Preddle

Help

Video Tutorials


Video tutorial series on using Preddle for both the social sciences (eg Psychology) and business modeling (eg insurance) are available.

Using Preddle


1) Upload your data into a new workspace from the workspace settings screen.
2) Set your workspace settings, incl choosing your response, offset and weight variables and predictor bands
3) Data exploration, looking at distributions of each variable and correlations between them
4) GLM Construct, creating curves for each predictor to see how they influence the response variable
5) Data mine to automatically pick out segments where the response is high, medium, low etc. Both the raw response or the GLM residuals can be mined.
6) Copy the GLM curves, data mine segments or SAS scoring code

Data upload


Your data must be in the following format:
- CSV (or zip file containing a single CSV file with the same name as the zip file eg file.zip containing file.csv)
- First row is column headings
- Each column is either a predictor or a response or a weight-style variable

Workspace settings


Select response variable, offset variable, weight variable, link function and error distribution and predictor variables. The offset, weight and link function selections need to be carefully chosen depending on how the Response variable is structured. Below are some technical details, and then scenarios to help walk you through the settings.

You can select up to three combinations of response variables and workspace settings on the same dataset. This can be handy when you are modelling the rate of something occurring, the magnitude when it does occur, and some other related event. The available predictors must be the same for each response type, but in the actual model you only need to use a subset of available predictors.

GLM Construction Technical Details


Response is the variable you are modelling
Response Transform is a mathematical transformation performed on the Response variable before any modelling or other adjustments are made
Weight is the relative weight given to each record in computing the Beta parameters. A record with a weight of 2 will have the same influence on the beta parameters (and associated diagnostics) as if that record appeared in the dataset twice [None] gives each record equal weight
Offset (after transformation as below) is subtracted from the Response before GLM modelling.
Offset Transformation is the mathematical transformation applied to offset variable before that transformed being subtracted from the Response before GLM modelling.
Link function is the inverse of the transformation applied to the linear predictor to compute the predicted value for each observation. If the link function is g(x), then for observation i, mui=g-1(B0+B1x1i+B2x2i...). Typically, the Identity link is used if each predictor has an additive impact on the Response, the Log link is used if each predictor has a multiplicative impact, the Logit is used for modelling a probability (0 to 1) and the Inverse is used in esoteric circumstances.
Error distribution reflects the variance in the response relative to the predicted value. Normal error indicates constant variance (irrespective of the predicted value). Poisson error indicates variance proportional to the predicted value. Gammma error indicates variance proportional to the square of the predicted value. Binomial error, being used in modelling probabilities, indicates variance of p*(1-p) where p is the predicted value. The error distribution selected will not affect the Beta parameters, nor predicted values for each observation. They only affect the standard deviations of the Betas and the resultant significance (with greater tolerance of observations with high deviations of actual to expected for Poisson and more so for the Gamma).
If you are modelling something per something (e.g. sales per day, dollars per sale, rainy days per year) choose a log link function

Data Mine Technical Details


The datamine sums up the Response and sums up the offset and then finds the best splits in the predictors to segment into different levels of the ratio SUM(Response*Weight)/SUM(Offset*Weight). Offset is treated as 1 if no Offset is selected. The Response Transformation and the Offset Transformations are not applied in these SUMS, ie they are not used in the datamine. If you want to use transformed versions of the response or offset in the Data Mine then include these separately on the imported dataset. The weight variable is used in summing both Response and Offset. The SUMs are effectively weighted sums.
Response Transform is ignored for the purpose of the datamine
Weight is the relative weight given to each record in computing the sums for the numerator and hte denominator in the ratio.
Offset is summed for the denominator in the ratio
Link function is ignored for the datamine.
Error distribution is ignored for the datamine

Scenarios


Survey data


Each record is one respondent answering questions about themselves and their satisfaction with three aspects of their lives. You have recorded values like their sex, age, weight, state etc (which are the predictors) and three stated satisfaction scores (say on a scale of 1 to 10) to be modelled.
To do:
- Choose each satisfaction score for the three Responses and label appropriately.
- Choose no Response Transformation, but might need to revisit this depending on the shape of the residuals.
- Choose no offset and no weight
- Choose an Identity link if you think the predictors apply additively (eg living in NSW instead of VIC adds 1.5 satisfaction points other things being equal) or a Log link if you think predictors apply multiplicatively (eg living in NSW vs VIC increases satisfaction by 5% other things being equal).
- Choose a Normal error distribution, but might need to revisit this depending on the shape of the residuals.

Probability of an event, where each record is a trial and the outcome (Response) is either a success (1) or a failure (0)


Each record is one football game, where we are modelling the probability of a win. You have recorded values like the game location, weather, player stats etc (which are the predictors) and Response to be modelled is whether the win field is 1, as opposed to 0.
To do:
- Choose the win field as the Response.
- Choose no Response Transformation.
- Choose no offset and no weight
- Choose a logit link .
- Choose a Binomial error distribution.

Rate of something occurring, where each record is some exposure to the event occurring and multiple occurrences are possible


Each record is exposure of an insurance policy to the possibility of damage, where we are modelling the rate of damage occurring per unit of exposure. Multiple events can occur against the one policy (e.g. two claims in one year), so we are modelling a rate of occurrence rather than a probability between 0 and 1. You have recorded values like the characteristics of the driver and their vehicle (which are the predictors). Depending on whether the count of claims has been recorded or the rate per unit of exposure, the offset and weight will need to be tweaked.

Case 1: Each record is exactly one unit of exposure.
To do:
- Choose the claim count field as the Response.
- Choose no Response Transformation.
- Choose no offset and no weight
- Choose a log link .
- Choose a Poisson error distribution.

Case 2: Each record is a different level of exposure (could be less than 1 or greater than 1) and claim counts are recorded
To do:
- Choose the claim count field as the Response.
- Choose no Response Transformation.
- Choose exposure as the offset - Choose log as the offset transformation - Choose no weight
- Choose a log link .
- Choose a Poisson error distribution.

Case 3: Each record is a different level of exposure (could be less than 1 or greater than 1) and claim rates (ie 'claim count/exposure') are entered for each record
To do:
- Choose the claim rate field as the Response.
- Choose no Response Transformation.
- Choose no offset - Choose exposure as the weight
- Choose a log link .
- Choose a Poisson error distribution.

Magnitude of something, eg cost of an event or size of a sale


For example, you might be modelling the cost of storms, given predictors like the storm's magnitude, the population density where the storm hit etc. You have recorded values like the characteristics of the driver and their vehicle (which are the predictors). Each record might be a single storm, or the sum across multiple storms with the same predictors.

Case 1: Each record is exactly one storm.
To do:
- Choose the storm cost as the Response.
- Choose no Response Transformation.
- Choose no offset and no weight
- Choose a log link if the predictors act multiplicatively or identity if the predictors act additively.
- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

Case 2: Each record is the sum of the cost across multiple storms with those predictors.
To do:
- Choose the storm cost as the Response.
- Choose no Response Transformation.
- Choose storm count as the offset
- Choose log as the offset transformation.
- Choose no weight
- Choose a log link.
- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

Case 3: Each record is the sum across multiple storms, but the cost is recorded as the average cost across these storms, not the sum of the cost
To do:
- Choose the storm cost as the Response.
- Choose no Response Transformation.
- Choose no Offset.
- Choose storm count as the Weight.
- Choose a log link if the predictors act multiplicatively or identity if the predictors act additively.
- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

Combining different Response types in the same workspace


Different variables attached to the one workspace can be modelled. Consider a situation where a pricing analyst for an insurer wants to model claim frequency, average claim cost per claim and competitive position (defined as 'competitive premium / own premium)'. They upload a CSV file with one record for each customer, that customer's exposure, sum of claim count, sum of claim cost, own premium and competitor premium.

To do:
Set up the following workspace characteristics:
1) Response label: Claim frequency
      Response: Claim count
      Response Transform: None
      Weight: None
      Offset: Exposure
      Offset Transform: Log
      Link Function: Log
      Error Distribution: Poisson

2) Response label: Cost per claim
      Response: Total claim cost
      Response Transform: None
      Weight: None
      Offset: Claim count
      Offset Transform: Log
      Link Function: Log
      Error Distribution: Gamma

3) Response label: Competitive position
      Response: Competitor premium
      Response Transform: None
      Weight: Exposure
      Offset: Own premium
      Offset Transform: Log
      Link Function: Log
      Error Distribution: Gamma

'Claim frequency' is modelling claim count per unit of exposure
'Cost per claim' is modelling claim cost per claim
'Competitive position' is modelling the ratio of 'competitor premium / own premium'


Use of GLMs


Helpful links

The following links are particularly helpful for those interested in GLMs
https://www.casact.org/pubs/dpp/dpp04/04dpp1.pdf


Helpful tips

The weight variable can be used as either the number of observations in that record, or the credibility of that record.
Typically, the credibility of a record is proportional to the number of observations, but there may be other criteria that makes a record more credible or less credible.

When residuals are plotted against fitted values, linear predictors, or response values, the shape of the 'cloud' should be a random cloud if the normal error distribution was chosen.
If the Poisson error distribution was chosen, these residuals should fan out larger values. If the Gamma error distribution was chosen, the residuals should fan out even more than for the Poisson distribution. Regardless of the error distribution selected, no fanning out, or other discernible shape should be observed when plotting Pearson residuals or Deviance residuals against fitted values, linear predictors or response values.
Care must be taken when looking at residuals plot of binomial responses (success/failure or yes/no or 0/1). Since there can only be one of two outcomes, the residuals will typically appear as two lines, which does not in itself mean much.

When deciding between a Poisson or Gamma distribution, be aware of the impact of the response variable's unit of measurement (eg if the response is measured in cm vs metres, or pounds vs kilograms) on the ratio of standard deviation to expected value.
The Poisson error distribution's variance is equal to the expected value, so if smaller units of measurement are used (such as centimetres rather than metres) the expected value will increase faster than the standard deviation, resulting in a tendency to overestimate significance.
In practice, the Poisson shouldn't be used when the response is a continuous value, which possess a unit of measure, anyway, so this probably won\t matter. The Gamma distribution is invariant to the unit of measurement, since variance increases at the square of the expected value.
The Normal error distribution is also invariant to the unit of measurement, since the variance is assumed constant over the range of expected values.
The Binomial response is either success or failure and so has no unit of measurement.

When a Poisson distribution, and especially so when a Gamma, is chosen, in fitting a line of best fit, more weight is given to points with a low response value rather than a high response (since the residual error increases as the expected value increases).
When a normal error distribution is selected, all points have equal weight.
Obviously, the weight variable is overlaid on top of this.

Useful reference table for insurance modelling examples


Response Link function Error distribution Scale parameter Variance function Prior weights Offset
Claim frequencies ln(x) Poisson 1 x Exposure 0
Claim numbers or counts ln(x) Poisson 1 x 1 ln(exposure)
Average claim amounts ln(x) Gamma Estimated x^2 # of claims 0
Probability (eg of renewing) ln(x/(1+x)) Binomial 1 x(1-x) * 1 0
* where the number of trials = 1, or x(t-x)/t where the number of trials = t