## Video Tutorials

Video tutorial series on using Preddle for both the social sciences (eg Psychology) and business modeling (eg insurance) are available.

## Using Preddle

1) Upload your data into a new workspace from the workspace settings screen.

2) Set your workspace settings, incl choosing your response, offset and weight variables and predictor bands

3) Data exploration, looking at distributions of each variable and correlations between them

4) GLM Construct, creating curves for each predictor to see how they influence the response variable

5) Data mine to automatically pick out segments where the response is high, medium, low etc. Both the raw response or the GLM residuals can be mined.

6) Copy the GLM curves, data mine segments or SAS scoring code

## Data upload

Your data must be in the following format:

- CSV (or zip file containing a single CSV file with the same name as the zip file eg file.zip containing file.csv)

- First row is column headings

- Each column is either a predictor or a response or a weight-style variable

## Workspace settings

Select response variable, offset variable, weight variable, link function and error distribution and predictor variables. The offset, weight and link function selections need to be carefully chosen depending on how the Response variable is structured. Below are some technical details, and then scenarios to help walk you through the settings.

You can select up to three combinations of response variables and workspace settings on the same dataset. This can be handy when you are modelling the rate of something occurring, the magnitude when it does occur, and some other related event. The available predictors must be the same for each response type, but in the actual model you only need to use a subset of available predictors.

# GLM Construction Technical Details

**Response**is the variable you are modelling

**Response Transform**is a mathematical transformation performed on the Response variable before any modelling or other adjustments are made

**Weight**is the relative weight given to each record in computing the Beta parameters. A record with a weight of 2 will have the same influence on the beta parameters (and associated diagnostics) as if that record appeared in the dataset twice

**[None]**gives each record equal weight

**Offset**(after transformation as below) is subtracted from the Response before GLM modelling.

**Offset Transformation**is the mathematical transformation applied to offset variable before that transformed being subtracted from the Response before GLM modelling.

**Link function**is the inverse of the transformation applied to the linear predictor to compute the predicted value for each observation. If the link function is g(x), then for observation i, mui=g-1(B0+B1x1i+B2x2i...). Typically, the

**Identity**link is used if each predictor has an additive impact on the Response, the

**Log**link is used if each predictor has a multiplicative impact, the

**Logit**is used for modelling a probability (0 to 1) and the

**Inverse**is used in esoteric circumstances.

**Error distribution**reflects the variance in the response relative to the predicted value.

**Normal**error indicates constant variance (irrespective of the predicted value).

**Poisson**error indicates variance proportional to the predicted value.

**Gammma**error indicates variance proportional to the square of the predicted value.

**Binomial**error, being used in modelling probabilities, indicates variance of p*(1-p) where p is the predicted value. The error distribution selected will not affect the Beta parameters, nor predicted values for each observation. They only affect the standard deviations of the Betas and the resultant significance (with greater tolerance of observations with high deviations of actual to expected for Poisson and more so for the Gamma).

If you are modelling something per something (e.g. sales per day, dollars per sale, rainy days per year) choose a log link function

# Data Mine Technical Details

The datamine sums up the Response and sums up the offset and then finds the best splits in the predictors to segment into different levels of the ratio SUM(Response*Weight)/SUM(Offset*Weight). Offset is treated as 1 if no Offset is selected. The Response Transformation and the Offset Transformations are not applied in these SUMS, ie they are not used in the datamine. If you want to use transformed versions of the response or offset in the Data Mine then include these separately on the imported dataset. The weight variable is used in summing both Response and Offset. The SUMs are effectively weighted sums.

**Response Transform**is ignored for the purpose of the datamine

**Weight**is the relative weight given to each record in computing the sums for the numerator and hte denominator in the ratio.

**Offset**is summed for the denominator in the ratio

**Link function**is ignored for the datamine.

**Error distribution**is ignored for the datamine

## Scenarios

# Survey data

Each record is one respondent answering questions about themselves and their satisfaction with three aspects of their lives. You have recorded values like their sex, age, weight, state etc (which are the predictors) and three stated satisfaction scores (say on a scale of 1 to 10) to be modelled.

**To do:**

- Choose each satisfaction score for the three Responses and label appropriately.

- Choose no Response Transformation, but might need to revisit this depending on the shape of the residuals.

- Choose no offset and no weight

- Choose an Identity link if you think the predictors apply additively (eg living in NSW instead of VIC adds 1.5 satisfaction points other things being equal) or a Log link if you think predictors apply multiplicatively (eg living in NSW vs VIC increases satisfaction by 5% other things being equal).

- Choose a Normal error distribution, but might need to revisit this depending on the shape of the residuals.

# Probability of an event, where each record is a trial and the outcome (Response) is either a success (1) or a failure (0)

Each record is one football game, where we are modelling the probability of a win. You have recorded values like the game location, weather, player stats etc (which are the predictors) and Response to be modelled is whether the win field is 1, as opposed to 0.

**To do:**

- Choose the win field as the Response.

- Choose no Response Transformation.

- Choose no offset and no weight

- Choose a logit link .

- Choose a Binomial error distribution.

# Rate of something occurring, where each record is some exposure to the event occurring and multiple occurrences are possible

Each record is exposure of an insurance policy to the possibility of damage, where we are modelling the rate of damage occurring per unit of exposure. Multiple events can occur against the one policy (e.g. two claims in one year), so we are modelling a rate of occurrence rather than a probability between 0 and 1. You have recorded values like the characteristics of the driver and their vehicle (which are the predictors). Depending on whether the count of claims has been recorded or the rate per unit of exposure, the offset and weight will need to be tweaked.

**Case 1:**Each record is exactly one unit of exposure.

**To do:**

- Choose the claim count field as the Response.

- Choose no Response Transformation.

- Choose no offset and no weight

- Choose a log link .

- Choose a Poisson error distribution.

**Case 2:**Each record is a different level of exposure (could be less than 1 or greater than 1) and

**claim counts**are recorded

**To do:**

- Choose the claim count field as the Response.

- Choose no Response Transformation.

- Choose exposure as the offset - Choose log as the offset transformation - Choose no weight

- Choose a log link .

- Choose a Poisson error distribution.

**Case 3:**Each record is a different level of exposure (could be less than 1 or greater than 1) and

**claim rates**(ie 'claim count/exposure') are entered for each record

**To do:**

- Choose the claim rate field as the Response.

- Choose no Response Transformation.

- Choose no offset - Choose exposure as the weight

- Choose a log link .

- Choose a Poisson error distribution.

# Magnitude of something, eg cost of an event or size of a sale

For example, you might be modelling the cost of storms, given predictors like the storm's magnitude, the population density where the storm hit etc. You have recorded values like the characteristics of the driver and their vehicle (which are the predictors). Each record might be a single storm, or the sum across multiple storms with the same predictors.

**Case 1:**Each record is exactly one storm.

**To do:**

- Choose the storm cost as the Response.

- Choose no Response Transformation.

- Choose no offset and no weight

- Choose a log link if the predictors act multiplicatively or identity if the predictors act additively.

- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

**Case 2:**Each record is the sum of the cost across multiple storms with those predictors.

**To do:**

- Choose the storm cost as the Response.

- Choose no Response Transformation.

- Choose storm count as the offset

- Choose log as the offset transformation.

- Choose no weight

- Choose a log link.

- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

**Case 3:**Each record is the sum across multiple storms, but the cost is recorded as the

**average cost**across these storms, not the

**sum of the cost**

**To do:**

- Choose the storm cost as the Response.

- Choose no Response Transformation.

- Choose no Offset.

- Choose storm count as the Weight.

- Choose a log link if the predictors act multiplicatively or identity if the predictors act additively.

- Choose a Gamma error distribution, as the variance in cost is likely to be proportional to the square of the expected cost

# Combining different Response types in the same workspace

Different variables attached to the one workspace can be modelled. Consider a situation where a pricing analyst for an insurer wants to model claim frequency, average claim cost per claim and competitive position (defined as 'competitive premium / own premium)'. They upload a CSV file with one record for each customer, that customer's exposure, sum of claim count, sum of claim cost, own premium and competitor premium.

**To do:**

Set up the following workspace characteristics:

1) Response label: Claim frequency

Response: Claim count

Response Transform: None

Weight: None

Offset: Exposure

Offset Transform: Log

Link Function: Log

Error Distribution: Poisson

2) Response label: Cost per claim

Response: Total claim cost

Response Transform: None

Weight: None

Offset: Claim count

Offset Transform: Log

Link Function: Log

Error Distribution: Gamma

3) Response label: Competitive position

Response: Competitor premium

Response Transform: None

Weight: Exposure

Offset: Own premium

Offset Transform: Log

Link Function: Log

Error Distribution: Gamma

'Claim frequency' is modelling claim count per unit of exposure

'Cost per claim' is modelling claim cost per claim

'Competitive position' is modelling the ratio of 'competitor premium / own premium'

## Use of GLMs

# Helpful links

The following links are particularly helpful for those interested in GLMshttps://www.casact.org/pubs/dpp/dpp04/04dpp1.pdf

# Helpful tips

The weight variable can be used as either the number of observations in that record, or the credibility of that record.Typically, the credibility of a record is proportional to the number of observations, but there may be other criteria that makes a record more credible or less credible.

When residuals are plotted against fitted values, linear predictors, or response values, the shape of the 'cloud' should be a random cloud if the normal error distribution was chosen.

If the Poisson error distribution was chosen, these residuals should fan out larger values. If the Gamma error distribution was chosen, the residuals should fan out even more than for the Poisson distribution. Regardless of the error distribution selected, no fanning out, or other discernible shape should be observed when plotting Pearson residuals or Deviance residuals against fitted values, linear predictors or response values.

Care must be taken when looking at residuals plot of binomial responses (success/failure or yes/no or 0/1). Since there can only be one of two outcomes, the residuals will typically appear as two lines, which does not in itself mean much.

When deciding between a Poisson or Gamma distribution, be aware of the impact of the response variable's unit of measurement (eg if the response is measured in cm vs metres, or pounds vs kilograms) on the ratio of standard deviation to expected value.

The Poisson error distribution's variance is equal to the expected value, so if smaller units of measurement are used (such as centimetres rather than metres) the expected value will increase faster than the standard deviation, resulting in a tendency to overestimate significance.

In practice, the Poisson shouldn't be used when the response is a continuous value, which possess a unit of measure, anyway, so this probably won\t matter. The Gamma distribution is invariant to the unit of measurement, since variance increases at the square of the expected value.

The Normal error distribution is also invariant to the unit of measurement, since the variance is assumed constant over the range of expected values.

The Binomial response is either success or failure and so has no unit of measurement.

When a Poisson distribution, and especially so when a Gamma, is chosen, in fitting a line of best fit, more weight is given to points with a low response value rather than a high response (since the residual error increases as the expected value increases).

When a normal error distribution is selected, all points have equal weight.

Obviously, the weight variable is overlaid on top of this.

# Useful reference table for insurance modelling examples

Response |
Link function |
Error distribution |
Scale parameter |
Variance function |
Prior weights |
Offset |

Claim frequencies |
ln(x) | Poisson | 1 | x | Exposure | 0 |

Claim numbers or counts |
ln(x) | Poisson | 1 | x | 1 | ln(exposure) |

Average claim amounts |
ln(x) | Gamma | Estimated | x^2 | # of claims | 0 |

Probability (eg of renewing) |
ln(x/(1+x)) | Binomial | 1 | x(1-x) * | 1 | 0 |