Method description

Method description
Prev	Chapter 25. Linear Regression	Next

Standard linear regression

The Linear regression model

Linear regression attempts to model the relationship between two or more explanatory (independent) variables and a response (dependent) variable by fitting a linear equation to the observed data. For example, one can try to predict the total yearly sales of a salesperson (the dependent variable) from independent variables such as age, education, and years of experience.

Consider a random sample of observations They can represent, for example, the frequency of some events or a positive response to sale offers, etc. The linear regression model for explanatory variables and the dependent variable is defined as:

where are the unobserved error terms and the unknown parameters are constant. The model is called multiple linear regression (or for simple linear regression) and describes how the mean response changes with the explanatory variables.

The error term represents the deviations of the observed values from their mean value, which are normally distributed with zero mean and constant variance. The best-fitting line for the observed data is calculated by minimizing the sum of the squares of the vertical deviations from each data point to the line. The model parameters can be estimated using the least squares procedure which minimizes the sum of the squares of errors:

Because the deviations are squared prior to summation, there are no cancellations between the positive and negative values.

Linear models are not limited to straight lines or planes, but include a fairly wide range of shapes. For example, a simple quadratic curve is linear in the statistical sense because it is equivalent to where and are new transformed dependent variables. Another examples are a straight-line model in logarithms of : and a polynomial in cos(x): which are also linear in the statistical sense because they are linear in the parameters, though not with respect to the observed explanatory variable . The required transformations of variables can be done using AdvancedMiner data transformations (for details see the Data Access and Data Processing chapter).

Model assumptions

The underlying assumption of the linear regression model is that every sample value of the independent variable is associated with a value of the dependent variable and the dependent variable is assumed to be normally distributed with constant mean and variance. Generally, the following assumptions have to hold:

The independent variables are not constant and not random.
There are no exact linear relationships between the independent variables (they are not collinear).
The errors associated with any two different observations are independent and normally distributed.
The mean value of the probability distribution of random errors is zero, i.e. the average error over an infinitely long series of experiments is 0 for each setting of the independent variable.
The variance of the random error is equal to a constant value.
The distribution of the error term is independent of the joint distribution of explanatory variables.

If these conditions are satisfied then the error terms in the linear regression equation are mutually independent and identically normally distributed with zero mean and constant variance (i.e. variances are homoscedastic: all errors variances are equal and constant over time).

Estimation of parameters: the Least Squares Method

The least squares estimation method (LSE) requires that a straight line be fitted to a set of data points so that the sum of the squares of the distance of the points to the fitted line is minimized. The sum of the squares of the residuals is used instead of their absolute values because this allows the residuals to be treated as a continuous differentiable quantity. However, because the squares of the residuals are used, outlying points can have a disproportionate effect on the fit, a property which may or may not be desirable depending on the problem at hand.

Minimizing the sum of squares leads to the following equations, from which the values of parameter estimates can be computed:

The linear regression model can be written in a matrix form as:

where denotes the vector of the dependent variables, is the error vector, is the coefficient vector and is the matrix of the independent variables. In the cases of linear regression models with constant term this matrix is augmented with a unit column (i.e. a column with all elements equal to one).

Using this notation it is not difficult to see that the unknown parameter estimates can be found as:

if the inverse matrix exists.

From this form one can see that the unknown parameters are linear functions of n normally distributed random variables Therefore, have a normal sampling distribution.

Under the above assumptions the Gauss-Markov theorem states that the least square estimates (LSE) of the parameters have the smallest variance among all linear unbiased estimators; they are efficient, so they are BLUE (Best Linear Unbiased Estimators). If the above assumptions are not fulfilled, LSE estimators are not BLUE.

The advantages of linear least squares method are:

Though there are types of data that are better described by nonlinear functions of the parameters, many processes in real life, science and engineering are sufficiently well described by linear models.
The estimates of the unknown parameters obtained from linear least squares regression are the optimal in a broad class of possible parameter estimates under the usual assumptions used for modeling. It means that in practice linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets.
The theory associated with linear regression is well-understood and allows for the construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations.
Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness.

The disadvantages of the linear least squares method are:

Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, so does the output of the linear model. This means that linear models may not be effective for extrapolating the results of a process for which the data cannot be collected in the region of interest (extrapolation itself is potentially dangerous regardless of model type).
While the least squares method often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit the model. One or two outliers can sometimes seriously skew the results of the least squares analysis. This makes model validation, especially with respect to outliers, critical.

Confidence limits for parameter estimations

100(1-Confidence Level)% confidence interval is calculated for Confidence Level = specified in the current algorithm settings. The upper and lower bounds are calculated as: where is the percentile of the standard normal distribution, and are the i-th diagonal elements of the estimators covariance matrix.

Pearson's correlation coefficient

Pearson's correlation coefficient is used as the measure of association between two variables:

where are the covariances between the and variables.

Fit of the regression model

Every sample has some variation in it (unless all values are identical, which is unlikely). The total variation is made up of two parts, the part that can be explained by the regression equation and the part that cannot be explained by the regression equation. This division of the total variation follows from the division of the total deviation, as shown below.

Figure 25.1. Fit of the regression model

The fit of the regression model can be assessed by the so-called coefficient of determination which is a fraction of the variation explained by the regression to the total variation of the variable of interest explained by the regression plane.

The following notation is used:

The sum of squares of errors (SSE):
The sum of squares due to regression (SSR):
The total sum of squares (SST):

where is the expected mean value of y, and is the fitted value of y, predicted from the model. Under Least Squares Method: . The coefficient of multiple determination (also called the coefficient of multiple correlation) denoted as

represents the proportion of the total variation in explained by the regression model.

is sensitive to the magnitudes of and in small samples. If is large relative to , the model tends to fit the data very well. Usually is expressed as value.

The coefficient of multiple determination can be misleading in multiple regression because it does not decrease. In fact it often increases as the number of independent variables increases. This problem is overcome by taking into account the degrees of freedom of the sums of squares that enter the formula for , which adjusts for the number of explanatory variables in the model. The obtained measure for goodness of fit is the so-called adjusted , defined as:

Statistical inference for the model

The overall goodness of fit of the regression model can be evaluated, using an F-test in the format of analysis of variance. Under the null hypothesis , the statistic

has an F-distribution with and degrees of freedom.

Whether a particular variable contributes significantly to the regression equation can be tested as follows: For any specific variable the null hypothesis is tested by computing the statistic where is the standard error of the coefficient estimator, and performing a one or two tailed t-test with degrees of freedom for models with a intercept and for models without an intercept.

Standardized regression coefficients

The magnitude of the regression coefficients depends on the scale of measurement used for the dependent variable and the explanatory variables included in the regression equation. Non-standardized regression coefficients cannot be compared directly because of differing units of measurements and different variances of the explanatory variables. It is therefore necessary to standardize the variables for meaningful comparisons. The estimated model

can be written as:

The expressions in the parentheses are the standardized variables; are the non-standardized regression coefficients and are the standard deviations of variables and is the standard deviation of variable .

The coefficients are called the standardized regression coefficients. Note that after standardization the intercept in the model will be equal to zero. The standardized regression coefficient measures the impact of a unit change in the standardized value of the explanatory variables on the standardized value of . However, the regression equation itself should be reported in terms of the non-standardized regression coefficients so that the prediction of can be made directly from the explanatory variables.

In simple linear regression, the value of the standardized regression coefficient is exactly the same as the correlation coefficient; its magnitude can be interpreted in the same way.

Multicollinearity

Multicollinearity (linear dependence between the explanatory variables) can have a significant influence on the quality and stability of the fitted regression model. A commonly used approach to solve the multicollinearity problem is to omit highly correlated explanatory variables. The simplest method for detecting multicollinearity is to calculate the correlation matrix, which can be used to check for high correlations between pairs of explanatory variables. When more subtle patterns of correlation coefficients exist, the determinant of the correlation matrix can be used to check for multicollinearity. The value of the determinant near zero indicates that some or all explanatory variables are highly correlated. If the determinant is equal to zero it indicates a singular matrix, which means that at least one of the explanatory variables is a linear function of one or more explanatory variables. Another approach is to compute the so-called tolerance or/and Variance Inflation Factor (VIF, the inverse of the tolerance) for each variable. The tolerance of the variable is defined as one minus the squared multiple correlation between that variable and the remaining variables. When the tolerance is small (usually less than 0.1) then it would be a good decision to discard the variable with the smallest tolerance/largest VIF.

Regression with categorical (qualitative) explanatory variables

Sometimes the considered variables are categorical (non-numeric). Explanatory variables for inclusion in a regression model cannot be of interval scale; they have to be nominal or ordinal variables. Such variables can be used in the regression model by creating so-called 'dummy’ variables that indicate if a variable is of the determined type (yes=1) or not (no=0). This can be done by applying the binarize procedure.

Note

The binarize procedure should be used with the Random Redundant option active, since otherwise the method will produce a set of multicollinear (strictly speaking linearly dependent) explanatory variables, which can make the linear regression model impossible to build.

Weighted Linear Regression (WLS)

Applicability

In the cases when the assumption of homoscedasticity is not satisifed (random errors do not have constant variance for all observations in the data), a weighted regression approach is more appropriate than the ordinary least-squares linear regression model. In the weighted regression model, in addition to standard data requirements of the linear model, one has to specify a weight attribute, which will represent the residual error of a given observation (refer to the Data Requirements section). This residual value is an argument of the observation weighting function specified in algorithm settings (refer also to Weight Types section)

Algorithm description

The original model assumes that error variance is constant for each observation.

Consider data with known but not constant variance for each observation. Both sides of model can be divided by so the original model can be rewritten as:

Now, the rewritten model has constant error variance and can be estimated using the least squares method. This method is called the Weighted Least Squared Regression(WLS). The term is the 'weight'.

For WLS the residual weighted sum of squares is defined as

Weight Types

Huber's:

student:

where k is a tuning constant

Iteratively Re-Weighted Least Squares (IRLS) Regression

Applicability

There are cases when the Linear Regression model assumptions are not satisfied, especially when observation errors are not normally distributed (with 0 mean and constant variance), for example when outliers are present. In these situations it is possible to use the Iteratively Re-weighted Least Squares method to obtain model estimators which are better fitted to the true trends in the data.

Algorithm description

In the general case, we do not know the real weights for each particular set of data. Thus, it is not possible to fit the proper weighted regression (WLS) model. To autonomously estimate weights as well as the coefficients of the model we can use the Iteratively Re-Weighted Least Squares method, which iteratively determines the appropriate weights by fitting several WLS models by operating in the so called IRLS loop.

IRLS Loop.

Set initial weights to 1.
Estimate the model using WLS.
Compute model's standard deviation via median absolute deviation (MAD) estimator , where is the median of the absolute values of residual errors for all observations.
Exit the loop if the change in s is smaller then an arbitrarily chosen convergence threshold.
Change the weights using a pre-determined function (see Weight types): .
Go back to step 2.