Linear regression attempts to model the relationship between two or more explanatory (independent) variables and a response (dependent) variable by fitting a linear equation to the observed data. For example, one can try to predict the total yearly sales of a salesperson (the dependent variable) from independent variables such as age, education, and years of experience.
Consider a random sample of
observations
They can represent, for example, the frequency of some events or a
positive response to sale offers, etc. The linear regression model for
explanatory variables
and the
dependent variable
is defined as:

where
are the unobserved error terms and the unknown parameters
are constant. The model
is called multiple linear regression (or for
simple linear
regression) and describes how the mean response
changes with the explanatory
variables.
The error term represents the deviations of the observed values
from their mean value, which are normally
distributed with zero mean and constant variance. The best-fitting
line for the observed data is calculated by minimizing the sum of the
squares of the vertical deviations from each data point to the line.
The model parameters can be estimated using the least squares
procedure which minimizes the sum of the squares of
errors:

Because the deviations are squared prior to summation, there are no cancellations between the positive and negative values.
Linear models are not limited to straight lines or planes, but
include a fairly wide range of shapes. For example, a simple quadratic
curve
is linear in the
statistical sense because it is equivalent to
where
and
are new transformed
dependent variables. Another examples are a straight-line model in
logarithms of
:
and a polynomial in
cos(x):
which are
also linear in the statistical sense because they are linear in the
parameters, though not with respect to the observed explanatory
variable
.
The required transformations of variables
can be done using AdvancedMiner data transformations (for details see
the Data Access and Data Processing
chapter).
The underlying assumption of the linear regression model is that every sample value of the independent variable is associated with a value of the dependent variable and the dependent variable is assumed to be normally distributed with constant mean and variance. Generally, the following assumptions have to hold:
The independent variables are not constant and not random.
There are no exact linear relationships between the independent variables (they are not collinear).
The errors associated with any two different observations are independent and normally distributed.
The mean value of the probability distribution of random errors is zero, i.e. the average error over an infinitely long series of experiments is 0 for each setting of the independent variable.
The variance of the random error is equal to a constant value.
The distribution of the error term is independent of the joint distribution of explanatory variables.
If these conditions are satisfied then the error terms in the linear regression equation are mutually independent and identically normally distributed with zero mean and constant variance (i.e. variances are homoscedastic: all errors variances are equal and constant over time).
The least squares estimation method (LSE) requires that a straight line be fitted to a set of data points so that the sum of the squares of the distance of the points to the fitted line is minimized. The sum of the squares of the residuals is used instead of their absolute values because this allows the residuals to be treated as a continuous differentiable quantity. However, because the squares of the residuals are used, outlying points can have a disproportionate effect on the fit, a property which may or may not be desirable depending on the problem at hand.
Minimizing the sum of squares leads to the following equations, from which the values of parameter estimates can be computed:

The linear regression model can be written in a matrix form as:

where
denotes the vector of the
dependent variables,
is the error vector,
is
the coefficient vector and
is the matrix of the independent
variables. In the cases of linear regression models with constant term
this matrix is augmented with a unit column (i.e. a column with all
elements equal to one).
Using this notation it is not difficult to see that the unknown parameter estimates can be found as:

if the inverse matrix
exists.
From this form one can see that the unknown parameters
are linear functions of n
normally distributed random variables
Therefore,
have a normal sampling
distribution.
Under the above assumptions the Gauss-Markov theorem states that the least square estimates (LSE) of the parameters have the smallest variance among all linear unbiased estimators; they are efficient, so they are BLUE (Best Linear Unbiased Estimators). If the above assumptions are not fulfilled, LSE estimators are not BLUE.
The advantages of linear least squares method are:
Though there are types of data that are better described by nonlinear functions of the parameters, many processes in real life, science and engineering are sufficiently well described by linear models.
The estimates of the unknown parameters obtained from linear least squares regression are the optimal in a broad class of possible parameter estimates under the usual assumptions used for modeling. It means that in practice linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets.
The theory associated with linear regression is well-understood and allows for the construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations.
Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness.
The disadvantages of the linear least squares method are:
Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, so does the output of the linear model. This means that linear models may not be effective for extrapolating the results of a process for which the data cannot be collected in the region of interest (extrapolation itself is potentially dangerous regardless of model type).
While the least squares method often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit the model. One or two outliers can sometimes seriously skew the results of the least squares analysis. This makes model validation, especially with respect to outliers, critical.
100(1-Confidence Level)% confidence interval is
calculated for Confidence
Level =
specified in the current algorithm
settings. The upper and lower bounds are calculated as:
where
is the
percentile of the standard
normal distribution, and
are
the i-th diagonal elements of the estimators covariance matrix.
Pearson's correlation coefficient is used as the measure of association between two variables:

where
are the covariances between the
and
variables.
Every sample has some variation in it (unless all values are identical, which is unlikely). The total variation is made up of two parts, the part that can be explained by the regression equation and the part that cannot be explained by the regression equation. This division of the total variation follows from the division of the total deviation, as shown below.
The fit of the regression model can be assessed by the so-called coefficient of determination which is a fraction of the variation explained by the regression to the total variation of the variable of interest explained by the regression plane.
The following notation is used:
where
is the expected mean value of y, and
is the fitted value of y, predicted from the model. Under Least
Squares Method:
. The coefficient of
multiple determination (also called the
coefficient of multiple correlation) denoted
as

represents the proportion of the total variation in
explained by the regression model.
is sensitive to the magnitudes of
and
in small samples. If
is large relative to
, the model tends to fit the data very
well. Usually
is expressed as
value.
The coefficient of multiple determination
can be misleading in multiple regression
because it does not decrease. In fact it often increases as the number
of independent variables increases. This problem is overcome by taking
into account the degrees of freedom of the sums of squares that enter
the formula for
, which adjusts for the number of
explanatory variables in the model. The obtained measure for goodness
of fit is the so-called adjusted
,
defined as:

The overall goodness of fit of the regression model can be
evaluated, using an F-test in the format of analysis of variance.
Under the null hypothesis
, the
statistic

has an F-distribution with
and
degrees of freedom.
Whether a particular variable contributes significantly to the
regression equation can be tested as follows: For any specific variable
the null hypothesis
is tested by computing the
statistic
where
is the standard error of the
coefficient estimator, and performing a one or two tailed t-test with
degrees of freedom for models with a
intercept and
for models without an intercept.
The magnitude of the regression coefficients depends on the
scale of measurement used for the dependent variable
and the explanatory variables included in
the regression equation. Non-standardized regression coefficients
cannot be compared directly because of differing units of measurements
and different variances of the explanatory variables. It is therefore
necessary to standardize the variables for meaningful comparisons. The
estimated model

can be written as:

The expressions in the parentheses are the standardized
variables;
are the non-standardized
regression coefficients and
are the standard deviations of
variables
and
is the standard deviation of variable
.
The coefficients
are called the standardized regression
coefficients. Note that after standardization the intercept
in the model will be equal to zero. The standardized regression
coefficient measures the impact of a unit change in the standardized
value of the explanatory variables on the standardized value of
. However, the regression equation itself
should be reported in terms of the non-standardized regression
coefficients so that the prediction of
can be made directly from the explanatory
variables.
In simple linear regression, the value of the standardized regression coefficient is exactly the same as the correlation coefficient; its magnitude can be interpreted in the same way.
Multicollinearity (linear dependence between the explanatory variables) can have a significant influence on the quality and stability of the fitted regression model. A commonly used approach to solve the multicollinearity problem is to omit highly correlated explanatory variables. The simplest method for detecting multicollinearity is to calculate the correlation matrix, which can be used to check for high correlations between pairs of explanatory variables. When more subtle patterns of correlation coefficients exist, the determinant of the correlation matrix can be used to check for multicollinearity. The value of the determinant near zero indicates that some or all explanatory variables are highly correlated. If the determinant is equal to zero it indicates a singular matrix, which means that at least one of the explanatory variables is a linear function of one or more explanatory variables. Another approach is to compute the so-called tolerance or/and Variance Inflation Factor (VIF, the inverse of the tolerance) for each variable. The tolerance of the variable is defined as one minus the squared multiple correlation between that variable and the remaining variables. When the tolerance is small (usually less than 0.1) then it would be a good decision to discard the variable with the smallest tolerance/largest VIF.
Sometimes the considered variables are categorical (non-numeric). Explanatory variables for inclusion in a regression model cannot be of interval scale; they have to be nominal or ordinal variables. Such variables can be used in the regression model by creating so-called 'dummy’ variables that indicate if a variable is of the determined type (yes=1) or not (no=0). This can be done by applying the binarize procedure.
The binarize procedure should be used with the Random Redundant option active, since otherwise the method will produce a set of multicollinear (strictly speaking linearly dependent) explanatory variables, which can make the linear regression model impossible to build.
In the cases when the assumption of homoscedasticity is not satisifed (random errors do not have constant variance for all observations in the data), a weighted regression approach is more appropriate than the ordinary least-squares linear regression model. In the weighted regression model, in addition to standard data requirements of the linear model, one has to specify a weight attribute, which will represent the residual error of a given observation (refer to the Data Requirements section). This residual value is an argument of the observation weighting function specified in algorithm settings (refer also to Weight Types section)
The original model assumes that error variance
is constant for each
observation.
Consider data with known but not constant variance
for each observation.
Both sides of model can be divided by
so the original model can be rewritten as:
Now, the rewritten model has constant error variance and can be
estimated using the least squares method. This method is called the
Weighted Least Squared Regression(WLS). The term
is the 'weight'.
For WLS the
residual weighted sum of squares is defined as
There are cases when the Linear Regression model assumptions are not satisfied, especially when observation errors are not normally distributed (with 0 mean and constant variance), for example when outliers are present. In these situations it is possible to use the Iteratively Re-weighted Least Squares method to obtain model estimators which are better fitted to the true trends in the data.
In the general case, we do not know the real weights for each particular set of data. Thus, it is not possible to fit the proper weighted regression (WLS) model. To autonomously estimate weights as well as the coefficients of the model we can use the Iteratively Re-Weighted Least Squares method, which iteratively determines the appropriate weights by fitting several WLS models by operating in the so called IRLS loop.
IRLS Loop.
Set initial weights to 1.
Estimate the model using WLS.
Compute model's standard deviation via median absolute
deviation (MAD) estimator
, where
is the median of the
absolute values of residual errors for all observations.
Exit the loop if the change in s is smaller then an arbitrarily chosen convergence threshold.
Change the weights using a pre-determined function (see
Weight types):
.
Go back to step 2.