Usage

Data requirements

The data set for Bivariate Probit model building should contain numerical explanatory ( active/ active2/ active12) attributes and categorical binary target and target2 attributes. In order to use categorical explanatory variables it is necessary to convert them to binary zero-one dummy variables using the binarize procedure with the Random Redundant option switched on. Binarization can be performed before building the model or during building process by selecting Automatic Data Trasformation option in General Algorithm Settings.

Missing values are not supported by the Bivariate Module unless Automatic Data Trasformation option in General Algorithm Settings is selected. The other way to use a data set with missing values is to replace the missing data before building the model or to switch to Liberal Mode in algorithm settings to automatically omit observations containing missing values. This requirement applies also to the target attribute values in the censored model, due to the initial parameter estimation by logistic regression models.

In order to build the Bivariate Probit model in the bivariate function settings it is necessary to select the Target (the name of the target variable in the default equation), Target2 (the name of the target variable in the censoring equation ), 1st Target Positive Value and 2nd Target Positive Value (the positive values for respective targets). For the Target variable the positive value is understood in the same way as for the Logistic Regression Model, for the Target2 variable the positive value means that a given observation is not censored and makes its way to default equation. The two target attributes can also be selected by setting the appropriate usage options in the current AttributeUsageSet: target for the default equation target and target2 for the censoring equation target. For more details see the AdvancedMiner Concepts section.

Variables which are expected to be present in both equations should be set to active12 in the current AttributeUsageSet. Variables which have to be present only in the default equation should be set to active in the current AttributeUsageSet. Variables which have to be present only in the censoring equation should be set to active2 in the current AttributeUsageSet.

For an example script see the Examples section (the full source code with the data can be found in the Examples appendix).

Model building

Model building is performed in the standard way and the complete procedure is described in the section AdvancedMiner Concepts (see Approximation). Full specification of the model settings contains the elements of Algorithm Settings: General Algorithm Settings, Optimization Algorithm Settings, Variable Selection Settings and Transformation Settings.

It it also possible to use the seedModel option in the BivariateFunctionSettings to incrementally build a bivariate model (especially with different settings) starting from the parameter estimator of the selected seed model. If a seed model is specified, then both Preselection and Variable Selection Settings options are ignored.

Algorithm settings

The Bivariate Probit algorithm is controlled by the following options:

Table 17.1. Bivariate Probit Module: Algorithm Settings

NameDescriptionPossible valuesDefault value
Automatic Data Transformations if TRUE automatic transformations (e.g. replaceMissing, binarization) should be executed. TRUE / FALSEFALSE
Confidence Level the confidence level value for the calculation of the interval estimators for model parameters real number from the interval (0.5,1)0.95
Execute Init Tests if TRUE initial data/task tests should be executed. TRUE / FALSETRUE
Group Statistics if TRUE statistics for variable groups should be computed. TRUE / FALSETRUE
Liberal Execution Mode if TRUE 'liberal' execution is preferred (do not stop on minor errors). TRUE / FALSETRUE
Likelihood Function Type whether to build a full or censored bivariate probit model full / censoredcensored
Logistic Estimation Method sets the estimation algorithm for the preselection logistic regression models fisher / newtonfisher
Logistic Preselection if TRUE then the algorithm performs preliminary variable selection using two supplementary logistic models TRUE / FALSETRUE
Randomization if TRUE then likelihood optimization via Simulated Annealing heuristic is turned on TRUE / FALSETRUE
Likelihood Function Type

If the option full has been chosen then a full bivariate probit model is built. If the option censored has been chosen then the partial observability bivariate probit model is built, i.e. the value of the target variable is observed only if the value of the target2 variable is positive (equal to the 2nd Target Positive Value selected in the BivariateFunctionSettings).

Randomization

If set to FALSE then the usual Newton-Raphson algorithm (with constraints on the correlation coefficient) is executed. If set to TRUE then the Newton-Raphson with additional Simulated Annealing heuristic is executed instead. The motivation for using Simulated Annealing is that the Bivariate Probit likelihood is not a convex function, so the ordinary Newton-Raphson algorithm may occasionally end up in a local maximum. To prevent this, occasional small step downward is allowed with a probability proportional to the difference between the current and new (i.e. lower) likelihood function value.

Note

The Newton-Raphson algorithm with Randomization turned on is a randomized algorithm. For this reason the algorithm behavior (and especially the number of iterations required to converge) may vary from one run to another. Exploitation of the seedModel option is recommended.

Figure 17.1. Bivariate Probit Algorithm Settings Window

Bivariate Probit Algorithm Settings Window

In addition to the settings specific to the Bivariate Probit algorithm, the user can use:

  • Variable Selection Settings - to control the behavior of the available heuristics for model building; these settings are described in the Automatic Variable Selection chapter

  • Optimization Algorithm Settings - to control the selection of the optimization algorithm; these settings are described in the Optimization Library chapter.

  • Transformation Settings - to control the way of data transformation; these settings are described in the Transformation chapter.

Remarks

  • Unlike other statistical modules, the Bivariate Probit model variable selection refers only to variable preselection for the main bivariate probit model. Preselection means building two supplementary logistic models independently (one for each equation). Variables which are significant (with respect to the Entry Level and Leave Level thresholds) become explanatory variables in the bivariate model. Next, the final bivariate model is built (without any subsequent variable selection).

  • All variable selection methods take into account only the variables selected as active or active12 for the first supplementary logistic model (which pre-estimates the default equation parameters), and active2 or active12 for the second supplementary logistic model (which pre-estimates the censoring equation parameters).

  • In case of Bivariate Probit algorithm, the following Variable Selection options available to the user do not affect the final algorithm settings, used during the model estimation phase:

    • Auxiliary Lift Estimation Mode,
    • Group Mode.

    This is due to specificity of the bivariate algorithm that does not entirely fit the general Variable Selection component.

Model statistics

The final model contains the following statistics: Variable Statistics, Group Statistics (only if Group Statistics is set), Model Fit Statistics, Coefficient Correlation and Covariance Matrices and Attributes Correlation Matrix.

Table 17.2.  Bivariate Probit Model Statistics: Variable Statistics

NameDescription
Coeffthe estimated parameter value
Lower Confidence the lower bound of the confidence interval for the current estimator. The confidence interval is calculated for the confidence level specified in the current algorithm settings (see the Confidence Level option)
Pr(Wald>ChiSq) the p-value for the Wald statistic for parameter estimator. The statistic is tested with the chi-square distribution with one degree of freedom
StdErrthe standard error of parameter estimator
Upper Confidence the upper bound of the confidence interval for the current estimator. The confidence interval is calculated for the confidence level specified in the current algorithm settings (see the Confidence Level option)
Variable attribute name (prefixed with '0.' for variables in the first [default] equation and '1.' for variables in the second [censoring] equation). Correlation coefficient name is 'RO'.
Wald Test the Wald statistic for parameter estimator

Table 17.3. Bivariat Model Statistics: Group Statistics

NameDescription
Variableattribute name
DF the number of degrees of freedom of variable or group of variables; In the case of Bivariate algorithm always DF = 1 (see Note)
Wald Stat the value of the Wald statistic 
Wald Pr>ChiSq the p-value of the Wald statistic 

Table 17.4. Bivariate Probit Model Statistics: Model Statistics

NameDescription
Likelihood Ratio Stat the value of the Likelihood Ratio statistic
Pr(LRatio>ChiSq) the p-value for the Likelihood Ratio statistic The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model
Pr(Score>ChiSq) the p-value for the Score statistic The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model
Pr(Wald>ChiSq) the p-value for the Wald statistic. The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model
Score Stat the value of the Score statistic
Wald Stat the value of the Wald statistic
Zero Correlation Stat the statistic for the zero correlation hypothesis test
Pr(ZeroCorr>ChiSq) the p-value for the zero correlation statistic. The statistic is tested with a chi-square distribution with one degree of freedom
ZeroCorr Sample the size of the sample used for the zero correlation hypothesis test
Covariance the covariance matrix of the parameter estimators
Correlation the correlation matrix of the parameter estimators

Figure 17.2. Bivariate Probit Model Statistics Window

Bivariate Probit Model Statistics Window
Covariance

The covariance matrix of the parameter estimators is calculated as

where is the Hessian matrix of the parameter estimators.

Correlation

The correlation matrix of the parameter estimators is calculated as

where is the -th elements of covariance matrix of the parameter estimators.

Likelihood Ratio statistic

The Likelihood Ratio statistic is calculated as:

Score statistic

The Score statistic is defined as:

where is the gradient of the log-likelihood function.

Wald statistic

The Wald statistic is calculated as:

The Wald statistic for the parameter estimator is defined as:

where with stands for the Hessian matrix (the matrix of the second order partial derivatives of the log-likelihood function) and is the gradient of the log-likelihood function, both calculated for the maximum likelihood estimator (MLE).

Model application

The Bivariate Probit module may be applied to classification problems in a similar manner as the Logistic Regression model. Classification is based on the estimated probabilities and is made by setting a threshold probability. An observation (vector of attributes) is classified to one of two groups depending on the comparison between the conditional probability estimated for this observation and the threshold probability (see also the example script below).

The Bivariate Probit module is capable of creating two possible output types when classifying the provided data: category or probability. For details on how to apply the model to the data see the chapter Applying Models in AdvancedMiner, and the Classification subsection in the Applying for different mining functions sextion.

The table below presents the possible combinations and their descriptions.

Table 17.5. Output items and output types combinations

Output Type Output Item Type Description
probabilityrankreturns the probability of the n-th best category of the default equation target variable
probabilitycategoryreturns the probability of classifying as the given category of the default equation target variable
categoryrankreturns the n-th best category of the default equation target variable
categorycategorynot supported
nodeIDranknot supported
nodeIDcategorynot supported