The data set for Bivariate Probit model building should contain numerical explanatory ( active/ active2/ active12) attributes and categorical binary target and target2 attributes. In order to use categorical explanatory variables it is necessary to convert them to binary zero-one dummy variables using the binarize procedure with the Random Redundant option switched on. Binarization can be performed before building the model or during building process by selecting Automatic Data Trasformation option in General Algorithm Settings.
Missing values are not supported by the Bivariate Module unless Automatic Data Trasformation option in General Algorithm Settings is selected. The other way to use a data set with missing values is to replace the missing data before building the model or to switch to Liberal Mode in algorithm settings to automatically omit observations containing missing values. This requirement applies also to the target attribute values in the censored model, due to the initial parameter estimation by logistic regression models.
In order to build the Bivariate Probit model in the bivariate function settings it is necessary to select the Target (the name of the target variable in the default equation), Target2 (the name of the target variable in the censoring equation ), 1st Target Positive Value and 2nd Target Positive Value (the positive values for respective targets). For the Target variable the positive value is understood in the same way as for the Logistic Regression Model, for the Target2 variable the positive value means that a given observation is not censored and makes its way to default equation. The two target attributes can also be selected by setting the appropriate usage options in the current AttributeUsageSet: target for the default equation target and target2 for the censoring equation target. For more details see the AdvancedMiner Concepts section.
Variables which are expected to be present in both equations should be set to active12 in the current AttributeUsageSet. Variables which have to be present only in the default equation should be set to active in the current AttributeUsageSet. Variables which have to be present only in the censoring equation should be set to active2 in the current AttributeUsageSet.
For an example script see the Examples section (the full source code with the data can be found in the Examples appendix).
Model building is performed in the standard way and the complete procedure is described in the section AdvancedMiner Concepts (see Approximation). Full specification of the model settings contains the elements of Algorithm Settings: General Algorithm Settings, Optimization Algorithm Settings, Variable Selection Settings and Transformation Settings.
It it also possible to use the seedModel option in the BivariateFunctionSettings to incrementally build a bivariate model (especially with different settings) starting from the parameter estimator of the selected seed model. If a seed model is specified, then both Preselection and Variable Selection Settings options are ignored.
The Bivariate Probit algorithm is controlled by the following options:
Table 17.1. Bivariate Probit Module: Algorithm Settings
| Name | Description | Possible values | Default value |
|---|---|---|---|
| Automatic Data Transformations | if TRUE automatic transformations (e.g. replaceMissing, binarization) should be executed. | TRUE / FALSE | FALSE |
| Confidence Level | the confidence level value for the calculation of the interval estimators for model parameters | real number from the interval (0.5,1) | 0.95 |
| Execute Init Tests | if TRUE initial data/task tests should be executed. | TRUE / FALSE | TRUE |
| Group Statistics | if TRUE statistics for variable groups should be computed. | TRUE / FALSE | TRUE |
| Liberal Execution Mode | if TRUE 'liberal' execution is preferred (do not stop on minor errors). | TRUE / FALSE | TRUE |
| Likelihood Function Type | whether to build a full or censored bivariate probit model | full / censored | censored |
| Logistic Estimation Method | sets the estimation algorithm for the preselection logistic regression models | fisher / newton | fisher |
| Logistic Preselection | if TRUE then the algorithm performs preliminary variable selection using two supplementary logistic models | TRUE / FALSE | TRUE |
| Randomization | if TRUE then likelihood optimization via Simulated Annealing heuristic is turned on | TRUE / FALSE | TRUE |
If the option full has been chosen then a full bivariate probit model is built. If the option censored has been chosen then the partial observability bivariate probit model is built, i.e. the value of the target variable is observed only if the value of the target2 variable is positive (equal to the 2nd Target Positive Value selected in the BivariateFunctionSettings).
If set to FALSE then the usual Newton-Raphson algorithm (with constraints on the correlation coefficient) is executed. If set to TRUE then the Newton-Raphson with additional Simulated Annealing heuristic is executed instead. The motivation for using Simulated Annealing is that the Bivariate Probit likelihood is not a convex function, so the ordinary Newton-Raphson algorithm may occasionally end up in a local maximum. To prevent this, occasional small step downward is allowed with a probability proportional to the difference between the current and new (i.e. lower) likelihood function value.
The Newton-Raphson algorithm with Randomization turned on is a randomized algorithm. For this reason the algorithm behavior (and especially the number of iterations required to converge) may vary from one run to another. Exploitation of the seedModel option is recommended.
In addition to the settings specific to the Bivariate Probit algorithm, the user can use:
Variable Selection Settings - to control the behavior of the available heuristics for model building; these settings are described in the Automatic Variable Selection chapter
Optimization Algorithm Settings - to control the selection of the optimization algorithm; these settings are described in the Optimization Library chapter.
Transformation Settings - to control the way of data transformation; these settings are described in the Transformation chapter.
Unlike other statistical modules, the Bivariate Probit model variable selection refers only to variable preselection for the main bivariate probit model. Preselection means building two supplementary logistic models independently (one for each equation). Variables which are significant (with respect to the Entry Level and Leave Level thresholds) become explanatory variables in the bivariate model. Next, the final bivariate model is built (without any subsequent variable selection).
All variable selection methods take into account only the variables selected as active or active12 for the first supplementary logistic model (which pre-estimates the default equation parameters), and active2 or active12 for the second supplementary logistic model (which pre-estimates the censoring equation parameters).
In case of Bivariate Probit algorithm, the following Variable Selection options available to the user do not affect the final algorithm settings, used during the model estimation phase:
This is due to specificity of the bivariate algorithm that does not entirely fit the general Variable Selection component.
The final model contains the following statistics: Variable Statistics, Group Statistics (only if Group Statistics is set), Model Fit Statistics, Coefficient Correlation and Covariance Matrices and Attributes Correlation Matrix.
Table 17.2. Bivariate Probit Model Statistics: Variable Statistics
| Name | Description |
|---|---|
| Coeff | the estimated parameter value |
| Lower Confidence | the lower bound of the confidence interval for the current estimator. The confidence interval is calculated for the confidence level specified in the current algorithm settings (see the Confidence Level option) |
| Pr(Wald>ChiSq) | the p-value for the Wald statistic for parameter estimator. The statistic is tested with the chi-square distribution with one degree of freedom |
| StdErr | the standard error of parameter estimator |
| Upper Confidence | the upper bound of the confidence interval for the current estimator. The confidence interval is calculated for the confidence level specified in the current algorithm settings (see the Confidence Level option) |
| Variable | attribute name (prefixed with '0.' for variables in the first [default] equation and '1.' for variables in the second [censoring] equation). Correlation coefficient name is 'RO'. |
| Wald Test | the Wald statistic for parameter estimator |
Table 17.3. Bivariat Model Statistics: Group Statistics
| Name | Description | |
|---|---|---|
| Variable | attribute name | |
| DF | the number of degrees of freedom of variable or group of variables; In the case of Bivariate algorithm always DF = 1 (see Note) | |
| Wald Stat | the value of the Wald statistic | |
| Wald Pr>ChiSq | the p-value of the Wald statistic |
Table 17.4. Bivariate Probit Model Statistics: Model Statistics
| Name | Description |
|---|---|
| Likelihood Ratio Stat | the value of the Likelihood Ratio statistic |
| Pr(LRatio>ChiSq) | the p-value for the Likelihood Ratio statistic The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model |
| Pr(Score>ChiSq) | the p-value for the Score statistic The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model |
| Pr(Wald>ChiSq) | the p-value for the Wald statistic. The statistic is tested with a chi-square distribution with p degrees of freedom, where p is the number of attributes included in the final model |
| Score Stat | the value of the Score statistic |
| Wald Stat | the value of the Wald statistic |
| Zero Correlation Stat | the statistic for the zero correlation hypothesis test |
| Pr(ZeroCorr>ChiSq) | the p-value for the zero correlation statistic. The statistic is tested with a chi-square distribution with one degree of freedom |
| ZeroCorr Sample | the size of the sample used for the zero correlation hypothesis test |
| Covariance | the covariance matrix of the parameter estimators |
| Correlation | the correlation matrix of the parameter estimators |
The covariance matrix of the parameter estimators is calculated as

where
is the Hessian matrix of the
parameter estimators.
The correlation matrix of the parameter estimators is calculated as

where
is the
-th elements of covariance matrix of
the parameter estimators.
The Likelihood Ratio statistic is calculated as:

The Score statistic is defined as:

where
is the gradient of the
log-likelihood function.
The Wald statistic is calculated as:

The Wald statistic for the parameter estimator is defined as:

where
with
stands for the Hessian matrix
(the matrix of the
second order partial derivatives of the log-likelihood
function) and
is the gradient of the log-likelihood function, both
calculated for the maximum likelihood estimator (MLE).
The Bivariate Probit module may be applied to classification problems in a similar manner as the Logistic Regression model. Classification is based on the estimated probabilities and is made by setting a threshold probability. An observation (vector of attributes) is classified to one of two groups depending on the comparison between the conditional probability estimated for this observation and the threshold probability (see also the example script below).
The Bivariate Probit module is capable of creating two possible output types when classifying the provided data: category or probability. For details on how to apply the model to the data see the chapter Applying Models in AdvancedMiner, and the Classification subsection in the Applying for different mining functions sextion.
The table below presents the possible combinations and their descriptions.
Table 17.5. Output items and output types combinations
| Output Type | Output Item Type | Description |
|---|---|---|
| probability | rank | returns the probability of the n-th best category of the default equation target variable |
| probability | category | returns the probability of classifying as the given category of the default equation target variable |
| category | rank | returns the n-th best category of the default equation target variable |
| category | category | not supported |
| nodeID | rank | not supported |
| nodeID | category | not supported |