Most models in AdvancedMiner after being created on the training data can be applied to other data sets.
Applying is a special mining function which uses a previously created mining model (i.e. the mining algorithm) to obtain some output information for the given input data. Models are applied according to the same scheme, there are, howver, some differences between specific models. This document describes the common procedure of model applying.
Applying is a process of reading the input data set and creating the output data set. Every input row has a corresponding output row. The output rows are created while executing the apply task. The output row values are computed by the algorithm. Which algorithm is used depends on the specified mining model.
Both input and output data have to be database tables. The input table must exist, the output table is created during the execution of the apply task.
The structure of the output table is fully customizable by the user. ApplyOutput is a metadata object responsible for specifying the output table.
As is common in AdvancedMiner, in applying there is a layer of abstraction created on top of the mining algorithm called the mining functions. Therefore each algorithm performing the same mining function is applied in the same way.
There will be exactly as many attributes in the apply output, as many output items the user specifies. Output items are placed in directMapping or applyOutput .The former are items which will be simply copied from input to output. The latter are items that will be generated by the model for each observation.
The applying architecture is designed to give the user the maximum flexibility. Therefore, the apply design introduces the concept of apply output item.
A single ApplyOutputItem describes a single attribute in the output table to be created in the applying process. There are four main output items: Category, Cluster, Rank and Source.
Create the output value for the specified target category value, see the classification section for details.
Create the output value for the specified cluster, see the clustering section for details.
Output values are assigned to the first, second, n-th ranked prediction. Used by classification and clustering models.
The output attribute contains values copied from the input. See the direct mapping section for details.
With every output item an output type is associated. It specifies what will be the exact value type of the output item. For example, for a single classifier decision the output value may be the probability of the decision or the structure id associated with the decision. As another example, consider the following: there are three different output types for clustering model application for the best cluster. It might be: clusterId, probability, distance. More examples can be found in the examples sections for each module.
The apply task has the functionality to directly copy some values from the input table to the output table. No transformations are made.
To specify the attributes to be mapped, an ApplySourceItem objects must be added to the directMapping list. The ApplySourceItem contains the following attributes:
the name of the attribute in the input data
the name of the output attribute
Every mining model contains a signature with logical attributes used for mining purposes. In the apply task, the model is used to create an output row for each input row. It is necessary to specify how the input attributes are to be treated by the model.
Model assignment specifies how the input attributes are mapped to the model signature.
When no mapping is explicitly defined, the default mapping is used. It is a mechanism provided for the user's convenience. AdvancedMiner tries to map the input attributes to the signature attributes by matching their names.
It means that if an applying source data has the same signature (list of attribute names) as the training data set, no mapping is required to properly perform the applying process.
It may happen that some attributes from the model signature have no corresponding (mapped explicitly or by default) input attribute. In such cases the attribute is not mapped. Therefore, every input value for this attribute will be set to a missing (NULL) value. The effects of such situation may depend on the specific algorithm. Some algorithms will handle the missing data, while others may interrupt with an exception.
Normally, the apply task expects that the output table does not exist. In this case the table is created with a list of columns according to the user's specification.
When the output table already exists, the apply task executor tries to append results to the existing table. Therefore, it is required, that the table have the same signature as the output specification. If this condition is not met an exception is signaled.
It is possible to turn off this mechanism and overwrite the existing output table in every case. To do this set the replaceExistingData property to TRUE.
Here is the minimal set of steps required to set up a basic apply task:
Create the apply task using the MiningApplyTask class
Set the model by setting models name (as it is named in a Repository) to the model property.
Set the input and output tables using physical data object names. Use the sourceData and targetData properties respectively.
Create and set the ApplyOutput object. The possible types are:
ApproximationApplyOutput
ClusteringApplyOutput
ClassificationApplyOutput
SurvivalApplyOutput
TimeSeriesApplyOutput
It is required to set at least one output item.
This section contains the information about applying which is specific to various mining functions in AdvancedMiner.
classes: ClassificationOutputType, ClassificationCategoryItem and ClassificationRankItem
Table 15.9. Classification - Output items and output types combinations
| output type | output item type | description |
|---|---|---|
| probability | rank | returns the probability of the n-th best category |
| probability | category | returns the probability of classifying as belonging to a given category |
| predictedCategory | rank | returns the n-th best category |
| predictedCategory | category | not supported |
| nodeId | rank | returns the structure id to which the input case was assigned. It depends on the particular algorithm what the term "structure" means. Some algorithms do not support this feature at all. Refer to the chapter describing the algorithm/module for more details. |
| nodeId | category | not supported |
| leverage, pearsonResidual, devianceResidual, dfbetas, c, cBar, deltaDev, deltaChiSq | logistic regression | returns the statistics specific for the logistic regression algorithm |
classes: ApproximationOutputType
Table 15.10. Approximation - Output type descriptions
| output type | output item type | description |
|---|---|---|
| predictedValue | aprox | returns the predicted value computed by the approximator |
| confidence | aprox | returns the probability that the approximated value is true. It depends on the particular algorithm how this value will be computed. Refer to the chapter describing the algorithm/module for more details. |
| cookDistance, covRatio, dfbetas, dffits, leverage, press, rstudentResidual, studentResidual | linear regression | returns the statistics specific for the linear regression algorithm |
classes: ClusterIdItem and ClusteringRankItem, ClusteringOutputType
Table 15.11. Clustering - Output items and output types combinations
| output type | output item type | description |
|---|---|---|
| probability | rank | returns the probability of the n-th best cluster |
| probability | clusterId | returns the probability that the current input case belongs to the selected cluster |
| clusterId | rank | returns the n-th best clusterId |
| clusterId | clusterId | returns the id of the cluster to which the input case was assigned |
Creating an apply task for survival models does not require adding any particular apply output items to the item property. All the outputs are specified by setting the appropriate SurvivalApplyOutput properties. The settings are:
The lower bound of the time scale interval for which survival function values are calculated.
The upper bound of the time scale interval for which survival function values are calculated.
The number of time points between the first and last time points for which survival function will be calculated.
The name prefix used to create output column names.
The output table will contain X output columns, named 'prefix_0' to 'prefix_X-1' where X is equal to numberOfTimePoints.
Survival, like all other methods make use of direct mapping.
Creating an apply task for a time series model does not require adding any particular apply output items to the items property. All the outputs are specified by setting the appropriate TimeSeriesApplyOutput properties. The settings are:
the name of prefix used for naming the variables denoting predicted variances at the succesive time points
the number of successive points in time used in prediction
The output table will contain following columns:
X output columns named 'prefix_0' to 'prefix_X-1', where X is equal to Number Of Time Points - variables denoting the predicted variances at the succesive time points
regressed_mean - mean computed from the linear part of the model on the basis
of explanatory variables
forecast_lowerbound/upperbound - the lower/upper bound of confidence interval for the regressed mean and variance
TimeSeries, like all other methods, make use of direct mapping.
The following sections show some examples of applying models to data in AdvancedMiner.
The script below produces the model that will be used in the examples that follow. The data is artificial. It represents a hypothetical problem of segmenting customers using their attribute values into an interesting and not interesting group.
Example 15.2. A sample model for application examples
# artificial training set
table 'customers_train':
customer_id age region assets interesting
1 18 'A' 1500 'no'
2 34 'A' 200 'no'
3 15 'A' 100 'no'
4 24 'A' 20000 'yes'
5 21 'A' 10000 'yes'
6 54 'B' 30000 'yes'
7 14 'B' 20000 'yes'
8 13 'B' 1200 'yes'
9 15 'B' 100 'yes'
pd = PhysicalData('customers_train')
fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()
# excluding the id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')
att.setUsage(UsageOption.inactive)
fs.targetAttributeName = 'interesting'
bt = MiningBuildTask( 'pd', 'fs', 'model' )
save('bt', bt)
save('fs', fs)
save('pd', pd)
execute('bt')
model = load('model')
for leaf in model.modelStatistics.allLeaves:
# print some node statistics
print "**Node %d, count: %g" % (leaf.nodeID, leaf.count)
print 'RULE:'
print leaf.nodeRule
print 'Predicted value:'
print leaf.getPredictedValue()
print '---'
Output:
**Node 1, count: 4
RULE:
region IN ('B')
Predicted value:
yes
---
**Node 3, count: 3
RULE:
region IN ('A') AND
assets < 5 750.000
Predicted value:
no
---
**Node 4, count: 2
RULE:
region IN ('A') AND
assets >= 5 750.000
Predicted value:
yes
---
The next script prepares and runs an apply task which predicts the category for each observation.
Example 15.3. Basic classification example. Assign decision to an input vector.
# artificial training set,
table 'customers_train':
customer_id age region assets interesting
1 18 'A' 1500 'no'
2 34 'A' 200 'no'
3 15 'A' 100 'no'
4 24 'A' 20000 'yes'
5 21 'A' 10000 'yes'
6 54 'B' 30000 'yes'
7 14 'B' 20000 'yes'
8 13 'B' 1200 'yes'
9 15 'B' 100 'yes'
pd = PhysicalData('customers_train')
fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()
# excluding the id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')
att.setUsage(UsageOption.inactive)
fs.targetAttributeName = 'interesting'
bt = MiningBuildTask( 'pd', 'fs', 'model' )
save('bt', bt)
save('fs', fs)
save('pd', pd)
execute('bt')
#--------------------------------------------------
# prepare and run apply task which predicts
# the category for each observation
#--------------------------------------------------
table 'new_customers':
age region assets
14 'A' 400
23 'B' 6500
78 'A' 500
24 'B' 10000
15 'B' 600
14 'B' 200
# a classification mining model built on data similar
# to the one in the data_to_apply set
classifierName = 'model'
inputTableName = 'new_customers'
outputTableName = 'decisions'
save(inputTableName, PhysicalData(inputTableName) )
save(outputTableName, PhysicalData(outputTableName) )
task = MiningApplyTask()
task.setReplaceExistingData(TRUE)
task.applyOutput = ClassificationApplyOutput()
# there is one output attribute for each classifier decision
# it will contain target categories selected by the algorithm
# as the be best ( with first rank )
predictedItem = ClassificationRankItem('decision', ClassificationOutputType.\
predictedCategory, 0)
task.applyOutput.item.add( predictedItem )
task.modelName = classifierName
task.sourceDataName = inputTableName
task.targetDataName = outputTableName
save( 'apply', task )
execute( 'apply' )
# print the results:
print 'decision'
print 20*'_'
trans None <- outputTableName:
print '%s' % decision
Output:
result
____________________
no
yes
no
yes
yes
yes
The following script shows how to use a survival model to assign the probability that an event will occur in the next point in time.
The script below enchances the previous script by adding a customer id and using attribute mapping.
Example 15.5. Applying a model with attribute mapping
# artificial training set
table 'customers_train':
customer_id age region assets interesting
1 18 'A' 1500 'no'
2 34 'A' 200 'no'
3 15 'A' 100 'no'
4 24 'A' 20000 'yes'
5 21 'A' 10000 'yes'
6 54 'B' 30000 'yes'
7 14 'B' 20000 'yes'
8 13 'B' 1200 'yes'
9 15 'B' 100 'yes'
pd = PhysicalData('customers_train')
fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()
# excluding id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')
att.setUsage(UsageOption.inactive)
fs.targetAttributeName = 'interesting'
bt = MiningBuildTask( 'pd', 'fs', 'model' )
save('bt', bt)
save('fs', fs)
save('pd', pd)
execute('bt')
#--------------------------------------------------
# prepare and run apply task which predicts
# the category for each observation
#--------------------------------------------------
table 'new_customers':
customer_id age region_code assets_2
12 14 'A' 400
18 23 'B' 6500
11 78 'A' 500
16 24 'B' 10000
19 15 'B' 600
17 14 'B' 200
# a classification mining model, built on data similar
# to the one in data_to_apply set
classifierName = 'model'
inputTableName = 'new_customers'
outputTableName = 'decisions_table'
save(inputTableName, PhysicalData(inputTableName) )
save(outputTableName, PhysicalData(outputTableName) )
task = MiningApplyTask()
# Output attribute 'id' will have values directly copied from
# input attribute 'customer_id', which will let the user
# identify corresponding rows in the input and output tables.
task.directMapping.add( ApplySourceItem('customer_id', 'id') )
# Two columns in the 'new_customers' table have different names
# than these in the model signature. If a mapping is not added,
# the input values for these attributes will be NULL.
task.modelAssignment.assignment.add( SignatureAssignment( 'region_code', 'region' ) )
task.modelAssignment.assignment.add( SignatureAssignment( 'assets_2', 'assets' ) )
task.applyOutput = ClassificationApplyOutput()
task.setReplaceExistingData(TRUE)
# There is one output attribute for the classifier decision.
# It will contain target categories selected by the algorithm
# as the best (with first rank)
predictedItem = ClassificationRankItem('decision', ClassificationOutputType.\
predictedCategory, 0)
task.applyOutput.item.add( predictedItem )
task.modelName = classifierName
task.sourceDataName = inputTableName
task.targetDataName = outputTableName
save( 'apply', task )
execute( 'apply' )
# print the results:
print 'id\tdecision'
print 20*'_'
trans None <- outputTableName:
print '%d\t%s' % (id, decision)
Output:
id decision
____________________
12 no
18 yes
11 no
16 yes
19 yes
17 yes
This script shows how to change the positive (good) classification threshold to 0.3
Example 15.6. Changing the classification threshold
# This script is not complete. An appropriate table and task # are required ao = ClassificationApplyOutput() ao.item.add( ClassificationRankItem( 'good_prob', ClassificationOutputType.\ probability, 0) ) trans 'decision' <- 'apply_output_table': drop out good_prob if good_prob > 0.3: target = 'good' else: target = 'bad'
In the script below an example advice system suggests a second decision (i.e. the 2nd ranking decision) when the probability of the first decision is too low (less or equal than 0.7)
Example 15.7. An advice system with secondary decision suggestion
# this script is not complete. An appropriate table and task
# are required
ao = ClassificationApplyOutput()
ao.item.add( ClassificationRankItem('first', ClassificationOutputType.\
predictedCategory, 0) )
ao.item.add( ClassificationRankItem('first_prob', ClassificationOutputType.\
probability, 0) )
ao.item.add( ClassificationRankItem('second', ClassificationOutputType.\
predictedCategory, 1) )
trans 'decision' <- 'apply_output_table':
drop out first_prob
if first_prob > 0.7:
second = None