Applying Models in AdvancedMiner

Basic concepts

Most models in AdvancedMiner after being created on the training data can be applied to other data sets.

The Apply task

Applying is a special mining function which uses a previously created mining model (i.e. the mining algorithm) to obtain some output information for the given input data. Models are applied according to the same scheme, there are, howver, some differences between specific models. This document describes the common procedure of model applying.

Applying is a process of reading the input data set and creating the output data set. Every input row has a corresponding output row. The output rows are created while executing the apply task. The output row values are computed by the algorithm. Which algorithm is used depends on the specified mining model.

Both input and output data have to be database tables. The input table must exist, the output table is created during the execution of the apply task.

Apply specification

The structure of the output table is fully customizable by the user. ApplyOutput is a metadata object responsible for specifying the output table.

As is common in AdvancedMiner, in applying there is a layer of abstraction created on top of the mining algorithm called the mining functions. Therefore each algorithm performing the same mining function is applied in the same way.

There will be exactly as many attributes in the apply output, as many output items the user specifies. Output items are placed in directMapping or applyOutput .The former are items which will be simply copied from input to output. The latter are items that will be generated by the model for each observation.

Apply Output items

The applying architecture is designed to give the user the maximum flexibility. Therefore, the apply design introduces the concept of apply output item.

A single ApplyOutputItem describes a single attribute in the output table to be created in the applying process. There are four main output items: Category, Cluster, Rank and Source.

Category

Create the output value for the specified target category value, see the classification section for details.

Cluster

Create the output value for the specified cluster, see the clustering section for details.

Rank

Output values are assigned to the first, second, n-th ranked prediction. Used by classification and clustering models.

Source

The output attribute contains values copied from the input. See the direct mapping section for details.

There are two additional output item types which are specific to the LogisticRegression and LinearRegression algorithms. The details about these output item types can be found in the corresponding chapters.
Output type

With every output item an output type is associated. It specifies what will be the exact value type of the output item. For example, for a single classifier decision the output value may be the probability of the decision or the structure id associated with the decision. As another example, consider the following: there are three different output types for clustering model application for the best cluster. It might be: clusterId, probability, distance. More examples can be found in the examples sections for each module.

Advanced concepts

Direct mapping

The apply task has the functionality to directly copy some values from the input table to the output table. No transformations are made.

To specify the attributes to be mapped, an ApplySourceItem objects must be added to the directMapping list. The ApplySourceItem contains the following attributes:

sourceName

the name of the attribute in the input data

destinationName

the name of the output attribute

Mapping the input to the model signature

Every mining model contains a signature with logical attributes used for mining purposes. In the apply task, the model is used to create an output row for each input row. It is necessary to specify how the input attributes are to be treated by the model.

Model assignment specifies how the input attributes are mapped to the model signature.

Default mapping

When no mapping is explicitly defined, the default mapping is used. It is a mechanism provided for the user's convenience. AdvancedMiner tries to map the input attributes to the signature attributes by matching their names.

It means that if an applying source data has the same signature (list of attribute names) as the training data set, no mapping is required to properly perform the applying process.

No mapping situation

It may happen that some attributes from the model signature have no corresponding (mapped explicitly or by default) input attribute. In such cases the attribute is not mapped. Therefore, every input value for this attribute will be set to a missing (NULL) value. The effects of such situation may depend on the specific algorithm. Some algorithms will handle the missing data, while others may interrupt with an exception.

Replacing existing output data

Normally, the apply task expects that the output table does not exist. In this case the table is created with a list of columns according to the user's specification.

When the output table already exists, the apply task executor tries to append results to the existing table. Therefore, it is required, that the table have the same signature as the output specification. If this condition is not met an exception is signaled.

It is possible to turn off this mechanism and overwrite the existing output table in every case. To do this set the replaceExistingData property to TRUE.

Minimal set-up

Here is the minimal set of steps required to set up a basic apply task:

  1. Create the apply task using the MiningApplyTask class

  2. Set the model by setting models name (as it is named in a Repository) to the model property.

  3. Set the input and output tables using physical data object names. Use the sourceData and targetData properties respectively.

  4. Create and set the ApplyOutput object. The possible types are:

    • ApproximationApplyOutput

    • ClusteringApplyOutput

    • ClassificationApplyOutput

    • SurvivalApplyOutput

    • TimeSeriesApplyOutput

    It is required to set at least one output item.

Applying for different mining functions

This section contains the information about applying which is specific to various mining functions in AdvancedMiner.

Classification

classes: ClassificationOutputType, ClassificationCategoryItem and ClassificationRankItem

Table 15.9.  Classification - Output items and output types combinations

output type output item type description
probabilityrank returns the probability of the n-th best category
probabilitycategory returns the probability of classifying as belonging to a given category
predictedCategoryrankreturns the n-th best category
predictedCategorycategorynot supported
nodeIdrank returns the structure id to which the input case was assigned. It depends on the particular algorithm what the term "structure" means. Some algorithms do not support this feature at all. Refer to the chapter describing the algorithm/module for more details.
nodeIdcategorynot supported
leverage, pearsonResidual, devianceResidual, dfbetas, c, cBar, deltaDev, deltaChiSq logistic regression returns the statistics specific for the logistic regression algorithm

Approximation

classes: ApproximationOutputType

Table 15.10. Approximation - Output type descriptions

output typeoutput item typedescription
predictedValueaprox returns the predicted value computed by the approximator
confidenceaprox returns the probability that the approximated value is true. It depends on the particular algorithm how this value will be computed. Refer to the chapter describing the algorithm/module for more details.
cookDistance, covRatio, dfbetas, dffits, leverage, press, rstudentResidual, studentResidual linear regression returns the statistics specific for the linear regression algorithm

Clustering

classes: ClusterIdItem and ClusteringRankItem, ClusteringOutputType

Table 15.11. Clustering - Output items and output types combinations

output typeoutput item typedescription
probabilityrankreturns the probability of the n-th best cluster
probabilityclusterId returns the probability that the current input case belongs to the selected cluster
clusterIdrankreturns the n-th best clusterId
clusterIdclusterId returns the id of the cluster to which the input case was assigned

Survival

Creating an apply task for survival models does not require adding any particular apply output items to the item property. All the outputs are specified by setting the appropriate SurvivalApplyOutput properties. The settings are:

First Time Point

The lower bound of the time scale interval for which survival function values are calculated.

Last Time Point

The upper bound of the time scale interval for which survival function values are calculated.

Number Of Time Points

The number of time points between the first and last time points for which survival function will be calculated.

Prefix

The name prefix used to create output column names.

The output table will contain X output columns, named 'prefix_0' to 'prefix_X-1' where X is equal to numberOfTimePoints.

Survival, like all other methods make use of direct mapping.

TimeSeries

Creating an apply task for a time series model does not require adding any particular apply output items to the items property. All the outputs are specified by setting the appropriate TimeSeriesApplyOutput properties. The settings are:

Prefix

the name of prefix used for naming the variables denoting predicted variances at the succesive time points

Number Of Time Points

the number of successive points in time used in prediction

The output table will contain following columns:

  • X output columns named 'prefix_0' to 'prefix_X-1', where X is equal to Number Of Time Points - variables denoting the predicted variances at the succesive time points

  • regressed_mean - mean computed from the linear part of the model on the basis of explanatory variables

  • forecast_lowerbound/upperbound - the lower/upper bound of confidence interval for the regressed mean and variance

TimeSeries, like all other methods, make use of direct mapping.

Examples

The following sections show some examples of applying models to data in AdvancedMiner.

Basic examples

The script below produces the model that will be used in the examples that follow. The data is artificial. It represents a hypothetical problem of segmenting customers using their attribute values into an interesting and not interesting group.

Example 15.2. A sample model for application examples

# artificial training set
table 'customers_train':

    customer_id  age  region   assets   interesting

    1            18    'A'      1500      'no'
    2            34    'A'      200       'no'
    3            15    'A'      100       'no'
    4            24    'A'      20000     'yes'
    5            21    'A'      10000     'yes'
    6            54    'B'      30000     'yes'
    7            14    'B'      20000     'yes'
    8            13    'B'      1200      'yes'
    9            15    'B'      100       'yes'


pd = PhysicalData('customers_train')

fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()

# excluding the id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')
att.setUsage(UsageOption.inactive)
fs.targetAttributeName = 'interesting'

bt = MiningBuildTask( 'pd', 'fs', 'model' )

save('bt', bt)
save('fs', fs)
save('pd', pd)

execute('bt')

model = load('model')

for leaf in model.modelStatistics.allLeaves:

    # print some node statistics
    print "**Node %d, count: %g" % (leaf.nodeID, leaf.count)
    print 'RULE:'
    print leaf.nodeRule
    print 'Predicted value:'
    print leaf.getPredictedValue()
    print '---'

Output:

**Node 1, count: 4
RULE:
region IN ('B')
Predicted value:
yes
---
**Node 3, count: 3
RULE:
region IN ('A') AND 
assets < 5 750.000
Predicted value:
no
---
**Node 4, count: 2
RULE:
region IN ('A') AND 
assets >= 5 750.000
Predicted value:
yes
---
    

The next script prepares and runs an apply task which predicts the category for each observation.

Example 15.3.  Basic classification example. Assign decision to an input vector.

#  artificial training set, 
table 'customers_train':

    customer_id  age  region   assets   interesting

    1            18    'A'      1500      'no'
    2            34    'A'      200       'no'
    3            15    'A'      100       'no'
    4            24    'A'      20000     'yes'
    5            21    'A'      10000     'yes'
    6            54    'B'      30000     'yes'
    7            14    'B'      20000     'yes'
    8            13    'B'      1200      'yes'
    9            15    'B'      100       'yes'


pd = PhysicalData('customers_train')

fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()

#  excluding the id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')

att.setUsage(UsageOption.inactive)
fs.targetAttributeName = 'interesting'

bt = MiningBuildTask( 'pd', 'fs', 'model' )

save('bt', bt)
save('fs', fs)
save('pd', pd)

execute('bt')

#--------------------------------------------------
# prepare and run apply task which predicts 
# the category for each observation
#--------------------------------------------------

table 'new_customers':

    age  region   assets  

    14    'A'      400   
    23    'B'      6500  
    78    'A'      500   
    24    'B'      10000 
    15    'B'      600   
    14    'B'      200   


#  a classification mining model built on data similar 
#  to the one in the data_to_apply set
classifierName = 'model'
inputTableName = 'new_customers'
outputTableName = 'decisions'

save(inputTableName, PhysicalData(inputTableName) )
save(outputTableName, PhysicalData(outputTableName) )

task = MiningApplyTask()

task.setReplaceExistingData(TRUE)
task.applyOutput = ClassificationApplyOutput()

#  there is one output attribute for each classifier decision
#  it will contain target categories selected by the algorithm 
#  as the be best ( with first rank )

predictedItem = ClassificationRankItem('decision', ClassificationOutputType.\
predictedCategory, 0)

task.applyOutput.item.add( predictedItem )

task.modelName = classifierName
task.sourceDataName = inputTableName
task.targetDataName = outputTableName

save( 'apply', task )
execute( 'apply' )

#  print the results:
print 'decision'
print 20*'_'
trans None <- outputTableName:
    print '%s' % decision
    

Output:

result
____________________
no
yes
no
yes
yes
yes
    

The following script shows how to use a survival model to assign the probability that an event will occur in the next point in time.

Example 15.4.  Basic survival example.

task=MiningApplyTask()
ao = SurvivalApplyOutput()
ao.setFirstTimePoint(0)
ao.setLastTimePoint(15)
ao.setNumberOfTimePoints(1)
task.applyOutput = ao
    

Advanced examples

The script below enchances the previous script by adding a customer id and using attribute mapping.

Example 15.5. Applying a model with attribute mapping

# artificial training set
table 'customers_train':

    customer_id  age  region   assets   interesting

    1            18    'A'      1500      'no'
    2            34    'A'      200       'no'
    3            15    'A'      100       'no'
    4            24    'A'      20000     'yes'
    5            21    'A'      10000     'yes'
    6            54    'B'      30000     'yes'
    7            14    'B'      20000     'yes'
    8            13    'B'      1200      'yes'
    9            15    'B'      100       'yes'

pd = PhysicalData('customers_train')

fs = ClassificationFunctionSettings()
fs.logicalData = LogicalData(pd)
fs.algorithmSettings = TreeSettings()

#  excluding id attribute from model building
att = fs.attributeUsageSet.getAttribute('customer_id')
att.setUsage(UsageOption.inactive)

fs.targetAttributeName = 'interesting'

bt = MiningBuildTask( 'pd', 'fs', 'model' )

save('bt', bt)
save('fs', fs)
save('pd', pd)

execute('bt')

#--------------------------------------------------
# prepare and run apply task which predicts
# the category for each observation
#--------------------------------------------------

table 'new_customers':

    customer_id  age  region_code   assets_2

    12            14    'A'      400
    18            23    'B'      6500
    11            78    'A'      500
    16            24    'B'      10000
    19            15    'B'      600
    17            14    'B'      200


# a classification mining model, built on data similar
# to the one in data_to_apply set
classifierName = 'model'
inputTableName = 'new_customers'
outputTableName = 'decisions_table'

save(inputTableName, PhysicalData(inputTableName) )
save(outputTableName, PhysicalData(outputTableName) )

task = MiningApplyTask()

# Output attribute 'id' will have values directly copied from
# input attribute 'customer_id', which will let the user
# identify corresponding rows in the input and output tables.
task.directMapping.add( ApplySourceItem('customer_id', 'id') )

# Two columns in the 'new_customers' table have different names
# than these in the model signature. If a mapping is not added,
# the input values for these attributes will be NULL.
task.modelAssignment.assignment.add( SignatureAssignment( 'region_code', 'region' ) )
task.modelAssignment.assignment.add( SignatureAssignment( 'assets_2', 'assets' ) )

task.applyOutput = ClassificationApplyOutput()
task.setReplaceExistingData(TRUE)

# There is one output attribute for the classifier decision.
# It will contain target categories selected by the algorithm
# as the best (with first rank)
predictedItem = ClassificationRankItem('decision', ClassificationOutputType.\
predictedCategory, 0)

task.applyOutput.item.add( predictedItem )
task.modelName = classifierName
task.sourceDataName = inputTableName
task.targetDataName = outputTableName

save( 'apply', task )
execute( 'apply' )

# print the results:
print 'id\tdecision'
print 20*'_'
trans None <- outputTableName:
    print '%d\t%s' % (id, decision)


Output:

id        decision
____________________
12        no
18        yes
11        no
16        yes
19        yes
17        yes
    

This script shows how to change the positive (good) classification threshold to 0.3

Example 15.6. Changing the classification threshold

# This script is not complete. An appropriate table and task
# are required
ao = ClassificationApplyOutput()
ao.item.add( ClassificationRankItem( 'good_prob', ClassificationOutputType.\
probability, 0) )

trans 'decision' <- 'apply_output_table':
drop out good_prob
if good_prob > 0.3:
	target = 'good'
else:
	target = 'bad'

In the script below an example advice system suggests a second decision (i.e. the 2nd ranking decision) when the probability of the first decision is too low (less or equal than 0.7)

Example 15.7.  An advice system with secondary decision suggestion

# this script is not complete. An appropriate table and task
# are required
ao = ClassificationApplyOutput()
ao.item.add( ClassificationRankItem('first', ClassificationOutputType.\
predictedCategory, 0) )
ao.item.add( ClassificationRankItem('first_prob', ClassificationOutputType.\
probability, 0) )
ao.item.add( ClassificationRankItem('second', ClassificationOutputType.\
predictedCategory, 1) )

trans 'decision' <- 'apply_output_table':
drop out first_prob
if first_prob > 0.7:
second = None