Predefined transformations for data Mining models

Introduction

Transformations are very important in data processing. Very often one encounters a situation when the data set has improper format or contains invalid data, which prevents it from being used or analyzed. When it is necessary to convert data from one format to another, data transformation can be used. In the following parts the built-in transformation are discussed:

  • binarize
  • normalize
  • outliers
  • PCA
  • replace missing
  • standardize
  • weight of evidence.

Every transformation is defined by a set of parameters. Some of them are common to all transformations while others depend on the type of the transformation. Parameters unique to particular transformations are described in the corresponding sections. The table below contains the descriptions of common parameters.

Table 4.1. Attribute Usage Set

ValueDescription
auto The columm will be transformed according to the default settings (e.g. replace missing with mean/modal for numerical/categorical variables)
copy The column will be copied without any changes.
drop The column will be omitted during transformation.
transform The column will be transformed according to user setings.

There are two ways of using transformations in the Advanced Miner system. Transformation can be created directly with TransformationBuildTask or indirectly as an automated part of the model building process ( Automatic Data Transformation). In the first case the user chooses the type of the transformation, which can then be applied to a data set. In the latter case the system performs the transformations which are necessary for a given algorithm to work on the data set selected for modelling.

Transformation Types

Binarize

Binarization changes nominal variable into n binary variables, where n is the number of possible values of a nominal variable. NULL values are copied without changes.

The Binarize Transformation has the following settings:

Table 4.2. Binarize Settings

NameDescriptionPossible valuesDefault value
Default Binarized Values For unipolar the binarized columns assign the values 0 and 1, for bipolar -1 and 1. unipolar / bipolarunipolar
Max Values Count If the number of different values of the attribute is greater then 'Max Values Count' the attribute will not be binarized. Integer values100
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

As in the case of Attribute Usage Set the user can determine additional transformation parameters for each variable.

Table 4.3. Binarize - Custom Attributes

NameDescription
Variable NameVariable names
Redundant Value selection for the binarized variable:
  • randomRedundant - randomly select the value to omit
  • minRedundant - omit the least frequent value
  • maxRedundant - omit the most frequent value
  • noRedundant - keep all values
Binarizedvalues Select the set of values for binary variables:
  • unipolar: 0 i 1
  • bipolar: -1 i 1

Normalize

Normalization is used to transform the input table in such a way that after the transformation the values for a given column are in the range <MinValue, MaxValue>. Normalization transforms the data according to the following formula:

If all values in the transformed column are identical, all values in the output column will be equal to . Null values are not changed.

The Normalize Transformation has the following settings:

Table 4.4. Normalize Settings

NameDescriptionPossible valuesDefault value
Max ValueThe right end of the target interval.real number (greater than Min Value)1.0
Min ValueThe left end of the target interval.real number (less than Max Value)0.0
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

Outlier transformation (Winsorising)

In data mining processes and data analysis not all non-normalities are due to non-declared or missing values. Another reason for non-normality is the presence of outliers - observations which are extreme relatively to the rest of the sample. The 'rare' event syndrome is a typical cause of outliers. Outliers can be also defined as extreme observations that for some legitimate reason do not fit within the typical range of other data values. The Outlier transformation aims to replace outliers by appropriate values to improve normality.

The Outlier Transformation has the following settings:

Table 4.5. Outlier Settings

NameDescriptionPossible valuesDefault value
lowerPercentPercentage value of observations from lower range to be transformed.positive integer values from range (0-100)5
upperPercentPercentage value of observations from upper range to be transformed.positive integer values from range (0-100)95
minValuesCountMinimumal number of different values for the transformation to take affect.integer values10
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

PCA (Principal Component Analysis)

Principal component analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a smaller number of uncorrelated variables called principal components. Principal component analysis was originally introduced by Karl Pearson in 1901 and later developed by several other authors (see Hotelling (1933)). Excellent statistical treatments of principal components can be found in Morrison (1976), and Mardia, Kent, and Bibby (1979). Comprehensive PCA description and references can be found in Jolliffe (2002). Principal component analysis can be used to summarize the data and detect linear relationships. The objective of principal component analysis is to reduce the dimensionality (the number of variables) of the dataset while retaining most of the original variability in the data. The first principal component accounts for as much of the variability in the data as possible, and each successive component accounts for as much of the remaining variability as possible. The PCA transformation performs Principal Component Analysis on the selected dataset. Principal component analysis is concerned with explaining the variance-covariance structure of a high dimensional random vector through few linear combinations of the original component variables. Consider a -dimensional random vector . The principal components (with not exceeding ) of are univariate random variables defined by the formulas

For a given the coefficient vectors are chosen so that has the largest variance subject to the following conditions:

and the -th Principal Component is a linear combination that maximizes and . There are exactly such linear combinations. However, the first few of them explain most of the variance in the original data, so instead of working with all the original variables one would typically first perform PCA and then use only first two or three principal components in subsequent analysis. The Kaiser criterion can also be used for retaining only principal components with eigenvalues greater than one. It was proposed by Kaiser (1960), and is probably the most widely used criterion of this kind.

The PCA Transformation has the following settings:

Table 4.6. PCA Settings

NameDescriptionPossible valuesDefault value
Components Number When Criterion is set to KAISER this parameter is ignored. When Criterium is set to NUMBER it describes the number of primary components. integer numbers1
Criterium Determines how the primary components are selected. kaiser uses a self selecting number of primary components. With number the user can set the number of components manually. number / kaisernumber
Standardize When set to TRUE, standardization is performed prior to the calculation of primary components, FALSE otherwise. TRUE / FALSEFALSE
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

Replace Missing

This transformation replaces or removes null values in the data set. Some algorithms do not work with null values and in that cases this transformation is especially useful.

Note

In the case of categorical and ordinal variables missing values are replaced by the most frequent value

The Replace Missing Transformation has the following settings:

Table 4.7. Replace Missing Settings

NameDescriptionPossible valuesDefault value
Categorical Replace The value which will replace missing values of categorical variables. mean/median/distirbutionBased/modal/custommodal
Cut Lower Percent Percentage value of observations from lower range to be treated as one interval in a distibution based process. Positive integer values from range (0-100)0
Cut Upper Percent Percentage value of observations from upper range to be treated as one interval in a distibution based process. Positive integer values from range (0-100)100
Numerical Replace The value which will replace missing values of numerical variables. mean/median/distirbutionBased/modal/custommean
Ordinal Replace The value which will replace missing values of ordinal variables. mean/median/distirbutionBased/modal/custommodal
Seed Random number generator seed used in the case of distribution based replacement of missing values (distributionBased). integer valuesnull
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

As in the case of Attribute Usage Set the user can set additional transformation parameters for each variable.

Table 4.8. Replace Missing - Attributes

NameDescription
NameVariable names
Replace Type The manner in which missing values will be replaced: mean/median/distirbutionBased/modal/custom.
Custom Value User selected custom value which will be used to replace missing values (in the case of selecting the 'custom' replacement type).

Standardize

Standardization means subtracting from each entry in a column (of a numerical type) the mean value of all entries in this column and then dividing the result by the standard deviation of that column. Null values remain untouched. A prerequisite for this transformation is that standard deviation is not zero. Standardization transforms the data according to the formula:

where

After standardization each column has its mean value equal to 0 and its standard deviation equal to 1.

The Standardize transformation has the following settings:

Table 4.9. Standardize Settings

NameDescriptionPossible valuesDefault value
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

Weight of evidence transformation

Weight of evidence measures to what extent an observed feature in the data adds to or subtracts from the evidence of impact. Weight of evidence can be calculated in the following way:

Every attribute (where and is the number of attributes), is divided by the algorithm into levels. Let denote the index of one of the level of the attribute , .

Firstly, the following parameter is defined:

where is the number of target values equal to 'good' in the -th level of the attribute and is the number of target values equal to 'good' in the whole dataset. The coefficients for 'bad' are defined analogously.

Weight of Evidence is computed as:

The Weight of Evidence Transformation has the following settings:

Table 4.10. Weight of Evidence Settings

NameDescriptionPossible valuesDefault value
Max Values Count If the number of different values of an attribute is greater then 'Max Values Count' the attribute will not be binarized. Integer values100
Positive Target ValueThe name of the positive target category.any value 
Target AttributeThe name of the target attribute.any valuenull
Liberal Mode If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. TRUE / FALSETRUE

Usage

Transformation Build Task

During the transformation building process, the following MR objects are used:

The transformation building task consists of the folowing steps:

  • step 1: creating a PhysicalData object for the dataset used for transformation building
  • step 2: choosing TransformationSettings of a particular type
    • step 2a: adding a LogicalData object corredsponding to the created PhysicalData object
    • step 2b: setting the AttributeUsageSet
    • step 2c: setting other parameters
  • step 3: choosing the task: TransformationBuildTask
    • step 3a: adding PhysicalData
    • step 3b: adding TransformationSettings
    • step 3c: setting the new transformation name
  • step 4: executing the task

After execution of the Transformation Build Task a new transformation object appears in Metadata Repository, which can be further used for data transformation in TransformationApplyTask

Transformation Apply Task

During the transformation applying process, the following MR objects are used:

  • PhysicalData - an object describing the dataset used for building the transformation
  • TransformationApplyTask - an object controlling the whole transformation applying process
  • Transformation - the result of the transformation building process

The transformation applying task consists of the folowing steps:

  • step 1: creating a PhysicalData object for the data to which the transformation will be applied
  • step 2: creating a PhysicalData object for the output data of the transformation
  • step 3: choosing the task: TransformationApplyTask
    • step 3a: adding the source data - a PhysicalData object of the data to be transformed
    • step 3b: adding the target data - a PhysicalData object for the result of the trasformation
    • step 3c: adding the transformation to be applied
    • step 3d: specifying direct mapping - which variables from input data are to be copied unchanged into the output set without any changes
  • step 4: executing the task

After execution of the Transformation Apply Task a new database table is created, which contains the transformed data. The resulting table name is inherited from the target PhysicalData name.

Automatic transformation

Most of AdvancedMiner mining algorithms support the Automatic Data Transformation option, which is shortcut method for transforming variables during a model building task. With this option data transformations are prepared on-the-fly using customizable transformation parameters. Customization is available through functionSettings | algorithmSettings | transformationSettings | internalTransformationsSettings (by double clicking on the transformation name and in the Properties window). The transformation are automatically executed just before model estimation step, however, the transformed data is not save in the database.

Figure 4.1. Algorithm Settings: Transformation Settings

Algorithm Settings: Transformation Settings

Figure 4.2. Algorithm Settings: Automatic Data Transformations Option

Algorithm Settings: Automatic Data Transformations Option

Examples

The scripts below shows how to work with transformation tasks.

Example 4.40. A sample transformation script

table 'data_sample':
    a b c
    2           'a5'             -4
    10          'abcde5'         8

pd = PhysicalData('data_sample')
save('pd', pd)
ld = LogicalData(pd)
save('ld', ld)

# Create the TransformationSettings object
transformationSettings = BinarizeSettings() # or StandardizeSetting, NormalizeSettings etc.

# Set transformation parameters
transformationSettings.setLogicalData(ld)
transformationSettings.getAttributeUsageSet().getAttribute('a').setUsage\
(TransformationUsageOption.drop)
transformationSettings.getAttributeUsageSet().getAttribute('b').setUsage\
(TransformationUsageOption.copy)
transformationSettings.getAttributeUsageSet().getAttribute('c').setUsage\
(TransformationUsageOption.transform)
transformationSettings.allNoRedundant()

# and save it
save('transformation_settings', transformationSettings)

# Create TransformationBuildTask
bt = TransformationBuildTask()

# Set the TransformationBuildTask parameters
bt.transformationName='my_transformation'
bt.physicalDataName='pd'
bt.transformationSettingsName='transformation_settings'

# and save the TransformationBuildTask
save('transformation_bt', bt)

# and execute the build task
execute('transformation_bt')

# Create TransformationApplyTask
tat=TransformationApplyTask()

# if the table exists it will be replaced
tat.replaceExistingData = TRUE

# Add the transformation
tat.setTransformationName('my_transformation')

# set Source 
tat.setSourceDataName('pd')

# Create PhysicalData for the TransformationApplyTask Target
save('pd_out',PhysicalData('output_table_name'))    

# set the Target
tat.setTargetDataName('pd_out')

# Save TransformationApplyTask
save('transformation_at', tat)

# Run the task
execute('transformation_at')
print 'See the result in binarize_out table...'
data = tableRead('output_table_name')

# Print the output table
for i in range(len(data)):
    for j in range(len(data[i])):
        print data[i][j],
    print
    

Output:

See the result in binarize_out table...
b __c___4_0 __c__8_0
a5 1.0 0.0
abcde5 0.0 1.0
    

Example 4.41. Weight of evidence transformation :

table 'data_sample':
    id   sex    target
    1    'woman' 'good'
    2    'man'   'bad'
    3    'man'   'good'
    4    'woman' 'good'
    5    'man'   'good'
    6    'woman' 'good'
    7    'woman' 'bad'
    
pd = PhysicalData('data_sample')
save('pd', pd)
ld = LogicalData(pd)
save('ld', ld)

transformationSettings = WeightOfEvidenceSettings()
transformationSettings.setTarget('target')
transformationSettings.setPositiveTargetValue('good')
                                            
transformationSettings.setLogicalData(ld)
transformationSettings.getAttributeUsageSet().getAttribute('id').setUsage(TransformationUsageOption.copy)
transformationSettings.getAttributeUsageSet().getAttribute('target').setUsage(TransformationUsageOption.copy)
transformationSettings.getAttributeUsageSet().getAttribute('sex').setUsage(TransformationUsageOption.transform)

save('transformation_settings', transformationSettings)

bt = TransformationBuildTask()
bt.transformationName='my_transformation'
bt.physicalDataName='pd'
bt.transformationSettingsName='transformation_settings'
save('transformation_bt', bt)
execute('transformation_bt')

tat=TransformationApplyTask()
tat.replaceExistingData=TRUE
tat.setTransformationName('my_transformation')
source = DataObjectSet('data_sample')
tat.setSourceDataName('pd')
save('pd_out',PhysicalData('output_table_name'))
tat.setTargetDataName('pd_out')
save('transformation_at', tat)
execute('transformation_at')


print 'Transformed table:'
sql res:
    select * from output_table_name

print res
    
            
# Exmaple - caluculating weight of evidence

good_target = 0.0
bad_target = 0.0

good_target_in_man = 0.0
bad_target_in_man = 0.0

good_target_in_woman = 0.0
bad_target_in_woman = 0.0

trans None <- 'data_sample':
    if target == 'good':
        $good_target+=1
        if sex == 'man':
            $good_target_in_man+=1
        elif sex == 'woman':
            $good_target_in_woman+=1
    elif target == 'bad':     
        $bad_target+=1
        if sex == 'man':
            $bad_target_in_man+=1
        elif sex == 'woman':
            $bad_target_in_woman+=1
            
import math   

man_woe = math.log( (good_target_in_man/good_target)/(bad_target_in_man/bad_target) )
woman_woe = math.log( (good_target_in_woman/good_target)/(bad_target_in_woman/bad_target) )
print 'Weight of evidence for man=%2.4f'%man_woe
print 'Weight of evidence for woman=%2.4f'%woman_woe

Output:

Transformed table:
id | target | sex                 | 
+--+--------+---------------------+--
 1 | good   |  0.1823215567939546 | 
 2 | bad    | -0.2231435513142097 | 
 3 | good   | -0.2231435513142097 | 
 4 | good   |  0.1823215567939546 | 
 5 | good   | -0.2231435513142097 | 
 6 | good   |  0.1823215567939546 | 
 7 | bad    |  0.1823215567939546 | 

Weight of evidence for man=-0.2231
Weight of evidence for woman=0.1823

Important notes

LogicalData owned by the transformation settings object should not be changed manually. Only automatically generated LogicalData can be used in transformations.