Transformations are very important in data processing. Very often one encounters a situation when the data set has improper format or contains invalid data, which prevents it from being used or analyzed. When it is necessary to convert data from one format to another, data transformation can be used. In the following parts the built-in transformation are discussed:
Every transformation is defined by a set of parameters. Some of them are common to all transformations while others depend on the type of the transformation. Parameters unique to particular transformations are described in the corresponding sections. The table below contains the descriptions of common parameters.
Table 4.1. Attribute Usage Set
Value | Description |
---|---|
auto | The columm will be transformed according to the default settings (e.g. replace missing with mean/modal for numerical/categorical variables) |
copy | The column will be copied without any changes. |
drop | The column will be omitted during transformation. |
transform | The column will be transformed according to user setings. |
There are two ways of using transformations in the Advanced Miner system. Transformation can be created directly with TransformationBuildTask or indirectly as an automated part of the model building process ( Automatic Data Transformation). In the first case the user chooses the type of the transformation, which can then be applied to a data set. In the latter case the system performs the transformations which are necessary for a given algorithm to work on the data set selected for modelling.
Binarization changes nominal variable into n binary variables, where n is the number of possible values of a nominal variable. NULL values are copied without changes.
The Binarize Transformation has the following settings:
Table 4.2. Binarize Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Default Binarized Values | For unipolar the binarized columns assign the values 0 and 1, for bipolar -1 and 1. | unipolar / bipolar | unipolar |
Max Values Count | If the number of different values of the attribute is greater then 'Max Values Count' the attribute will not be binarized. | Integer values | 100 |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
As in the case of Attribute Usage Set the user can determine additional transformation parameters for each variable.
Table 4.3. Binarize - Custom Attributes
Name | Description |
---|---|
Variable Name | Variable names |
Redundant |
Value selection for the binarized variable:
|
Binarizedvalues |
Select the set of values for binary variables:
|
Normalization is used to transform the input table in such a way that after the transformation the values for a given column are in the range <MinValue, MaxValue>. Normalization transforms the data according to the following formula:
If all values in the transformed column are identical, all values in the output column will be equal to . Null values are not changed.
The Normalize Transformation has the following settings:
Table 4.4. Normalize Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Max Value | The right end of the target interval. | real number (greater than Min Value) | 1.0 |
Min Value | The left end of the target interval. | real number (less than Max Value) | 0.0 |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
In data mining processes and data analysis not all non-normalities are due to non-declared or missing values. Another reason for non-normality is the presence of outliers - observations which are extreme relatively to the rest of the sample. The 'rare' event syndrome is a typical cause of outliers. Outliers can be also defined as extreme observations that for some legitimate reason do not fit within the typical range of other data values. The Outlier transformation aims to replace outliers by appropriate values to improve normality.
The Outlier Transformation has the following settings:
Table 4.5. Outlier Settings
Name | Description | Possible values | Default value |
---|---|---|---|
lowerPercent | Percentage value of observations from lower range to be transformed. | positive integer values from range (0-100) | 5 |
upperPercent | Percentage value of observations from upper range to be transformed. | positive integer values from range (0-100) | 95 |
minValuesCount | Minimumal number of different values for the transformation to take affect. | integer values | 10 |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
Principal component analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a smaller number of uncorrelated variables called principal components. Principal component analysis was originally introduced by Karl Pearson in 1901 and later developed by several other authors (see Hotelling (1933)). Excellent statistical treatments of principal components can be found in Morrison (1976), and Mardia, Kent, and Bibby (1979). Comprehensive PCA description and references can be found in Jolliffe (2002). Principal component analysis can be used to summarize the data and detect linear relationships. The objective of principal component analysis is to reduce the dimensionality (the number of variables) of the dataset while retaining most of the original variability in the data. The first principal component accounts for as much of the variability in the data as possible, and each successive component accounts for as much of the remaining variability as possible. The PCA transformation performs Principal Component Analysis on the selected dataset. Principal component analysis is concerned with explaining the variance-covariance structure of a high dimensional random vector through few linear combinations of the original component variables. Consider a -dimensional random vector . The principal components (with not exceeding ) of are univariate random variables defined by the formulas
For a given the coefficient vectors are chosen so that has the largest variance subject to the following conditions:
and the -th Principal Component is a linear combination that maximizes and . There are exactly such linear combinations. However, the first few of them explain most of the variance in the original data, so instead of working with all the original variables one would typically first perform PCA and then use only first two or three principal components in subsequent analysis. The Kaiser criterion can also be used for retaining only principal components with eigenvalues greater than one. It was proposed by Kaiser (1960), and is probably the most widely used criterion of this kind.
The PCA Transformation has the following settings:
Table 4.6. PCA Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Components Number | When Criterion is set to KAISER this parameter is ignored. When Criterium is set to NUMBER it describes the number of primary components. | integer numbers | 1 |
Criterium | Determines how the primary components are selected. kaiser uses a self selecting number of primary components. With number the user can set the number of components manually. | number / kaiser | number |
Standardize | When set to TRUE, standardization is performed prior to the calculation of primary components, FALSE otherwise. | TRUE / FALSE | FALSE |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
This transformation replaces or removes null values in the data set. Some algorithms do not work with null values and in that cases this transformation is especially useful.
The Replace Missing Transformation has the following settings:
Table 4.7. Replace Missing Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Categorical Replace | The value which will replace missing values of categorical variables. | mean/median/distirbutionBased/modal/custom | modal |
Cut Lower Percent | Percentage value of observations from lower range to be treated as one interval in a distibution based process. | Positive integer values from range (0-100) | 0 |
Cut Upper Percent | Percentage value of observations from upper range to be treated as one interval in a distibution based process. | Positive integer values from range (0-100) | 100 |
Numerical Replace | The value which will replace missing values of numerical variables. | mean/median/distirbutionBased/modal/custom | mean |
Ordinal Replace | The value which will replace missing values of ordinal variables. | mean/median/distirbutionBased/modal/custom | modal |
Seed | Random number generator seed used in the case of distribution based replacement of missing values (distributionBased). | integer values | null |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
As in the case of Attribute Usage Set the user can set additional transformation parameters for each variable.
Table 4.8. Replace Missing - Attributes
Name | Description |
---|---|
Name | Variable names |
Replace Type | The manner in which missing values will be replaced: mean/median/distirbutionBased/modal/custom. |
Custom Value | User selected custom value which will be used to replace missing values (in the case of selecting the 'custom' replacement type). |
Standardization means subtracting from each entry in a column (of a numerical type) the mean value of all entries in this column and then dividing the result by the standard deviation of that column. Null values remain untouched. A prerequisite for this transformation is that standard deviation is not zero. Standardization transforms the data according to the formula:
where
After standardization each column has its mean value equal to 0 and its standard deviation equal to 1.
The Standardize transformation has the following settings:
Table 4.9. Standardize Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
Weight of evidence measures to what extent an observed feature in the data adds to or subtracts from the evidence of impact. Weight of evidence can be calculated in the following way:
Every attribute (where and is the number of attributes), is divided by the algorithm into levels. Let denote the index of one of the level of the attribute , .
Firstly, the following parameter is defined:
where is the number of target values equal to 'good' in the -th level of the attribute and is the number of target values equal to 'good' in the whole dataset. The coefficients for 'bad' are defined analogously.
Weight of Evidence is computed as:
The Weight of Evidence Transformation has the following settings:
Table 4.10. Weight of Evidence Settings
Name | Description | Possible values | Default value |
---|---|---|---|
Max Values Count | If the number of different values of an attribute is greater then 'Max Values Count' the attribute will not be binarized. | Integer values | 100 |
Positive Target Value | The name of the positive target category. | any value | |
Target Attribute | The name of the target attribute. | any value | null |
Liberal Mode | If TRUE 'liberal' execution is preferred (do not stop on minor errors), FALSE otherwise. | TRUE / FALSE | TRUE |
During the transformation building process, the following MR objects are used:
The transformation building task consists of the folowing steps:
After execution of the Transformation Build Task a new transformation object appears in Metadata Repository, which can be further used for data transformation in TransformationApplyTask
During the transformation applying process, the following MR objects are used:
The transformation applying task consists of the folowing steps:
After execution of the Transformation Apply Task a new database table is created, which contains the transformed data. The resulting table name is inherited from the target PhysicalData name.
Most of AdvancedMiner mining algorithms support the Automatic Data Transformation option, which is shortcut method for transforming variables during a model building task. With this option data transformations are prepared on-the-fly using customizable transformation parameters. Customization is available through functionSettings | algorithmSettings | transformationSettings | internalTransformationsSettings (by double clicking on the transformation name and in the Properties window). The transformation are automatically executed just before model estimation step, however, the transformed data is not save in the database.
The scripts below shows how to work with transformation tasks.
Example 4.40. A sample transformation script
table 'data_sample': a b c 2 'a5' -4 10 'abcde5' 8 pd = PhysicalData('data_sample') save('pd', pd) ld = LogicalData(pd) save('ld', ld) # Create the TransformationSettings object transformationSettings = BinarizeSettings() # or StandardizeSetting, NormalizeSettings etc. # Set transformation parameters transformationSettings.setLogicalData(ld) transformationSettings.getAttributeUsageSet().getAttribute('a').setUsage\ (TransformationUsageOption.drop) transformationSettings.getAttributeUsageSet().getAttribute('b').setUsage\ (TransformationUsageOption.copy) transformationSettings.getAttributeUsageSet().getAttribute('c').setUsage\ (TransformationUsageOption.transform) transformationSettings.allNoRedundant() # and save it save('transformation_settings', transformationSettings) # Create TransformationBuildTask bt = TransformationBuildTask() # Set the TransformationBuildTask parameters bt.transformationName='my_transformation' bt.physicalDataName='pd' bt.transformationSettingsName='transformation_settings' # and save the TransformationBuildTask save('transformation_bt', bt) # and execute the build task execute('transformation_bt') # Create TransformationApplyTask tat=TransformationApplyTask() # if the table exists it will be replaced tat.replaceExistingData = TRUE # Add the transformation tat.setTransformationName('my_transformation') # set Source tat.setSourceDataName('pd') # Create PhysicalData for the TransformationApplyTask Target save('pd_out',PhysicalData('output_table_name')) # set the Target tat.setTargetDataName('pd_out') # Save TransformationApplyTask save('transformation_at', tat) # Run the task execute('transformation_at') print 'See the result in binarize_out table...' data = tableRead('output_table_name') # Print the output table for i in range(len(data)): for j in range(len(data[i])): print data[i][j], print
Output:
See the result in binarize_out table... b __c___4_0 __c__8_0 a5 1.0 0.0 abcde5 0.0 1.0
Example 4.41. Weight of evidence transformation :
table 'data_sample': id sex target 1 'woman' 'good' 2 'man' 'bad' 3 'man' 'good' 4 'woman' 'good' 5 'man' 'good' 6 'woman' 'good' 7 'woman' 'bad' pd = PhysicalData('data_sample') save('pd', pd) ld = LogicalData(pd) save('ld', ld) transformationSettings = WeightOfEvidenceSettings() transformationSettings.setTarget('target') transformationSettings.setPositiveTargetValue('good') transformationSettings.setLogicalData(ld) transformationSettings.getAttributeUsageSet().getAttribute('id').setUsage(TransformationUsageOption.copy) transformationSettings.getAttributeUsageSet().getAttribute('target').setUsage(TransformationUsageOption.copy) transformationSettings.getAttributeUsageSet().getAttribute('sex').setUsage(TransformationUsageOption.transform) save('transformation_settings', transformationSettings) bt = TransformationBuildTask() bt.transformationName='my_transformation' bt.physicalDataName='pd' bt.transformationSettingsName='transformation_settings' save('transformation_bt', bt) execute('transformation_bt') tat=TransformationApplyTask() tat.replaceExistingData=TRUE tat.setTransformationName('my_transformation') source = DataObjectSet('data_sample') tat.setSourceDataName('pd') save('pd_out',PhysicalData('output_table_name')) tat.setTargetDataName('pd_out') save('transformation_at', tat) execute('transformation_at') print 'Transformed table:' sql res: select * from output_table_name print res # Exmaple - caluculating weight of evidence good_target = 0.0 bad_target = 0.0 good_target_in_man = 0.0 bad_target_in_man = 0.0 good_target_in_woman = 0.0 bad_target_in_woman = 0.0 trans None <- 'data_sample': if target == 'good': $good_target+=1 if sex == 'man': $good_target_in_man+=1 elif sex == 'woman': $good_target_in_woman+=1 elif target == 'bad': $bad_target+=1 if sex == 'man': $bad_target_in_man+=1 elif sex == 'woman': $bad_target_in_woman+=1 import math man_woe = math.log( (good_target_in_man/good_target)/(bad_target_in_man/bad_target) ) woman_woe = math.log( (good_target_in_woman/good_target)/(bad_target_in_woman/bad_target) ) print 'Weight of evidence for man=%2.4f'%man_woe print 'Weight of evidence for woman=%2.4f'%woman_woe
Output:
Transformed table: id | target | sex | +--+--------+---------------------+-- 1 | good | 0.1823215567939546 | 2 | bad | -0.2231435513142097 | 3 | good | -0.2231435513142097 | 4 | good | 0.1823215567939546 | 5 | good | -0.2231435513142097 | 6 | good | 0.1823215567939546 | 7 | bad | 0.1823215567939546 | Weight of evidence for man=-0.2231 Weight of evidence for woman=0.1823