Preparing a dataset for modeling (data processing example)

Preparing a dataset for modeling (data processing example)
Prev	Chapter 1. Data Exploration Tutorial	Next

AdvancedMiner provides the user with a scripting language called Gython, which offers various tools for data processing. Sampling and splitting the table are examples of dedicated functions accessible from Gython scripts. Below we present an example illustrating the usage and effect of both procedures along with examples of functions in Gython.

Example 1.1. Data processing examples

if not tableExists('german_credit'):
 raise "Table 'german_credit' does not exists. Please run german_credit.py script from data directory first"
 
 
# FUNCTION DEFINITION Example in Gython
# includes illustration of SQL parametrization - "tableName" parameter

def rowCount(tableName):
    sql result:
        SELECT count(*) FROM $tableName
    return result[0][0]
    
    
   
# SPLITTING DATA Example
# splitting data into data_1 and data_2 sets

in_data = 'german_credit'
split_data_1 = 'german_credit_1'
split_data_2 = 'german_credit_2'
print "Rows count for whole dataset: ", rowCount(in_data )

tableSplit(in_data , split = [7,3], seed = 1234, output=[split_data_1, split_data_2])
print "Table after split 1: ", rowCount(split_data_1)
print "Table after split 2: ", rowCount(split_data_2)



# SAMPLING DATA Example

in_data = 'german_credit'
sample_size = 100
sampled_data = 'german_credit_sample'
print "Rows count for whole dataset: ", rowCount(in_data)

sample(in_data, sampled_data, sample_size)
print "Sampled data size: ", rowCount(sampled_data)

Output:

Rows count for whole dataset:  1000
Table after split 1:  709
Table after split 2:  291
Rows count for whole dataset:  1000
Sampled data size:  100

Figure 1.1. Tables after sampling and splitting

Note

Working with the scripts is supported by an advanced editor with code completion, on-fly error localization, syntax highlighting etc.
Gython provides access to:
- a wide range of language expressions and statements (conditional instructions, loops, functions etc.) and libraries dedicated to analytical projects (see Gython chapter in the Technical Documentation for further reference)
- data transformation language (see the Data Access and Data Processing chapter in the Technical Documentation) for working with tables (e.g. create, format, join).
- SQL code allowing the user to use SQL with parametrization, execute SQL in loops, include it in functions etc.
Besides the presented ways there are other possibilities for data processing that are not covered in this tutorials, for example Transformations (e.g. binarize, normalize, outliers, PCA, replace missings, standardize, WoE). To obtain more information on using Transformations see the Transformations chapter in the Technical Documentation.

Prev	Up	Next
Data exploration	Home	Model Building Guide