Method description

The structure of Classification Trees

A tree is a collection of nodes connected together. The first node is called the root. The nodes connected from below to a given node are called its children. Nodes with no children are called leafs. A test is assigned to every node that is not a leaf. The concept of a test is described below. A target distribution, which is interpreted as a decision, is assigned to every leaf. Edges represent the possible outcomes of the test from a higher node.

Figure 30.1. Schematic structure of a classification tree

Schematic structure of a classification tree

For a given observation expressed as an input vector the decision is made by traversing the tree from the root to a leaf. The decision at each node is based on the test assigned to that node. Depending on which output the test has for the input vector, an appropriate child node is selected. The above procedure is repeated until the vector reaches a leaf. Classifier decision is made based on the leaf in which the procedure has finished.

Test types

A test is a multi-value condition operating on a single attribute value. The names "rule" or "predicate" are also used. There are four different test types in the Classification Tree module:

numeric

This is a binary test. The value of a numerical attribute is compared with the threshold value. If the value is greater or equal to the threshold, the observation is sent to the right child, otherwise it is sent to the left one.

Figure 30.2. Example numeric test

Example numeric test
nominal binary

This is a binary test for nominal attributes. The tested value is compared to a selected category. If the value is equal to the category, the observation is sent to the right child. Otherwise it is sent to the left one.

Figure 30.3. Example nominal-binary test

Example nominal-binary test
nominal - all values

This is a multi-value test for nominal attributes. It contains as many outputs, as there are possible categories for nominal attributes. Every child node corresponds to one category value.

Figure 30.4. Example nominal test with all values

Example nominal test with all values
nominal - grouped

This is the most general type of nominal test. It can have any number of outcomes. Every outcome is a set of values. If this set contains the value of the current attribute, then the case is sent to the corresponding child node.

Figure 30.5. Example nominal-grouped test

Example nominal-grouped test

Tree building algorithm

The Classification Trees module in AdvancedMiner uses a recursive top-down tree building algorithm.

Tree construction is a recursive process of data partitioning. The general rule is to select tests which will split the data into subsets containing the most homogeneous target. An evaluation method is used to compare the possible partitions and select the best one.

When the test is selected, the algorithm recurses down, the data set is split, and the procedure is repeated for every subset.

The algorithm stops when one of the stop criteria is satisfied.

Stop Criteria

In the Classification Tree module there are four conditions on which the algorithm will stop the recursion and create a leaf in the current node:

  1. The density of cases with some target category reaches a certain level: all samples belong to the same class or the Homogeneous Density condition is satisfied.

  2. The size of a node is to small to continue splitting: the Min Node Size or Min Node Size Prc condition is not satisfied or there are no cases in the current node.

  3. The tree depth has exceeded the Max Tree Depth limit.

  4. There are no more valid split tests. Such situation occurs when all tests have evaluation value less than 0 or less than the Min Split Evaluation Gain value.

Note

The stop criterion affects only the tree building phase. It implies that the min node size or tree height condition might not be satisfied after the pruning process.

Split selection

The algorithm selects the best split for a given node by rating all valid tests, and selecting the one for which the value of the rating function is the largest. Valid values are in the interval [-1.0, 1.0). A positive value means that the split based on a given test would improve the classification efficiency of the model. The value of -1.0 means that the split is not valid. Negative values indicate deterioration in classification.

There are three mathematical formulas used to compute the assessment value in AdvancedMiner. All of them are based on the distributions of target values in the current node and its direct children.

Let be the set of cases for the current node, be the assessed partition of the set and be the size of the set operator. Let be the frequency of target category in the set .

Entropy based split

The entropy of the set is equal to:

The entropy gain is the difference between the entropy of the set and the entropy of the split. The entropy of the split is equal to the weighted average the entropies of the subsets entropies:

Finally, the entropy gain is equal to . For details see Quinlan 1993, pages 20, 21, section "Gain criterion"

Gain ratio based split

Gain ratio is the entropy gain normalized by the distribution of the sizes of the subsets. Denote this distribution by . Then

(for details see Quinlan 1993, pages 23, 24, section "Gain ratiocriterion")

Gini based split

The Gini value for a subset is calculated using the formula

The gini split evaluation value is computed using the formula

(for details see Breiman 1984, page 93, chapter "Splitting Rules")

Tree pruning

Pruning is an independent process, performed after a full tree has been built. Its goal is to reduce the tree size. A single operation of the pruning process replaces a given node with a leaf or with the largest subtree. This operation is recursively performed from leaves to the root (bottom-up).

Pruning is intended to reduce the rate of overfitting. It is performed by reducing the error made by the tree for unseen data in each step of the procedure.

In AdvancedMiner the Classification Tree module employs pessimistic error pruning. This method uses training data to pessimistically estimate the probable error made by the classifier on unseen data. Validation data is not used for pruning.

Null values

Missing (null) values are currently fully supported by the Classification Trees module. Missing values are treated using the partitioning of cases method. (for details see Quinlan 1993, chapter "Unknown Attribute Values")