Tutorial: How to determine the quality and correctness of classification models? Part 1 – Introduction
What is classification?
Classification is the process of assigning every object from a collection to exactly one class from a known set of classes.
Examples of classification tasks are:
- assigning a patient (the object) to a group of healthy or ill (the classes) people on the basis of his or her medical record,
- determining the customer’s (the object) credibility during credit application using, for example, demographic and financial data; in this case the classes are “credible” and “not credible”,
- determining if the customer (the object) is likely to stop using the company’s services or products on the basis of behavioral and demographic data; in this case the classes are “disloyal customers” and “loyal customers”.
How are classification models created?
The creation of a classification model involves the following stages:
1. Data preparation (importing, processing, exploration and statistical analysis) This stage divides the data into two or three parts:
- training data – will be used to build the model
- validation data (in more complex cases) – will be used for evaluation of model quality during its creation
- testing data – will be used to establish the final quality of the model
2. Model creation (using training and optionally validation)
3. Model quality assessment (testing the created model on testing data)
4. Model application and subsequent monitoring (periodical checks if the quality of predictions does not deteriorate over time, for instance due to demographic or market changes)
Which indicators can be used to determine the quality of classification models?
There are two kinds of indicators that can be used to estimate the quality of classification models:
- Quantitative quality indicators – statistics, which express the quality of classification using numerical values.
- Graphical indicators – the quality of classification is represented on a graph which combines selected quantitative indicators. Graphical methods simplify model quality assessment and visualize classification results. Such indicators include:
Basic notions used in the assessment of the quality of classification models
- one class is defined as positive (also known as target class, rare class or minority class)
- other class is defined as negative (also known as normal class)
- one class is defined as positive
- other classes combined are defined as negative
Positive class should collect objects which should be identified during modeling: for example, in churn modeling the positive class would consist of resigning customers; in credit scoring projects the positive class consists of customers who defaulted on their debts. (In both cases the negative class consists of the remaining customers).
TP, TN, FP, FN
- TP – True Positive – the number of observations correctly assigned to the positive class
Example: the model’s predictions are correct and resigning customers have been assigned to the class of “disloyal” customers
- TN – True Negative – the number of observations correctly assigned to the negative class
Example: the model’s predictions are correct and customers who continue using the service have been assigned to the class of “loyal” customers.
- FP – False Positive – the number of observations assigned by the model to the positive class, which in reality belong to the negative class.
Example: unfortunately the model is not perfect and made a mistake: some customers, who continue using the service have been assigned to the class of “disloyal” customers.
- FN – False Negative – the number of observations assigned by the model to the negative class, which in reality belong to the positive class.
Example: unfortunately the model is not perfect and made a mistake: some churning customers have been assigned to the class of “loyal” customers.
For a perfect classifier (i.e. every observation has been correctly classified) we would have:
FP = 0
FN = 0
TP = number of all observations from the positive class
TN = number of all observations from the positive class
Pos = TP + FN – number of all observations which in reality belong to the positive class
Neg = FP + TN – number of all observations which in reality belong to the negative class