Data is everywhere. We generate data when using an ATM, browsing the Internet, calling our friends, buying shoes in our favourite e-shop or posting on Facebook. Companies collect this data en masse in order to make more informed business decisions, such as:
It is not always easy to obtain answers to questions like the ones above. In such situations it is worthwhile to resort to predictive analytics, which provides valuable information that can help make the right decisions. In simple terms, predictive analytics lets us predict the future on the basis of historical data.
For example, if we know which customers stopped using our products in the past, we can build an analytical model describing the patterns of their behavior and characteristics. If we observe similar behavior in other customers, especially the ones which are the most valuable to us (as they generate the largest sales), we may try to prevent them from departing. Predictive analytics will provide us with the ranking of our customers according to their risk of departure (this is the so called score – in our example, the higher the score, the higher the departure risk).
However, to build such an analytical model we need historical data …
Most often the data required for modeling are obtained from databases or flat files. Below, we discuss an example table with source data and ways of interpreting it.
Table columns are referred to as variables or attributes, while table rows are called records, observations or objects.
Variables can be:
We distinguish two basic roles a variable can have:
It is worth remembering, that information about the target variable should not be used when calculating the values of explanatory variables.
Depending on the industry, task, there can be a lot of variables available for analysis. We have worked on databases that had tens of thousands of variables. In addition, variable names are not always understandable. For example, who would guess that POP901, MARR1, IC10 means respectively number of persons, the percentage of married , the percentage of households with an income $ 50,000 – $ 74.999 ? Therefore, during data analysis one should have a data dictionary with variables description.
A predictive model describes the dependencies between explanatory variables and the target. It lets us to predict the target value on the basis of explanatory variables. There are many types of models. The most popular ones include:
Models can have the following roles:
In our next post we are going to focus on the model building process. We will show it on an example of classification task. Classification is the process of assigning every object from a collection to exactly one class from a known set of classes. Examples of classification tasks are: assigning a patient (the object) to a group of healthy or ill (the classes) people on the basis of his or her medical record or determining the customer’s (the object) credibility during credit application using, for example, demographic and financial data; in this case the classes are „credible” and „not credible”.
The third part will be devoted to the notions of scoring and cut-off point.