Predictive Analytics for Beginners – part 1

The role of predictive analytics in business

Data is everywhere. We generate data when using an ATM, browsing the Internet, calling our friends, buying shoes in our favourite e-shop or posting on Facebook. Companies collect this data en masse in order to make more informed business decisions, such as:

  • Which customers should participate in our promotional campaign for a given product in order to maximize response?
  • Which customers should be paid special attention to, as they might be considering resigning from using our services?
  • Is a particular customer trustworthy and does he/she qualify for a mortgage loan?

It is not always easy to obtain answers to questions like the ones above. In such situations it is worthwhile to resort to predictive analytics, which provides valuable information that can help make the right decisions. In simple terms, predictive analytics let us predict the future on the basis of historical data.

For example, if we know which customers stopped using our products in the past, we can build an analytical model describing the patterns of their behavior and characteristics. If we observe similar behavior in other customers, especially the ones which are the most valuable to us (as they generate the largest sales), we may try to prevent them from departing. Predictive analytics will provide us with the ranking of our customers according to their risk of departure (this is the so called score – in our example, the higher the score, the higher the departure risk).

However, to build such an analytical model we need historical data …

Source data

Most often the data required for modeling are obtained from databases or flat files. Below, we discuss an example table with source data and ways of interpreting it.

Predictive Analytics for Beginners source data

Table columns are referred to as variables or attributes, while table rows are called records, observations or objects.

Variables can be:

  • numerical (also called quantitative or continuous) – for example age, income, temperature,
  • categorical (also called discrete, qualitative, nominal) – such as gender, occupation, eye color.

We distinguish two basic roles a variable can have:

  • independent (also called predictor, explanatory, feature) – these variables describe the properties of objects which we want to use as the basis for making inferences,
  • dependent (also called response, explained, target) – these variables describe the features of the object which we want to make inferences about.

It is worth remembering, that information about the target variable should not be used when calculating the values of explanatory variables.

Depending on the industry, task, there can be a lot of variables available for analysis. We have worked on databases that had tens of thousands of variables. In addition, variable names are not always understandable. For example, who would guess that POP901, MARR1, IC10 means respectively number of persons, the percentage of married , the percentage of households with an income $ 50,000 – $ 74.999 ? Therefore, during data analysis one should have a data dictionary with variables description.

Analytical model

A predictive model describes the dependencies between explanatory variables and the target. It lets us to predict the target value on the basis of explanatory variables. There are many types of models. The most popular ones include:

  • regression (with the dependency expressed using a mathematical formula). An example:

Predictive Analytics for Beginners regression

  • decision tree (where the dependency is encoded using a tree-resembling graph). An example:

Predictive Analytics for Beginners decision tree

Models can have the following roles:

  • classification – the target variable is discrete (i.e. decision trees, logistic regression),
  • approximation – the target is continuous (i.e. linear regression, neural networks),
  • association – co-occurrence of values (i.e. A-Priori algorithms, associative networks),
  • segmentation – division into subgroups (i.e. k-means algorithm, Kohonen networks).

In our next post we are going to focus on the model building process. We will show it on an example of classification task. Classification is the process of assigning every object from a collection to exactly one class from a known set of classes. Examples of classification tasks are: assigning a patient (the object) to a group of healthy or ill (the classes) people on the basis of his or her medical record or determining the customer’s (the object) credibility during credit application using, for example, demographic and financial data; in this case the classes are „credible” and „not credible”.

The third part will be devoted to the notions of scoring and cut-off point.

Subscribe to our newsletter to get the next two parts of our Predictive Analytics for Beginners series directly to your mailbox.

Interested in similar content? Sign up for Newsletter !

Share this post!