Data Science is one of the most significant trends in Data Analytics market over the last ten years and has some buzzwords that are hard to track. However, when it comes to the actual implementation of Data Science projects, these projects are usually experiencing failure because the success and the end of the project were not successfully defined.
In each business segment, you need to have a methodology that will give you a framework for successful project implementation. The most commonly used methodology in Data Science projects is called CRISP-DM, which has been used for years in analytical projects.
CRISP-DM contains 5 major steps that apply to Data Science projects regardless of the industry you are working on:
1.) Business Understanding
2.) Data Understanding
3.) Data Preparation
4.) Modeling and model evaluation
5.) Implementation of the solution
I should point out that these steps are interdependent and that they are iterative.
To make Data Science projects successful, it is most important to determine the project goal, ie to determine what value the Data Science project brings to the company. Most often, companies at conferences or the Internet see that there is a Data Science case study, and then they try to copy it, and in the end, such Data Science projects are in most cases decaying and this is because the goals of a company can be vastly different from each other.
BUSINESS UNDERSTANDING – CASE STUDY
You are the owner of a variety of merchandising company and over the last year, you are aware of the fall in the number of registered customers who are coming to buy on your e-commerce site. To stop the continuation of such a trend, you want to take certain steps to keep your existing customers and attract new ones, but you do not know how to do it.
In this case, the business goal of the project would be to find out which customers have left your business for the past year and for such types of customers to make a specific marketing campaign to keep them.
2.) DATA UNDERSTANDING – what Data do we need to use?
Data Science is not only a technical discipline but also a business one because it requires the detailed domain knowledge of the business to select a set of data to be used to solve our business goal. It is also very important to know what individual data mean in business so that the interpretation of model results makes sense for business users (marketing team).
DATA UNDERSTANDING – CASE STUDY
After defining the business goal of the project – creating a marketing campaign for customers who have abandoned purchases on an e-commerce site, it is necessary to define the data that will be used to create the Data Science model. Through the marketing and data science team meetings, we came to the conclusion that two sources of data: customer behavioral transactions on our website and demographic data on our customers in the CRM system will generally be used. Also, we have defined the customer as “lost” (churner) if, for the last three months, the user has not made any purchases on the e-commerce site.
3.) DATA PREPARATION – how to prepare data for modeling?
Data preparation is a purely technical job in which the aim is to bring data into the structure needed to create Data Science models. In this part of the methodology, two tasks are done:
- Integration of data from multiple sources of data
- Cleaning and Data Transformation
This project phase spends most of the project time (70-80%) because without the quality and structured data we can not build a high-quality Data Science models.
DATA PREPARATION – CASE STUDY
Once we have defined data sources, it is necessary to link data from transaction systems and CRM to get a set of data explaining customer behavior and demographics at the aggregation level. Also, it was necessary to make the so-called feature engineering which means additional modifications and transformations of the data needed to get a better view of the individual customer.
4.) Modeling AND MODEL EVALUATION – Which algorithm to use?
Data Science algorithms are generally divided into two groups:
- Supervised – where we know what is the outcome of the event and where, based on past data, we predict future events.
- Unsupervised – where we do not know the outcome of the event and use algorithms to suggest the end result. Such algorithms are better known as algorithms for data clustering.
The supervised algorithm can be further divided into:
- Classification algorithms – we use if we predict a qualitative outcome, eg whether a user will go (YES / NO).
- Regression algorithms – we use them if we predict a quantitative outcome eg how much will be the price of the apartment depending on some variables (size of the apartment, location, apartment squares).
Evaluation of the Data Science algorithm takes place in two ways:
- The accuracy of the algorithm or model – in this case, the model loses quality in the interpretation, so such models are called black box models and these models in most cases give very high accuracy. However, these models make sense if self-automation of a process is to be achieved eg recognition of objects in images or video.
- Identifying data patterns – in this case, we want to get business information about the data and these models are simpler and easier to interpret by business users.