In the final post about practicing data science in SMBs, I will get into more details about how exactly a data science project looks like, what are the essential phases and last but not least how to maintain the deployed solutions.
Phases and timing
In the very first stage of experimenting with a data science solution, it often makes sense to try building a quick concept model that will show the potential value added. Timeline and duration of course differ from case to case, but by models such as the churn model from my previous post, building a concept solution should not take more than 10 days of work.
There are essentially five phases when building such a concept solution:
1. Firstly we need to understand the business problem & decide how to approach it
We are looking to answer the question “What is it that we are solving?”, more specifically in the case of our churn model we need to define, “when is a client considered to be churned”, “how much in advance do we need to detect a leaving client to take some actions?”, “which is worst: to target clients who wouldn’t leave or to miss someone who would?” and so on. We also need to consider what can be realistically achieved and what actions can be taken afterwards so a close involvement of internal experts is essential.
2. Secondly we need to understand the data
Again making use of our internal experts, this time more of the data administrator, we need to understand what data are available and which can be meaningfully used in the model. For the churn model the minimal requirement would be the transactional data (who bought what and when), nice to have is also some data about customers and products. Of course that other data can be also considered, we could for example use the data about behaviour on the web, merge data about our clients with data of some external partners or use open data, but the amount of work needed to do this always needs to correspond to the expectations that we have about the final solution.
3. With the third step, the real data science starts. We need to prepare the data and do so called feature engineering
A feature is a piece of information that might be useful for prediction. In plain English it means “how do we represent the real-life situation in the data to capture all important traits” or specifically for the churn model “what might indicate that a client will leave?” Decrease in number of transactions? Less amount spent for products the client had been typically buying? Negative evaluation of the product purchased? Decrease in visits of your web? Experience and ideas of the domain expert are very useful in this step. However, his or her hypotheses should be used as a guidelines not an exhaustive list. We should always generalize them as much as possible not to miss something important. Feature engineering is the part that is the crucial piece for success. Contrary to what one might expect, algorithms are not what makes the difference in models, it’s the data and their representation that counts.
4. Fourth step is the modelling itself
This is the core part for the data scientist: there is business specification, there is data, so the one thing missing is to build the actual model that is good enough and does what it should do, in our case predicts, which customers will leave. An important part of the model specification is testing how well it performs, we test this using something similar to a control group. We choose a point in the past, apply the model on a group of customers and check in how many cases did the model specify correctly that they would leave. It is important to bear in mind that there is always a trade-off between how well the model performs and how much time it takes to develop it. So developing a model with 95% accuracy would most probably be too expensive to consider as a concept.
5. Fifth step is the solution deployment
This is an optional step for solutions that are expected to bring some continuous value such as recommendation engines or our churn model. It is basically plugging the data science model into the infrastructure of the business so that they can use it – i.e. build a reactivation campaign for targeted customers.
Are we done?
If it turns out that the solution works well enough, that’s of course very good. What follows is ideally incorporating the solution in the standard processes, so it is possible to act upon it regularly. In other words make sure that the model scores are available, there will be a reactivation campaign and that the marketing representative will include the model in decision making process for selecting the target customers.
The model needs to be also constantly monitored – evaluated and re-trained to be still up-to-date. If needed we might also invest more in the model development. Very often this is done by including some additional data sources or taking more time for feature engineering to improve the model quality and be really “better than yesterday” each and every new day.
Truth to be told, incorporating data science in your business is a never ending, iterative process of maintenance, constant adjustments and improvements. But I hope that I was able to explain the reasons why I think that in many cases it is 100% worth it.