When building predictive models, you obviously need to pay close attention to their performance. That is essentially what it is all about – getting the prediction right. Especially if you are working for paying clients you need to prove that the performance of your models is good enough for their business. Fortunately, there is a whole bunch of statistical metrics and tools at hand for assessing model’s performance.
In my experience, performance metrics for (especially binary) classification tasks such as confusion matrix and derived metrics are naturally understood by almost anyone. A bit more problematic is the situation for regression and time series. For example when you want to predict future sales or want to derive income from other parameters, you need to show how close your prediction is to the observed reality.
I will not write about (adjusted) R-squared, F-test and other statistical measures. Instead, I want to focus on performance metrics that should represent more intuitive concept of performance as I believe they can help you to sell your work much more. These are:
- mean absolute error
- median absolute deviation
- root mean squared error
- mean absolute percentage error
- mean percentage error
There are many situations where we find that our code runs too slow and we don’t know the apparent reason. For such situations it comes very handy to use the python cProfile module. The module enables us to see the time individual steps in our code take, as well as the number of times certain functions are being called. In the following paragraphs, I will explore it’s capabilities a bit more.
However first let’s remember the quote by Donald Knuth: “premature optimization is the root of all evil (or at least most of it) in programming”. So make sure that you don’t start optimizing before you even have a working code! In many cases you will not be able to determine the bottlenecks beforehand and might spend a lot of extra effort in the wrong places.
Profiling with cProfile
The easiest way of using the cProfile module from within a Python script can look as follows
pr = cProfile.Profile()
In the code we create a Profile object and enable it, then execute the code that is of interest to us, disable the profiling and view the results.
Our world is generating more and more data, which people and businesses want to turn into something useful. This naturally attracts many data scientists – or sometimes called data analysts, data miners, and many other fancier names – who aim to help with this extraction of information from data.
A lot of data scientists around me graduated in statistics, mathematics, physics or biology. During their studies they focused on individual modelling techniques or nice visualizations for the papers they wrote. Nobody had ever taken a proper computer science course that would help them tame the programming language completely and allow them to produce a nice and professional code that is easy to read, can be re-used, runs fast and with reasonable memory requirements, is easy to collaborate on and most importantly gives reliable results.
I am no exception to this. During my studies we used R and Matlab to get a hands-on experience with various machine learning techniques. We obviously focused on choosing the best model, tuning its parameters, solving for violated model assumptions and other rather theoretical concepts. So when I started my professional career I had to learn how to deal with imperfect input data, how to create a script that can run daily, how to fit the best model and store a predictions in a database. Or even to use them directly in some online client facing point.
To do this I took the standard path. Reading books, papers, blogs, trying new stuff working on hobby projects, googling, stack-overflowing and asking colleagues. But again mainly focusing on overcoming small ad hoc problems.
Luckily for me, I’ve met a few smart computer scientists on the way who showed me how to develop code that is more professional. Or at least less amateurish. What follows is a list of the most important points I had to learn since I left the university. These points allowed me to work on more complex problems both theoretically and technically. I must admit that making your coding skills better is a never ending story that restarts with every new project.
Irrespective of whether the underlying data comes from e-shop customers, your clients, small businesses or both large profit and non-profit organizations, market segmentation analysis always brings valuable insights and helps you to leverage otherwise hidden information in your favor, for example greater sales. Therefore, it is vitally important to utilize an efficient analytical pipeline, which would not only help you understand your customer base, but also further serve you during planning of your tailored offers, advertising, promos or strategy. Let us play with some advanced analytics in order to provide a simple example of efficiency improvement when using segmentation techniques, namely clustering, projection pursuit and t-SNE.
As your goal might be improving your sales through tailored customer contact, you need to discover homogeneous groups of people. The different groups of customers behave and respond differently, therefore it is only natural to treat them in a different way. The idea is to get greater profit in each segment separately, through diverse strategy. Thus, we need to accomplish two fundamental tasks:
- identify homogeneous market segments (i.e. which people are in which group)
- identify important features (i.e. what is decisive for customer behavior)
In this post, I am focusing on the first problem from the technical point of view, using some advanced analytic methods. For the sake of brief demonstration, I will work with simple dataset, describing the annual spending of clients of a wholesale distributor on diverse product categories. Following the figure below, it would be difficult to detect some well separated clusters of clients at the first sight.
After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python. This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness.
In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works.
Apart from being a data scientist, I also spend a lot of time on my bike. It is therefore no surprise that I am a huge fan of all kinds of wearable devices. Lots of the times though, I get quite frustrated with the data processing and data visualization software that major providers of wearable devices offer. That’s why I have been trying to take things to my own hands. Recently I have started to play around with plotting my bike route from Python using Google Maps API. My novice’s guide to all this follows in the post.
In most of the data science applications, it comes very handy to be able to run code on the cloud. Be it a simple demonstration of a functionality that we want to make accessible for a potential client or an end-to-end implementation of let’s say a predictive model, the accessibility of cloud-based solutions is a definitive asset. However, running code on the cloud does have its pitfalls, which can discourage many from taking advantage of it.
This is why I have decided to share our experience with working on the cloud. In this post, I will specifically give a summary of functionalities that can help to run a python script on the Ubuntu cloud.
Running a python script on the cloud, can become much more bothersome than the development on our local computer, especially if we are using a standard SSH connection. Fortunately, to make our lives easier, there are a couple of functionalities that we can use.
1. argparse (python) – to run the script with various input arguments
2. tmux (unix) – to run sessions without the need to have a permanent SSH connection
3. cron (unix) – to run the scripts with a predefined frequency
4. SimpleHTTPs (python) – lightweight webserver for providing access to files to users that don’t have access to our cloud
Driving marketing budget sometimes seems to be a mysterious art where decisions are based on ideas of few enlightened people, who know what’s right. But you should not fool yourself, the times are changing and so is the way successful marketing is managed. The same as in other fields, experienced marketing managers use information hidden in data to help them. With the amount of data and methods available, it is however often tricky not to get lost and be able to distinguish the signal from the noise. Typical examples are the marketing attribution models – a tool that is widely used, but in my experience rarely maximizes the leveraged value of data.
Typically, in marketing attribution, marketers want to know, which part of the business KPI (typically site visits, sales, new customers, new revenues etc.) result from which marketing activity. Mainstream approach is to use attribution models that are often very simplistic – like single source attribution (last click, first click) or fractional attribution (where the contribution is distributed among multiple touch points given some simple rule). These methods provide marketers with the importance of each marketing channel or campaign in respect to their KPI. Based on this historical information the marketing managers make a decision about how to allocate the marketing budget. This approach however puts a great deal of pressure to tedious and demanding data detective work to make sure all client touch points are measured correctly. More importantly, there is no way of knowing that this work has been done correctly, which of course has significant impact on credibility of the attribution models.
Knowing these difficulties, we decided for an alternative approach. We thought: Why should we dig into the individual touch points? Shouldn’t we rather focus on marketing investments and model the ultimate business output? And that is exactly what we did. We took investments into individual marketing channels in time and used time series analysis to predict our client’s business goal (number of sales). On top of it, we also added seasonality, marketing investment of competitors and some other simple parameters.
“Even though we are using data to drive marketing decisions on a daily basis, most of the tools that we have used up until now focus on describing the past. Recently we decided to work together with aLook Analytics to change that. Thanks to their modelling approach to marketing investments we now have accurate information about the expected future developments as well.
Using the interactive Shiny application that is built in Keboola Connection, we want to make informed decisions on the fly, which will help us to reach our sales goals in the most cost efficient way.”
Daniel Gorol, BNP Paribas Personal Finance SA / Cetelem
If there is one thing you learn soon as a data scientist, it is that problem solving gets an extra dimension as the data volume grows. One typical example is building recommendation engines.
A very basic form of a recommendation engine can be built using just a simple matrix algebra. But the situation quickly changes when analyzing data about many thousands customers, who are buying or rating several hundreds of products, which generates large and so called sparse data sets (where a lot of customer-item combinations do not have any value assigned). There are two main problems to overcome – how to store these large sparse matrices and how to run quick calculations over them.
In the following post, I will describe how to approach this problem in R using the package Matrix. The package allows to store large matrices in R’s virtual memory, supports standard matrix operations (transpose, matrix multiplication, element-wise multiplication etc.) and also provides a nice toolkit to develop new custom functions needed for recommendation engines as well as for other applications (here or here) where sparse matrices are used.
Every student of statistics I know has at least once thought about making easy money by predicting the stock market or by predicting sports results. To be honest, I certainly was not an exception. Intimidated by uncapped randomness of the stock market, I always tended more to the second option – the sports betting. Nevertheless, it has not been until recently, that I asked the guys from the team and we decided to actually really try if we can make an easy living from predicting football results. [For impatient readers: it looks promising but it is not so easy]
In the beginning we knew absolutely nothing about how the betting works, where to find the data, neither what some standard prediction methods are. So let me walk you through what we have learned.