For the last few years everyone talks about the importance of advanced analytics for manufacturing as a next step after lean and Six Sigma programs and what great potential it can unleash. So when we were in front of our first project we were naturally very excited and curious what can be done. The outcome of the project exceeded our expectations both in terms of data modelling and more importantly business results for our client.
It’s been almost 3 years since I started aLook. First as one-man-show, later joined by friends and family. During this time we worked on more than 60 projects with many partners for clients all over the world. It seems we mostly did a good job if I can say that from the returning customers and partners recommending us to their clients. And now we’re hiring!
Trying to motivate the team to work during our first hackathon. 1994 Sid Meier’s Colonization on a phone shared via Apple TV is hard to beat…
Coming from a classical IT background in terms of software development it took us a while to arrive at an architecture that was capable of fulfilling our needs for Data Science projects. Be aware that treating these two in a similar matter is not a good idea, as you might seriously lower the productivity of your Data Science team.
For businesses where clients generate revenues over time knowing who will be your most valuable clients in the future is very handy information. Especially if you want to optimise your service models. Continue reading
Automatic Machine Learning (AML) is a pipeline, which enables you to automate the repetitive steps in your Machine Learning (ML) problems and so save time to focus on parts where your expertise has higher value. What is great is that it is not only some vague idea, but there are applied packages, which build on standard python ML packages such as scikit-learn.
Anyone familiar with Machine Learning will in this context most probably recall the term grid search. And they will be entirely right to do so. AML is in fact an extension of grid search, as applied in scikit-learn, however instead of iterating over a predefined set of values and their combinations it searches for optimal solutions across methods, features, transformations and parameter values. AML “grid search” therefore does not have to be an exhaustive search over the space of possible configurations – one great application of AML is package called TPOT, which offers applications of e.g. genetic algorithms to mix the individual parameters within a configuration and arrive at the optimal setting.
In this post I will shortly present some basics of AML and then dive into applications using TPOT package including its genetic algorithm solution optimization.
The basic concept is very simple, once we receive our raw data we start with the standard ML pipeline.
For our client, an international start-up company (South Africa, Great Britain, Switzerland…), we are currently looking for (1) behavioral data scientist and (2) client delivery analyst.
Table mountain, Cape Town (SA)
The client you would be working for is a company who provides big corporations with employee behavioral analytics. Our team is responsible for building and maintaining their analytical platform as well as for supporting the internal team of behavioral scientist in developing measurements.
The positions we are offering are demanding but do come with their unique advantages. Firstly, we don’t mind when or where you work as long as you deliver what you are supposed to. Secondly, you will have a huge opportunity to grow in data science and related fields, supported by our experienced team. And thirdly, you will be in direct contact with international start-up environment. Continue reading
Monte Carlo method is a handy tool for transforming problems of probabilistic nature into deterministic computations using the law of large numbers. Imagine that you want to asses the future value of your investments and see what is the worst-case scenario for a given level of probability. Or that you want to plan the production of your factory given past daily performance of individual workers to ensure that you will meet a tough delivery plan with high enough probability. For such and many more real-life tasks you can use the Monte Carlo method.
Monte Carlo approximation of Pi
Clustering is a hugely important step of exploratory data analysis and finds plenty of great applications. Typically, clustering technique will identify different groups of observations among your data. For example, if you need to perform market segmentation, cluster analysis will help you with labeling each segment so that you can evaluate each segment’s potential and target some attractive segments. Therefore, your marketing program and positioning strategy rely heavily on the very fundamental step – grouping of your observations and creation of meaningful segments. We may also find many more use cases in computer science, biology, medicine or social science. However, it often turns out to be quite difficult to define properly how a well-separated cluster looks like.
Today, I will discuss some technical aspects of hierarchical cluster analysis, namely Agglomerative Clustering. One great advantage of this hierarchical approach would be fully automatic selection of the appropriate number of clusters. This is because in genuine unsupervised learning problem, we have no idea how many clusters we should look for! Also, in my view, this clever clustering technique solves some ambiguity issues regarding vague definition of a cluster and thus is more than suitable for automatic detection of such structures. On the other hand, the agglomerative clustering process employs standard metrics for clustering quality description. Therefore, it will be fairly easy to observe what is going on. Continue reading
When building predictive models, you obviously need to pay close attention to their performance. That is essentially what it is all about – getting the prediction right. Especially if you are working for paying clients you need to prove that the performance of your models is good enough for their business. Fortunately, there is a whole bunch of statistical metrics and tools at hand for assessing model’s performance.
In my experience, performance metrics for (especially binary) classification tasks such as confusion matrix and derived metrics are naturally understood by almost anyone. A bit more problematic is the situation for regression and time series. For example when you want to predict future sales or want to derive income from other parameters, you need to show how close your prediction is to the observed reality.
I will not write about (adjusted) R-squared, F-test and other statistical measures. Instead, I want to focus on performance metrics that should represent more intuitive concept of performance as I believe they can help you to sell your work much more. These are:
- mean absolute error
- median absolute deviation
- root mean squared error
- mean absolute percentage error
- mean percentage error
There are many situations where we find that our code runs too slow and we don’t know the apparent reason. For such situations it comes very handy to use the python cProfile module. The module enables us to see the time individual steps in our code take, as well as the number of times certain functions are being called. In the following paragraphs, I will explore it’s capabilities a bit more.
However first let’s remember the quote by Donald Knuth: “premature optimization is the root of all evil (or at least most of it) in programming”. So make sure that you don’t start optimizing before you even have a working code! In many cases you will not be able to determine the bottlenecks beforehand and might spend a lot of extra effort in the wrong places.
Profiling with cProfile
The easiest way of using the cProfile module from within a Python script can look as follows
pr = cProfile.Profile()
In the code we create a Profile object and enable it, then execute the code that is of interest to us, disable the profiling and view the results.