The Python programming is our daily bread. We develop frameworks, which are afterwards deployed on the customers’ infrastructures. And in some cases, there is an emphasis on performance, such as in the recent case with a recommender engine, which should load an individual recommendation in less than 30 ms. And then faster calculation might be helpful, especially since the use of a specific distribution requires no changes to the underlying python code
Two weeks ago, Martin found that an Intel distribution for Python exists, so I decided to have a look. Intel claims that this distribution is faster in every way, and shares its benchmark. So apart from conducting the intel benchmark only, I decided to test the distributions using my own benchmark to determine the performance on typical cases often performed in a Data Science pipeline.
Intel Distribution for Python
Automatic Machine Learning (AML) is a pipeline, which enables you to automate the repetitive steps in your Machine Learning (ML) problems and so save time to focus on parts where your expertise has higher value. What is great is that it is not only some vague idea, but there are applied packages, which build on standard python ML packages such as scikit-learn.
Anyone familiar with Machine Learning will in this context most probably recall the term grid search. And they will be entirely right to do so. AML is in fact an extension of grid search, as applied in scikit-learn, however instead of iterating over a predefined set of values and their combinations it searches for optimal solutions across methods, features, transformations and parameter values. AML “grid search” therefore does not have to be an exhaustive search over the space of possible configurations – one great application of AML is package called TPOT, which offers applications of e.g. genetic algorithms to mix the individual parameters within a configuration and arrive at the optimal setting.
In this post I will shortly present some basics of AML and then dive into applications using TPOT package including its genetic algorithm solution optimization.
The basic concept is very simple, once we receive our raw data we start with the standard ML pipeline.
There are many situations where we find that our code runs too slow and we don’t know the apparent reason. For such situations it comes very handy to use the python cProfile module. The module enables us to see the time individual steps in our code take, as well as the number of times certain functions are being called. In the following paragraphs, I will explore it’s capabilities a bit more.
However first let’s remember the quote by Donald Knuth: “premature optimization is the root of all evil (or at least most of it) in programming”. So make sure that you don’t start optimizing before you even have a working code! In many cases you will not be able to determine the bottlenecks beforehand and might spend a lot of extra effort in the wrong places.
Profiling with cProfile
The easiest way of using the cProfile module from within a Python script can look as follows
pr = cProfile.Profile()
In the code we create a Profile object and enable it, then execute the code that is of interest to us, disable the profiling and view the results.
After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python. This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness.
In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works.
Apart from being a data scientist, I also spend a lot of time on my bike. It is therefore no surprise that I am a huge fan of all kinds of wearable devices. Lots of the times though, I get quite frustrated with the data processing and data visualization software that major providers of wearable devices offer. That’s why I have been trying to take things to my own hands. Recently I have started to play around with plotting my bike route from Python using Google Maps API. My novice’s guide to all this follows in the post.
In most of the data science applications, it comes very handy to be able to run code on the cloud. Be it a simple demonstration of a functionality that we want to make accessible for a potential client or an end-to-end implementation of let’s say a predictive model, the accessibility of cloud-based solutions is a definitive asset. However, running code on the cloud does have its pitfalls, which can discourage many from taking advantage of it.
This is why I have decided to share our experience with working on the cloud. In this post, I will specifically give a summary of functionalities that can help to run a python script on the Ubuntu cloud.
Running a python script on the cloud, can become much more bothersome than the development on our local computer, especially if we are using a standard SSH connection. Fortunately, to make our lives easier, there are a couple of functionalities that we can use.
1. argparse (python) – to run the script with various input arguments
2. tmux (unix) – to run sessions without the need to have a permanent SSH connection
3. cron (unix) – to run the scripts with a predefined frequency
4. SimpleHTTPs (python) – lightweight webserver for providing access to files to users that don’t have access to our cloud
Given the size of data and complexity of processing, many Data Science projects require scalability that can be provided by cloud environments. Clouds combine high performance and cost-efficiency and are therefore very much sought after.
The set-up of cloud environment can be quite tedious – fortunately, the needed infrastructure installations are often similar across projects and therefore tools that enable automated infrastructure installations can be used to minimize the manual workload. The following blog post covers setting up a cloud environment using Ansible, which is one such program.
We will talk about UNIX based system as:
- most people already have a working knowledge of Windows based systems, but are much less knowledgeable in terms of UNIX
- most UNIX based systems are open-source under free licenses, so they are cheaper to run in general
- some handy products (e.g. RStudio server) are only UNIX based so a situation where basic knowledge is necessary can arise