Coming from a classical IT background in terms of software development it took us a while to arrive at an architecture that was capable of fulfilling our needs for Data Science projects. Be aware that treating these two in a similar matter is not a good idea, as you might seriously lower the productivity of your Data Science team.
Automatic Machine Learning (AML) is a pipeline, which enables you to automate the repetitive steps in your Machine Learning (ML) problems and so save time to focus on parts where your expertise has higher value. What is great is that it is not only some vague idea, but there are applied packages, which build on standard python ML packages such as scikit-learn.
Anyone familiar with Machine Learning will in this context most probably recall the term grid search. And they will be entirely right to do so. AML is in fact an extension of grid search, as applied in scikit-learn, however instead of iterating over a predefined set of values and their combinations it searches for optimal solutions across methods, features, transformations and parameter values. AML “grid search” therefore does not have to be an exhaustive search over the space of possible configurations – one great application of AML is package called TPOT, which offers applications of e.g. genetic algorithms to mix the individual parameters within a configuration and arrive at the optimal setting.
In this post I will shortly present some basics of AML and then dive into applications using TPOT package including its genetic algorithm solution optimization.
The basic concept is very simple, once we receive our raw data we start with the standard ML pipeline.
There are many situations where we find that our code runs too slow and we don’t know the apparent reason. For such situations it comes very handy to use the python cProfile module. The module enables us to see the time individual steps in our code take, as well as the number of times certain functions are being called. In the following paragraphs, I will explore it’s capabilities a bit more.
However first let’s remember the quote by Donald Knuth: “premature optimization is the root of all evil (or at least most of it) in programming”. So make sure that you don’t start optimizing before you even have a working code! In many cases you will not be able to determine the bottlenecks beforehand and might spend a lot of extra effort in the wrong places.
Profiling with cProfile
The easiest way of using the cProfile module from within a Python script can look as follows
pr = cProfile.Profile()
In the code we create a Profile object and enable it, then execute the code that is of interest to us, disable the profiling and view the results.
After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python. This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness.
In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works.
In most of the data science applications, it comes very handy to be able to run code on the cloud. Be it a simple demonstration of a functionality that we want to make accessible for a potential client or an end-to-end implementation of let’s say a predictive model, the accessibility of cloud-based solutions is a definitive asset. However, running code on the cloud does have its pitfalls, which can discourage many from taking advantage of it.
This is why I have decided to share our experience with working on the cloud. In this post, I will specifically give a summary of functionalities that can help to run a python script on the Ubuntu cloud.
Running a python script on the cloud, can become much more bothersome than the development on our local computer, especially if we are using a standard SSH connection. Fortunately, to make our lives easier, there are a couple of functionalities that we can use.
1. argparse (python) – to run the script with various input arguments
2. tmux (unix) – to run sessions without the need to have a permanent SSH connection
3. cron (unix) – to run the scripts with a predefined frequency
4. SimpleHTTPs (python) – lightweight webserver for providing access to files to users that don’t have access to our cloud
Once we have set up our cloud, the next logical step is to create some form of data storage. One possible choice is a MySQL database.
In this article, we will have a look at two ways how to secure your MySQL instance in the cloud, while allowing access from the outside. As a result you will be enable to access the database in the cloud using the standard tool MySQL Workbench.
Given the size of data and complexity of processing, many Data Science projects require scalability that can be provided by cloud environments. Clouds combine high performance and cost-efficiency and are therefore very much sought after.
The set-up of cloud environment can be quite tedious – fortunately, the needed infrastructure installations are often similar across projects and therefore tools that enable automated infrastructure installations can be used to minimize the manual workload. The following blog post covers setting up a cloud environment using Ansible, which is one such program.
We will talk about UNIX based system as:
- most people already have a working knowledge of Windows based systems, but are much less knowledgeable in terms of UNIX
- most UNIX based systems are open-source under free licenses, so they are cheaper to run in general
- some handy products (e.g. RStudio server) are only UNIX based so a situation where basic knowledge is necessary can arise
The following series of blog posts will cover some essentials in terms of tools used by data scientists. The individual posts should serve as a guide to setting up a proper environment for all types of data science tasks. The tools will therefore be described from a technical perspective without paying attention to individual libraries or algorithms.
The posts will cover tools, that are either not that common or their set-up is a little bit tricky. The aim is to provide information about interesting new tools that help to broaden analytical skills of anyone, who deals with data analysis.