Automate your Machine Learning in Python – TPOT and Genetic Algorithms

Automatic Machine Learning (AML) is a pipeline, which enables you to automate the repetitive steps in your Machine Learning (ML) problems and so save time to focus on parts where your expertise has higher value. What is great is that it is not only some vague idea, but there are applied packages, which build on standard python ML packages such as scikit-learn.

Anyone familiar with Machine Learning will in this context most probably recall the term grid search.  And they will be entirely right to do so. AML is in fact an extension of grid search, as applied in scikit-learn, however instead of iterating over a predefined set of values and their combinations it searches for optimal solutions across methods, features, transformations and parameter values. AML “grid search” therefore does not have to be an exhaustive search over the space of possible configurations – one great application of AML is package called TPOT, which offers applications of e.g. genetic algorithms to mix the individual parameters within a configuration and arrive at the optimal setting. 

In this post I will shortly present some basics of AML and then dive into applications using TPOT package including its genetic algorithm solution optimization.

Basic concepts

The basic concept is very simple, once we receive our raw data we start with the standard ML pipeline.

Within the pipeline we have some steps that are specific for a given dataset/problem, most prominently data cleaning, whose automation is an issue. However, later in the process we get to tasks such as:

  • feature pre-processing
  • feature selection
  • model selection

What these tasks have in common is that within each of them we use a set of approaches whose performance we then evaluate (feature importance,  model performance…). Since we have clearly defined metrics on the individual steps we can automate the process. Here we get to the AML search for optimal solutions across methods, features, transformations and parameter values.

The following packages exist for Automated Machine Learning in Python:

  • TPOT
  • Auto-Sklearn
  • Auto-Weka
  • Machine-JS
  • DataRobot

Advantages

Apart from the obvious time-saving perspective, there are other advantages. One of the applications that this blog post by Airbnb mentions is the ability to easily create benchmarks (they also mention other). This enables us to judge the performance of existing ML models and put them in context with relevant values of other models.

Another benefit is standardizing the basic methods that are used for any ML task. And instead of a classic documentation or guide we are able to prepare a setting, which can directly be used by the person working on the problem.

Furthermore it enables us to run quick prototyping tasks, e.g to give clients a better estimate of what performance basic models can achieve without having to implement them. The results are just one configuration and execution away from you.

AML  in TPOT package

We will have a look at some very basic examples in the TPOT package. The most basic example is assigning and fitting a simple classifier or regressor in TPOT.

from tpot import TPOTClassifier, TPOTRegressor

# create instance 
tpot = TPOTClassifier()
# fit instance
tpot.fit(X_train, y_train)

# create instance
tpot = TPOTRegressor()
# fit instance
tpot.fit(X_train, y_train)

# evaluate performance on test data
tpot.score(X_test, y_test)

# export the script used to create the best model
tpot.export('tpot_exported_pipeline.py')

The last line of the script serves for exporting the script to standard scikit-learn code, which enables us to further modify/optimize the script from the current best solution.

Genetic algorithm and its parameters

For the TPOT classifier and regressor we have a set of available parameters, such as for example:

class TPOTBase(BaseEstimator):

    def __init__(self, generations=100, population_size=100, offspring_size=None,
                 mutation_rate=0.9, crossover_rate=0.1,
                 scoring=None, cv=5, n_jobs=1,
                 max_time_mins=None, max_eval_time_mins=5,
                 random_state=None, config_dict=None, warm_start=False,
                 verbosity=0, disable_update_check=False):

Many parameters quite logically coincide with the parameters from scikit-learn, therefore we will not explore them any further. Rather we will have a look at the parameters related with the genetic algorithm used within TPOT (for a detailed list and use of the parameters refer to the documentation).

Genetic algorithms are based on the idea of creating an initial population, iteratively combining the members of the population and so creating offspring based on the “traits/parameters” of their parents. At the end of each iteration we do a fitness test and leave the fittest individuals out of the original + newly created. Therefore in each iteration we create new offspring that may replace existing individuals if the offspring performs better. This leads to the fact that the overall performance increases or at least stays the same at each iteration.

The genetic algorithm parameters are:

      • generations – determines the number of iterations where offspring (new individuals) are created
      • population_size – the initial number of individuals to create (these serve for creating offspring)
      • offspring_size – the number of new individuals created in each generation
      • mutation_rate – the rate at which random changes to attribute values occur (a method for including new parameters, that might not have been available in the initial population)
      • crossover_rate – the percentage of individuals that are used for creating offspring

Using this iterative process we select the optimal configuration. Just be prepared that the results for genetic algorithms in general depend on the initial state. Consequently he randomly generated initial population impacts the output and therefore re-running the same setting can in some cases lead to different results.

Custom setting dictionaries

Instead of the standard values that are used for the underlying scikit-learn functionalities you can define your own dictionary, which enables you to incorporate knowledge about the area or the required parameters that you have (eg. exclude ML algorithms that from your experience don’t work for the given application). An example of a dictionary looks like this:

classifier_config_dict = {

    # Classifiers
    'sklearn.naive_bayes.GaussianNB': {
    },

    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },

    'sklearn.naive_bayes.MultinomialNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },

    'sklearn.tree.DecisionTreeClassifier': {
        'criterion': ["gini", "entropy"],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21)
    }
}

Iris example

Here we have a very basic example with the iris dataset provided within scikit-learn and the output of TPOT.

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.8, test_size=0.2)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

############ OUTPUT #########################
Optimization Progress:  33%|███▎      | 100/300 [00:25<02:36,  1.28pipeline/s]
Generation 1 - Current best internal CV score: 0.9833333333333334
Optimization Progress:  50%|█████     | 150/300 [00:39<00:58,  2.57pipeline/s]
Generation 2 - Current best internal CV score: 0.9833333333333334
Optimization Progress:  67%|██████▋   | 200/300 [00:52<00:28,  3.48pipeline/s]
Generation 3 - Current best internal CV score: 0.9833333333333334
Optimization Progress:  83%|████████▎ | 250/300 [01:05<00:10,  4.66pipeline/s]
Generation 4 - Current best internal CV score: 0.9833333333333334
                                                                              
Generation 5 - Current best internal CV score: 0.9916666666666668

Best pipeline: KNeighborsClassifier(Nystroem(input_matrix, Nystroem__gamma=0.4, Nystroem__kernel=linear, Nystroem__n_components=DEFAULT), KNeighborsClassifier__n_neighbors=12, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=DEFAULT)
0.933333333333

For more complex examples refer to the tutorials of TPOT.

Conclusion

Automated Machine Learning enables us to save time and optimise parts of the code where you have to be creative. It reduces repetitive tasks, which for people in this field are not as interesting as exploring new ways of resolving a problem. Try incorporating these automated methods in your own workflows, just make sure that you consider the computation-time into consideration as running over many combinations results in compute-intensive tasks.

Advertisements

One thought on “Automate your Machine Learning in Python – TPOT and Genetic Algorithms

  1. Pingback: Distilled News | Data Analytics & R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s