Data Scientists toolbox: Cloud set-up using Ansible

Given the size of data and complexity of processing, many Data Science projects require scalability that can be provided by cloud environments. Clouds combine high performance and cost-efficiency and are therefore very much sought after. 

The set-up of cloud environment can be quite tedious – fortunately, the needed infrastructure installations are often similar across projects and therefore tools that enable automated infrastructure installations can be used to minimize the manual workload.  The following blog post covers setting up a cloud environment using Ansible, which is one such program.

We will talk about UNIX based system as:

  • most people already have a working knowledge of Windows based systems, but are much less knowledgeable in terms of UNIX
  • most UNIX based systems are open-source under free licenses, so they are cheaper to run in general
  • some handy products (e.g. RStudio server) are only UNIX based so a situation where basic knowledge is necessary can arise

Choosing a provider and getting access to your instance

The first thing is choosing a provider of cloud services. Most providers are similar and differ only in prices and the interfaces, you control your instance. However if you need additional tools that only a certain provider has (e.g. Redshift), then your choice is limited. Furthermore some cloud providers don’t offer Windows based systems (eg. DigitalOcean) and to run a virtual instance of Windows on a UNIX server is ineffective and can have limitations, such as being able to install only 32-bit systems. This means that you are limited to 4GB of RAM, which is not enough for most cloud computing actions (if it were, you would probably be able to do most of the modelling on your desktop or laptop that has similar RAM).

In order to choose the right cloud provider, you can do your own research given your required functionalities, or you can use some ready-made search engines that are available online, such as this one.

While choosing between Windows and UNIX based systems, one advantage of UNIX to bear in mind is that they provide a command line, which is much more powerful than the command line tool on Windows. The reason for this is the modularity of the command line in UNIX. Individual functionalities serve as modules, which then can be combined together using “|”. A simple example of this would be combining the ps command, which returns all running services, together with grep, which is a utility to search text data. So by combining these two we are able to find a certain service by issuing the ps -A | grep mysql command. Apart from controlling the environment, the UNIX command line can also be used to do some data pre-processing.

After setting up your cloud, the second step to do is set-up SSH access to your instance (enable access on port 22, usually this is enabled by default; you will find guides to do so available from each cloud provider). While the console provided by the cloud provider may be sufficient, for some tasks it is easier to control it from a SSH program such as Putty (in Windows based systems).

Putty enables you to connect to your instance using either a password or an SSH keys. It also includes a key generator “puttygen.exe”, where you can generate or modify your keys from one format to another (some providers such as AWS provide already generated keys, while on other you have to insert your generated key).

After having set up the SSH access to your instance you can start installing the required tools. This is the point where Ansible comes very handy.

Infrastructure Installation using Ansible

Ansible is a program that enables you to automate the infrastructure installation (other similar tools include e.g. Saltstack). So what does that mean? Basically Ansible is a tool where you can create scripts (in Ansible called playbooks) that Ansible then executes, while controlling the installation process. This means that it takes care of many things around the installation, which you would have had to do manually.

NOTE: many cloud providers have predefined images, which may contain the tools you require, which is an easier alternative to Ansible, given you find an image, that suits your needs.

Before working with Ansible I used standard text files with individual commands and then copied them line-by-line and executed via SSH. This has many disadvantages, such as being more time-intensive, having to control the whole installation process by yourself and not being able to install multiple machines at once.

With Ansible you have a set of standardized modules that can be used for individual installation steps and have predefined parameters and being simply able to re-run the process. Furthermore you can use the “command” module to run “raw” UNIX command line if the provided modules are insufficient.

Prerequisites to be able to run Ansible on Ubuntu:

  1. Having python 2.6 or higher installed on your cloud instance (this is installed by default on new UNIX distributions)
  2. Run following:
    • sudo apt-get install software-properties-common
    • sudo apt-add-repository ppa:ansible/ansible
    • sudo apt-get update
    • sudo apt-get install ansible

How are Ansible scripts constructed

Ansible scripts use a language called yml (which is a mark-up language). Similarly as in python the commands are controlled by indentation. This means you a have a standardized syntax for writing the scripts.

A basic command would look like:

name: "Install MySQL server"

apt: name=mysql-server

where the name is the description of the given step you provide that will also show up in the status in the execution of the Ansible script (so provide names, that will make sense). And apt is an Ansible module for installing packages. In this case we install the package identified by mysql-server. This command is identical to the command line command apt-get install mysql-server.

An Ansible script will consist of a certain number of commands, that will be executed when the script is run.

Constructing more complicated scripts

When we want to install multiple tools, we want to keep them in separate blocks, in order to be easily able to include or exclude them from an installation. These are called roles in Ansible. So let’s say we want to install an environment with MySQL instance, R studio server and Redis. We will have a main script (the one we run to initialize the process). For simplicity lets call it install_main.yml. This script would than contain a part defined as roles. So in our case it would look as follows:


  • mysql
  • r_server
  • redis-master

In case we want to exclude some of the roles we have in our standard script, we can simply comment out the given line by “#” and thus control the installed tools without having to change any of the other scripts we have.

The structure of our Ansible folder would look like this:

  • installation_main.yml
  • roles
    • mysql
    • r_server
    • redis

Furthermore each of the roles (mysql, r_server, redis), might consist of the following folders:

  • defaults
  • handlers
  • meta
  • tasks
  • templates

Each of the above mentioned folders has to contain a main.yml file (which can then have a link to other files).

Playbook files

The above mentioned files are so called playbook files. Each of the group is responsible for different actions.

  • defaults – contains default values for individual set-ups. For example it can contain the port that a service should run on. The users you want to create (username, password), access rights, etc.
  • handlers – these will check the status of system services and perform an  individual actions that are not related to installing tasks, these can for example include starting, stoping or restarting a service
  • meta contains dependencies
  • tasks – tasks are individual commands that will be executed, so for example the installation of individual packages.
  • vars define variables that can be passed into conditional statements
  • templates templates contain configuration files. So for example, when you finally configured a MySQL instance to your liking, you can simply copy this configuration file (my.cnf), change it to the required format for Ansible (.j2 extension) and include it in the installation

Copying Ansible files to the cloud and modifying them

You can either use the SSH command line to copy the files, or for people who are used to the Windows environment you can install WinSCP and copy the files in a similar manner as in Windows after setting up your connection to your cloud instance. WinSCP uses Putty to create a SSH connection, so given that you are using Putty, setting up WinSCP should be easy.

Furthermore WinSCP enables you to edit the files on the cloud, by opening them in your text editor on your desktop/laptop, rather then having to do that using the command-line.

NOTE: For some files you might not have permission, so you will not be able to modify these files without root privileges via WinSCP. You have at least three options:

  • Change the permissions for the selected files and folders from the command line (remember to restore the default setting after modifying the files!)
  • Connect via SSH as root and not your standard account
  • Open the file using for example the sudo nano main.yml command. Nano is a text editor on UNIX and main.yml is a random .yml file from Ansible

Running Ansible scripts/playbooks

After installing the pre-requisites and creating your Ansible playbook, you still have to specify, which hosts should be affected by the installation (Ansible is able to install multiple hosts at once).

To do the installation on your current machine go to the “/etc/ansible” folder and change the hosts file to include the following lines:

localhost ansible_connection=local

Remember to reference the hosts you want to change in your install_main.yml file, by including the line hosts: local.

Go to the folder with your install_main.yml file and execute ansible-playbook install_main.yml.

This will start executing all the tasks in your yml files and roles. The output is shown in your SSH connection and contains the status of individual tasks. Ansible scripts can be re-run multiple times and for command using the Ansible modules the state on the machine is compared to the required and if no changes are necessary then this task is skipped. Only for the “command” module, the command is executed repeatedly.


Ansible galaxy

One of the main aims of Ansible is the re-usability. Therefore, writing the yml scripts in logical modules, so that you can use them independently is recommended. This also means that you can easily use a script created by someone else and include it in your playbook.

For this purpose use Ansible galaxy, which contains a large number of Ansible roles, that are usually well maintained and highly configurable. So for example you would only have to modify your install_main.yml script to include the selected roles and then modify some default values in one of the selected roles.

There is no point in reinventing the wheel, so if these roles suit your needs use them or you can at least get inspired how an Ansible role should be written.


Using Ansible will enable you to set-up the cloud environment in a standardized way and much quicker, than doing it manually. The introduction to Ansible contained only the basic principles of Ansible to be able to understand its capabilities and how it works. Be prepared that you will need to spend some time reading the documentation, in order to be able to use it effectively.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s