In most of the data science applications, it comes very handy to be able to run code on the cloud. Be it a simple demonstration of a functionality that we want to make accessible for a potential client or an end-to-end implementation of let’s say a predictive model, the accessibility of cloud-based solutions is a definitive asset. However, running code on the cloud does have its pitfalls, which can discourage many from taking advantage of it.
This is why I have decided to share our experience with working on the cloud. In this post, I will specifically give a summary of functionalities that can help to run a python script on the Ubuntu cloud.
Running a python script on the cloud, can become much more bothersome than the development on our local computer, especially if we are using a standard SSH connection. Fortunately, to make our lives easier, there are a couple of functionalities that we can use.
1. argparse (python) – to run the script with various input arguments
2. tmux (unix) – to run sessions without the need to have a permanent SSH connection
3. cron (unix) – to run the scripts with a predefined frequency
4. SimpleHTTPs (python) – lightweight webserver for providing access to files to users that don’t have access to our cloud
Argparse is used to pass input arguments when we are running the script from the command-line (NOTE: similarly as python [R] also has an argparse functionality – link). An example of calling such a script using argparse is:
python script_name.py -input_1 15 -input_2 10
The above line of code means that we are executing the `script_name.py` script, with the input arguments `input_1 = 15` and `input_2 = 10`
The definition within our python script would be
# import the module import argparse # create the parser object parser = argparse.ArgumentParser(description='Data Preprocessing') # add any arguments we require parser.add_argument('-e', '--epsilon', type=float, default=.0005, dest='epsilon', help='epsilon to be used') # parse the arguments args = parser.parse_args() # use the individual parameters in other parts of the script create_model(input_epsilon = args.epsilon)
- In our example:
- ‘-e’ – short name for identifying the argument
- ‘–epsilon` – long name for identifying the argument
- type – defines the data type that is entered
- default – the default value to be used in case the parameter is not set
- dest – the name of the attribute to be added to the object returned by parse_args()
- help – a description that explains the purpose of the input argument
The attributes that we don’t set will use their default values.
Argparse supports many data types (eg. float, int, str, lists). In case we want to use a list we use `nargs=’+’` if we don’t know the number of items in the list or an integer value when we know the number beforehand. To pass the argument we would use `python script.py -list item1 item2`.
It is good practice to add some debugging/logging information to the script, because this gets stored in the logs of cron and so we can use it to find the reason why our script did not execute as we expected. Ideally we can set the level of detail for our logging as an input argument, while executing the script.
For an in depth description have a look at the documentation https://docs.python.org/2/library/argparse.html .
Tmux allows us to create sessions within our cloud environment using SSH, without the need to stay connected while the script executes. So in case we want to start a script that is expected to run a longer time (and which does not use cron) we use tmux. Furthermore it allows us to run multiple terminal command-lines at the same time. And it is already pre-installed on Ubuntu, so we can easily get started straight away.
# prefix for commands (default) (ctrl+b) # create a new session tmux new -s session_name # attach to the session session_name tmux attach -t session_name # find all running sessions tmux list-sessions # general information about all sessions, panes etc... tmux info # detach from a session tmux detach # or using prefix prefix + `d` # terminate a session tmux kill-session -t session_name # or using prefix in current session prefix + ':kill-session' + enter # terminate/destroy current session exit
After attaching to a given session you can execute the command you want to run (eg. a python script with argparse) and detach from the session. The session continues to run and you can attach to it anytime until you terminate the session.
In case you want to share a session with another user (ie. user using a different account), then you can either create a new user (eg. tmux_user) and after connecting to your account using SSH, run
sudo su - tmux_user and work on your sessions from that user. Or another alternative is to modify the configuration as described here.
Cron is a simple scheduler, that enables you to execute scripts at predefined times/frequencies. So basically we can use it for any tasks that run with identical settings at predefined times, such as downloading files, running replications, end-of-day jobs etc..
Cron can also easily be configured to send emails regarding the status of the scheduled job. Or it stores the command-line output, so we can examine, why the script did not finish as expected. Furthermore in case we have a pipeline of scripts that need to be run in a certain order (and we don’t want to create a main script to orchestrate them) we can use the “&&” operator. This ensures that a script is started only after the previous script has terminated.
# list the available cron jobs for the current user crontab -l # edit the available cron jobs for the current user crontab -e # example of a crontab configuration to execute a script on a daily basis at 2am 00 02 * * * /opt/alook/orchestrator.py -c /opt/alook/config/alook_common.cfg
For an in depth explanation on how to configure cron with additional options have a look here.
Note: For cron to be able to execute a script the file needs the correct permissions, so don’t forget to set the permissions using `chmod +x filename.sh`
This an extremely lightweight HTTPServer in case you need for example to share a couple of files from your cloud with users that don’t have access to the cloud or for example show some static html sites. By running the command
python -m SimpleHTTPServer 80 you can give anyone access to the files in the directory where you executed the command (the parameter after the server command defines the port where the HTTP server will run). Users can then access the folder from their browser and view/download the files.
Be sure that this does not include any sensitive data and that SSL is not required, else you have to configure it like for example shown here.
Note: Don’t forget to enable the correct port in your firewall in case you want to enable access for outside users.
All of the above are powerful tools that come handy for situations that a data scientist deals with on a regular basis and that we usually have to solve once we start working in the cloud or cooperating with other team members on projects.
More about working with Cloud in my previous article about Cloud set-up using Ansible.