The following series of blog posts will cover some essentials in terms of tools used by data scientists. The individual posts should serve as a guide to setting up a proper environment for all types of data science tasks. The tools will therefore be described from a technical perspective without paying attention to individual libraries or algorithms.
The posts will cover tools, that are either not that common or their set-up is a little bit tricky. The aim is to provide information about interesting new tools that help to broaden analytical skills of anyone, who deals with data analysis.
Skills that a data scientist should have
In terms of the skills that are required for a data scientist, the requirements are very extensive. In other technical fields, such as software development, employees within an organization generally have a narrower field of expertise. There are experts on certain products, integration, architecture etc.., so you always have a go to guy. This approach obviously has many advantages, because a deeper knowledge of a certain product or tool leads to better designed and implemented solutions. On the other hand without certain knowledge across the areas the cooperation can become ineffective as each individual works without bearing the global context of the project in mind.
As a data scientist, especially when you are working on small projects or within small teams, you have to deal with a wide range of tools as the setup of the environment is often fully in your hands. The need for a variety of tools depends very much on the variety of tasks you have to do, but for most data scientists, learning to work with new tools is their daily bread. Obviously, no matter how steep your learning curve is, there is always a limitation by available time and trade-off between having a working knowledge of more tools or a deep expertise for a few.
Let us start with a basic summary of the tools and technologies that are typically required. For the most part I will talk about open source tools, so it is easy for everyone to try them out.
Programming language to analyze data (R, python)
The most popular choices here are R and python. While these two languages are very different, they complement each other very well. There are many excellent materials on both languages and because their setup is very easy we will not cover it any further.
- Data storage (SQL like databases)
A wide range of free SQL databases are available (MySQL, MariaDB, PosGre…), all of these are pure SQL databases. I will not cover NoSQL or tools for unstructured data for the time being.
- Cloud environment
While many projects don‘t require the scalability that a cloud environment provides, there are still many situations, where being able to set up an environment with high performance is necessary and/or more cost-efficient.
If you need to integrate your solution into an environment and the individual components are blocks that are independent (technology-wise and business orientation-wise), then you will most likely need an interface to communicate to the other components.
- Automated tasks
There are many situations where automated tasks come in handy. Be it for data extraction/replication or scoring of calibrated models, especially in cloud environments.
Without properly visualizing and presenting your results the impact and influence will be minimal. Here we will focus on individual products such as Tableau, rather than how to present the results.
The above is a basic summary of areas that I will cover, while some of the areas may not apply to everyone, it is still convenient to have some knowledge of the options at hand in case there is a need to solve some unprecedented task.
While setting up a programming language or an SQL-like database is a mundane task, which you can do easily on your laptop in a couple of minutes, I will not cover it any further. I only mentioned it, to give a complete picture of required tools. I will provide some references in terms of easy alternatives or handy functionalities though.
My first few posts will deal with the following topics:
- setting up a cloud environment using Ansible
- securing access to a MySQL database in the cloud
- setting up automated jobs using Cron