Coming from a classical IT background in terms of software development it took us a while to arrive at an architecture that was capable of fulfilling our needs for Data Science projects. Be aware that treating these two in a similar matter is not a good idea, as you might seriously lower the productivity of your Data Science team.
For a classical IT project you have the development, acceptance and production environments. Development serves the purpose of developing new functionalities, acceptance to deploy and test the functionalities and production to have a stable, reliable environment for standardised functionalities with high availability.
New features/functionalities are based on functional specifications or any other technical document describing the requirements. Accordingly the data structure, scripts etc. are adjusted and the new functionality is added. Usually you create some testing data for the development and acceptance environment and if everything works you deploy it to production.
Now on the other hand with Data Science related development you usually need the data beforehand, because in most of the tasks you need real-life data to calibrate your model, test your hypothesis etc.. So basically you need access to the production environment, which is properly integrated and has reliable up-to-date data.
And here we arrive at the first caveats:
- the production environment needs to be utilised primarily for standardised modelling tasks, which run on a daily basis and on which other systems in your environment depend.
- you certainly don’t want many users to have access to your production environment (apart from the people that are in charge of maintenance)
- data used should be anonymized to comply with data protection rules
Needless to say that while creating a new model you can easily overwhelm the cloud instance while doing the first iteration. Classically you think that your script runs with linear complexity, while in reality the complexity is exponential, so with an increase in the amount of data you exceed the available resource. So we certainly don’t want any Data Science development to take place in the production environment!
Or to be more specific once you have any standardized scoring in place, you need to have at least two instances in your environment (one development and one production), even if the environment is available solely for Data Science purposes!
Once we standardise a Data Science task we have to deploy it identically as any other functionality (i.e. following standard rules in terms of tests, code review etc.). If not we are creating something unsustainable, which will come back to haunt us in the future. Therefore the acceptance environments are basically identical for development independent if it is IT or Data Science development. On the other hand the development environment differs significantly for those 2 cases.
- A Data Science development instance is more permanent, because we usually need to store the inputs/outputs, primarily because we require reproducibility for the individual tasks
- A Data Science development instance has usually higher specifications than a general development environment, and it can even exceed the specifications used in the production environment. It varies highly based on the task at hand
- A Data Science development instance has limited user permissions. This is primarily oriented at extracting information from the production environment, where a standardized way should be defined, using exposed services and not direct access!
- A Data Science development instance has higher demands in terms of availability, because if the instance is not available, the Data Scientists are basically stuck
Development, revision and deployment
From a classical software development perspective Data Science tasks are not pure programming tasks in a sense, that programming is more deterministic, you usually know what you want to achieve and you only look for an optimal way among the possible solutions. Although we can consider data science in the same sense, the range of possible alternatives is usually much wider and we don’t know if our task can be achieved to a required precision/predictability (eg. there doesn’t have to be any relation among the variables we are trying to model).
So the whole process becomes a two step process:
- try if we are able to resolve/model/predict the problem at hand
- standardise it and deploy it for automatic/regular use in our and other systems
The second point is already classical software development, while the first is more of an investigation where the output of the individual tasks needs to be stored in order to come back to them in subsequent iterations if needed.
To achieve both in a single environment architecture we need to enable users to have read access to the production data and to have a simple tool for extracting the data without affecting the performance of the production database. Any IDEs (such as Rstudio, Jupyter) are available on the development environment and optionally on the subsequent environments if needed.
Depending on the specifics of your environment you might opt for having two development environments, in our case our infrastructure is Data Science oriented and the software oriented development takes place in a separate environment (other team), so we have a single environment shared for our internal software and data science development as most changes are driven by the modelling needs.
Looking at the two steps in the above process our repositories are constructed accordingly, with the first aimed at the data science development where the review is more logical in terms if the assumptions make sense, if all requirements to use a given method/model are satisfied etc.. On the other hand the second repository is for the deployment and the review is primarily code oriented in terms of performance/readability etc.. In many cases the second step isn’t done in case the results from the first step are unsatisfactory.
The reasons for such a distinction are the different requirements for the environments where the scripts are used. While in the modelling/hypothesis testing step we don’t want to limit the Data Scientists with things like naming conventions, run-time etc., because in many cases it is superfluous (i.e. the script will be run a limited number of iterations, for specific data), so if we have a reasonable solution, that fulfils the methodological requirements we are content. On the other hand for development that is meant for the production we need to keep the scripts maintainable and the environment available and stable.
The above distinction also resembles the difference between the strengths of a developer and data scientist. While the data scientist has profound modelling skills, which exceed the knowledge of a standard developer, the coding skills are usually inferior to the developer. Dividing them as such enables us to put a data scientist oriented maintainer for the first repository, while a development oriented for the second repository.
There is certainly no single infrastructure that suits all the requirements perfectly, although the standard development, acceptance, production infrastructure works well, nonetheless you have to be aware that the requirements a Data Science team has on the environment infrastructure differ from a software development team and adjust your infrastructure accordingly.
This includes giving the Data Scientists more freedom for ad-hoc tasks and not restricting them with software development rules, when they are not necessary. Otherwise we are limiting the productivity of our Data Science team.