After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python. This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness.
In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works.
Technical Setting & Basic Functionalities
While R integration used Rserve and you only needed to create a running Rserve session to enable a connection from Tableau, the Python integration requires you to install and set-up TabPy Server (installation instructions from the Tableau github repository can be found here). The set-up contains instructions on installing TabPy, installing Python 2.7 with Anaconda and connecting Tableau.
In terms of functionalities the python integration is very similar to R integration (if we take into account the script calls from Tableau), but because TabPy-client enables you to deploy endpoints, it is far more powerful. Before going through a few examples, how we can use Python in Tableau, just a quick remark that it is not possible to use the python integration at the same time with R integration. There is only one connector, so in case you want to use both, you would need to connect directly from R to python or vice-versa (for example using the rpy2 package). Furthermore, Tableau public does not currently support the TabPy functionalities.
Let’s show a few examples now, which will contain:
- passing data to python
- fitting a scikit-learn model
- using a fitted model to predict
- saving and loading a model
- passing user-defined parameters to python
We will use the iris dataset that is already included in scikit-learn and create a model using the Naive Bayes estimator. The dataset contains 5 columns (sepal width, sepal length, petal width, petal length and the category). First, let’s just do a visualization of the iris dataset using only two of the 4 attributes and color coding the category (Iris type).
Now we have the required data ready to start with calling the python functionalities. However, in order for the calculations to be done for each individual row in the dataset, we have to make sure that we are not working with aggregated measures in Tableau.
To use python functionalities we have to create a new calculated field and define SCRIPT_XX, where XX defines the return data type. The available options are BOOL, INT, REAL, STR. There are some rules/specifics, which you have to consider when calling python:
- only 1 calculated field can be returned from a calculation. So if we want to have multiple values returned, we need to create a delimited string and define other calculated fields to access the desired content.
- We have to return the same number of records as the input number of rows (ie. if the calculation was executed for all 20 records at once then the return vector must contain 20 elements)
- Python script calls are table calculations, so be careful what dimension is being used for the calculation! Because for each partition Tableau does an individual call to TabPy.
For now let us create a Naive Bayes model from the input data and predict the same data using the fitted model.
SCRIPT_REAL(" import numpy as np from sklearn.naive_bayes import GaussianNB # create the model model = GaussianNB() # transform input data data_x = np.transpose(np.array([_arg1, _arg2, _arg3, _arg4])) data_y = np.array(_arg5) # fit the model model.fit(data_x, data_y) # predict the category for input data predicted_category = model.predict(data_x) # transform output return list(np.round(predicted_category, decimals=2)) ", ATTR([Petal Length]), ATTR([Petal Width]), ATTR([Sepal Length]), ATTR([Sepal Width]), ATTR([Category]))
_argX – defines the individual input arguments (columns from Tableau sheets). In this example all of the input arguments are vectors. We have to use the “ATTR()”, because SCRIPT_XX requires some sort of aggregation function although we are not working with aggregated data. Also for a call to Python to be successful, the script requires the return argument.
To visualize the output, we will compare the original categories with the predicted categories from the model.
We can see that the Python script executed successfully and we misclassified some observations (6 out of the 150), which is exactly the same result we get in Python.
Save, load and deploy
Since for most applications we don’t want to fit the model and then predict the model for the same data, we will now save the model and only load it for predictions. This is identical as we would proceed in pure Python.
import pickle # fit the model model.fit(data_x, data_y) # save the model pickle.dump(model, open('C:\\temp\\model', 'w')) # load the model model = pickle.load(open('C:\\temp\\model', 'r'))
Remember that for some models it might be more beneficial to save them using joblib.dump and joblib.load as described in the scikit-learn documentation.
However, this method is only for testing/playing around, while for production use you should use deployed functions as mentioned in the Tableau client documentation and define them as endpoints. When you deploy an individual function that contains a model, Tableau automatically saves the model definition (using pickle) to be used in the function execution. This should be the preferred way of deploying functions that are supposed to be exposed to end-users. Another alternative would be to create your own package.
The purpose of both methods is to simplify calls in calculated fields to:
# with import SCRIPT_REAL(" # load package import custom_function # execute function return custom_function(_arg1) ",ATTR(First Argument)) # with Tableau deployed function SCRIPT_REAL(" # query the exposed endpoint return tabpy.query('endpoint_name',_arg1)['response']" ",ATTR(First Argument))
The calculated field contains a simple function call and you can make changes to the underlying model/function without the need to modify each dashboard (with the exception of changes in input parameters of the function).
In case you want to allow the end-users to change input parameters of a certain function to be able to visualize various scenarios (eg. best case, worst case), we can create standard Tableau parameters and add them to the function call.
SCRIPT_REAL(" # load package import custom_function # execute function return custom_function(_arg1, _arg2) ",ATTR([First Argument]), ,[Custom Parameter])
There are two things you should be careful about when using custom parameters:
- The custom parameter is passed to Python as a vector for each row of data that is sent (not as a scalar).
- Any change of a parameter causes the recalculation of calculated fields containing that parameter. So be careful to enable end-users to modify parameters that don’t cause compute-intensive/time-intensive Python calls.
We have gone through some basic applications of the Python integration in Tableau. Python can extend the analytical capabilities of our dashboards significantly, however to make this work smoothly requires advanced knowledge of Tableau table calculations, else you might be surprised with unexpected values in your calculated fields, because they were done across a different dimension than you expected or errors from mismatches in the expected and received vector size.
Furthermore different levels of detail were quite tricky for R integration, so expect a similar complexity for Python. In case you don’t find examples online for Python/Tableau, try to use the R/Tableau posts to get the general idea, how to tackle the problems.
- Tableau table calculations – http://drawingwithnumbers.artisart.org/want-to-learn-table-calculations/
- TabPy documentation – https://github.com/tableau/TabPy
- An example of using TabPy – http://databrit.blogspot.com/2016/11/using-python-for-sentiment-analysis-in_15.html
- Examples of R integration – https://boraberan.wordpress.com