Advanced analytics with Python and Tableau 10.1 integration

After introducing R capabilities in Tableau 8.1, the new Tableau 10.1 now comes also with support for Python. This is a great news especially for data scientists, who use the reports to visualize results of some more sophisticated analytical processes. Such reports can now bring the analytics much closer to the end users, while preserving the given level of user-friendliness.

tableau_python

In this post I am using a simple modelling example to describe how exactly the integration of Tableau and Python works.

Technical Setting & Basic Functionalities

While R integration used Rserve and you only needed to create a running Rserve session to enable a connection from Tableau, the Python integration requires you to install and set-up TabPy Server (installation instructions from the Tableau github repository can be found here). The set-up contains instructions on installing TabPy, installing Python 2.7 with Anaconda and connecting Tableau.

In terms of functionalities the python integration is very similar to R integration (if we take into account the script calls from Tableau), but because TabPy-client enables you to deploy endpoints, it is far more powerful. Before going through a few examples, how we can use Python in Tableau, just a quick remark that it is not possible to use the python integration at the same time with R integration. There is only one connector, so in case you want to use both, you would need to connect directly from R to python or vice-versa (for example using the rpy2 package). Furthermore, Tableau public does not currently support the TabPy functionalities.

Let’s show a few examples now, which will contain:

  • passing data to python
  • fitting a scikit-learn model
  • using a fitted model to predict
  • saving and loading a model
  • passing user-defined parameters to python

We will use the iris dataset that is already included in scikit-learn and create a model using the Naive Bayes estimator. The dataset contains 5 columns (sepal width, sepal length, petal width, petal length and the category). First, let’s just do a visualization of the iris dataset using only two of the 4 attributes and color coding the category (Iris type).
iris-plot

Now we have the required data ready to start with calling the python functionalities. However, in order for the calculations to be done for each individual row in the dataset, we have to make sure that we are not working with aggregated measures in Tableau.

aggregate-measures

To use python functionalities we have to create a new calculated field and define SCRIPT_XX, where XX defines the return data type. The available options are BOOL, INT, REAL, STR. There are some rules/specifics, which you have to consider when calling python:

  1. only 1 calculated field can be returned from a calculation. So if we want to have multiple values returned, we need to create a delimited string and define other calculated fields to access the desired content.
  2. We have to return the same number of records as the input number of rows (ie. if the calculation was executed for all 20 records at once then the return vector must contain 20 elements)
  3. Python script calls are table calculations, so be careful what dimension is being used for the calculation! Because for each partition Tableau does an individual call to TabPy.

Model creation

For now let us create a Naive Bayes model from the input data and predict the same data using the fitted model.

 
SCRIPT_REAL("
import numpy as np
from sklearn.naive_bayes import GaussianNB

# create the model
model = GaussianNB()

# transform input data 
data_x = np.transpose(np.array([_arg1, _arg2, _arg3, _arg4]))
data_y = np.array(_arg5)

# fit the model
model.fit(data_x, data_y)

# predict the category for input data
predicted_category = model.predict(data_x)

# transform output
return list(np.round(predicted_category, decimals=2))
", ATTR([Petal Length]), 
   ATTR([Petal Width]), 
   ATTR([Sepal Length]), 
   ATTR([Sepal Width]), 
   ATTR([Category]))

_argX – defines the individual input arguments (columns from Tableau sheets). In this example all of the input arguments are vectors. We have to use the “ATTR()”, because SCRIPT_XX requires some sort of aggregation function although we are not working with aggregated data. Also for a call to Python to be successful, the script requires the return argument.

To visualize the output, we will compare the original categories with the predicted categories from the model.

predicted-categories

We can see that the Python script executed successfully and we misclassified some observations (6 out of the 150), which is exactly the same result we get in Python.

Save, load and deploy

Since for most applications we don’t want to fit the model and then predict the model for the same data, we will now save the model and only load it for predictions. This is identical as we would proceed in pure Python.

import pickle

# fit the model
model.fit(data_x, data_y)

# save the model
pickle.dump(model, open('C:\\temp\\model', 'w'))

# load the model 
model = pickle.load(open('C:\\temp\\model', 'r'))

Remember that for some models it might be more beneficial to save them using joblib.dump and joblib.load as described in the scikit-learn documentation.

However, this method is only for testing/playing around, while for production use you should use deployed functions as mentioned in the Tableau client documentation and define them as endpoints. When you deploy an individual function that contains a model, Tableau automatically saves the model definition (using pickle) to be used in the function execution. This should be the preferred way of deploying functions that are supposed to be exposed to end-users. Another alternative would be to create your own package.

The purpose of both methods is to simplify calls in calculated fields to:

# with import 
SCRIPT_REAL("

# load package
import custom_function

# execute function
return custom_function(_arg1)
",ATTR(First Argument))

# with Tableau deployed function 
SCRIPT_REAL("

# query the exposed endpoint
return tabpy.query('endpoint_name',_arg1)['response']"

",ATTR(First Argument))

The calculated field contains a simple function call and you can make changes to the underlying model/function without the need to modify each dashboard (with the exception of changes in input parameters of the function).

Custom parameters

In case you want to allow the end-users to change input parameters of a certain function to be able to visualize various scenarios (eg. best case, worst case), we can create standard Tableau parameters and add them to the function call.

SCRIPT_REAL("

# load package
import custom_function

# execute function
return custom_function(_arg1, _arg2)
",ATTR([First Argument]),
 ,[Custom Parameter])

There are two things you should be careful about when using custom parameters:

  1. The custom parameter is passed to Python as a vector for each row of data that is sent (not as a scalar).
  2. Any change of a parameter causes the recalculation of calculated fields containing that parameter. So be careful to enable end-users to modify parameters that don’t cause compute-intensive/time-intensive Python calls.

Summary

We have gone through some basic applications of the Python integration in Tableau. Python can extend the analytical capabilities of our dashboards significantly, however to make this work smoothly requires advanced knowledge of Tableau table calculations, else you might be surprised with unexpected values in your calculated fields, because they were done across a different dimension than you expected or errors from mismatches in the expected and received vector size.

Furthermore different levels of detail were quite tricky for R integration, so expect a similar complexity for Python. In case you don’t find examples online for Python/Tableau, try to use the R/Tableau posts to get the general idea, how to tackle the problems.

Useful links

Advertisements

5 thoughts on “Advanced analytics with Python and Tableau 10.1 integration

  1. Henrique

    Hello, I was trying to recreate the example above in tableau and it returned the error
    TypeError : 1 is not JSON serializable
    What you think it might be?

    Like

    Reply
    1. martinzofka Post author

      Hello,

      It looks like a problem with using wrong dimensions (eg. calling the Python script per line instead of the whole table or returning an object with incorrect number of columns or rows). Try to write the input arguments from Tableau and output you create in Python to a file within the script call before the return statement and have a look at it.

      Like

      Reply
      1. Henrique

        Hello,

        I’m sorry, I work with tableau for a long time, but with python or the integration of both I’m pretty new.
        How do I do that?

        Like

      2. martinzofka Post author

        You can do something like this:

        f_log = open('C:\\temp\\log.csv', 'w')
        f_log.write(str(data_x.shape))
        f_log.write(str(data_y.shape))
        f_log.close()

        Also the tabpy server contains information about the request/response data.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s