UPDATE: July 28, 2015
We have received numerous requests for the workbook which is now attached below. Also this workbook will function fine in the latest releases of Tableau (version 9 and above) as is. Finally one small correction - if you are using RStudio to launch RServe, you will need to keep RStudio running. On the other hand you can launch RServe from an R instance and once you make the connection to Tableau, you can close the R GUI and the connection will not be deleted.
Tableau, the prom queen of data is finally going out with R, the alpha-geek of analytics. This is a moment a lot of us have been waiting for. Tableau version 8.1 allows super easy integration with R. I've been able to use Tableau 8.1 to get some really cool results. Below are a few initial impressions along with a simple workbook you can download and play with.
The best way to understand how the integration works is to think of Tableau as the front end for displaying (and beautifying) results from a typical statistical data analysis that is run using R. While R itself has several graphing packages that can produce dazzling visuals, none of them have the interactive dashboarding capability that Tableau provides so naturally (with the possible exception of Shiny). That alone makes this new version worth the price of admission.
Simply put, the way the puzzle pieces fit into each other is that R is the statistical or predictive analytics engine for Tableau. Lets face it, R doesn't have the prettiest visualization interfaces, whereas Tableau, despite the recently introduced time series forecasting feature, cannot hope to provide the wealth of data mining algorithms that R (or RapidMiner, hopefully we will also have a RapidMiner-Tableau integration soon?) have built into them. So what is better than marrying the strengths of the two packages?
The flowchart below gives a high level overview of how this coupling works followed by a simple example of using the correlation calculation function within R for a small data set that is loaded into Tableau.
So how does the integration work? It starts with an R package called Rserve which is a TCP/IP server that allows other programs to use facilities of R. The stated typical use is "to integrate R backend for computation of statstical models, plots etc. in other applications". The nice thing about using Rserve is that once you start the service, you dont need to have R running at all, if you are on a Windows machine. If you want to reduce memory usage, you would start R, load the Rserve library and start the Rserve service in the R environment, like this:
Then you can shutdown R (if you want to) and continue using R inside of Tableau. But before you can do this, you will need to establish a connection to Rserve from Tableau. This is done in Tableau via the top menu Help -> Settings and Performance -> Manage R connections.
Which will launch the dialog box, where you can use the values shown below. Click OK and for the most part, you are done with the integration.
That was the easy part. (Here is another blog which gives more connection details if your installation has some problems).
So how do we invoke the R functionality within Tableau? The key is to ensure that the data structure being sent into R from Tableau and that which is coming out of R into Tableau, have the same structure. For example, if you want to push a vector or data frame into R, you have to make sure that the same type vector or data frame is output by R before being brought back into Tableau. To this end, all Tableau provides today are four SCRIPTing functions which can be called using the "Create Calculated Field ..." command. These four scripting functions allow us to pass back and forth REAL, INTeger, BOOlean and STRing data structures between the two programs. Inside these scripts, we can call any of the R functions that are available to do the analyses.
Let us look at a simple example of using R to calculate correlations between any pairs of variables within our dataset, and displaying the pairwise correlations using a dynamic dashboard created within Tableau.
Step 1: Load the dataset into Tableau
In this case our dataset is a tiny 25 sample 5 attribute example which exists as a csv file. We can connect the data in the text file to Tableau fairly easily and we will not explore this basic step here. Here is how the data looks inside Tableau.
Step 2: Create parameters for dashboard
As you can see there are 5 attributes and we would like to calculate the correlations between any pair of these attributes. There will be a total of 5*4/2 = 10 distinct correlations between these attributes and we would like to create a dashboard to select any pair, see a scatter plot and read off the Pearson's correlation coefficient between them as shown below.
The actual details for creating these parameters are detailed in this Tableau Knowledge Base article and will not be repeated here as our focus is on R-scripting. Follow the procedure described in the KB, to create two Calculated Field measures, "Measure 1" and "Measure 2" which may be dynamically chosen by the user when using the dashboard, to select two of the five attributes from the dataset. The image below shows the programming behind Measure 1.
Step 3. Add correlation as a calculated measure
This is the heart of the analysis where we invoke R. We need to create a new calculated field, called "Correlation" which essentially takes the two attributes that are chosen using the dashboard selectors and passed onto R for computing the correlation. To do this, we right click on the Measures area in Tableau, select Create Calculated Field ... and populate the dialog box as shown below.
Provide a name in the Name: field and type in the script which will invoke R. Note the following main points about the script
- It is SCRIPT_REAL, meaning it will take in real or floating point arguments and return a floating point result
- Only valid R code must be inserted between the double quotes (" ... ") with the one exception as noted in the next point
- The arguments to R functions must be AGGREGATE measures that exist in the Tableau database. These arguments are passed to R as .arg#
- For example, in our case, the first argument, .arg1 is [Measure 1] which is a SUM([..]) of one of the five attributes from our dataset, as seen in the CASE statements of step 2.
- The cor function takes the two measures as arguments, passes them to Rserve and returns the calculated correlation between the two selected measures
Step 4. Use the calculated Correlation as another measure to create a chart and dashboard
First create a scatter plot between calculated measures, Measure 1 and Measure 2 by dragging them to the Rows and Columns shelf. Show the parameter controls, Measure 1 Selector and Measure 2 selector on the chart by right clicking on the Parameters block and selecting Show Parameter Controls. Finally, drag and drop the calculated Correlation into the Marks shelf. Now the scatter plot is ready for display. Select any two attributes using the Measure selectors and the scatter plot will update. Hovering the mouse pointer on any point in the scatter plot will show the points and the correlation coefficient. To see the dashboard shown at the start of this example, we need to do simply create a dashboard that uses this chart and edit its title as described in this article to make the title change dynamically with the selected measures and calculated correlation.
This article (may require sign in) describes more complex examples of integrating R and Tableau, but our workbook can help you start with something basic.