5 Replies Latest reply on Jun 14, 2018 10:09 AM by Nathan Mannheimer

    How to use TDAmapper from Tableau (via R)?

    Ming Ng

      I'm currently trying to use TDAmapper (which is a package from R) from Tableau because all my data is currently loaded onto Tableau server, and I would like to use the algorithm to identify useful groups within my dataset.

       

       

      This, in principle, should be possible since: (a) one can pass calculations from Tableau to R via Rserve, which I understand the basics of; (b) it is possible to do k-means clustering using the SCRIPT_INT command in Tableau.

       

       

      Unfortunately, while I'm able to follow the basic applications of TDAmapper in R (see below for reference), I don't quite know how to employ TDAmapper from Tableau. Can somebody tell me how to do this? I tried the following code below and it didn't work:

       

      SCRIPT_INT(

      "library(TDAmapper)

      X<-scale(.arg1,.arg2,.arg3,center=FALSE)

      X.dist=dist(X)

      library(ks)

      filter.kde<-kde(X,H=diag(1,nrow=3),eval.points=X)$estimate

      X.mapper<-mapper(dist_object=X.dist,filter_values=filter.kde,num_intervals=4,percent_overlap=50,num_bins_when_clustering=18)$cluster",

      MAX([Arising Credit Item Cost]),[Sales Discount %],[Sales Margin %]

      )

       

       

       

      ----------

      For convenience, let me briefly discuss the general pieces of this puzzle:

       

       

      1) How does one pass from Tableau to R?

       

       

      One has to do this by one of four commands, the most relevant to us being SCRIPT_INT, which is a command we use if we expect a computation to yield us an integer result. The general form of the command is:

       

       

          SCRIPT_INT (

       

          “R code”,

       

          Tableau fields being passed in)

       

       

      For example, if we want to find the correlation between the variables "Profit" and "Discount", then we use the SCRIPT_REAL function (which is when we expect the computation to yield us a real-value/non-integer value), which is written as:

       

       

          SCRIPT_REAL(

          "cor(.arg1,.arg2)",

          sum([Discount]),

          sum([Profit])

       

       

      where sum([Discount]) is .arg1, and sum([Profit]) is .arg2 respectively in the R script.

       

       

      2) How does one do k-means clustering in Tableau by passing through R?

       

       

      The code is as follows:

       

       

          SCRIPT_INT("

              ## Sets the seed

       

              set.seed( .arg8[1] )

       

              ## Studentizes the variables

       

              age <- ( .arg1 - mean(.arg1) ) / sd(.arg1)

              edu <- ( .arg2 - mean(.arg2) ) / sd(.arg2)

              gen <- ( .arg3 - mean(.arg3) ) / sd(.arg3)

              car <- ( .arg4 - mean(.arg4) ) / sd(.arg4)

              chi <- ( .arg5 - mean(.arg5) ) / sd(.arg5)

              inc <- ( .arg6 - mean(.arg6) ) / sd(.arg6)

              dat <- cbind(age, edu, gen, car, chi, inc)

       

              num <- .arg7[1]

       

              ## Creates the clusters

       

              kmeans(dat, num)$cluster

          ", 

       

          MAX( [Age] ), MAX( [Education ID] ), MAX( [Gender ID] ),

          MAX( [Number of Cars] ), MAX( [Number of Children] ), MAX( [Yearly Income] ),

          [Number of Clusters], [Seed]

          )

       

       

       

      3) How does one use TDAmapper in R?

       

       

      TDAmapper is an algorithm in R that gives us a specific way of identifying similar members of the dataset. The prototypical example of this would be applying TDAmapper to identifying similar types of diabetic patients based on the dataset (made available as "chemdiab")

       

       

      The general idea is that we have a dataset of points, and we define a particular function (known as the "filter function") to assign a value to these points. Once we have done this, we cover these datapoints with a finite number of intervals - furthermore, for the algorithm we also need to specify: (a) the number of intervals we use; (b) the percentage overlap between these intervals.

       

       

       

           library(TDAmapper)

       

              library(locfit)

              data(chemdiab)

          

              normdiab<- chemdiab

       

              normdiab[,1:5]<-scale(normdiab[,1:5],center=FALSE)

              normdiab.dist=dist(normdiab[,1:5])

          

              library(ks)                   

          

       

              filter.kde<-kde(normdiab[,1:5],H=diag(1,nrow = 5),eval.points =normdiab[,1:5])$estimate

       

           ## filter.kde is defined to be our filter function

           ## In this case, we assign values to the data points based on the kernel density.

          

       

             diab.mapper<-mapper(

                  dist_object = normdiab.dist, filter_values = filter.kde,

                  num_intervals =4,

                  percent_overlap=50,

                  num_bins_when_clustering=18)

       

          ## Here, the mapper() algorithm accepts as input the distance matrix of the data points we want, the filter function, the number of intervals, the percentage overlap.

          ## We also have another parameter (which affects the clustering algorithm that is implicitly used in the TDAmapper algorithm), which can be any integer value we like.

       

       

      References:

      - http://bertrand.michel.perso.math.cnrs.fr/Enseignements/TDA/Mapper.html

      - http://breaking-bi.blogspot.co.uk/2013/12/performing-k-means-clustering-in-tableau.html

      -https://www.associationanalytics.com/2017/01/31/tableaus-r-integr

        • 1. Re: How to use TDAmapper from Tableau (via R)?
          Patrick A Van Der Hyde

          Hello Ming

           

          I have raised this question with our extensions team to see if they can assist further.  Thank you

           

          Patrick

          • 2. Re: How to use TDAmapper from Tableau (via R)?
            Ming Ng

            Alright, thank you, I appreciate it!

             

            I suspect the situation might have to do with not entirely understanding the output of the TDAmapper algorithm in R. If it helps, the R coding is available here:

             

            TDAmapper/R at master · paultpearson/TDAmapper · GitHub

            • 3. Re: How to use TDAmapper from Tableau (via R)?
              Nathan Mannheimer

              Hello Ming,

               

              So the one missing piece for passing the data to the TDAmapper algorithm is that you'll need to create a dataframe of the values you are passing to R from Tableau. The issue then is returning a result. The TDAmapper algorithm does not return a vector that assigns each point to a cluster like $cluster does for Kmeans. This means you'll need to write a loop or other function to assign the points from the $points_in_vertex section of the results of the mapper() function with the vertex that they fall into. The final result will need to be an ordered vector of vertex assignments ie c(1,1,5,4,3,2,3,2).

               

              Does your usecase require the use of this algorithm verse other more standard approaches like Kmeans or DBscan? The TDAmapper algorithm seems to be designed to detect 3d structures in data.

               

              SCRIPT_INT(

              "library(TDAmapper)

              library(ks)

              library(locfit)

               

              df <- data.frame(.arg1,.arg2,.arg3)

               

              X<-scale(df, center=FALSE)

              X.dist <- dist(X)

               

              filter.kde<-kde(X,H=diag(1,nrow=3),eval.points=X)$estimate

               

              X.mapper <- mapper(dist_object=X.dist,filter_values=filter.kde,num_intervals=4,percent_overlap=50,num_bins_when_clustering=18)

               

              # insert code here to assign a vertex/cluster to each sample",

               

              MAX([Petal.Length]),MAX([Petal.Width]),MAX([Sepal.Length]))

              1 of 1 people found this helpful
              • 4. Re: How to use TDAmapper from Tableau (via R)?
                Ming Ng

                Hi there Nathan,

                 

                Thanks for this! My choice to use TDAmapper was just based on the fact I was impressed by TDAmapper's sensitivity to certain features of the dataset, and hence giving us a more refined way of clustering the data points as opposed to Kmeans. I'm aware that Tableau 10 allows you to do Kmeans clustering via a drag-and-drop operation in the Analytics tab, so that will be my backup plan for this.

                 

                Just to make sure I've understood you:

                 

                The reason why one can do kmeans clustering via R from Tableau is because of the $cluster component, which gives us a vector that indicates the cluster to which each point of data is allocated. And this (somehow) is intelligible to the SCRIPT_INT function, so we can pass this result back to Tableau from R. In our case, we would like to do the same thing for TDAmapper.

                 

                The thing is that what TDAmapper does is to group certain data points into "vertices" (which are like mini-clusters), and then it relates closely-related vertices to each other via something called an "edge". In other words there are two kinds of relations - the first relation relates data points that are "close" to each other by grouping them into a vertex, the second relation relates vertices that are close to each other by associating an edge between them. So I guess in order to make sure that both information goes through, I would need to write: (1) a function that gives us a vector that indicates the vertex to which each point of data is allocated, and (2) a function that gives us a vector that indicates which vertices are joined by an edge, right?

                 

                I'm somewhat of a novice on R in that I'm literate in R, but not quite confident enough to do much heavy-duty programming in R. I know this is more of an R-related question, than a Tableau-related one, but I'd be grateful if you could give me some useful advice on where I can look for help on the above. For instance, you said I might have to write a loop function - is that what $cluster is? I can't seem to find the explicit code for it online. Do you know of examples of loop functions (or other functions) that sound similar to what I'm trying to accomplish here?

                • 5. Re: How to use TDAmapper from Tableau (via R)?
                  Nathan Mannheimer

                  Hi Ming,

                   

                  I understand your use case a little bit better now. You're also right on in your understanding of how the SCRIPT_ functions work in Tableau. To correctly return a result, they need to have a vector that is the same length as the vectors you passed to them. So in the case of the iris dataset, with 150 samples, you would pass it vectors of length 150 and need to return a vector of length 150, or a constant which would be broadcast over the input vector to be repeated 150 times. You are also correct in that the if you wanted multiple return results you would need a SCRIPT_ function for each different result, ie one for vertex and one for the edges. If you wanted to return multiple edges at once, for each sample, the easiest way would be to return them as a string with SCRIPT_STR().

                   

                  In R, the $ operator is used to pull a vector out of a dataframe, so $clusters simply pulls the cluster column out of the resulting dataframe from the kmeans function. Now that it has been separated as a single vector, it can be passed back to Tableau.

                   

                  If I remember correctly,  $points_in_vertex returns essentially a list of lists:

                  points  = result$points_in_vertex

                  points

                  [[1,3,4],[2,5,7],[6,8,9]]

                   

                  Where each sub-list contains the points assigned to that vertex. What you would need to do is loop through each sub-list and assign its members to a cluster. The challenge is that the points in those sub-lists are identified by their ordering in which they were brought into the function, so you would need to ensure that in the output vector you create is ordered correctly. Given the above example it would look like this (where there are three vertex points):

                   

                  c(1,2,1,1,2,3,2,3,3)

                   

                  because the order of the points for each vertext is:

                   

                  vert1: (1,3,4), vert2: (2,5,7) vert3: (6,8,9)

                   

                  I would suggest looking on places like Stack Overflow or online guides like A Tutorial on Loops in R - Usage and Alternatives (article) - DataCamp  for information on creating the function you are looking for. While you could use a for loop, often times there are more efficient functions in R.