9 Replies Latest reply on Mar 25, 2013 2:31 PM by Scott Tennican

    Statistical analysis of groups of data

    Sonya T

      Hi All,

       

      Tableau is a lot more fun to use than Excel and prettier than Prism, but either it can't do statistics as well, or I can't figure out how to do them.

       

      I have the data from a molecular biology experiment. Cell cultures were treated with different agonists in triplicate. I need to compare each treatment with control and find the p-value. I made a bar graph with the average of the three results, but can't figure out how to put in error bars showing SEM. I don't see a way to compare the treatments to the control.

       

      Thanks!

        • 1. Re: Statistical analysis of groups of data
          Elvis Has Left The Building

          Sonya, I'm going to assume that waiting 10 days for an answer to your question isn't unusual for you, considering you're growing bio-cultures. But on these forums (lately) if your Q goes that long, there's a good chance it will go unanswered, fading into the ether. You are right Tableau is a lot more fun than Excel, and most other programs for that matter! And could be quite beneficial to academic endeavours. I no longer participate in these forums (much), but let me ping a few folks you should get to know: Alan Jonathan Josh Hope they'll be able to help you. [They are totally helpful & knowledgeable folks.]

           

          --Shawn

           

          PS: No I'm NOT back. But someone needs to take on the '0' (zero) replies. As Johan mentioned, this is their one and maybe ONLY experience of the forums -- a [blank/no response] shouldn't be their take-away; the forums are so much better than that.

          • 2. Re: Statistical analysis of groups of data
            Jonathan Drummey

            Hi Sonya,

             

            Are you still looking for help with this?

            • 3. Re: Statistical analysis of groups of data
              Sonya T

              Yes, I never did figure it out, and went back to Prism.

               

              From what I can figure out, Tableau only does a linear analysis when both axes are numerical values. If I want a t-test of two groups of data, the name of the groups goes on the x-axis and the statistical feature doesn't know what I'm doing. Or I don't know what I'm doing.

              • 4. Re: Statistical analysis of groups of data
                Jonathan Drummey

                I'm going to ping Scott Tennican on this one, he's a Tableau developer who works on the statistical functions (might be a little while before he can respond, they are trying to get v8 out the door).

                 

                Jonathan

                • 5. Re: Statistical analysis of groups of data
                  Scott Tennican

                  Hi Sonya,

                   

                  It sounds like you would like to test for a significant difference between multiple pairs of population mean estimates. So, you would probably like a single step multiple comparison of means test such as, in order of quality, the Hochbergs's GT2, Dunn-Sidak test or Bonferroni test. You don't want to compare all pairs of means so the Tukey-Kramer doesn't apply. Also, you don't just want to know if any mean is different from any other so one-way ANOVA doesn't apply. Tableau does not compute any of these test statistics for you. Also, the theoretical distributions to which you compare your test statistics are not available as functions in Tableau.

                  In some cases, it is possible to use calculated fields to compute test statistics.

                  But, I don't know how to separate and then combine the values in your control and treatment groups to allow me to compute these statistics. Everything that I have thought of seems to require an illegal combination of aggregate and disaggregated values.

                   

                  So, it might be better to compute your tests in another tool and display the results in Tableau.

                  Also, you might consider requesting the tests you need on the idea forum here: Hypothesis Tests (e.g. t-test, chi-squared, ANOVA)

                   

                  Scott

                  • 6. Re: Statistical analysis of groups of data
                    Jonathan Drummey

                    Hi Scott,

                     

                    You write, "Everything that I have thought of seems to require an illegal combination of aggregate and disaggregated values." There are ways using aggregations like MIN(), MAX(), ATTR(), etc. to aggregate the non-aggregate rows so they can be used in aggregate calculations. I'm not familiar enough with the various statistical tests to be able to translate them into Tableau formulas, if you wanted to help explain the computation that needs to be done I'm pretty sure I could get something that works in Tableau.

                     

                    Jonathan

                    • 7. Re: Statistical analysis of groups of data
                      Scott Tennican

                      Hi Jonathan,

                       

                      You may be right. I'm having trouble just computing the difference in means between the control and treatment groups which is a basic component of many tests like this. If we ignore multiple comparison problems for now and just compute the critical value for a simple t-test of two independent means with equal sample size and equal variance, in R it looks like this:

                      n = length(c1)

                      xbar1 = mean(c1)

                      xbar2 = mean(t1)

                      Sxx =  sqrt( (var(c1) + var(t1)) / 2 )

                      tcrit = (xbar1 - xbar2) / (Sxx * sqrt(2 / n))

                       

                      I created some simple test data and made a workbook where the you'll see xbar1 - xbar2 is incorrect.

                       

                      thanks, Scott

                      • 8. Re: Statistical analysis of groups of data
                        Jonathan Drummey

                        Hi Scott,

                         

                        See the attached. Here's what I noticed in your original:

                         

                        - I generally set up table calcs using a crosstab with all my dimensions on Rows and Measure Names on the Columns shelf, I set this up in the attached. It makes it a lot easier to see effects of changing Compute Usings and patterns in the results.

                         

                        - Group in the view is too much level of detail. Since the Control Value and Treatment Value are row-level calcs based on the value of Group, when Group is in the view (on the Columns Shelf in this case) then the calcs and the calcs based on them return different values for each value of Group. You can see this in the Measure Names/Measure Values worksheet. When it comes time to do the xbar1-xbar2, that calc fails because there are two values for each Avg Control/Treatment Value. The solutions are:

                         

                        a) Use a combination of LOOKUP and/or PREVIOUS_VALUE to make sure that every cell gets the right values. This requires being comfortable with understanding table calculation addressing and partitioning, and how offsets work.

                        b) Remove Group from the level of detail. The one caveat to this is that each pane is now "seeing" raw data from both Groups, so Number of Records and any COUNT() functions may not be returning the desired results and therefore need some adjustment.

                         

                        I chose option b), it's a lot easier to deal with.

                         

                        - The two xbar calcs (Avg Control/Treatment Value) are set up as WINDOW_AVG(AVG([Control Value])) or [Treatment Value], but the aggregation in the view is SUM(Control Value) and SUM(Treatment Value). Since there is only one non-zero record per Trial/PatientID SUM() and AVG() will return the same value, but I changed the table calcs to use SUM() so they'd be the same no matter what. Something I often do as a sanity check is to explicitly create the aggregate calculated fields and then use those in the table calcs, so I don't accidentally use different aggregations and the get myself confused.

                         

                        - The table calculations have different (and in one case incorrect) Compute Usings. The default Compute Using is set to Patient ID for Average Treatment Value, but not Average Control Value. It appears that there aren't any changes to the defaults in the view. Based on my understanding, it seems like you want the Compute Using to be on PatientID, so it partitions on Trial. So that means that Average Control Value needs it's Compute Using set. In the case of the differenceInMeans calc, that means that you can either set each of the nested Compute Using's to PatientID via the Edit Table Calculation dialog, or just set the top-level to PatientID via the context menu and then that setting will propagate to the child calcs.

                         

                        Generally I avoid setting the default Compute Using via the "Default Table Calculation" and just set the Compute Using in the view. The reason I do this is two-fold: 1) It's easier to just remember to set the compute using in one place, and I choose to do so where it really matters, in the view. 2) If a dimension necessary for the Compute Using is not in the view at the time you bring the table calc pill into the view, the table calc fails and the view stops updating until you change the table calc's Compute Using in the view or bring the dimension in. I avoid those problems by accepting Tableau's automatic behavior. If a table calc is depending on certain addressing, then I'll add comments to the table calc's formula.

                         

                        One thing to note is that when working with the crosstab view I'll pretty much always specify the Compute Using (aka fixed addressing) for each table calc as opposed to using relative addressing (Table/Pane Across/Down/etc.). This way the results of the table calculation won't change as I rearrange the view, for example by duplicating the revised worksheet and rearranging pills to create the Control mean and Treament mean worksheets.

                         

                        - I created Sxx and tcrit calcs, and set those Compute Usings in the view to Patient ID. The one calc that I'm not sure of is the n calc, should the value be 10 or 20? If it needs to be 10, i.e. # of patients, then you could use SIZE() with a Compute Using of PatientID instead of the WINDOW_COUNT(). An alternative would be to use an extract or a data source that supports COUNTD() and use that to build n.

                         

                        - There are also various optimizations that could be done, but I'm not sure about the desired final view. For example, if the end goal was to only show a single row per Trial for xbar, n, the difference in means (show a single value of values that are the same for every row in the partition) then I'd wrap the calcs in IF FIRST()==0 THEN ...calc... END. If there was a goal to show some value compared to each patientID (return the same value to every row in the partition), then I'd wrap the calcs in PREVIOUS_VALUE() so that way the calcs would be computed once. Also, in v7 there's an additional optimization that is useful for large data sets for any of the WINDOW_XXX() functions, see http://www.clearlyandsimply.com/clearly_and_simply/2011/01/another-look-at-site-catchment-analysis-with-tableau-6-part-3.html for details.

                         

                        Hope this helps!

                         

                        Jonathan

                        • 9. Re: Statistical analysis of groups of data
                          Scott Tennican

                          Hi Jonathan,

                           

                          Your advice really helped. All I needed to add to get the correct critical values was to replace n with Size() in the tcrit calc. Then I added an approximated pValue and categorized treatment and and control as Different or the Same. The approximation is based on an approximation of the normal distribution which itself is an approximation of the student-t distribution which is the distribution of the null hypothesis for this or any other t-test. I decided to take the advice of the paper here: http://web2.uwindsor.ca/math/hlynka/zogheibhlynka.pdf and use their recommended simple formula. It worked much better than the slightly simpler one that I have used before. The correct values if you used a Welch Two Sample t-test and an actual student-t distribution are: 0.0213 and 0.07902. My approximation gives 0.0182 and 0.0671 which gives the same result for a 95% confidence level.

                          So, this solution could do the job for some people.

                           

                          thanks, Scott