2 Replies Latest reply on Jun 8, 2012 11:35 AM by Scott Tennican

    Pierson Correlation for text fields

      Hi there,


      I have a table that basically looks like the following (it has more columns, but the import things are the studentid, questioned and a result).

      I have shown the id's as strings since they basically are strings in the database.




      I am looking for a way to visual the correlation between questions, so I'd like to have a table that looks something like this:




      I would like to show colored *****/circles where the size/color indicate the pearson coefficient.

      Eventually I would like to either filter based on gender/age/class/teacher, or show several ***** per question that give this information in one big sheet.


      We're trying to find questions that don't add overly much to our final results.

      Our dataset is about 140.000 students x 30 questions.

        • 1. Re: Pierson Correlation for text fields
          Jonathan Drummey

          Hi Eljakim,


          I'm pinging Scott Tennican for help on this one.



          • 2. Re: Pierson Correlation for text fields
            Scott Tennican

            Hi Eljakim,


            Pearson's correlation coefficient is a measure of linear dependence of two continuous variables.

            Your data is categorical so Pearson's and other measures of correlation do not apply.


            Instead, it seems that you might like to build a contingency table of counts and apply a Pearson's chi-squared test for homogeneity/independence of multiple categorical populations.

            I described how to do that in a thread here:



            Also, it seems that you are interested in the strength of the association.

            A simple possibility is just to look at the differences/residuals between the observed and expected counts from which you computed your chi-squared statistic. But, these residuals are hard to compare between cells because they have different ranges. If you just have a 2 by 2 table, you could compute the odds ratio for each of the cells which has a standard range. But, if your table is bigger, using odds ratio gets complicated. Another possibility is to standardize your residuals like this:

            standardized residual for a cell =

            ( observedCount - expectedCount ) / sqrt( expectedCount * (1-rowProportion) * (1-columnProportion) )


            Since there are so many ways to analyze categorical data, if you could describe the goal of your analysis in more detail and provide a packaged workbook with some data for me to work with, I could help you better.