2 Replies Latest reply on Jun 8, 2012 11:35 AM by Scott Tennican

# Pierson Correlation for text fields

Hi there,

I have a table that basically looks like the following (it has more columns, but the import things are the studentid, questioned and a result).

I have shown the id's as strings since they basically are strings in the database.

StudentIDQuestionIDGenderResult
id1piratemalecorrect
id1forestmalecorrect
id2piratemaleincorrect
id2forestmaleincorrect
id3forestfemalecorrect
id1rowingmalecorrect
id2rowingmaleincorrect
id3rowingfemalecorrect

I am looking for a way to visual the correlation between questions, so I'd like to have a table that looks something like this:

pirateforestrowing
pirate1??
forest?1?
rowing??1

I would like to show colored *****/circles where the size/color indicate the pearson coefficient.

Eventually I would like to either filter based on gender/age/class/teacher, or show several ***** per question that give this information in one big sheet.

We're trying to find questions that don't add overly much to our final results.

Our dataset is about 140.000 students x 30 questions.

• ###### 1. Re: Pierson Correlation for text fields

Hi Eljakim,

I'm pinging Scott Tennican for help on this one.

Jonathan

• ###### 2. Re: Pierson Correlation for text fields

Hi Eljakim,

Pearson's correlation coefficient is a measure of linear dependence of two continuous variables.

Your data is categorical so Pearson's and other measures of correlation do not apply.

Instead, it seems that you might like to build a contingency table of counts and apply a Pearson's chi-squared test for homogeneity/independence of multiple categorical populations.

I described how to do that in a thread here:

Also, it seems that you are interested in the strength of the association.

A simple possibility is just to look at the differences/residuals between the observed and expected counts from which you computed your chi-squared statistic. But, these residuals are hard to compare between cells because they have different ranges. If you just have a 2 by 2 table, you could compute the odds ratio for each of the cells which has a standard range. But, if your table is bigger, using odds ratio gets complicated. Another possibility is to standardize your residuals like this:

standardized residual for a cell =

( observedCount - expectedCount ) / sqrt( expectedCount * (1-rowProportion) * (1-columnProportion) )

Since there are so many ways to analyze categorical data, if you could describe the goal of your analysis in more detail and provide a packaged workbook with some data for me to work with, I could help you better.

Scott