Contingency Table Tests Using the G statistic in R

Version 12
    Introduction

     

    Tableau, as the name implies, enables the visualization of tabulated data, which in turns expresses the multidimensional contingencies present in the data. Of central concern to data scientists when working with contingency tables is whether the implied contingent relationship between the factors (values of the dimension fields) occurred due to the chance of independent sampling, or represents a putative contingency between the factors (values of the dimension fields).

     

    As an example, the χ2 test is one of the earliest statistical tests of contingency tables; originally derived for the case of 2 × 2 contingency tables. The χ2 test is itself an analytic simplification of the more general likelihood ratio test of the G statistic; which in the limit of large sample sizes is χ2 distributed. The G test is robust against the incorrect specification of models, as it is a non-parametric, or semi-parametric technique. However, the trade off is that the G test is a less powerful test than models with explicit parameters. In the large sample sizes typified of the use cases of Tableau this is rarely a limiting concern.

     

    In the remainder of this note we will explain how to calculate the G statistic in Tableau, and how to test it's significance by finding the tail probability of the appropriate χ2 distribution, through a call to R.

     

    Analysis

     

    The general form of the G statistic, derived from the likelihood ratio test, is given by the equation:

     

    g-statistic.gif

     

    To aid in implementing the G statistic and test in Tableau, we will carefully parse through the meaning of each symbol from right to left:

     

    • expectedfactor is the expected number of samples, for each distinct combination of factors, under the assumption that the dimensions are independent.
    • observedfactor is the observed number of samples, in each distinct combination of factors found in the dimension fields of the tabulation.
    • ln is the natural logarithm.
    • Σ is the sum over every distinct combination of factors found in the dimension fields of the tabulation.

     

    The G statistic, by derivation from the likelihood ratio test, is asymptotically χ2 distributed, with the degrees of freedom given by the difference between the number of frequency parameters that are estimated in each of the two compared hypotheses.

     

    The observed number of samples for each distinct combination of factors, found in a single cell, is simply the record count aggregate:

     

    // Courtesy alias for the observed number of records in each cell

    [observed] := SUM([Number of Records])

     

    It will be helpful to have the sample count over the complete data set available for latter use:

     

    // Total number of observations

    [observed all] := ATTR({ FIXED : [observed] })

     

    The expected number of samples requires more care, as it entails computing the independent frequencies, for each factor, in each dimension, and then computing the product of the independent frequencies for each cell. In the first step we compute the observed counts for each factor in each dimension:

     

    // Count the number of observed samples within

    // each factor of dimension X

    [observed X] := ATTR({ FIXED [Dimension X] : [observed] })

     

    The expected independent frequencies follow as the ratio of the observed samples in the factor for the dimension to the count of all observations:

     

    // The estimated frequency of each factor in dimension X

    [frequency X] := [observed X] / [observed all]

     

    Next we find the expected number of samples by taking the product over the independent expected frequencies, and multiplying by the total number of observations in the whole table:

     

    // For each cell calculate the expected samples assuming the dimensions are independent

    [expected] := [observed all] * [frequency X] * ⋯ * [frequency Z]

     

    The G statistic follows as a single table calculation; where the WINDOW_SUM must be used because TOTAL cannot nest table calculations:

     

    // The G statistic from the likelihood ratio test

    [G statistic] := 2 * WINDOW_SUM([observed] * LN([observed] / [expected]))

     

    Computing the G test requires calculating the tail probability of the χ2 distribution, which in turn requires a degrees of freedom parameter. The degrees of freedom is the difference in the number of frequency parameters estimated under the hypothesis that the factors are contingent and the number of frequency parameters estimated under the hypothesis the factors are independent. Under the hypothesis that the factors are contingent the number of frequency parameters estimated is the number of cells minus 1:

     

    // The number of parameters estimated for all interactions

    // is the count of cells minus one

    [parameters] := SIZE() - 1

     

    Under the hypothesis that the factors are independent the number frequency parameters estimated is sum of the number of distinct values of each dimension minus 1:

     

    // The number of frequency parameters estimated in dimension X

    [parameters X] := ATTR({ FIXED : COUNTD([Dimension X]) }) - 1

     

    The minus 1 term occurs because the frequency parameters must sum to 1, and thus one of the frequency parameters is linearly dependent on the other frequency parameters. The degrees of freedom is then the difference between the number of parameters estimated between the fully contingent hypothesis and the independent hypothesis:

     

    // Difference between the number of estimated parameters

    [degrees of freedom] := [parameters] - [parameters X] - ⋯ - [parameters Z]

     

    Finally we can compute the G test by a call to R, to return the upper tail probability of the χ2 distribution:

     

    // Call the chi square function in R

    [G test] := SCRIPT_REAL(

         "pchisq(q = .arg1, df = .arg2, lower.tail = FALSE)",

         [G statistic],

         [degrees of freedom]

    )

     

    This omnibus calculation of significance is a excellent value to present in either the title or caption of a worksheet. If the G test indicates that the observed contingencies between the dimensions is unlikely to be due to chance alone then further exploratory analysis is warranted to discern specific relationships in the data (for example a G test value ≤0.01 indicates a G statistic as larger or larger than the observed G statistic would occur in ≤1% of reproduced data sets, under the assumption of the hypothesis that the dimensions are independent).

     

    Example

     

    Linked below is an example workbook using randomly generated data. The method for the generation of the data can be found at this Jupyter notebook. In the attached workbook the call to the R table function in the G test calculated field has been commented out and replaced with the fixed value that would have been returned by R. The R functionality can be easily restored by editing the G test calculated field. An example workbook can be found here on GitHub.

     

    https://public.tableau.com/static/images/co/contingency/GTest/1.png

     

    Notes

     

    1. You will need to have an instance of RServe running to calculate the test; otherwise all other calculations are done in Tableau.
    2. In each instance the table calculations are taken over the whole table, down (default setting).
    3. The calculations can be simplified and in lined; the variables were explicitly instantiated to illustrate the calculation.
    4. If you are using data densification, zeroes in either the [expected] or [observed] need to be handled by replacing the natural logarithm with zero using the ZN() function.
    5. A full discussion of the interpretation of significance tests is beyond the scope of this note.
    6. This work was supported by Alberta Health Services, in preparation for a series of short lessons on longitudinal analysis in Tableau.