5 Replies Latest reply on Jul 17, 2018 9:03 PM by Aaron Sheldon

# Comparison of Means between Unequal Groups

Hello everyone.

I am trying to compare the differences between two groups of dealers. One group has around 140,000 dealers (control group) and the other group has 4,500 dealers (experimental group). I have grouped all the dealers together and created a column that differentiates between the two groups. For example, the control group is group A and the experimental group is group B. The problem I am facing now is how to make a fair comparison between each group when the sizes are different. I think I might have to run a T-Test but not sure what is the best way to go about it.

Can someone give me some suggestions? I have a sales and profit column for all the dealers which is what I am primarily interested in comparing. It would not be much help in comparing them unless they are on equal measure. Hope all this makes sense. Thanks.

• ###### 1. Re: Comparison of Means between Unequal Groups

If you've got your dealers (here Category) on Rows you can drag sales an profit on the labels mark on the mark shelf. Right-click on sales and select Measure -> Average and do the same for profit.

Please let me know if that helps,

Best,

Luisa

• ###### 2. Re: Comparison of Means between Unequal Groups

Hey Danish,

Did you consider metrics like:

Sales per dealer, profit per dealer

profit over sales ratio for entire groups

and similar metrics where you normalize before you compare.

• ###### 3. Re: Comparison of Means between Unequal Groups

Thanks Luisa. That seems to help in normalizing the data.

• ###### 4. Re: Comparison of Means between Unequal Groups

Thanks Nitish. I did not think of that. That is a good suggestion though.

• ###### 5. Re: Comparison of Means between Unequal Groups

With a sample size that large you have more than enough statistical power to use a non-parametric test like the Mann-Whitney U test for stochastic dominance in a single ranking dimension, like say sales. Odds are good though that there will be a statistically significant difference, simply due to the large sample size. The harder question is whether or not it is a meaningful difference, which is driven by two considerations: First, why is there a difference? Answering that depends on whether you have included the correct explanatory latent variables in your data. Second, is the difference large enough to have an operational impact. With a sample size that large, even a less powerful non-parametric test will be very sensitive to small differences in the distributions.

The really powerful question is to ask: accounting for all the factors that the dealers cannot control, what is the difference in sales? In the life sciences this is referred to as risk adjustment, although propensity scoring might work as well. Answering this question will not only tell you which dealers are adding the most value due to their behaviour, but also which dealers have large unrealised potential for generating sales, if placed in a better market.

For any financial data I would advise against any tests, like the t-test, which rely on assumptions of normality. Most financial processes tend to originate from geometric stochastic processes, which results in variables whose distributions have power-law tails.