Comparing Distributions
What does comparing distributions mean?
- Many questions will give you data that has been split into two related categories
- e.g. Data about daily screen time by children and adults
- the data is all about screen time but has been split into two distributions
- one for children
- one for adults
- the data is all about screen time but has been split into two distributions
- To compare distributions you should look to compare two things
- the average of the distributions
- the spread (variation) of the distributions
How do I compare the averages of two data sets (distributions)?
- Choose the appropriate average (mode, median or mean)
- The mean includes all the data
- The median is not affected by extreme values
- The mode can be used for non-numerical data
- Consider whether it is better for the average to be bigger or smaller
- If you are comparing time to complete a puzzle - the smaller the average the better
- If you are comparing test scores - the bigger the average the better
- Give numerical values for the average and explicitly compare
- e.g. The mean for dogs is 17 kg which is bigger than the mean for cats which is 13 kg
- Give your comparison in context
- e.g. The mean for dogs is bigger which suggests that, on average, dogs are heavier than cats
How do I compare the spread (variation) of two data sets (distributions)?
- Choose the appropriate range (range or interquartile range)
- The range is affected by extreme values
- The interquartile range focuses on the middle 50%
- Consider whether it is better for the range to be bigger or smaller
- A smaller range implies consistency
- A bigger range implies more spread
- Give numerical values for the range and explicitly compare
- e.g. The interquartile range for dogs is 6 kg which is bigger than the interquartile range for cats which is 4 kg
- Then give your comparison in context
- e.g. The interquartile range for cats is smaller which suggests that the weights of cats are more consistent and less spread out than dogs
- When comparing raw data sets, you may also need to check for outliers in either distribution
- If one, or both, of the data sets has a data value that is much larger or smaller than the others, this may need mentioning and a possible reason given
Worked example
Julie collects data on the distances travelled by snails and slugs over the duration of ten minutes. She records a summary of her findings as shown in the table below.
Median | Interquartile range | |
Snails | 7.1 cm | 3.1 cm |
Slugs | 9.7 cm | 4.5 cm |
Compare the distances travelled by snails and slugs over the duration of ten minutes.
For average, compare medians - remember to comment with numbers, then in context.
Slugs have the higher median (9.7 > 7.1) which suggests that on average slugs move further than snails
For spread, compare interquartile ranges - again, comment with numbers, then in context.
Snails have the lower interquartile range (3.1 < 4.5) which suggests that there is less variation in the distances travelled by snails