Types of Data (Edexcel GCSE Statistics): Revision Note

Exam code: 1ST0

Written by: Roger B

Reviewed by: Dan Finlay

Updated on 3 November 2024

Types of Collected Data

What types of data do I need to be familiar with?

There are a number of terms for types of data that you need to be familiar with
- You need to recognise and understand them when they appear in exam questions
- And be able to use them when writing your answers to questions
Raw data is data in exactly the form that it was collected
- i.e. before it has been organised or processed in any way
Raw data can be either quantitative or qualitative
- Quantitative data can be recorded as a number
  - e.g. heights, lengths of time, numbers of people or objects, shoe sizes, etc.
- Qualitative data cannot be recorded as a number
  - e.g. colours, flavours, kinds of animal, makes of car, etc.
Quantitative data can be either continuous or discrete
- Continuous data can take any numerical value on a scale
  - e.g. height, length, weight, mass
  - For continuous data the measurements can become more and more accurate the more you 'zoom in'
- Discrete data can only take on particular numerical values on a scale
  - Often these are integers (e.g. numbers of people or objects)
  - But they don't have to be integers (e.g. shoe sizes, which include 'half sizes')
Categorical data is data that can be organised into non-overlapping categories
- 'Non-overlapping' is important here
  - Each piece of data can belong to one and only one category
  - e.g. heights less than 1.7 metres ( $h < 1.7$ ) and heights greater than or equal to 1.7 metres ( $h \geq 1.7$ )
  - but not $h \leq 1.7$ and $h \geq 1.7$ (because a height of 1.7 metres would belong to both categories)
- The categories can be numerical or non-numerical
Ordinal data is categorical data that can be written in order
- If the data is numbers, these can be ordered in the usual way
- If the data is not numbers, then it must be possible to apply a numerical 'rating scale'
  - e.g. a scale of 1 to 5 with 1 as 'disagree strongly' and 5 as 'agree strongly'
Bivariate data is data that is collected as pairs of values
- This could be data collected to investigate
  - the relationship between two variables
  - how changes in one variable affect the other variable
- e.g. age of car and cost of annual maintenance, train ticket price and length of journey, etc.
Multivariate data is data that is collected in sets of more than two values
- e.g. cholesterol levels, blood pressure and weight for a number of patients in a study

What is the difference between primary data and secondary data?

For the exam, you need to know the difference between primary data and secondary data
- This includes recognising the advantages and disadvantages of each
Primary data is data that is collected either by the person who is going to use it, or specifically for the person who is going to use it
- Advantages of primary data:
  - Can be gathered specifically for the question you are trying to answer
  - The level of accuracy will be known
  - The collection method will be known
- Disadvantages of primary data:
  - Collecting data can require a lot of time
  - It can also be expensive
Secondary data is data that has been collected by somebody else
- Some possible sources for secondary data:
  - the internet
  - print media (newspapers, magazines, etc.)
  - databases
  - research articles
  - census returns
- Advantages of secondary data
  - Can be quicker to obtain (i.e. less time)
  - Can be easier to obtain (i.e. more convenient)
  - Less expensive than collecting data yourself
  - May be more accurate than data you collect yourself (depending on the source)
- Disadvantages of secondary data
  - May be hard to find relevant data for your specific question
  - The data may be out of date
  - The level of accuracy may not be known (e.g. the data may have been rounded)
  - The collection method may not be known
  - The source of the data may not be reliable
- If you use secondary data, it is always necessary to acknowledge the source that the data was taken from

Worked Example

(a) Which of the following words can be used to describe the data in the following examples?

quantitative qualitative continuous discrete

More than one word might be applicable in each case.

(i) The weights of dogs participating in a dog show.

Weight is recorded by a number, so it is quantitative data
And weight can take on any value, so it is continuous

quantitative, continuous

(ii) The favourite ice cream flavours of the students in a school.

Flavour is not recorded as a number, so it is qualitative data
And only quantitative data can be discrete or continuous

qualitative

(iii) The number of computers owned by each household in a particular city.

The data is recorded as numbers, so it is quantitative
But only integer (i.e. whole number) values are possible, so it is discrete

quantitative, discrete

(b) Write down two types of data you could collect about cars owned by people in a particular region. State whether each type of data is categorical and/or ordinal.

You could record the make of each car (Renault, Ford, etc.)
This is categorical, because the data can be put into non-overlapping categories (just use the different makes as the categories!)
It is not ordinal, because it cannot be arranged in numerical order

Make of car (categorical, not ordinal)

You could also record the engine size of the car in cubic centimetres (cc)
This is categorical, because the data can be put into non-overlapping categories (just make sure to select the categories carefully!)
It is also ordinal, because the sizes can be put into numerical order

Size of engine in cc (categorical, ordinal)

(c) Gihan is investigating the lateness of flight departures at Heathrow Airport. Explain why it is sensible for Gihan to collect secondary data for his investigation.

It will be quicker and less expensive for Gihan to use secondary data, instead of collecting it himself.
It will also be much easier to find a large amount of data from a secondary source.

Grouped & Ungrouped Data

What are the advantages and disadvantages of grouping data?

For a relatively small data set it is okay to leave the data in ungrouped form
- e.g. the heights (in metres) of eight students in a school club
  
  1.57 1.63 1.69 1.71 1.77 1.79 1.81 1.84
- There are not too many values in that data set
  - so it is possible to get a 'feel' for the set just by looking at the list of values
For a large data set it is often more useful to present the data in grouped form
- The data is divided into a number of categories
  - and the frequency of each category (i.e., the number of values in each category) is reported
- The categories are known as classes
- The intervals defining what goes into what class are known as class intervals
Advantages of using grouped data:
- The distribution of the data can be seen more clearly
- Patterns in the data can be spotted more easily
Disadvantages of using grouped data:
- The exact data values are no longer visible
  - You can only see how many values fall within each class
- Statistics calculated from grouped data are less precise
  - e.g. mean, median and mode from grouped data can only be estimates

What things are important when grouping data?

You must be careful when selecting the class intervals for grouped data
The class intervals must not overlap
- For discrete data make sure no data value occurs in more than one class interval
  - e.g. 0-10, 11-20, 21-30, etc.
- For continuous data the class intervals also must not have any gaps between them
  - e.g. $0 \leq x < 10, 10 \leq x < 20, 20 \leq x < 30$ , etc.
  - $0 \leq x \leq 10$ and $11 \leq x \leq 20$ would not be good because there is a gap between 10 and 11
Open-ended class intervals can be used where minimum or maximum values aren't known
- e.g. $x < 30$ for the first class interval
- or $x > 90$ for the last one
Consider how many class intervals to use for grouping the data
- If there are too many intervals (too much detail)
- or too few intervals (not enough detail)
  - then it can be hard to spot trends in the data
Class intervals do not all need to be the same width
- You will often see grouped data where the class intervals have equal widths
  - This is appropriate when the data is roughly evenly spread out
- But sometimes unequal class widths might be more appropriate
  - e.g. when most of the data values are clustered 'in the middle'
  - It might make more sense to have wider intervals at the start and end
  - and narrower intervals in the middle
- Too many or too few data values falling into certain class intervals
  - can make the data representation less useful
Also be careful with class intervals when working with rounded data values
- All values that might round to a particular value must fall within the same class interval
- e.g. if the data is time rounded to the nearest second
  - then $60 \leq t < 70$ and $70 \leq t < 80$ would not be good intervals to use
  - (because a measurement of 70 seconds to the nearest second could be anywhere between 69.5 and 70.5 seconds)
  - Use $59.5 \leq t < 69.5$ and $69.5 \leq t < 79.5$ instead

Worked Example

Hazel and Avelaine have been collecting data on the weights of walnuts. After rounding all the weights to the nearest gram, the weights in their data set (in grams) are as follows:

9 13 17 11 15 16 22 18 14 16 15 19

14 13 10 15 20 14 16 13 12 18 16 12

(a) Avelaine suggests using the following table to group the data:

weight ( $w$ grams)	frequency
$w < 10$
$10 \leq w < 13$
$13 \leq w < 15$
$15 \leq w < 17$
$17 \leq w < 20$
$w \geq 20$

Based on the nature of the data, suggest one problem with Avelaine's table.

Remember that rounded and unrounded values need to fall within the same class interval
The unrounded weight of any nut could be up to 0.5 grams more or less than the rounded value

Avelaine's table doesn't take account of the rounding of the data.
For example a 9.7 g nut would fall in the w<10 class interval, but the rounded value (10 g) would fall in the 10≤w<13 class interval.

(b) Hazel suggests using the following table instead:

weight ( $w$ grams)	frequency
$w < 9.5$
$9.5 \leq w < 12.5$
$12.5 \leq w < 14.5$
$14.5 \leq w < 16.5$
$16.5 \leq w < 19.5$
$w \geq 19.5$

Complete Hazel's table for the data provided.

Be sure to count carefully
For example use a tally chart and cross off values from the list once you tally them

A tally chart for the data values in the question

Also make sure your frequencies total up to 24 (the number of data values in the list)

weight ( $w$ grams)	frequency
$w < 9.5$	1
$9.5 \leq w < 12.5$	4
$12.5 \leq w < 14.5$	6
$14.5 \leq w < 16.5$	7
$16.5 \leq w < 19.5$	4
$w \geq 19.5$	2

Explanatory & Response Variables

What are explanatory and response variables?

When data is collected from an experiment, the researcher usually wants to know how changes in one variable affect another variable
- The first variable is called the explanatory variable (or independent variable)
  - This is the variable that the researcher controls (or observes) changes in
  - The researcher suspects that changes in this variable will cause changes in the other variable
  - The explanatory variable is thought to 'explain' why the other variable changes
- The second variable is called the response variable (or dependent variable)
  - This is the variable that the researcher measures after changes have been made in the explanatory variable
  - The researcher suspects that this variable will be affected by changes in the explanatory variable
  - The response variable 'responds' to changes in the explanatory variable
- For example, a researcher wants to study the effects of different types of running shoe on how long it takes runners to run 100 metres
  - The explanatory variable is the type of running shoe
  - The response variable is the time taken to run 100 m
- Any other variables in an experiment are known as extraneous variables
  - These should be eliminated or minimised so they don't affect the results
You need to be very careful with explanatory and response variables when drawing a scatter diagram
- The explanatory variable MUST be on the x-axis
- And the response variable MUST be on the y-axis

Worked Example

In each of the following experiments, state which variable is the explanatory variable and which is the response variable.

(a) An engineer wishes to study whether temperature has an effect on charging times for mobile phone batteries.

Explanatory variable: temperature
Response variable: how long it takes the batteries to charge

(b) An education researcher wants to see whether a new AI study app improves students' scores on a maths test.

Explanatory variable: whether or not a student has used the app
Response variable: scores on the test

(c) An naturalist wants to explore whether the number of offspring successfully raised by breeding pairs of a particular species of bird depends on the percentage of tree cover in the region where the birds live.

Explanatory variable: percentage of tree cover
Response variable: number of offspring successfully raised

You've read 0 of your 5 free revision notes this week

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Test yourself

Did this page help you?

Previous:Planning an EnquiryNext:Population & Sampling

Types of Data (Edexcel GCSE Statistics): Revision Note

Types of Collected Data

What types of data do I need to be familiar with?

What is the difference between primary data and secondary data?

Grouped & Ungrouped Data

What are the advantages and disadvantages of grouping data?

What things are important when grouping data?

Explanatory & Response Variables

What are explanatory and response variables?

You've read 0 of your 5 free revision notes this week

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

The Collection of Data

Planning & Types of Data

Planning an Enquiry

Types of Data

Population, Sampling & Collecting Data

Population & Sampling

Collecting Data

Processing, Representing & Analysing Data

Tabulation, Diagrams & Representation

Bar Charts, Line Graphs & Pictograms

Pie Charts

Stem & Leaf Diagrams

Two-way Tables & Venn Diagrams

Population Pyramids

Choropleth Maps

Cumulative Frequency Charts

Box Plots

Histograms & Frequency Polygons

Selecting & Interpreting Data Representations

Measures of Central Tendency

Mode, Median & Arithmetic Mean

Mode & Mean from Grouped Data

Linear Interpolation

Transforming Data

Other Types of Mean

Measures of Dispersion

Quartiles & Percentiles

Types of Range

Standard Deviation

Outliers

Using Summary Statistics

Using Measures of Central Tendency

Using Measures of Dispersion

Skewness

Index Numbers & Rates of Change

Index Numbers

Rates of Change

Scatter Diagrams & Correlation

Scatter Diagrams & Correlation

Lines of Best Fit & Regression Lines

Spearman's Rank Correlation Coefficient

Pearson's Product Moment Correlation Coefficient (PMCC)

Time Series

Time Series Graphs

Moving Averages

Identifying & Interpreting Trends in Data

Quality Assurance & Estimation

Quality Assurance

Estimation from Statistical Data

Probability

Probability Basics

Probabilities & Data

Risk & Bias

Probability Formulae

Probability Diagrams

Probability Distributions

The Binomial Distribution

The Normal Distribution

Author: Roger B

Reviewer: Dan Finlay