Collecting Data (Edexcel GCSE Statistics)
Revision Note
Written by: Roger B
Reviewed by: Dan Finlay
Data Collection Basics
What are different ways to collect data?
You should be familiar with different methods of collecting data
You can use direct observation to collect data
This means observing the things you are interested in and recording what you observe
For example to study pedestrians' use of mobile phones you might observe people walking past a certain spot in a town and tally the numbers who are or aren't looking at a mobile phone while they walk
You will need an appropriate data collection sheet for recording your data
This will usually be a table or tally chart, with appropriate rows or columns for the data you are collecting
For the example above you could use a tally chart, with rows for 'looking at a mobile phone' and 'not looking at a mobile phone'
An advantage of observation can be not affecting the natural behaviour of the things you are observing
But a possible disadvantage is not having any control over the things you are studying
You can also conduct an experiment to collect data
This is done to see how changes in one variable (the explanatory or independent variable) affect another variable (the response or dependent variable)
It is important to control extraneous variables (see the 'Extraneous Variable' spec point)
Different types of experiment (laboratory experiments, field experiments, and natural experiments) have different advantages and disadvantages
including how much control you have over extraneous variables
Sometimes a pre-test will be used before starting on a full experiment
The intended experiment is run on a small sample
This may reveal any problems with the design of the experiment
And allow the problems to be fixed before the experiment is run for real
Simulation can be used to model events in the real world
Data is collected from the model to predict what would happen in the real world
This may be easier or cheaper than collecting real world data
Random processes may be involved (including the use of random numbers)
For example say 23% of the UK population possesses a certain genetic marker
A two-digit random number generator could serve as a model 'person'
A number from 00 to 22 means the 'person' has the genetic marker, and a number from 23 to 99 means they don't
You can gather data from individuals using questionnaires or interviews
These need to be used carefully to avoid bias or other possible issues
See the 'Questionnaires & Interviews' spec point
You can also use reference sources to collect secondary data
e.g., government census data, online sources, etc.
Remember that the source of secondary data needs to be acknowledged
See the 'Types of Data' revision note
What are the advantages and disadvantages of different kinds of experiment?
You should know the advantages of different types of experiment for collecting data
Laboratory experiments
Conducted in a controlled environment (it doesn't have to happen in an official laboratory!)
For example studying people's sleep patterns in a special room where lighting, temperature, bedding materials, etc. are all under the researchers' control
Advantages include
Easy to control extraneous variables
Easy to repeat the experiment under exactly the same conditions
Disadvantages include
Test subjects may not behave naturally in the controlled environment
Field experiments
Conducted in the subject's usual environment, but with the researcher controlling the situation and certain variables
For example studying people's sleep patterns in their own beds at home, but with the researchers providing specific types of pillow and deciding what time subjects should go to bed
Advantages include
More likely than a laboratory experiment to show usual or natural behaviour
Disadvantages include
Can't control all extraneous variables
Harder to repeat the experiment under exactly the same conditions
Natural experiments
Conducted in the subject's usual environment, without the researcher controlling the situation or variables
For example studying people's sleep patterns in their own beds at home, with the subjects using their own beds and bedding, going to sleep at their usual times, etc.
Advantages include
More likely than a laboratory experiment to show usual or natural behaviour
Disadvantages include
Can't control any extraneous variables
Harder to repeat the experiment under exactly the same conditions
What are validity and reliability with regards to collected data?
We say that data is reliable when repeated measurements give similar results
i.e. if you collected the data again under similar circumstances you would get similar results
For example, using a scale to weigh some samples
It should give the same result if the same sample is weighed again
The reliability of collected data is the extent to which this is true
We say that data is valid if it measures what it was intended to measure
i.e. the data should be telling you what you think it is telling you
For example, using a questionnaire to assess participants' stress levels
To be valid, scores from the questionnaire should agree with other accepted ways of measuring stress
The validity of collected data is the extent to which this is true
Reliability and validity are both very important for collected data
The more reliable and valid data is, the more we can trust any predictions or conclusions made from it
Worked Example
Tomas is a researcher studying obedience in pet dogs. He plans to study 8 different dogs. For each dog, he will first visit the dog at its home, ask it to perform 10 basic commands, and record how many the dog successfully carries out. Two days later, Tomas will visit each dog at home a second time, ask it to do the same 10 commands, and record how many the dog successfully carries out.
(a) Design a data collection sheet that Tomas could use to record the results of his experiment.
Tomas will need to record the data for the 8 different dogs
For each dog he will need to record two different data values (the number of commands successfully carried out on each visit)
The best way to do this will be in a table
Dog | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
1st visit | ||||||||
2nd visit |
(b) Explain whether Tomas is conducting a laboratory experiment, a field experiment, or a natural experiment.
He is visiting the dogs at their homes, so he is not carrying out a laboratory experiment
He is controlling what the dogs are asked to do on each visit, so it is not a natural experiment
Tomas is visiting the dogs in their home environments, but he is also controlling what they are asked to do on each visit. Therefore it is a field experiment.
(c) Explain what Tomas has done to help assure the reliability of his experimental results.
Tomas is asking the dogs to perform the same 10 commands each time
He is testing them in the same setting (their home) each time
If his experiment is reliable he should get approximately the same results on both visits
He is visiting each dog twice. Both visits are in the dogs' homes, and they are asked to do the same 10 commands each time. This will test whether he gets similar results for each dog when tested in similar circumstances, and help to show whether his results are reliable.
Questionnaires & Interviews
What makes a good questionnaire?
A questionnaire contains a set of questions that are used to collect data
A person who completes a questionnaire is known as a respondent
You should know the difference between open and closed questions
An open question has no suggested answers, and a respondent can answer anything at all to them
For example, 'How do you think the current town council is doing?'
Every answer can be different
so it can be hard to summarise or analyse the data as a whole
A closed question offers the respondent a number of answers to choose from
For example, 'The current town council is doing a great job. Choose one: ☐ Agree ☐ Disagree'
It is possible to record how many people choose each response
This makes it easier to summarise and analyse the data
Closed questions will often use an opinion scale
For example offering the options 'strongly agree', 'agree', disagree' and 'strongly disagree'
A problem with opinion scales is that most people tend to choose responses 'in the middle', so the data collected might be biased towards those middle values
There are a number of things to consider when creating a questionnaire
Avoid leading questions
These are questions that suggest a particular answer
For example 'How delighted are you with our awesome new product?'
This is 'leading' the respondent to give a positive answer
The responses collected are likely to be biased
Make sure that options offered cover all possibilities
For example, 'How many time per day do you use our app? ☐ 1 time ☐ 2 times ☐ 3-5 times'
This doesn't offer '0' or 'more than 5' as options
You may need to include options like 'never', 'other' or 'I don't know'
Make sure any intervals given do not overlap
For example, 'How much do you spend per month on widgets? ☐ £0 to £5 ☐ £5 to £10 ☐ More than £10'
'£5' is included in the first and second options!
Be sure to be specific about time frames
For example, 'How many text messages do you send per week?' is better than 'How many text messages do you send?'
Keep questions short
and use language that is simple and easy to understand
Be careful about asking sensitive questions
i.e. questions about personal matters (age, etc.) or about things people may not want to discuss ('How many times have you stolen things from shops?')
People may not answer the questions
Or they may not answer them honestly
Sometimes a pilot survey will be used before giving the questionnaire to all the respondents in the intended survey
The questionnaire is first given to a smaller sample of people
This may reveal any problems with the design of the questionnaire
And allow the problems to be fixed before the questionnaire is used for real
What are the advantages and disadvantages of interviews versus anonymous questionnaires?
In interviews an interviewer asks the questions to the respondents and records their responses
This can be done in person or by phone
Advantages of interviews:
The response rate is higher
i.e. every person interviewed will tend to answer the questions
The interviewer can explain questions (if necessary)
The respondent can explain their answers
This avoids unclear or ambiguous answers being recorded
A good interviewer can help respondents feel more comfortable when answering sensitive questions
Disadvantages of interviews:
Conducting interviews can take a lot of time
So interviews can take longer and be more expensive
The sample size will usually be smaller than when using questionnaires
This can make the sample less representative
Respondents may be less likely to be honest or to answer sensitive questions in an in-person interview
Or respondents may try to boast or to give the answers they think the interviewer wants to hear
There may be interviewer bias
This is when the opinions or expectations of the interviewer affect the answers given by the respondent
For example the interviewer may ask a question in a way that leads the respondent towards giving a particular answer
This can lead to biased results
Questionnaires will normally be given to people to fill in anonymously
This can be a printed form or a form accessible online
Advantages of questionnaires:
Respondents can answer questions in their own time
This can make the survey quicker and cheaper to run
Questionnaires can be sent to a large sample
This can make the sample more representative
Respondents may be more likely to be honest and to answer sensitive questions in an anonymous questionnaire
There is no interviewer bias
Disadvantages of questionnaires:
The response rate is lower
People may not answer all the questions, or may not complete or return the questionnaire at all
A respondent may not understand the questions
A respondent's answers may be unclear or ambiguous
Worked Example
A researcher is designing a questionnaire in order to collect data on how often people illegally download music.
One question the researcher is thinking of using is the following:
"A lot of people say that downloading music illegally is really okay, because it doesn't hurt anyone. How bad do you think it is to download music illegally?"
(a) State with a reason whether that is an open or a closed question.
Remember, in a closed question respondents are given a fixed set of responses to choose from
A person could give any answer at all to that question, so it is an open question
(b) State one thing that is wrong with the way the question is asked.
They start by saying that a lot of people think it's okay, before asking the actual question
This is leading a respondent towards giving a certain type of answer
It is a leading question, because it starts off by saying that a lot of people think it's okay to download music illegally
Data Problems & Cleaning Data
What sorts of problems can occur with collected data?
A number of problems can occur with data that has been collected
There may be missing data items
For example collecting data for the ages and weights of a number of puppies
For one puppy the researcher wrote down the weight, but forgot to record the age
There may be non-responses
This may be because someone chose not to answer a questionnaire
But it could also be because a member of a sample cannot be reached for some reason
For example questionnaires sent out to all the businesses in a government database
Some may not respond because they have gone out of business
This could mean that struggling or unsuccessful businesses are under-represented in the sample
There may be incomplete responses
People may return a questionnaire, but not answer all the questions
If lots of people don't answer the same question it may be because there's a problem with the question
Data could be in an incorrect format
For example, decimal points in the wrong place, incorrect or inconsistent units used, etc.
Data sets may contain anomalous data values (also known as outliers)
These are data values that are either very large or very small compared to the rest of the data
They may be valid data values
One very high salary in a list of company salaries may belong to the company CEO
Or they may be mistakes
A member of sports club whose age is recorded as 500
It's possible the person is really 50, but the person recording the data put in an extra 0 by mistake
How do I clean data?
Before data can be analysed, it should first be cleaned
Incorrect data values should be identified
and corrected if possible
or otherwise removed
This includes outliers
If you decide an outlier is a mistake it should be removed
But outliers that you think are valid should be kept in the data set
You should decide what to do about missing or incomplete data
It may be possible to find missing data values
For example, ring the person whose puppy was weighed but whose age was not recorded, and find out the age
Incomplete data (for example from an incomplete response to a questionnaire) can be kept or removed
Consider the effect that keeping or removing the data would have on any calculated statistics or other analysis
Units or other symbols may need to be removed from the data
For example removing the 'kg' from a list of weights, or the '£' from a list of prices
This is especially important if using spreadsheets or statistical software to calculate statistics from the data
Final calculations and analysis should be done using the cleaned data set
However be sure to justify why any values have been removed from the data set
Worked Example
At the start of each day, 8 towels are placed in each room in a hotel.
At the end of a particular day, the hotel manager recorded how many towels had gone missing from each room in the hotel. The results are in the table below.
2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
3.2 | 3 | 0 | 2 | 15 | 0 | 1 | 0 | 1 | 5 | 0 |
(a) Explain why the value 3.2 in the table must be an error.
You cannot have part of a towel go missing, so valid data values must be whole numbers
Only whole numbers of towels can go missing, and 3.2 is a decimal number that is not a whole number
(b) Explain why 15 is an anomalous data value, and state with a reason whether it should be kept in or removed from the data set.
15 'stands out' because it is so much bigger than all the other values in the set
And remember that only 8 towels are put in each room to begin with
15 is an anomalous data value because it is much bigger than all the other numbers in the table. It should be removed, because only 8 towels are put into each room at the start of the day, so 15 is very likely to be an error.
(c) Clean and rewrite the data.
Remove the 3.2 and the 15, and rewrite the remaining data values
This is the data set you would use to calculate any statistics or to do other analysis
2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 3 | 0 | 2 | 0 | 1 | 0 | 1 | 5 | 0 |
Extraneous Variables
What are extraneous variables?
An extraneous variable is a variable that
you are not interested in
but that can affect the results of your experiment
It is important to
identify possible extraneous variables before beginning an experiment
control any extraneous variables that are identified
i.e. eliminate (or at least minimise) their effect on the data
For example an experiment looking at how a new memory technique helps people memorise lists of words
Some people are tested in a quiet room
and some people are tested in a busy café
Background noise is an extraneous variable
People in the café might do worse on the test
But only because they were distracted by the noise
This would not say anything about the memory technique being investigated
Background noise could be controlled by having everyone do the test in the same setting
Worked Example
John is conducting a study to see whether there is a relationship between a person's height and how quickly they are able to run 100 metres.
He selects a random sample of 50 students from his school, and intends to time each of them running 100 metres in the same location on the school's playing fields.
He intends to time half of the students first thing in the morning before morning lessons, and the other half at the end of the day once classes are over.
Suggest an extraneous variable that might affect John's results. Explain how it might affect the results, and suggest what John might do to control that extraneous variable.
There is no one right answer here, as there are a number of extraneous variables that might affect the results:
A student's level of physical fitness
A student's age
A student's gender
Any valid answer and explanation, along with a valid suggestion for controlling the extraneous variable, would get full marks on a question like this
From the info given in the question, however, the time of day a person runs the 100 metres is an obvious thing to choose for the extraneous variable
The time of day that a person runs the 100 metres is an extraneous variable. For example, people might run slower in the afternoon because they are tired after a whole day at school. John can control this extraneous variable by having everyone run the 100 metres at the same time of day.
Last updated:
You've read 0 of your 5 free revision notes this week
Sign up now. It’s free!
Did this page help you?