Collecting Data (Edexcel GCSE Statistics)

Revision Note

Data Collection Basics

What are different ways to collect data?

  • You should be familiar with different methods of collecting data

  • You can use direct observation to collect data

    • This means observing the things you are interested in and recording what you observe

      • For example to study pedestrians' use of mobile phones you might observe people walking past a certain spot in a town and tally the numbers who are or aren't looking at a mobile phone while they walk

    • You will need an appropriate data collection sheet for recording your data

      • This will usually be a table or tally chart, with appropriate rows or columns for the data you are collecting

      • For the example above you could use a tally chart, with rows for 'looking at a mobile phone' and 'not looking at a mobile phone'

    • An advantage of observation can be not affecting the natural behaviour of the things you are observing

      • But a possible disadvantage is not having any control over the things you are studying

  • You can also conduct an experiment to collect data

    • This is done to see how changes in one variable (the explanatory or independent variable) affect another variable (the response or dependent variable)

    • It is important to control extraneous variables (see the 'Extraneous Variable' spec point)

    • Different types of experiment (laboratory experiments, field experiments, and natural experiments) have different advantages and disadvantages

      • including how much control you have over extraneous variables

    • Sometimes a pre-test will be used before starting on a full experiment

      • The intended experiment is run on a small sample

      • This may reveal any problems with the design of the experiment

      • And allow the problems to be fixed before the experiment is run for real

  • Simulation can be used to model events in the real world

    • Data is collected from the model to predict what would happen in the real world

    • This may be easier or cheaper than collecting real world data

    • Random processes may be involved (including the use of random numbers)

      • For example say 23% of the UK population possesses a certain genetic marker

      • A two-digit random number generator could serve as a model 'person'

      • A number from 00 to 22 means the 'person' has the genetic marker, and a number from 23 to 99 means they don't

  • You can gather data from individuals using questionnaires or interviews

    • These need to be used carefully to avoid bias or other possible issues

      • See the 'Questionnaires & Interviews' spec point

  • You can also use reference sources to collect secondary data

    • e.g., government census data, online sources, etc.

      • Remember that the source of secondary data needs to be acknowledged

    • See the 'Types of Data' revision note

What are the advantages and disadvantages of different kinds of experiment?

  • You should know the advantages of different types of experiment for collecting data

  • Laboratory experiments

    • Conducted in a controlled environment (it doesn't have to happen in an official laboratory!)

      • For example studying people's sleep patterns in a special room where lighting, temperature, bedding materials, etc. are all under the researchers' control

    • Advantages include

      • Easy to control extraneous variables

      • Easy to repeat the experiment under exactly the same conditions

    • Disadvantages include

      • Test subjects may not behave naturally in the controlled environment

  • Field experiments

    • Conducted in the subject's usual environment, but with the researcher controlling the situation and certain variables

      • For example studying people's sleep patterns in their own beds at home, but with the researchers providing specific types of pillow and deciding what time subjects should go to bed

    • Advantages include

      • More likely than a laboratory experiment to show usual or natural behaviour

    • Disadvantages include

      • Can't control all extraneous variables

      • Harder to repeat the experiment under exactly the same conditions

  • Natural experiments

    • Conducted in the subject's usual environment, without the researcher controlling the situation or variables

      • For example studying people's sleep patterns in their own beds at home, with the subjects using their own beds and bedding, going to sleep at their usual times, etc.

    • Advantages include

      • More likely than a laboratory experiment to show usual or natural behaviour

    • Disadvantages include

      • Can't control any extraneous variables

      • Harder to repeat the experiment under exactly the same conditions

What are validity and reliability with regards to collected data?

  • We say that data is reliable when repeated measurements give similar results

    • i.e. if you collected the data again under similar circumstances you would get similar results

      • For example, using a scale to weigh some samples

      • It should give the same result if the same sample is weighed again

    • The reliability of collected data is the extent to which this is true

  • We say that data is valid if it measures what it was intended to measure

    • i.e. the data should be telling you what you think it is telling you

      • For example, using a questionnaire to assess participants' stress levels

      • To be valid, scores from the questionnaire should agree with other accepted ways of measuring stress

    • The validity of collected data is the extent to which this is true

  • Reliability and validity are both very important for collected data

    • The more reliable and valid data is, the more we can trust any predictions or conclusions made from it

Worked Example

Tomas is a researcher studying obedience in pet dogs. He plans to study 8 different dogs. For each dog, he will first visit the dog at its home, ask it to perform 10 basic commands, and record how many the dog successfully carries out. Two days later, Tomas will visit each dog at home a second time, ask it to do the same 10 commands, and record how many the dog successfully carries out.


(a) Design a data collection sheet that Tomas could use to record the results of his experiment.

Tomas will need to record the data for the 8 different dogs
For each dog he will need to record two different data values (the number of commands successfully carried out on each visit)
The best way to do this will be in a table

Dog

1

2

3

4

5

6

7

8

1st visit

2nd visit

(b) Explain whether Tomas is conducting a laboratory experiment, a field experiment, or a natural experiment.

He is visiting the dogs at their homes, so he is not carrying out a laboratory experiment
He is controlling what the dogs are asked to do on each visit, so it is not a natural experiment

Tomas is visiting the dogs in their home environments, but he is also controlling what they are asked to do on each visit. Therefore it is a field experiment.

(c) Explain what Tomas has done to help assure the reliability of his experimental results.

Tomas is asking the dogs to perform the same 10 commands each time
He is testing them in the same setting (their home) each time
If his experiment is reliable he should get approximately the same results on both visits

He is visiting each dog twice. Both visits are in the dogs' homes, and they are asked to do the same 10 commands each time. This will test whether he gets similar results for each dog when tested in similar circumstances, and help to show whether his results are reliable.

Questionnaires & Interviews

What makes a good questionnaire?

  • A questionnaire contains a set of questions that are used to collect data

    • A person who completes a questionnaire is known as a respondent

  • You should know the difference between open and closed questions

  • An open question has no suggested answers, and a respondent can answer anything at all to them

    • For example, 'How do you think the current town council is doing?'

    • Every answer can be different

      • so it can be hard to summarise or analyse the data as a whole

  • A closed question offers the respondent a number of answers to choose from

    • For example, 'The current town council is doing a great job. Choose one: ☐  Agree     ☐  Disagree'

    • It is possible to record how many people choose each response

      • This makes it easier to summarise and analyse the data

    • Closed questions will often use an opinion scale

      • For example offering the options 'strongly agree', 'agree', disagree' and 'strongly disagree'

      • A problem with opinion scales is that most people tend to choose responses 'in the middle', so the data collected might be biased towards those middle values

  • There are a number of things to consider when creating a questionnaire

    • Avoid leading questions

      • These are questions that suggest a particular answer

      • For example 'How delighted are you with our awesome new product?'

        • This is 'leading' the respondent to give a positive answer

        • The responses collected are likely to be biased

    • Make sure that options offered cover all possibilities

      • For example, 'How many time per day do you use our app? ☐  1 time     ☐  2 times     ☐  3-5 times'

        • This doesn't offer '0' or 'more than 5' as options

      • You may need to include options like 'never', 'other' or 'I don't know'

    • Make sure any intervals given do not overlap

      • For example, 'How much do you spend per month on widgets? ☐  £0 to £5     ☐  £5 to £10     ☐  More than £10'

        • '£5' is included in the first and second options!

    • Be sure to be specific about time frames

      • For example, 'How many text messages do you send per week?' is better than 'How many text messages do you send?'

    • Keep questions short

      • and use language that is simple and easy to understand

    • Be careful about asking sensitive questions

      • i.e. questions about personal matters (age, etc.) or about things people may not want to discuss ('How many times have you stolen things from shops?')

        • People may not answer the questions

        • Or they may not answer them honestly

  • Sometimes a pilot survey will be used before giving the questionnaire to all the respondents in the intended survey

    • The questionnaire is first given to a smaller sample of people

    • This may reveal any problems with the design of the questionnaire

    • And allow the problems to be fixed before the questionnaire is used for real

What are the advantages and disadvantages of interviews versus anonymous questionnaires?

  • In interviews an interviewer asks the questions to the respondents and records their responses

    • This can be done in person or by phone

    • Advantages of interviews:

      • The response rate is higher

        • i.e. every person interviewed will tend to answer the questions

      • The interviewer can explain questions (if necessary)

      • The respondent can explain their answers

        • This avoids unclear or ambiguous answers being recorded

      • A good interviewer can help respondents feel more comfortable when answering sensitive questions

    • Disadvantages of interviews:

      • Conducting interviews can take a lot of time

        • So interviews can take longer and be more expensive

      • The sample size will usually be smaller than when using questionnaires

        • This can make the sample less representative

      • Respondents may be less likely to be honest or to answer sensitive questions in an in-person interview

        • Or respondents may try to boast or to give the answers they think the interviewer wants to hear

      • There may be interviewer bias

        • This is when the opinions or expectations of the interviewer affect the answers given by the respondent

        • For example the interviewer may ask a question in a way that leads the respondent towards giving a particular answer

        • This can lead to biased results

  • Questionnaires will normally be given to people to fill in anonymously

    • This can be a printed form or a form accessible online

    • Advantages of questionnaires:

      • Respondents can answer questions in their own time

        • This can make the survey quicker and cheaper to run

      • Questionnaires can be sent to a large sample

        • This can make the sample more representative

      • Respondents may be more likely to be honest and to answer sensitive questions in an anonymous questionnaire

      • There is no interviewer bias

    • Disadvantages of questionnaires:

      • The response rate is lower

        • People may not answer all the questions, or may not complete or return the questionnaire at all

      • A respondent may not understand the questions

      • A respondent's answers may be unclear or ambiguous

Worked Example

A researcher is designing a questionnaire in order to collect data on how often people illegally download music.

One question the researcher is thinking of using is the following:

"A lot of people say that downloading music illegally is really okay, because it doesn't hurt anyone. How bad do you think it is to download music illegally?"

(a) State with a reason whether that is an open or a closed question.

Remember, in a closed question respondents are given a fixed set of responses to choose from

A person could give any answer at all to that question, so it is an open question

(b) State one thing that is wrong with the way the question is asked.

They start by saying that a lot of people think it's okay, before asking the actual question
This is leading a respondent towards giving a certain type of answer

It is a leading question, because it starts off by saying that a lot of people think it's okay to download music illegally

Data Problems & Cleaning Data

What sorts of problems can occur with collected data?

  • A number of problems can occur with data that has been collected

  • There may be missing data items

    • For example collecting data for the ages and weights of a number of puppies

      • For one puppy the researcher wrote down the weight, but forgot to record the age

  • There may be non-responses

    • This may be because someone chose not to answer a questionnaire

    • But it could also be because a member of a sample cannot be reached for some reason

      • For example questionnaires sent out to all the businesses in a government database

        • Some may not respond because they have gone out of business

        • This could mean that struggling or unsuccessful businesses are under-represented in the sample

  • There may be incomplete responses

    • People may return a questionnaire, but not answer all the questions

      • If lots of people don't answer the same question it may be because there's a problem with the question

  • Data could be in an incorrect format

    • For example, decimal points in the wrong place, incorrect or inconsistent units used, etc.

  • Data sets may contain anomalous data values (also known as outliers)

    • These are data values that are either very large or very small compared to the rest of the data

    • They may be valid data values

      • One very high salary in a list of company salaries may belong to the company CEO

    • Or they may be mistakes

      • A member of sports club whose age is recorded as 500

      • It's possible the person is really 50, but the person recording the data put in an extra 0 by mistake

How do I clean data?

  • Before data can be analysed, it should first be cleaned

  • Incorrect data values should be identified

    • and corrected if possible

      • or otherwise removed

    • This includes outliers

      • If you decide an outlier is a mistake it should be removed

      • But outliers that you think are valid should be kept in the data set

  • You should decide what to do about missing or incomplete data

    • It may be possible to find missing data values

      • For example, ring the person whose puppy was weighed but whose age was not recorded, and find out the age

    • Incomplete data (for example from an incomplete response to a questionnaire) can be kept or removed

      • Consider the effect that keeping or removing the data would have on any calculated statistics or other analysis

  • Units or other symbols may need to be removed from the data

    • For example removing the 'kg' from a list of weights, or the '£' from a list of prices

    • This is especially important if using spreadsheets or statistical software to calculate statistics from the data

  • Final calculations and analysis should be done using the cleaned data set

    • However be sure to justify why any values have been removed from the data set

Worked Example

At the start of each day, 8 towels are placed in each room in a hotel.

At the end of a particular day, the hotel manager recorded how many towels had gone missing from each room in the hotel. The results are in the table below.

2

0

0

1

0

0

1

1

0

0

2

3.2

3

0

2

15

0

1

0

1

5

0

(a) Explain why the value 3.2 in the table must be an error.

You cannot have part of a towel go missing, so valid data values must be whole numbers

Only whole numbers of towels can go missing, and 3.2 is a decimal number that is not a whole number


(b) Explain why 15 is an anomalous data value, and state with a reason whether it should be kept in or removed from the data set.

15 'stands out' because it is so much bigger than all the other values in the set
And remember that only 8 towels are put in each room to begin with

15 is an anomalous data value because it is much bigger than all the other numbers in the table. It should be removed, because only 8 towels are put into each room at the start of the day, so 15 is very likely to be an error.

(c) Clean and rewrite the data.

Remove the 3.2 and the 15, and rewrite the remaining data values
This is the data set you would use to calculate any statistics or to do other analysis

2

0

0

1

0

0

1

1

0

0

2

3

0

2

0

1

0

1

5

0

Extraneous Variables

What are extraneous variables?

  • An extraneous variable is a variable that

    • you are not interested in

    • but that can affect the results of your experiment

  • It is important to

    • identify possible extraneous variables before beginning an experiment

    • control any extraneous variables that are identified

      • i.e. eliminate (or at least minimise) their effect on the data

  • For example an experiment looking at how a new memory technique helps people memorise lists of words

    • Some people are tested in a quiet room

      • and some people are tested in a busy café

    • Background noise is an extraneous variable

      • People in the café might do worse on the test

      • But only because they were distracted by the noise

      • This would not say anything about the memory technique being investigated

    • Background noise could be controlled by having everyone do the test in the same setting

Worked Example

John is conducting a study to see whether there is a relationship between a person's height and how quickly they are able to run 100 metres.

He selects a random sample of 50 students from his school, and intends to time each of them running 100 metres in the same location on the school's playing fields.

He intends to time half of the students first thing in the morning before morning lessons, and the other half at the end of the day once classes are over.

Suggest an extraneous variable that might affect John's results. Explain how it might affect the results, and suggest what John might do to control that extraneous variable. 

There is no one right answer here, as there are a number of extraneous variables that might affect the results:

  • A student's level of physical fitness

  • A student's age

  • A student's gender

Any valid answer and explanation, along with a valid suggestion for controlling the extraneous variable, would get full marks on a question like this

From the info given in the question, however, the time of day a person runs the 100 metres is an obvious thing to choose for the extraneous variable

The time of day that a person runs the 100 metres is an extraneous variable. For example, people might run slower in the afternoon because they are tired after a whole day at school. John can control this extraneous variable by having everyone run the 100 metres at the same time of day.

Last updated:

You've read 0 of your 5 free revision notes this week

Sign up now. It’s free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Did this page help you?

Roger B

Author: Roger B

Expertise: Maths

Roger's teaching experience stretches all the way back to 1992, and in that time he has taught students at all levels between Year 7 and university undergraduate. Having conducted and published postgraduate research into the mathematical theory behind quantum computing, he is more than confident in dealing with mathematics at any level the exam boards might throw at you.

Dan Finlay

Author: Dan Finlay

Expertise: Maths Lead

Dan graduated from the University of Oxford with a First class degree in mathematics. As well as teaching maths for over 8 years, Dan has marked a range of exams for Edexcel, tutored students and taught A Level Accounting. Dan has a keen interest in statistics and probability and their real-life applications.