Collecting Data (Edexcel GCSE Statistics)

Revision Note

Data Collection Basics

What are different ways to collect data?

  • You should be familiar with different methods of collecting data

  • You can use direct observation to collect data

    • This means observing the things you are interested in and recording what you observe

      • For example to study pedestrians' use of mobile phones you might observe people walking past a certain spot in a town and tally the numbers who are or aren't looking at a mobile phone while they walk

    • You will need an appropriate data collection sheet for recording your data

      • This will usually be a table or tally chart, with appropriate rows or columns for the data you are collecting

      • For the example above you could use a tally chart, with rows for 'looking at a mobile phone' and 'not looking at a mobile phone'

    • An advantage of observation can be not affecting the natural behaviour of the things you are observing

      • But a possible disadvantage is not having any control over the things you are studying

  • You can also conduct an experiment to collect data

    • This is done to see how changes in one variable (the explanatory or independent variable) affect another variable (the response or dependent variable)

    • It is important to control extraneous variables (see the 'Extraneous Variable' spec point)

    • Different types of experiment (laboratory experiments, field experiments, and natural experiments) have different advantages and disadvantages

      • including different levels of control for extraneous variables

    • Sometimes a pre-test will be used before starting on a full experiment

      • The intended experiment is run on a small sample

      • This may reveal any problems with the design of the experiment

      • And allow the problems to be fixed before the experiment is run for real

  • Simulation can be used to model events in the real world

    • Data is collected from the model to predict what would happen in the real world

    • This may be easier or cheaper than collecting real world data

    • Random processes may be involved (including the use of random numbers)

      • For example say 23% of the UK population possesses a certain genetic marker

      • A two-digit random number generator could serve as a model 'person'

      • A number from 00 to 22 means the 'person' has the genetic marker, and a number from 23 to 99 means they don't

  • You can gather data from individuals using questionnaires or interviews

    • These need to be used carefully to avoid bias or other possible issues

      • See the 'Questionnaires & Interviews' spec point

  • You can also use reference sources to collect secondary data

    • e.g., government census data, online sources, etc.

      • Remember that the source of secondary data needs to be acknowledged

    • See the 'Types of Data' revision note

What are the advantages and disadvantages of different kinds of experiment?

  • You should know the advantages of different types of experiment for collecting data

  • Laboratory experiments

    • Conducted in a controlled environment (it doesn't have to happen in an official laboratory!)

      • For example studying people's sleep patterns in a special room where lighting, temperature, bedding materials, etc. are all under the researchers' control

    • Advantages include

      • Easy to control extraneous variables

      • Easy to repeat the experiment under exactly the same conditions

    • Disadvantages include

      • Test subjects may not behave naturally in the controlled environment

  • Field experiments

    • Conducted in the subject's usual environment, but with the researcher controlling the situation and certain variables

      • For example studying people's sleep patterns in their own beds at home, but with the researchers providing specific types of pillow and deciding what time subjects should go to bed

    • Advantages include

      • More likely than a laboratory experiment to show usual or natural behaviour

    • Disadvantages include

      • Can't control all extraneous variables

      • Harder to repeat the experiment under exactly the same conditions

  • Natural experiments

    • Conducted in the subject's usual environment, without the researcher controlling the situation or variables

      • For example studying people's sleep patterns in their own beds at home, with the subjects using their own beds and bedding, going to sleep at their usual times, etc.

    • Advantages include

      • More likely than a laboratory experiment to show usual or natural behaviour

    • Disadvantages include

      • Can't control any extraneous variables

      • Harder to repeat the experiment under exactly the same conditions

What are validity and reliability with regards to collected data?

  • We say that data is reliable when repeated measurements give similar results

    • i.e. if you collected the data again under similar circumstances you would get similar results

      • For example, using a scale to weigh some samples

      • It should give the same result if the same sample is weighed again

    • The reliability of collected data is the extent to which this is true

  • We say that data is valid if it measures what it was intended to measure

    • i.e. the data should be telling you what you think it is telling you

      • For example, using a questionnaire to assess participants' stress levels

      • To be valid, scores from the questionnaire should agree with other accepted ways of measuring stress

    • The validity of collected data is the extent to which this is true

  • Reliability and validity are both very important for collected data

    • The more reliable and valid data is, the more we can trust any predictions or conclusions made from it

Worked Example

Tomas is a researcher studying obedience in pet dogs. He plans to study 8 different dogs. For each dog, he will first visit the dog at its home, ask it to perform 10 basic commands, and record how many the dog successfully carries out. Two days later, Tomas will visit each dog at home a second time, ask it to do the same 10 commands, and record how many the dog successfully carries out.


(a) Design a data collection sheet that Tomas could use to record the results of his experiment.

Tomas will need to record the data for the 8 different dogs
For each dog he will need to record two different data values (the number of commands successfully carried out on each visit)
The best way to do this will be in a table

Dog

1

2

3

4

5

6

7

8

1st visit

2nd visit

(b) Explain whether Tomas is conducting a laboratory experiment, a field experiment, or a natural experiment.

He is visiting the dogs at their homes, so he is not carrying out a laboratory experiment
He is controlling what the dogs are asked to do on each visit, so it is not a natural experiment

Tomas is visiting the dogs in their home environments, but he is also controlling what they are asked to do on each visit. Therefore it is a field experiment.

(c) Explain what Tomas has done to help assure the reliability of his experimental results.

Tomas is asking the dogs to perform the same 10 commands each time
He is testing them in the same setting (their home) each time
If his experiment is reliable he should get approximately the same results on both visits

He is visiting each dog twice. Both visits are in the dogs' homes, and they are asked to do the same 10 commands each time. This will test whether he gets similar results for each dog when tested in similar circumstances, and help to show whether his results are reliable.

Questionnaires & Interviews

What makes a good questionnaire?

  • A questionnaire contains a set of questions that are used to collect data

    • A person who completes a questionnaire is known as a respondent

  • You should know the difference between open and closed questions

  • An open question has no suggested answers, and a respondent can answer anything at all to them

    • For example, 'How do you think the current town council is doing?'

    • Every answer can be different

      • so it can be hard to summarise or analyse the data as a whole

  • A closed question offers the respondent a number of answers to choose from

    • For example, 'The current town council is doing a great job. Choose one: ☐  Agree     ☐  Disagree'

    • It is possible to record how many people choose each response

      • This makes it easier to summarise and analyse the data

    • Closed questions will often use an opinion scale

      • For example offering the options 'strongly agree', 'agree', disagree' and 'strongly disagree'

      • A problem with opinion scales is that most people tend to choose responses 'in the middle', so the data collected might be biased towards those middle values

  • There are a number of things to consider when creating a questionnaire

    • Avoid leading questions

      • These are questions that suggest a particular answer

      • For example 'How delighted are you with our awesome new product?'

        • This is 'leading' the respondent to give a positive answer

        • The responses collected are likely to be biased

    • Make sure that options offered cover all possibilities

      • For example, 'How many time per day do you use our app? ☐  1 time     ☐  2 times     ☐  3-5 times'

        • This doesn't offer '0' or 'more than 5' as options

      • You may need to include options like 'never', 'other' or 'I don't know'

    • Make sure any intervals given do not overlap

      • For example, 'How much do you spend per month on widgets? ☐  £0 to £5     ☐  £5 to £10     ☐  More than £10'

        • '£5' is included in the first and second options!

    • Be sure to be specific about time frames

      • For example, 'How many text messages do you send per week?' is better than 'How many text messages do you send?'

    • Keep questions short

      • and use language that is simple and easy to understand

    • Be careful about asking sensitive questions

      • i.e. questions about personal matters (age, etc.) or about things people may not want to discuss ('How many times have you stolen things from shops?')

        • People may not answer the questions

        • Or they may not answer them honestly

  • Sometimes a pilot survey will be used before giving the questionnaire to all the respondents in the intended survey

    • The questionnaire is first given to a smaller sample of people

    • This may reveal any problems with the design of the questionnaire

    • And allow the problems to be fixed before the questionnaire is used for real

What are the advantages and disadvantages of interviews versus anonymous questionnaires?

  • In interviews an interviewer asks the questions to the respondents and records their responses

    • This can be done in person or by phone

    • Advantages of interviews:

      • The response rate is higher

        • i.e. every person interviewed will tend to answer the questions

      • The interviewer can explain questions (if necessary)

      • The respondent can explain their answers

        • This avoids unclear or ambiguous answers being recorded

      • A good interviewer can help respondents feel more comfortable when answering sensitive questions

    • Disadvantages of interviews:

      • Conducting interviews can take a lot of time

        • So interviews can take longer and be more expensive

      • The sample size will usually be smaller than when using questionnaires

        • This can make the sample less representative

      • Respondents may be less likely to be honest or to answer sensitive questions in an in-person interview

        • Or respondents may try to boast or to give the answers they think the interviewer wants to hear

      • There may be interviewer bias

        • This is when the opinions or expectations of the interviewer affect the answers given by the respondent

        • For example the interviewer may ask a question in a way that leads the respondent towards giving a particular answer

        • This can lead to biased results

  • Questionnaires will normally be given to people to fill in anonymously

    • This can be a printed form or a form accessible online

    • Advantages of questionnaires:

      • Respondents can answer questions in their own time

        • This can make the survey quicker and cheaper to run

      • Questionnaires can be sent to a large sample

        • This can make the sample more representative

      • Respondents may be more likely to be honest and to answer sensitive questions in an anonymous questionnaire

      • There is no interviewer bias

    • Disadvantages of questionnaires:

      • The response rate is lower

        • People may not answer all the questions, or may not complete or return the questionnaire at all

      • A respondent may not understand the questions

      • A respondent's answers may be unclear or ambiguous

What is the random response method for collecting sensitive data?

  • Even in an anonymous questionnaire, people may not be willing to give honest answers to sensitive questions

    • The random response method is a way to get better responses for these sorts of questions

    • It uses some sort of random event (for example a coin flip) to determine how a question will be answered

  • For example, say you wanted to collect data on people using handheld phones while driving

    • This is illegal in the UK

    • So people may not be willing to admit that they have done it

  • You could ask the question in this form:

    • "Have you ever driven while using a handheld phone?
      Flip a coin.
      If you get heads, then answer Yes.
      If you get tails, then answer honestly."

    • There is no way to know if a person answering yes really did drive while using a handheld phone, or whether they only answered yes because they flipped the coin and got heads

  • To estimate the response rate for a random response question:

    • Estimate the number of people who answered a certain way because of the random event

      • For example, with a coin flip about half the people will get heads and half will get tails

    • Remove that many responses from the data set

      • For the example used above, say 1000 people responded to the question

      • We would expect half of them to answer yes because they got heads on the coin

      • So remove 500 yes answers from the data set

    • Perform your analysis on the remaining items in the data set

    • See the Worked Example

Worked Example

A researcher is designing a questionnaire in order to collect data on how often people illegally download music.

One question the researcher is thinking of using is the following:

"A lot of people say that downloading music illegally is really okay, because it doesn't hurt anyone. How bad do you think it is to download music illegally?"

(a) State with a reason whether that is an open or a closed question.

Remember, in a closed question respondents are given a fixed set of responses to choose from

A person could give any answer at all to that question, so it is an open question

(b) State one thing that is wrong with the way the question is asked.

They start by saying that a lot of people think it's okay, before asking the actual question
This is leading a respondent towards giving a certain type of answer

It is a leading question, because it starts off by saying that a lot of people think it's okay to download music illegally

In the final version of the questionnaire, one of the questions is as follows:

"Have you ever downloaded music illegally?
  Before answering the question, flip a coin.
  If you get heads on the coin, then answer Yes.
  If you get tails on the coin, then answer honestly."

The questionnaire is sent to a large number of people. 1332 people answer Yes to that question, and 1068 answer No.

(c) Estimate the percentage of people in the sample who have downloaded music illegally.

Start by figuring out the total number of people who responded

1332 plus 1068 equals 2400

The probability of getting heads on a fair coin is 1 half
Multiply that by 2400 to estimate the number of people who answered Yes because of the coin flip

1 half cross times 2400 equals 1200

Subtract that from 1332 to estimate the number of 'real' Yes responses

1332 minus 1200 equals 132

All the No responses are 'real', because no one answered No just because of the coin flip
So the total number of 'real' responses is 1068+132
Divide the number of 'real' Yes responses by that to get the proportion of 'real' responses that were Yes

fraction numerator 132 over denominator 1068 plus 132 end fraction equals 132 over 1200 equals 0.11

Multiply by 100 to convert to a percentage

0.11 cross times 100 equals 11

11%

Data Problems & Cleaning Data

What sorts of problems can occur with collected data?

  • A number of problems can occur with data that has been collected

  • There may be missing data items

    • For example collecting data for the ages and weights of a number of puppies

      • For one puppy the researcher wrote down the weight, but forgot to record the age

  • There may be non-responses

    • This may be because someone chose not to answer a questionnaire

    • But it could also be because a member of a sample cannot be reached for some reason

      • For example questionnaires sent out to all the businesses in a government database

        • Some may not respond because they have gone out of business

        • This could mean that struggling or unsuccessful businesses are under-represented in the sample

  • There may be incomplete responses

    • People may return a questionnaire, but not answer all the questions

      • If lots of people don't answer the same question it may be because there's a problem with the question

  • Data could be in an incorrect format

    • For example, decimal points in the wrong place, incorrect or inconsistent units used, etc.

  • Data sets may contain anomalous data values (also known as outliers)

    • These are data values that are either very large or very small compared to the rest of the data

    • They may be valid data values

      • One very high salary in a list of company salaries may belong to the company CEO

    • Or they may be mistakes

      • A member of sports club whose age is recorded as 500

      • It's possible the person is really 50, but the person recording the data put in an extra 0 by mistake

How do I clean data?

  • Before data can be analysed, it should first be cleaned

  • Incorrect data values should be identified

    • and corrected if possible

      • or otherwise removed

    • This includes outliers

      • If you decide an outlier is a mistake it should be removed

      • But outliers that you think are valid should be kept in the data set

  • You should decide what to do about missing or incomplete data

    • It may be possible to find missing data values

      • For example, ring the person whose puppy was weighed but whose age was not recorded, and find out the age

    • Incomplete data (for example from an incomplete response to a questionnaire) can be kept or removed

      • Consider the effect that keeping or removing the data would have on any calculated statistics or other analysis

  • Units or other symbols may need to be removed from the data

    • For example removing the 'kg' from a list of weights, or the '£' from a list of prices

    • This is especially important if using spreadsheets or statistical software to calculate statistics from the data

  • Final calculations and analysis should be done using the cleaned data set

    • However be sure to justify why any values have been removed from the data set

Worked Example

At the start of each day, 8 towels are placed in each room in a hotel.

At the end of a particular day, the hotel manager recorded how many towels had gone missing from each room in the hotel. The results are in the table below.

2

0

0

1

0

0

1

1

0

0

2

3.2

3

0

2

15

0

1

0

1

5

0

(a) Explain why the value 3.2 in the table must be an error.

You cannot have part of a towel go missing, so valid data values must be whole numbers

Only whole numbers of towels can go missing, and 3.2 is a decimal number that is not a whole number


(b) Explain why 15 is an anomalous data value, and state with a reason whether it should be kept in or removed from the data set.

15 'stands out' because it is so much bigger than all the other values in the set
And remember that only 8 towels are put in each room to begin with

15 is an anomalous data value because it is much bigger than all the other numbers in the table. It should be removed, because only 8 towels are put into each room at the start of the day, so 15 is very likely to be an error.

(c) Clean and rewrite the data.

Remove the 3.2 and the 15, and rewrite the remaining data values
This is the data set you would use to calculate any statistics or to do other analysis

2

0

0

1

0

0

1

1

0

0

2

3

0

2

0

1

0

1

5

0

Extraneous Variables

What are extraneous variables?

  • An extraneous variable is a variable that

    • you are not interested in

    • but that can affect the results of your experiment

  • It is important to

    • identify possible extraneous variables before beginning an experiment

    • control any extraneous variables that are identified

      • i.e. eliminate (or at least minimise) their effect on the data

  • For example an experiment looking at how a new memory technique helps people memorise lists of words

    • Some people are tested in a quiet room

      • and some people are tested in a busy café

    • Background noise is an extraneous variable

      • People in the café might do worse on the test

      • But only because they were distracted by the noise

      • This would not say anything about the memory technique being investigated

    • Background noise could be controlled by having everyone do the test in the same setting

How can I use control groups to control extraneous variables?

  • Control groups are often used when testing new treatments

    • People are randomly selected to be in one of two groups

      • People in the test group are given the treatment

      • People in the control group are not given the treatment

      • The results for the two groups are compared to see how effective the treatment is

  • The circumstances for the test group and control group should be as similar as possible

    • This is to control possible extraneous variables

    • For example, in medical tests the control group may be given an inactive substance (known as a placebo) that looks exactly like the active substance given to the test group

      • Even the people giving the substance may not know who is getting what

      • This makes sure everyone's experience in the experiment is as similar as possible

  • Matched pairs can be used in experiments with control and test groups

    • Each person in one group is paired with a person in the other group

    • People who are paired should have as much as possible in common

      • e.g. age, gender, educational background, annual income, geographic location, etc.

    • The only thing that should be different is the variable being studied

      • This is another way to control extraneous variables

      • But it can be challenging to find enough matched pairs to give a good sample size

Worked Example

A doctor wants to test the effectiveness of a new medicated lotion for treating a skin condition.

She plans to select a number of people who have the condition, and to divide her test subjects into two groups. The members of one group will receive the medicated lotion to use, and the members of the other group will receive a lotion that looks and feels the same but doesn't contain any medication.

At the end of the study she will compare the two groups to see whether each subject's skin condition has improved, stayed the same, or gotten worse during the time of the study.

(a) State which of the doctor's groups is the control group, and which is the test group.

The test group is the group receiving the medicated lotion, and the control group is the group receiving the unmedicated lotion

(b) Explain how the doctor should choose which test subjects should be in which group.

Participants should be selected randomly for the groups in a control group experiment

She should use random selection

(c) Describe how the doctor could use matched pairs in her study, and explain how this could make the study results more reliable.

Matched pairs pair together people with similar characteristics
This is to control as many extraneous variables as possible

She could pair each person in the control group with a person in the test group who is of the same age and gender
This would help control extraneous variables, by making sure the only difference between people in the two groups is whether or not they use the medicated lotion

Last updated:

You've read 0 of your 5 free revision notes this week

Sign up now. It’s free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Did this page help you?

Roger B

Author: Roger B

Expertise: Maths

Roger's teaching experience stretches all the way back to 1992, and in that time he has taught students at all levels between Year 7 and university undergraduate. Having conducted and published postgraduate research into the mathematical theory behind quantum computing, he is more than confident in dealing with mathematics at any level the exam boards might throw at you.

Dan Finlay

Author: Dan Finlay

Expertise: Maths Lead

Dan graduated from the University of Oxford with a First class degree in mathematics. As well as teaching maths for over 8 years, Dan has marked a range of exams for Edexcel, tutored students and taught A Level Accounting. Dan has a keen interest in statistics and probability and their real-life applications.