Statistics Toolkit (DP IB Applications & Interpretation (AI))

Flashcards

1/69
  • What is qualitative data?

Enjoying Flashcards?
Tell us what you think

Cards in this collection (69)

  • What is qualitative data?

    Qualitative data is given in words, not numbers.

    Examples include: hair colour, favourite animal, street name, etc.

  • What is quantitative data?

    Quantitative data is given using numbers.

    Examples include: number of siblings, height, time taken to run 100 metres, etc.

  • What is continuous data?

    Continuous data is quantitative data that can take any value within an interval. Continuous data needs to be measured.

    Examples include: height, mass of apples, length of leaves, etc.

  • What is discrete data?

    Discrete data is quantitative data that can only take specific values from a set. Discrete data is normally counted.

    Examples include: number of pets, number of times a coin is flipped until a 'tails' is obtained, etc.

    (Not all discrete data is counted however, e.g. shoe sizes.)

  • What is a population?

    A population is the whole set of things you are interested in observing.

    For example, if you are investigating the heights of giraffes in Africa then the population is the set of all giraffes in Africa.

  • What is a sample?

    A sample is a subset of a population used to collect data.

  • What is a sampling frame?

    A sampling frame is a list of all members of the population.

  • What is a random sample?

    A random sample is where every member of the population has an equal chance of being included in the sample.

  • Name the five sampling techniques.

    The five sampling techniques are:

    • simple random sampling

    • systematic sampling

    • stratified sampling

    • quota sampling

    • convenience sampling

  • Describe how you would use simple random sampling to take a sample of 30 people from a population of 100 people.

    To use simple random sampling to take a sample of 30 people from a population of 100 people:

    • give each member of the population a unique number from 1 to 100,

    • use a random number generator to select 30 randomly different numbers between 1 and 100,

    • the 30 people with those numbers form the sample.

  • True or False?

    Forming a sample by including every 10th person from a list is an example of stratified sampling.

    False.

    Forming a sample by including every 10th person from a list is not an example of stratified sampling.

    It is an example of systematic sampling.

  • What is the main difference between stratified sampling and quota sampling?

    The main difference between stratified sampling and quota sampling is how members for each group are selected.

    For stratified sampling, simple random sampling is used on each group of the population. Whereas for quota sampling, the members do not have to be selected randomly, e.g. convenience sampling could be used.

  • How do you calculate the number of members of each group to include in a stratified sample?

    For example, if there are 100 animals on a farm (60 chickens and 40 sheep), how many chickens should be included in a stratified sample of 30 animals?

    To calculate the number of members of each group to include in a stratified sample use the formula:

    fraction numerator size space of space sample space open parentheses n close parentheses over denominator size space of space population space open parentheses N close parentheses end fraction cross times number space of space members space in space the space group

    If there are 100 animals on a farm (60 chickens and 40 sheep), then to find the number of chickens to include in a stratified sample of 30 animals you would calculate:

    30 over 100 cross times 60 equals 18

  • True or False?

    A survey is always carried out in person.

    False.

    A survey is not always carried out in person.

    A survey can be done remotely such as postal surveys, phone surveys and internet surveys.

  • Why is the following question not suitable for a questionnaire?

    "How much do you appreciate your hardworking, selfless teacher?"

    "How much do you appreciate your hardworking, selfless teacher?"

    This question is not suitable for a questionnaire because it is a leading question.

  • What is meant by reliability in terms of data collection?

    Reliability measures how consistent a process is at measuring a variable.

    A process is reliable if you get the same results by repeating the process with the same sample using the same conditions.

  • What is meant by validity in terms of data collection?

    Validity measures how accurate a process is at measuring a variable.

    A process is valid if it is accurately measuring the variable you want it to measure.

  • What are the two methods to test for reliability?

    The two reliability tests are:

    1. Test-retest

    2. Parallel forms

  • Describe the process for the test-retest method.

    The test-retest method is where you use a data collection process with a sample and then repeat the same process with the same sample at a later time.

  • Describe the process for the parallel forms method.

    The parallel forms method is where you give the same sample a second set of questions (or second set of experiments), which are similar to the first set.

  • What are the two methods to test for validity?

    The two validity tests are:

    • Content-related check

    • Criterion-related check

  • Describe the content-related validity check.

    The content validity method is where you check how well the process measures all aspects of the variable.

    The process is valid if it covers all aspects of the variable.

  • Describe the criterion-related validity check.

    The criterion-related validity method is where you check how well one variable predicts the outcome for another variable (called the criterion variable).

    If the process is valid then the variable should be a good predictor.

  • The test-retest method is used to check the reliability of a process.

    What type of correlation should there be between the two sets of results if the process is reliable?

    The test-retest method is used to check the reliability of a process.

    There should be a positive correlation between the two sets of results if the process is reliable.

  • What is the mode of a data set?

    The mode of a data set is the item(s) that occurs the most often.

  • True or False?

    Any data set always has exactly one mode.

    False.

    Not all data sets have exactly one mode.

    A data set may have:

    • no mode

    • more than one mode.

  • How do you find the median of ungrouped data without a GDC?

    To find the median of ungrouped data:

    • put the data in order from smallest to largest,

    • find the middle value(s).

    If there are two middle values then find the midpoint of them.

    For example, the median of 1, 2, 3, 4 is 2.5 (i.e. the midpoint of 2 and 3).

  • How do you find the mean of ungrouped data without a GDC?

    To find the mean of ungrouped data:

    • add all the values together,

    • divide by the number of values.

    For example, the mean of 1, 2, 3, 4 can be found by calculating:

    fraction numerator 1 plus 2 plus 3 plus 4 over denominator 4 end fraction equals 2.5

  • 1 over n sum from i equals 1 to n of x subscript i

    Which measure of central tendency is found using the stated formula?

    1 over n sum from i equals 1 to n of x subscript i

    This formula is used to calculate the mean of a data set. This is denoted as x with bar on top.

  • True or False?

    The lower quartile of a set of data splits the lowest 25% from the highest 75%.

    True.

    The lower quartile of a set of data splits the lowest 25% from the highest 75%.

    For example, the lower quartile of 1, 2, 3, 4 is 1.5.

  • What is the interquartile range of a data set?

    The interquartile range is the range of the central 50% of data.

    It can be found by subtracting the lower quartile from the upper quartile.

    IQR equals Q subscript 3 minus Q subscript 1

  • True or False?

    The range of 2, 3, 1, 5, 8 is 8 - 2 = 6.

    False.

    The range is the difference between the lowest value and the highest value.

    The range of 2, 3, 1, 5, 8 is 8 - 1 = 7.

  • True or False?

    The standard deviation is a measure of central tendency.

    False.

    The standard deviation is not a measure of central tendency.

    The standard deviation is a measure of dispersion, it measures how spread out the data is about the mean.

  • What is the mathematical relationship between the standard deviation and the variance?

    The standard deviation is the positive square-root of the variance.

    (Equivalently, the variance is the standard deviation squared.)

  • What statistical measure is denoted by the symbol sigma squared?

    sigma squared denotes the population variance.

  • How do you calculate the mean from a frequency table containing ungrouped data?

    To calculate the mean from a frequency table containing ungrouped data:

    • multiply each value (xi) by its frequency (fi),

    • add these products together,

    • divide by the total frequency (n).

    This can be written as the formula fraction numerator sum from i equals 1 to n of space f subscript i x subscript i over denominator n end fraction.

  • True or False?

    Score

    Frequency

    5

    10

    10

    8

    The modal score is 10.

    False.

    Score

    Frequency

    5

    10

    10

    8

    The mode is the value with the highest frequency. Therefore the modal score is 5.

  • True or False?

    It is possible to find the exact median from a frequency table of ungrouped data.

    True.

    It is possible to find the exact median from a frequency table of ungrouped data.

    You can find the median by finding the middle value. You can use cumulative frequency to help to find which value is the middle value.

    You can also use your GDC to find the median.

  • How do you estimate the mean from a frequency table containing grouped data?

    To calculate the mean from a frequency table containing ungrouped data:

    • find the mid-interval value (midpoint) of each group (xi),

    • multiply each midpoint by its frequency (fi),

    • add these products together,

    • divide by the total frequency (n).

    This can be written as the formula fraction numerator sum from i equals 1 to n of space f subscript i x subscript i over denominator n end fraction.

  • Why is it not possible to calculate the exact mean of grouped data?

    It is not possible to calculate the exact mean of grouped data because the exact individual values are unknown.

  • How do you calculate the mid-interval value of a group in a frequency table of grouped data?

    For example, how do you calculate the mid-interval value of the group 10 less or equal than x less than 20?

    To calculate the mid-interval value of a group in a frequency table of grouped data:

    • add the upper and lower boundaries of the group,

    • divide by 2.

    For example, the mid-interval value of the group 10 less or equal than x less than 20 can be found by calculating fraction numerator 20 plus 10 over denominator 2 end fraction equals 15.

  • True or False?

    If you add 5 to each value in a data set, then the mean also increases by 5.

    True.

    If you add 5 to each value in a data set, then the mean also increases by 5.

  • x with bar on top is the mean of a data set. The constant k is subtracted from each value in the data set. What is the mean of the new values?

    x with bar on top is the mean of a data set. The constant k is subtracted from each value in the data set. The mean of the new values is x with bar on top minus k.

    If you subtract a constant from all the values in the data set then you also subtract it from the mean of the original values to find the mean of the new values.

  • What happens to the mean of a data set if each value in the data set is doubled?

    If each value in a data set is doubled, then the mean is also doubled.

  • x with bar on top is the mean of a data set. Each value in the data set is multiplied by the constant k. What is the mean of the new values?

    x with bar on top is the mean of a data set. Each value in the data set is multiplied by the constant k. The mean of the new values is k x with bar on top.

    If you multiply all the values in the data set by a constant then you also multiply the mean of the original values by the constant to find the mean of the new values.

  • True or False?

    If you add 5 to each value in a data set, then the standard deviation also increases by 5.

    False.

    If you add 5 to each value in a data set, then the standard deviation does not change.

    Adding or subtracting a constant to the values in a data set does not affect the standard deviation.

  • sigma squared is the variance of a data set. Each value in the data set is multiplied by the constant k. What is the variance of the new values?

    sigma squared is the variance of a data set. Each value in the data set is multiplied by the constant k. The variance of the new values is k squared sigma squared.

    If you multiply all the values in the data set by a constant then you multiply the variance of the original values by the constant squared to find the variance of the new values.

  • True or False?

    If you multiply each value in a data set by -2, then the standard deviation is also multiplied by -2.

    False.

    If you multiply each value in a data set by -2, then the standard deviation is multiplied by 2.

    The standard deviation is always positive.

  • What is an outlier?

    Outliers are extreme data values that do not fit with the rest of the data.

  • What is the formula used to calculate the boundaries for outliers in a data set?

    The boundaries for outliers in a data set are found by:

    • multiplying the interquartile range by 1.5,

    • subtracting this from the lower quartile and adding it to the upper quartile.

    It can be written as a formula. x is an outlier if x < Q1 - 1.5 × IQR or x > Q3 + 1.5 × IQR.

  • True or False?

    All outliers are errors.

    False.

    Not all outliers are errors.

  • Should outliers be removed from a data set?

    Outliers should be removed from a data set if they are clearly errors. However, if they are possibly valid values then they should be included.

  • What is used to represent an outlier on a box and whisker diagram?

    A cross (×) is used to represent an outlier on a box and whisker diagram.

    Box and whisker diagram with an outlier.
  • What are the five values needed to draw a box and whisker diagram?

    The five values needed to draw a box and whisker diagram are: lowest data value, lower quartile, median, upper quartile, and highest data value.

  • What proportion of the data set does the box in a box and whisker diagram represent?

    The box in a box and whisker diagram represents the central 50% of the data set.

  • What do the "whiskers" in a box and whisker diagram represent?

    The whiskers represent the lowest 25% and the highest 25% of the data.

  • True or False?

    To draw a cumulative frequency diagram, you plot the midpoint of a group against its frequency.

    False.

    To draw a cumulative frequency diagram, you do not plot the midpoint of a group against its frequency. This is a mistake that students often make on the exam.

    You plot the endpoint of a group against its cumulative frequency.

  • If the total frequency is 100, how do you use a cumulative frequency diagram to estimate the median value?

    If the total frequency is 100, you can use a cumulative frequency diagram to estimate the median value.

    1. Draw a horizontal line from 50 (100 divided by 2) on the vertical axis to the curve.

    2. Draw a vertical line down from this point on the curve to the horizontal axis.

    The value on the horizontal axis is an estimate of the median.

  • How can you estimate the lower quartile using a cumulative frequency diagram?

    To estimate the lower quartile using a cumulative frequency diagram:

    1. Divide the total frequency by 4.

    2. Draw a horizontal line from that result on the vertical axis to the curve.

    3. Draw a vertical line down from this point to the horizontal axis.

    The number on the horizontal axis is an estimate for the lower quartile.

  • What is a (frequency) histogram?

    A frequency histogram clearly shows the frequency of class intervals for grouped data with equal class intervals.

  • True or False?

    You need to leave a gap between the bars when drawing a histogram.

    False.

    You do not leave a gap between the bars when drawing a histogram.

  • Which measure of central tendency does a box and whisker diagram show?

    A box and whisker diagram shows the median.

  • When working with grouped data, which data representation clearly shows the modal group: a cumulative frequency graph or a histogram?

    When working with grouped data, a histogram clearly shows the modal group.

  • Which measure of central tendency (mode, median, mean) is most affected by outliers?

    The mean is most affected by outliers.

  • Which is the only measure of central tendency (mode, median, mean) that can be applied to qualitative data?

    The mode is the only measure of central tendency (mode, median, mean) that can be applied to qualitative data.

  • True or False?

    When comparing two data sets, the one with the higher mean is always better.

    False.

    When comparing two data sets, the one with the higher mean is not always better. It depends on the context.

    If comparing times to complete a race, the smaller mean is better.

    If comparing scores, the bigger mean is better.

  • True or False?

    The interquartile range is not affected by outliers.

    True.

    The interquartile range is not affected by outliers.

    The interquartile range only uses at the central 50% of the data.

  • When comparing two data sets, which one is more spread out about the median: the one with the bigger interquartile range or the one with the smaller interquartile range?

    When comparing two data sets, the one with the bigger interquartile range is more spread out about the median.

  • If you are comparing data sets that contain outliers, which measure of dispersion should be used: the standard deviation or the interquartile range?

    If you are comparing data sets which contain outliers, you should use the interquartile range as this is not affected by outliers.