Linearization of Bivariate Data (College Board AP® Statistics): Study Guide

Written by: Mark Curtis

Reviewed by: Dan Finlay

Updated on 24 October 2024

Linearization of bivariate data

What does transforming a variable mean?

Transforming a variable means performing a mathematical operation on either the $x$ -coordinates of the data points, or the $y$ -coordinates
- e.g. take the $x$ -coordinates and square them
  - $x$ becomes $x^{2}$
A common transformation is taking the natural logarithm of the $y$ -coordinates
- $y$ becomes $\ln y$

What is linearization of bivariate data?

If a scatterplot shows that data points do not follow a linear relationship
- then it is sometimes possible to transform one of the variables to make the data points follow a more linear relationship
  - This process is called linearization of bivariate data

Examiner Tips and Tricks

When transforming variables, the type of transformation will be given to you in the exam.

How do I know if the transformed data is more linear than the untransformed data?

There are two different methods to check if the transformed data is more linear than the untransformed data:
- Method 1: Create residual plots before and after the transformation
  - If, after the transformation, the plots are more random (no longer following curves or patterns)
  - then this is evidence that the transformed data is more linear than the untransformed data
- Method 2: Calculate the coefficient of determination, $r^{2}$ , before and after the transformation
  - If, after the transformation, $r^{2}$ is closer to 1,
  - then this is evidence that the least-squares regression line is a better model for the transformed data than the regression line for the untransformed data

How do I use the regression equation for the transformed data?

Find the least-squares regression line for the transformed data
- This will either have the form
  - $(transformed \hat{y}) = a + b x$
- or the form
  - $\hat{y} = a + b (transformed x)$
Then use this equation to predict $y$ values, given $x$ -values
- You may need to rearrange the equation to make $\hat{y}$ the subject
- or you may need to transform the $x$ -value before substituting it in

Worked Example

The scatterplot below shows the population of mosquitoes, $y$ , in different parts of an island against the percentage cover of vegetation, $x$ %. The least-squares regression line and its residual plot are also shown.

Two graphs: Left graph shows data points and a regression line; right graph shows residuals of these data points, forming a curved pattern.

A biologist claims that the natural logarithm of the population of mosquitoes will have a linear relationship with the percentage cover of vegetation. The scatterplot, least-squares regression line and residual plot for the transformed data are shown below.

Left plot with ln(y) vs. x has a regression line and data points near it; right plot shows residuals vs. x scattered around zero.

(a) State, with justification, whether or not the new plots support the biologist's claim.

It is not enough to say the scatterplot looks more linear

Instead, you need to compare the residual plots

They are more random after the transformation, suggesting that the transformed data is more linear than the untransformed data

Remember to give all comments in context (copy phrases from the question to help)

Answer:

The residual plot from the scatterplot showing the population of mosquitoes, $y$ , in different parts of an island against the percentage cover of vegetation, $x$ %, shows that the residuals follow a U-shaped pattern (they are not random)

The residual plot from the scatterplot showing the natural logarithm of the population of mosquitoes, $\ln y$ , in different parts of an island against the percentage cover of vegetation, $x$ %, shows that these residuals are randomly spread (not following a pattern)

This means there is evidence to say that the natural logarithm of the population of mosquitoes, $\ln y$ , in different parts of an island against the percentage cover of vegetation, $x$ %, has a more linear relationship than the population of mosquitoes, $y$ , in different parts of an island against the percentage cover of vegetation, $x$ %

This supports the claim by the biologist

(b) Given that the second regression line has a slope of 0.102 and a $y$ -axis intercept of 4.29, estimate, to the nearest thousand, the population of mosquitoes in an area on the island with a vegetation cover of 65%.

Answer:

Write out the equation of the least-squares regression line using $\ln y$ instead of $y$ (the $x$ is unchanged)

$\ln \hat{y} = 4.29 + 0.102 x$

Substitute in $x = 65$ and simplify

$\begin{array}{rcl} \ln \hat{y} & = & 4.29 + 0.102 \times 65 \\ \ln \hat{y} & = & 10.92 \end{array}$

Rearrange the equation to make $\hat{y}$ the subject (find $e$ to the power of the right-hand side)

$\hat{y} = e^{10.92} = 55270.79 . . .$

Round this answer to the nearest 1000 and give the answer in context

The population of mosquitoes is approximately 55000 in an area on the island with a vegetation cover of 65%

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Test yourself

Did this page help you?

Previous:Outliers, High-Leverage & Influential PointsNext:Introduction to Sampling

Linearization of Bivariate Data (College Board AP® Statistics): Study Guide

Linearization of bivariate data

What does transforming a variable mean?

What is linearization of bivariate data?

How do I know if the transformed data is more linear than the untransformed data?

How do I use the regression equation for the transformed data?

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

Unit 1: Exploring One-Variable Data

Summary Statistics

Describing Variables

Parameters & Statistics

Measures of Center

Measures of Position

Measures of Variability

Tables & Relative Frequency

Grouped Data

Outliers & Resistant Measures

Five-Number Summary & Boxplots

Skewness of Data

Comparing Data using Summary Statistics

Graphical Representations

Shape of Distributions

Bar Charts & Histograms

Dotplots & Stemplots

Cumulative Graphs

Comparing Univariate Graphs

The Normal Distribution

Properties of Normal Distributions

Standardized z-scores

Comparing Normal Distributions

Finding Proportions from Normal Distributions

Inverse Normal Calculations

Estimating Parameters of Normal Distributions

Unit 2: Exploring Two-Variable Data

Tables & Graphs

Two-Way Tables & Relative Frequencies

Bar Graphs & Mosaic Plots

Scatterplots & Regression

Explanatory & Response Variables

Scatterplots

Association & Correlation Coefficients

Interpolation & Extrapolation using Linear Models

Residuals

The Least-Squares Regression Line

Residual Plots

The Coefficient of Determination

Outliers, High-Leverage & Influential Points

Linearization of Bivariate Data

Unit 3: Collecting Data

Sampling Methods & Bias

Introduction to Sampling

Simple Random Sampling (SRS)

Random Sampling Methods

Types of Bias

Non-random (Biased) Sampling Methods

Experimental Design

Introduction to Experiments

Well-Designed Experiments

Control Groups, Placebos & Blind Experiments

Completely Randomized Design

Randomized Block & Matched Pairs Design

Unit 4: Probability, Random Variables & Probability Distributions

Probability

Estimating Probability using Relative Frequency

Probabilities of Single Events

Introduction to Combined Events

Addition Rule & Mutually Exclusive Events

Conditional Probability

Multiplication Rule & Independent Events

Probabilities of Combined Events using Tree Diagrams

Probabilities of Combined Events using the Rules

Discrete Random Variables

Probability Distributions for Discrete Random Variables

Cumulative Probability Distributions for Discrete Random Variables

Mean & Standard Deviation of a Discrete Random Variable

Linear Transformations of Random Variables

Linear Combinations of Random Variables

Binomial & Geometric Distributions

Introduction to Binomial Distributions