The Least-Squares Regression Line (College Board AP® Statistics): Study Guide

Written by: Mark Curtis

Reviewed by: Dan Finlay

Updated on 28 August 2024

Least-squares regression line

What is the least-squares regression line?

The least-squares regression line is a special type of regression line that:
- minimizes the sum of the squares of the residuals
- and that passes through the mean point $(\bar{x}, \bar{y})$
  - where $\bar{x}$ is the mean of the $x$ -values
  - and $\bar{y}$ is the mean of the $y$ -values
It is used to predict $y$ -values from given $x$ -values
- Its full name is the least-squares regression line of $y$ on $x$
- This is not the same line if you wanted to predict $x$ -values from given $y$ -values
  - That is the least-squares regression line of $x$ on $y$
  - You cannot swap $x$ and $y$

What is the sum of the squares of the residuals?

The sum of the squares of the residuals for any regression line is found by
- calculating the residuals for each data point
- squaring each residual
- then adding together all these squared values

Why is the sum of the squares of the residuals minimized?

Residuals are like errors when comparing regression lines
- A good regression line should minimize the residuals
  - So compare the sum of all the residuals for different regression lines
- However, the sum of all the residuals is zero
  - The positive residuals end up cancelling out the negative ones
- So, instead, compare the sum of the squares of the residuals
  - because squaring the residuals makes them all positive
  - which stops any cancellation
The regression line with the smallest possible sum of the squares of the residuals is the least-squares regression line

What is the equation of the least-squares regression line?

The equation of the least-squares regression line is given by
- $\hat{y} = a + b x$
  - $a$ is the $y$ -intercept
  - $b$ is the slope
  - note the order of the terms
- where $\hat{y}$ is the $y$ -value predicted by the regression line
  - This is usually different to the actual $y$ -value of a data point
- $x$ is the explanatory variable
- $b = r \frac{s_{x}}{s_{y}}$
  - where $r$ is the correlation coefficient
  - $s_{x} = \sqrt{\frac{1}{n - 1} \sum_{}^{} {(x_{i} - \bar{x})}^{2}} = \sqrt{\frac{\sum_{}^{} {(x_{i} - \bar{x})}^{2}}{n - 1}}$
  - $s_{y} = \sqrt{\frac{1}{n - 1} \sum_{}^{} {(y_{i} - \bar{y})}^{2}} = \sqrt{\frac{\sum_{}^{} {(y_{i} - \bar{y})}^{2}}{n - 1}}$
- and $\bar{y} = a + b \bar{x}$
  - which rearranges to $a = \bar{y} - b \bar{x}$ (to find $a$ )
  - You need to find $b$ before you can find $a$
In practice, the equation of the least-squares regression line is found using technology
- e.g. a calculator

Examiner Tips and Tricks

The formulas for the equation of the least-squares regression line are given in the exam.

How do I interpret the slope of a regression line?

The slope, $b$ , of the regression line $\hat{y} = a + b x$ is
- the amount by which the predicted $y$ -variable, $\hat{y}$ , changes for every 1 unit of increase in the $x$ -variable
  - i.e. the increase in $\hat{y}$ per unit increase in $x$

How do I interpret the y-intercept of a regression line?

The $y$ -intercept, $a$ , of the regression line $\hat{y} = a + b x$ is
- the predicted value of $y$ when the explanatory variable, $x$ , equals zero
In some contexts, the y-intercept may not have a logical interpretation

Worked Example

The scatterplot below shows the number of hours spent studying (on the $x$ -axis) against the score in a test out of 16 points (on the $y$ -axis), for five different students.

The equations of three different regression lines are shown, together with sums of squares of their residuals in the table below. The variable $\hat{y}$ is the predicted value of $y$ . One of these three regression lines is the least-squares regression line.

Graph showing three linear regression lines: ŷ = 4x, ŷ = 2 + 3x, and ŷ = 2.4 + 2.8x along with data points marked as black Xs on a grid. Axes labeled "x" and "y".

Regression equation	Sum of the squares of the residuals
$\hat{y} = 2.4 + 2.8 x$	25.6
$\hat{y} = 2 + 3 x$	26
$\hat{y} = 4 x$	40

(a) Explain how you know which regression line is the least-squares regression line.

Answer:

The least-squares regression line minimizes the sum of the squares of the residuals

The regression line $\hat{y} = 2.4 + 2.8 x$ has the smallest sum of the squares of the residuals, as 25.6 < 26 < 40

We are told that one of the three regression lines is the least-squares regression line

This means $\hat{y} = 2.4 + 2.8 x$ is the least-squares regression line

(b) Explain what the $y$ -intercept and the slope of the least-squares regression line mean in context.

Answer:

The $y$ -intercept of a regression line is the predicted value of $y$ when $x$ is zero

The slope of a regression line is the amount of change in the predicted value of $y$ for every increase by 1 in the value of $x$

The $y$ -intercept shows that a student who has done no studying is predicted to score 2.4 (which rounds to 2 points) out of 16

The slope shows that the predicted score of a student increases by 2.8 points per hour of studying

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Test yourself

Did this page help you?

Previous:ResidualsNext:Residual Plots

The Least-Squares Regression Line (College Board AP® Statistics): Study Guide

Least-squares regression line

What is the least-squares regression line?

What is the sum of the squares of the residuals?

Why is the sum of the squares of the residuals minimized?

What is the equation of the least-squares regression line?

How do I interpret the slope of a regression line?

How do I interpret the y-intercept of a regression line?

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

Unit 1: Exploring One-Variable Data

Summary Statistics

Describing Variables

Parameters & Statistics

Measures of Center

Measures of Position

Measures of Variability

Tables & Relative Frequency

Grouped Data

Outliers & Resistant Measures

Five-Number Summary & Boxplots

Skewness of Data

Comparing Data using Summary Statistics

Graphical Representations

Shape of Distributions

Bar Charts & Histograms

Dotplots & Stemplots

Cumulative Graphs

Comparing Univariate Graphs

The Normal Distribution

Properties of Normal Distributions

Standardized z-scores

Comparing Normal Distributions

Finding Proportions from Normal Distributions

Inverse Normal Calculations

Estimating Parameters of Normal Distributions

Unit 2: Exploring Two-Variable Data

Tables & Graphs

Two-Way Tables & Relative Frequencies

Bar Graphs & Mosaic Plots

Scatterplots & Regression

Explanatory & Response Variables

Scatterplots

Association & Correlation Coefficients

Interpolation & Extrapolation using Linear Models

Residuals

The Least-Squares Regression Line

Residual Plots

The Coefficient of Determination

Outliers, High-Leverage & Influential Points

Linearization of Bivariate Data

Unit 3: Collecting Data

Sampling Methods & Bias

Introduction to Sampling

Simple Random Sampling (SRS)

Random Sampling Methods

Types of Bias

Non-random (Biased) Sampling Methods

Experimental Design

Introduction to Experiments

Well-Designed Experiments

Control Groups, Placebos & Blind Experiments

Completely Randomized Design

Randomized Block & Matched Pairs Design

Unit 4: Probability, Random Variables & Probability Distributions

Probability

Estimating Probability using Relative Frequency

Probabilities of Single Events

Introduction to Combined Events

Addition Rule & Mutually Exclusive Events

Conditional Probability

Multiplication Rule & Independent Events

Probabilities of Combined Events using Tree Diagrams

Probabilities of Combined Events using the Rules

Discrete Random Variables

Probability Distributions for Discrete Random Variables

Cumulative Probability Distributions for Discrete Random Variables

Mean & Standard Deviation of a Discrete Random Variable

Linear Transformations of Random Variables

Linear Combinations of Random Variables