The Least-Squares Regression Line (College Board AP® Statistics)

Study Guide

Mark Curtis

Written by: Mark Curtis

Reviewed by: Dan Finlay

Updated on

Least-squares regression line

What is the least-squares regression line?

  • The least-squares regression line is a special type of regression line that:

    • minimizes the sum of the squares of the residuals

    • and that passes through the mean point open parentheses x with bar on top comma space y with bar on top close parentheses

      • where x with bar on top is the mean of the x-values

      • and y with bar on top is the mean of the y-values

  • It is used to predict y-values from given x-values

    • Its full name is the least-squares regression line of y on x

    • This is not the same line if you wanted to predict x-values from given y-values

      • That is the least-squares regression line of x on y

      • You cannot swap x and y

What is the sum of the squares of the residuals?

  • The sum of the squares of the residuals for any regression line is found by

    • calculating the residuals for each data point

    • squaring each residual

    • then adding together all these squared values

Why is the sum of the squares of the residuals minimized?

  • Residuals are like errors when comparing regression lines

    • A good regression line should minimize the residuals

      • So compare the sum of all the residuals for different regression lines

    • However, the sum of all the residuals is zero

      • The positive residuals end up cancelling out the negative ones

    • So, instead, compare the sum of the squares of the residuals

      • because squaring the residuals makes them all positive

      • which stops any cancellation

  • The regression line with the smallest possible sum of the squares of the residuals is the least-squares regression line

What is the equation of the least-squares regression line?

  • The equation of the least-squares regression line is given by

    • y with hat on top equals a plus b x

      • a is the y-intercept

      • b is the slope

      • note the order of the terms

    • where y with hat on top is the y-value predicted by the regression line

      • This is usually different to the actual y-value of a data point

    • x is the explanatory variable

    • b equals r space s subscript x over s subscript y

      • where r is the correlation coefficient

      • s subscript x equals square root of fraction numerator 1 over denominator n minus 1 end fraction sum from blank to blank of open parentheses x subscript i minus x with bar on top space close parentheses squared space end root equals square root of fraction numerator sum from blank to blank of open parentheses x subscript i minus x with bar on top space close parentheses squared space over denominator n minus 1 end fraction end root

      • s subscript y equals square root of fraction numerator 1 over denominator n minus 1 end fraction sum from blank to blank of open parentheses y subscript i minus y with bar on top space close parentheses squared space end root equals square root of fraction numerator sum from blank to blank of open parentheses y subscript i minus y with bar on top space close parentheses squared space over denominator n minus 1 end fraction end root

    • and y with bar on top equals a plus b x with bar on top

      • which rearranges to a equals y with bar on top minus b x with bar on top (to find a)

      • You need to find b before you can find a

  • In practice, the equation of the least-squares regression line is found using technology

    • e.g. a calculator

Examiner Tips and Tricks

The formulas for the equation of the least-squares regression line are given in the exam.

How do I interpret the slope of a regression line?

  • The slope, b, of the regression line y with hat on top equals a plus b x is

    • the amount by which the predicted y-variable, y with hat on top, changes for every 1 unit of increase in the x-variable

      • i.e. the increase in y with hat on top per unit increase in x

How do I interpret the y-intercept of a regression line?

  • The y-intercept, a, of the regression line y with hat on top equals a plus b x is

    • the predicted value of y when the explanatory variable, x, equals zero

  • In some contexts, the y-intercept may not have a logical interpretation

Worked Example

The scatterplot below shows the number of hours spent studying (on the x-axis) against the score in a test out of 16 points (on the y-axis), for five different students.

The equations of three different regression lines are shown, together with sums of squares of their residuals in the table below. The variable y with hat on top is the predicted value of y. One of these three regression lines is the least-squares regression line.

Graph showing three linear regression lines: ŷ = 4x, ŷ = 2 + 3x, and ŷ = 2.4 + 2.8x along with data points marked as black Xs on a grid. Axes labeled "x" and "y".

Regression equation

Sum of the squares of the residuals

y with hat on top equals 2.4 plus 2.8 x

25.6

y with hat on top equals 2 plus 3 x

26

y with hat on top equals 4 x

40

(a) Explain how you know which regression line is the least-squares regression line.

Answer:

The least-squares regression line minimizes the sum of the squares of the residuals

The regression line y with hat on top equals 2.4 plus 2.8 x has the smallest sum of the squares of the residuals, as 25.6 < 26 < 40

We are told that one of the three regression lines is the least-squares regression line

This means y with hat on top equals 2.4 plus 2.8 x is the least-squares regression line

(b) Explain what the y-intercept and the slope of the least-squares regression line mean in context.

Answer:

The y-intercept of a regression line is the predicted value of y when x is zero

The slope of a regression line is the amount of change in the predicted value of y for every increase by 1 in the value of x

The y-intercept shows that a student who has done no studying is predicted to score 2.4 (which rounds to 2 points) out of 16

The slope shows that the predicted score of a student increases by 2.8 points per hour of studying

Sign up now. It’s free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Did this page help you?

Mark Curtis

Author: Mark Curtis

Expertise: Maths

Mark graduated twice from the University of Oxford: once in 2009 with a First in Mathematics, then again in 2013 with a PhD (DPhil) in Mathematics. He has had nine successful years as a secondary school teacher, specialising in A-Level Further Maths and running extension classes for Oxbridge Maths applicants. Alongside his teaching, he has written five internal textbooks, introduced new spiralling school curriculums and trained other Maths teachers through outreach programmes.

Dan Finlay

Author: Dan Finlay

Expertise: Maths Lead

Dan graduated from the University of Oxford with a First class degree in mathematics. As well as teaching maths for over 8 years, Dan has marked a range of exams for Edexcel, tutored students and taught A Level Accounting. Dan has a keen interest in statistics and probability and their real-life applications.