Least Squares Fit of a Straight Line to Data

The data that I'm trying to fit with a simple straight line might be distance as a function of time, density as a function of pressure, or any of a large number of other paired physical measurements. To simplify discussion, and maintain consistency with many decades of notation, I'll just work with x, y data pairs. For each of n values of x, values of y are measured and recorded. For example in some experiment, I might have obtained the following y values for the listed x values.

x 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
y .2 1.8 4.3 5.7 8.3 9.7 12.4 13.6 16.3

Plotting these points, I might guess that the line y = 2 x would be a good fit, but I need a more formal and systematic approach to finding the best equation for a line "fitting" the points.

To keep the notation clean the values of x and y are numbered using subscripts. For example the point ( x2 , y2) is ( 1.0 , 1.8). Generally the subscripts will just be represented with the letter "i", which for this example can take on any value from 1 to 9.

My goal is to determine values for the slope "m" and the intercept "b" in an equation:

y = m x + b

that result in a "best fit" of the line equation to the data points. The concept of "best fit" requires definition of some measure of the error between the data and the line. The choice for this method is to look at each value of xi, and calculate the square of the difference between the data value yi and the y value determined by evaluating the above equation at that xi . The process of squaring guarantees a positive number so that we can sum the errors at all points to obtain an overall measure of error:

I've written the error measure as a function of "m" and "b" to emphasize the fact that these are the unknowns in our problem. The xi's and yi's are all just known numbers. The slope and intercept will be determined to give a "best fit", by obtaining the smallest possible value of the error.

We are looking for the least value of the squared measure of error. Calculus tells us that this will occur when the partial derivative of error with respect to "m" and the partial with respect to "b" are both equal to zero. This gives us two equations to use in obtaining the two unknowns.

The equation resulting from evaluating the partial derivative with respect to "m" is:

Dividing both sides of the final form by 2 and rearranging gives one equation for "m" and "b".

The equation resulting from evaluating the partial derivative with respect to "b" is:

Dividing both sides of the final form by 2 and rearranging gives:

In standard notation for linear algebra, these two equations can be written as:

This can now be translated into Fortran for solution. Assume that the arrays "x" and "y" contain the "n" data points. Then we load a Fortran doubly dimensioned array as

a(1,1) = n

a(1,2) = sum(x(1:n))

a(2,1) = a(1,2)

a(2,2) = sum(x(1:n)**2)

The array for the right hand side can be set as:

rhs(1) = sum(y(1:n))

rhs(2) = sum(y(1:n)*x(1:n))

These two arrays can be passed to an appropriate linear system solver to obtain the values for the intercept and slope.

Back to the Lecture / Table of Contents / Home

Written and Maintained by John Mahaffy : jhm@psu.edu