Linear Regression Models


References:

Raj Jain, The Art of Computer Systems Performance Analysis, Wiley, 1991, Chapter 14


  1. Basic Terms

  2. What Is a Good Model? [Jain, 14.1]

  3. Estimating Model Parameters [Jain, 14.2]

  4. Allocation of Variation [Jain, 14.3]

  5. Standard Deviation of Errors [Jain, 14.4]

  6. Confidence Intervals for Regression Parameters [Jain, 14.5]

  7. Confidence Intervals for Predictions [Jain, 14.6]

  8. Visual Tests to Verify Assumptions [Jain, 14.7]

  9. Regression with Categorical Predictors [Jain, 15.2]

  10. Common Mistakes [Jain, 15.6]

Basic Terms

Problem Summary

Construct a model that estimates (predicts) a random variable

Example: Predict time required to run "LaTeX" as function of input file size

Some possible models:

Some Terms

response variable
estimated quantity (e.g., run time)

predictor variable/predictor/factor
known quantities that are varied (e.g., file size)

Solution Method

Least squares fit of straight line to graph of response variable versus one predictor variable

More Terms

simple linear regression
yhat = b0 + b1*x

multiple linear regression
yhat = b0 + b1*x1 + b2*x2 + ...

What Is a Good Model?


Estimating Model Parameters

Problem

Given: n observation pairs { (x1,y1), ..., (xn,yn) }

Determine: regression parameters b0 and b1 in yhat = b0 + b1 * x

Facts About the Straight Line

y-intercept
This is the value of b0

slope
This is the value of b1

predicted value of i-th observation
yhat_i = b0 + b1 * x_i

error for i-th observation
e_i = yi - yhat_i

How To Choose the Straight Line: Least-squares Criterion

We could choose line to make sum of errors = e_0 + ... + e_n = 0. But many lines satisfy this equation.

Only one line satisfies:

Minimize:
sum of squared errors (SSE) = e_0^2 + ... + e_n^2

Subject to constraint:
mean error = e_1 + ... + e_n = 0.

So to find b1, solve d(SSE)/db1=0 for b1. Result:


Allocation of Variation

Introduction

Variation is like variance, but without dividing by DF (degrees of freedom).

How much of the variation in predicted response is due to:

  1. the regression model (ideally 100%)

  2. the errors between the model and the observed data (ideally 0%)

SST, SSE, and SSR

To compute variation, we use the sum of squares of various quantities:

Coefficient of Determinination

The ratio SSR/SST is a measure of the goodness of the regression model, and is known as the coefficient of determination, or R^2.

What does R^2=1 mean? R^2=0.64? R^2=0? (Answer)


Standard Deviation of Errors

The variance of errors is the variation of errors divided by the DF. This variance is a useful measure, and is given a special name: Mean squared error (MSE):

MSE = SSE/(n-2)

(See [Jain, p. 228] for explaining of why DF=2.)


Confidence Intervals for Regression Parameters

Regression coefficients b0 and b1 are really estimates from a single sample. Other samples might yield different estimates.

Therefore, there exist population parameters beta_0 and beta_1 such that the true model is

y = beta_0 + beta_1 * x

Therefore we can compute confidence intervals for b0 and b1:

b0 ± t*s_b0

b1 ± t*s_b1

where s_b0 and s_b1 are standard deviations. See [Jain, p. 229] for the formulas for s_b0 and s_b1.


Confidence Intervals for Predictions

Because b0 and b1 are estimates of populations parameters, we can also compute a CI for the response variable yhat that we are predicting. The CI is tightest at the sample mean xbar.

So be careful if you use the regression model for prediction far away from the mean!


Visual Tests to Verify Assumptions

Test visually that four assumptions underlying linear regression hold:

Linear relationship between response y and predictor x:

Use scatter plot of y vs. x:

Errors are independent:

Plot e_i versus predicted response yhat_i and verify that there is no trend:

Also plot error as a function of experiment number. Any trend would indicate that some factor not accounted for affected the observed values. (Figures shown above for ei versus file size can be used to illustrate this point; just re-label x-axis to be "i".)

Errors are normally distributed:

Use a Normal QQ plot of errors e_i.

Constant standard deviation of errors:

Again plot e_i versus predicted response yhat_i and verify that there is no trend:


Regression with Categorical Predictors

Just Two Categories

Sometimes the predictor variables are categorical (non-numeric). Yet regression is still possible.

Example: Predict execution time as function of CPU type (Pentium, i860).

Solution: Define binary variable x to encode values:

x=0: CPU = Pentium

x=1: CPU = i860

Regression equation: yhat = b0 + b1 * x

More Than Two Categories

Suppose there are three types of CPUs: Pentium, i860, MIPS R4000.

Using one variable requires an artificial ordering. The choice of order affects the success of regression, and makes regression parameters uninterpretable.

Instead, use two variables:

     x_1 = 1 ===> Pentium
     x_1 = 0 ===> not a Pentium

     x_2 = 1 ===> i860
     x_2 = 0 ===> not an i860

Thus the three CPUs are represented by (x_1,x_2):

     (x_0,x_1) = (1,0) ===> Pentium
     (x_0,x_1) = (0,1) ===> i860
     (x_0,x_1) = (0,0) ===> MIPS R4000
     (x_0,x_1) = (1,1) ===> illegal

This requires multiple linear regression [Jain, 15.1]:

y = b0 + b1*x_1 + b_2* x_2


Common Mistakes

Please read section 15.6 in Jain on your own. Whenever you use regression in CS5014, be sure that you do make any of the listed mistakes.


Please send inquiries and commments to abrams@vt.edu.