References:
Raj Jain, The Art of Computer Systems Performance Analysis, Wiley, 1991, Chapter 14
Construct a model that estimates (predicts) a random variable
Example: Predict time required to run "LaTeX" as function of input file size
Some possible models:
- The sample mean
- A distribution fit to the observations
- An equation that is a function of other variables (e.g., input file size)
- response variable
- estimated quantity (e.g., run time)
- predictor variable/predictor/factor
- known quantities that are varied (e.g., file size)
Least squares fit of straight line to graph of response variable versus one predictor variable
- simple linear regression
- yhat = b0 + b1*x
- multiple linear regression
- yhat = b0 + b1*x1 + b2*x2 + ...
Given: n observation pairs { (x1,y1), ..., (xn,yn) }
Determine: regression parameters b0 and b1 in yhat = b0 + b1 * x
- y-intercept
- This is the value of b0
- slope
- This is the value of b1
- predicted value of i-th observation
- yhat_i = b0 + b1 * x_i
- error for i-th observation
- e_i = yi - yhat_i
![]()
We could choose line to make sum of errors = e_0 + ... + e_n = 0. But many lines satisfy this equation.
Only one line satisfies:
- Minimize:
- sum of squared errors (SSE) = e_0^2 + ... + e_n^2
- Subject to constraint:
- mean error = e_1 + ... + e_n = 0.
So to find b1, solve d(SSE)/db1=0 for b1. Result:
![]()
Variation is like variance, but without dividing by DF (degrees of freedom).
How much of the variation in predicted response is due to:
- the regression model (ideally 100%)
- the errors between the model and the observed data (ideally 0%)
To compute variation, we use the sum of squares of various quantities:
![]()
The ratio SSR/SST is a measure of the goodness of the regression model, and is known as the coefficient of determination, or R^2.
What does R^2=1 mean? R^2=0.64? R^2=0? (Answer)
The variance of errors is the variation of errors divided by the DF. This variance is a useful measure, and is given a special name: Mean squared error (MSE):
MSE = SSE/(n-2)
(See [Jain, p. 228] for explaining of why DF=2.)
Regression coefficients b0 and b1 are really estimates from a single sample. Other samples might yield different estimates.
Therefore, there exist population parameters beta_0 and beta_1 such that the true model is
y = beta_0 + beta_1 * x
Therefore we can compute confidence intervals for b0 and b1:
b0 ± t*s_b0
b1 ± t*s_b1
where s_b0 and s_b1 are standard deviations. See [Jain, p. 229] for the formulas for s_b0 and s_b1.
Because b0 and b1 are estimates of populations parameters, we can also compute a CI for the response variable yhat that we are predicting. The CI is tightest at the sample mean xbar.
So be careful if you use the regression model for prediction far away from the mean!
Test visually that four assumptions underlying linear regression hold:
Also plot error as a function of experiment number. Any trend would indicate that some factor not accounted for affected the observed values. (Figures shown above for ei versus file size can be used to illustrate this point; just re-label x-axis to be "i".)
Sometimes the predictor variables are categorical (non-numeric). Yet regression is still possible.
Example: Predict execution time as function of CPU type (Pentium, i860).
Solution: Define binary variable x to encode values:
x=0: CPU = Pentium
x=1: CPU = i860
Regression equation: yhat = b0 + b1 * x
Suppose there are three types of CPUs: Pentium, i860, MIPS R4000.
Using one variable requires an artificial ordering. The choice of order affects the success of regression, and makes regression parameters uninterpretable.
Instead, use two variables:
x_1 = 1 ===> Pentium x_1 = 0 ===> not a Pentium x_2 = 1 ===> i860 x_2 = 0 ===> not an i860Thus the three CPUs are represented by (x_1,x_2):
(x_0,x_1) = (1,0) ===> Pentium (x_0,x_1) = (0,1) ===> i860 (x_0,x_1) = (0,0) ===> MIPS R4000 (x_0,x_1) = (1,1) ===> illegalThis requires multiple linear regression [Jain, 15.1]:
y = b0 + b1*x_1 + b_2* x_2
Please read section 15.6 in Jain on your own. Whenever you use regression in CS5014, be sure that you do make any of the listed mistakes.