References:
Raj Jain, The Art of Computer Systems Performance Analysis, Wiley, 1991, Chapter 12
L. Ott, An Introduction to Statistical Methods and Data Analysis, PWS-Kent, 1988 (a highly readable and practical book to have in your library)
Question: What is the mean time required to complete HW1 in CS5014 this semester?
Answer: Pick 10 students (the sample). Record their completion times. Compute the sample mean, which is an estimate of the mean for the entire population (all CS5014 students).
A statistic is a number that summarizes data. Jain asks [p. 177]
Example: X maps server named in a URL to the 32 bit IP address
We think of p_x as probability that experiment outcome is s such that X(s)=x.
Example: X maps student to time required to complete HW1.
F(x) is the probability that if we perform an experiment, the corresponding random variable's value does not exceed a:
F_x(a) = P(x<a)
Example: For a uniformly distributed R.V., CDF is a line with slope one.
Example: For a uniformly distributed R.V., pdf is a line with slope zero.
P(x<=x_q) = F(x_q) = q
Example: For a Uniform(0,10) R.V., what is the
Note: We should distinguish numerical from categorical discrete RV's in CS5014. Certain statistical terms are meaningless for categorical RV's.
Which of the basic terms listed above are meaningless for categorical data?
Consider: 1, 1, 2, 3, 4
What is the:
Here's a list of hosts in the .cs.vt.edu domain to which someone ftp'd in the last 5 minutes:
csgrad, vtopus, csgrad, ei
What is the:
Consider: 10, 20, 30, 40
What is the:
Consider: 1, 2, 2, 3, 4, 4, 5
What is the:
What is the difference between sample mean and population mean?
Which one of the two is the expected value of the R.V.?
See [Jain 12.3] for guidance on when to use mean vs. median vs. mode.
Example 1: What is DF for the sample mean of 10 observations?
Example 2: Why is DF for the sample variance of n observations equal to n-1? (answer)
Example: Is this good or bad: "The sample standard deviation of response time is 2."
If the sample mean response time is 0.2, then 2 is terrible! But if mean is 200, 2 is great!
The corresponding statement with CV is:
- "The CV is 10." (Terrible!)
- "The CV is 0.01" (Great!)
Additional reference:
Law and Kelton, Simulation Modeling and Analysis, 2nd edition, McGraw-Hill, 1982, pp. 374-379.
Recall that a Cumulative Distribution Function (CDF) of a random variable X is
F_X(a) = Prob[X<=a].
Let F_M(x) denote CDF of the theoretical distribution which you believe fits your data. This distribution is the model of your data.
Let F_S(x) denote the CDF that truely underlies your sample data. This is the sample distribution.
We want to know if F_M = F_S.
The most straightforward procedure is merely to compare, graphically, the pdf of the empirical and theoretical distributions.
Procedure:
Advantage:
Fast
Disadvantage
DeceptiveConsider the following samples (to be fit by a continuous distribution):
1.1, 2.1, 2.3, 3.0, 3.9, 3.9, 4.1, 4.2, 4.3, 4.8, 5.5, 5.5, 5.6, 6.7, 6.8, 7.0, 8.0, 8.2, 9.1, 10.9, 12.0A histogram with bin size = 1 suggests a function that increases, then decreases (e.g., a normal, lognormal, or beta distribution)
A histogram with bin size = 4 suggests a monotonically decreasing function (e.g., exponential distribution)
Moral:
Don't use histograms to fit distributions!
We learned from Method 1 that we cannot directly compare the empirical and theoretical pdfs.
Can we compare CDFs?
Procedure:
First we need the CDF of the samples:
- Sort the samples XS(1), XS(2), ..., XS(n) in ascending order
- The empirical distribution function f_S(x) could be proportion of samples that are smaller than x:
f_S(XS(i)) = i/n- But i/n equals 1 at i=n, while most theoretical distributions tend to 1 as x goes to infinity. Thus use:
f_S(XS(i)) = (i-.5)/nThe advantage is that the .5/n term goes to zero as n approaches infinity.
Illustration:
Advantage:
More accurate than histograms, because there's no free parameter to choose (i.e., bin size)
Disadvantage:
So we can now compare two curves, for theoretical and empirical CDF. But comparing "S" curves is hard for the human eye!
Isn't there anything that a human can easily compare "by eye"?
Yes! Humans can see if points line on a straight line pretty easily!
So we need a distribution-dependent transform that maps samples to new x,y graph points, such that the points are on a straight line if they fit the distribution.
What can we use as a transform?
Instead of comparing two curves (e.g., two CDFs), plot one curve.Procedure:
- Sort the observed samples: XS(1), XS(2), ..., XS(n)
- Plot following graph:
- y coordinate:
- samples: XS(1), XS(2), ..., XS(n)
- x coordinate:
- Find what quantile q(i) corresponds to XS(i) (for 1<=0<=n); then plot q(i)-quantile of theoretical distribution (which is XM(i)).
Illustration:
Example [Jain, Table 12.5]:
Consider samples XS(i) in table below:
i XS(i) ================ 1 -0.19 2 -0.14 3 -0.09 4 -0.04 5 0.04 6 0.09 7 0.14 8 0.19 ================Now compute the empirical quantile to which each XS(i) corresponds:
i XS(i) q(i)=(i-.5)/n ================================ 1 -0.19 .06 2 -0.14 .19 3 -0.09 .32 4 -0.04 .44 5 0.04 .56 6 0.09 .69 7 0.14 .81 8 0.19 .94 ================================
Finally, use a theoretical distribution to compute XM(i). You must find the formula for computing quantiles from a statistic book. Let's try a Normal distribution, since it's the most important for experiment design. Here is the formula for Normal:
XM(i) = 4.91 [ q(i)**0.14 - (1 - q(i))**0.14 ]
Using the formula:
i XS(i) q(i)=(i-.5)/n XM(i) ======================================== 1 -0.19 .06 -1.54 2 -0.14 .19 -0.89 3 -0.09 .32 -0.49 4 -0.04 .44 -0.16 5 0.04 .56 0.16 6 0.09 .69 0.49 7 0.14 .81 0.89 8 0.19 .94 1.54 ========================================
Here's the resultant QQ plot:
Notes:
Another approach to deciding whether F_M(x) = F_S(x) is to use a statistical test, called a goodness-of-fit test. The advantage of a test is that it gives a binary true/false answer to the hypothesis that F_M(x) = F_S(x). The disadvantage is that the answer might (with low probability) be wrong!
See Law and Kelton for some common goodness-of-fit tests: Chi-Square, Kolmogorov-Smirnov, Anderson-Darling, ...
The ExpertFit software tool (available on NEXTSTEP-based PCs in the McBryde 118 lab) is a wonderful software tool that helps fit distributions to data. It was written by Law and Kelton.
Review sections 12.8 and 12.8 of Jain, alog with Fig. 12.4, on statistics to specify variability of data.
Last modified on 11 September 1996.
Please send inquiries and commments to abrams@vt.edu.