Summarizing Measured Data [Jain, Ch. 12]


References:

Raj Jain, The Art of Computer Systems Performance Analysis, Wiley, 1991, Chapter 12

L. Ott, An Introduction to Statistical Methods and Data Analysis, PWS-Kent, 1988 (a highly readable and practical book to have in your library)


  1. What Is Statistics?

  2. Basic Probability Terms [Jain, 12.1]

  3. Mean, Mode, and Median [Jain, 12.2]

  4. A Few More Terms [Jain, 12.8]

  5. Determining Distribution of Data [Jain, 12.10]

What Is Statistics?

Example:

Question: What is the mean time required to complete HW1 in CS5014 this semester?

Answer: Pick 10 students (the sample). Record their completion times. Compute the sample mean, which is an estimate of the mean for the entire population (all CS5014 students).

A Statistic

A statistic is a number that summarizes data. Jain asks [p. 177]


Basic Terms

Discrete random variable (R.V.)

Function X mapping experiment outcome to an integer

Example: X maps server named in a URL to the 32 bit IP address

Probability mass function (pmf)

Associates each element x in range of X with a number p_x in [0,1]

We think of p_x as probability that experiment outcome is s such that X(s)=x.

Continuous random variable

Function X mapping outcome to real number

Example: X maps student to time required to complete HW1.

Cumulative distribution function (CDF)

Maps each a in range of X to a number F_x(a) in [0,1]

F(x) is the probability that if we perform an experiment, the corresponding random variable's value does not exceed a:

F_x(a) = P(x<a)

Example: For a uniformly distributed R.V., CDF is a line with slope one.

Probability density function (pdf)

For continuous RV's, the derivative of the CDF (denoted "f")

Example: For a uniformly distributed R.V., pdf is a line with slope zero.

q-Quantile

The x value at which the CDF takes a value q, denoted x_q:
P(x<=x_q) = F(x_q) = q

Example: For a Uniform(0,10) R.V., what is the

Categorical Data

Note: We should distinguish numerical from categorical discrete RV's in CS5014. Certain statistical terms are meaningless for categorical RV's.

Example of Numerical:
Interarrival time, time on task, number of errors
Example of Categorical:
Host names, IP addresses (!), subroutine names, workload names

Which of the basic terms listed above are meaningless for categorical data?


Mean, Mode, and Median

Example 1

Consider: 1, 1, 2, 3, 4

What is the:

Definitions

Sample mean
the sum of the values divided by the number of values
Sample median
the 0.5-quantile of an RV
Sample mode
the most likely value (i.e., the value(s) at which the pdf or pmf is maximum)

Example 2

Here's a list of hosts in the .cs.vt.edu domain to which someone ftp'd in the last 5 minutes:

csgrad, vtopus, csgrad, ei

What is the:

Example 3

Consider: 10, 20, 30, 40

What is the:

Example 4

Consider: 1, 2, 2, 3, 4, 4, 5

What is the:

What is the difference between sample mean and population mean?

Which one of the two is the expected value of the R.V.?

See [Jain 12.3] for guidance on when to use mean vs. median vs. mode.


A Few More Terms

Degrees of Freedom (DF)

The number of independent terms in a sum

Example 1: What is DF for the sample mean of 10 observations?

Example 2: Why is DF for the sample variance of n observations equal to n-1? (answer)

Coefficient of Variation (COV or CV)

The ratio of the sample standard deviation to the sample mean

Example: Is this good or bad: "The sample standard deviation of response time is 2."

If the sample mean response time is 0.2, then 2 is terrible! But if mean is 200, 2 is great!

The corresponding statement with CV is:

  • "The CV is 10." (Terrible!)

  • "The CV is 0.01" (Great!)

Determining Distributions of Data

Additional reference:

Law and Kelton, Simulation Modeling and Analysis, 2nd edition, McGraw-Hill, 1982, pp. 374-379.

Why Fit a Theoretical Distribution to Data?

How can we tell if observed data fits a theoretical distribution?

Recall that a Cumulative Distribution Function (CDF) of a random variable X is

F_X(a) = Prob[X<=a].

Let F_M(x) denote CDF of the theoretical distribution which you believe fits your data. This distribution is the model of your data.

Let F_S(x) denote the CDF that truely underlies your sample data. This is the sample distribution.

We want to know if F_M = F_S.

Method 1: Compare pdf's (i.e., Histograms)

The most straightforward procedure is merely to compare, graphically, the pdf of the empirical and theoretical distributions.

Procedure:

Advantage:

Fast

Disadvantage

Deceptive

Consider the following samples (to be fit by a continuous distribution):

            1.1, 
            2.1,  2.3,  
            3.0,  3.9,  3.9,
            4.1,  4.2,  4.3,  4.8,
            5.5,  5.5,  5.6,
            6.7,  6.8,
            7.0,
            8.0,  8.2,
            9.1,
            10.9,
            12.0

A histogram with bin size = 1 suggests a function that increases, then decreases (e.g., a normal, lognormal, or beta distribution)

A histogram with bin size = 4 suggests a monotonically decreasing function (e.g., exponential distribution)

Moral:

Don't use histograms to fit distributions!

Method 2: Compare CDFs

We learned from Method 1 that we cannot directly compare the empirical and theoretical pdfs.

Can we compare CDFs?

Procedure:

First we need the CDF of the samples:

  1. Sort the samples XS(1), XS(2), ..., XS(n) in ascending order

  2. The empirical distribution function f_S(x) could be proportion of samples that are smaller than x:
            f_S(XS(i)) = i/n
    

  3. But i/n equals 1 at i=n, while most theoretical distributions tend to 1 as x goes to infinity. Thus use:
            f_S(XS(i)) = (i-.5)/n
    

    The advantage is that the .5/n term goes to zero as n approaches infinity.

Illustration:

Advantage:

More accurate than histograms, because there's no free parameter to choose (i.e., bin size)

Disadvantage:

So we can now compare two curves, for theoretical and empirical CDF. But comparing "S" curves is hard for the human eye!

Method 3: QQ Plots

Isn't there anything that a human can easily compare "by eye"?

Yes! Humans can see if points line on a straight line pretty easily!

So we need a distribution-dependent transform that maps samples to new x,y graph points, such that the points are on a straight line if they fit the distribution.

What can we use as a transform?

Instead of comparing two curves (e.g., two CDFs), plot one curve.

Procedure:

  1. Sort the observed samples: XS(1), XS(2), ..., XS(n)

  2. Plot following graph:

    y coordinate:

    samples: XS(1), XS(2), ..., XS(n)

    x coordinate:

    Find what quantile q(i) corresponds to XS(i) (for 1<=0<=n); then plot q(i)-quantile of theoretical distribution (which is XM(i)).

Illustration:

Example [Jain, Table 12.5]:

Consider samples XS(i) in table below:

i	XS(i)
================
1	-0.19
2	-0.14
3	-0.09
4	-0.04
5	 0.04
6	 0.09
7	 0.14
8	 0.19
================

Now compute the empirical quantile to which each XS(i) corresponds:

i	XS(i)	q(i)=(i-.5)/n
================================
1	-0.19	.06
2	-0.14	.19
3	-0.09	.32
4	-0.04	.44
5	 0.04	.56
6	 0.09	.69
7	 0.14	.81
8	 0.19	.94
================================

Finally, use a theoretical distribution to compute XM(i). You must find the formula for computing quantiles from a statistic book. Let's try a Normal distribution, since it's the most important for experiment design. Here is the formula for Normal:

XM(i) = 4.91 [ q(i)**0.14 - (1 - q(i))**0.14 ]

Using the formula:

i	XS(i)	q(i)=(i-.5)/n	XM(i)
========================================
1	-0.19	.06		-1.54
2	-0.14	.19		-0.89
3	-0.09	.32		-0.49
4	-0.04	.44		-0.16
5	 0.04	.56		 0.16
6	 0.09	.69		 0.49
7	 0.14	.81		 0.89
8	 0.19	.94		 1.54
========================================

Here's the resultant QQ plot:

Notes:

  1. Normal distribution parameters mu and sigma are the y-intercept and slope, respectively. Note that mu=0 in this example.

  2. If the line does not have y-intercept zero and slope one, then F_S != F_M. If the line has slope one (and non-zero y-intercept), then the difference XM(i)-XS(i) is always constant, and thus F_M is a translation of distribution F_S. If the y-intercept is zero (but the line has slope other than one), then F_M is a dilation of F_S.

  3. Even if F_S=F_M, the graph will not be a straight line if you have too few samples!

  4. See Fig. 12.6 in Jain for examples of how to interpret QQ plots that do not contain straight lines.

Method 4: Statistical Tests

Another approach to deciding whether F_M(x) = F_S(x) is to use a statistical test, called a goodness-of-fit test. The advantage of a test is that it gives a binary true/false answer to the hypothesis that F_M(x) = F_S(x). The disadvantage is that the answer might (with low probability) be wrong!

See Law and Kelton for some common goodness-of-fit tests: Chi-Square, Kolmogorov-Smirnov, Anderson-Darling, ...

Method 5: Use a Computer Tool

The ExpertFit software tool (available on NEXTSTEP-based PCs in the McBryde 118 lab) is a wonderful software tool that helps fit distributions to data. It was written by Law and Kelton.

Index of Dispersion

Review sections 12.8 and 12.8 of Jain, alog with Fig. 12.4, on statistics to specify variability of data.


Last modified on 11 September 1996.

Please send inquiries and commments to abrams@vt.edu.