Tools, networking and communication, workload modeling, education.
What is the nature of distributed multimedia network traffic - that is, traffic between clients and servers over a communication network? Are the traffic characteristics generated by different users at different times entirely unpredictable, or can one characterize multimedia traffic?
Using the tool described in this paper - CHITRA95 - one can visualize, query, statistically analyze and test, transform, and model collections of traces of client or server requests and replies. The domain used as a case study in this paper to illustrate tool use is educational. In this domain, students use a workstation-equipped classroom for a course on multimedia and retrieve materials from the World Wide Web (WWW) during class. Outside of class, students retrieve course materials on demand from a department-wide server. The material includes articles, video, pictures, and other materials from a digital library that includes IEEE and ACM papers used in courses.
The analysis of multimedia network traffic can answer a variety of questions. These questions are addressed specifically for WWW traffic in the SAMPLE ANALYSIS... section.
A first class of questions addresses how users (clients) make use of multimedia material through a network. What mixes of media types and distributions of file sizes are accessed by users? What file size distributions arise for each media type? How many servers are accessed by clients and with what distribution? Do different groups (e.g., two sections of a college course delivered over a network) of users have similar usage characteristics?
A second class of questions addresses the requests that a particular server receives. What is the time between successive accesses to a file in a server (of interest to cache designers)? What is the distribution of file sizes and media types requested?
A third class of questions addresses how to best use mechanisms to improve network performance, such as caching and prefetching in a network. A variety of caching mechanisms have been proposed (e.g., [13][9][7][5]). Pitkow and Recker, in proposing a cache policy, state, ``Surprisingly, many of these [caching algorithms] lack an empirical or mathematical basis.'' Are there patterns of access that users commonly perform that can be exploited by algorithms that cache or even prefetch information? Are there patterns of user access that permit predictive prefetching? How does cache performance vary with cache size? What is the distribution of the number of times that a file is accessed each day (suggesting a ``working set size'' for a user in a cache)? What performance improvement do users experience with caching?
The most immediate need for multimedia analysis tools is in the WWW due to its rapid growth and widespread use. Tools to analyze WWW trace files of server traffic include wwwstat [3], gwstat [8], and WEBVIZ [12].
The tool wwwstat can generate the following reports about a log from an NCSA httpd server: the number of requests the server receives and the number of bytes in the server response. The report can be by day, by hour, by client country, by client domain, or by Uniform Resource Locator (URL). wwwstat can also process queries about the server log, because it can limit its report to regular expressions containing certain hostnames, IP addresses, server response codes, dates, hours of the day, and URLs. Graphic presentation of wwwstat output in the form of histograms is provided by gwstat.
Our tool (CHITRA95) can produce essentially the same reports and
histograms as wwwstat and gwstat. In addition, our tool
also provides various time-dependent analyses of trace data, analyzes
cache hit rates, uses tcpdump
traces [14 Ch. 21, App. A]tcpdump to analyze URL
response times,
uses several visualizations rather than just histograms, provides
transforms to reduce and aggregate the trace data, analyzes client
traffic traces, analyzes ensembles of traces rather than just a single
trace, performs statistical tests to determine if different traces are
homogeneous, and generates workload models.
WEBVIZ views the files in the WWW as a database, and visualizes the structure as nodes in a graph with edges representing hypertext links. The frequency and recency of file accesses from an NCSA http log are indicated by color and line thickness in the graph. Queries are supported to restrict what nodes are shown (e.g., clients in a particular domain). CHITRA95 does not provide a view like this.
The case-study here has three distinctions from past studies. First, we not only examine the requests that one specific server receives from any client, but also the complement: the requests that any server receives from one specific client. Second, we compare the traffic characteristics of different groups of users (e.g., classroom users versus undergraduate or graduate lab users). Third, we also collect hit rates and use tcpdump to measure improvements in transfer rate due to caching.
This paper is organized as follows. The Requirements section lists requirements for a trace analysis tool and discusses how CHITRA95 addresses these. A workload experiment is described in the Example:... section, and then questions from each question class listed above are addressed for WWW traffic in the Sample Analysis... section. The key conclusions from the analysis are in Conclusions.
We believe that a tool to analyze network traffic traces should provide the following facilities.
It should be possible for a person to use the tool for five minutes and get interesting information about their trace files.
A single trace represents the actions occurring for a single period of observation which may or may not be representative of other observation periods. User behavior can vary dramatically. If a trace analysis tool is to provide robust information to network designers that is likely to characterize user behavior over a variety of conditions, then the tool must analyze not one but a set, or ensemble of traces. The ability to handle ensembles is essential to being able to compare and contrast different types of workloads.
The tool should answer queries about traces that have been collected so that qualitative questions about user behavior can be posed as queries. For example: Does the traffic on all Sundays during a semester look similar? How do the access patterns of students in dormitories differ from users of lab-based computers? Which servers on campus appear to be saturated in their ability to serve client requests?
Visualization can summarize large amounts of trace data - for example, histograms of the sizes and types of files accesses. However visualization facilities in a tool should also represent time-dependent behavior in a system, such as the variation of requests to a server with time of day or day of the week.
Common tests answer whether a set of traces is homogeneous, for example whether the traces are likely to have been drawn from the same distribution. This permits conclusions such as: Does one group of users with caching see a significantly different response time than a second group without caching? Do two user groups represent the same workload?
Network designers require traffic models to use in analytic modeling and simulation. Thus the ability to fit a model to trace data is essential to generating a workload model.
Common patterns could indicate the need to redesign the interface to a multimedia application to streamline the operations. Patterns could be exploited in data layout on server disks or when designing cache policies.
Transform methods can reduce the volume of trace data to speed analysis. For example, suitably chosen transforms can remove ``noise'' from a trace to reveal patterns or exceptions in traced behavior, or filter out accesses to certain types of media.
A tool should be user extensible in a language of the user's choice so that a user can codify commonly used analysis procedures to save time, or even write new analysis or visualization modules. There should be an easy way for tool users to share new modules in the form of a library.
CHITRA95 is the third generation of a system to
analyze trace data from computer and communication
networks [2][1]. CHITRA95 takes as input
ensembles of trace files. A trace file is a sequence of
ordered
pairs
where
is an ascending sequence of numbers representing
some index of time, such as timestamps, and
is a sequence of events. An event is an ordered tuple
representing some actions of interest during observation of a system.
For CERN or NCSA WWW server (http) log files, each
(for
)
represents a timestamp in the log,
represents a GET performed by a client, and the ordered tuple contains three
components: client name, URL being requested, and file size.
CHITRA95 provides a library of at present 60 commands that each visualize, query, statistically analyze or test, transform, or model trace data. CHITRA95 is operating system and graphical user interface (GUI) independent. At present there is an automatically generated X-windows interface to the toolkit to provided an integrated analysis tool. A forms interface to permit remote use through the WWW is being implemented.
To satisfy ease of use, CHITRA95 contains a mega-command that invokes many interesting and commonly used analysis and visualization commands on a collection of log files, which each may be either a CERN or NCSA server log or a tcpdump log. The mega-command generates a collection of visualizations and summary statistics on a monitor or writes a file in one of a variety of formats (e.g., PostScript, GIF) and tables of summary statistics.
For each tcpdump log, the mega-command produces the following analysis of file transfers: statistics, such as mean, min, max, standard deviation of file transfer rate; histogram of transfer rate; transfer rate by source or source-class; transfer rate by destination or source-class; and graph of server traffic load versus time. For each CERN httpd [10] trace, including a trace from httpd configured to run as a proxy [9], the mega-command produces several results. These include statistics, such as total number of requests (e.g., URL GETs); histogram of number of requests per client or client-class histogram of number of file type; histogram of file size; histogram of request destination, showing the number of requests that go to each server or server-class; a time dependent graph of the rate of cache hits and misses; and visualization of time since last access to a file as a scatter plot versus time and as a histogram over all time.
We are currently implementing a World Wide Web forms interface to CHITRA95, so that a user need not even take the time to install a copy of CHITRA95 on their machine. The user can electronically mail trace data to a WWW server running CHITRA95, then open a URL that contains a form to select the types of analysis desired, and then view the results of analysis as a set of dynamically-created Web pages containing visualizations and tabular data. With the user's permission, the traces and created pages remain on the server, so that the server acts as an archive of the traffic characteristics of various sites that submit traces. This would facilitate studies that characterize traffic from many sites, rather than from a single site. If the user finds the initial results useful, then they can invest more time to install and learn how to use the full set of CHITRA95 commands to do more detailed analyses.
CHITRA95 allows an ensemble of traces to be operated on as a unit, so that the same analysis, transform, and modeling can be performed on all traces in the ensemble. Some visualization methods combine the data from all traces in an ensemble into one graph; others graph data from each trace separately. We routinely handle ensembles with thousands of short traces.
CHITRA95 provides each of these through a set of queries that retrieves data from trace files matching a query criterion and transforms that apply a function to map the matched data to a new form. For example, there is a query to identify all patterns in an ensemble of trace data, and a transform to replace a pattern by an aggregate representation of the data to simplify the trace data. Pattern matching can identify common user access patterns. There is a query to identify the most often occurring components in trace data, and a transform to eliminate all but these components from the trace data for further analysis. There are also transforms to remove all but certain components in the trace data vector, to perform arithmetic transforms such as scaling the range of vector component values, and to aggregate vector component values (useful to group server ids into categories).
The visualization methods in CHITRA95 include two dimensional plots, such as scatter plots and Gannt charts, histograms, periodograms, and correlograms. The tool is unique in its ability to visualize and model categorical time series data. (Categorical data has no total ordering among its values; examples include hostnames, domain names, and URLs. In contrast, numerical data has a total ordering; examples include the sets of integer and real numbers.) Categorical data - but not as a time series - arises in wwwstate, and the histograms for categorical data (e.g., client domain, client sub-domain, and URL) are sorted by the number of requests so that the histogram bars are non-decreasing. CHITRA95 provides novel visualizations to represent the time evolution of categorical data.
The primary test in CHITRA95 is the Kruskall-Wallis (KW) rank sum [pp. 422-425]ott, which tests the hypothesis that two or more samples are drawn from the same distribution. The implementation includes the correction for the case when the samples include many ties. The KW test is used because it makes no assumption about the underlying distribution of the data. The KW test is used also in another test for stationarity (as defined for stochastic processes) of trace data, to identify whether one segment of a trace file ``looks like'' another segment.
CHITRA95 can generate one of three types of workload models: a semi-Markov process, a model based on the Chi-square Automatic Interaction Detection (CHAID) procedure [6] to find events in traces that are likely to occur in combination with other events, and a novel time-dependent stochastic process.
CHITRA95 is a toolkit that consists of a set of small programs that communicate through a standard, self-describing file format. The programs form a library to which a user can add new programs. Therefore a user can extend the toolkit by writing a script that combines existing modules to codify analysis procedures, or by writing a new program to add a command to the CHITRA95 library. In either case the user can use any language of their choice.
We use as a case study traces from World Wide Web clients representing three classes of educational traffic at Virginia Tech. The Computer Science Department provides materials for thirteen courses through the WWW (accessible from http://ei.cs.vt.edu/courses.html). Four of these classes are ``paperless'' - all course material is delivered during class using the Web, students retrieve all assignments from the Web, and in one class take exams and quizes through the Web. In several other courses the Web is used for lecture delivery during class. In all classes the Web serves as a repository for items such as lecture notes, assignments, course syllabi, and links to departmental documents (e.g., honor code description, computer use handbooks). Students retrieve the course material either from networked workstations during class; from campus computer labs; from dormitories with SLIP or Ethernet connections; and, through the Blacksburg Electronic Village, from campus apartments or homes through Ethernet or SLIP over 14.4kbit/sec modems.
The course use of the Web at Virginia Tech is targeted to exploit multimedia, primarily through a project to create a digital library that allows students to obtain on-line copies of papers used in class. [4]. Other multimedia types of class material include scanned images of diagrams, photographs, and links to servers outside of the Virginia Tech campus.
If the use of network delivery grows in popularity on college campuses, then a significant amount of campus network and even Internet related traffic will be education-related. Therefore in this paper we use CHITRA95 to report on preliminary analysis of four classes of educational workloads at Virginia Tech: (1) classroom access by students that each use a network-connected computer during a class on Multimedia; (2) undergraduate access by students in an undergraduate computer lab; (3) graduate access by students in the graduate computer lab; and (4) accesses to the main server for educational materials (host ei.cs.vt.edu).
The following equipment is used. In the classroom workload (1), each client is Netscape running on a 10baseT Ethernet-connected Apple PowerPC 6100/60 AV, and the proxy server runs on a thin-net connected DECstation 3000. In the graduate (2) and undergraduate (3) lab workloads, the clients are Netscape and Mosaic and run on DECstation 3000s or 5000s, the proxy server runs on a DECstation 5000, and all machines are connected by thin-net Ethernet. In the server workload (4), the server is a DECstation 3000 connected to thin-net. The 10baseT and thin-net networks are inter-connected by an FDDI network that is then connected to the Internet by a T1 link (soon to be upgraded).
The objective of our study is to characterize and compare the four workloads, to identify the performance improvement possible by caching off-campus URL gets, to characterize the mean time between accesses to the same URL, and to qualitatively characterize the predictability of URL gets.
We vary two factors during this study: the size of cache used on a proxy server, and the type of workload (i.e., (1) to (4) described above).
Studies in the literature and statistics available on the Internet report on server traces, rather than client traces. That's because it is easy to collect a server trace: just turn tracing on at a server. The resultant log says which client used one specific server.
However it is hard to collect traces of client behavior that record how users access any network server. Either one must collect a trace at every host supporting a Web browser client or one must ask all clients to use a certain proxy server and then record a log on the proxy server. We chose the later route. On multi-user student machines, we created modified versions of the Mosaic and Netscape commands that caused clients to use one of several proxy servers so that we could record client behavior.
To conduct the experiment, we first installed CERN proxy servers to cache off-campus URLs, and we enabled logging. Therefore the proxy server log traces all URL accesses by clients that have their proxy server set to our proxy server, whether they are actually cached or not.
For the classroom workload, we used two proxy servers. The undergraduate and graduate workloads were each assigned to a unique proxy server. The workstations in the classroom were divided into two groups of 12 machines, and each group was assigned to one server. (The division was done in a way to avoid a bias, such as machines in the front of the room in one group, because there might be a correlation of student participation in class with the seat location.) The use of two groups allows comparison of different cache sizes for similar workloads - students in the same class at the same time.
We also installed tcpdump on each proxy server machine to log all packet traffic to or from the server. We varied the cache sizes (using randomization) each day. The data observation period for our study collects a complete log of all URL accesses by each client for every day in the Spring 1995 semester. In this paper, we report on a portion of the Spring semester. Finally, we analyzed the collected traces with CHITRA95.
One problem with the CERN proxy server is that it cannot automatically invalidate outdated files in the cache unless an expiration date in the file is set. In addition, there is no way for Mosaic and Netscape clients to force a cache reload. Therefore, to avoid user complaints about outdated file copies, we configured proxy servers to force file expiration after 24 hours. Furthermore, we cached only off-campus accesses, because students that are accessing course-related Web pages require the latest copy of a page. In addition, caching on-campus accesses would have questionable benefit, given that the time for a client to read the cache is not much different than the time to reach the true server on-campus.
First consider the distribution of file sizes. The histogram in Fig. 1 represents one day of all client requests in the undergraduate lab workload. The histogram suggests an exponential distribution, with the vast majority of files transferred being less than 1kB. The shape of the graph and the 1kB figure is generally representative of all three workloads. However, occasionally another distribution is superimposed upon the general distribution, as shown in Fig. 2; the long bar at 17000 bytes is due to a single URL that was accessed many times repeatedly within a single day. This illustration comes from the undergraduate workload.
Analysis of media type shows different distributions for the different workloads. For both undergraduate and graduate browsing, the distribution of media types manifested similar trends. First, the number of GIF files far outnumbered other file types (Fig. 3). Second, the number of requests for GIF files was approximately twice the number for HTML files. This fact, along with the file size information, leads us to believe that the majority of HTML files accessed in a browsing environment have inline graphic images. A histogram of the sizes of GIF files for one day of the graduate workload is shown in Fig. 4; the histogram for other days and for days in the undergraduate workload appear similar. The histogram shows that the vast majority of GIF images are under 500 bytes, probably representing buttons or icons. We conjectured that the multimedia class would access HTML files with more GIF files than did the other workloads, but measurement showed the opposite trend (which turns out to relate to instructor authoring habits). In the majority of days of the multimedia class, the number of HTML file requests was approximately twice the number of GIF requests (Fig. 5); however many more audio files (labeled AU) were requested than in the other workloads.
Finally, the distribution of servers accessed by the multimedia class for one day is shown in Fig. 6. The distribution suggested by the histogram is representative of other days in the multimedia class. The graduate and undergraduate lab workloads tended to have long histogram bars, showing that a small number of servers were accessed much more than on average.
In this section we analyze the NCSA httpd log from the main server machine used for educational course materials in Computer Science at Virginia Tech, named ei.cs.vt.edu. The log file contains data for the period of 11-January through 27-March 1995. A total of 229,256 files were accessed by 2522 different clients during that time period. Plotting filesize distribution (not shown) indicates the same exponential distribution as the client logs with most of the files under 1000 bytes. The file type distribution is very near to that of the multimedia class workload with respect to the proportion of HTML files to others and the proportion of HTML to GIF files (about twice as many HTML files as GIF files).
The log file is visualized in Figure 7 using a
periodogram unique to CHITRA95 for categorical time series data. The
-axis represents time; 11 January is day 0 and 27 March is near the
right end of the axis. Each graph point represents a client request.
The
-axis value of a point is the recency, or time since the
file requested was last referenced. Moving across the graph along
the
-axis, there are vertical bands of white representing the early
morning period of each day when no one was accessing the server. The
completely white vertical band near day 65 corresponds to a period
when the server went down for installation of a peripheral. The
density of graph points is greater below the line
hours than
above the line. Also there is a second density decline between 10 and
100 hours. These breakpoints are indicated by
Fig. 8, which shows the data from
Fig. 7 as a histogram showing the distribution
of last-access time through the eleven week period. The first
histogram bar is significantly higher than the rest, showing that
inter-access time of a file is most often on the order of tens of
minutes. The height of the histogram bars then slowly declines until
about 24 hours on the
-axis. After this, between 24 and 28 hours,
there is an increase in the number of accesses, perhaps because
students may work at the same time each day, and thus are likely to
reaccess the same URL after 24 hours elapse. This suggests that a
cache policy for this server should remove files after about either 30
minutes or 28 hours.
Our conjecture before measurement was that caching client requests to off-campus servers should produce a larger performance improvement for the multimedia class, where students tend to request the same files for their workstations, than for the undergraduate and graduate browsing workloads, where client requests would not be likely to follow any patterns.
: Transfer rates, in bytes/second for multimedia class
workload.
denotes sample size;
denotes mean file size
transferred. Two hypotheses are tested: that file sizes requested by
the two halves of the class are drawn from the same distribution, and
that the resultant transfer rates are drawn from the same
distribution. Kruskall-Wallis test with
and correction
for ties is used for hypothesis tests.
The method used for data collection was to analyze tcpdump logs recording packets received by and sent from ports assigned to the proxy servers for each workload. A CHITRA95 module first converts tcpdump logs into a sequence of transfer rates. Another CHITRA95 module is then used to compute basic statistics, such as mean transfer rate. Table 1 represents four days of multimedia classes. Each row corresponds to one day. Because the workstations in a class are partitioned into two halves assigned to different proxy servers, two transfer rates (with different cache sizes) in the table correspond to each day. The transfer rates are higher with caching for three days and lower for the fourth day. On three of the days, the homogeneity hypothesis that the file sizes of URLs requests by the two halves of the class is accepted (using CHITRA95's KW test), giving evidence that on these days the two halves of the class generated the same workload with respect to file size. On the other hand, on these three days the hypothesis that the transfer rates for the halves of the class that did and did not use caching was accepted, giving evidence that there is no improvement in performance due to caching. The only significant difference is on the day when transfer rate decreased with caching, but this occurs on the fourth day when the hypothesis that the workload is the same is also rejected.
Comparison of transfer rates with no cache and with a 50Mbyte cache
was also performed on the graduate workload. The sample mean of
transfer rate was 6288 bytes/second with no caching (
) and
6300 bytes/second with caching (
). The KW test
(
) indicated that the difference in transfer rates is not
significant. The mean file size of a URL response was 4122 bytes when
no caching was used, and 5376 bytes when caching was used. Because
all clients use the same cache on each day, we expected that the
workload for days with and without caching were not homogeneous;
applying the KW test (
) on file sizes of URLs requested
rejected the hypothesis was rejected.
The transfer rates in the classroom workload are lower than those of the graduate workload because in the classroom the client machines are slower than in labs and the network connection is through bridges. This was confirmed by performing an FTP of a 1Mbyte file from a classroom PowerPC to the server, yielding a transfer rate of 0.123 Mbytes/second, and from the graduate client host (a DECstation 3000) to the same server, yielding 1.00 Mbytes/second. Also, all classroom machines used the Netscape browser which does not terminate the network connection until after the entire page has been displayed. The other workloads use both Mosaic and Netscape as browsers.
A primary goal of analyses run on these workloads was to determine the effectiveness of caching WWW pages fetched from outside the university. Visualization of cache hit rate was done on both a cumulative basis and a time basis. It was discovered that the rate of cache hits remains fairly constant throughout the trace (Fig. 9) and therefore the hit-ratio remains constant as well. For the graduate browsing workload the average hit-rate was approximately 17.4%. The classroom workload had an average hit-rate of 34.6%. This indicates that a classroom environment is more likely to benefit from the use of a cache server since many URLs will be accessed by most or all of the students in the class in a short period of time.
The CHITRA95-based WWW analysis tool is available from http://www.cs.vt.edu/~ chitra/www.html. CHITRA95 can analyze any type of trace data, even though we discuss its use only for WWW traces. Our WWW traffic study yielded the following conclusions:
We plan to derive workload models from the entire suite of spring semester traces. Few such models exist in the literature, despite their potential value in succinctly characterizing traffic for multimedia network designers.
Alan Batongbacal wrote much of CHITRA95, with help from Anup Mathur and David Connerley. Carl Harris and Laurie Zirkle helped set up proxy servers to collect trace data.
This work was supported in part through the National Science
Foundation to
This document was generated using the
LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994)
Copyright © 1993, 1994,
Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were: The translation was
initiated by CERN server install on Thu Aug 10 14:57:17 EDT 1995
<--Table of Contents
References
<--Table of Contents
Figures
Figure 1: Typical file size distribution.
[back]
Figure 2: Outlier for file size distribution.
[back]
Figure 3: Media type distribution.
[back]
Figure 4: GIF file size distribution.
[back]
Figure 5: Media type distribution.
[back]
Figure 6: Distribution of servers accessed.
[back]
Figure 7: Periodogram of server accesses.
[back]
Figure 8: Distribution of last-access times.
[back]
Figure 9: Time dependent hit and miss rates.
[back]
<--Table of Contents
Multimedia Traffic Analysis Using CHITRA95
latex2html -split 2 95multimediaAWAFPR.tex.
CERN server install
Thu Aug 10 14:57:17 EDT 1995
Footnotes: