(prepared by Madhan Subhas and Edward A. Fox)
In this exercise, you will be working with the incremental cover coefficient based clustering algorithm. This is a partition based non-hierarchial clustering algorithm and it produces statistically valid clusters compared with those of reclustering algorithms.The added advantage of this scheme is that we can predict the number of clusters using the cover coefficient based concept. This method of prediction agrees with the hypothesis that the number of clusters within a document collection should be low if the individual documents are dissimilar, and high otherwise. The order of document addition does not affect the outcome of the clustering process.
Virginia Tech has rights to experiment with ACM's "Hypertext Compendium" edited by Rob Akscyn. Later in the course, in the hypertext unit, you will be encouraged to work with Rob to add other works into the Compendium, using the KMS hypertext system. For now, you can use ASCII-only versions of the Compendium articles for experiments with clustering. These files are stored on video.cs.vt.edu and are copyrighted by ACM, so do not copy them from that machine. They are in directory /u3/PRC/compendium and have names of form "htc*.txt".
This exercise will be carried out on video.cs.vt.edu since you will use software that is licensed only to that machine. Also, the clustering programs have been developed for that environment.
You should be able to carry out the exercise with any type of "telnet" connection, and do not need the X Window System. However, some experience with UNIX by at least person in each group carrying this out is desirable.
1. Create files in your home directory for this work.
Run the following commands:
cd;mkdir fultext;mkdir fultext/collection
The directories ~/fultext and ~/fultext/collection will contain the actual files generated by the clustering routines.
2. Create the right environment for the programs.
First you must
include /u3/PRC/bin in your path. This directory contains all the
executable programs written for clustering on "video".
Second, you must
set environmental variables before you run the programs.
The best way you can do these two things is
by editing your ".login" file. Or,
you can apply a much simpler approach, namely, copy
that of the instructor with a command like:
cp ~fox/.login ~
and then login anew to have this take effect.
Basically, you need to use three programs to get the required result. They should be executed in the order given below:
1. xircoll collection_name
Command Description: Creates a new document collection.
Parameter Description:
collection_name - Name of the collection that is to be created.
N.B.: Long names are not supported.
Example: xircoll info
This commands creates a collection called "info". You can see some of the system files in your ~/fultext and ~/fultext/collection as a result of your command.
2. xiradd collection_name document_name doc_key.doc_master=doc_id granularity
Command Description: Adds a document to a document collection.
Parameter Description:
collection_name - Collection to which the document is to be added.
document_name -
Name of the document that is to be added to the collection.
doc_id - Give some unique integer here. Dont worry about it.
granularity - Specifies the level of indexing. This parameter can
have one of the three values.
Example:
xiradd info htc1.txt doc_key.doc_master=1 2
xiradd info htc2.txt doc_key.doc_master=2 2
xiradd info htc3.txt doc_key.doc_master=3 2
In the exercise, a sentence level indexing is preferred. Note that this means vectors are created for sentences as well as the larger structure. Please add at least three documents before you go to the next step. This works for documents in the /u3/PRC/compendium directory, so pick any 3 documents in that directory.
3. xirindex collection_name index_flag delete_flag similarity
Command Description: Index a collection and create the clusters.
Parameter Description: None of the parameters are related to the clustering process.
Form that You Should Always Use: xirindex collection_name 1 1 0.25 > out
Example: xirindex info 1 1 0.25 > out
This command may take a while to complete, so please be patient!
Also, don't worry about the "out" file. It can be renamed as you wish, but some file is required to receive the output of the indexing run.
A log of the indexing operations, giving statistics, is found in file ~/fultext/info.log --- please report to the instructor the number of documents processed, which should be on the 2nd line of that file.
You are required to study the following files in the ~/fultext/collection directory.
1. collection_name.term - This is the dictionary file. The file is in the following format:
Term Term Frequency Term Term Frequency
The term numbers are implicit from the order. The first term is given the term number 1.
2. collection_name.vect_ascii - Contains the actual vectors. This contains vectors for documents, paragraphs, and sentences of the document depending on the indexing level. The format of this file is
Document ID Term_id Weight
The weights are normalized and they range from 0 to 1.
3. collection_name.clust_ascii - Contains the actual clustering information. The format of this file is as follows.
Cluster Seed
Non-seed
Non-seed
Cluster seed
....
....
Note that the rag-bag-cluster handles vectors that do not "fit" into any of the other clusters.
4. (large) collection_name.id_ascii - Contains the document (sentences, paragraph ids) followed by the actual text. Will help to understand the clustering better.