CS5604 Projects: Suggestions
There are many possible projects suitable for CS5604. Please
read over the Overview of Related Projects. Then select from the
Suggestion List below, or suggest an adaption thereto, or propose a
project of your own creation.
To learn about personality issues that may help your work more
effectively with other groups members, see
a variant of the
Myers Briggs personality test that you can take in about 5 minutes.
Overview of Related Projects
- ACM DL
ACM has allowed us to make their content available for research
in CS education. Some of the SIGs (SIGCOMM, SIGIR) would like
to have some of their conference proceedings and other publications
added to our collection.
- CS Courseware DL
NSF has supported our efforts to improve CS education by making
our CS digital library accessible to students, and by funding faculty
to make learning more interactive (adding online interactive
exercises or making tools or packages available).
- CS5604 Modules
As part our CS Courseware project, and to help with a proposal for
curriculum, course and knowledge development in the broad area of
Multimedia, Hypertext and Information Access (MHIA), we are
collecting and analyzing IR curriculum, and preparing small
modules that can be used at a variety of institutions for IR courses.
- Envision
Interface
As part of our work on digital libraries, the Envision interface was
developed. It includes a query window, a results list window, and a
results visualization window. Some other features have been
designed but not implemented, such as browsing or zoom in/out.
- IBM DL
IBM is working with Virginia Tech on Digital Library research
and development. This involves use of search engines, rights
management software, DBMS, and image handling software.
Extensions desired relate to making it compatible with NCSTRL and
with the Envision Interface.
- MARIAN
MARIAN is a library catalog search system developed at Virginia
Tech. It now runs with the same content as VTLS, and serves as a
vehicle for research as well as providing an alternative to VTLS.
- NCSTRL
The Networked CS Technical Report Library runs as a distributed
digital library among CS departments. Further research is needed to
improve its performance and reliability.
- NDLTD
Virginia Tech is funded by Southeastern Universities Research
Association and US Dept. of Education to build a National Digital
Library of Theses and Dissertations. This involves developing a
distributed digital library of great size, training large numbers of
students about digital libraries and electronic publishing, and
developing tools and procedures so students can easily add to the
archive.
- Scaling DLs
For systems like NDLTD it is useful to develop simple analytic or
trace driven simulations so that predictions can be made about the
effect of various architectures on performance and cost effectivness.
Some work has been done by Ghaleb Abdulla on scaling issues,
some simulation software tools are available, and the instructor has
started to develop an analytical model for one class of architectures.
- SWAN
Drs. Shaffer and Heath, along with various students, have
developed a system for algorithm visualization. Manuals are
available and grad student Jeff Nielsen (nielsen@csgrad.cs.vt.edu)
also can help others develop
visualizations.
- ZPRISE
The Natural Language Processing and Information Retrieval
Group at NIST makes available a public domain package of source
code and documentation (ZPRISE), which includes a Z39.50
UNIX client/server pair. It runs on Solaris. See online documentation.
Suggestion List
- Title: ACM DL - Rights Management
- Number of people: 2
- Goal: Prepare beta test report for IBM of the rights
management software being added to their digital library systems.
- Required background: C, C++ programming
- Description: IBM has recently developed software to
manage rights in digital libraries, i.e., restrict access to certain
contents to certain people, using a system of cryptolopes.
We would like to test to see how this might protect the publications
of ACM SIGCOMM or SIGIR to be available only to ACM
members. IBM requires a report by end of semester regarding the
beta test software that will be provided.
- Title: ACM DL - Making HQ System Operate
- Number of people: 2
- Goal: ACM now has an IBM RS/6000 system to
work together with the IBM SMP machine at Virginia Tech for
digital library services. These need to become usable by ACM
members.
- Required background: PERL, UNIX
- Description: As part of our CS digital library efforts
we procured and installed an RS/6000 in ACM Headquarters, NYC.
This system runs the same software as our IBM SMP machine, and
mirrors the content on that machine. Both use the software that
underlies NCSTRL. Collaboration with ACM staff is needed to
make these systems easily usable by ACM members, affording
access to page images and other data we have converted and
organized.
- Title: CS Courseware DL - Log Analysis
- Number of people: 1
- Goal: Tools to capture and characterize users' sessions and
accesses to courses.
- Required background: PERL, UNIX, probability and statistics
- Description: We have extensive logs of students
accessing the CS courseware online for almost 2 years. These can
be analyzed with various routines already developed, as well as
statistical analysis packages. We wish to discover how courses can
be improved, prepare tools to help in the process, and generate such
recommendations.
We need to develop tools to extract users' sessions and characterize
them. We will try to answer the following questions:
Can we identify a user session and how?
If yes, can we characterize it?
What is the effect of a single session on the network and server?
Do we have different types of sessions, or they are all the same?
If they are different, what types of sessions do we have, and what do we
need to support every type of session?
- Title: CS5604 Modules
- Number of people: 1-4
- Goal: Modules that can be used in CS5604 in future
years to help students learn, or at other sites where IR courses are
given.
- Required background: various, depending on part
- Description:
- One student can learn SWAN, and prepare 2-3 algorithm
animations, such as of string search routines.
- One student can extend a mini-IR system developed by
the instructor into a larger IR system that can be used to demonstrate
query processing and retrieval.
- The JMP statistical package can be adapted to help
students learn about clustering for IR applications.
- The SMART, ZPRISE and/or INQUERY systems can be
installed on video.cs.vt.edu and documented so that students can use
them to index and search against their own document collection.
- Title: Envision Interface
- Number of people: 2-6
- Goal: Design and if possible (partial) implementation of
enhancements to the current Envision interface.
- Required background: C++ or Java
- Contact: Lucy Nowell, nowell@vtcc1.cc.vt.edu
- Description:
- The current Envision interface could be redone in Java.
- The Envision interface could be adapted by connecting it
with NIST's ZPRISE system, to work with any Z39.50 accessible collection.
- A detailed design of browsing for Envision could be
completed, and implementation begun.
- A detailed design of zooming (in, out) for Envision could
be completed, and implementation begun.
- The Envision system could be extended so that
documents can be displayed with Netscape when selected.
- Title: IBM DL Enhancement
- Number of people: 1-3
- Goal: IBM DL capable of fitting in with NCSTRL
and NDLTD.
- Required background: C++
- Description: NCSTRL follows an open architecture,
with particular protocols used to connect the user interface, index
server, and repository. Right now the only software usable with it is
the Dienst package from Cornell. Using the API provided, the IBM
DL system could be extended/adapted so that a site could use the
IBM DL instead of the Dienst software.
- Title: IR Query Log Analysis
- Number of people: 1-3
- Goal: Characterization of IR queries, building of large
test collections, construction of models of query submission.
- Required background: probability and statistics
- Description: We have a large number of queries
available from records of IR systems. One is the full log of
MARIAN queries. Another is the log of NCSTRL queries. The
third can be obtained from InfoSeek, one of the large companies
providing support for searching the WWW. We would like to
characterize these query collections, compare them, build models of
them, and when possible, develop IR test collections that include
queries and relevance judgments.
- Title: MARIAN Enhancement
- Number of people: 1-4
- Goal: MARIAN should be more fully functional.
- Required background: C++
- Description: MARIAN is now usable, and will be
more widely used later this year with new hardware being added. It
could be enhanced by completing the relevance feedback subsection.
Design and possible implementation for browsing by author and
subject would be helpful too. Addition of a Z39.50 interface,
possibly from the ZPRISE system, also would help.
- Title: NDLTD - Digital Library Setup
- Number of people: 2-5
- Goal: The NDLTD must operate as a distributed
digital library, with roles for Virginia Tech, UMI, and various
universities.
- Required background: PERL, C++
- Description: As the coordinating institution for the
NDLTD, Virginia Tech must set up the central digital library and
prepare training materials for other universities. First, we must
decide if NCSTRL will work satisfactorily. Second, we must decide
if and when the IBM DL will fit. Third, we need to establish a
handle service for theses. Fourth, we must convert our existing
WWW-based software into this new environment. Finally, we must
document this so other universities can deploy our scheme and it will
operate in distributed fashion.
- Title: NDLTD - Filtering
- Number of people: 1
- Goal: Design of a filtering mechanism for NDLTD.
- Required background: systems analysis
- Description: Users of the NDLTD will want to have
relevant new works brought to their attention, based on a stored
profile that describes their interest. A design is needed for
such filtering, at the level of institution as well as individual. The
architecture of this service, what technology is used, and details of
the user interface are needed.
- Title: NDLTD - LaTeX and SGML
- Number of people: 1-2
- Goal: Students should be able to author theses and
dissertations in LaTex and have them converted correctly to SGML,
and SGML-coded theses should be convertible to LaTeX for
formatting and printing.
- Required background: LaTeX, some knowledge of
SGML (or at least HTML)
- Description: Based on the Document Type Definition
for SGML, macros should be chosen so that an author working in
LaTeX can use them to mark up a document. A converter must be
developed to translate from LaTeX to SGML. Another converter
must be developed to translate from SGML to LaTeX.
- Title: Scaling DLs
- Number of people: 1-2
- Goal: An analytical model or software simulation of
architectures that will enable DLs to scale to large numbers of users
and large content objects.
- Required background: probability and statistics
- Description: What will happen to digital libraries
when serving millions of users with millions of gigabytes worth of
information? The instructor has a preliminary formulation of a
massively parallel solution, which needs to be completed and tested
to see how various architectures will affect response time. Data from
services such as InfoSeek can be obtained to allow some trace-driven
simulations to help in this process.
After characterizing users, sessions, clients, servers, and classes of
servers and clients, we can use this data to build a predictive model.
The model can be applied on various levels, server level, group of
clients level, enterprise level, etc. If the model is accurate, then the
invariants and characteristics identified can be used to test and build
models for other Web communities.