INFORMATION RETRIEVAL 2000 WORKPLACE NEEDS AND CURRICULAR IMPLICATIONS MAY 24, 1996, PHILADELPHIA, PA KATE McCAIN -- WELCOME Welcome to Information Retrieval 2000, Workplace Needs and Curricular Implications. Or as Dick labeled it, the Intelligent Information Retrieval Workshop. We have a divergence of opinion of what this is about, apparently. As the person who was given the charge to organize this workshop, identify and contact the guest speakers and other invited participants, and convince our faculty that spending a Friday in spring term in a room at the Marriott Hotel was an appropriate use of their time, I'm extremely gratified to see all of you here. I expect you don't all know each other, and I'd like to begin by asking each of you to introduce yourselves and say where you're from. So maybe we'll start with Karen, because everyone knows Karen. KF: Hi, Karen Forte, project coordinator KM: (Karen is in charge of everything.) JM: Jackie Mancall, Drexel University. NH: Nira Herrmann. I'm department head , math and computer science at Drexel. IS: Il-Yeol Song at Drexel, this college JH: John Hall, Drexel GH: and Greg Hislop from Drexel LE: Lee Ehrhart from Drexel, and Tom Triscari will be joining us eventually. EF: Ed Fox from Virginia Tech. BC: Bruce Croft from University of Massachusetts. EV: Ellen Voorhees from Siemens in Princeton. DL: Dick Lytle from Drexel. TC: Tom Childers from Drexel. MH: Max Hughes from Drexel. ER: Edie Rasmussen from Pittsburgh. BH: Brian Heidorn from Illinois. LF: Larry Fitzpatrick from Open Text. DT: Dave Toliver from the Institute for Scientific Information. CD: Carl Drott from Drexel. GS: Gene Sherron, Florida State University. HW: Howard White from Drexel. AW: Amy Warner from University of Michigan. KM: And I'm Kate McCain and I'm from the College of Information Science and Technology at Drexel as well. Before turning the floor over to Dick Lytle, who I thought was going to focus us to the task at hand, I want to say a few words about why this meeting is happening in the first place. I suppose it could be said that information retrieval in the sense we think of it today became a problem soon after the first Sumerian archivist, city manager, librarian or scholar accumulated enough clay tablets to discover that the desired tablet and the information it contained wasn't findable by relying on memory of its contents or its location, and had to turn to a contents list and a very hard copy. Accumulation of recorded information in the original or as representative surrogate systems has never stopped. The range of media and the types of recorded information have certainly multiplied, as have the needs of users for access to these documents and/or the information they contain, and as have the technologies that have been developed to help users cope, to assist them and I would say to add to their retrieval problems. In 1994, we were fortunate enough to receive funding from the W.K. Kellogg Foundation to support a multi-year curriculum development effort supporting the education of information and computing professionals. As part of our initial efforts in this area, we identified several topics that we want to explore with an eye to developing both instructional and research strengths in order to prepare our students, both undergrads and grads, with the skills and knowledge they would need to prosper. One of these areas was information retrieval with a particular flavor. Given the nature of our program, our college and our university, we were particularly interested in applied information retrieval, both R&D and practice, contemporary techniques for information retrieval, particularly those dealing with relatively unstructured texts, and those that were applicable in and met the needs of users in the workplace. We found precious little in the professional literature, print or electronic, that seemed to deal with this combination of topics in the IR world. The work of Bruce Croft and his colleagues at the Collaborative Research Center for Intelligent Information Retrieval being one notable exception. We wanted to know more, and we wanted to know in greater depth. So as part of the curriculum planning activities, we decided to organize a small, select focus session, workshop, symposium, whatever you want to call it, on this area of applied information retrieval and organizations. And to provide a few knowledgeable speakers and discussants from academia and industry to help us think productively about this topic; and the result is what you see here today. The two major themes are outlined in today's agenda. Within these very general interrelated topics, the speakers have been given carte blanche to discuss whatever they felt important from their own perspectives. To the extent that they highlight the same or similar issues, we can be confident that this is indeed a key concern. To the extent that they diverge, we can get a sense of the breadth of the problems and of the possible solutions. In other words, they're in good shape, no matter what they say. The hoped for outcomes are several. As a faculty, we expect to become more informed in ways that will guide our curriculum development efforts. We have every anticipation that the information shared here today will be useful to all participants and to the broader communities in academia and industry. And we expect to disseminate the results of this workshop symposium widely over the coming months. Now I'd like to turn the floor over briefly, for a limited number of overheads, to Dick Lytle, Dean of the College of Information Science and Technology. DICK LYTLE Thank you Kate, I think. I'm not sure. I worried about this, when Kate thought she couldn't project without a microphone. I said, good grief, how am I going to do it? Anyway, welcome to this workshop/symposium, or whatever we want to call it. I thought I'd take just a little bit of time, I thought I would just talk a little bit about Drexel and the college. Some of you who might not know about Drexel, and I think it's useful, because it sort of sets a context. Drexel is, considers itself and is a technological university. This means we focus on development of new technology, the application of technology, the management of technology. We certainly have programs in science that study things, but sort of the ethos of Drexel is building things. And the largest college is engineering. The second largest college was business until recently. Now it's arts and sciences. And we're the fourth largest college, even though we have the largest box on the chart. The other characteristic of Drexel, in addition to technological university, is education for careers. They sort of follow, but at the undergraduate level, we are one of the two really outstanding co-operative education schools in the country for undergraduate education. Graduate education, we focus a lot on professional career, education for careers. Professional education. We have other kinds of programs that go on. We're very proud of the fact that many of our graduates go on to graduate schools and get Masters and Doctorates. But this is our primary focus-on education for careers. The College of Information Science and Technology has several degree programs. In addition to the degree programs, which I won't read off, in the interest of keeping Kate happy, we have the Curriculum Reform Project, which is what we're here today talking about. We're trying to broaden our view in a lot of ways, and we're very appreciative of the help of the W.K. Kellogg Foundation. A related effort is delivery of IT, IS delivered Instruction. We have a large grant from the Alfred P. Sloan Foundation for asynchronous learning. We're going to start a Master of Science and Information Systems Software Engineering Degree, will be delivered to CIGNA starting this fall. Totally on Lotus notes. That's not the subject of this conference. I just wanted to mention it. The other factor, feature you need to know about the college is, the enrollment's going up. Deans love graphs that are going in the right direction. And this goes way back, because we were kind of interested in it. But essentially what happens, most recently, in 1992, 1993, we started offering a Master of Science in Information Systems. The other programs were growing slowly, but that one really took off. And has resulted in a dramatic increase in enrollment. Three years ago we had about 500, little over 500 students. This year we will go over 800 students. This is a pie chart showing the distribution by program of the number of students. This is a body count, it's not an FTE, nor a student credit hour generation-related thing. It's a body count distribution. What you see as the Bachelor of Science and Information Systems is 31%, the M.S., which is our Library and Information Science Degree is 38% of the total, the M.S.I.S. Information Systems Degree, is 27%. Since 1995 and through fall 1996, you'll see in absolute numbers, the M.S. Degree is growing, but the Bachelor of Science and the Master of Science in Information Systems will be a larger share of the total. That's where the growth is in the program. This is a statement of the curriculum objectives of the Sloan Project, oh, excuse me, Kellogg Project. Kate started me on that this morning, telling me not to talk about the Sloan Project. I'm only going to talk about one-that's the first one. You'll have a handout if you're interested. We want a curriculum which derives exclusively from the needs of society. We want, in this particular case, in my terms I guess I would say, we're looking for programs that address the needs of industry for people, for career professionals. in the information and computing field. And that's the purpose, from my perspective, of this conference, is to help us think about that. (That was the last slide which said just what I just got through saying.) Which is, we are really asking you to help us think about the information retrieval field very broadly, as it applies to the programs across the college. And I'm looking forward very much to being enlightened about that. You might be interested to know that every IST faculty member has an assignment, and their assignment is, after we've gone away and thought about it for a day or two, to write up a statement of what we think, what we've learned, that the implication of what we think, the implication of what we've learned for information retrieval is applied to the development of curriculum. And we are presently involved in actively reviewing and redesigning the curriculum. And we're looking forward to your help. Thank you. Kate, who's next? ED FOX Let me give you some handouts that you can, of course, take with you. I'll do this in a different order, but it took about 20 minutes to download this file. So I might as well show it to you while I'm still here, before I go onto something else. I teach a course in the spring on Multimedia, Hypertext and Information Access, which is the closest we have at our institution to an information retrieval class for the undergraduate students. And this is one of the projects that was done by a team in that course, and you can see down...(microphone discussion). This was done by one of the team project groups, and you can see down here on the bottom that they're using JAVA. Kids love to play with things, play with technology. We have funding from the National Science Foundation. On the yellow sheet that went around, we have a five-year research infrastructure project, and a piece of that is a room that will be used for distance learning, for computer supported cooperative work, for conferencing in a small sense as well as in a remote sense. And so this student group will be put to good use, and they essentially made a proposal here for how to lay out that room. So that was one of their term project activities. And just to give you a flavor of some of this purpose and scope and background and outline, they interviewed people, they went around and found the needs all over campus for these kinds of activities. They did a survey of industry and company activities. And they came out with room layouts, which are a little bit further down here. So we have a few different conferencing situations that will be supported in this room. So I just wanted to point out that students are really highly capable. If you give them a task that's open and broad and something they know will have an impact in the future, they get excited about that. So, let's come back out of where I am here to my actual presentation. Come back a few levels. This particular presentation I have delivered on a diskette, so you have that and I'll put it up on the Web. I have the fortunate position to be first and last today. I'm grateful for that privilege and so, since I'm trying to figure out what we're talking about today, I hope that I will lead us in a good direction and that others will correct things as we proceed, and then at the end, I can tell you what I should have said afterwards. One of the reasons I'm here, actually probably the main reason I'm here, is that I wrote a proposal for SIG-IR-96, which is put up here, and it's available through the worldwide web, for a courseware training and curriculum workshop, a one-day workshop, August 22 I think it is, in Zurich, to discuss what will happen and what should happen in developing worldwide curriculum for library schools, information science, for computer science, in this broad field of information retrieval. We're now at the time, and Edie's actually one of the co-chairs of the SIG-IR Education Committee, I'm just a member of that. But I'm organizing this workshop so that we can come together in an international sense and work towards curriculum. There has been some broad impact from people at Drexel, in terms of HCI curriculum. It's had a tremendous impact all over the nation and further into the world. And I think we're at a time when that kind of activity is healthy and important. And actually today is kind of a step towards that. So, I'm going to use what comes out of today into this workshop, so you will see some dissemination going on. I invite anyone who can come to the Zurich workshop to join in discussions as well. My perception is that in many cases, we need to tie in with other areas like hypertext and hypermedia. And also with electronic publishing. So this lays out a little bit of that perspective. There is more information about the conference itself. I want to connect shortly. What I've done is download onto diskette and locally store most of the presentation so we don't have to wait for the 14K modem to connect over the Internet. There are some things I'll grab from other places. But all of this is being presented through Netscape. Another hot button that people have been excited about is under the name of digital libraries, and I'm sure we'll talk about this during today. But this is an example of a project that's underway in the international sense, called Ancestral. How many of you have used this, have used Ancestral? This is the biggest repository of computer science technical reports in the world right now. We're hoping very soon to have all the major institutions as a part of this project. There are maybe 40 already that are part of this, the major PhD-granting institutions. And we invite information science, library science to join into this. So you if have a tech report series here, at Drexel or any of the other institutions represented, we hope you'll join into this particular project. And it makes use of search technology. (And the computer just crashed. Let's get to..., do you want to do something?) So, while that's happening, let me talk about a few other things. In 1993, Gary Strong from the department here, was part of a group that was involved in an NSF task force on developing a framework for academic programs in informatics. And do any of you know about that project? You're familiar a little bit with it. This is almost identically the same topic and focus that we are talking about today. The breadth of this was slightly different in that it was talking about information systems as well as information retrieval. But, the focus of that workshop was to bring together people from industry, half of the workshop was from industry, half was from the academic world. And we met and looked at what the needs are of industry for people in information systems and information areas, and what kind of curriculum we need to develop to support that. So almost identical to the kind of purpose that we're reviewing today. We're a little bit more focused today on information retrieval as to this broad sense. But there is an NSF report, back from 1993, I have a version of that here, that I think is a particularly important report. So I want to set that as a context, in the ideal sense it would be better if we had more people here today from industry. We have some, and that's great, and I hope you'll all listen to them because they actually know a lot about this. I have a few other documents from other places. The European Community, back in 1994, started a program as part of the Ford Foundation effort in information engineering. That was the term that they were using there. With a strong component in information retrieval. Again, to deal with producing curriculum that would be relevant to the needs of industry, to do research that would be relevant to the needs of the future, to build up the European Community in the networking side, in the information retrieval side. And related to kinds of disciplines. So I have a few documents here as well. They call it informational engineering on that side, but we're focusing today on retrieval. Now, as a faculty person, you may say why in the world is he telling us what the needs are of industry, so that we can identify what to do in our program. So I'll show you some of the stuff here on this slide. If you could just bring up Netscape. So I'll give a brief sense of why, perhaps, it might be relevant in this case. I spend a third of my time, a third of the nominal time, which is more than probably half the time some people spend, as Associate Director for Research at the Computer Center at Virginia Tech. So this gives me a slightly different hat than I wear when I'm the professor in computer science, teaching courses in this field. I'm really the only faculty person in my department. And we needed the whole university that's doing IR research. I have a few others who are collaborating with me. But as a focal point of research there, there are really not that many involved. In most institutions around the country, there are small groups of people focusing on this. And yet, we've been successful in pointing out that this should be a key part of the undergraduate and also the graduate curriculum. And from the computing center side, I have connections with campus projects. We work in the libraries, we work in library automation, work in distance learning, distance education. (Sorry for the technical problems here. Connect back to it.) OK, so, my roles. I've mentioned that in 1993 I was part of this information engineering workshop. Since 1993, up until next year, I'm the Director for an education infrastructure project which is to revamp curriculum in our institution. And to use Internet and digital libraries and other technologies to improve that. So I'll talk about some of that as we proceed later. At ACM, I've been involved in a number of capacities. Again, this gives a little bit of insight, I think, regarding where things are going. From 1988 to 1991 I was Editor in Chief of electronic publishing for ACM and sort of helped set them towards the mode of SGML and other kinds of archival representations. And I've been involved in research grants relating to that, helping ACM since that point. At SIG-IR, I was an officer for a number of years, and in part now am still on the education committee. I'm running that workshop I showed you before. Also at ACM, I helped found the multimedia conference series, which again had strong connection with industry, trying to serve their needs and develop people who would have the skills to help the field to grow in a broad sense. And I'm co-chair of the education committee at SIG-NM. And also been involved in digital library conferences. So that's how I might fit into some of this. My sources of information that relate to this are, I had lots of students over the last 13 years at Virginia Tech who have come out of our programs, who want jobs. So I've helped them in that regard and have learned a little bit of the situation. Information retrieval has been taught at Virginia Tech since the 70s. So it's one of the few places that has actually had a long-standing tradition in it. And I was actually hired to fill that position when I came there in 1983. Many of the almost 50 grants that I've had at Virginia Tech since I've been there have been funded from industry. And so I've had some connections with them in terms of their projects. Our projects that are carried out by students, as the one that I showed you a little bit earlier when we began, are projects that serve real users. So, again, you have the sense of that. From the computing center, we're serving campus activities and, of course, I've had connection with other educational institutions in the past in terms of their needs. So let's talk a little bit about projects in that course that I mentioned. Why do we do projects? And what do we learn about the needs of the field? The projects that we've been asked to do, and I've not only taught this course on multimedia, hypertext, and information access, but also a kind of software engineering course that we've developed things for people around campus. We found, in recent years, these are the three kinds of needs that people want our students to fit into. They want students who can collect, organize and present things. We work with digital video, so sometimes we'll produce a video. It was fun because at graduation a few weeks ago, the parents came in, sat down, and after everyone was quiet, the students presented a video they'd done in my class. Which they actually produced and edited, music and visual systems and all. So presenting these in video, also presenting and using the World Wide Web, and then as combinations of those. So have some things on the web, some things that are presented with video, and mixing back and forth, delivering things on the web that include video. So all of these kinds of combinations. Students also need to do things that are a bit more flexible, that are scripted. So hypercard, authorware, Director, Java, these are also the kinds of skills that people want these students to have that are important for these kinds of applications. And then, they need not only the technical skills, but some kind of understanding of presentations that will be of value and I can point some of these out to you from these class projects. One group did a museum, a virtual museum of the history of computing. Jen Lee, who's the editor of the lead journal in this field, is on our faculty and has enormous multimedia resources. So they organized this, they put it together, developed a number of different kinds of interfaces and access points to this. So these are important skills that are well received and this will go on and be fairly widely used. Some provide a tour. So they have a tour to the multimedia facilities on our campus, which was done as another project. unintelligible Another angle on this is, from the more technical side, is publishing things. Until recently, we've been publishing things as text. We're now moving towards more hypertext. Yesterday we had visitors from Denmark in my place who are trying to move their whole agricultural enterprise from the paper based mode of delivery of information to farmers out into the delivery using electronic schemes. And Virginia Tech has a similar plan of upgrading. The whole agri-business field is moving very rapidly in this regard. We face very serious challenges in, if the people that we put out from our institutions are not able to do what the end users need, they'll be bypassed. We've had serious problems in terms of funding from the state for our extension services. For people out in the field to help farmers, because they're not necessarily as knowledgeable as some of the farmers are, and the people out in industry. So, we face a serious challenge in our universities, not only in our majors but also in minors, to help people in other fields at our institutions to have these kinds of skills. Moving into virtual reality, searching and browsing, and since we're in computer science, students have to have, to get these kinds of jobs, ability to build the tools, to build a tool that will do searching, to build an interface. To build the took that will take clusters of information and identify the hypertext links of this. These are all research projects that we're involved in, and we'll be able to take information and group it together, cluster it, organize it in different ways, so people can manage it. And I'm giving this at a high level, but this is really what the people want on the user side. The user interface is a serious problem. The other handout that I sent around, we have a new center on Human-Computer Interaction. And on the order of 35 faculty around campus are part of the center. The major thrust of our department of computer science, is Human Computer Interaction. NSF is partly reorganizing itself around the notion of human-centered systems. So, there is large scale interest in this broad field, and we've had a lot of activity. Among the most popular areas for our students who are getting jobs are multimedia, HCI and of people coming into our department are the same kinds of things. In computer science we have gone from, on the order of 100 freshmen coming in 250 this year, 250 students. So there's tremendous growth in the last couple of years and demand. People are interested in these fields. Information handling. As part of the Blacksburg Electronic Village, which is the most wired community in this country, one of the student groups in this course put together a multimedia collection of information from newspapers, from going out and shooting video and other kinds of things, as part of a long-standing project we've had, to document what's going on, to understand where things are going, to deal with community groups. And then, of course, we have Ellen and Bruce and others who I'm sure will talk about the notion of evaluative studies. TREC is a good example of that. Let me go back a second here, and give a little bit broader perspective, which I think is something that is relevant for me to do here. My view has always been, and this I think is a way to think about this discipline. We get very caught up in terms. Information retrieval, resource discovery. You talk to someone out in the field, they don't know what information retrieval is. They don't understand the kind of jobs that might involve this. I had one student who wound up working for the judiciary system in an information retrieval capacity. The only way he got his job, got started, was by claiming that he was a computer science student, instead of having a degree in information systems, which is what he actually had. So he snuck into that situation. That's kind of changed a little bit. Right now, in our Northern Virginia program, that's been the fastest growing area, the most demand in industry is for people in information systems with these kinds of backup. And we view information retrieval as one of the key courses in that area. But part of it is because ultimately, out in industry, people wanted to solve problems. And so I think we should think of a problem orientation system. Fundamentally I view IR as a problem-oriented discipline as opposed to a technically oriented discipline. The problem is to help people find and use information. That's really what IR is about. We need to not lose track of that and get caught up into all the jargon about disciplines. It draws from the fundamental needs of people to survive. And to survive in this information age. It expands, this field, as technology moves along. So we can use those technologies to solve this kind of problem, in whatever capacity it arises. And we should (and this is what the field has been doing for a long time) apply any kind of technologies that will help us solve this kind of problem, which forces us to be interdisciplinary. So, we have to work in an interdisciplinary way. It's nice that there are people here from Drexel, from all over the university, that's a great attitude and perspective. And it brings in from the scientific side the notions of measure and evaluation. Which are, again, key things that outside we have to have, that's often not present in regular computer science degrees. We have all of these connections to other fields. And here's where sometimes people get lost. What is IR? Well, we're doing some computational linguistics, and we're doing some storage things, and we're doing some architecture and digital libraries and we're doing educational technologies. And I sort of bounce around in all of these areas and my students work in all of these fields. So people say, what in the world is he doing? Can't he focus on something? But ultimately, I think that's where we have to move in this regard. Tying all these things together-giving them some kind of exposure. When I look through the curriculum that I saw here at Drexel, I noted that you have a very broad collection, very broad collection of courses in many areas. And I'll make a comment that I hope we'll come back to later in the day, in the afternoon session, that I worry a little bit about the coherence of it. I didn't see that when I looked at the curriculum here. But if this is your kind of view, that this is a problem-oriented discipline, that you're having people who will come out with skills and broad knowledge in these areas, then maybe it will make sense. But I tried to list here some of the key areas that tie in and help in this IR side. And here are some. So, finally, and this is for the morning section, I have a whole thing for the afternoon. Needs and Trends. With the emergence of these new technologies, with the networks, with the alternative publishing, we have enormous growth in the information out there. So, that's a fact that Karen Spark Jones didn't have to deal with in 1971, as she was doing some of her studies. Things have radically changed. And this will only increase. One of the projects, and I can talk more about this later, that I'm involved in, is to develop (we already have this in place at Virginia Tech) a mechanism for all of our graduate work to be recorded through electronic theses or dissertations. We have a pilot project in the Southeast to expand this all throughout the Southeast, and then have proposals in the works to turn this into a national service. If we're successful in this, and I think ultimately we will be, there are about 400,000 theses or dissertations that could be produced electronically every year. So building a digital library of research works, this is maybe the most effective and speediest way. So this first point is enormous amounts of information becoming available. The second thing is, unfortunately, the economics have not been such that there is a librarian in every place where this information is being produced. By and large, people who are producing this information, and often using it, have very little training in these fields. This is a serious problem that we face. And so there is a tremendous opportunity, if you cast things well, for people in this field to be well placed and to help create more organization to this generation process. Along the same vein, we have tremendous move towards decentralization, which makes it harder for people to share things, to combine things, to organize things together. And we can do some of that technically, but we need to have people who are trained in these kinds of skills. We have cycles, of course, in the demand side for CS, for people in information systems and so on. There are some new emerging areas that we've noticed. One of them is community networking. Since we're in the Blacksburg Electronic Village, we see this replicating all over the place. People are constantly contacting us. We've seen projects all over. Communities want to have information systems. I was in a distance learning discussion on Tuesday at Old Dominion University, and people came from one of the city, county complexes. They have laid fiber in their schools, they are out for their whole county to be wired up for distance learning so they can have a lifelong learning experience. They're out and ready to do this. They need guidance, they want help. So, we see this move towards community networking, and this will have lots of opportunities. Here's some statistics on the Blacksburg Village. 40% is on the Internet, two-thirds of the companies are on the Internet there. And what this means is we'll see a tremendous move towards small companies starting. From the last couple of years, three or four of my students have started companies. They're on the Internet, they can do consulting, they can build these kinds of systems. So some of our graduates will be doing this kind of thing. Information Systems, as was written up in this 1993 report, are often late, are too expensive, they're inflexible, they're hard to change, hard to adapt, they're hard to use. All of these, this litany of problems that relate to HCI and other areas. We need to help solve this kind of problem. At Virginia Tech, in our information systems side, here are some of the main things that we face, and I think the industry reflects a lot of the same thing. Training. We have to train people in all of these new technologies. At Virginia Tech we have a four-year project so that the entire faculty gets trained over a four-year cycle, in a repetitive mode. A week-long training program where you're left with a computer workstation on the Internet, know how to do it, all ready to run, and this has transformed the whole educational enterprise on our campus. This kind of retraining, reengineering, is going on in companies all over the place. So, some of our graduates will go out there and be involved in training activities. We're building digital libraries, we're increasing this scholarly publication activity, we're developing new courseware and this distance learning. Venue is going to move what we're all doing out to the broader fields, and this is again something we should address, to meet the actual needs out in the workplace. And finally, this whole notion of client server and distributive processing, people don't know how to deal with this, and all the information side of it. So, that's what I wanted to do this morning. The afternoon, I have another set of this thing focused around the curriculum side, which I'll come back to later on. Sort of highlight some of this. And I've gone around and collected curriculum in IR from all over the place, and put it up here so we can take a look and see what other people are doing as well as how this connects to what we are doing. So I'll stop. KATE McCAIN: Thanks, Ed. The way we've got this structured, we've got two speakers, a break, two speakers and then time for general discussion. But we're pretty loose about this. They brought break stuff up early. So, if anybody has anything they want to take up with Ed right now, a couple of questions? Or something that you want to say, that's fine. Do you have a question? LF: Maybe, maybe not. One of the things that strikes me here, I'll get into this, is the horizontal, very horizontal- like nature of some of this technology. And the discipline as well. And it's interesting that in the marketplace, the market's exactly the same way. And I'm wondering what it is that sort of unifies the whole thing, other than the technological approach, at the bottom. EF: Is that a question? I think the point is, people want to accomplish certain needs. They have problems they want addressed. And so we have to apply whatever it takes to do that. So, I think that's what you're getting at with this horizontal notion. LF: In the specific, though, it seems like, I mean it's refreshing to me to see that in an academic setting, it's the same thing that we're experiencing in a commercial setting. The problems that people are actually trying to solve and get concretely at them, are very different, right? But in tying it all together at the bottom is sort of a suite of techniques, or a bag of tricks, being brought to the gate and applied in new and interesting ways for each sort of different thing. And I'm wondering if there is something fundamental that kind of holds it all together, other than just a bag of tricks, or managing information. Or is there a higher plane on which we could think of this? EF: What I did in the stuff about curriculum, I'll point out sort of a way to organize it on the curriculum side. But on the needs side, I'll throw out one idea which is kind of strange, since I'm in the College of Arts and Sciences. But these other meetings that I've talked about, and some of the European community activities, have talked about information engineering. So maybe we're producing an engineer who can deal with information, who has to have a broad training like a civil engineer would have, or a mechanical or electrical engineer. So maybe that's what's actually happening here. We're doing the research, we're calling ourselves scientists. But we're producing a broadly-based engineer to solve the suite of problems. Yes? DL: I have a question, and I was wondering if this is the same question, but it sounds different to start out with. Virtually everything you described that your students do, our students are also involved in. But I would not have connected it with information retrieval. Necessarily. I think it's the technique; you're taking a look at the techniques. Then if you start focusing on that, then that seems to be..., you can see how that solves problems, in individual instances with what is holding is the frame or a framework LF: ... a coherent model; At the bottom you may be looking at it from an information engineering program might be trying to build sort of like a higher level metaphor in primitives for information processing. I get the sense that we're not there. EF: It's a little funny, because when I was involved as the Chair of SIG-IR, it was very much my intention to give us a broader visibility and perception, and to put IR squarely in the middle of a lot of activities like multimedia, college-based information retrieval and other kinds of things. I think that's the right way to do it. I don't think we want to be in the closet. I think we have things to contribute to many different groups and activities. So, that's again another way to think of this. Others will disagree, and certainly this talk was not focused on state of the art research in particular technical areas. But I thought this perspective, this breadth, was important to start off with. Yes Brian? BH?: I think sort of an answer to the perspective of the question is, you can look at all of the HCI work as knowing how to do those things as tools. Like a cabinetmaker would have a set of tools in his toolbox. The tools don't necessarily look like each other. They're not directly connected. But they're all designed to serve a particular task, to make cabinets, to do woodwork. So the tools themselves aren't related. It's the task that holds the tools together. And that's what a lot of these things are for information systems as well. They don't relate directly to one another, so you can't say, what's he doing? If he's doing computer interaction, and DL: Part of the problem is, we're focusing on tools too much BH: Yes, maybe And it's the task at hand, it's the retrieval system that they're really working on. EF: I have two contacts here with us who are in the job market this week. One of them is a search firm that was looking for someone to write proposals in distance learning and multimedia education. And he said :"Do I hire someone from education to do this, or hire someone from the technical side?" So of course I argued for the technical side, because they were actually better grounded in this. Another case, I had a student call me who has a degree in video production. And he said, "It's a dead end. I can't get a job. I want to go back and get retrained. What should I do? Should I go to a school of art, and learn it from that side, or go to a school of technology and learn it from that side?" And I said, "To be honest, I think you'll be much better in the long term if you get a computer science degree. With this kind of background, you'll be able to deal with all of these in the future." This is, in part, the kind of argument that the Media Lab has espoused, that we have this convergence of the telecommunication and the publishing and the entertainment and intellectual information areas all together. And we have to prepare people to deal with those things. KM: Bruce, and then we'll segue to the next speaker BC: One of the problems of this area, which is often a criticism directed at us, is what are the foundations, what is the discipline that we're talking about here. And the criticism is leveled at us by computer scientists often, although eventually they ask the same questions of the people in the database field, for example. It's exactly the same question. In the database systems area, there is some partial theories, some models, like collections of tools, problems that need to get solved, things like that. And people are groping towards general approaches and trying to find out what the right approaches are. And it's no different from information retrieval either. We have some theories and partial models, bits and pieces, and it is an important issue, though, to keep trying to focus on what are the fundamentals, what is the discipline, if you like, that underlies these different things. It's certainly true to say that we're not there in terms of understanding what it is. But that doesn't differentiate us from a lot of computer science. Except for the theorists, who sit there and say, we're doing theories, so... KM: Gene? GS: I've spent the last couple of years developing a bachelor's program in information studies. And the problem has always been, how do you name the baby? If you say it's going to be library studies, that's a turn off for young people. If you say information studies, then Richard has been very good at pointing out things like this, what does it mean? You know, studies, that's a wonderful word. Then you put the word systems with it, well, that's sort of computer science. If you put one of the catchwords, science, then oh, you can't work in the real world. So the problem as I see it is, what do we name our programs? How do we demonstrate to those who buy our students as workers, something that makes sense to them, and parents who are sending their kids off to school, and understand CS... but if you're saying information "blah", what is it? So I think that's one of our biggest issues, is how do we demonstrate this to the public? Students we produce are capable of doing these things, and today, it's if you use a hook. We did a pilot course on a Novell program, Novell administrators. Certified Novell Network Administrator courses. And we did it over a semester. Everyone of those students got jobs, and they got a job that was worth more than that course ever cost them. Because it raised them up a level above their competitors who were just plain old whatever, you know? So, I think we need to work on, how do we sell the discipline, whatever name we use. KM: Well, that's certainly something we are going to be struggling with all day. Maybe it would be useful if we got a perspective from the other side of the fence, so to speak. Ed has told us what is going on at Virginia Tech. Ellen Voorhees is from Siemens, and she might be able to address some of Gene's concerns about who is it that wants what it is we're producing, and what are they looking for. And if we have the right technology up there for you or what? ELLEN VOORHEES (I just need a straight overhead.) As the people have said here, I'm from Siemens. I feel like perhaps I'm somewhat masquerading here. Siemens is a company, and it's actually a very big company. But I'm in a research division of this company. And part of... so my perspective here is as a researcher within a large company. Before I get into the actual talk, let me give some background about Siemens. Yeah, German counterpart's fine. It's very hard to believe, but in fact, in the U.S. Siemens is not a household name. In fact, frequently I get asked, if I'm just talking to people, I've been asked two or three times, why a furniture company needs a research scientist. So, we, in fact, do not sell furniture. One of the few things Siemens does not sell. Siemens is probably most best understood as the European equivalent of GE. A very, very wide ranging company. It sells nuclear power plants, it sells dental equipment. In the U.S. most of its market, or its biggest market, is probably in the medical equipment field. My interests are... the parts of the company which are most closely aligned with my interests would probably be the telecommunication parts of the company. Siemens Stromberg Carlson, which is down in Boca Raton, makes public network switching, so you can... Sprint, for instance, or is it MCI, one of the two, I think it's MCI, uses Siemens Stromberg Carlson switches. They also have a large division in private switches of PBX's. That's out in rural, Siemens role in Santa Clara, CA. These are the part, Siemens has also in the past couple of years bought Mixdorf, which is a software company. So Siemens is involved in these sorts of areas. And that is the sort of... part of the company that we're looking to service, at my group in Princeton. I'm going to be focusing here on a particular topic which I think is a trend where information retrieval research is going. And it's in the information agents. Every large telecommunication or information technology company that I know about, including Siemens, has some variant of this vision of the future. There's going to be just massive computation and communication power to the user. So, as you're walking down the street, you whip out from your holster your palm top computer and order tomorrow night's movie tickets, OK? This is sort of a vision of just completely ubiquitous communication and computation to the user. Now, this is all fine and dandy as a vision. But this vision always sort of, mostly implicitly, very seldom do people explicitly make it, but implicitly provides a lot of challenges to the information management community. If, in fact, you're going to have this ubiquitous access, well, then that means you're actually going to have to be able to find the information that's out there in this massive amount of information. You're going to have to be able to deal with a wide variety of media types. It's not just text, it's going to be images. There's going to be video, there's going to be audio. Probably combinations of all of them in the same object. And you're going to be dealing with, at very different levels and sizes, of collections. People will sort of..., the canonical use actually, at the moment, of agents, is to schedule meetings, calendar agents. The next most canonical use is mail filtering. Or, because people don't like other people to experiment with their mail, news group filtering. There are a wide variety of different types. They have all sorts of different characteristics. And by and large, the people are at least making up a vision, are assuming that this is all going to be accessible and, in fact, it will all work fine and dandy, thank you very much. So, part of what we need to do there is to make this a reality. OK, so if that's the vision, then why in particular agents? I am not the only one, by any means, to think that the information agents are sort of the wave of the future. There is a..., in September of this past year, Scientific American had its 150th anniversary issue. And they decided that within this issue they were going to make predictions about what the next major technologies were going to be, in a variety of different fields. One of which was information technologies. And they picked five key items within each of those fields. And in the information technologies field, information management by intelligence agents was picked as one of those five ones. The author of that article in Scientific American was Patty Mays at MIT, as probably the person most well known for this work. In an earlier report, in 1994, Ovum, which is a British company that does market analysis for the computer and communication industry, released a report on intelligent agents. And this is in 1994, the research for that report preceded that. So this was actually just slightly before the real advent of the web. But even so, they predicted by the year 2000 a very large market for information agents. And in particular, IR agents were predicted to be the biggest share of that agent market. And as I said, as far as I know, just about every information technology company has, within its research division, some project which is targeted at agents and information. Now, there's a running catchphrase in the software agent community. Everyone asks, what's a software agent? And no one will give you an answer. No one will agree on an answer to what a softer agent is. This is my definition. It's encapsulation of the processing required to do a task, but lots of other programs are also intelligent, so where's the agent? Most people will also...to be considered an agent, it should have the connotation of autonomy. This processing can take effect without direct user involvement. Now, the user had to be involved, probably at some point to set it up or start it or something. But after that, this software can act on its own. So that it can act on its own, there's usually some implication here of some amount of intelligence or knowledge of the particular task inside the agent. The agent can act on its own because it has a broader idea of what the real goals are. Its main use, at least so far, or projected initial uses, would be to alleviate tedium. So this is where the filtering comes in. When you don't want to see the ninth notice of a conference or the offer to make $500,000 in three days. Another major aspect of it, and the one that I want to concentrate on here, because I think it's really the key to why it might be useful in information technologies, is there's usually this idea that the agent contains some sort of model of the user. That in order, this goes back to its intelligence, that it has some idea, in fact, of what the user wants, or why it should be doing this task in terms of the user's goals. So this then modifies the common vision picture. So that now you have your network, but you also have these little guys running around within the network. The particular project at Siemens which I've been dealing with, we call our little agents "scouts", so it's the Infoscout Project. That's why they're wearing Davy Crockett hats. The idea here, though, is that what you'd like to be able to do is have these agents find information, find resources, let me go into what the actual uses for the agents. As I said earlier, I think the main reason why agents may, in the end, prove useful in information technology, is the idea of the user model, being able to adapt to individual users, and thereby personalize the access to the information. As the amount of information out there grows, to the point where you can't find anything because there's so much, and nobody's interested in the vast majority that's out there. You have to find those particular pieces that are of interest to you. It's within this agent, as a repository of data and processing and whatever else you want to stick in your agent, but it's particular to an individual user. And I think that's where the leverage is going to come. Within one agent, you can capture what is of interest to this person, and within a separate agent you can capture like what's of interest to this other person. Once these profiles have been built up on a particular user, the profiles can then provide context for the retrieval systems. END OF TAPE ONE, SIDE ONE ...used for other things too, but here we'll concentrate on the retrieval system. And I think there are at least three different, very different ways in which this context can be used. It will be different contexts, but this idea of personalization. One is in being able to learn the appropriate data sources. There are a zillion data sources out there, where do you go? Within, if you keep track of the history of where you've gone before, and how useful it has been to be there, all of which can be stored in this agent, you have a better..., you will be able to track over time, what data sources are good for particular types of queries from this user. And in fact, it was this learning appropriate sources within such a context which has sort of inspired the database merging work that we've been doing at Siemens and there have been other people also doing it-there's a group at U Mass. Where you have different databases, they're physically separated for whatever reason, and you want to try to figure out to get the performance, as if it were a single database. An agent is a good place to keep track of that sort of information, is my claim here. A different type of context is to provide the sort of background information for queries. Sort of the typical interest of the user. For example, if a user wants to find information about a certain company. If the information he recorded from house queries was recorded in this agent, about whether or not, in fact, this user usually wants, say, the financial data for the company, or the technical data for the company. The profiles can provide that sort of context and target the query appropriately, without the user having to go through all that explicitly. And sort of a subcase of that is it can provide the context for ambiguous words, for sense disambiguation. In most retrieval systems, I can't speak from experience, I've been told that in most retrieval systems, and certainly when I use Web search engines myself, the query is one word, maybe two. If you, in fact, have information about the user and the typical things that this user looks for and wants to do, then the user can continue to use the one word query. But you can, in the background, provide the context that the retrieval systems actually need in order to do a decent job of retrieving information. As a sort of a completely separate use of the agents, I see them also as being able to offer sort of, you can think of it as maintenance duties. If you're building up these profiles, and we'll get to them in a minute, the problem of how you get these profiles, but say you had these profiles. This is then, since the agent needs these explicitly, right then you have been forced to codify your experience with different retrieval systems, with different sources. This information can then be shared with other users. Which, if you share common, sort of interesting place to go things, it builds up a sense of community, users can have this, you can provide a user base with a common groundwork by sharing these profiles. It also preserves the experience of people who have gone through a long training process here of where the really good sources for finding the best restaurants, or whatever. If that's part of your profile, then that can be shared with other users. Which leverages the cost of creating that knowledge to begin with. But also, I know there's, Siemens has a, through Siemens Stromberg Carlson, has a relationship with AmeriTech, which is the baby Bell in the Illinois area. And one of the problems AmeriTech has had recently is that a lot of the people who have worked in the past on their billing systems and on other systems like that, their main parts of their business, are retiring. And they're actually losing sort of never documented knowledge about their systems. If these sorts of agents were used to actually be able to capture that knowledge over the long term, you'd have to make it sort of a gradual process so that people don't actually realize that they're doing something extra. But then it's there, and as new users come in, as new into a community, they can benefit and the knowledge will be retained by the organization. Sort of a completely separate use, if you actually do envision these agents as running around in say, web pages, or like the spiders or wanderers...,they can be used to detect old or modified links. I actually, in preparation for this talk, did a web search on Kellogg Foundation and various other things. I came up with things; I'm always astounded at what you can find on the worldwide web. I also came across an awful lot of links which now point to nowhere. And even more dangerous, I think, are those links which still exist but have actually been modified since the last the time the person pointed in there. In fact, it's unlikely that they point to something completely different, but you don't know that they're actually now pointing to what the original pointer wanted you to see. And I think this is going to be an increasingly important problem. How do we maintain the history of what was there, I mean, this is for the archivists, how do you actually keep around what was there, and maintain that sort of continuity. So again, I presented here mostly still just a vision, of what's real and what's not. This is what I think is necessary to produce these agent-based information systems. Some of it's there already. In terms of the infrastructure, I think, actually, a lot of the infrastructure is already there. However, there's this major sociological or political problem of, are you going to let somebody else's agent run on your machine? By and large so far, the answer is no. You want to be able to run your agent on their machine. But there's firewalls and there's all these other problems. I have no suggestions for what you're going to do there. This isn't going to fly unless something happens there. And in terms of technological problems, I think there's still a lot of problem with what I have loosely here called indexing. We have somewhat of a handle on text. We sort of know what to do in terms of retrieval of text, how you represent text for retrieval purposes. But there's an awful lot of different other types of media types out there, which are much less developed, at least in terms of automatic indexing or for these sorts of agent accesses. We're going to need some way for these agents to communicate with one another. Now, there's an NSF sponsored effort going on. The KQML and KIF efforts. These are, KQML is Knowledge Query Manipulation Language, and KIF is Knowledge Interchange Format. These are a start at being able to let these agents communicate. But you have this problem, you also have the problem that you need ontologies to describe what's in these data resources. How are they going to be, how is that going to happen? And when you say, you have these same sorts of problems in the Federated Database Cases here, and when you say age, when your database here says age, and my database in here says age, mine could be age of tress. And yours could be age of employees. You've got to be able to make sure that you really are talking about the same things. And I think this is going to be a very difficult problem. But also a problem in which I think the sort of information schools have a history of being able to do, in categorization, etc. And I think this will be a strength from which to work. There's also this problem of actually getting the user models. Where are they going to come from? My research has been trying to develop them automatically. The department that I'm in at Siemens was called the Learning Department, so I'm a little biased there. I'm trying to make sure that I fit within the rest of the community so that it was sort of based on automatically learning. In terms of, or at least what I've started, is sort of straight, relevance feedback, build up your profiles sort of things. It's not sufficient. It might be necessary, that sort of stuff. But that's not going to be sufficient. What else is going to be out there? And of course, to really be useful, to get tied in with this communicating with the other agents, it's going to have to be intimately connected with whatever types of indexing on these resources are out there. This afternoon I was going to talk about what I thought the curriculum to support that type of development of those source systems were. Are there questions? KM: I think we have time for some questions before we take our break. It looks like Carl has his hand up. CD: Yeah. I wasn't going to so much ask a question as just note one of the problems with agent technology. I've been watching spiders at my worksite. Which is to say, they tend to remove context. That is to say, they're hitting pages which I perceive as people getting to only by following a certain number of other pages that put them into context. And I've really discovered now that I have to link back to the context-providing pages, because otherwise these are going to retrieve things for people, and they're going to have no idea what they fit into. And so, it seems to me that we have to, the ability to rip books apart and distribute individual pages, which is what these systems do, are going to have to change our whole ideas of authorship and construction. So it's got to work from both directions. LF: When you use the term agents or scouts for your project, where you have these agents who act throughout the system, have you thought about a concept of an intelligent shield that can perhaps protect access to the server? So the data is (inaudible)? EV: The short answer to that question is, have I've thought about it, is no. There's this..., on the robot or wanderer mailing list, there's a lot of talk, a lot of traffic devoted to, how do we stop these spiders from attacking web sites or, I mean, lots of people are interested in what they see as an unfair use of their web site by these robots. I frankly think that's a more, you have this tension in there between wanting to make your information available, and yet not making it too available, I guess. Is the... ??: Of course, to have it taken out of context.... EV: Yes. I think in the long run, the answer there is actually going to be the taking out of context, which is going to solve this problem. That people are going to stop using systems which just willy-nilly rip things out of the air, and give you some random page. Unfortunately, since right at the moment that's more useful than not being able to find anything at all, they're going to stay around. I don't have, off the top of my head, any real suggestions for protecting information that you still want to be accessed in other ways. LF: I'd like to come back to this point a little bit. I'm actually kind of relieved that Ellen is leaving the industry with the competitors (inaudible). Because we're doing some very similar things that will lead to product in the short term. I think the issue of context is profound, but it comes back to personal context. So, in the case where one's personalizing, one could argue that you are what you read. And that you provide the context. If you have not been there, then it's probably a lot less relevant to you than the places you tend to go. And rather than think of it as a process that's occurring out there somewhere on your behalf, it helps to think of it as a process that's occurring locally where the most it is possible to know about you is known, right, on your behalf, and I think it enriches the process as opposed to not enriching it. CD: Well, I don't see, you're talking about knowing about you the user, rather than the author. LF: Yes CD: I mean, if you think about all the work in reader response theory, you can come down on both sides of it. I was thinking, let me give you a specific example. I noticed the other day that something had hit on one of my Pascal examples. And the start of that says, something like, now we extend this in the way that I talked about above, OK? Because this is example two, and you should have read example one. But it doesn't seem to me that it matters what the user's knowledge is, that the user is better off knowing the context I put it in, rather than having to invent their own. Even if it's only to know that they can ignore my comment. EF? Getting back to the issue of industry needs and jobs, agents, if successful, will decrease the demand for the students that are being produced in these programs. We won't need them anymore. What's your counterargument? EV: I think it will just change the sorts of needs. EF: What will they be then? Do you have a feeling? EV: I think there's still going to be a very large demand for people who actually understand the idea of how one organizes knowledge. And agents are only going to benefit by having that be done well. And are not going to preclude humans from doing that. At least certainly not in the medium term. So, my particular emphasis, I think, is on being able to categorize and organize collections of knowledge. It might very well be the case that the ways, the goals that you're trying to meet in the organization may be different, if you know that there are these agents out there that are going to interact with these things, and it's just going to humans. In fact, I would guess that it would be different. But it still comes down to somebody, I think it's going to have to be a somebody, that's going to have to organize the vast majority of this. KM: I don't want to break up questions, but for those of you who do want to take a break, and run laps around the floor or something, there are soft drinks and coffee and there's still a few blueberry muffins and those of you who want to continue the conversation can do so, and we'll start back up in about 15 minutes, formally, with Bruce Croft. MIDMORNING BREAK KATE McCAIN Our third speaker is Bruce Croft from the University of Massachusetts. (And what are we doing with the computer?) BRUCE CROFT So the title that I have for my presentation is Corporate Intranets and Digital Libraries. And the title basically says one or two of the buzzphrases going around in the industry these days. Ed's already mentioned digital libraries. And corporate intranets is suddenly something you come across everytime you open a brochure these days. And they sort of form the framework, and we're going to talk about what I see is some trends in employment issues and the information industry. So, the information industry has grown really enormously in the last three or four years. We've always called ourselves the information industry in ASIS, things like that. But there really has been an explosion of activities centered around access to unstructured information in the last few years. The first cause of that was simply the decrease in the cost of online storage. That was before the web, that was early 90's when that really had an impact, and all of a sudden, all these databases. The information explosion we've been talking about forever, which had existed on paper, was actually available online. And that's when we started to see increasing demand from the corporate world for techniques to deal with this information, because they are creating it on line, they had the large databases, and of course, they didn't know what to do with them. And then not long after that, the impact of ... along came the web, and that has had an obviously explosive impact on the industry and is a dominant, has to be a dominant factor in considering what do we do with our curriculum and things like that. Because that's where a lot of the employment is. And will be. Of course, the big question is, in terms of structuring a curriculum, defining issues and trends is what are the trends with regard to the web? What is going to be happening there? And of course, I think the web is solid proof that nobody has the faintest idea what things are going to look like in two or three years' time. Because most of you know about Infoseek as being one of the major web searching companies. When they came to us two years ago, they had, what they were talking about was Infoseek Professional, which is sort of an online access service somewhat like a refined version of Dialog. A smaller version of Dialog, more upbeat version of Dialog. That was what they wanted to do. And they started offering the web searching as just sort of a novelty to get people interested, and before long, because it took over their entire business. And so, what's going to happen on the web? Well certainly, there's a lot of companies around, a lot of students, as Ed mentioned, making money out of just doing webpages for people. That's going to become more and more humdrum, just going to be probably just your regular word processing and all that sort of thing. So, that doesn't seem to be a long term thing to be putting into curriculums and stuff like that. So, what I think we can say is, this gets back to the fundamentals issue, is that there are fundamentals of information access, the way humans access information, there are some fundamentals about on-line access of information that the web doesn't change. And in some sense, the web has just sort of caused more opportunities rather than a fundamental shift in the types of activities. I haven't really seen anything being mentioned that wasn't being talked about before. Just that there's the curve of interest, the amount of interest has gone up enormously. So, to make that clear, then, let's talk about corporate intranets, which is certainly where just about every company who's involved with the Internet, every company is involved with information access, database companies, information retrieval companies, etc., they all say, corporate intranets is the way to work. But where are we going to make our money? Because nobody's really certain how they're going to make their money out of just doing, sharing all the advertising revenue on the web applications currently. So what does that mean? Obviously, a phrase like that means a lot of different things to different people. But I think there are some common characteristics that we'll see. Heterogeneous information and information systems, meaning, as Ellen said, lots of different types of information. Structured data, unstructured data in various forms, a lot of text, other types of media, images, etc. And lots of different information systems also, in these environments. Not a single information retrieval system or a single database system, lots of database systems, lots of information retrieval systems. Lots of other types of systems. File systems or legacies, whatever you want to call them. But lots of different types of systems out there. So a messy world, which is somewhat like Ellen's slide as well. Obviously distributed locally and remotely. And so this brings up the issues that Ellen also mentioned of having to deal with resource location. How do you find the information or the sources of the information in the first place, before we start searching that source? And how do you merge the results when you've got a little bit from here, a little bit from there, little bit from there, and you have to present the user with the best five things, and you've find 500 from different sources, and you have to merge those 500 to come up with your estimate of which are the best ones to look at initially. And that may actually be merging the sources to recommend. So your top five may be the top five sources that you look at, rather than the actual data initially. Information retrieval and text is really of critical importance in these corporate intranets. Text tends to be the glue that holds a lot of things together. And the techniques that we've developed for text are more flexible and forgiving than the techniques that are being developed with the database systems, for example. The database systems are fine if you're operating with your Oracle database system using an SQL standard and things like that. But you get into the semantic problems about how you change the objects in the schemas, and it becomes very messy. Heterogeneous database systems has been an area of research in the database community for quite a while, and I don't think there's any real convincing, not the right answer to that problem yet. Information retrieval, on the other hand, because it's more flexible, doesn't expect certainty, etc., it certainly provides a framework for starting to glue together, starting to deal with some of the heterogeneous information problems. It certainly doesn't solve them all. But it's an important component. And of course, a lot of the information that's out there, is textual anyway. So you have text as just an important data type. So information retrieval, in the general sense, is going to include all the other types of functions that you want to do with text. Just to stay with text for the time being. Some of them we mentioned, filtering, for example, or we call that routing, filtering. A lot of work. Emphasis on visualization. So if you find the information that's fine, but what do I do with the information? So database systems make a lot of their money, if you like, out of analyzing the data once you've found it. Information access is just one thing you do in a database system, and it's considered a pretty mundane thing, actually, to find the data. But then you want to do things like data mining and sort of analyze the data, look for trends and all that sort of other stuff. And we'd like to be able to do that stuff in general. So, from a text point of view, people looking at how to visualize information that you retrieve, visualize and browse large databases without formulating explicit queries, date mining in the context of text, for example, as well as in some structured data. And more long distance information extraction. Pulling facts out of text. So populating databases from text. And there's other things too, like summarization, summarizing lots of text into small amounts of text. And the point of that is, that there are a lot of different functions associated with text, and there's other data types and when we're talking about a powerful integrated information system, provided on the corporate intranet, then we're going to be talking about integrating a lot of these different functionalities. And integration, in fact, is obviously the name of the game all the way through this. Because integrating heterogeneous systems, heterogeneous data, integrating different functionality with database systems, obviously that's one big one. How do you integrate text retrieval functionality with database systems? There's lots of pretty bad solutions to that problem out there at the moment. That's an area which is going to continue to develop and improve. Integration with workflow, obviously a key component in terms of what you actually do with the stuff once you've found it. And that's a lot of what people ask in the context of these corporate systems. Other types of integration issues, integrating with image, imaging in the sense of OCR imaging. We're capturing lots of data on scan, scanned documents and doing OCR. Speech which will be increasingly important because the speech-understanding technology is coming along in leaps and bounds really, in terms of workstation technology. There are still big problems in terms of telephone quality speech, but the large vocabulary continuous speech recognizers with workstation quality microphones are really good right now. And there will be a lot of good commercial products of that type pretty soon. Multimedia is one of those things where everybody likes it, and says that they want it. It's not clear exactly what the applications are, or how many there are. But still, there's going to be a lot more work in that area in terms of image databases, meaning photographic databases and things like that. And video. So there already are these systems out there, and now we're starting to combine these things, and this is another aspect of this merging problem, that you have techniques which say here's the best image data, and here's the best text data. How do you merge those results? So it's another research area that people are looking at. You also have to integrate with security systems and a variety of other things. So, these types of integrated systems running on a corporate intranet are going to be, well, that's what people need to solve most of the corporate problems. And there's a lot of interesting research issues there. And a lot of stuff to cover in terms of training people. And there's a lot of jobs potentially in this general area. The other area which is getting a lot of air time, both academically and industrially, in the sense that IBM has a line they call the Digital Libraries Products. I've got the mouse pad and other things like that from IBM which is the digital libraries mouse pad, the digital libraries logo, to prove there's a product out there. So, what does that mean? It's really not that different from the sort of stuff we were discussing as corporate intranets, just different context, some slightly different emphasis. It certainly doesn't mean, to these companies it does not mean putting libraries online. That's one small aspect of digital libraries, as far as the company is concerned. But it's really, how do you make a ....?? There's all shapes and sizes of information providers out there. There are lots and lots of people that have information that they want to provide to other people. What are the tools that you give those people to make their information available, and enable them to charge money, make money off it . Blockbuster Video is an information provider, and could be a digital library. It doesn't have to be the Library of Congress in order to be a digital library. So some of the main issues there we could, coming back to distributing information. How do you get access, distributed access to video and stuff like that is particularly interesting. Multimedia is really big in this area, because obviously I just mentioned things like video on demand, as one of the big areas there that people see money. And that's what drives us, after all. Also, when you look at real libraries that are using, talking about digital library technology, most of the stuff that the real libraries are putting online are special collections, and a lot of that is going to be multimedia stuff in image collections. For example, speeches, things like that. Motion pictures even. And the reason for that is, that unfortunately, people get awfully bored with text. They find text boring and multimedias interesting. And the libraries doing this digital library stuff, when they often have less of a commercial aspect than sort of making these special collections available, the high profile stuff is the special collections. And so there's a lot of push to get the special collections, which are usually photographic or something interesting, or the handwritten archives. In fact, there are a number of manuscripts and things like that, which is text., thought of as a special type of image, because it's not ASCII text. So there's a lot of work going on in that area. In the multimedia area, that comes under digital libraries. And of course, they are concerned with charging. Some people want to give it away for free. But most people are concerned with at least getting their money back to pay for these systems. And copyright is a big issue and a lot of technology being developed around the issue of copyright as well. So, what sort of people are going to be needed in that environment? Well, we're going to need system builders, a lot of system builders. The reason for that is there are an awful lot of new companies appearing in the general area around those areas that I mentioned, and other things like that. Certainly, and the older companies, the ancient companies, maybe 10 years old or 5 years old, but, they're expanding very rapidly, going public, getting lots of money to do new things, etc. And so there's a lot of demand for people who know something about text retrieval and associated technologies in the context of this corporate intranet. And as we'll talk about this afternoon is actually, there's hardly anybody out there who knows about these things. So that leads to an interesting situation. So the demand is for people who can build these types of systems. It's very heavy. There's a lot of focus on web-based publications, I guess you can call them digital libraries or whatever, or corporate intranets. But a lot of the companies going public now don't say that they're doing web searching. They're saying we're doing publication on the web. And we've are going to build the tools for publication. Information retrieval groups, either by that name or slightly different names like "information access" or something like that, there's a lot of them in industrial research labs. So it's a hot area in industry in general. There's the new companies, there's the companies which are expanding, and then there's the mega-companies which have big industrial research labs, who have also got groups. And there really aren't enough computer scientists, technology oriented people, trained to fill the demand in this area currently. And the other thing which I'm going to talk about this afternoon too is, there's absolutely no academic demand for Ph.D.s in this area, from the computer science point of view. So a lot of industrial demand, as I'm sure you're aware, from the sorts of degrees that you have in Drexel. But you can train people and they'll be snapped up very quickly. So, sorts of people that I think are going to be really important to train, interface designers for both web and corporate access and stuff like that. As Ed said, interfaces are crucial aspects of all this. There aren't that many places producing people who really know how to do good interfaces. It's still very much a seat of the pants type thing. And the number of places that have good HCI curriculum and training are very small, really. System integrators are really, we talked about the corporate intranet stuff, most of that is integration, putting things together. And knowing how to build applications by pulling things together, stuff like that. That's a really critical skill. So using all these tools that Ed mentioned, and knowing how to put them together and getting training in how to put them together and knowing what the capabilities are is obviously very important. We need people... you know the database administrator role in the old database world? We were talking about these corporate intranet systems, etc., it's really that , plus. People who know something about how to manage these heterogeneous environments, how to manage the systems that are going to be built, in corporate environments. And people who know how to build applications using software and API -level tools. So these are all technology, titled as system builders. So I'm not saying that there's going to be a million people out there building information retrieval systems. Although the previous slide was that type of person. People who know how to develop new techniques and new systems. And these are people who know how to use the tools that are out there, how to put them together and how to build applications. All (NOT?) the sorts of people that the database systems areas have been producing for a while now. We're talking about a more general type of person, that knows how to use a wider range of tools. Information scientists and library people. Intermediaries and indexers. Let's say that there are two groups, they are two of the types of groups of people who are being produced by library & information science departments and schools. Some interesting facts. Roughly 80% of the queries in systems like West and Dialogue and LEXIS/NEXIS are boring. That's despite the fact that West in particular, and the other two as well, have full text search capabilities based for simple queries. OK? So, the reason for that is that intermediaries trained in Boolean searching is still very dominant in the industries that use those types of systems. They're often just the people making the choice of system, and they are still being used as intermediaries in a lot of these environments. And so that's what they're trained and familiar with. On the other hand, when you look at the web, like take Infoseek, 8 million queries a day or something, and similar statistics for AltaVista, and that is what Ellen was saying. Most of them, the vast majority, 90%, are two words long. Very small percentage make use of the ability..., all the systems have the ability to do more structured queries. But a very small percentage of people actually make the effort to do that. So what does that say? I don't know. It's just interesting, the contrast between these two things. Intermediaries, people who know how to formulate queries, and certainly you can train people to formulate good Boolean queries, I don't doubt that. But at issue, for interaction, they need to be more part of the teams who are designing applications and interfaces. You can't insist, OK, we need a Boolean template form, because that's what I'm used to. It's really saying OK, as we move this stuff out to other people, what it is about the way to formulate queries? It should be captured in the interfaces. So the intermediaries, I think, can become part of the teams developing systems. So that's a talent they can bring to what's going on in the industry. Multimedia databases, no matter, even if all the research programs and image indexing which are out there at the moment in various image analysis groups and media labs and places like that, even if they're all completely successful. (which they're not going to be, or anything like that.) They're still going to need manual indexing, because the sorts of things that people retrieve images on often, in many applications, are things that are not accessible on the image. Art is a perfect example of that where we don't care if there are naked babies floating around in the sky or something like that. I want to know who painted it, and what's the mythological theme, and what's the style of the painting, all the sorts of things that could never be captured by image recognition. So manual indexing is critical, especially in the multimedia image arena. However, and Ellen mentioned this, and she was right to say that the library people have concentrated, obviously that's what they've been doing for hundreds of years, is how you structure knowledge and index things. However, when you actually talk to people, like the Library of Congress, New York Public Library, etc., who do picture indexing, image indexing as part of their job, they have no idea how to index images, or what the right way of indexing images is, or what to say about photographs. OK, I'll give you a photograph. Index it. What do I say about it? Do I describe exactly the layout of everything there, or do I,... OK, so a controlled vocabulary. There are no good controlled vocabularies for images. These people have been doing it for 40 years or something like that, and the New York Public Library will tell you in a second, that the controlled vocabulary that they have that was sort of developed for images is hopeless, and they can't claim anything good about it at all. So yes, there is a lot of knowledge in the library world about indexing. But I claim that we need to put a lot more effort into understanding new types of media and how to index them, and tools that go with that. Can we develop computer tools to help the task of manual indexing? There's very little stuff out there to help manual indexing. Certainly there's very little that has some intelligence to it, or something, a flair, or something, you need a systems person to do that. And developing controlled vocabularies is fine, but we need controlled vocabularies that comes from corpora and things like that, more interesting ways of developing them. And heterogeneous data also requires indexing in the sense that, if you've got a database and you want to access it as part of a heterogeneous system, that has to be indexed in some more flexible way than a relational schema. So it tends to get back to the text and things like that. And also potentially, you've met a lot of discussion about what types of meta-level descriptions of information systems are required in order to access the right information system. That's an indexing problem, again. You can think of designing a schema as an indexing problem too. So, my summary is that these trends in the development of corporate intranet-based information systems and digital libraries provide enormous opportunities for computer scientists trained in information retrieval and related areas. Application and interface designers, who know how to use the types of tools that are needed for that environment, and information scientists trained in querying and indexing. But for that to be true, that is generally true, but the training and skills of these types of people really have to be relevant to the environment that we're talking about here today. And certainly, when you're talking about the sorts of indexing and querying training, you're getting a lot of library skills. A lot of that isn't particularly relevant to the types of problems I was just hinting at. That tends to need to be brought up to date. I'm not saying anything about Drexel's curriculum. I haven't looked at it in that much detail. But that certainly is something that can be addressed. But also applies to the other stuff in the sense that most computer science programs did not produce a lot of the people, a lot of people with the sort of training I described. That can sort of readily build applications out of this range of tools and know how to put things together, understand information retrieval of unstructured data. Even database systems, there's actually surprisingly few courses on how to build applications using a database system, as opposed to, let's go talk about a relational theory and Nth-normal form and all that type of stuff, and do 7,000 exercises of SQL queries and stuff like that. So, computer science needs some revamping in terms of its curriculum, if it's going to address this area, and of course a lot of people don't think it's necessary for computer science to address that area. And we mentioned interface design and all that sort of thing just being often a neglected area in computer science curriculum. And it needs to be picked up by somebody. And so, my last point in the summary is that the current state of academia when talking about computer science and information science in general, relative to the demand of industry, is that it's a bottleneck or worse. Meaning that by bottleneck I mean, we're not producing enough people in academia. There are not enough academic departments producing the right sort of people. Worse, I mean, the people at some places are producing people who claim to have the right training who don't have the right training, and that's worse than saying I don't know anything about this, train me. So, I think the idea of working on the curriculum is obviously something that we've been worried about for a long time at SIG-IR. Haven't found the right answers yet, but it's certainly something that needs to be addressed very quickly. KM: Well, that was provocative, Bruce! Questions? IS: In that case, what's the current status of industry actually using databases in information retrieval? BC: Well, this is one of the key things that industry wants people to address. And obviously so, naturally database system vendors and IR system vendors, to some extent, have produced solutions to this, because it's such a major problem. The solutions range from things like Oracle's text capability, which addresses some issues, but it's not a truly integrated system. It's a bit of a hodgepodge in terms of the techniques that underlie it. And it's not really efficient enough for text retrieval applications and things, etc., etc. So, the more modern solution is something like the Illustra people have done with opening up the capability to insert information retrieval libraries that they call "blades," and they provide one as well. But essentially, systems moving to where you can..., information retrieval companies can produce a library for an Illustra system or an Informix system or eventually for an Oracle system, and therefore, you can choose between which database and which information retrieval system you want to combine, and then you'll have a way in the query language for the database system of combining those tools, so that they'll fit together fairly easily. It's still not really an integrated system, though. It's more integrated than older solutions which are essentially two, used to be two separate systems hidden under a common interface. But indexes and things like that, for information retrieval systems, are actually not even in the database system. There's little integration. The open systems libraries approach is more integrated and gives you the possibility of putting the information retrieval capability on top of database structures, and making it efficient enough, and some access to information retrieval capability and query language. But it's not truly integrated in the sense that they typically, there's no system that can really handle the model's uncertainty in information retrieval in unstructured data may have to deal with. You're producing things like scores, probabilities, rankings, whatever you want to call them, and you're producing them in a variety of places, that has a big impact on query optimization strategies,. or the query language itself, just as a small example of that.... A lot of extensible standards like open ODB. The database query languages, something like object-oriented SQL or something like that, are extensible with external functions or methods that you can define on text data type. But, the extensibility stops at the point of it still has only ands, ors and nots as combinations of external functions. So that means that all your probabilities, all your ranking, then has to stop when it hits those Boolean functions. And you cannot extend, in most systems, those Boolean functions, without sort of building a whole sort of pseudo-interface on top of their query language. So there's still interesting research issues there to get truly integrated systems. But the library approach, and having libraries that you can put into open database systems, are starting to get to a more reasonable solution. IS: Does industry actually use that kind of a system for their business needs? BC: Ever since the "year dot", at least since the mid-80s, the thing I heard the most was, we need to combine text and database, data, you know...structured and unstructured data. And so, it's been a demand in industry for a long time. And what they've been doing in the past is they END OF TAPE ONE, SIDE TWO BC: dealt with some resistance. But on the other hand, you have people like, I think it's Time Life or something, with this enormous video library. Now after looking at this technology around, they said what we need to do is hire people to index this stuff. That's the only way to get effective access right now. So if that's the solution that's required, that's what people will do. But I think that certainly computer scientists are not used to working with general interface design and people like indexers and people who deal with the human side of stuff. And some computer programs, some degree programs can address that, and make that better. But certainly, the average computer scientist undergraduate is not used to working in that sort of environment for that type of person. So I think you will meet resistance there. And certainly I think it's true that most places would resist saying .... I mean, that's why we're selling full text systems, is that people don't want to hire an army of indexers to start manually indexing everything that they've got, if they can do automatic approaches to it. GS: I was thinking back on that, and say, what we need is that person we hope to produce, that's one foot in the user camp and one foot in the technology camp, to be that interface. But let me ask you about your intranet. It seems to me that many organizations go through an internal network set of operations. And they also want to have access to the world, the Internet and the web. And whether or not those two can come together and avoid the duplication of systems. I guess really what I'm talking about is, many companies, because of security reasons, want to protect their internal information. And so I think... BC: Right. That's why corporate intranet-based systems, that's really the future of the web in the sense that, you know, like these web crawlers. One solution to this problem is that the tools that you buy will have the capability of indexing information on your site, on your local intranet. And you also utilize the capability of deciding which parts of that you want to make public. And there will be standard places that you send your index, your representation of what's interesting, which you want to make public, you send that out to a public site, which serves as the pointer. And that way it gives people control over what they want to have available, what they index, what they don't, security levels and things like that. Rather than having these web crawlers trying to poke into the data all the time, which is mostly a problem. KM: Do you have a quick question? OK, Larry seems to be up at the moment, and since we know how fragile this wonderful new technology is, maybe we should just let him go ahead. LARRY FITZPATRICK: I think today I bring some kind of a focus from the software product point of view. Where we're trying to package in industry some of these techniques and technologies up into ways that are suitable for the mass market. And I think that adds a slightly different flavor to some of the perceptions. Today, I'd like in this morning's talk, sort of what we discussed is perceptions to stay in the market, and how it breaks out in terms of retrospective searching which is the traditional..., hitting the database versus prospective searching, which is our term for filtering or agents or whatever have you. And then I'd kind of like to break away from that a little bit and sort of like drop right to the ground, and talk a little bit about experience with dealing as a product company with organizations who are trying to integrate some of this technology, and looking at two aspects of that. One is, how well do organizations understand it, and what parts of the organizations are we dealing with in terms of the roles of people that are involved? And the challenges that those roles have as they interact with information retrieval type technology. I'm not that concerned about the other aspects of those roles. Just how they sort of glue together, and are affected by information retrieval technology. (So we wait for a little bit of paging activity. ) I think it's really important to ask what problem is being solved by information retrieval, and we saw some of that this morning, which I actually thought was sort of spot on from an abstract point of view. But I'd like to say that, the maturity of the software market measures either the importance of the technology in that marketplace, or the level of understanding of that discipline. And right now, everybody has trouble finding things within corporations, and within their personal existence. But the markets, that businesses are actually willing to pay for sort of like off the shelf technology, are really small. Ten years ago there were numbers like 90% of all the information in corporate environments were free-text, and (maybe not ten years ago, maybe five years ago). And RDMBS is a $5 Billion or $10 Billion a year industry. Therefore, we can expect to see this huge, this is going to be a $100 Billion industry-we're in the right place at the right time kind of thing. But the fact of the matter is that it hasn't panned out that way. Two years ago the Puritech search database marketplace was less than $100 Million. So, where is the disconnect? I think what it really comes down to ultimately is when people actually need some kind of information retrieval technology, it's in conjunction with some other business process. And that there are lots of these, and they're all very different. They treat the information differently, they process it differently. It's in different formats, it goes through different individuals within the organization, it's on different computing platforms. Some of it's not on computing platforms. And so, I think it's probably not too harsh to say that information retrieval does not solve a business problem, it's a technology in search of a business problem, or a suite of business problems. Alright, so what business problems? Well, fundamentally, the technology is very horizontal. I think that my experience in the industry mirrors to some degree some of the stuff we heard this morning about academics, is that you've got sort of a theory of information retrieval or information management or multiple theories, and a bunch of technologies or tools that one can use to apply, but those things, in and of themselves, really have no value. They have to be coupled to some other thing that's going on within the organization. And it turns out that the solutions have to be really very, very vertical. When you happen to stumble across them. And again, there are two general problems that we tend to see. The first is retrospective searching, which is sort of like interacting with reservoirs of information to discover new things. And the second is prospective searching, which is the accumulation or the discovery of information in flows, as opposed to reservoirs. We call sometimes the retrospective searching problem the "tell me now" problem, because looking at it from a user's point of view, what does a user do when he's searching? He's got some information need, some need to discover some piece of information, and he wants to know the answer now, and the only way he can get that is to actually engage at that moment, OK, with extracting information from reservoirs. I think that if we look at sort of like where the industry comes from, and where the history of this technology has found itself, and certainly in the on-line systems. And just about every on-line out there built their own system 20 or 30 years ago or transferred it from some government-funded activity, Dialog, Dow Jones, and all these major systems. And there's been sort of a rebirth in that industry as these people try to downsize off of mainframes, want to be more adaptive and flexible in terms of dealing with information. So there's been a market created for information retrieval technology to deal with the new on-lines, or the rebuilding of the old on-lines. And Infoseek and Altavista and places like that are actually, could be considered to be examples of those, digital libraries, if you will. Historically, CD pubs has actually been somewhat of an interesting marketplace. People doing rich technical documentation or reference material. Completely different kind of application, exactly the same technology, fundamentally. In the marketing effort, and the sales effort, and the support effort, and the packaging effort, and the kind of developers you need in order to delivery a solution to that marketplace are different. Embeddable, or OEM type applications, there's been actually a fair amount of demand in some quarters for taking full-text retrieval technology and gluing it to some other stand-along process like help desk applications, or customer support or human resources, resume management. And that in and of itself essentially means providing a really raw tool that gets bundled into maybe a relational database system or a workflow system. In almost a one-off capacity. So dealing with that kind of a marketplace is actually rather interesting. There has been some activity in the MIS departments in terms of full-text retrieval. Amazingly small amounts when you actually consider how much data there is out there. But it has found niches in some places. Document management, for instance, where the retrieval by content happens to be important. Litigation support is another area, where lots of documents, where you have to get access to content, and lawyers talk about being able to find a smoking gun, and information retrieval technology is actually being able to do that with their fuzzy ability to find things and match things and find new terms and so forth. So there's been some penetration in that area. And then of course there's the Internet. The growth in sort of like on-line material and the Internet has really changed the shape of information management. And really there's been the public indexes, like the Infoseeks, the Altavistas, the Open Text, whatever, Lycos. And there's been lots of public data such as the Thomas System that Bruce's system back-ends. The SEC Edgar stuff. And now, we're coming to discover, as Bruce mentioned, the Intranet, which is basically, now that you've got this substrate computing, this network computing platform based on HTTP and HTML, information can be exchanged, shared, collaborated, distributed in interesting ways and it's creating opportunity. But we're just beginning to actually figure out how and why, and exactly what people are buying within corporations. But fundamentally, none of them are actually enormous markets. They're not billion dollar markets at this point. And I'm not enough of a marketeer, nor do I want to be, to predict whether they, in fact, will be. So from a technologist's point of view, retrospective searching has actually been really fragmented, much like the academic stuff where you're pulling from all of these different aspects. The second problem we talked about, I mentioned, was filtering a profile. And we called this the "keep an eye out" problem. There are lots of other manifestations of it. But basically the motivation from a user's point of view is, I have a long-standing information need, and it's fairly predictable. My past behavior is sort of predictive of exactly what that is. And information is flowing around me through flows somewhere, and I would like to tap onto that and be notified or at least have the ability to have stopped somewhere information that may be interesting that's flowing through the system. Because I can't go on guard everyday, I can't set aside 8:30 to 9:30 A.M. to do my searching of all of the data resources that I know. I'm not that disciplined, and I think that if you go into corporations today, especially with the Internet, there's this expression running around that says, One man year is seven dog years is 49 Internet years, or something like that, right? So, people are running around like nuts. They don't have time to plan, at 8:30 in the morning. If they come in at 5:30 in the morning, when nobody's there, maybe they'll be able to get something done. Sort of what's unique about filtering is that, or "keeping an eye out," it really is personalized. And it has to really originate with the habits and trends of the individual originating the profile request. And it's predicated on the fact that your past behavior is a predictor of your future interests, which I think, in many cases, within organizations, is appropriate. Where is this found, "play in the past?" For one thing, the intelligence community. Tremendous amount of effort. People with "anal "in their name. Analysts, right? And the reason is the technology is really hard to use in a lot of ways. These guys created, basically, big monster Boolean queries in environments where people have maintained 1,200 word queries with Boolean connectives and 7-deep nesting and so forth. And this is like a major investment of intellectual effort. And people in corporations will not do this, OK? Nor could they do this. The financial community, where the cost of information is, or the value of information is really high, they'll go to great lengths to do this, OK? What's happening with the companies that I'm investing $100 Million in every other day? Now, there have been lots of business specific wires that have popped up. And as I started asking around about applications of filtering technology, I'm amazed to learn that in really weird niches like people who do government contracting for civilian organizations in the Washington area, there are like five specialized newswires packaged and distributed as a bundle to people who want to sort of tap that pipe. Your Commerce Business Daily, etc. And I suspect that there are lots of little niches like this otherwhere as well. In the Internet, there have been a couple of examples of technologies that people are pushing out there to try and help. Individual, Inc, pretty credible example, kind of brute force technology. Sift, the Stanford stuff, sort of a public service. Net News Filtering, mediocre. And there's bunches of other things. If you look, you'll find them. And then there's the Intranet. And it's my opinion that, the "keep an eye out" problem, is bigger than the retrospective searching problem, the "tell me now" problem. Most of us acquire information by having it bombard us, right. We set up these rather crude mechanisms for letting flows impact upon us. We do listservs. We subscribe to news magazines. We have people around us, in our communities, forwarding things to us. And there's a huge opportunity for automating this process. Whether the technology's there or not yet, is really kind of open to question. But I think there's a mass of opportunities. Most people let information wash over them. And it's a virtual world, right? I mean, you can't possibly see all that you could see. So, it has to be filtered in some way. Why let it just wash over you? Why not take control of it, and let what you see be reflective of what your interests are. Now I'm going to drop down, right down to the ground, rubber hits the road. In organizations, last few years, a number of years, I've been involved with companies that are trying to sell text retrieval, information retrieval products into corporations. And I think almost universally, I don't want to be too harsh, but information retrieval, or IR, from a statistical information retrieval point of view, is a paradigm, it's a way of thinking about information acquisition, and nobody gets it, right, except us? I mean, we're like the only people who get it. It's really hard to explain to other people when you go into corporations what this is all about. When you walk into a corporation you start talking about this- the effectiveness expectations are way too high. People think, I'm going to get 100% recall. I need 100% recall, OK? No, sorry. And you don't dare, you're in a sales situation, you don't dare say no, unless the other guy has said, the competitor has said, he can do 100%, and you say oh no, he can't do 100%. So, the effort expectations are way too low. And the effort in terms of the effort that the user in terms of querying with this thing, has to observe. Bruce mentioned, and I think I'm going to mention it later on, OpenText has a web index. The average query length is 1.4 words. And it's treated as an event, not a process. OK, 1.4 words, I should be able to do it. Short story. Somebody was in a show with OpenText showing off their OpenText index. A woman walks up to the booth and sort of like milling around, and the salesperson says, would you like to try and search the web index for something? And she kind of gets up there casually and says, "OK, Colorado." And back comes like 40,000 items, and the salesperson who's there helping says, "what about Colorado were you thinking?" Because she looked kind of disappointed. She said, "well, I actually attend college at Mary Hartman Community College outside Boulder." And she said, "why don't we try that, and type that in," and there it was, first thing, right? People just don't have somewhat of a clue about how to approach this technology. The resource expectations are way off in corporations. People walk in, you walk in with a text retrieval system, and they think database. Why can't you do 300 queries per second? You know, on this Sparc 10? You say, well, because the underlying technology has to do all this work. If you want to do like one word queries, we can maybe get 10, alright, on a huge database. If you want to do 50 word queries which some of the research has shown, with iterations and feedbacks start to become important and effective, no, we're not going to do anywhere near that. So, I think there's a lot of education that goes on. Sometimes also, I think the benefits of having it are not that quantifiable. And I guess I don't quite know what I mean by this, because you can take it two ways. One is, they think they're going to solve a real problem by bringing this technology in, but it's not clear that they actually are, and other times they don't bring the technology in when, in fact, it would solve a problem for some other unknown reason, like they don't buy the paradigm, or something. And again, like education, everybody has an opinion about it because they've experienced it. They think they know how it should work, right? So you get into discussions sometimes where, well, when I say fly, you should be able to figure out that I mean the acronym, you know, Friends of the Library Yesterday. You sit there and say, OK. So, in terms of, the next slide, what I essentially did was, OK, I'm not an academic, so what can I bring to you folks? Well, almost a dozen years, ten, nine years of working with people. I basically made a list of all the people I've worked with closely in organizations as consumers of the technology, and I tried to characterize them in terms of their roles. And said, now, how does that role impact their collision with information retrieval, and what's the challenge for that role in dealing with this technology. In the first, sort of like highest level to lowest level, the chief informational officer within an organization, and I've seen really good ones, and I've seen mediocre ones. I haven't seen any bad ones. But the challenge, I think, for these people, really is in deciding how to support, let's see, deciding which reservoirs and flows within the organization are the appropriate ones upon which to devote resources, right? I've seen CIO's come in and say, we got to get everybody with a policy and procedures manual on CD-ROM out in the field with a Windows machine, OK? And it turns out this was like Windows II, and it had to be running Excel at the same time. And if Excel loaded first, it used up all memory. So, there's some appropriateness questions that come in there. Now, the second is the evaluator expert. Usually, sometimes there's an individual within an organization that, in fact, has some knowledge about this technology. And whether he does or he doesn't, he's been asked to sort of evaluate it. And I think, I feel really sorry for some of these people, because they have absolutely nothing to go on in many cases. They've got these vendors coming in, making claims like you wouldn't believe, and they have nothing to peg it to. And I think something like TREC really helps in a lot of ways because it's sort of like a benchmarking environment, an evaluation scheme, where people can know what works and what doesn't work. And they can look at the results and say, yeah, these top ten search engines all have about the same performance, right? Not 90% recall, at 100% recall, or 100% precision, which is sometimes the pitch these guys get. And I think benchmarking would really be of value to these people. And then there's the infrastructure architects. These are the people who are responsible for actually putting together the system within the organization. And these guys have to have a lot of technology breadth. Their challenge is to know how information retrieval has an effect on the network performance on the disc drive consumption, on the back-up requirements, on the administrative load, like how many administrators is it going to take to get this thing up and running and maintained and kept and held together on a regular basis. And again, without having experience with this technology, they really don't have a clue. And often in the sales cycle, the vendor's not actually going to lie to them, but he's not going to tell them what it's really going to cost them. Integrators. A lot of times you get integrator-type programmers who have to glue together some kind of a workflow system and a text retrieval system. And these guys don't understand the paradigm, and the thing I see most often, which I just, I cringe everytime I see it, is they glue a search and retrieval system together with some other system, and search consists of a 40-character long box with a submit button. I mean, that's it. You know? You can't do anything with that. I'm sorry. The data administrators. These guys probably of all the people, you know, a lab based belief, these guys kind of have a tough life. System gets up, basically, how do they keep it maintained and functioning. But I think one of the things I've seen, the challenge there, is building architectures and administrative infrastructures that are nimble, because we're in an incredible knee of growth curve right here in this technology. And it changes very rapidly. And so, I've seen organizations, in fact, one of the earliest on-line customers that, one of the companies I was working with, we had, in 1989, they built this massive infrastructure to maintain that system. And it was very specific to the particular system that we had at the time, and we had three generations of technology beyond that, and they still haven't upgraded to any of those generations, because the cost of migration of the administrative load is too high. So, I think maybe there's some paradigm building that has to go on at that level. Innovators, I won't spend much time on them. I think there's a role, and I hope when you're looking at this you're thinking about, geez, the students coming through here may end up in one of these roles. What can we teach them to sort of like affect their ability to deal with this challenge on that particular dimension. The innovators, you know, this product company. You hire some guy, he's bright, he sees market problems and he crafts solutions. One of the biggest problems I see with these people is in the product companies, product companies, and I don't know how you effect this, but they don't have the discipline to wait, right? They're on Internet time, you know? If you can't do it in three months, you know? So I don't know what you do about that. But that's their challenge, that's the challenge for them, and it's a challenge within their organization. Product developers. I'm amazed. I've hired a lot of developers, and I'm amazed at how many people come out of school and can't write production code. Don't know how to design, don't know how to write production code, don't know what it means to think about error conditions, don't know what it means to write defensively or code defensively, don't know what it means to think about the fact that some guy coming along three months after them might have to read it, and that guy might be himself, OK? And I think that I was really, really thrilled when I read the Drexel Home Page and saw the heavy emphasis on call coop programs, because I've worked with Virginia Tech co-ops before, and it's been just a marvelous experience. I mean, these guys, they come into an organization and their first cycle through, and they're wet behind the ears and they don't know this stuff. But by the time they graduate, they have been grounded in this stuff. Now, one might argue that you're foisting onto commerce to do that kind of education. But it works. UI designers. Bruce mentioned this one as well. Where are they? Nobody, you know, people say "I'm a UI designer." No, you're a UI programmer. There's a different thing. Just because somebody writes UI code doesn't make him a UI designer. And I think that the UI design art is, maybe it is a science, I haven't found it to be a science. It still seems to be largely an art. But, there's not a lot of these people around. And if there's methodology here that we can train these people, I think it would have tremendous value. And we heard from Ed the HCI emphasis as well. Researchers. I've dealt with universities in the past in terms of working with researchers there, and I think until about three years ago, the IR community was very, and I'm going to kind of do this in this afternoon's talk, was very tight. And it didn't sort of like have its hands on the problems that we were dealing with. In 1989, I put together a system that actually did some design work for a system that addressed the issue of concurrency control in a full-text data base. And went through the literature over a period of about six months, and found nothing. I mean nothing, right? And eventually found threads of useful information in the ODB literature. And tied it back to the RDB literature and said, ah ha, knowing what I know about information retrieval, and this piece of information here, and this piece of information here, we can do this. But there was nothing there. So I think that at some level there needs to be a focus on real world problems. But at the same time, we can't neglect sort of like the model underpinnings. For example, what does it mean to have a text data model? What are the right query primitives? And I know the folks up at Waterloo whom I'm working with now have done a lot of work in structured data in building data models for interacting with it. But there are no real primitives at a high enough level that users can consume them for dealing with the information search process. And I think that some focus on information, some focus on sort of like underlying models, has to come. And that in many ways is at odds with the real world stuff. So how you trade those off, I'm not quite sure. Too fast? Too much? KM: A couple of questions for Larry? Perhaps at this point we can open up to a general discussion. END OF TAPE TWO (Question period) MH: I recognize the familiar problem, and I go back to a former reincarnation of myself as a pharmaceutical company executive. I realize now I wouldn't have hired an IR trained person. But with the kind of stuff you're talking about, I'd have taken a Ph.D. chemist and sent him somewhere for three or four months to learn the thing. The educational opportunity I now see is to take people who have domain expertise which you get through 15 years working in the field and teach them this stuff in three months, I think is the opportunity. LF: think in someways I would agree with that. I think that this paradigm hasn't been communicated outside of a small circle. And it needs to be communicated. I don't know why that is. I don't think it's for lack of trying in some ways. But it hasn't been. GS: I'd like you to talk a little bit more about the technology being horizontal, and the solution being vertical. I think I understand what you're saying is we tend to operate at one plane, when we're used to the technology, and not looking at it as a way of moving up and down in the organization. Is that what you're getting at? LF: I was thinking about it from a product company's point of view. That I'm trying to trade off the fact that the technology is fundamentally very versatile, that the way that you construct queries, the way that you build your indices to support those queries, and all the access methods and whatever else you have underneath, can solve lots of problems. But that you don't just sell this thing as an entity to an organization or to some consumer. You end up focusing on selling it to maybe help desk applications, right, where you work with people who understand what "help desk" means, and you integrate this technology into an infrastructure that is a total package, in the help desk arena. And that becomes a vertical application within the product world. Not the underlying technology. GS: Is the goal then to get the user to become their own help desk? Because, as many speakers have said, we can't do it all. We don't have enough graduates to come out of the program. So at some point, I think we got to get them beyond... LF: Well, I think what happens is that if you look at it from the point of view of an architect or a developer, who is going to be building this help desk application, if he has some grounding in this technology, then, he can work with vendors or purveyors of this technology to integrate it appropriately within his domain. But if he has no understanding of what this means, he doesn't know that he needs it, he doesn't know that he would benefit from it, and so they don't integrate this. I'm amazed at the number of sort of like production database systems I've seen that would benefit from full text retrieval. Sometimes in like a resume area, and they don't have it, right? Because they just don't think it might be important, because they don't understand what the benefits of the technology could be to their application. And I think that maybe understanding that at that level, for people who, and maybe this is appropriate for an undergraduate curriculum in computer science, where people are exposed to what's possible with this technology. So if they go out there in the world, they know it exists. KM: What I'm curious to hear from Ellen, as the other person whose coming to us really directly from industry, is how those niches and roles that Larry identified fit into what you see at Siemens-what is going on where you are? EV: Larry's in a very different position than I am at Siemens. My experience at Siemens is actually in trying to interest Siemens' operating companies with my research. And that's really my job at Siemens. KM: You're the researcher? EV: Right. And I have not had experience other than Siemens.. So I don't know if this is Siemens-specific or industry-wide. Information retrieval even in Siemens, even now, is still a hard sell. I find this incredible, but it's true. It's, I think, on the one hand, people want it. And I think it gets back to this idea that they really underestimate the problems involved with it. That they want it ,they assume it must be there because we can do it. And so, there's not a real interest in the problems , figuring that it will be there. GS: It goes back to asking the question, how valuable it is. For example, the help desk. We can put in the database all the frequently asked questions, and maybe do 90% of the responses. But the critical issues are those that are not in the database, and therefore the human has to filter that out and give them their response. And I think that is the hard sell, is to say yes, we need a person that runs this kind of knowledge. But we don't want to pay for it. We'd rather get an on-line system where we can just upload the software around here. LF: I worked with Apple a few years ago, and they basically automated their support desk. So that the people answering the telephone calls coming in would have access to a query interface to the underlying sort of like, I forget what they called them, problem reports or whatever it was, or notes to the field, or whatever. So that basically, when a call came in, they would do the full text query. I've also seen a lot of organizations where they don't do any full text retrieval at all. They use some kind of categorization or classification scheme where they say oh, it's in the dryer division, oh, it's the door part. Oh, it's the ..., alright? And very often, they don't, you miss out on the fact that creating the classification system is very much just one snapshot of one way of dealing with it, and sometimes there's information that falls into multiple categories, and it becomes a mess then. And they would benefit from having this, but they don't do it. I don't know why. EF: You have a list there of rules, and Bruce also indicated a number of places that there were a lot of openings in different kinds of areas. I wonder if any of the speakers or any of the other people in this audience here have any real statistics. Do we have real numbers on the demand for these things, or if we had people trained in this, how many would be soaked up by the industries? GS: I could respond to that, because I was trying to figure out, as Richard knows, how do we sell our graduates? What do we market them as? And the government has wonderful classifications for analysts and systems programmers, but they don't have a knowledge worker, knowledge broker, information agent. So they're not even capturing the data that we know would be out there. There are people doing these jobs, but we call them librarians or we call them programmers, or a systems analyst. But we don't have the right title. TC: They haven't been decomposed enough by then. LF: The fact that's actually surprising to me is that I would have expected, I forgot to put on this list, but the reason is when I made this basically, each category here is like the body parts from a handful of people that I've known. And what strikes me is that, I've never met an information scientist within an organization that I've dealt with on an information retrieval problem. I mean, the information scientists I know are in marketing and sales. Seriously. And I think it's wrong. And I don't know what to change about it, because there's tremendous value in having that knowledge. NH: That's probably an age-old problem. I think from statistics, and the time that I got my doctorate, I was part of the first wave of people where to be a statistician, you had to have a degree in statistics. So if you look at the generation older than me, everybody who did statistics came from somewhere else. And in fact, I was a part of the transition which illegitimatized them, I think somewhat unfairly, because they didn't have the degree in the right field. And I think that's part of what you're seeing now. And the field is similar to statistics in the other sense that you come with a set of techniques, but in order to apply them you really have to acquire for yourself sufficient domain knowledge that what you're doing works in that field and doesn't cause more trouble than it solves. Because of your misunderstanding of what their real goal is. And I think in the old days, what people did was, what you were talking about, was take the chemistry person and give them three months of training, and that's how statistics used to be done. That doesn't really work either. What you really need is..., people have to learn that to survive on these fields that combine more than one discipline, you have to combine more than one discipline. You have to learn to be willing to be cross-disciplinary. And I think that's the other place where, ... you know, I was asking that question before with Bruce, about the bottle-neck issue- that is one of the bottle-necks that the university has, being unable to really effectively handle interdisciplinary work in a way that rewards the practitioners of it. And I've seen that as a problem from the statistics aspect, and I see it again now that I've kind of gone into computer science somewhat too; in some of these fields, where we are working on cross-college boundaries. So no matter where you put us, I mean, I work across college boundaries in every college on my campus. So you can't say oh, well good, we'll move you to another college, because you can't move me into every college, and so there's the problem. KM: :Questions for the other speakers? NH: Can I ask my original question? KM: Of course. NH: One of the things you said we have to do is to train students with the skills that are relevant today. And one of the problems that my department has on the computer science side is, what does that really mean, because, you don't even talk about half lives in this business. The entire life, a version of software, let's say, is less than a year. And frequently the new version... I've gone through educational stuff with Excel where I actually use it in the classroom and I have handouts based on it. And within a year my handouts are completely obsolete, the interface looks different, the capabilities are different. Very little is reusable as is. When my students go out to companies, the companies say, well, we want you to know how to do Fox Pro. Well, nobody even remembers what that is a year later. So, defining what it means, to educate students with the skills they need today that aren't irrelevant tomorrow, and with the ability to stay on top of this curve that is constantly peaking, I think is the major challenge. And I just wondered how you see us doing that in a way that is effective and prepares them to be relevant ten years from now as well as today. BC: Well, I completely agree with you that instances of technology change very quickly. But, I mean, Larry said this too. That you've got to, people never hear about information retrieval, and what its goals are, how it works. What the pros and cons are, and what the paradigms are. Which most people in computer science never hear. Then, they've got not the faintest idea about how to approach an information retrieval technology. It doesn't matter what generation. If you don't understand what the thing's supposed to be doing, you don't know how to evaluate or how to use it in the system. And same with building interfaces, designing interfaces, as you said. And some other work flow, or collaborative work, or something like that. That's almost, interfaces and CSCW are almost never touched in computer science, most computer science departments. And yet, those things are all extremely important parts of tackling real business problems. And I think that you need to have courses that talk about the issues and the general approaches and solutions and things like that. And then train people with instances that exist at the current time. And then hopefully they can understand, they're smart enough to learn the new generation of those tools, because they've got some understanding of underlying principles behind those things. That's the only solution that I have to that. LE: I think one of the things that I've been hearing at the same time, is that it's more than just the way that we can deal with technological approaches. In other words, in order to get breadth, so that we're flexible, the technological solutions drop down to the lowest common denominator. In other words, teach people C++ or C programming so that they will be able to go for a longer period of time, because the single applications programs don't work. But what in fact, I think, to a certain extent we're missing..., the point of that is, what is the need, and how do we teach people, the analysts, what the real need is, and then turn around and look at current technology and apply the context to the real need. And that subsumes these issues with HCI and work and also needs to address what it is to understand the time value of information in organizations. And a lot of this knowledge has existed for a long time in the literature, and certainly since the 70s in the business literature, because that's been an active area of interest in MIS, DSS, economics to a certain extent, all the econometric-type literatures that decision science deals with, some of those kinds of things. We need to tack onto that, which I think is part of it too. But it's got to be concept-based as well. I'm an engineer, so I'm not necessarily a high content person. So the analyst's notion is not..., this is not a low-level skill. This is one of the most difficult things to train. EF: There are a few concepts that I think answer some of your question, though. One of them is, we in part have to move more towards lab based courses than we have in the past. Where we have conceptual areas of fundamental understanding that's coupled with current practical limitations. So that they can understand those things. And those will change each year. The examples, what I try and do in my courses is have two or three different examples of tools that represent this kind of concept, that are rather radically different, as much different as I can. The web actually is helping in this regard, because I can fairly quickly connect to the latest one. And the students are very good at going in and finding new ones which I don't know about. And even in some cases have prepared instructional modules to help others learn this. So we're leveraging the student behavior and their excitement to work with this. HW: We've been mentioning that computer science, most computer science departments don't prepare people to enter this world very well. I wanted to just add as a kind of a footnote that the Information Science 95 curriculum, the IS-95 curriculum designed for information systems people, doesn't even touch on information retrieval. It completely doesn't. And there really does seem to be a need to pay some attention to what you might call linguistic engineering, the problems of working with large language structures, which gets us into things like classification and indexing, and that's where everybody in computer science and information systems seems to draw the line. They don't want to get into the big language structures. So that's where I think we need to build the bridges. TC: But where we are moving in that direction is in the digital libraries arena, where we've got computer scientists talking to people who know indexing, and classification and the like. And I see some hope there. HW: Oh sure. There are faint hopes. But it's in the world, in commerce and industry, that things aren't coming together. KM: Before we come to blows, we might want to think about breaking for lunch. And in your folders is a list of restaurants. The Reading Terminal is right across the street, and it has stalls and restaurants. There are two in the hotel proper. TC: Well, I was just going to say that we decided to site this meeting is here, so you could go to the Reading Terminal. If you have any respect for that, I think that it is the oldest farmer's market in continuous operation in the country. It's about 105 years old or so, and has a lot of prepared food stalls, from Mexican to Sushi to health food, and whatever, it's an exciting place. And if you don't know how to get there, just follow one of the Drexel people and we'll end you up there. KM: There's the Reading Terminal Market. There's a list of other restaurants in the area. And there are two restaurants in the hotel. The Champion Sports Bar does have a good roast pork sandwich. TC: And save your receipt please. We need that receipt. To reimburse you. KM: To those of you who are filing travel reimbursement requests, or just doing lunch, just save your receipt. According to the agenda, we will reconvene here at 1:30, which should give you time for a nice leisurely lunch. The room is secured, there will be security. There will be somebody posted outside. Or Karen and I will hang around until that person gets here. Most of this equipment is the hotel's equipment, and they don't want to lose it either. But if you're really compulsive and you want to take your computer with you, I won't tell you not to. END OF SESSION I IR 2000 workshop/symposium