INFORMATION RETRIEVAL 2000 WORKPLACE NEEDS AND CURRICULAR IMPLICATIONS SESSION II RICHARD LYTLE I sort of skimmed over this this morning and in the conversations in some places people said it seemed like to me it would be worth just to describe our program to you. Because words mean different things in different places. And we have basically two programs -- information systems and library and information science. The information systems program is a Bachelor of Science in Information Systems, which is a five year program at Drexel, with 18 months of cooperative education in the five years. A Masters of Science in Information Systems is a professional graduate degree. It is primarily populated by IS professionals who, for one reason or another, want the program. BC: How long is that degree? RL: It takes about three years to finish it, if you take one course. Or three and a half. TC: It can be done in a calendar year. RL: It can be done faster, but it's very unusual, because it's almost overwhelming in a part time program. Very few people could take two or three courses a quarter and still keep their job. If you were to ask me, just talking about both information systems programs together, if I had to say in a phrase what it is, I would say it's applied computer science in a systems engineering framework. If I were giving it two more phrases, I would say, user-centered approach, user-centered design. Heavy emphasis on the user requirements part of the process. It is a software oriented degree. Both of them are software oriented degrees. It's analysis, design, creation, evaluation of software intensive information systems. Lots of software development methodology. Another descriptor is database. It's heavily database oriented in terms of what the implementation is. BC: Does it have any MIS component at all? RL: It's relatively low on management. That's one of the things. Is that what you mean? BC Yeah. RL: Yeah, relatively little management. There's one management course at the graduate level at the end. It's a capstone course. It cuts across the more technical curriculum. LE Software project management? RL: OK. Software project management. Two courses. Two out of 15, at the grass root level. At the undergraduate level, there's a significant requirement for business courses. They're not necessarily management courses. The software engineering concentration in the MSIS is an adaptation of the Software Engineering Institute curriculum. And we worked on that quite a bit. It's a concentration, not a full degree at this point. We did have experts, in Dick Fairlie and so forth, SEI types, in to consult with us on what you would expect of a software engineering concentration & focus. Master of Science in Library and Information Science is the oldest degree program in the college. It was just named among the top ten by U.S. News and World Report , so it's a very well known and well respected program. It has been known for years as one of the leading, maybe the leading, program in terms of applied computing for the library field. One of the things it's known for. One of the questions that I will address in a minute. I just really want to delineate what this program consists of. The library program, i mean the information systems program, is quite different from some other things at other schools that were primarily or initially library programs. That's one of the things I wanted to emphasize. The library science students take a lot of information systems courses. That varies with the type of student. One of the things that we're not really satisfied with is the other way around, there's very little impact of the library and information science program on the graduate MSIS in particular. The undergraduate level, there's a little bit more. Kate is grinning because she teaches the information resources course to undergraduate information systems students. We have a Ph.D. in information studies. Probably the predominant percentage of that now is in the information systems field. We have very strong library and information science graduates in the doctoral programs as well. I just thought I would do that before we started off. Are there questions about it? Is that fairly clear in terms of the philosophy and rationale? I gave you the relative size of the programs earlier. OK. TOM CHILDERS: I'm on the agenda to set the agenda for the afternoon. But I think frankly, the people speaking this morning did that for themselves, and did it exactly on target. This afternoon we're going to spend our time on educational issues and the implications of what was said this morning for learning units, learning components. One of the things, I have a couple of logistic things that I am bound to talk about. First off, I'm using these flipcharts because we paid for them, and I invite all of you to come up and scribble on them as you wish. We paid for them and I don't believe in leaving them behind. You can get parking coupons, if you parked in the lot out here, you can get coupons from Karen. Just see her at the end of the meeting. And for those of you who need to get to the airport quickly after the meeting, how many of those are there?. Why don't you three meet in the back of the room. It probably will pay you to do a cab rather than shuttle. It will be more direct and faster. And we do plan to end at 4:30 promptly. So you don't worry about what's happening. In thinking about this conference, I don't know why, a cartoon that I saw a month or so ago in the New Yorker came to mind. And I always loved, or I have loved ever since I saw it. If you can't read it, it says "Tightening the Buttocks." And that I thought should be a conference title, and it fit for this. And I think "tightening the buttocks", getting it all together for the new century, is a great title for this conference, and Kate has promised this will be the title of the publication. Without further ado, I will turn it over first to Ellen. We'll then hear from Bruce and then Larry and then Ed. I think I have that in the right order. KM You might want to check and see if someone wants to follow something that tasteless. TC: Kate, only you would draw the direct reference to Ellen. CD: But think of what it will do to all the webcrawlers in the world. ELLEN VOORHEES: `I'm going to concentrate on what I think the sorts of courses that will be necessary to support the agent part that I was talking about. So I'm not suggesting that this should be the entire program or anything. Just sorts of things that I think people will need in order to both develop and then participate in using agent information systems. I think there are three main things. One is cognitive science, and I think this is the biggie, because this is where I'm putting lots of knowledge processing sorts of things. There's also distributed application and machine learning. I tried to think about the sorts of tasks that people would be doing to develop and use these systems, and tried to come up with one common phrase under which they would be classified. And basically I eventually got myself to cognitive science. Although if someone had asked me for a definition of cognitive science, I'm not sure I would have been able to give it. So I went to Princeton's home page, and they have..., my connection with Princeton is just that I've done some stuff with the WordNet group there, which is in the cognitive science lab, so I knew their home page. Anyway, their definition of it is, "the study of cognitive processes involved in the acquisition, representation and use of human knowledge." And as I hope at least that I left the impression at the end of my talk this morning, I think this knowledge representation issue, in both the profile creation and a description of the resources, and how agents are going to intercommunicate, I think this is going to be the main focus of whether these systems are actually ever going to be deployed. So, some of the issues under cognitive science that I think need to be addressed is knowledge representation, both for the user models and for the ontology, the descriptions of the data resources. This is my part for the human computer interaction. I didn't mention it this morning, but I knew it was coming here on this slide. Clearly, human interface and human computer interface needs to be worked on. Both in terms of just the general, how do you use..., designing a good interface to just a normal information system. I think that the agent paradigm might produce some other interface issues. In particular, in at least a lot of the scenarios that I think of in terms of using the agents, the agents are sort of asynchronously searching, or asynchronously doing things for the users, and the primary focus of the user is not on what this agent out there is doing. And if the agent gets results, there's got to be some way for this agent to inform the user of results, without interrupting the processing that the user is actually doing, because it might be of a higher priority. How do you actually integrate those sorts of issues of this asynchronous communication between your computer and you, I think is going to need to be studied. And also under the general rubric of this, of cognitive science, are the language or linguistics issues. And this gets back again into the indexing schemes. So indexing schemes, as sort of its broad use, as we were using it this morning, being able to index various media types. There's also issues within traditional IR indexing, text indexing, that can still be addressed in terms of what a computational linguist does. We've been trying for a long time to actually get that to help us, and END OF TAPE THREE, SIDE ONE ...so far, but it might still come. Distributed applications. By this I basically just mean development of the computer infrastructure that will enable the agents to do their stuff. So it's sort of abstracted from the particular IR task. There's the technology to allow the agents to move among computers. It's there, in very rudimentary, or very initial form. It's like sort of a programming paradigm which I don't think lots of people are comfortable with yet. And with the Java? or Telescript, General Magic's Telescript product.... This idea of having the program go to the resource instead of the resource coming to you, through procedure calles and stuff. Plus there's the whole issue, which I basically labeled here, of distributed AI. Negotiation among agents, multi-agent systems, all this sort of work will need to be involved in this sort of system. I can't really be very specific here, or I'm not being very specific here. I realize that. I'm not sure what particular parts are going to end up being important. But I think a general background within these issues is going to be necessary. And finally, machine learning. I think in terms of a particular agent aspect, machine learning will be necessary for this automatic improvement of the user profiles and other resource descriptions. But there's also a lot of machine learning going on in IR circles which isn't agent specific at all. Things such as the LSI indexing, that's a type of machine learning. There's the data fusion work that's going on. This type of work in applying these methods to IR problems, probably implies a more sophisticated mathematical background than I think is traditional in many library schools. I'm not sure how that is here, in the information science, but I think that's another issue, another place that needs to be looked at. And actually I realize that that's not 20 minutes worth, but that's really what I came up with. So I'd be happy to argue with someone. KM I'm curious, Ellen. If you think about the kinds of training, you know, the levels of training. You know, as Dick pointed out, at IST, we have Batchelors degrees, we have Masters degrees, we have Ph.D.s. At what level might people come into a company like Siemens, or come into an environment where you're developing these agents that you're talking about, and might they be expected to make the contributions over the long term? EV: Within Siemens as a research institute, in developing this system, Siemens would be looking at at least a master's level. And if not a Ph.D. On the other hand, I mean, eventually, I think there's also room..., I think there will be employment opportunities for people even after the original systems have been developed. I think these systems themselves will then require people who will be able to work with them. And that I think can spread through the entire range. I guess I wish I was more comfortable with what..., I've never been particularly happy in the past with the term knowledge engineering, because I never really understood what it meant. And I find myself here in an uncomfortable position of sort of recommending that people study it, when I still can't really tell you what it means. So, I'm not being specific here, mostly because I can't be specific here. That raised lots of hands, OK. Yes? CD: I'd like to ask a question that really has to do inpart with Siemens marketing plans. It seems to me that if the people developing intelligent agents have their way, that there's only going to be about three to five places in the world where intelligent agents are developed, and will be selling them to everybody else. So one of the questions I have is, you might talk about training students to use something that we don't even know the shape of yet. But are there really going to be that many slots for developers of intelligent agents? EV: I think, if the systems actually do materialize like they've been predicted to do so, if agent systems actually do catch on, which I don't think is a given. Lots of people are predicting it to happen, and I don't think it's a given. If they do, I think yes, there will be lots of places for people to develop agents, not the agent maybe per se, but different pieces of these, OK, let me give you a little bit more background of the sorts of things that, in particular, that I've been working on at Siemens. Within these agents they have programs which I call scripts. But basically it's a program that the agent follows. They have a whole set of scripts. These scripts are arbitrary programs, so they can do arbitrary things. I think the development of those sorts of scripts, you could actually get into a real market of these sorts of scripts. Your script might be..., you develop the script for the way to find such and such information, while my script finds other types of information, or does other things. I think these script writers actually could be a whole market. In that you would sell your scripts, or maybe in a more altruistic environment, you might share your scripts. But I think there is room for development there. HW What about the content that the agents will bear? The development of the profiles. Do you see that as being a self-service thing, or are you going to need a librarian, an information intermediary, to do it? These profiles have been around for ages as selective dissemination of information programs. I'm just wondering, do you think these agents will be capable of being loaded by somebody sitting at a computer and you just fill in the blanks, the way we now do with Firefly or something, where you just put in a few terms? Or is it going to require a more sophisticated, in-depth examination of somebody's interests, such that only a trained intermediary could do it? EV: In my research proposals and things that I have done, my goal is to have these profiles be automatically created. So that over time, as the system watches the user interact with what's interesting to that user and what's not interesting to that user, it would extract pieces and use that as the profile. HW: You're assuming then something like ongoing examination of incoming electronic mail, or newsgroup stuff? EV: Right. HW What about other things where you don't have a steady stream that's being monitored? EV: I think even in the case where you do have a steady steam, my goal, it's going to take a long time before you can actually have that be automatic. So, I think there is a place for the intermediaries creating these profiles. However, what I actually think is going to happen is that the market is going to be such that initially, when you would need the profile created, there's a chicken and egg problem here, is what I'm trying to say. You're not going to have these agents be useful until they have good profiles. But nobody's going to hire a profile specialist until the agents can do something, until they're convinced there's a real market there. What are you going to do about that? I don't know. But one other point, though. Even if you forget the user profiles, say they are automatically created or the users do them themselves, you're going to need specialists, I believe, to describe the data resources. I don't think, in order to usefully exploit them, I think you're going to need humans in that loop there somewhere. Kate? KM: I was just (me and then Ed, rank has its privileges). When you were talking about what's going to come first, in that people wouldn't hire profile specialists until they had agents that did something. Then when the agents know when to do something, again, it makes me think about, going to ISI, the idea of the generic Research Alert versus the customized Research Alert, or BIOSIS, where they have sort of generic SDI profiles. I could conceive of, in a corporation, even a widely spread one, there might be a way of developing sort of generic programs and then generic profiles for certain tasks or for certain parts of the company, that then would expect to be quickly tailorable to the users as the users start using them. EV: Right, I certainly agree with that. Yes? EF The notion of these agents is something that has often fit into the AI paradigm. And the way that you've present it, the curriculum related issues, one might say that everything you put there would fit into an AI kind of degree. I wonder if there are not other disciplines that can contribute to this, that we should consider? Just a few things that come to mind, sort of on opposite ends. In one sense, many people have raised the question of scalability of agents. And some have proposed using modeling or simulation techniques to begin to understand them. On another vein, some people have talked about the economic concerns of these, thinking about spending money and resources and time and so forth. And on a third, some people have likened agents to people and other kinds of intelligent beings, and argue that we should do a sociological study. How do you see these other disciplines, and others as well, that fit into this? EV: The simulation, I think, is a perfectly reasonable avenue to take. Having sociological studies of systems that don't yet exist I think is somewhat a cart before the horse. I think we need to actually see whether we can develop these systems. It still comes back to..., there are a couple of problems here. One is that, in order for the agents to be useful, you need to have the profiles, etc., which we've been talking about somewhat. But you also, if you're really going to have these agents moving around within an environment like the Web, you need people to be willing to provide the resources to have them execute, and the search engines there, and right at the moment, your average web page is not a search engine. It's an it, it's a set of whatever. And that needs to be addressed also. What are you going to do with all of this information which isn't necessarily a collection, and isn't set up to be searched as such. I wouldn't at all claim that the list of things that I put there was exclusive, that you don't need anything other than that. Those were the main points that occurred to me. And particular, I think, this whole business of the knowledge, the indexing and knowledge representation problems, are going to have to be faced, even though I'd rather they didn't have to be. I would personally really rather that we could do something to ignore them. But I don't see how the systems are going to work. If we do ignore it. Yes? BH? If you're trying to make a curriculum to teach people to write agents, what kind of background would their profile be? What would their background be? What sort of a strong end? I mean, would you want linguists or computer scientists? EV: I think, well, remember my biases here. I'm a computer scientist, and I came out of the Cornell Department of Computer Science, but with an information retrieval background. I think for the actual building of the infrastructure, get these processes moving from one machine to another, etc., that that's a computer science problem. However, and a lot of the people who are currently doing various sorts of agents, or agents for various sorts of things, are in general, computer scientists. Trying to apply agents to do information management tasks, I think does involve also some, you need somebody who has some understanding of the problems there. And I think that's missing in a lot of the agent research. There's not a lot of other IR people, at least that I've seen. Hard core IR people doing agent stuff. And I think you can tell that in the assumptions that they make and what they're going to be able to do. Does that answer your question? BH? Yes. I was expecting to hear something about psychology, linguistics. I mean, since cognitive science... EV: Cognitive science is clearly ..., right, OK. I think it's very possible, yes. The knowledge engineering part of it, there are people who will call themselves at least computer scientists, the AI crowd will usually admit to being computer scientists. Who would do that sort of thing. But it could just as easily be the psychologists. I'm not sure how, I don't know enough psychologists to know how much they think of this problem, or what they think of this problem. KM: But George Miller's a psychologist. In cognitive science. His cognitive science group at Princeton is ... EV: Is cognitive science and computer science. It has a very strong psych aspect to it. Yes, that's certainly true. However, they're not paying much attention to this agent problem. IS I'd like to ask two questions. To my understanding, this software agent is still theory, in the realm of research rather than wide spread or practical application. In that sense, is any university teaching a course just focusing on agents? Otherwise, if there is no course, then do you think that we have to embed this topic in some other course? Thgen what? And my second question is, do you see some kind of a job market or niche for our graduates for the course of study of agents ? EV: I believe that the University of Maryland at Baltimore Campus, Tim Finan at that center, I think he teaches a course on agents. I think there might also be some out in, I can't remember now if it was Berkeley or Stanford. It was some California university. I don't remember now which one it was. I also think that in some circles, saying that you took a course in agents, agents is enough of a buzzword at the moment, that would attract attention. Should it? Is it? It is mostly research right now. I don't know how many actual products are going to be out there soon. I actually expect there will be some, because everybody's trying to get the agent word in their product. Because it is sexy now. So all sorts of things are being called agents. I think there will be a market for four year graduates. But I'm a proponent of this technology, so... KM: I think that would be an interesting challenge to see, I'll be interested to see what the other speakers have to say about this problem of buzzwords. And this problem of educating for something that is very now, and perhaps rather narrowly y circumscribed, and we all know what happened to AI, and expert system shells, and neural networks doing X and Y, and I wonder, it will be interesting to see what people think about the danger of educating for the focus as opposed to educating for the kind of breadth and adaptability. So if we're educating people to go out and write intelligent agents, or design profiles for them to use, it seems we have got to give them the tools that they need to go beyond that when that dries up. Or when those problems become solved. Or something. GS One observation regarding curriculum, it seems like we want all of everything in this program or whatever. In the design of a curriculum, I think we've got to reach outside and build some partnerships with the psych department, with communications and with computer science and so forth, so that there would be minors that the students could elect, and if you got a 60-hour gap to fill, 36 of it is your major. What are the other minors that you could get into? And I think if the individual wanted to do that type of thing, then we should have, as the faculty, modules that they can select, which will equate to something, some sort of skill sets. It's tough to deal with it all. And so, the first two year level, we're recommending courses like linguistics and psychology and philosophy and logic and things of that nature. GH: We're looking ahead to that. There's a focus group working on the MSIS curriculum now, and the problem we have is, that all these topics ...we think they all have to be in there. There's not enough credits of degree to cover them. And every time I see a list like this, I have that same kind of reaction. Where you've got all these different businesses, and anytime you try to do something that's fairly broad, you have this issue. That is, can you give enough in each of those areas that people can go out and do something, or do you just give enough so that they know about those areas, but they're not capable of doing very much. KM: Or perhaps what you need to do is develop skill sets so that people can partner. And so what you're really doing is building part of larger work teams that are going to address these talks. You cannot possibly educate somebody broadly enough to do all this yourselves. And that applies a different kind of curriculum, and a different kind of education, perhaps. GH And also implies a different kind of awareness on the part of employers. To understand what the product is. KM: Carl maybe would want to... CD Yeah. Not directly this, but on the topic that's come up, I mean, there's this constant tension between offering a university education, and offering vocational education. A vocational school trains somebody so they go out and immediately start doing a job. University education trains somebody so that in 20 years they are still using the things that they were taught. Now, you know, I think it's pretty obvious that part of the source of this tension is industry, who would be just as happy to apply people that they could plug and systematically make money off of tomorrow. And I think as we look at curriculum, we have to resolve this kind of tension to do well by our students in both the long run and the short run. And I guess, to me that's tied in with the questions of, do we teach a course in agents? And the answer may well be, if we recognize where that fits in the spectrum, we do. That is, we might do it very crassly to say, as you said, agents are sexy, it will get people jobs immediately. KM: But so does becoming a Novell networks trainer. EF I guess I want to bring up conceptual or make a comment here, which is, we're already into the vein of a course of this and a course of that. That's not the only way for people to learn. There are other approaches where we can define 100 knowledge modules that 40, 30 of them are the core of this, and others are available. They're all constructed by people at many institutions perhaps, we share in some ways, and part of the student experience may be semester-long team projects where they do nothing else but do a really in-depth project. And they learn these things as they need them, sort of project-based approach to learning. So there are many other ways that we can do this. So let's not just say we're going to teach courses in something. KM: What you see behind you is your screen.... So we can turn the lavalier over to Bruce Croft, who is also using overheads, and after the break we'll have dueling systems. BRUCE CROFT: So here's my title, IR in Academia. With a subtitle of Defining Identity in the Post-Salton Era. It's not meant to be humorous as such, it's not really humorous, but Gerry's no longer with us. But it's really what we have to do. Gerry's books and his influence on the area, sort of defined information retrieval for a long time. So, now what we're talking about, is how do we define it in this new era we find ourselves in with the Internet and Web and all this other stuff going on. And that's what I want to talk about. Current studies of information retrievals is where I want to start. And I'm from computer science, so we'll be biased towards computer science with the occasional comment about what I know about information science departments, but since I haven't visited Drexel, most of these comments about information science or library science come from the number of other places that I've visited. So, in fact, having looked at your curriculum, a lot of the comments don't apply to your curriculum. But I think it's still worth saying in general. First of all, few computer science departments teach IR courses. That's just one observation. Second observation, (I would claim these are all fact, not opinion) Information retrieval is not regarded as part of mainstream computer science. Maybe it used to be in the early 70s. Curriculum proposals from ACM... it was there, but it's not anymore. And there is actually a bias against information retrieval in mainstream computer science. And apologies to anybody here with their name in a book, but there are no adequate information retrieval texts. That's one of the core problems, that affects a lot of the things that I'm saying here. And this is a comment which applies less in your case, I believe. Information and library science students just don't know enough about technology or systems. They learn a little bit about how to use a system, how to write queries for system A or system B, and a little bit about technology. And a couple of places they learn some programming. But they can't fill a lot of those roles that Larry mentioned. They don't have the technology background to do it. Good computer science, information retrieval, Ph.D.s, there aren't many of them, because there aren't many that have faculty working in that area. But there are some good people being produced. And they can't get hired for love or money in academic jobs in computer science. It's a sweeping statement, but in general it's true, and I think Ellen would vouch for that, and a number of other people I know, who are excellent computer science Ph.D.s, but it's just hard to find jobs, despite this incredible amount of industrial interest. It doesn't translate into academic interests, in computer science. Associated with that is that other parts of computer science, and there's a lot of them, database people, natural language processing people, machine learning people, and the few others as well, think that they can do information retrieval, digital libraries, all that sort of stuff, just they inherit all the knowledge that they need to be able to do this stuff. CD: And they think they can do it without knowing libraries. BC: Of course, the library stuff here. The indexing and all that. So, as I said, I think all these are facts that we have to define the environment we have to live in when we start defining curriculums and things like that. I haven't finished yet. There is a good side. That was the downside, if you like. Some of it's repeating what I said earlier. I don't have the facts that Ed wants, but I just know from the number of E-mail messages that I get, the ease with which our undergraduates who have done work in my group get hired, and the salaries they get, and just the constant demands for Ph.D.s which I can't satisfy, despite having one of the largest groups around doing information retrieval. You know, we don't produce enough, and if you summed up all the good computer science Ph.D.s in information retrieval per year produced in the United States, it's less than one handful. Whereas there is a lot of job openings for people who know the technology, who build technology at the moment, with all these companies starting up. So that means if you've got a computer science degree, and you know information retrieval, as long as you don't want to work in academia, you can find a lot of good places to work. And I think that you've already vouched for this, and I've seen this in other places too, that are being creative with redefining information science degrees and training people the right way, and they also are snapped up pretty readily. The other thing, which also makes the academic scene less understandable (there's lots of reasons why, isn't there?) is that there are just any number of research funding opportunities in this area at the moment. Virtually every ARPA (DARPA-have to keep up with this) BAA that comes out mentions information access or retrieval or filtering or multimedia, or something, just about every one of them, has got some component which has got something to do with the stuff that we're doing. And then there's NSF, and NIH, and industry. And so there's just lots and lots of opportunities, including infrastructure money, because one of the questions was, what sort of resources do you need to teach in this area? Well, there's lots of funding opportunities for that as well. So, it is a boom time, and hopefully we won't go the way of AI in the 80s, but that's just a matter of being able to deliver something at the end. KM: Small problem. BC: OK, so that's my slide for the current state of information retrieval. Now I've got some action items. How can we address, mostly they are bad things in the current state, except for the fact that there's this incredible demand, and you can get research money. But just that there's nobody in academia, you can't get a job in academia. In computer science. I should have, I thought there was another thing. I must have skipped something there somewhere. Because I had one information science related one, I just remembered. END OF TAPE THREE, SIDE B ..salary, because library and information science department is not really regarded as engineering departments, they're over in social sciences or something like that, which have traditionally a lower salary scale-they don't have an engineering salary scale so you know it's a big hit to us that we have to take and that's another thing that prevents building up the computers, the technology side of the information science department. As I said, speaking generally here and you know, you don't have the same problems as some other library & information science departments. So what can we do? Well, number one, we have to produce some better textbooks -it's just atrocious the current state of the textbooks. I mean Gerry Salton's textbooks are fine but they are old, hopelessly out of date and you know, everybody I know is, virtually everybody I know, is used to teaching through some combination of photocopies of Keith van Rijsbergen's book and a collection of papers and maybe Gerry's last book with heavy modifications to say, don't read that chapter, don't read this chapter. It's really hopeless. Whereas compare this to the data base field where there are three or four excellent texts and about 25 others, you know, to choose from as well. But there's three or four that are equally as good as each other, which are really texts, and anybody in Pocunk U can get that text and teach a good data base course based on that text. There's no book like that in information retrieval... somebody that doesn't know IR but is a good computer scientist, or a good information scientist can get that book, read it themselves and with the teaching aids and that , can do a good IR course and start training some IR people. So how do we expect to produce these people if we don't have texts-and the reasons, one of the reasons is that it's a a cyclical chicken and egg thing, isn't it? There is not enough people in academia for the people that write the books after all mostly. And it's difficult, we can't get these people into academia for all the reasons I mentioned before ; so in the data base field, for example, there are hundreds of those guys and as you know, a lot of the best textbooks, the best textbooks, are not written by the best researchers. They are written by people who are good practitioners, they are good teachers, they know what it takes to write a good textbook, they've got a little bit more time because they're not focusing on getting the next ARPA grant all the time and they produce better textbooks. You know, that's not always true-there are good data base texts by top researchers as well but certainly in other fields, the textbooks are often not written by the best researchers. We don't have that number of people around so you know, maybe there's somebody over here who has got time, he's not doing research, he's writing a good textbook and there's just not that many people in academia. So I recently became a Kluwer series editor and trying to do something about this, because I don't have time to write a textbook myself. But I thought -well, I'll keep encouraging people to write things, trying maybe by getting some things like readings and monographs, certainly monographs on some topics and things like that, anything is better than what we have now. Just having a selection of monographs like a monograph on information filtering for example which is a collection of works by some of the others, around that particular topic. It would be better than what we have now, which is nothing. In most of these areas. You know, papers are just not good to teach out of. So, you know, I don't have a solution to that except for... I'm trying to do a little bit by starting up this series of information retrieval series and trying to encourage people to write something, that's not as big as a full textbook. Because obviously it takes a lot of time to develop a 600 page text with all the associated assignments and things but they are steps towards that to improve things. Another possible action item is, since industry is so interested in all this stuff, let's get them to put their money where their mouth is and instead of just doing research, ask them to find some stuff in the universities for teaching in that, get them to fund a chair or two in a couple places. That's when the departments will hire people if some industry says, here's money for a chair of information retrieval, they'll go and get one. They are not going to be..., you know, some database person might sneak in there but in general you are going to have to have some IR credentials. This is a standard model in other areas, you know, so if industry is interested, why can't they support some chairs in some places. Introduce information retrieval to undergraduates obviously. Get them involved in projects, you know, this is stuff we've been saying already. So we've got to get people more familiar with the paradigms, what the content is, you know, what is it that's special about information retrieval that makes an interesting subject. Most of the undergraduates I teach, when they start doing this stuff, they love it. You know, they really get into it. It's a lot more interesting than database stuff to a lot of them but they are just not exposed to it in general except in a few places like Virginia Tech and U Mass, Cornell, I guess they'll still be teaching that course in Cornell. (No? Is Claire teaching it? Oh, there's another one lost.) Improve computer science focused IR journals. So obviously there's JASIS and theire's IPM, which publish more information science, a little bit more towards the information science side, but a lot of the information scientists are not too happy with too much technology stuff in those journals either but on the computer science end there's nothing that's an information retrieval journal as such. There's been some attempts to start one, the closest..... the ACM Transactions in Information Systems was a result of trying to start information retrieval transactions a long time ago. Since I recently became the editor of that I'm trying to refocus it more heavily on information retrieval, to make certain that there's at least an IR paper in every issue and that there's a tendency to have the paper up the front instead of down at the last paper or something like that. But we need to do that, even though it's not just a little thing, you know, we need to improve the visibility of information retrieval, instead of always being hidden away. And you know, information retrieval people do this but there are not enough of us around publishing in other computer science related conferences, just to improve visibility and publishing in different places like machine learning conferences, things like that. I got some more action items. I'll be spending 10 minutes picking slides off the floor.) We've been talking about this. I think that computer science on its own, it's really hard to establish a powerbase there for information retrieval because of all these facts that I've been saying. You're fighting against all these established parts of computer science. Industry can help us by establishing chairs and things like that, but really I think that the best possibility of a powerbase for information retrieval is these innovative collaboratios between departments. Like you mentioned collaborations before in the context of your degree..., your degree itself is the sort of thing I could see as a collaboration in other places rather than the result of a single school. So collaborations between computer sciences business schools, information science and library schools and other schools too, some places there's a bit more emphasis on policy and economic policy and things like that or journalism schools, for example, they're actually interested in some places in some collaboration there. Other places like, other departments like psychology, etc. are mentioned because you do find people scattered around these departments who have a real interest in the cognitive aspects, etc. So some places are just beginning to try this or otherwise they've started to define degrees like the ones you have here which tends to merge some of those things and it seems like given the industrial demand, the best way to go is with good professional masters degrees to increase visibility. There's a core of the teaching in this area so I'm preaching to the converted here, obviously, so you already got that. But there is..., I did these slides before I read your curriculum, they're really designed talking about other libraries and science schools. We wanted ..., there are lots of people out there that want to know about this stuff, they want retraining, they are good people. They are mature people, and so these types of collaborative masters degrees, I think they have a real potential so information retrieval but with less CS and more other types which is like your information system course masters. Except like, when I say business information systems, I guess I meant more applying technology rather than learning how to build the technology, which is what computer science focuses on; more like applied computer science is a good way of putting it. Information science with computer science, but we discussed that one too. You mentioned a lot of your students take a lot of the technology courses and the more the better as far as I'm concerned in terms of the types of jobs and positions that are out there. Information science and library with more business is something that other schools certainly are trying. There is at least a couple of other library & information science schools. I'm sure you're aware of that. Becoming pseudo-business schools. And that's another model. It doesn't solve the technology problem obviously but you know, if you got to, thinking of Larry's roles and you get the CIO-type people, they do need some business training, management training and things like that so there is a, there is certainly the scope for for concentration on that side of things. I want to address the issue of information retrieval being not mainstream. We are not actually alone in that. Information retrieval, natural language processing, HCI, CSCW, informationscience, they are all non-mainstream as far as mainstream computer scientists are concerned. In fact to some people anything that isn't ,you know, theory of computation is non-mainstream, they'll grudgingly accept operating systems and a couple of other things. But you know these things are all just crucial to the development of information systems that's the funny thing. So you know, one way to approach this is to turn that on its head which is essentially what you've done here so let's take that as an advantage. You don't want to do that stuff, then let's make it, these things are all related to each other, they tie in around information systems, they're core for building good information systems so you can build good degrees around those things. And I think a couple of places are looking at how you construct these degree programs around things that computer scientists don't consider core, but which are still technology-oriented topics. I too put cognitive science in there, funnily enough, not because of the topic itself but because it's an example of a, we would call it a discipline, but it's something which has now become more accepted. You find it a number of different places which is really a conglomeration of other areas. And some places it's a center, some places it's a school, some places it's a department, some places it's just a collaboration saying that, so there are lots of possible ways to create these degree, valid degree programs out of multiple places on the campus. One of the questions that was asked, was hardware and software resources and this is all pretty obvious stuff. I don't think there is anything unusual about teaching IR in terms of hardware and software requirements, except that you need a lot of disk. So sometimes people just forget about that. It's just getting worse and worse in the sense that you prefer to have hundreds of gigabytes of disk really but hundred would be a good amount to start with because if you want people to build realistic programs, and play around with realistic amounts of text and you've got 0-40 people in the class, and you give them a few hundred megabytes to play around with because anything less than that they are not even experiencing anything about how long it takes to process text, then you rapidly chew up disks like it's going out of style. Disks are getting bigger but still they're not getting big that fast, if you haven't anticipated in the budget. So that's one really unusual thing relative to database systems or any of these other technology things. So we use multi-processor Unix servers, whatever flavor-NT servers are becoming really, really popular in industry so they should really be part of the platforms which applied information retrieval is taught on as well, because it's just going to be essential that people know about those things. Interface tool kits, you know, we need to be teaching how to build interfaces so obviously Java and Visual Basic and other things like that, that people typically build interfaces with out in the industry should be taught. People need to know the basics of client server architectures and how you build distributed systems out of client-server things that already exist, using APIs from IT Software and database systems, document management, workflow, OCR, imaging packages, as much exposure to that stuff is better if we're going to produce these systems integration people. We do have this problem about "what if you'll just be learning...", you know, you've got to put it into a framework where these are the basics. What does electronic data document management mean? What are the common themes between document management packages and now let's use a particular document management package as an illustration of that. So it's not just teaching one tool after another but there is the underlying issues that are being addressed by each tool being taught as well. So it depends on how many of these things, obviously it requires..., OCR in particular requires a little bit of extra equipment but not too much. I've got a summary slide now, just want to tell you what courses, that we teach for information retrieval. But the summary of the high level stuff. Industry and government demand is high. Information retrieval is in an inbetween position with regard to computer science and information science. Ccomputer science is non-mainstream and information science in other places is not..., they don't teach you information retrieval technology, they teach you how to usean information retrieval system. And so there all sorts of visibility and public relations problems which I've been describing. Defining a graduate discipline that dears on topics centered around informations systems-that's what essentially you've done there and I am doing and I think that any information systems degree has to have a core informational retrieval component, a technology component. Blah, blah. The other really crucial thing is improving acceptance of IR in both fields. Primarily through better texts and other sorts of visibility things. I really think this ideaof issuing chaired professorships is something worth following up on. (Since you paid for these boards I'm going to use them, okay. I am actually going to use them.) In terms of what we actually teach you know, I've been at U Mass 15 or 16 years now, so, what sort of things have we been able to teach and develop in the information retrieval area? And the answer to that is very little because of the constraints of the computer science degree, which has lots and lots of requirements and the sort of general attitude that I've mentioned about information retrieval even in the department where I'm regarded as one of the good citizens and all that sort of thing, still it's not mainstream so it gets relegated to an option in the third level or something like that. So in the course I've taught, this is not part of our regular curriculum but I think you can cover most things that are interesting about IR, the fundamentals, in two courses-a sequence of two courses. The first course is an advanced undergraduate course which can serve as one course if you like, this is how to build IR systems. Not that everybody is going to go out and build IR systems. But it teaches you enough about the technology that they end up building their own IR systems. So they understand it, they can evaluate systems. They know what all the, really how things tick. And it also gives you enough scope to look at things like machine learning and other things like that. Sowe have one real core course and an advanced information retrieval course, so I'm just describing what our experience is. Other courses that information retrieval appears in: we have an internet course, tools on the internet, that's a low level course for the masters, actually. Obviously as many masters as you can take without destroying your internet because there is an unlimited demand for this course and part of this course is the IR tools in the sense of giving a little bit of discussion of the web browsing tools and you know, what's good and what's bad about retrieval techniques as they're exposed on the web. We have a course which is an advanced data base and IR course in the sense that it is building an application using both data base systems and information retrieval systems together. So it sort of caps off both the data base course which is separate to this and the IR courses-in this course they actually build IR systems. This one is making use of a commercial level product and building stuff around it and on top of it and things like that. So that course we've done. And other courses which, well we do have an NLP course which is focused heavily on information extraction. This information extraction stuff is probably even a little ahead of the agent technology in the sense that companies are pushing it hard, it takes a lot more work to produce an information extraction system,. If it's off the shelf stuff that allows you to build an information extraction system but it still takes a lot of work to build something but there is going to be things out there, commercially, in the not-to-distant future for building information extraction so this course is a good way of doing both NLP (the interesting part of that) and a more practical aspect which is the one of extraction and potentially summarization which are both things like context, Oracle's context system. There's another practical application of simple NLP technology that can be brought into a course like that. And so we teach all of these courses. I usually don't get to teach two IR courses, I have to make do with one. I prefer to have two and the one other course I'd like to have which I haven't taught yet is interfaces for information inaudible access and visualization. Not that I'd have a course title like that. That's the course I'd like to have, not just the general interface course where the course talks about interfaces for information access and visualizing information with practical work that goes with it. There's lot in the literature about how, what interfaces need to look like to support effective searching and there is beginning to be a lot more about how to support visualizing the stuff that you're bringing back. So it's a very different topic than just designing interfaces in general. So anyway that's the sort of courses, a mixture of what we teach and what I would like to add but that's within the constraint of a computer science curriculum. Ed and I have spent an awful lot more time worrying about information retrieval curricula in general so I'll leave that up to him. But if I was putting information retrieval courses in some things like your master's degree then I wouldn't expect to have much more than this. It's not like you need six IR courses. There is only so much you can teach, you know, about information retrieval; from my point of view, two courses are enough. The whole thing is really difficult without the text. Question? DL: Maybe Il-Yeol is going to ask the same question. You didn't say what you would do in database systems for along these lines. What part would you begin to meld in IR if you could if you could design your database courses the way you want it. BC: Well in our department we have one data base course. We have two data base courses, One is the introductory relational database designs and entity-relationship models, SQL, all that sort of stuff. The second course is object-oriented database systems, query optimization, all the more advanced technical issues in data base and then as I said we have this sort of, once we played around with this course as a cap to both of them-which is building an application out of both of these technologies together. A very applied course after the two more theoretical technology. DL: You're saying that works well? BC: From our limited experience.I believe that that is a good scope. A good number of courses to devote to that-it enables you to cover the technology in enough depth from a computer science perspective, as well as doing enough applied work so you've got the training y ou need for a course. And a lot of people who are interested in this area, only end up taking one of these; they take the data base course and an IR course and we only get to teach one in both things , but theoretically the curriculum has two in both (two to three). IS: I proposed the course in the masters program titled intelligent retrieval. So I am very interested in knowing a little more detail of the contents of your DB/IR combination course. Most of the course is trying to cover advanced DB/IR integrated ... So I am proposing to cover these four ambitous materials, I know this is too much for a 10 week course, but as an initial starter I am proposing to cover these kinds of things Would you like to tell us a little bit more about your DB/IR course in terms of actual topics and the collection of hardware and software tools and textbooks? BC: Well it's not well developed enough to really use it as a model except it has at least, we've only taught it once. But we think it's a good idea and it has obviously..., its focus is on database and IR integration and what are the ways to integrate those kinds of systems in practice, the digital libraries, the application we use to you know, it's an application that people build, it's a very project oriented course. There's a lot of work on building essentially a digital library application using Inquiry which is our information retrieval engine and what do they use? It's the..., the open database, open object-oriented database system, it's the Texas ARPA database group which is a public domain, object oriented data base environment. So in this course we teach both , usually because of the constraints on how many courses we can teach, we usually end up teaching some stuff about object oriented databases and some advanced IR and putting them together in this digital libraries application. As I said, we taught them like that once, other times it's had more database or more IR, depending on who is teaching it. Your lineup there looks perfectly reasonable too. It really just depends on things like if you've covered the essentials of things like object-oriented database systems somewhere else so you don't have to cover that again in this course. Question? NH: I'm interested in what the gender breakdown is in your department as a whole and in the IR courses in particular? I perhaps can add to that, Lee and Kate and I served on a panel of SIG CSE at the ACM meetings, no it was the Computing in Society group. One of the issues that came up there was that all of, the percentage of women has been increasing in terms of proportion of them getting Bachelors Degrees and it's also increasing in what have been considered traditionally male disciplines, except for one. BC: Right. NH: Computer science. BC: Computer has been getting worse in the time I've been there. NH: Computer science is getting worse; it peaked in the 80's. BC: But it's down to levels which would still be considered great in the other sciences and engineering. I mean in our department, it's down to one in four, or something like that. Which for a lot of departments.... NH: That's high. It's down to about 13% I believe. BC: We have a lot of female faculty who I think about, it's been a quarter, it's a little bit higher than a quarter in our faculty and they spend a lot of time measuring, you know, working on measuring women to make certain that they don't drop out prematurely or stuff like that. So we do reasonably well in that respect. In terms of IR, I would guess it's always about a third. So it's always fairly..., it seems like the..., I can't make it, I haven't seen any differences between whether it's a database course or information retrieval course or compilers and operating systems. A lot of my female graduate students, some of them leave to go to work on operating systems, some of them stay and want to go more in the AI direction. So it's not like all the women flock to artificial intelligence and the men will go into networks and operating systems, it seems to be pretty uniform. The biggest drop that we've seen is a continuous drop in domestic graduates and we've been pretty good in that respect. We get a lot of applications and, but, this year it was a big jump in the ratio of foreign to domestics and so for the first time, we had less than 50% domestic coming into the department. But , you know, there are a lot of foreign women applying as well so that keeps the, we accept a number of these, that keeps the ratios up. KM: Gene? GS: I was just going to ask on your, you know, "p", whether you see that as a sort of capstone course or do you have that so that a student can take it without too much preparation? BC: Well in our department it's just a stand-alone course. In other words it covers.... GS: I must assume you must have certain languages and so forth? Systems analysis? BC: Yes. It's done in, well it used to be done in LISP but now it's switched over to awk and perl. And so it's basically, it teaches the basics of natural language processing and does information extraction as the main project so it's fairly surface level language processing that you end up programming but you learn some of the, you know, the standard picture of syntax, semantics, and pragmaticsalong the way. KM: Brian and then? BH: What kind of entry level requirements do you have for your first IR course? Are they...? BC: The way it's set up at the moment, well, it was initially, I think it's still on the books as having a data base course as a prerequisite. So you got some idea about information systems before you come into it. We could get rid of that as well, it's just, we could do it either way, it's just set up to have the database course which is a sort of sophomore level course as a prerequisite. That one could be waived, data structures is the absolutely minimum, .so the people understand file organization and things like that. KM: Well I don't want to break up the questions but we do have wonderful desserts in the back and those of you that want to take a break, have some coffee, you can formerly reconvene at about 3:15. coffee break KM: Our last two speakers of the afternoon have duelling information systems so after Larry finishes, we'll probably be answering a lot of questions while Ed tries to boot up his system and get back on Netscape. LARRY FITZPATRICK Okay thanks. This session is ostensibly about curriculum. I'm not in the academic circle, involved with curriculum development in this area. And also it's a bit of a stressor in terms of being wedged between the two academic gurus in this industry so it's..., please excuse me if this doesn't deal directly with curriculum. What I hope to do is sort of lay out a few observations from the commercial side, from the product side from somebody who spends a reasonable amount of time trying to pick nuggets out of the research literature. And basically give you some sort of stories from the trenches to give you an idea of what you can expect if you want to sort of play in this game at some level with resources since that was one of the questions that sort of seemed to be focusing. And then last I'll try to give some boo hiss type opinions I hope. You can take them as hyperbolic statements. First I'd like to sort of address sort of what I think commercially may be the hardest problems before us. The kind of resources we devote to this exercise and then again the research community. (As we wait for the disk to come on). Basically one of the hot problems is most big online systems are...[request from the audience to read the slide entries] basically the hot problem is one terabyte of data, 40 character entry window with a submit button and l.4 words per query. How can you possibly hope to sift, right? And answer a question. And that's been addressed a little bit before but I think that, I think that there are opportunities to solve this problem via both user modeling and task modeling and in getting some library science knowledge into it but I'm not sure. Okay, so I leave that up to you guys. I also think a really big issue which hasn't, at least on the commercial side of things, been addressed well, is the personal differences in information consumption styles. Some people just like to browse. Some people like to build complex queries. Some people like to picklist. And there are very few information systems out there, or search and retrieval systems anyway, that operate and accommodate all those modalities. And having dabbled in education a little bit, I have one, I have opinions, at one point, so you know in understanding that there are different learning models for people it shouldn't be surprising to us that there, you know, since information acquisition is essentially a learning problem that there would be different styles and I think that some focus in this area. And maybe if you could point me in the right direction, I just may not be aware of it. A third thing is sort of related to that, is a higher level modeling of well understood tasks. It seems like we're operating at very low levels when we talk about searching. You know, it's like using these operators, constructing these things but I suspect, in certain problem domains, it's possible to build up primitives of a higher level of abstraction. I wish I had an example to give you. I think I had one but I'm kind of not remembering it at the moment. But I do think that a lot of, a lot can be gained by studying at higher levels of abstraction how people attempt to locate information in certain tasks paradigms and to translate those from one problem domain to another. Ellen spent a tremendous amount of time on this already, but I think it's also very important, is modeling personal organizational information consuming habits for better filtering. And the statement that I'd make here, and I already made it, is that there is so much information out there already . It's a virtual world. What you see is largely out of your control so the more control we can inject into the process on behalf of the consumer and his personal taste, the better off we're going to be downstream. And again, you know, I really think Ellen is on track. Collaboration in practice, sort of a big, hot buzz word these days is collaboration. Opentext itself has, since it's merged with the workflow document management company, has shoved information retrieval into the mix and has created this new "collaboration" tool for managing the information resource within the organization. Now we're beginning to discover some of the primitives that organizations use. But I don't, I'm not that versed in CSCW stuff but I suspect that organizations, at the practical side, the commercial side, aren't capable in the new distributed intranet of articulating what the right models are for them, and we need to sort of investigate that. Another one which comes back to the visualization issue, on the one hand and perhaps summarization on the other is communicating the information space. It's been known for a very long time that people who understand the information space do much better with search and retrieval activities than people who don't. Yet the issue is with huge information spaces, how do we guide people through them. Can we come up with other ways of representing compactly or getting people to learn models for doing this. I think are very important and the reason I mention them is that I am kind of clueless about them so. Scale. I think that things change when they get big. We can all talk about relevance feedback and 200 word queries but in practice, at several million queries a day on several dozen gigabytes of data given to the world for free, the performance hit is just a little too, too hard to be tolerated and I think that focusing on how problems change through scale is probably a really critical thing. I was fortunate enough to share an office with Donna Harmon 10-15 years ago and was involved with building information retrieval systems and she was an evaluation nut even then and she had, you know, major evaluation suites like the Cranfield collection of 273 documents or something like that. Okay. And I know that TREC is moving in the right direction in a major way, I mean, there's this huge data track this time, which is sort of experimental, trying to look at 10-20 gigabytes but the problem, the major problem doesn't change. I was involved in a commercial effort a few years ago where we said we have been doing 500 megabyte databases, what's 500 gigabytes? and we scaled three orders of magnitude in the space of about 18 months of development time and man it was, it was like that ad for I don't know, Raymond DB or something, in one of those database ads, the guy on the rocket sled, you know, face peeled back, it was an experience. We learned a lot. You learn a lot from that and I know Bruce will attest to that, with his Infoseek experience. Finally evaluation. Coming back to the issue of how do we know what's good and what's not good and I think that from the commercial point of view (I touched on it this morning) certain classes and roles within the organization of individuals need to be able to evaluate what's good and what's... sort the wheat from the chaff and to the extent that, you know, the academic community can prepare people to do this or provide models for doing this, that's a good thing. So that's sort of like my short list of hot problems. I'd like to touch a little bit on a product company's resources. The first one is clients, we interact with clients, that's incredibly valuable. The hard part though, is clients in general; and I don't mean to make this sound pejorative, but they don't know what they need, right, frequently, yet interacting with them they are telling you subliminally many times what they need and if you understand your stuff profoundly and attempt to understand their stuff profoundly then wonderful things happen. Secondly, we have on the order of 50 to 100 developers in the organization developing applications. Developers can be clients for tool technologies and when you're tightly, in a tight community, lots of wonderful things happen where different people try different things, stress things in different ways and you can observe patterns and usage which allow you to build higher level constructs which make the next goround a whole lot easier for other people. And to the extent that you can do this in an academic environment. I don't know if it's possible where you might have, you know, bodies of students building on the work of other bodies of students over time through a program. I think there is tremendous feedback and learning that goes on in that kind of environment. Opentext has about 6-10 people, I say between 6 and 10 because some of them are graduate students and either part time or on co-op and cycle through the organization and we also have a pretty strong relationship with the University of Waterloo where in fact, we subsidize research. We're setting up programs to subsidize the research at other organizations as well. So if the hard R&D stuff is not necessarily within the walls of Opentext but we've got a reasonable number of serious developers sort of focused on just the IR part and this doesn't count the data base, the test engine people and the workflow people and the document management people, those are the other, you know, 40 to 90 developers. Opentext is participating in TREC in order to do that, you know, the resources that we have are on the order of a developer and a half, ten gigs of disk and a multiheaded work station of various kinds and depending on the availability it migrates between Alphas and NT's and various other boxes. But when we sit down and we talk about this, if we are going to, you know, if we want to participate in the large data track, it obviously goes up which we plan on doing since we think we scale pretty well and if we talk about the kinds of things we would like to do we could easily envision putting a lot more people on this, on this kind of an effort. The last thing, or maybe not the last thing. I think it's the last thing is production systems. Opentext runs a web index. It really is run out of the marketing department and it has generated a tremendous amount of useful information. I mean, I said databases, you know, data sizes, data inaudible on the order of 100 gigabytes but that doesn't account for replication in order to get throughput it needs to be replications involved so data inaudible on the order of a terabyte, multiple site stuff starts coming into play. Ten nodes, you know, big machines with multiheads, 40 processors kind of stuff. You know, if you're going to play in this scale game, I mean, there are serious resources that need to be committed. And aechoing what Bruce said, IO is a major one there. Based on sort of the nature of the problem, there are special IO boxes out there that make this problem a lot easier to work on but they are insanely expensive. I mean they're like, huge arrays of 400 gigabytes, you know multi- six-figure kind of cost things. But in production, that's what you end up using so you need to play with these things. And I don't know if relationships with an industry would work well. I do know that some of these vendors of these things are desperately seeking validations of the effectiveness of those devices and markets for them and to have a proof that, I guess, I'm not saying this very articulately, what I want to say is, if you can provide them a reason for selling more boxes, then it's likely that you can get their equipment for very little or no money. But then you probably already know this. Some perspectives on the research side of things, I think I would encourage some of you to think about what if (and not that you don't already), but what if resources were unlimited and to a large degree in order to play those what if games, you kind of have to have unlimited resources. Again, which means you know, big, big machines, multiple processors. I think we know fairly well that you know, multiple head machines are becoming much more common, big memory spaces and huge IO spaces, are becoming more common. And one of the things that I noticed in sort of cruising the literature is that lots of people are messing with complex technologies like clustering or LSI or some of these other things but very often the algorithms that are developed aren't geared towards sort of like the multistage architectures you need in practice. Like in a production system, you can't just say "oops out of memory." You know, go buy more memory, you need to have fallback strategies and that changes the dynamics of algorithm building when you've got to deal with fixed amount of memory, fixed amount of disk and maybe slow media, somewhere else in the equation. Cross-pollinate. I think that for a long while back in the middle 80's I was much more close to sort of the IR literature and then dropped out for a period of about five years and dropped back in about a couple of years ago, two to three years ago, and noticed a radical change. There are a lot more people playing in this sphere. And it was a very, I think it was a very, very good thing. Coupled with TREC and some of that stuff where lots of people are getting thrown together, there has just been this heat generated out of all of these things. I can remember my years at National Library of Medicine, we had computational linguists on one side of the floor and IR people on the other and these people didn't even eat lunch together. I mean, you know, there was just no discussion going on and then when I dropped back in a few years ago, I was surprised to find that there were computational linguists attending the IR conferences and, in fact, IR people going to the computational linguistic conferences and you know, we're endowed with having some of the people that spearheaded this in the room with us. I think that I hear sometimes from certain communities that, you know, why don't people just, you know, like learn what was already done. Well I think there's an obligation to teach people what was already done and part of that obligation is the transfer of what is essentially paper knowledge into technology. Because that demonstrates it more effectively than just about anything else. I mean there are I don't know how many papers, anybody know how many papers there are that relate to IR going back? 10,000? I mean, it's enormous. I mean people can't get their head into that. But people will play with software. So I think that transferring the past into today if it can be done through software vehicles is actually a good idea. I would like to sort of encourage the notion, and I don't know how feasible this is, of treating some of this stuff as big science. I have a brother-in-law who is a physicist and he gets involved in projects that last seven years with 60 researchers and some kind of collaboration where you might have, you know, create this mega architecture right that involves some computation linguists doing their thing and some IR people doing their thing, some data base people doing their thing. That spans multiple years, might yield some rather interesting fruit. And it could be done given our interconnect now, not in the, not with the kind of funding vehicle that physics researchers need but perhaps by partitioning the research. I was encouraged when I read the Drexel Home Page ito look at this multi-disciplinary program and I keep forgetting the acronym for it, MY or something like that [MISE] where you know, something like that could be the beginnings of a big science type of approach to doing something like this. Obviously, setting what the problem is to be solved, might engender a lot of argument. It may be places like CIIR are already engaged in this and I just don't know it but the cross-discipline, the cross-organization stuff ought to be part of it. And the last thing is, why is there such lousy freeware with this stuff. I know, you know, you go out there and you scout the net and I mean you can get great compilers for free, you get great DBMS packages for free and most of the IR systems that are out there now, Bruce told me at lunch that Inquery is available to researchers for free but that doesn't solve the commercial problem where folks want to get educated about this technology and will do so at a very low cost. On the other hand, we also discussed the fact that there are lots of commercial people out there giving away commercial grade software for free too on the Internet. The likes of Exite and Verity and so forth. So, but I think that producing things that are distributed through the internet free software vehicles as a byproduct of work, is pretty useful. I know the MG stuff is out there too but and that's kind of a first. Before that there really didn't appear to be much. Question? LF: The Managing Gigabyte people, Whitten Australia.? BC: That group. LF: That group. The Australians in Australia BC: in Melbourne. LF: Okay thank you. KM: Questions for Larry? I want to ask you kind of the same question that I asked Ellen, as you are also in industry. If you were hiring people to come and work for you, you said that you would get co-op students from Ed's group at Virginia Tech and maybe you'd like to look at some IST co-op students somewhere along the line. LF: Yeah. I very much would. KM: What are you looking for in terms of the kind of background that people who come work Opentext?. LF: It varies but being a product company people with strong algorithm and data structures experience. They tend to be rare. We talked about this briefly at lunch. (I think it was at lunch. I don't quite remember when.) But one of the issues with that is that a lot of the good data structures and algorithm people tend not to stay in data structures and algorithms. It kind of seems like a blocking/tackling kind of life right now; they want to be one of the wide receivers and build user interfaces. So, but there is definite value in it. And I think that folks who come through co-op programs in my experience have much more use to a commercial organization in the sense that there is less of a training curve coming out and they tend to have more of an eye on a market sensitivity as opposed to "oh I implemented a new memory B tree once and so I know data structures" kind of thing. If there were sort of human factors type folks I think that there is tremendous value in that. KM: How about software engineers? Maybe, you said that already thinking about programs like ours or otherprograms that are SEI-influenced. Do those play any part in your part of the world? LF: Oh they do in a company like Opentext although I'm not sure what you mean by software engineers? KM: I'm not either, maybe Lee can tell you? LE: Oh no. Greg.. Hot potato. GH: People who are trained in clearly-defined procedural approaches to... LF: I think most of the companies that are selling product in this area have come to the conclusion that professional services have to be a large part of the equation. And people with that set of skills who are versed in this technology would be very valuable. To populate the professional services because a lot of times what happens, you go into an organization, you've got a tool technology, they have a problem, they don't need the tool-they need a solution and they are willing to pay for it and they need people who can take the tool and turn it into a solution. And so software engineers in that capacity would be extremely useful. In fact, Open Text is engaged in actively buying companies who do this because we can't find people. LE: This has been a kind of ongoing concern and I'll try not to beat a dead horse. But when I hear conversations about how the users don't know what they need... . That's entirely untrue. Users have a lot of concepts about what they need, they don't express them in the terms that computer scientists necessarily use. And it's not their job to do that. If they could do that, they wouldn't need us. LF: I think that's a fair statement. LE: Yeah, and so what we need , we need a lot of different skills-some people have very deep knowledge about the construction issues, architectural issues from the software and hardware standpoint; we need pele who have the ability to understand the problems in context and apply the options. LF: I think one of the reasons that you might be hearing that, one of the reasons why I find myself saying that is that I tend to see..... END OF TAPE 4 LF: If you really looked under the covers, what he might need is an automatic tool that takes it out of Word and dumps it into HTML pages. So, this guy goes off and builds what the user asks for. And I think that some critical questioning on the part of people interacting with clients is pretty key. LE: It's not clear where we put that in the curriculum. That's been something we've been talking about. It doesn't fit easily into the disciplinary curricula at all. NH: Again, I come from a statistics background and what we ended up doing is biostatistics or any statistical consultant has a lot of those same things. I mean, in fact, I will tell you that the worst possible client is the one who comes in supposedly speaking your jargon. Because first of all, they're sure they know everything there is to know and you should just go do it, and secondly they misrepresent it because you hear your jargon and you interpret it in certain ways, not realizing that they really don't understand what they're talking about at all. After a while, you do know that, but, at first, you're naive. You know, one of the things we ended up doing is we had something called the Biostatistics Seminar and all it did was provide students with faculty opportunities to do consulting with real clients. And that was interdisciplinary but within a controlled environment. It wasn't totally interdisciplinary. It wasn't something I've seen that much at Drexel, but it really worked fairly well and was done as a seminar course, but the thing is that -- one thing I'm beginning to see is a sense that I don't think we're going to be able to teach students everything anyway. I feel like I came out of my doctorate and I learned a tremendous amount on the job and I think that really, what you have to create, is somebody who knows a little bit about what is out there that they could learn more about and an attitude about it that that's what their life is gonna be. LE: Well, yeah, I think you're right. I think they're gonna at least -- this is gonna have to part of the way they understand their role in the future. At least that's the path. Now, not everybody in information technology is in that particular path. For a lot of reasons, like.... LF: One comment I didn't make in the slides which sort of reverberated around for a while is that in the context of teaching all of these very specific techniques and skills is that if I find a developer or a candidate coming out of a graduate program, the first thing I look at is his math background because this industry is getting -- this field is getting much more complex and strong mathematical skills in terms of how they -- understanding mathematics, being able to read, having math involved in some of these papers and being able to translate them into data structures and algorithms is insanely valuable. There are very few people who have that capacity, I find. And end up coming into the product industry, at least on the east coast. And again, keep in mind that I see probably a very tiny slice of the world. GS: This may be a simplistic question but, in your potpourri of what you would like out of a graduate, we've talked about programming languages, what do you think would be good or do you see all of them as transitory? LF: I think teaching design skills with latest methodologies are far more appropriate than any particular language. I was involved -- I put together a course for the UMUC at University of Maryland years ago, an object-oriented technology, taught a couple of seminar courses and eventually they asked me to turn it into a regular course in their degree program and they insisted on doing it in C++ and I did it against my objections. It was the worst thing for these people, although it did train them in C++ skills, but it didn't train them -- they spent more time worrying about the arcane syntax and edges of that language than the design problem which is what they really needed which were the skills that were gonna carry them five and ten years from now, rather than C++ which is probably fading in popularity right before our eyes. As a sense, things like Java, actually which solved those problems, becomes more popular. KM: Mentioning languages makes me think about something we -- everybody's been pussyfooting around so far around what I like to worry about so I'd like to mention it which is content. I didn't hear anybody say you need content experts, really. I haven't heard you say ... EF: I'll do that, I'll do that. KM: Okay. I've got three of them pinned down so far, Ed, so I want to pick on them right now. Then there's a "yeah, well, you library science types, we kind of need you too or something like that," but where does content fit into it all? LF: I think the challenge for me was since that community ...,since not many people outside that community understand what it does either, okay, it behooves that community to inject itself into the process. Go figure out how you can help. And if you can prove your value, then it will pay back. I have a deep suspicion that there's tremendous value there that's being untapped. I can't articulate what it is. We were talking about this after lunch. There's got to be some value there, but I can't -- maybe the needs analysis side in some areas. KM: It's a sort of late reaction to Bruce was saying. He had this nice little suite of IR courses and then there's this natural language processing course. What about the controlled vocabulary side of the world? I mean, what about that kind of content representation? Do you feel that information retrieval students need to worry about that? (inaudible voice). KM: Salton doesn't rule. I'm sorry. BC: I mentioned in the first talk that controlled indexing needs to be -- there's lots and lots of stuff in digital library applications, multimedia stuff where manual indexing's gonna have to happen and the companies are hiring indexers now to do these things. But, I also agree with your comment that when you -- as I said, when I go and talk to these people who supposedly know all about image indexing and controlled vocabulary , they actually know very little. And they haven't attacked it in a very scientific way. They can't show me any results they've come up with over the last 30 years that apply to how to index images and so I think it's up to that community to inject itself into this process which is -- starting out now -- which is how do you make manual indexing. Absolutely manual indexing is needed and there's some sort of control vocabulary is obviously very important aspect for some of those applications as well. So, you know, that needs to come into the development of digital library systems and other types of modern systems as part of the process. It also means that software tools need to get built with the input of indexers because you can't expect computer scientists to sort of magically osmose -- what it is, that you need to have a controlled vocabulary tool to support the development and use of a controlled vocabulary. There needs to be some of that going on. I mean, Yahoo is probably the most successful controlled vocabulary recently and that was done just by a bunch of yahoos. Km: No, I think there's a librarian... BC: On OCLC, they're trying to use -- they've been trying to stick the Library of Congress categorization in as a substitute for that, but they're too far behind the curve already. Anyway, that's an example at least. At least you got a very strong example of some people still really like to go down their controlled vocabulary for searching even when you've got all these search tools around. They'll still go down that controlled vocabulary. A lot of people do. So, it obviously appeals for browsing and looking at information, but it needs to be brought up to date in terms of how do you incorporate the whole process of building these types of systems. HW: Bruce, they don't call it "controlled vocabulary" at Yahoo, at Yahoo they call it Ontology of Data Resources. The guy that runs Yahoo talks about the ontology of Yahoo and he means like the subject categories: medicine, law. BC: It's hierarchical. LF: The business card of their librarian, her title is ontologist. Which is not to be confused with ornithologist. BC: Sounds like an unpleasant medical specialty. KM: Okay, thank you. EF: Since we're overtime... KM: No, we're fine. EF: I have to react to some -- is this working? BC: No. ED FOX: Is there a switch? I have to react to some of these last statements. I certainly have, in experimental studies, shown that combining manual indexing with automatic indexing gives better results than either one of them, so I'm very much in favor of these kinds of schemes. However, I think it's somewhat dangerous in that many of the library science community people haven't made this jazzy to appeal to people and their disciplines. The courses have these old titles and no one takes them from outside. Virginia doesn't have a library school, which is an interesting statement. So, that means we have to work together in some ways and we have to solve some of these problems. And just as an example, this notion of agents that we were talking about before, the concept of teaching people about content, may be the way to do this is to tie it into building an agent for that particular domain area. So, that's jazzy and that's something that will get people excited about it. And it'll also teach them the concept. So, it's a practical application instead of just teaching the domain area. Anyway, so those are just a few miscellaneous comments. I want to talk a little bit about our education infrastructure project, which is essentially an effort to apply digital libraries to improve education. But, before I dive into that and show you some of the courseware, I want to mention a few related things. In Arts and Sciences we have what they call Cyberschool, which is a very good name. It gets a lot of publicity and really the implication here is that we had this faculty development initiative that has trained lots of faculty so excited about these new technologies, are trying to put them into their courses. So, instead of starting where you have to get pushed into this and otherwise you can't deliver your course to the other side, you're wanting to improve the course locally to the residential students and in doing that, in developing asynchronous learning kinds of methods, automatically it becomes available for distance learning so that's sort of the route we've been taking and that's a valuable point and that ties in with networking and applications to education. So, that's some of the context of things in which the things I will talk about relate. There was a question about computing infrastructure. At Virginia Tech, we follow Drexel fairly shortly -- '84 engineering, '85 computer science required computers so we're sort of at the same bent on having our students with these machines and so we have this elaborate infrastructure with dorms wired up with Ethernet and the village around so people in many places in the community can have Ethernet connections and there's a qualitative difference in the learning experience of our students who have Ethernet and a fast computer and access to all this web of information. It's a significant difference and we've done interviews and focus groups and other kinds of things. So, we're going towards that and we should predict in our new curriculum to move into that kind of domain. So, we have that as part of our infrastructure. On the other side, I want to point out that we've engaged in the last year in particular, in significant evaluation studies and I think that has to be a part of these kinds of programs that we're developing. Because we're seeing an enormous influx into the education world of this kind of information and these kinds of knowledge and some of the people that we're training will go out and be helping in this education activity. That's gonna be a new area of occupation for them as well. And they need to learn evaluation methods. Not only for the sake that we've already talked about and there are several parts of that. There's -- if they're dealing with education application, you have to understand education evaluation. Do people learn more effectively and how do we measure that qualitatively as well as attempt to do it quantitatively. Usability evaluation. We know that that's a fairly well refined science and here at Drexel you have an HCI group and people involved in various those kinds of areas, too. A third area I want to add in that part, which is something we just started doing last year, and it's very exciting, is taking server logs -- that's sort of the first step, but we have a better method using network packet monitoring and other schemes, proxy servers, to determine client behavior. So we now know what our students are doing when they use the World Wide Web to access our course materials and we monitor them in labs, they're taking quizzes. I can see how a student does a quiz and we found enormously exciting support for people learning in different ways. The person who's a good researcher and when they're given a problem, know how to find the answer, do really well. The person who has really done the course thoroughly, understands it, doesn't even have to look up stuff. They know the answer very well, too. So we're beginning to get these kinds of results and these kinds of evaluation studies that involve monitoring, modelling and simulation will allow us to build new systems. So I argue that those disciplines also fit into the paradigm of some of the people we're trying to build. A scaling problem which has been brought up all today, people don't know how to deal with that problem. We're just trying things, okay. It's a really terrible approach to do that. We have scientific efforts that will help in that area. Okay, now, let's talk briefly about this education infrastructure project. We have over 25 courses on the Web, over 4,500 Web pages. Here's pointers to all kinds of project information and so on. Let me briefly look at some of those to show you -- and I really encourage all of you at whatever institution you're located, use this stuff. Connect into it. Give us some comments. Let's share together. I think this is an exciting opportunity. Yes? LF: Give us a start point. EF: Sure. I'll show you all that later. And I'm gonna leave this whole packet -- I've already delivered this -- I'm gonna leave this whole thing, but I'll do that in a second. Actually, the easiest thing is FOX.CS.VT.EDU. That's my workstation and that points to everything, so, you can just start from my name. We have this 1604, which is analogous to what Bruce said before, an Internet course, which we've been offering for a while and that has components in it of digital library and information retrieval so I'm helping with that course. And that's open to the campus and wildly excited people are involved in that, so, this is another way that we can improve the visibility of IR in the CS world. If we start to get people to see this as their first course, or early on in their careers. We've decided that it's important to have IR present in the undergraduate curriculum and so this course, Multimedia, Hypertext and Information Access, is a senior level course which we teach instead of the database course that is often taught in other universities. We may cycle it with database sometimes, but we actually cycle it a little bit. But this course is something that I encourage you to take a look at as well. There are over 300 Web pages for this. There's an enormous list of tools that they get exposed to. So, there's this long list of things that they've become familiar with. Systems we've written, systems available on the Web. And this is another area to work with industry. We need to have more collaboration with industry for them to put up demo versions of their things. Many of them already do this, but more detailed. And actually instructional materials. There's no reason that Opentext and others couldn't put up nice instructional materials. I've done some of my own here so we could collaborate on these kinds of things as well. But to show how does Opentext work and where are the algorithms and so on. That would be quite invaluable for all of us and it would help the customers understand it, too. So, those are some of the examples of things we might want to work towards. So, computers and tools and I believe we really have to liven up our education. I go on field trips. I take my seniors to the library to see the information resources. I have the librarians show them the CD-Rom searches. Many times, that's the first time they've been in the library in several years, if not their whole time in the university. I mean, these are seniors. So, it's sad, but this is actually the reality, especially in areas like computer science. I take them to media centers and other places around campus and this really helps widen their experience. So, I think we have to view this as maturing our students, as well as projects -- I've already talked briefly about projects. And so people, just to give you a sense, students are actually developing computer literacy materials, tours on the Web and in video of multimedia labs, developing virtual realities for the campus. These are exciting things and the kids really love this so they can get exposure to many of these disciplines. You don't even have to teach them very much. They'll just dive into this and get exposure and pick up things along the way that's very valuable. Okay, so I guess what I should do is I should show you what's the content. You're probably interested in that. The on-line page is a little bit better organized than this. This is one I had on my disk here. But the course has an introductory component which gets them involved in additional libraries in other areas. And I try to break it up in a number of ways. There are notes, there are exercises, objectives, study questions, issues that are dealt with, readings, and there are readings on the Web, there are readings from textbooks and then each unit has systems or demonstrations. There are commercial packages or demos I produced of various kinds, so they get to see these concepts reflected in real tools and they work with those as part of the exercise activities. So this course, Multimedia, Hypertext and Information Access - BC: What textbook do you use Ed? EF: I use a multimedia textbook, lightly. But the rest, there's a lot of course material online. A lot of material's online for most of the course. And I've put a lot into this. So, there's a unit -- and I thought very deeply on how to organize this. There's a unit on application construction. There's a unit on capture and representation. There's a unit on models and, in this case, compression. Presentation and interaction, networking and communication. So, that's how I organized this course. It's a different kind of organization in many other areas, but I'm trying to cover these three fields and I think, even in CS where there's so much blockage to do anything new, to couple Multimedia, Hyperteext and Information Access or IR kinds of stuff into one course, there's enough legitimacy to that and enough excitement to that. This sells out. I have higher enrollments than almost any of our other senior courses, even though it's an elective. So, I think we can win the battle if we work this right. Bruce did talk before about textbooks. I'm pleased to announce that Morgan Kaufmann is moving along very well and if you do want to have a readings book in IR, let me know and we'll get you involved in some of the early testing of a new book that's coming out in that area. Okay. So, I have that one and then I have my graduate level course on Information Storage and Retrieval, which is similarly laid out. These courses are almost actually ready to be self-study. All the exercises and quizzes are on-line and they're graded by e-mail so this is widely available. (What I'm trying to do here...). KM: I notice you have a northern Virginia campus. Do you do a certain amount of distance education with y our courses EF: Yeah, I taught this in distance learning mode in the fall last year. Both places, on campus and in northern Virginia. KM: It helps that you have everything on a Web page. EF: Yeah, it really did make a difference although there were still technical problems in this field. So, here the units for this course, digital libraries, sort of general information retrieval, the old inverted file, Boolean, string searching using pat-type approaches, pat-trees and so on. Clustering. A unit on SGML and document translation. I think that's something we often forget and that's crucial. It's feeding into all of this. Students have to manage documents. Hypertext, Multimedia and a unit on knowledge-based information retrieval. So that's how I've carved this up as far as an IR course. I mean, one course, I try to cover this and make it exciting as much as I can. Now, let's go back out of my stuff so you see these lots of other courses there. And I went briefly through the Web and I want to point out a few points from that. Edie's still..., I went through Pitt -- and as I said this will all be on the Web so you can browse through all of it and you can find all of it any other way, but I tried to quickly put together an assemblage of a number of places that have IR courses and to put it in the context of the different programs. So here is Pitt, the school, the lab stuff, the graduate program and course information. So, here's a course on information retrieval, which I don't know who teaches that one. ER: I don't know. EF: And then another one on on-line retrieval. And then under the graduate library science courses, here's another one on information search and retrieval. From Rutgers, from their MLS program, here's one on information retrieval theory. From Glasgow -- Glasgow started an interesting notion where they have a course, which they mean a kind of program with different strands which are like our different specializations. The different degree programs you might say. And then modules which are components of courses and then outlines for those. So you can go through those in more detail. Here's Bruce's courses so I'm making it available for the group here. (You need a comma there.) So, there's an IR course, here's one on topics. From Michigan, they just recently reorganized. Actually, they have what appears to be one of the richest collections of courses. I don't know if these are being offered or they're all new, but here's Concepts in IR, Image Databases, Impact of New Information Resources, Multimedia Networks. Amy, do you want to -- you can pitch into here. Information Industry. Internet Resource Discovery, Reorganization and Design, Making Digital Libraries. So quite a range of courses that if you think towards the future, they've done this in a large..., so I encourage people to take a look at some of the programs, they may be of appeal. Also, Chapel Hill has quite an interesting collection. A lot of courses there. There's -- this one is Information Retrieval. Here's Information Systems Effectiveness. Cluster Analysis in Classification, Retrieval and Communication. AI in Information Retrieval. Seminar in Information Retrieval and Research in Information Retrieval. So quite a rich collection of things being offered there. UNLV -- and this is just sort of by way of contrast -- has a course on Document Image Understanding. So you see kind of the range of what people are really offering right now. We don't agree at all, do we? We just do radically different things in all these places. If someone wrote a textbook, would anyone buy it? BC: Absolutely. EF: What's the market for this? I think people buy readings books. I'm not sure about textbooks. I guess they will. Now let's go abstract a little bit and talk about general preparation. And this, in part, comes out of that earlier study I mentioned what they call Information Engineering. We don't have a good name for this. Do we call it Informatics? Do we call it Information Studies? What do we call this? But here are some of the key things. People need to have skills for problem solving, for analysis and diagnostic skills, specifying user requirements, design, and there are all kinds of new innovations in design that are important for people to convey. A scenario-based design is done at my place. And revising Systems, Synthesis, Integration. We talked about system integration, but building systems, too. And then there are all these pieces of collaboration where in the world today, we're building things in teams and in projects. Working in groups as teams, communicating, discussing consulting, learning together. Which is not new in education. Revising, iteration, CSCW. How to manage collaboration activities? And ones that are interdisciplinary or at a distance. So these are all things that we need to be able to cover because that's gonna be the reality of the work place. From this information engineering program, here's a few points that might summarize and be valuable for us. First of all, there's this terminological problem. What do we call ourselves? We face this identity problem. Here's aspects from the European Union of Research in electronic publishing, information dissemination in IR, here's some bullets on the IR research components. Talking about design, implementation and management. Then from this actual workshop that was run and after a long period of time, a report came out and there were a number of dissemination activities. This actually does provide kind of a frame work and I'd like to spend a little time on this component. People need problem solving and design ability. It's really crucial for going out into the workplace and industry. They have to understand systems and software engineering,the life cycle of systems and usability concerns. They have to understand objects. That's a key thing. The whole paradigm of it. The programming, the databases and that -- they have to understand architectures. Because they're building things -- we have an information architecture, we have a network architecture, we have a system architecture, we have an application architecture. So we have to know how to put things together and the concepts, the models, the frameworks that these function in. We've already talked about data and information bases and management, evaluation, HCI and I think here, from a number of discussions from this group, here are the core things from CS that people felt needed to be conveyed. So they have to focus at some point. We can't just say we do everything. Here are some of the core areas that were dealt with. So data structures, software design and methods, designing algorithms, computer organization, architecture, operating systems, programming languages, do that in different levels and maybe even with scripting ofJava or other kinds of things. On the general background, math. Many of the universities say just take a science course, but lab science is important for people to have that kind of background. And we argue that we need some kind of field experience with people and human behavior situations so that was another suggestion. So taking a psychology kind of course ... BC: Like visiting the mall or what? EF: I'm sorry? BC: Like visiting the local mall? TC: Yeah EF: Social psychology field studies perhaps. Communication groups. Project management . Now, to think of the overall program -- and here's sort of a good way to end this section -- there has to be kind of a core knowledge. There needs to be some kind of a practicum and we're gonna put people out in an applied kind of field. They have to have some kind of practicum. They have to understand about organizations management and the human issues on that. They have to understand systems concepts and the last thing, as we said before, they have to have domain knowledge. Whether it's in health or education or whatever discipline. So, in many cases, the practicum and the domain knowledge can get coupled together. So they get experience in that., and it's driven in a practical sense. Now, I argue that we need to be careful and talk about multimedia. We're moving into content-based retrieval. There are funded projects now in this. There's fairly a good amount of work. We can't do very well, but we're making some progress and we can have some initial results in that. We're moving gradually or more slowly towards object recognition. People can now recognize faces in these kinds of video streams and images and so there will be some automatic production, as well as the manual stuff we've been talking about. How do you take the video streams and segment them? Combining analysis from video/audio/text. We're beginning to see that people need to understand the new innovations in speech and understanding, how that can be applied to analyzing video and indexing it and searching later on. Using context. We have to move towards theories and this last... end of Tape 5 side 1 We have a.building going up. A $25 million advanced communication, information and technology center where the groups -- and this is absolutely amazing -- on the campus, engineering, education, information systems, arts and sciences, they actually said "we want to work together. We want to be on the same floor. We want to be near you." So, we have labs in all these areas. Visualization, animation, multimedia, human factors, HCI and so on -- and education technology -- they want to be together and work together on some of these and actually have a building that will break ground next year. So multimedia is something else we should cover -- at least at some level, and I argue that and that's what I'm doing in that course I mentioned before. And finally, there's this new, integrated principle that's maybe a rehash of some other things from before but it's caught the imagination of people around the nation of working on digital libraries. So, it's a good way to bring things together and to bring us together in this area, focusing on supporting tasks and so one of our projects in digital libraries was user-centered database from the CS literature. We face interesting problems in naming, archiving, preservation. The library world has a lot to contribute to that. And the rest of the community desperately needs that assistance. We have problems of naming. The URL to URNs. Handles, perls and so forth. Again, the library community has a lot to say in that. And we talked before about filtering, we talked about classifying and so on. We have issues of distribution and we're radically changing that whole paradigm. How do we deal with repositories? Some of this is from the computer science end but that should be touched on as well. The notion of scaling is crucial. Interfaces and visualization and there's a strong commercial angle now. IBM, Xerox putting a lot of resources into this. As are many other new companies. Rights management is a key thing and so here we can tie in with business and other kinds of activities. We're really rethinking publishing and libraries. So, I think that's a good point, too, and I want to make a couple of short, other comments that -- another piece we haven't mentioned is an IR bibliography. That's something for us to think about us putting together. And finally, we need more collaboration between the library and information science places and the CS places in this field. Bruce and other CS-type of institutions have been moving aggressively and doing a lot of publishing, a lot of development work. But when you actually look at the numbers of faculty who are doing information retrieval who are teaching IR courses, a large percentage of them are in library & information science programs. We have to bring ourselves together. We have to understand and work on curriculum issues together. So, that's why I'm so pleased that this meeting is taking place and I hope that this will just be the first step of many that we'll work together on. Thank you. KM: Before there are questions for Ed, I want to say something about the agenda. Three other people looked at this agenda, including Tom Childers, who modified it. Me, who assembled it and Karen who Maced it up nicely and somehow, I managed to say that Larry and Ed were going to speak in a half an hour. But now this time has stretched out as appropriate, so I'll ask for questions for Ed or general questions and then maybe spend a few minutes wrapping things up because we did promise people we would be done by about 4:30. I just wanted to say that -- it's my fault, basically. But questions for Ed. BC: I put everyone to sleep. EF: It was the dessert. GS: If you could pick one thing that's significant about what our students need to take away from these programs, what might it be. In all of your curriculum researching, is there a void? EF: Is there something that's missing from all of them? GS: Yes. Or if that's not a good question, then what would be the key course or the key kind of knowledge that a student would get out of the program. It's on two ends of the spectrum. Absolutely essential and we don't do it yet. EF: One thing that doesn't happen at all, anywhere that I know of, is for library & information science and a CS group to work together and share in their learning experience. To have a team where you have people from both sides who solve some kind of problem together, that doesn't happen anywhere that I know of which is a criminal shame because there are institutions that have departments in both areas of the same institution. I haven't heard of a single case where that happens. Maybe it does. So that's the one thing that I would say. Working together and having the students understand this experience. Why should a CS person understand library and information science? Why should a library and information science person understand issues in CS? How could they help each other in that? If we could solve that problem, we'd get our own act together and the rest of the world would understand more effectively. So, that's my one comment I'd like to make. Yes? LF: As an industry representative, what can you hope from industry do to help ? EF: Well, I think that's an important area and I'm glad that you said that. That's important. Bruce had some important comments before in this area, but I'd like to add some others. One of them is this enterprise has to turn into a working group of maybe ten people over a couple of years who are serious about forming curriculum for this field that will be widely accepted. And we have to get to that stage so a serious commitment from several people in industry and government to be a part of that process is crucial. So I would say that. Secondly, I would say that we need to have a new paradigm of collaboration. The Web is something that will enable that, to a large extent. Where we have smaller modules of things. The courses I have here are broken into sort of week long or sometimes two-week long units. With a number of exercises and activities, there's no reason that we can't have the demonstrations and simulations, the visualizations, the exercises, that we teach in these courses all around the country, created by people from industry. To reinforce, to revitalize, to expose our people to the practical stuff. Let's start to work closely in this regard and I think it'll make it a lot more relevant and more exciting and prepare them more for your needs, too. So, two things to suggest. KM: Any comments or questions? Okay, thanks Ed. I want to just say, I was making a list and Ed touched on most of these points, but it seems to me that there are five things that came up. Information retrieval is too important to be left in individual departments and disciplines. It's too important to be left to computer science. Too important to be left to information science. It's too important to go up in MIS, so we've got to find ways of working together and, of course, Ed just said that so I don't need to say it. We need -- our graduates need vigorous multi-disciplinary training in order to be successful, whether it's Bachelor's, Master's or Ph.D.'s. We need to orient the students and the faculty. We need orientation for collaboration. Ed just said that, too. We need real world problems, tools and resources to educate the students. In terms of size, in terms of modelling or having access to what's out there now. Not just a question of getting co-ops or internships, having access through the Web or making industry-making demos available or whatever and Larry asked the last question that brought the last point that I had, which was in order to be successful, we need support. From industry and government. We need funding support. We needthat, and access to resources. And it may be that some kind of coalition of people from industry and from the various parts of academia that we can work toward that and maybe this is just the beginning of that. So, I want to thank you for coming and since the Dean was the one who gave us the money to do this through the Kellogg Foundation, I will let him have the last brief word. DL: Well, I'm gonna be even briefer than usual. I had very high expectations for this meeting and they've been met. They've given us a lot of things to think about and I think people on all sides of the different tables that we might have from a disciplinary point of view have been excited by. Now the culmination of things really excites me. We have presentations for presenters, for our visitors. All out-of-town people? TC: Yes. All out-of-town people. DL: Are they all the same size? I'll show them off and then everybody can come up and get them. You can't show one off. Thank you very much. I really appreciate it. Let's have a round of applause . End of Conference IR 2000 workshop/symposium pt. 2 -