FURN87 Article Summary, Unit KB, 5604n From: (Group 5) Shirley Carr Mike Joyce Zakia Khan Vas Madhava The Vocabulary Problem In Human-System Communication Most functions of computers requires the user to type in keywords. But all too often they use the wrong keyword. This problem arises because people use a wide variety of words to refer to the same thing. This is the vocabulary problem. When it comes to keywords, most often the word used by the designer becomes the "official" one and everybody is forced to use it. This is especially true in manually indexed IR systems, where the terms the indexers use is not the same as that used by searchers. This is why there often needs to be intermediaries between the users and the data. Experiments in this area were done by collecting data from people in 5 domains: 1) Editor - Words used by typists to describe editing operations. 2) Decoder - Words used by system designers for a message decoder. 3) Common Objects - Words used by college students for various objects. 4) Classifieds - Names for 64 classified ad items 5) Recipe Keywords - Words for computerized file of recipes. Analysis of the terms people use show that it follows a Zipf distribution, ie a few words are used a lot and others used very little. "Armchair naming" is the term used when a designer uses a name and everybody is forced to follow it. It's highly unsatisfactory because the probability that the term would agree with what people is use is very small (0-10%). This is true even if experts are used to come up with the terms. On first glance, it would seem like the most popular term would be the most appropriate. It's better than the armchair method but still fails 65-85% of the time. Results from analyses show that there is no one good name for most objects. The only hope is to create guidelines for choosing a good name, especially one that would easy for unfamiliar users. One solution is to use aliasing. But results with aliasing show that you would need many aliases to cover all attempts. Even with 15 aliases, only 60-80% of the attempts will be successful. Unlimited aliasing could be tried, but having too many would create ambiguity (and precision would decrease). You also need to consider the cost of aliasing: what if the ambiguity resolution caused the execution of a wrong command. Finally, it's very tedious to build this alias list. The best solution to resolve this ambiguity is to: 1) Make the user memorize precise system meanings. 2) Have an interactive interface and users can refine their selections. One good way to do this is to return a list of choices, possibly ordered by frequency. Adaptive indices are another option. When users fail to get something, the entry is saved and when they finally succeed, all the failed attempts become future aliases for other users. Unlimited aliasing should only be used in help facilities. ================================================================ KB Article Summaries by Group I: Fitzgerald, Kalafut, Klein, and Muhlenburg. "The Vocabulary Problem in Human-System Communication" by Furnas, et al. The Vocabulary Problem is that users seldom select the same words when trying to refer to a particular concept. In fact, no single access word, however chosen, can retrieve more than a small percentage of desired output. In information retrieval, i ndexers seldom choose the same words as researchers (the users). A study was conducted over 5 different application domains to find a solution to the Vocabulary Problem. The most frequently used words for a particular concept were usually very few in number. "Armchair" names, even those provided by supposed doma in experts still were of little use. The best possible name was a little better but still with a failure rate between 65 and 85 percent. 3 "armchair" aliases preformed about the same. The 3 best aliases did show some improvement, so multiple aliases we re proposed. The apparent solution is unlimited aliasing, where the nature of unlimitedness is system-dependent. Even though precision could be lowered, extensive aliasing was found to be beneficial. The problem of disambiguation was done by presenting a list o f choices to the user for selection. At last, it was noted that performance could suffer when there was incomplete data.