Marcia J. Bates
Graduate School of Education and Information Studies
University of California, Los Angeles
mjbates@ucla.edu
DRAFT 9/25/96
Copyright (c)1996
by Marcia J. Bates
Popular discussion in computer science, information retrieval, and information science regarding content indexing (especially subject indexing) and access to digital resources and the Internet has tended to ignore a variety of factors that are important in the design of such access mechanisms. Some of these factors and issues are reviewed and implications drawn for information system design in the era of electronic access. Specifically, the following are discussed: Human factors: Description of information vs. access, Subject searching vs. indexing, Multiple terms of access, Folk classification, Basic-level terms, and Folk access; Database factors: Bradford's Law, Vocabulary scalability, the Resnikoff-Dolby 30:1 Rule, Domain Factors: Role of domain in indexing.
It is argued that to date that a "Naive Indexing Model" has dominated thinking on digital access. While we may not have enough research to posit a "Sophisticated Indexing Model," we can nonetheless characterize a "Less Naive Indexing Model," which is described.
Introduction
Objectives.
Popular discussion in the fields of computer science, information retrieval, and information science regarding the provision of access to digital libraries and Internet resources has proceeded largely in the absence of consideration of several key factors which are vital to development of effective means of access to the subject content of digitized resources. These factors range from the unique character of the human-system interaction in an information seeking situation, to linguistic factors, subject domain factors, and the statistical properties of databases. In some cases we have extensive research results supporting the points I will be making; in other cases, we know only enough to know there is a problem that needs attending to. In either case, little attention has been given to these factors in the rush of enthusiasm over the new capabilities we have with the worldwide network, powerful automatic search engines, and digital resources available online.
The purpose here is to identify each of these problem areas, and describe and discuss the issues. As noted, some of these points have been discussed extensively, though often not in the context of digital resources, and others have been given less attention. However, I am not aware of them having been brought together in one article, nor related together in a single framework of discussion. It is my thesis that once the blush of enthusiasm over the new power to browse digital resources has passed, users will soon become very frustrated by the ineffectuality of current subject access. By that time, we need to have in place the beginnings of more powerful and user-friendly means of subject access. I believe that by the end of this article it will be evident that next steps in the provision of subject access to digital resources will need to proceed in a different manner than has generally been assumed to date.
Fully Automated Subject Access.
Before discussing those factors, however, it is necessary to address a prior issue. Many people in our society, including many in the Internet and digital resources environments, assume that subject access to digital resources is a problem that has been solved, or is about to be solved, with a few more small modifications of current full-text indexing systems. Therefore, why worry about the various factors that will be addressed in this article?
There are at least two reasons. First, the human, domain, and other factors would still be operative in a fully automated environment, and need to be dealt with in order to optimize the effectiveness of information retrieval (IR) systems. Whatever information systems we develop, human beings still will come in the same basic model; products of human activity, such as databases, still will have the same statistical properties, and so on. As should become evident, failure to work with these factors will almost certainly sub-optimize the resulting product.
Second, it may be a little longer than we think before fully automated systems are developed. At conferences, researchers present their new system and say "We are 70 percent there with our prototype system. Just a little more work, and we will have it solved." This happens because, indeed, it is not difficult to get that first 70 percent in retrieval systems--especially with small, prototype systems.
The last 30 percent, however, is infinitely more difficult. Researchers have been making the latter discovery as well for at least the last thirty years. (Many of the retrieval formulas that have been tried recently were first used in the 1960's.) Information retrieval (IR) has looked deceptively simple to generations of newcomers to the IR field. But IR involves language and cognitive processing and is therefore as difficult to automate as language translation and other language processes based on real-world knowledge that researchers have been trying to automate virtually since the computer was invented. There are serious scalability problems as well with information retrieval; small prototype systems are often not like their larger cousins. We will discover, further, that user needs vary not just from one time to another, but from one subject domain to another. Optimal indexing and retrieval mechanisms may vary substantially from field to field.
We can do an enormous number of powerful things with computers, but effective completely automated indexing and access to textual and text-linked databases eludes us still, just as 100 percent perfect automatic translation does, among other things. Meanwhile, a lot of other things we could use computers to help us with in connection with information retrieval go untried or not even invented in deference to our collective assumption that we will soon find a way to make it possible for computers to do everything for us in information retrieval.
One of main points of this article will be that the people side of the information retrieval process needs attention too--and that the really sophisticated use of computers will require designs shaped to how our minds and information needs actually work, not to how our formal, analytical models might assume they work. If there is any validity to the thrust of the various points in this article, then attention to these points will be productive for effective information retrieval, no matter how soon, or whether, we find a way to develop good, fully automated, IR.
Naive Indexing Model.
Reference will be made at various points to a "naive indexing model." By this is meant the cluster of largely unexamined assumptions that lie behind the more common approaches to the development of indexing and access systems, automated or otherwise. These are perfectly reasonable assumptions in every way; they are logical places to start. We know just enough, however, to know that the naive model is wrong, sub-optimal, or distorted in one way or another. We do not know enough, in my view, to posit a contrary "Sophisticated Indexing Model," but we do know enough to talk about a "Less Naive Indexing Model," and so the latter rubric will be used at points in the text.
In the following, the topics of human factors, database factors, and domain factors are discussed seriatim, the naive and less naive indexing models are then summarized, and finally, possible solutions are discussed and conclusions drawn.
Human Factors
Description vs. Access
Almost universally, access to digital resources has been assumed to operate under the naive indexing model. A fundamental assumption under that model is that whatever elements you recognize as being subject-indicative about a record, whether full text words or phrases, abstracts, titles, index terms, or classification categories, will, in turn, be the basis for retrieval on the record. To put it differently, in the naive model the contents or indexing of a record is what the system directly searches on to find a match for the user's query. This has, indeed, been the historical approach in indexes and card catalogs in the print environment. But automated systems need not be so restricted. Description and access are two different functions, and can be handled differently by an automated retrieval system.
The hub system in air flight provides a partial analogy. For many years, all flights were scheduled to be direct, as much as possible, between two different cities. Eventually, it was realized that, if the customer would accept a stop along the way, much more efficient, inexpensive, and frequent flights would be possible. Instead of scheduling flights for every possible combination of two cities, flights could be scheduled between the originating city and a hub city and between the hub city and the destination. By doing so, the total number of flights needed was smaller, and yet, because all flights of the airline had the hub city as either destination or origin, more frequent flights could be scheduled between each city and the hub, leading to more options for the customer.
The same two-step process in access can provide benefits of several sorts for the information system user and provider. Suppose the access term the user inputs leads to a "hub" of vocabulary, whence the user is then directed to indexed items. Why the extra step?
In an information system, the vocabulary hub can be either visible or invisible to the user. The user may have entered a different term than that used to index records of interest. (Or, in full text environments, the term the user entered may appear nowhere in some highly relevant records.) The hub can be a switching mechanism from the user's term to the actual terms indexing the record. If invisible to the user, the switch can lead to the retrieval of records under different terms without the user being aware of it. If visible to the user, the hub can be a place for the user to identify better terms to use, or additional terms to search on, besides the term the user started with. As we shall see in later sections, problems of vocabulary matching and finding the best or all relevant terms for searching are much greater than the naive indexing model assumes. In other words, this approach might be preferable, even if it did not save money.
However, there are also benefits to the system provider in the way of potentially dramatically reduced costs. Suppose a new name for an old concept becomes popular, or a section of a classification system has been updated and reorganized. The old cataloging principle (honored mostly in the breach) was to re-index all records under the new term or classification category. Now, instead of changing the old term in 10,000 records, just change it one place--in the hub vocabulary, linking it to the old term.
Indexing and classification systems have historically been quite conservative, in part because of the enormous costs associated with any re-indexing of all the individual records. Change a couple of pieces of the classification, and all the previously classified items will be in a different location in the system than all the newly classified items. But suppose the changes were made in one place, in the hub. The behind-the-scenes links might look messy, but the user would get what the user wants--both the old set of records and the new set together on the screen--possibly without realizing there had been two sets. Once we break ourselves of the habit of assuming that the access and the description are pivoting on the same piece of data in each record, and can instead be separated into two distinct functions linked at the hub, many new possibilities in system design open up. We will return to some of these possibilities after some of the other problem areas are discussed.
Subject Searching vs. Indexing
Another, related, part of the naive indexing model assumes that the searching and indexing processes are mirror images of each other. In indexing, the contents are described or represented, or, in full-text searching, indicative words or phrases are matched or otherwise identified. Likewise, the user formulates a statement of the query. Then these two representations, of document and of query, are matched to retrieve the results.
But, in fact, this is only superficially a symmetrical relationship. The user's experience is phenomenologically different from the indexer's experience. The user's task is to describe something that, by definition, he or she does not know (cf. Belkin, 1982). (Knowledge specifically of what is wanted would lead to a "known-item" search.) The user, in effect, describes the fringes of a gap in knowledge, and can only guess what the "filler" for the gap would look like. Or, the user describes a broader, more general topic area than the specific question of interest, and says, in effect, "Get me some stuff that falls in this general area and I'll pick what looks good to me." Usually, the user has no tools available to help with that problem of describing the fringes of the gap, or the broader subject area.
In many cases, the problem for the user is even more difficult than indicated above. In years of studies, Kuhlthau (1993) has documented that the very process of coming to know what one wants in the first place is seldom straightforward. In a search of any complexity, as for a student term paper, one does not so much "pick a topic," as is usually assumed, but rather discovers, develops, and shapes it over time through exploration in materials in an area of interest. One can use an information system at any point in this gradually developing process. Use early on will naturally come out of a much less well specified and articulated information need--yet the searcher must nonetheless find a way to get the information system to respond usefully.
The indexer, on the other hand, has the record in hand. It is all there in front of him or her. There is no gap. Here, ideally, the challenge for the indexer is to try to anticipate what terms people with information gaps of various descriptions might search for in those cases where the record in hand would, in fact, go part way in satisfying the user's information need. This is a very peculiar challenge, when you think about it. What kinds of information needs would people have that might lead them to want some information that this record would, in fact, satisfy?
As Harter (1992) points out, discovering that a particular article is relevant to one's concerns, and therefore a good find, does not necessarily mean that the article is "about" the same subject as one's originating interest. As he notes, regarding his own article on the concept of psychological relevance:
...the present article [on psychological relevance] may be found relevant, by some readers, to the topics of designing and evaluating information retrieval systems, and to bibliometrics; I hope that it will. However, this article is not about these topics. (p. 603, emphasis in the original)
Conceivably, infinitely many queries could be satisfied by the record in hand. Imagining even the more likely ones is a major challenge. (See extensive discussions in Wilson, 1968; Soergel, 1985; Green & Bean, 1995; Ellis, 1996; and O'Connor, 1996.)
But in fact, historically, and often still today, catalogers and indexers do not index based on these infinitely many possible anticipated needs. Instead, and perhaps much more practically, they simply index what is in the record. (See also discussion in Fidel, 1994.) In other words, they attempt to provide the most careful and accurate possible description or representation of the contents of the record. In this situation, there are a great many differences from the phenomenological circumstances of the user. We should not be surprised, then, if the user and the indexer use different terminology to describe the record, or, more generally, conceptualize the nature and character of the record differently from each other.
For the indexer, there is no mystery. The record is known, visible before him or her. Factual information can be checked, directly and immediately, to create an absolutely accurate record. The user, on the other hand, is seeking something unknown, about which only guesses can be made. What happens in retrieval if the searcher's guesses have little (or big) inaccuracies in them, and do not match with the precise and accurate description provided by the indexer?
Further, the indexer is experienced with the indexing system and vocabulary. Over the years, with any given system, fine distinctions are worked out regarding when one term is to be used and when another for two closely related concepts. Rules are created by the indexers to cover these debatable situations. Eventually, mastery of these rules of application come to constitute a substantial body of expertise in themselves. For example, the subject cataloging manual for the Library of Congress Subject Headings (Library of Congress, 1996) runs to two volumes and hundreds of pages (Library of Congress, 1991-). This manual is apart from the subject headings themselves, and does not consist so much of general indexing principles as it does of rules for applying individual headings or types of headings. Thesaural indexing systems frequently have "scope notes" whose purpose is to tell the indexer how to decide which term to use under debatable circumstances. Often, these rules are not available to searchers, but even when they are, the naive searcher--and that includes Ph.D.'s, as long as they are naive about indexing--will usually not realize there are any ambiguities or problems with a term, and will not feel a need to check it. The user has in mind the sense of a term that interests him or her, not the other senses that the indexer is aware of. Only upon retrieving false drops will the user realize there is even any problem.
In short, the user almost always knows less about the indexing issues in a topic area than the indexer does. The user approaches the system with an information need that may be formulated out the of the first words that come to mind (See Markey, 1984b, on the many queries she found that "could be categorized as `whatever popped into the searcher's mind.'" p. 70). Consequently, the user's input is liable not to be a good match with the indexer's labeling, which is derived from years of experience and analysis. The indexer, on the other hand, cannot undo the far greater knowledge that he/she has of the indexing issues. After indexers have been at work for one day, or attended one training session, they know more, and have thought more, about indexing issues than even the most highly educated typical user has. Already, an expertise gap is forming between the user and the indexer that nearly guarantees some mis-matches between user search terms and indexing terms on records.
The same phenomenological gap will hold true for the match between the system user and bodies of data that have been automatically indexed by some algorithm as well. The creator of the algorithm has likewise thought more and experimented more with the retrieval effects of various algorithms than the typical user has, before selecting the particular algorithm made available in a given system. Yet, at the same time, the full statistical consequences of such algorithms can never be anticipated for every circumstance.
Results achieved through such algorithms will seldom dovetail exactly with what humans would do in similar circumstances. So the result is a peculiar mix of expert human understanding of indexing with non-sentient statistical techniques to produce a result that is never just like interacting with another human would be. Again, we have a phenomenologically different experience for the researcher/designer on one side, and the user on the other. Further, no retrieval algorithm has been found to be anywhere near perfectly suitable (more on this later), yet the principles by which the algorithm is designed are seldom made fully available to end users--just as the indexing principles seldom are--so the user could try to find ways around inadequacies.
There is still another factor that is likely to operate in the behavior of catalogers, indexers, and system designers for manual and automated systems. The information professional has an understandable desire to create, in an indexing vocabulary, classification system, or statistical algorithm, a beautiful edifice. He or she wants a system that is consistent in its internal structure, that is logical and rigorous, that can be defended among other professionals as meeting all the usual expectations for creations of human endeavor. After years of working with problems of description, the creators of such systems have become aware of every problem area in the system, and have determinedly found some solution for each such problem. Therefore, the better developed the typical system, the more arcane its fine distinctions and rules are likely to be, and the less likely to match the unconsidered, inchoate attempts of the average user to find material of interest.
This is by no means to suggest that information retrieval systems should be inchoate or unconsidered! Instead, the question to be asked is a different one than is usually assumed. That question should not be: "How can we produce the most elegant, rigorous, complete system of indexing or classification?," but rather, "How can we produce a system whose front-end feels natural and compatible for the searcher, and which, by whatever infinitely clever internal means we devise, helps the searcher find his or her way to the desired information?"
Multiple Terms of Access
On this next matter there is a huge body of available research. Several people have written extensively about it (Furnas and others, 1983; Bates, 1986a, 1989; Gomez, Lochbaum, & Landauer, 1990), and yet the results from these studies are so counter-intuitive to our naive view of the indexing/retrieval process, that little has been done to act on this information in information retrieval system design. It is as if, collectively, we in the field just cannot believe, and, therefore, do not act upon, this data. Here, a couple of examples of these results will be described; otherwise the reader is encouraged to explore the full range of available data on this matter in the above-cited references.
In study after study, across a wide range of environments, it has been found that for any target topic people will use a very wide range of different terms, and no one of those terms will be very frequent. These variants will be morphological (forest, forests), syntactic (forest management, management of forests) and semantic (forest, woods).
One example result can be found in Saracevic & Kantor (1988). In a carefully designed and controlled study on real queries being searched online by experienced searchers, when search formulations by pairs of searchers for the identical query were compared, in only 1.5 percent of the 800 comparisons were the search formulations identical. In 56 percent of the comparisons the overlap in terms used was 25 percent or less; in 94 percent of the comparisons the overlap was 60 percent or less. (p. 204).
In another study, by Lilley (1954), in which 340 library students were given books and asked to suggest subject headings for them, they produced an average of 62 different headings for each of the six test books. Most of Lilley's examples were simple, the easiest being The Complete Dog Book, for which the correct heading was "Dogs." By my calculation, the most frequent term suggested by Lilley's students averaged 29 percent of total mentions across the six books.
Dozens of indexer consistency studies have also shown that even trained experts in indexing still produce a surprisingly wide array of terms within the context of indexing rules in a subject description system (Leonard, 1977; Markey, 1984a). See references for many other examples of this pattern in many different environments.
To check out this pattern on the World Wide Web, a couple of small trial samples were run. The first topic, searched with the Infoseek search engine, is one of the best known topics in the social sciences, one that is generally described by an established set of terms: the effects of television violence on children. This query was searched on five different expressions that were minimally variant. The search was not varied at all by use of different search capabilities in Infoseek, such as quotation marks, brackets, or hyphens to change the searching algorithm. Only the words themselves were altered--slightly. (Change in word order alone did not alter retrievals.) These were the first five searches run:
violent TV children
children television violence
media violence children
violent media children
children TV violence
Each search, as is standard, produced ten addresses and associated descriptions as the first response for the query, for a total of 50 responses. Each of these queries could easily have been input by a person interested in the identical topic. If each query yielded the same results, there would be only ten different entries--the same ten--across all five searches. Yet comparison of these hits found 23 different entries among the 50, and in varying order on the screen. For instance, the response labeled "Teen Violence: The Myths and the Realities," appeared as #1 for the query "violent media children" and #10 for "violent TV children." I then went a little farther afield and searched "mass media effects children." That query yielded nine new sites among the ten retrieved, for a total of 32 different sites--instead of the ten that might have been predicted--across the six searches.
The previous example varied the search words in small, almost trivial ways. The variation could easily have been much greater, while still reflecting the same interests from the searcher. Let us suppose, for example, that someone interested in freedom of speech issues on the Internet enters this query:
+"freedom of speech" +Internet
This time the search was run on the Alta Vista search engine. The plus signs signal the system that the terms so marked must be present in the record and the quotation marks require the contained words to be found as a phrase, rather than as individual, possibly separated, words in the record. So, once again, we have a query that is about as straightforward as possible; both terms, as written above, must be present.
Let us suppose, however, that three other people, interested in the very same topic, happen to think of it in just a little different way, and, using the identical system search capabilities, so that only the vocabulary differs, they enter, respectively:
+"First Amendment" +Web
+"free speech" +cyberspace
+"intellectual freedom" +Net
All four of these queries were run, in rapid succession, on Alta Vista. The first screen of 10 retrievals in each case was compared to the first 10 for the other 3 queries. The number of different addresses could vary from 10 (same set across all four queries) to 40 (completely different set of 10 retrievals for each of the four queries). Result: There were 40 different addresses altogether for the four queries. Not a single entry in any of the retrieved sets appeared in any of the other sets.
I then looked to see what would happen when the 8 different terms used were combined in all the logical orders, that is, each first term appears with each second term (free speech and Internet, free speech and Web, First Amendment and cyberspace, etc.). There are 16 such orders, for a total of 160 "slots" for addresses on first ten retrievals in each case. All 16 of these combinations could easily have been input by a person interested in the very same issue (as well as dozens, if not hundreds, of other combinations and small variations on the component terms).
The result: Out of the 160 slots, there were 138 unique different entries produced. Thus, if each of the 16 queries had been entered by a different person, each person would have missed 128 other "top ten" entries on essentially the same topic, not to speak of the additional results that could be produced by the dozens of other terminological and search syntax variations possible on this topic.
However, the true variation among entries is actually slightly greater than even these figures indicate. Sometimes the same title, or entry, appeared with more than one address (URL) listed for it. Alta Vista counts each address as part of the 10 retrievals, so in such cases fewer different entries would appear, but with a total of 10 addresses. Altogether, there were 10 such cases where a given address constituted the second or third address for the same entry. So, in sum, there were a total of only 150 entries produced, containing a total of 160 addresses. Thus, the 138 different entries actually appeared out of a set of 150 entries, for a redundancy rate of only 8 percent (12/150). In sum, the data from these small tests of the World Wide Web conform well with all the other data we have about the wide range of vocabulary people use to describe information and to search on information.
If 85 or 90 percent of users employed the same term for a given topic, and only the remainder used an idiosyncratic variety of other terms, we could, with a moderate amount of comfort, endeavor to satisfy just the 85 or 90 percent, by finding that most popular term and using it in indexing the topic. But description of information by people just does not work this way. Even the most frequently used term for a topic is employed by a minority of people. There are generally a large number of terms used, many with non-trivial numbers of uses, and yet no one term is used by most searchers or indexers.
For search engines designed as browsers this may be a good thing. The slight variations yield different sets for searchers, and thus spread around the hits better across the possible sites, thereby promoting serendiptiy. But for people making a directed search, it is illusory to think that entering that single just-right formulation of the query, if one can only find it, will retrieve the best sites, nicely ranked, with the best matches first.
Under these circumstances, the naive one-to-one matching assumption about the nature of information retrieval does not hold. We can surmise some reasons why these peculiarities of information retrieval have not been noticed and acted upon earlier. In million-item databases, even spelling errors will usually retrieve something, and reasonable, correctly-spelled terms will often retrieve a great many hits. (In one of my search engine queries, I accidentally input the misspelling "chidlren" instead of "children." The first four retrievals all had "chidlren" in the title.) Users may simply not realize that the 300 hits they get on a search--far more than they really want anyway--are actually a small minority of the 10,500 records available on their topic, some of which may be far more useful to the user than any of the 300 actually retrieved.
Another possible reason why we have not readily absorbed this counter-intuitive study data: Interaction with a system is often compared to a conversation. Whether or not a person consciously thinks of the interaction that way, the unconscious assumptions can be presumed to derive from our mental model of conversations, because that is, quite simply, the principal kind of interaction model we language-using humans come equipped with. In a conversation, if I say "forest" and you say" forests," or I say" forest management" and you say" management of forests," or I say" forest" and you say "woods," we do not normally even notice that different terms are used between us. We both understand what the other says, each member of the example pairs of terms taps into the same area of understanding in our minds, and we proceed quite happily and satisfactorily with our conversation. It does not occur to us that we routinely use this variety and are still understood. Computer matching algorithms, of course, usually do not generally build in this variety, except for some stemming, because it does not occur to us that we need it.
Experienced online database searchers have long understood the need for variety in vocabulary when they do a thorough search. In the early days of online searching, searchers would carefully identify descriptors from the thesaurus of the database they were searching. The better designed the thesaurus and indexing system, the more useful this practice is. However, searchers soon realized that, in many cases where high recall was wanted, the best retrieval set would come from using as many different terms and term variants as possible, including the official descriptors. They would do this by scanning several thesauri from the subject area of the database and entering all the relevant terms they could find, whether or not they were official descriptors in the target database.
In some cases, where they had frequent need to search a certain topic, or a concept element within a topic, they would develop a "hedge," a sometimes-lengthy list of OR'd terms, which they would store and call up from the database vendor as needed. (See, e.g., Klatt, 1994.)
Sara Knapp, one of the pioneers in the online searching area, has published an unusual thesaurus--not the kind used by indexers to identify the best term to index with, but one that searchers can use to cover the many terms needed for a thorough search in an area (Knapp, 1993). Figure 1 displays the same topic, "Child development," as it appears in a conventional thesaurus, the Thesaurus of ERIC Descriptors (Houston, 1995), and in Knapp's searcher thesaurus. It can be seen that Knapp's thesaurus provides far more variants on a core concept than the conventional indexer thesaurus does, including likely different term endings and possible good Boolean combinations.
[Figure 1 about here.]
A popular approach in information retrieval research has been to develop ranking algorithms, so that the user is not swamped with hundreds of undifferentiated hits in response to a query. Ranking will help with the 300 items that are retrieved on the term-that-came-to-mind for the user--but what about the dozens of other terms and term variations that would also retrieve useful material--some of it far better--for the searcher as well? Information retrieval system design must take into account these well-attested characteristics of human search term use and matching, or continue to create systems that operate in ignorance of how human linguistic interaction in searching actually functions.