Yoelle S. Maarek
Research Staff Member
IBM Haifa Research Lab
Haifa 31905, ISRAEL
ph: +972-4-296-336
FAX: +972-4-296-114
yoelle@haifa.vnet.ibm.com
With the advent of digital libraries and of wide area networks, enormous amounts of textual information are made available all over the world: A typical example being the World Wide Web on the Internet. Searching and browsing are the two resource discovery paradigms mostly used to access this information [Bowman 94]. Information retrieval (IR) provides numerous sophisticated automatic indexing and storing techniques to support efficient searching and retrieval. In contrast, the organizing process that makes browsing possible, is typically done manually. The advances in automatic techniques for supporting the browsing discovery paradigms does not compare to those that support searching, in spite of the fact that browsing is classically the first paradigm used before searching.
The issue we would like to address at the Institute is the issue of document organization which seems to have been neglected by the research community. This relates to the proposed issue of discussion: "What do we need to know about how people use electronic texts and how can we gain this knowledge and apply it to the development of digital libraries?" We would like to discuss organization by content as well as by other criteria such as fixed fields (author?, subject?, date?, etc.).
Document clustering has been for long an area of interest in IR [Voorhees 86], [Willett 88] but has been primarily used for improving the effectiveness and efficiency of the retrieval process, [Rasmussen 92]. We are more concerned here with the primary use of cluster analysis which is the visualization of underlying structures for explanatory purposes. Since the hierarchy is going to be shown to the user, and not simply used internally by the retrieval engine, it has to be more precise.
The clustering process is based upon pairwise similarity between documents which is itself inferred from a profile of representative indices for each document. We use multiple-word indices that represent lexical affinities and thus get a richer and more precise indexing unit than single words as has been described elsewhere [Maarek 89].
The Librarian's Assistant identifies underlying structures in collections of documents. These structures are visualized via a hierarchy of folders or clusters, each cluster gathering documents conceptually similar. The user can interact with the structure via a graphical user interface in order to obtain for each cluster a list of key concepts as well as an automatically generated title that allows to understand what the cluster is about.
In this example, an administrator has requested the Librarian's Assistant to cluster about 100 man pages of the AIX commands (IBM's Unix), by default is associated with every cluster an "intra-cluster similarity" percentage indicating how similar the documents within that cluster are (100 %= conceptually identical, 0%=conceptually independent). A portion of the HTML output (Web interface of the tool) is included below. Note the hotbuttons links from documents to source are not enforced here. The full-fledged Web interface also allows to get a title and list of key concepts for clusters by clicking on the folder icon (not provided here).
In summary, whenever collections of documents are available, the
Librarian's Assistant allows to provide the organization needed to
browse and search or simply to get a global understanding of the
structure of a collection. A Motif interface on Unix and a PM
interface on OS/2 are available as well as a custom-made version for
the World Wide Web.
....
[ 26%] contains 4 documents
[ 48%]contains 2 documents
[ 40%] contains 2 documents
[ 26%] contains 2 documents
[ 26%] contains 2 documents
[ 25%] contains 7 documents
comb.cmd
[ 49%] contains 2 documents
[ 36%] contains 4 documents
sccsdiff.cmd
sccshelp.cmd
[ 50%] contains 2 documents
...
Clustering of search results
The example below, show a retrieval session in which he user queries a
collection of patent abstracts and clusters the top ten candidates in
order to understand why two abstracts with similar titles (but
different patent IDs) get sich different scores.
By clustering the user notices that the top two candidates belong to
two different clusters, which means that they must be drastically
different. By examining sample documents from each cluster, the user
discovers that his/her query was not precise enough and that one top
candidate talks about feeding physical documents (in the context of
copying machine) whereas the second talks obout online documents.
The clustering process allows to discover these ambiguity in the query
as it identified on cluster for physical documents and one for online
documents. The user then does not need to reissue a more precise query
to filter out the kind of documents s/he not interested in, but needs
only to look at the cluster relevant to him/her.
References: