Content Analysis Via Document Clustering

Bruce L. Lambert
Department of Pharmacy Administration
University of Illinois at Chicago
(312) 996-2411
(312) 996-3272 fax
lambertb@uic.edu


I have designed and implemented an automatic content analysis system, based on document clustering techniques, that may be of interest some people in the Allerton group. The system groups documents into thematic clusters based on the co-occurrence of words in the documents. In my research, the independent clause has been the unit of analysis, but any unit of analysis could be chosen. We have used the system to analyze a fairly wide range of text data, including job descriptions from Ft. Stewart Air Force Base, messages from pharmacists to patients, pharmacists' views about robotics in pharmacy, students comforting messages to other students, and so on.

Although there are still improvements to be made, the program does quite a nice job of extracting themes from (manually unitized) computer-readable text (a la the paper in Science 1994 by Allen, Salton, Buckley, and Singhal).

People who evaluate DL systems often ask users for free-response replies to open-ended evaluative questions (i.e., what features of the system do you like the most? the least?). Focus groups can also be used in the evaluation or design stages of a DL project. Both free-response answers and transcribed recordings from focus groups would be amenable to the kind of analysis I am suggesting. I'd be happy to present this work, either formally or informally, to interested parties at this year's Allerton meeting.

A brief technical report (called ThemeMachine.pdf) describing the development of the system and some theoretical justification, can be had via anonymous ftp from

ftp://ludwig.pmad.uic.edu/pub/ThemeMachine.pdf

or via the Web at

http://ludwig.pmad.uic.edu/~bruce/

The report was prepared for an audience of communication researchers who knew very little about info retrieval, so parts of it will be quite basic for this audience.

An example of the Theme Machine's output can be had from the same ftp server. The raw data are in /pub/combined-htn-units.cln, and the output of two separate clustering runs is in /pub/htn.t1000.gavg.out and /pub/htn.t800.clink.out. These raw data are written messages, produced by pharmacists, in response to a hypothetical hypertension compliance-gaining situation. They have been segmented by hand into independent clauses.