|Abstract:||The performance of statistical language modeling retrieval is directly determined by the estimation of document model £D, query model £Q, and the similarly between those two models. In this thesis, we propose to improve the estimation of £D and £Q through corpus local structures. We also further evaluate the optimality of the similarity measure and examine its variations to achieve better retrieval performance.
The accuracy of model estimation relies on the size of sampling data. In information retrieval tasks, a typical document model is only obtained from a single document, which is clearly a sample too small to be sufficient for an accurate estimation. In this thesis, we develop a new document expansion technique, which expands the original small sampling space (a single document) into a much larger one (a much larger pseudo document) by constructing a probabilistic neighborhood around the original document. We then propose to estimate document models over the pseudo document, since a much larger sampling space can potential lead to much better model estimation. Our experiment results show that hypothesis that the new estimation outperforms the old estimation on all test collections.
In typical information retrieval tasks, queries are very short and cannot result in solid query model estimation. We traditionally use feedback process to expand queries and this usually leads to much better query model estimation £Q. Many previous pseudo feedback models have been proposed before. However, each of them has to introduce one or more extra parameters to control the feedback process, and therefore they suffer from the parameter sensitivity: The parameter values working well on one collection may perform very badly on others. We observe the reason of such sensitivity is that learning process on feedback documents cannot fully utilize all query information. In this thesis, we therefore propose a new feedback model and an automatic parameter tuning algorithm. The new model integrates feedback documents and queries into a unified framework so that original queries can guide query model learning by gradually collecting relevant information from feedback documents. We develop a new learning EM algorithm, which dynamically changes its model prior to adapt new added information from feedback documents. On this stopping point, we believe the learning process reaches a good balance between information from query side and feedback document side. The new model and its learning algorithm does not require any pre-defined parameter. It automatically learns the parameter by adapting itself to the amount of feedback information added into query models. Experiments show that the new model performs much more robust than other feedback models on our text data sets.
We further study what is the best way to compute the similarity between a query mode and a document model. We propose and study several new similarity functions, and show that certain variations of KL-divergence can improve retrieval accuracy significantly in relevance feedback.
Both query model and document estimation above follows the bag-of-words model, which assumes term independence and totally ignores term locations in a document. In this thesis, we break this assumption by adding query term proximity into ranking functions. In particular, we propose five different features, each of which models proximity from a different prospective. Experiments show that one of such features, Minimum Pairwise Distance (MinDist), is indeed highly correlated with document relevance. However, this feature alone is not sufficient; we need to integrate it into retrieval formulas to form fully functional ones. We therefore follow the heuristic constrain framework to design two heuristics to limit the choice of proximity modeling functions. We then choose one function which satisfy both constraints and develop a new retrieval formula on the top of this function. Experiment results show the effectiveness of this new retrieval formula.
In summary, this thesis studies KL-divergence from different perspectives and proposes several new models to address the existing problems in KL-divergence language models. It results in more effective retrieval models, which should potentially benefit all retrieval applications.