Synthetic pre-training for robustness in information retrieval

Gangi Reddy, Revanth

Synthetic pre-training for robustness in information retrieval

Gangi Reddy, Revanth

Permalink

https://hdl.handle.net/2142/116283

Description

Title

Synthetic pre-training for robustness in information retrieval

Author(s)

Gangi Reddy, Revanth

Issue Date

2022-07-21

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Neural Information Retrieval
Data Augmentation
Weak supervision

Language

eng

Abstract

Research on neural information retrieval has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this thesis, we first improve the out-of-domain generalization of Dense Passage Retrieval (DPR)—a popular choice for neural information retrieval (IR)—through synthetic data augmentation only in the source domain. We empirically show that pre-training DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation. We then show that supervised neural IR models are prone to learning sparse attention patterns over passage tokens, which can result in key phrases including named entities receiving low attention weights, eventually leading to model under-performance. Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage. On two public IR benchmarks, we empirically show that the proposed method helps improve both the model’s attention patterns and retrieval performance, including in zero-shot settings.

Graduation Semester

2022-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/116283

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Synthetic pre-training for robustness in information retrieval

Gangi Reddy, Revanth

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In