Entity-based long document summarization using LLMs

Potluri, Abhilash

Entity-based long document summarization using LLMs

Potluri, Abhilash

Permalink

https://hdl.handle.net/2142/124661

Description

Title

Entity-based long document summarization using LLMs

Author(s)

Potluri, Abhilash

Issue Date

2024-04-16

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Summarization
Long Documents
Nlp
Llm
Entity Extraction
Chain-of-density

Language

eng

Abstract

Recent studies have found that the summaries generated by Large Language Models (LLMs) such as OpenAI's Generative Pre-trained Transformer (GPT) tend to be ranked as the most fluent abstractive summaries. Existing long document summarization research has focused on changing model architecture (such as different attention modules) but since LLMs (especially now that recent models have very large context windows recently) seem to be the best at outputting fluent summaries, we seek to understand if we can augment LLMs with information so that it produces the most accurate summary. Specifically, in this project, we aim to investigate if we can use a tandem approach of entity extraction and LLM prompting to generate the highest quality summary possible for scientific papers (long documents). We compare summarization using GPT only, using GPT and an entity extraction approach, and using a GPT Chain-of-Density based approach with the extracted entities and find that providing the entities improves the summary quality. Despite long documents containing over 6000 tokens on average, we find that we can generate an adequate to good summary in over half the cases using our chain-of-density method (nearly 80\% of inputs in two of the datasets). We also show how our entity extraction method is better in this setting than some contemporary approaches and experiment with some variations of early stopping and entity decay on the Chain-of-Density based prompting. While this still leaves significant room for improvement, our results are promising first steps towards a new methodology for long document summarization of scientific papers.

Graduation Semester

2024-05

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/124661

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Entity-based long document summarization using LLMs

Potluri, Abhilash

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In