Withdraw
Loading…
Towards transit data accessibility: Large language models and software tools for GTFS
Devunuri, Saipraneeth
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/129497
Description
- Title
- Towards transit data accessibility: Large language models and software tools for GTFS
- Author(s)
- Devunuri, Saipraneeth
- Issue Date
- 2025-03-06
- Director of Research (if dissertation) or Advisor (if thesis)
- Lehe, Lewis
- Doctoral Committee Chair(s)
- Lehe, Lewis
- Committee Member(s)
- Ouyang, Yanfeng
- Meidani, Hadi
- Talebpour, Alireza
- Department of Study
- Civil & Environmental Eng
- Discipline
- Civil Engineering
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- GTFS
- Public Transportation
- Transit
- Stops spacings
- Large Language Models
- Generative AI
- Abstract
- In an era characterized by data-driven decision-making, the General Transit Feed Specification (GTFS) has emerged as a global standard for publishing public transit data, enabling unprecedented transparency and accessibility. Despite its widespread adoption, extracting and analyzing transit data from GTFS remains challenging due to its complexity, optional components, and varying agency adherence to the standard. This dissertation addresses these challenges by proposing new tools and methods that make transit data more accessible using software and large language model (LLM)-based techniques. The dissertation begins with a systematic survey of errors in GTFS data across 632 US transit feeds. Approximately 21% of the feeds contain at least one error. The analysis identifies the most common issues, with errors related to the optional "shape_dist_traveled" field accounting for the majority, and fare-related discrepancies forming a secondary cluster. The analysis also demonstrates the limits of identifying errors programmatically, showing that manual inspection is necessary to catch some of the most severe errors. Subsequently, this dissertation addresses the absence of tools for calculating bus stop spacings from GTFS feeds by introducing "gtfs-segments," a Python package that computes summary statistics and visualizes spacing distributions. In addition, it establishes terminology and various weighting schemes for calculating stop spacing statistics. Using "gtfs-segments," stop spacings were computed for 539 U.S. transit providers and 83 Canadian providers, while detailed statistics were produced for 30 U.S. providers, 10 Canadian providers, and a sample of 38 international providers. The analysis shows that different weighting schemes yield distinct "average" spacing values on both a hypothetical sample network and actual transit networks. Notably, the weighted spacings in the U.S. and Canada are narrower than those observed in other regions, yet remain broader than what references in the literature suggest from anecdotal evidence. GTFS data is intricate, comprising over 20 interlinked files with 250+ attributes, each having a description, presence condition, and data type. This dissertation investigates the potential of LLMs in extracting information from GTFS feeds by introducing the "GTFS Semantics" and "GTFS Retrieval" benchmarks to evaluate their comprehension and retrieval capabilities. Benchmarking ChatGPT (GPT-3.5 Turbo and GPT-4) reveals that LLMs exhibit a reasonable understanding of GTFS semantics and can perform "simple" extraction tasks by generating Python code. However, they are prone to hallucinations, particularly in distinguishing attribute-file associations and enumerated attribute types. Furthermore, this leads to poor performance on "complex" tasks that involve multiple files and attributes. The culmination of this dissertation is the creation of "TransitGPT," a chatbot that leverages LLMs to answer natural language queries about GTFS data, such as "What is the longest bus route in Chicago?" TransitGPT helps guide the LLM to generate Python code that extracts and manipulates relevant transit data, which is then executed on a server hosting the GTFS feeds. This framework supports a wide range of tasks—including data retrieval, calculations, and interactive visualizations—without requiring users to have extensive knowledge of GTFS or programming. The LLMs are guided entirely by prompts (through prompt engineering techniques) without the need for fine-tuning or direct access to the feeds, allowing any LLM to serve as a drop-in replacement. Evaluations using GPT-4o and Claude-3.5-Sonnet on a benchmark dataset of 100 tasks demonstrate that TransitGPT significantly enhances the accessibility and usability of transit data, empowering planners, researchers, and the public with an intuitive interface for complex data analysis.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129497
- Copyright and License Information
- Copyright 2025 Saipraneeth Devunuri
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…