Withdraw
Loading…
Automatic Metadata Extraction Using Machine Learning
Wei, Qin; Heidorn, P. Bryan
Content Files

Loading…
Download Files
Loading…
Download Counts (All Files)
Loading…
Edit File
Loading…
Permalink
https://hdl.handle.net/2142/9139
Description
- Title
- Automatic Metadata Extraction Using Machine Learning
- Author(s)
- Wei, Qin
- Heidorn, P. Bryan
- Issue Date
- 2008-10-17
- Keyword(s)
- Metadata Extraction
- Machine Learning
- Date of Ingest
- 2008-10-26T00:05:05Z
- Abstract
- The information in museum specimen labels is not well recognized and used. The reason is mainly because this information is not part of the information retrieval systems. It is very expensive to have human to hand input the data to the system. We are using machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. Here, we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.
- Type of Resource
- text
- Genre of Resource
- Presentation / Lecture / Speech
- Language
- en
- Permalink
- http://hdl.handle.net/2142/9139
Owning Collections
Student Publications and Research - Information Sciences PRIMARY
Publications, conference papers, and other research and scholarship of iSchool students.Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…