IDEALS Home University of Illinois at Urbana-Champaign logo The Alma Mater The Main Quad

The application of file identification, validation, and characterization tools in digital curation

Show full item record

Bookmark or cite this item: http://hdl.handle.net/2142/24301

Files in this item

File Description Format
PDF Ford_Kevin.pdf (1MB) (no description provided) PDF
Title: The application of file identification, validation, and characterization tools in digital curation
Author(s): Ford, Kevin M.
Advisor(s): Cragin, Melissa H.; McDonough, Jerome P.
Department / Program: Library & Information Science
Discipline: Library & Information Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree: M.S.
Genre: Thesis
Subject(s): digital curation digital preservation file identification file validation file characterization preservation tools preservation software
Abstract: File format identification, characterization, and validation are considered essential processes for digital preservation and, by extension, long-term data curation. These actions are performed on data objects by humans or computers, in an attempt to identify the type of a given file, derive characterizing information that is specific to the file, and validate that the given file conforms to its type specification. The present research reviews the literature surrounding these digital preservation activities, including their theoretical basis and the publications that accompanied the formal release of tools and services designed in response to their theoretical foundation. It also reports the results from extensive tests designed to evaluate the coverage of some of the software tools developed to perform file format identification, characterization, and validation actions. Tests of these tools demonstrate that more work is needed - particularly in terms of scalable solutions - to address the expanse of digital data to be preserved and curated. The breadth of file types these tools are anticipated to handle is so great as to call into question whether a scalable solution is feasible, and, more broadly, whether such efforts will offer a meaningful return on investment. Also, these tools, which serve to provide a type of baseline reading of a file in a repository, can be easily tricked. It is possible to generate files with nothing more than a proper file extension and correct magic number and have the tools "positively" identify the file. This is not the same as a file that conforms to its specification, and one that could be considered valid. The ability to manipulate the results returned by these tools raises issues of identity, trust, security and risk.
Issue Date: 2011-05-25
URI: http://hdl.handle.net/2142/24301
Rights Information: Copyright 2011 Kevin Ford. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. A copy of the license is available at http://creativecommons.org/licenses/by-nc-nd/3.0/
Date Available in IDEALS: 2011-05-25
Date Deposited: 2011-05
 

This item appears in the following Collection(s)

Show full item record

Item Statistics

  • Total Downloads: 782
  • Downloads this Month: 17
  • Downloads Today: 1

Browse

My Account

Information

Access Key