First Monday, Volume 11, Number 8 — 7 August 2006

First Monday

Moving towards shareable metadata by Sarah L. Shreeves, Jenn Riley, and Liz Milewicz



Abstract
A focus of digital libraries, particularly since the advent of the Open Archives Initiative Protocol for Metadata Harvesting, is aggregating from multiple collections metadata describing digital content. However, the quality and interoperability of the metadata often prevents such aggregations from offering much more than very simple search and discovery services. Shareable metadata is metadata which can be understood and used outside of its local environment by aggregators to provide more advanced services. This paper describes shareable metadata, its characteristics, and its importance to digital library development, as well as barriers and challenges to its implementation.

Contents

Introduction
What is shareable metadata?
The six C’s and lots S’s of shareable metadata
Challenges to creating shareable metadata
Conclusions

 


 

Introduction

Libraries, museums and other cultural heritage institutions (with varying degrees of comfort) are seeing their digital content and metadata showing up everywhere these days. Search engines like Google and Yahoo! are getting more adept at spidering deep into databases so that formerly hidden content is now appearing in searches. Digital images from library and museum collections appear in Google Image searches. Libraries share digital content and/or its metadata with OCLC and RLG who each produce large union catalogs of the holdings of multiple libraries. In the United States, many states have efforts underway to pull together digital content produced by their cultural heritage institutions.

Add an Open Archives Initiative data provider — by which an institution can expose metadata to whomever would like to harvest it — to the mix, and the metadata increasingly appears farther and farther away from its original context. An aggregator might also turn around and re–expose the metadata it has harvested to another service or make it available for a federated search through SRU or OpenSearch [1]. As disconcerting as it may seem to libraries and museums who have in the past maintained fairly tight control over catalogs and collection management systems, our metadata can now appear in the most unexpected places.

Digital library development over the past six years has focused particularly on exposing and sharing metadata describing digital content and aggregating that metadata (and increasingly the content itself) from a range of disparate providers. Such aggregations serve a variety of purposes:

Communication protocols like the Open Archives Initiative Protocol for Metadata Harvesting (OAI–PMH) and common metadata encoding schemas such as Dublin Core (DC) facilitate the ease with which metadata from multiple sources can be pooled together.

Sharing metadata and the resultant aggregations benefit users, particularly those users whose subject interest cuts across disciplinary boundaries. Not only do these aggregations minimize the time and effort expended on searching for all the resources on a particular topic, but they can yield higher quality resources in a variety of formats than would typically be found through an Internet search engine’s crawl of the Web.

Aggregations also benefit the institutions sharing the metadata. Institutions can no longer assume that users know about their online collections and remember to visit them. By allowing their metadata to appear in places outside of the original collection, institutions increase the number of access points to the items in their collection and expose their collection to a broader audience.

Despite recent advances in the field of metadata sharing, the full potential of metadata harvesting and services on aggregated metadata has yet to be realized. Numerous studies, particularly within the OAI–PMH context, have discussed the difficulty in building services beyond those for basic search and access over metadata aggregations because of the poor metadata quality and shareability (Arms, et al., 2003; Dushay and Hillmann, 2003; Halbert, 2003; Hagedorn, 2003; Hutt and Riley, 2005; Shreeves, et al., 2003). Typical problems include:

  • Lack of consistency within a single collection.
    Example: The use of both the Dublin Core <date> and <coverage> elements to record some variant of the resource creation date.

  • Too much information.
    Example: Inclusion of technical information such as date digitized and type of scanner used.

  • Lack of key contextual information.
    Example: Exclusion of a collection name that is essential to make sense of the record.

  • Lack of conformance to technical standards.
    Example: Metadata encoded in XML with character encoding problems.

Services such as topical browsing or focused exploration of places and times are not easily accomplished because of problems such as these. Though some processing is always required in order to create services geared towards a particular audience, subject area, or use, the intensity and extent of this work is minimized and the quality of the results enhanced when metadata providers do the work to make metadata shareable.

This article describes shareable metadata and its characteristics, as well as barriers and challenges to its implementation. Our discussions are primarily grounded in experiences in cultural heritage institutions and focus particularly on issues found to be problematic in aggregations and not on issues in federated search environments. We try to generalize beyond specific metadata formats. We base our recommendations primarily on the guidelines documented in the Digital Library Federation and National Science Digital Library sponsored Best Practices for Shareable Metadata [3].

 

++++++++++

What is shareable metadata?

While sharing metadata is an essential first step towards creating useful aggregations, the quality of the resulting services is limited when the metadata used is not interoperable. So what are the qualities of shareable or interoperable metadata? How is this really different from creating metadata to be used “in–house”?

We believe that truly shareable metadata is different from the metadata that is used strictly “in–house”. Carl Lagoze has argued that “[m]etadata is not monolithic ... it is helpful to think of metadata as multiple views that can be projected from a single information object” (Lagoze, 2001). We agree with Lagoze that metadata should be simply a view of the resource, and that view may change depending on audience, use, and context. Unfortunately many libraries, museums, and other cultural heritage institutions have treated a metadata record as a monolithic item — a single record with all descriptive, technical, and administrative information about the resource included — and share this single record rather than a version of it most appropriate for the intended use.

We believe that truly shareable metadata is different from the metadata that is used strictly “in–house”.

Monolithic metadata records are problematic for aggregators for multiple reasons. Metadata schemas used for sharing often lack the semantic complexity to adequately communicate all of the information stuffed into them. End users and aggregators can be confused when search results are diluted by extraneous or ambiguous information. For example, when presented with a <dc:date>1922</dc:date> and <dc:date>2005-04–25</dc:date> in the same Dublin Core record, neither the user nor an indexing program will know that the former represents the date a photograph was taken and the latter the date the photograph was scanned. We encourage institutions to think carefully about how they might generate multiple views of resources using the metadata already created rather than simply sharing a single record describing everything about a resource.

At its most basic level shareable metadata should be human understandable. If a person unfamiliar with the resource described cannot state what a metadata record describes, the metadata is not shareable. The ultimate goal for shareable metadata, of course, is that it is machine processable so that computer programs can automatically parse and use the metadata for whatever service is needed. While this latter goal is not realistic for many institutions given limitations of digital content management systems in use today and technical resources available, making metadata understandable to human readers is an attainable and important goal for improving the quality of metadata in aggregations.

Shareable metadata should be an appropriate representation or view of the resource for its use. That is, shareable metadata should be useful and usable to services outside of its local context given the resource described. We are not arguing that institutions need to share metadata that can be used in all circumstances and by all services, or that institutions should create separate records tailored for every aggregator that may use their metadata. We are arguing that institutions need to think carefully about the uses and services they would like to support through their metadata. For example, basic cross-domain resource discovery is one use that almost all institutions with OAI data providers are trying to support given the basic purpose of the OAI protocol. In addition, an institution may also want to provide metadata to an aggregation with a specific audience focus, for example, K–12 teachers. An institution should understand what that aggregator needs included in the metadata (learning standards? audience level?) to support its service and, when possible, work to meet those needs.

Shareable metadata is fundamental to cross–domain resource discovery. Shareable metadata should support search interoperability. Priscilla Caplan defined search interoperability as “the ability to perform a search over diverse sets of metadata records and obtain meaningful results” [4].

Shareable metadata should exhibit the characteristics of quality metadata (Bruce and Hillmann, 2004). [5] However, high quality metadata may or may not be truly shareable metadata. That is, metadata may be of high quality within its local context, but may be compromised when taken out of this context for various reasons. We should also clarify that high quality does not necessarily mean extremely complex or hand–crafted metadata; automatically generated metadata or a simple Dublin Core record can be quality metadata and can be shareable metadata. In general, we have found that in addition to characteristics of quality metadata, the following characteristics are particularly important:

  • Content is optimized for sharing.

  • Metadata within shared collections reflects consistent practices.

  • Metadata is coherent.

  • Context is provided.

  • The metadata provider communicates with aggregators through direct or indirect means.

  • Metadata and sharing mechanisms conform to standards.

More specific recommendations for each of these are outlined below.

 

++++++++++

The six C’s and lots S’s of shareable metadata

Content

Ensuring the content of metadata records is optimized for sharing is the most important task a metadata provider can perform. The record as a whole should describe the resource with a granularity appropriate for the materials and their intended use. Item–level description is most often used, although in some cases describing collections or parts of items may be more appropriate. The record should include only those metadata elements that serve a defined purpose in the shared environment — for indexing, display to users so that they may determine if the resource meets their need, or of use for metadata enhancement activities by the aggregator.

In addition to the content of the record as a whole, the content of specific metadata elements can affect aggregators’ ability to make use of shared metadata records. For any element that makes use of a controlled vocabulary, a good shareable metadata record provides an unambiguous indication of the vocabulary from which the term provided was chosen. This allows aggregators to make use of defined structures for controlled vocabularies to improve searching, and to reconcile different vocabularies. Similarly, any links in the record should provide a machine–readable indication of what the link will resolve to — a representation of the resource itself, the resource in context with metadata and institutional branding, a Web site devoted to a collection of resources, or any of a number of other possibilities.

Consistency

Metadata aggregators can more effectively normalize records from metadata providers if all records within a defined set are consistent both semantically and syntactically. The presence or absence of a given field in all records allows an aggregator to more easily determine which fields to display (such as a title) or to index (such as a subject). When fields are not used for the same type of value consistently throughout a single collection (for example, in a Dublin Core record the use of both <dc:date> and <dc:coverage> for the date a resource was created), aggregators must index several metadata fields together which dilutes search results.

The consistent use of a controlled vocabulary within a given field, especially if the metadata format chosen does not allow an indication of which vocabulary is in use, will help the aggregator better interpret this information and reconcile different vocabularies. A similar principle applies to syntax encoding schemes — if a given field consistently uses the same encoding (for example, dates encoded using the W3CDTF format), the aggregator can more easily integrate that data into their internal data model. For aggregators, predictability is the key. When records are consistent, the aggregator can develop and apply enhancement logic to large groups of records at once, a practice that would not be cost–effective for small sets or individual records.

Coherence

Shared metadata records should be self–explanatory; they should make sense at a glance to relatively naïve observers. At its most basic, this means that values should appear in appropriate elements, and that each instance of an element should contain one and only one value (no “packing” of multiple values into a single element and expecting an aggregator to figure out how to separate them). When multiple values are needed, the metadata element should be repeated. Description for specialized resources, of course, is necessarily just as specialized; however, shared records should include some high–level indication of the type of resource being described so that aggregators can determine if the resource is appropriate for the aggregation, and, if so, understand how to interpret the specialized data.

Context

Providing appropriate context in shared metadata records is perhaps the biggest change from metadata records intended for local use. First and foremost, shared metadata records should ensure all appropriate contextual information necessary to make sense of the resource is included. Metadata records in a local environment often omit information common to every record in the collection; however, this information is often the most important feature of the resource, and as such is essential to include in a shared environment. Wendler (2004) succinctly articulated this as the “On a horse” problem: a record for a photo simply titled “On a horse,” with no subject information, offers little suggestion of its potential relevance to a researcher. Within its local context — a collection of photographs of Theodore Roosevelt — this detail was unnecessary; outside this context, however, the collection–level information (for example, The Theodore Roosevelt Collection) can be vital to understanding what a record describes and how it should be indexed, particularly in the absence of other coherent description.

The reverse is also true — metadata essential in a local environment for the management of digital resources (for example, the date on which an item was digitized) is not appropriate for inclusion in a shared environment. Roy Tennant (n.d.) pointed to the inclusion of “[electronic resource]” in a DC <title> element — “a hold–over from MARC, and outside the context of library catalogs ... is, at least, misplaced.” In the current metadata aggregation landscape, it is safe to assume that users search and browse for resources at an aggregator’s site, then follow a link back to the home institution for access to the resource itself and any additional metadata. Therefore, when creating metadata for the purposes of inclusion in these aggregations, one can afford to be selective about the data elements included, with the understanding that a user will find his way to the local records for full contextual information. As always, the context provided in a shared metadata record should be driven by its intended use. As the nature and common practices of metadata aggregators change, so will the contextual information appropriate for inclusion in shared metadata records.

Communication

Communication between metadata providers and aggregators can be of great benefit to both parties. Aggregators can use information on how records are created and distributed, such as the format in which the records are stored natively, the vocabulary and content standards used, how often and under what circumstances records are added or updated, analytical or supplementary materials, and provenance of resources, can all be useful to an aggregator in providing appropriate services based on a given set of metadata records. Such a relationship also benefits the metadata provider. While metadata providers cannot tailor shared records specifically for the needs of every aggregator, knowing how aggregators in general use those records can help the metadata provider create better shareable records. Communication between both parties can similarly help each of them to better understand what they can do for the other.

Conformance to standards

Perhaps the most obvious but still overlooked responsibility of a metadata provider is to ensure its records conform to recognized standards. Conformance to the standard format the record uses is the most basic of these — ensuring field names, order, and repeatability match the standard to which the records claim to conform. Yet there are other standards to which conformance is just as important. Consistent use and indication to the aggregator of a descriptive content standard, such as the Anglo–American Cataloging Rules (AACR2) for the library community, Cataloging Cultural Objects (CCO) for the museum community, or Describing Archives: A Content Standard (DACS) for the archives community, makes records more predicable and thus more shareable, as does the use of standard vocabularies and encoding standards.

Within a record, conformance to technical standards such as character encoding and record structures such as XML, is absolutely essential to providing a shareable metadata record. In addition, conformance to the standard transfer protocol used to share records, for example OAI–PMH, Z39.50, or SRU, is a core competency for metadata providers. If the record doesn’t conform to the transfer protocol, it will not be retrievable by the aggregator at all.

 

++++++++++

Challenges to creating shareable metadata

Implementation of shareable metadata is not a trivial task. Changing metadata practices within an organization — particularly those that are well established — will require investments of time and potentially financial resources to retool workflows and retrain staff. Considering how your metadata will appear outside of your local context and making appropriate changes to it can be difficult. Implementation of technical standards such as XML may not be easily accomplished for small institutions with limited technical expertise.

In addition, the digital collection management and repository software and other tools in use by cultural heritage institutions to manage and provide access to their digital resources do not always facilitate the creation of shareable metadata. For example, only one metadata format (Dublin Core) might be exposed via an OAI data provider in a digital repository system; this eliminates the possibility of exposing metadata which might better represent the resource and tends to encourage the overstuffing of the Dublin Core record. Support of (more) standards and a modular approach to digital collection management systems will be necessary for better shareable metadata. Of course, open source software systems can be changed to overcome some of the limitations, but commercial vendors will need to be convinced by their customers of the value of the changes necessary.

At the most basic level, institutions who contribute metadata through whatever means should consider the content and consistency of their metadata.

For some institutions sharing metadata, much less creating shareable metadata, is simply not a priority. Some institutions, particularly those that are smaller and/or less technically adept, are simply unfamiliar with how or why they might expose their metadata to aggregations. This barrier is directly related to the value that can be shown in building services around aggregated metadata which, in turn, of course, is in many ways reliant on the shareability of the metadata that can be harvested.

Finally, while we have focused here primarily on what metadata providers can do to make their metadata more interoperable, the aggregators and service providers also have an important role to play. Generally speaking, aggregators tend to have more access to technical resources that can be used to process the collected metadata. Aggregators can write normalization scripts for dates and geographic locations, for example, and can work on subject clustering using data mining algorithms and techniques. The balance between what metadata providers and service providers should do and where resources on each side are best spent is an area still under exploration.

 

++++++++++

Conclusions

We have discussed in this paper the importance and characteristics of interoperable or shareable metadata based on our experiences as both metadata providers and metadata aggregators. Efforts such as the Best Practices for Shareable Metadata, RLG’s Descriptive Metadata Guidelines for RLG Cultural Materials, and the MODS Implementation Guidelines for Cultural Heritage Materials from the Digital Library Federation’s Aquifer Initiative play an important role in offering specific guidelines for creating shareable metadata. At the most basic level, institutions who contribute metadata through whatever means should consider the content and consistency of their metadata. Implementing shareable metadata may be a slow process that is conducted as institutions work with new collections, but the ability to think critically about the shareability of ones’ own metadata and the commitment to make the necessary changes will be key for the next stage of effective digital library services. End of article

 

About the authors

Sarah Shreeves is the Coordinator for the University of Illinois at Urbana–Champaign’s institutional repository, the Illinois Digital Environment for Access to Learning and Scholarship (IDEALS). She is the chair of the Metadata Working Group of the Digital Library Federation’s Aquifer Initiative and a co–editor of the Best Practices for Open Archives Initiative Data Provider Implementations and Shareable Metadata. Sarah has written and presented on the Open Archives Initiative, metadata quality issues in aggregations, and collection–level description. She has an MS in Library and Information Science from the University of Illinois at Urbana–Champaign, an MA in Children’s Literature from Simmons College, and a B.A. in Medieval Studies from Bryn Mawr College.

Jenn Riley is the Metadata Librarian with the Digital Library Program at Indiana University–Bloomington, where she is responsible for planning metadata strategy for digital library projects and participates in the collaborative design of digital library systems. Jenn’s research interests include “shareable metadata,” the incorporation of thesaurus structures into search and browse systems, music digital libraries, and FRBR. Jenn is a member of the Metadata Working Group of the Digital Library Federation’s Aquifer Initiative and is an active contributor to the Best Practices for Open Archives Initiative Data Provider Implementations and Shareable Metadata. Jenn is the author of the blog Inquiring Librarian (at http://inquiringlibrarian.blogspot.com), where her posts frequently center around improving intellectual access to library materials, and a contributor to the collaborative Blog and Wiki TechEssence http://www.techessence.info), a technology resource for library administrators. She holds a B.M in Music Education from the University of Miami (Fla.), and an M.A. in Musicology and an M.L.S. from Indiana University–Bloomington.

Liz Milewicz is a Research Associate on the Digital Programs team of Emory University Libraries’ Digital Programs and Systems division, where she works on a range of digital library projects. She is currently Project Manager on a two–year project funded by the National Endowment for the Humanities to create an expanded online version of the Trans–Atlantic Slave Trade Database. She received her MLS degree from the University of Alabama and is pursuing a Ph.D. at Emory University, where she is studying academic library culture.

 

Acknowledgements

The authors would like to thank the Institute of Museum and Library Services (IMLS) for sponsoring the 2006 WebWise pre–conference workshop on “Creating Shareable Metadata.” We would also like to acknowledge the support of the Digital Library Federation and the work of everyone who has contributed to the Best Practices for Shareable Metadata on which much of the substantive content of this article and workshop were based: Caroline Arms, Naomi Dushay, Muriel Foulonneau, Kat Hagedorn, Diane Hillmann, Arwen Hutt, Bill Landis, Jewel Ward, and Simeon Warner.

 

Notes

1. It is not under the purview of this paper to describe the technologies available for sharing metadata, but briefly the SRU (Search and Retrieve via URL) and OpenSearch are both protocols that allow distributed or federated searching over a collection of metadata records. The SRU protocol is available at http://www.loc.gov/standards/sru/, and the OpenSearch documentation is available at http://opensearch.a9.com/. This is in contrast to the OAI protocol which enables aggregated searching. The OAI protocol documentation is available at http://www.openarchives.org/.

2. The range of OAI data and service providers are too numerous to list here. Brogan’s (2003) overview of the aggregation services, currently being updated for re–publication in 2006, offers a comprehensive assessment of the different types of services being developed through shared metadata.

3. Best Practices for Shareable Metadata are available in draft form at http://webservices.itcs.umich.edu/mediawiki/oaibp/?OAI_Best_Practices as of Spring 2006. These are part of a larger set of guidelines for OAI data provider implementations and represent the experiences of both OAI data providers and service providers. However, in this article we are speaking of shareable metadata no matter how it is shared whether by OAI, file transfer protocol (FTP), or sending a CD–ROM with a MS Access database to an aggregator.

4. Caplan, 2003, p.33, emphasis added.

5. The metrics described by Bruce and Hillmann (2004) to measure metadata quality are: accuracy, completeness, provenance, conformance to expectations, logical consistency and coherence, timeliness, and accessibility.

 

References

William Y. Arms, Naomi Dushay, David Fulker, and Carl Lagoze, 2003. “A case study in metadata harvesting: the NSDL,” Library Hi Tech, volume 21, number 2, pp. 228–237.

Martha Brogan, 2003. A Survey of Digital Library Aggregation Services. Washington, D.C.: Digital Library Federation, and at http://www.diglib.org/pubs/brogan/, accessed 1 June 2006.

Thomas R. Bruce and Diane I. Hillmann, 2004. “The continuum of metadata quality: defining, expressing, exploiting,” In: Diane I. Hillmann and Elaine Westbrooks (editors). Metadata in Practice. Chicago: ALA Editions, pp. 238#150;256.

“Best Practices for Shareable Metadata,” at http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?PublicTOC, accessed 1 June 2006.

Priscilla Caplan, 2003. Metadata Fundamentals for All Librarians. Chicago: ALA Editions.

Digital Library Federation, 2005. MODS Implementation Guidelines for Cultural Heritage Materials: Draft for Public Comment and Review, at http://www.diglib.org/aquifer/DLF_MODS_ImpGuidelines_ver4.pdf, accessed 2 June 2006.

Naomi Dushay and Diane I. Hillmann, 2003. “Analyzing metadata for effective use and re–use,” DC–2003: Proceedings of the International DCMI Metadata Conference and Workshop, pp. 161–170, and at http://www.siderean.com/dc2003/501_Paper24.pdf, accessed 1 June 2006.

Kat Hagedorn, 2003. “OAIster: A ‘no dead ends’ OAI service provider,” Library Hi Tech, volume 21, number 2, pp. 170–181.

Martin Halbert, 2003. “The metascholar initiative: AmericanSouth.org and MetaArchive.org,” Library Hi Tech, volume 21, number 2, pp. 182–198.

Arwen Hutt and Jenn Riley, 2005. “Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials,” Proceedings of the 5th ACM/IEEE–CS Joint Conference on Digital Libraries, Denver, Colo. (June 7–11 June). New York: ACM Press. pp. 262–270.

Carl Lagoze, 2001. “Keeping Dublin Core Simple: Cross Domain Discovery or Resource Description?” D–Lib Magazine, volume 7, number 1 (January), at http://dlib.anu.edu.au/dlib/january01/lagoze/01lagoze.html, accessed 2 June 2006.

Sarah L. Shreeves, Ellen M. Knutson, Besiki Stvilia, Carole L. Palmer, Michael B. Twidale, and Timothy W. Cole, 2005. “Is ‘quality’ metadata ‘shareable’ metadata? The implications of local metadata practice on federated collections,” In: Hugh A. Thompson (editor). Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, April 7–10 2005, Minneapolis, MN. Chicago: Association of College and Research Libraries, pp. 223–237, and at http://www.ala.org/ala/acrl/acrlevents/shreeves05.pdf, accessed 1 June 2006.

RLG, 2004. Descriptive Metadata Guidelines for RLG Cultural Materials, at http://www.rlg.org/en/pdfs/RLG_desc_metadata.pdf, accessed 2 June 2006.

Roy Tennant, “Bitter harvest: Problems and suggested solutions for OAI–PMH data and service providers,” at http://www.cdlib.org/inside/projects/harvesting/bitter_harvest.html, accessed 1 June 2006.

Robin Wendler, 2004. “The eye of the beholder: Challenges of image description and access at Harvard,” In: Diane I. Hillmann and Elaine Westbrooks (editors). Metadata in Practice. Chicago: ALA Editions, pp. 51–69.


Editorial history

Paper received 5 June 2006; accepted 21 July 2006.


Copyright ©2006, First Monday.

Copyright ©2006, Sarah L. Shreeves, Jenn Riley, and Liz Milewicz.

Moving towards shareable metadata by Sarah L. Shreeves, Jenn Riley, and Liz Milewicz
First Monday, volume 11, number 8 (August 2006),
URL: http://firstmonday.org/issues/issue11_8/shreeves/index.html