Files in this item



application/pdfUIUCLIS_2007_1_EARCH.pdf (110kB)
Main article.PDF


Title:Successful Scalability Techniques for Illinois Web Archive Search
Author(s):Jackson, Larry S.; Yuan, Huamin
Subject(s):web archive search Illinois state library
Abstract:The Capturing Electronic Publications (CEP) web archive assembled since 2002 by the Electronic Archive Project group of Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign (UIUC), for the Illinois State Library (ISL) currently contains over 37 million files and is increasing by over 900,000 files per month. In order for ISL to utilize this collection effectively in identifying, selecting, and migrating specific documents to permanent storage, some form of search mechanism had to be provided. However, the file inventory far exceeded the capacity of open source or freeware search tools. Detecting those files which had not changed between harvests allows the suppression of search surrogate generation for those files. With that substantial reduction in search surrogate count accomplished, existing provisions of the SWISH-E open-source search engine to use multiple search databases sequentially did not impose noticeable delays on search engine users. Combined, these approaches enable SWISH-E search across the entire collection, despite an assumed initial design limit of one million files.
Issue Date:2007-04-27
Genre:Technical Report
Publication Status:unpublished
Peer Reviewed:not peer reviewed
Sponsor:Illinois State Library
Date Available in IDEALS:2011-12-23

This item appears in the following Collection(s)

Item Statistics