Technology Infrastructure Supporting Digital Library Developments


Nancy K. Dennis

Collections and Technology Services

University of New Mexico General Library

Albuquerque, NM 87131

ndennis@unm.edu


Abstract

This chapter provides an overview of the technology infrastructure upon which digital libraries are created and accessed. Host servers, system and application software, desktop workstation requirements, interconnecting networks, and the creation and conversion of data, are summarized. A case study of the Online Archive of New Mexico is discussed where more than one thousand finding aids to manuscript and oral history collections from four institutions were converted to SGML-formatted documents according to the EAD standards.



Digital libraries typically are large collections of objects stored and maintained in digital or electronic formats within complex computer systems that an end-user can access and manipulate. Digital objects can include text documents, databases, multimedia components such as still images, audio and video files and vast collections of machine-readable unorganized data from experimental or scientific observations (such as satellites, geological activity, network logs). The challenge to developers of digital libraries is to create and support a stable, sustainable and scalable technical environment within which digital collections can grow.


A digital library includes five component parts:

the host computer system - or server - where data is stored; system and application software that facilitates the organization, searching, display and maintenance of the digital objects; end-user desk top workstation where the digital collections are displayed and manipulated; the network that delivers digital objects from the host server to the end-user; and the creation and conversion of data.

It is the interactions of these component parts which determine the success of a digital library implementation.



Host Servers


Servers are computers which store and process digital objects as well as facilitate the communication between the server and the end-user. Server configurations can range from a single PC-based computer, to clusters of networked workstations, to multiple-processor mainframe systems. The base computations within a computer are accomplished within the central processor(s) of the server. Processors are rated according to the clock speed (i.e. Megahertz or MHz) at which they can manipulate data. The requirements for servers processing plain text are relatively low. However, to process and serve audio and video formatted digital objects may require fast, dedicated multimedia servers.


The capacity to process data is determined by the amount of primary or main memory called RAM (random access memory) available to the processor to complete the base computations. Memory is measured in bytes - typically one byte is equivalent to one typewriter keystroke. In recent years the engineering of server hardware has advanced the standard configuration of RAM from megabytes (million bytes) to gigabytes or one billion bytes of RAM capacity.


Auxiliary or secondary memory is secure disk storage within the server. Unlike RAM where the total memory capacity is accessible to the processor at the same time, disk memory must be accessed through a movable magnetic head device. Digital collections are commonly stored within the secondary or disk storage devices and then called into RAM for processing. RAID arrays (Redundant Array of Independent Disks) provide for uninterrupted operation and rapid restoration in the event of disk failure.


The third form of storage is the remote or offline storage such as tape, Zip disk, CD-ROM or DVD. Historical data files or backup files are captured on these remote storage devices. Noerr (2000, p. 58) provides examples of storage requirements for a small 100,000 text document collection, audio and video collections.


A network card installed in the server is required to connect to the global or local network. In some specialized applications where communication speed is critical, dedicated network servers are recommended.


Dramatic increases in the implementation of e-business applications have contributed to the technological advances that directly benefit the infrastructure also needed to support digital libraries. Overall hardware costs continue to decline, however, demands for computing capacity continue to increase whether for processor speed, memory or network connectivity. In planning for the implementation of a digital library, expandability, flexibility and redundancy are key requirements of a robust and secure server platform. No matter how carefully planned at the beginning, demands for increased processor speed and capacity for RAM and storage upgrades are inevitable.


Any discussion of hardware requirements for a production server must include security and backup systems. Ideally the server should be placed in a secure room with supplemental cooling. Systematic procedures should be outlined and followed to insure that regular backups of data are performed (daily, weekly and monthly). Those backup tapes or disks should be stored offsite in the event of an environmental disaster effecting the physical location of the server. Hardware can be replaced, however, the investment in data creation and maintenance can span years and in some cases could never be restored or replaced if no backup copies exist. An uninterruptaple power supply (UPS) provides consistent electrical current to the server and in the event of power loss will initiate a controlled shutdown which will protect data and hardware.



System and Application Software


The most important consideration during the planning for the infrastructure is to determine the software that will be required to support the functions of the digital library. Those software decisions will drive the hardware and network specifications. Software running on the servers can be divided into two major categories: the operating system of the server and the application software. The operating system controls the server at the hardware level (i.e. processors and RAM). UNIX (vendor-specific such as Solaris or AIX), Microsoft NT Server and Linux are by far the most common operating systems or platforms in use. Hardware manufacturers will often bundle the operating system with the purchase of the server. The application software interacts with the operating system to manage and manipulate the data as instructed by the end-user. Database programs, web server software and electronic mail are typical application software programs. Although digital library development is in its infancy, integrated software packages are available and being customized for this application.



End-user desk top workstations


The standard end-user workstation or PC is the desktop device upon which the end-user or patron depends on to access, display, and download the collections and services of the digital library. The typical PC includes a processor, RAM memory, a floppy or Zip drive, CD-ROM or DVD drive, hard drive, keyboard and mouse, video card and monitor for display, sound card and speakers or head set for audio and a network card. The hardware is managed by the operating system (such as Windows 9x/NT, Linux, or Mac OS). At a minimum, the client or application software programs that are required include: a communication or network suite to facilitate the network connection to the digital collections, a web browser with plug-ins or helper applications to support the display and downloading of documents (i.e. Adobe Acrobat Reader) and the playing of video or audio files. Processor speed, the amount of RAM and the capacity of the network card will determine the level of success that the end-user experiences attempting to use a digital library.



Interconnecting networks


Providing access to created electronic collections is inherent in the development of digital libraries. Unlike physical libraries with books and journals sitting on a shelf, digital libraries occupy space on computers that are accessible from end-user workstations connected through local and global networks. For example, a signal traveling through a simple network connection between the host server and the end-user follows this typical path: from the server through a network card, to a cable connected to a series of routers and switches comprising the local, regional, national and global Internet backbones. Then depending on the destination is directed back down the hierarchy to the national, regional, and local backbones through routers and switches to a cable connected to the network card in the desktop computer of the end-user. A modem-connected PC would use the public telephone system to connect to the Internet backbones.


The communication between so many layers of networking equipment and software is facilitated by a set of complex protocols. That set of protocols (TCP/IP - transmission control protocol/Internet protocol) is based on international open standards which are commonly referred to as the Internet. The dominant Internet protocol is HTTP (hypertext transmission protocol) upon which the World Wide Web is based.


The capacity, bandwidth and speed of the local and global network to deliver data from the server to the end-user are major concerns when designing and planning for digital libraries. Just as a chain is only as strong as its weakest link, a network is only as fast as its slowest link. Fortunately, commercial and e-business development entities share these concerns and extensive capital investments are being made in global network systems. For example, a PC connected to the Internet with a 56.6 KB modem can communicate at approximately 5.7 KB per second. To transmit a 10 KB text article over a public network would take less than a minute, however, a 600 KB-1,500 KB audio or video clip could take several minutes to transmit to the end-user. Streaming audio and video with efficient compression techniques can reduce transmission times, however, future developments to increase network capacities such as Internet-2 and xDSL will contribute more to the success of digital libraries. Noerr (2000, p.63) provides an excellent discussion of networking issues in the Digital Library Toolkit.



Creation and conversion of data


The most significant contribution of digital libraries is the creation of digital content, whether "born digital" or converted from print-based formats. Several representative digital library projects have been initiated to advance developments in both arenas. Noerr's Digital Tool Kit and Fox's chapters 4 and 5 of this book provide extensive descriptions of current projects.


Most library ventures into the creation of digital libraries include the conversion of local collections from print-based formats to digital formats. There are two major methods for conversion of text-based collections: manual re-keying of text and Optical Character Recognition (OCR) scanning. Before any text is converted a document structure or schema must be determined and the markup method developed. Of course the application software used to produce the collection must also be considered.


If a re-keying process of the text is chosen, then the materials are removed from their shelves or cabinets, data entry is performed, errors detected and then corrected. This process is very labor intensive, slow and can be expensive if executed in-house by the library. Commercial service bureaus or conversion vendors with highly skilled data entry operators located in several countries can provide cost effective quality conversions. Typically, re-keying is considered to be the most expensive means of conversion.


OCR scanning is a viable option for some text data conversion projects. Rather than re-keying the text, scanners are used to "read" the characters and convert them into digitally encoded text. Materials to be scanned must be removed from their storage containers. Bound materials may need to be photocopied or unbound. Scanning speeds can vary depending on the capacity of the scanners and the PC hardware and software selected. The quality of the original document ultimately will determine the quality of the scanned document.


The conversion of documents that include images (i.e. photos, drawings, maps, graphs) must be converted using scanners and document imaging techniques. Integrated commercial packages have been bundled with hardware (servers, scanners, and workstations) and the software to facilitate indexing and retrieval of the processed collections.


Extensive developments have been accomplished in recent years to increase the functionality and productivity of conversion techniques. Open standards should be used where available, affordable and feasible. Avoid investing in conversions of data that will result in long term archival storage in proprietary data formats such as Adobe's Acrobat. Will a reader program exist in 20 years that can read that data? Or will that data have to be repeatedly converted to keep pace with future versions of reader programs?


Cornell University Library has published a Digital Imaging Tutorial in both English and Spanish ( Cornell, 2000). The Northeast Document Conservation Center conducts the School for Scanning, see their web site for more information (http://www.nedcc.org).

Saffady (1999, p. 291) and Lesk (1997, p. 48) offer comprehensive chapters on digital libraries and text conversions. The Archives Builders provides extensive information on document conversion techniques, see web site for more information (http://www.ArchiveBuilders.com).


While the focus of this chapter has been on the technical aspects of digital libraries, perhaps the most valuable investment an organization can make is in the human resources necessary to create and support the hardware infrastructure and the creation and conversion of content. The recruitment and hiring of knowledgeable technicians is crucial, but the commitment does not end there. Ongoing training is a must. The constant surveying of technological developments and new digital library projects will be required to maintain an awareness of new techniques.



A Case Study of the Online Archive of New Mexico


The University of New Mexico General Library was the recipient of a National Endowment for the Humanities grant in 1999. The purpose of the grant was to create a technology center to support the conversion of finding aids to more than one thousand manuscript and oral history collections held at four libraries and museums in New Mexico. The four participating institutions are the UNM Center for Southwest Research, Rio Grande Historical Collections at New Mexico State University, New Mexico State Archives and Records Center and the History Library at the Palace of the Governors in Santa Fe. The collections offer a representative sampling of more than 400 years of rich cultural history held in these four institutions. The project is scheduled for completion in May 2001. The participating institutions will contribute more than half of the costs required to complete this project.


More than 13,000 pages of finding aids will be encoded in SGML (Standard General Markup Language) using the standard Encoded Archival Description (EAD) Document Type Definition (DTD). The EAD is jointly maintained by the Society of American Archivists and the Network Development and MARC Standards Office of the Library of Congress (see the EAD web site at http://www.loc.gov/ead for more information).


Approximately half of the original finding aids are in a paper-only format. Most of these finding aids were sent to a service bureau for conversion to an electronic format using SGML. The other half were already in an electronic form (mostly ASCII word processing or data base formats). Those documents were converted in-house using SGML authoring tools and ASCII text editors by encoding specialists assigned to the project.


To accommodate the search form and display for the end-user, the EAD finding aids in the SGML format are converted to HTML by a PERL script. The archival version of the finding aid will remain in the SGML format for future use. The published version of the finding aid will remain in the HTML format insuring accessibility from standard web browsers at the end-users desktop. Long term goals include converting the SGML to XML (Extensible markup language).


A demonstration project within the larger project will provide access to more than 400 digital facsimiles of selected documents, photographs and/or audio clips from the collections being indexed. Images from the collections will be scanned and linked from URLs within the container lists of the digital finding aids. These images will be scanned at 600 dpi as TIFF formatted images for archival storage. They will be converted to smaller JPEG and/or GIFs for public display.


Open access to the finding aids will be facilitated by the project web site as well as by collection-level MARC records. Those MARC records will be created and loaded into the project union catalog maintained by the University of New Mexico (http://LIBROS.unm.edu) and in the OCLC WorldCat database. URL links to the full text of the finding aids are embedded within the 856 field of the MARC record.


One project manager, 1.5 FTE encoding specialists (actually three employees assigned part time to the project) and the half-time assignment of a systems analyst comprise the central technical team. At the beginning of the project, intensive training in SGML and the EAD were required for the project staff at the central technology center and for the project coordinators at the participating institutions. Four days were devoted to the training and developing "best practices" among the project participants. Experts from the Online Archive of California and the University of Virginia conducted the training.


A new web server was acquired and installed in early 1999 to host all of the UNM General Library web services as well as host the online archive project.


The OANM web server and web environment consists of the following:

  1. Dell Poweredge 6300 with (2) 500MHz Pentium III processors with 512MB memory.

  2. Storage space on the server is controlled by an ICP Vortex GDT6528RD PCI-to Ultra2SCSI RAID controller and consists of (5) hot swappable 9.1G drives operating as a RAID 5 disk array.

  3. 100 baseT switched Ethernet network connection.

  4. Operating system: SuSE Linux v. 6.1 running Apache server v. 1.3.6.

  5. Currently SGML is being converted to HTML with a PERL script written by Alvin Pollock using Perl v.5.005_02. Both the SGML file and the derived HTML files are stored on the server.

  6. Full-text searching is done using sgrep v. 1.92a.


A public launch of the OANM was made at the annual meeting of the New Mexico Library Association in March of 2001. The New Mexico State Library will be leading an end-user and reference evaluation process to determine the effectiveness of the web site (i.e. navigation, collections, presentation). An Advisory Board of historians and librarians from around the state has been formed to help guide the future developments of the OANM. A requirement of the grant is to articulate a plan for the ongoing addition of new participants and collections. Thus far, unofficial previews of the web site have resulted in very positive comments from teachers, public, school and academic librarians.


This collaborative project has provided four institutions in the state of New Mexico with the opportunity to acquire and apply the technical expertise required to add current and future unique collections resources to the growing national and international digital library initiatives.



RESOURCES:


Archives Builders. This site provides extensive information on document conversion techniques. [Online at]: http://www.ArchiveBuilders.com.

Arms, W. Y. (2000). Automated digital libraries: How effectively can computers be used for the skilled tasks of professional librarianship? D-Lib Magazine, 6 (7/8). [Online at]: http://www.dlib.org/dlib/june00/hughes/06/hughes.html.

Cornell University Library/Department of Preservation and Conservation. (2000-2001) Moving Theory into Practice; Digital Imaging Tutorial. [Online at]: http://www.library.cornell.edu/preservation/tutorial/contents.html.

Encoded Archival Description web site. [Online at]: http://www.loc.gov/ead.

Hughes, C. A. (2000). Lessons learned: Digitization of special collections at the University of Iowa Libraries. D-Lib Magazine, 6 (6). [Online at]: http://www.dlib.org/dlib/june00/hughes/06/hughes.html

Lesk, M. (1997). Practical digital libraries. San Francisco, CA: Morgan Kaufmann Publishers.

Noerr, P. (2000, March) The Digital Library Toolkit. Palo Alto, CA: Sun Microsystems. [Online at]: http://www.sun.com/products-n-solutions/edu/whitepapers/pdf/digital_library_toolkit.pdf.

Sitts, M. K. (Ed.). 2000. Handbook for digital projects: a management tool for preservation and access, First Edition. Andover, MA: Northeast Document Conservation Center. [Also online at]: http://www.nedcc.org/digital/dighome.htm.

[This Online monograph is intended to meet the needs of libraries and museums for basic information on planning and managing digital projects. The Northeast Document Conservation Center conducts frequent training sessions. School for Scanning. Online at http://www.nedcc.org].

Online Archive of New Mexico web site. Albuquerque, New Mexico. [Online at]: http://eLibrary.unm.edu/oanm

Saffady, W. (1999). Introduction to automation for libraries, Fourth edition. Chicago, IL: American Library Association.