Digital Library for International Materials --Guidelines for Digitization Projects

The DLIR has over 30 active participants that contribute catalog records, digital objects and project resources to the DLIR web site. Each digitization project presents its own unique challenges. Given variations in environment, equipment quality, personnel training, and project objectives, it is difficult to maintain a precise uniform standard for digitization for all projects. Instead, DLIR provides guidelines and minimal standards to inform decisions on project design that match local capabilities.

Copyright

Make sure you have the correct permissions before you begin any project. Some helpful links are below:

Columbia University Libraries. Information Services. Copyright Advisory Office. (2009)

Copyright Protection in the Arab World (2010)

For more information on guidelines for Arab world resources, contact the DLIR.

Balance Preservation and Access

The two basic goals of any digitization project are preservation and access. Limited resources and time dictate that any digitization project must balance these two goals. While not exactly complementary, increases in access generally translate into decreases in preservation quality, and vice-versa. A project manager therefore has to determine what level of preservation quality to sacrifice in order to facilitate access to the objects being digitized. Although it appears to be a science because of the technical and precise variables that go into calibrating this balance, the reality is the determination is as much art as science.

Creating a Master File

On the ideal end of the preservation scale, every project should produce a master file for each digital object that contains the highest resolution possible given equipment and time constraints. The Arizona State Library, Archives and Public Records (ASLAPR) provides a handy table to describe minimum standards for different object types.

Textual Photographic Documents Maps, Drawings, Bitonal

Scan resolution
200-300 dpi grayscale TEXT only min.
11” on long dimension

File format & resolution
Uncompressed TIFF
200 dpi at original size

Scan resolution
4000 pixels on long dimension OR 600 dpi

File format & resolution
Uncompressed TIFF
4000 pixels on long dimension OR 600 dpi at original size

Scan resolution
4000 pixels on long dimension OR 600 dpi

File format & resolution
Uncompressed TIFF
4000 pixels on long dimension OR 600 dpi at original size

Scan resolution
300 dpi

File format & resolution
Uncompressed TIFF
200-300 dpi at original size

Source: Arizona State Library, Archives and Public Records Digital Project Guidelines, pg. 12. http://www.lib.az.us/digitalProjectsGuidelines.pdf. Access date 4/6/2011.

Uncompressed TIFF format is preferred for files because it captures the object in digital form without losing potentially important resolution information. However, the trade off with TIFF format is longer scanning times or larger files on disks. When files are converted to other formats like JPEG, resolution information is compressed or lost, and this can degrade the quality of the image over time if it is converted multiple times. The trade off with formats like JPEG is less scanning times or smaller files on disks.

Determining what file format and what level of resolution to use is going to be determined by the circumstances of a particular project. In many cases, it may not be possible or even desirable to scan or photograph each object at archive levels of resolution described above. However, that said, project managers should think long and hard about how digitized objects will be used in the future—not only for the immediate project. A helpful way to think of resolution is that it is built in wiggle room in master files for changes in technology, shifting priorities, or unforeseen uses for the digital objects. The larger the image and greater the resolution, the more options one has to make modifications to the image in the future. If the image is compressed with low resolution from the beginning, any uses for the images in the future that require higher resolution are impossible. In other words, it is always possible to degrade the resolution of a high quality image, but it is impossible to improve the quality of a low resolution image.

To summarize that idea in a simple rule: Always scan or photograph at the maximum resolution one can reasonably ever foresee needing if resolution levels cannot for whatever reason meet the archive standards presented above.

   

Metadata

There are four primary types of metadata produced during a digitization project. The first type is the technical data attached to each image automatically by the scanning or photographic device. Examples of this kind of data are the image's dimensions in pixels, bit-depth, and time of creation. This data is generally more important to technicians handling the preservation of the digital objects, and it is less important to viewers and users of the object. The second type, descriptive data, is more important to viewers and users. This metadata is usually collected by the image processor or an expert archivist, and it includes information such as the creator of the original object, its date of creation, its dimensions, notes, and descriptive title. A third type, structural metadata, is used to articulate the parts of a whole item, e.g. the volumes, issues/supplements, and pages of a journal run. Last, administrative metadata tracks rights and permissions for intellectual property, contact information, and location of documentation.

Dublin Core Metadata Standard

There are numerous standards for metadata, but DLIR encourages the use of Dublin Core as a minimum metadata standard for digital projects. Each object should have information for as many of the following fifteen fields as possible (required elements are highlighted in red). Controlled vocabularies may dictate the field contents. Consistency of descriptive terms improves results from search and retrieval actions.

  • Creator: An entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the author or a text object, or the artist or photographer of an image object. Names may be cited as they appear on the object or checked against an authorized list of name entries, such as the Library of Congress Name Authorities.
  • Contributor: An entity responsible for making contributions to the resource. Examples of a Contributor include a person, an organization, or a service. Typically, the Contributor should be used to indicate the co-author, editor, translator, or illustrator of a text object. Names may be cited as they appear on the object or checked against an authorized list of name entries, such as the Library of Congress Name Authorities.
  • Title: A name given to the resource. Copy the title as given on the title page of a text object. Assign titles to images or archival materials, e.g. Cityscape or Correspondence
  • Type: The nature or genre of the resource, e.g., Collection, Dataset, Event, Image, InteractiveResource, MovingImage, PhysicalObject, Service, Software, Sound, StillImage, Text. To describe the file format, physical medium, or dimensions of the resource, use the Format element.
  • Coverage: The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. Spatial topic and spatial applicability may be a named place or a location specified by its geographic coordinates. Temporal topic may be a named period, date, or date range. A jurisdiction may be a named administrative entity or a geographic place to which the resource applies. Recommended best practice is to use a controlled vocabulary such as the Thesaurus of Geographic Names [TGN]. Where appropriate, named places can be used in preference to numeric identifiers such as sets of coordinates.
  • Date: A point or period of time associated with an event in the lifecycle of the resource. Date may be used to express temporal information at any level of granularity, e.g. Date: 2003-02-15 Time: 13:50:05-05:00
  • Publisher: An entity responsible for making the original resource available. Examples of a Publisher include a person, an organization, or a service. Cite the publisher as given on the original object. For born digital works, cite the web address.
  • Description: An account of the resource. Description may include but is not limited to: an abstract, a table of contents, a graphical representation, or a free-text account of the resource.
  • Identifier: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string conforming to a formal identification system, such as a unique Library of Congress call number or a persistent URL (a PURL).
  • Relation: A related resource. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system. E.g., Forms part of [collection title + identifier/persistent URL]
  • Source: A related resource from which the described resource is derived. The described resource may be derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system. E.g., A full citation for the source item+ identifier/permanent URL]
  • Rights: Information about rights held in and over the resource. Typically, rights information includes a statement about various property rights associated with the resource, including intellectual property rights for the original resource and the digital versions. Include contact information for usage rights and permissions.

Creating Academic Quality Data

Dublin Core is designed to be fairly flexible, so that different types of objects can be described appropriately. The DLIR also has a slightly flexible framework for the display of objects on its web site. Because these frameworks are not rigid, project managers are encouraged to submit metadata samples to ensure conformity with both the Dublin Core and existing DLIR web design.

The key to collecting any data is consistency and integrity. With just a few objects, inconsistent data can be post-processed and corrected. But with thousands of objects, misspelled and inconsistent data often provides inaccurate or misinterpreted information. It can take an inordinate amount of time to correct. In fact, revisions can take longer than it took to collect the data in the first place. If you are not prepared to revise the data at a later time, it is highly recommended that you take steps to ensure good data collection at the outset of the project.

   

Parsing data

Data should be treated as atomic elements of information whenever possible. Data parsed (i.e., broken down to its elemental components) into highly refined columns is analogous to a high resolution image with similar trade-offs in terms of input efforts and potential future usability. It is always easy to merge highly parsed data in a spreadsheet into more general cells of information, but it is much harder to parse general cells into refined data. Take these two examples of unparsed and parsed data for books with one or more authors:

Unparsed Data

Author(s)
Hansen, J; Smith, P; Ried, E
Jones, S; Munson, D
Smith, J
Stewart, J; Colbert, S; Sedaris, A

Parsed Data

Author OneAuthor TwoAuthor Three
Hansen, J Smith, P Ried, E
Jones, S Munson, D
Smith, J
Stewart, J Colbert, S Sedaris, A

The parsed data can be converted into the unparsed data simply by concatenating the three columns, but parsing the unpased data into the three columns takes a bit more effort. It is possible to further parse this data into first and last names for each author, but whether this is appropriate is determined by balancing the effort required to parse the data during processing and the future value of having highly parsed data.

To summarize this idea with a simple rule: Data should be consistently arrayed in columns and parsed as atomically as necessary to meet all reasonable potential uses for the data.

Out of the available metadata standards, the DLIR has selected Dublin Core for its simplicity and adaptability. The Dublin Core data elements can be mapped into USMARC format, the data standard used by the DLIR catalog, OCLC’s Worldcat, and most academic libraries. Project managers should avoid creating custom standards unless there is absolutely no other available general standard that is appropriate for the project. Always consult with the DLIR technical team at the outset of a project to ensure compatibility with existing resources, functionality, and display.

Dowload a template and sample data collection: Dublic Core Data Sheet Template