Friday, April 25, 2008

Digital Preservation Matters - 25 April 2008

Preserving the Data Explosion: Using PDF. Betsy Fanning. Digital Preservation Coalition & AIIM. February 2008. [PDF]

This report looks at PDF standards activities and the relevance to digital preservation. The PDF Reference is an open specification made freely available by Adobe. The various version are listed; in 2000 subsets were created, including PDF/A for archiving, which is being developed by AIIM and an ISO group. They looked at a variety of formats for long term preservation and "PDF was chosen as the file format best suited for long-term preservation due to its wide adoption in numerous applications and ease of creating PDF files from digitally born documents." Long term is defined as "the period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository, which may extend into the indefinite future."

PDF is an open file format but is considered proprietary because Adobe Systems owns patents on the format. However it allows developers to use the specification royalty free. The objectives are to find a format that:

  • is device-independent
  • self-contained for rendering and description
  • does not have restrictive elements to render the document long term
  • wide spread use

PDF does not fit all these and have issues that need to be resolved. PDF/A limits some functions, and there are two levels:

  • PDF/A-1a: may include any features before PDF version 1.4 except those forbidden by the specifiations,
  • PDF/A-1b: must meet all specifications

Adobe products will conform to the ISO PDF standard when approved. But the PDF format is not enough to ensure accurate preservation. Organizations must have appropriate policies, procedures and records management in place. It is important to know that files conform to PDF/A, so tools are needed. "It is safe to say that correctly implementing the PDF/A file format should result in reliable, predictable, and unambiguous access to the full information content of electronic documents long-term." Education and training on PDF/A is needed. "Due to the specific nature of long-term preservation of electronic documents, the field of available file formats that can be used for preservation purposes is very small." Other formats often considered are TIFF, XML, ODF, OOXML, and XPS.

Significant Properties of Digital Objects. Andrew Wilson. JISC Workshop. 7 April 2008.

The fundamental challenge is to preserve the accessibility and authenticity of digital objects over time and across changing technical environments. We must accept the separation of logical information of an object from its physical environment. There are different models of digital preservation that focus on the technology, the data, the processes, or restoring objects later (digital archaeology). Authenticity comes from integrity and accuracy (no unauthorized changes), being able to trust that the item is what it is supposed to be, and the ability to use and view it later. That does not mean that it has not been changed, but that the message it was meant to communicate is unaltered. The model needs to ensure that the essence or significant properties are preserved.

Investigating the significant properties of electronic content over time. Stephen Grace. JISC Workshop. 7 April 2008.

The project is to look at the properties of the digital content. The framework is to catalog the significant properties of a digital object, determine the relative value of the property for the re-creation of the object, designate the level of significance, determine the user community and restrictions. Some properties are more important to others and a judgment has to be made on the value. A numbered scale measures the significance, from essential to not important.

The Significant Properties of Vector Images. David A. Duce. JISC Workshop. 7 April 2008.

They use the data-centric approach which focuses on maintaining digital objects in the current formats rather than the process-centric approach that keeps objects in their original form and attempts to emulate the original environment. The strategy is to transform the original object with related information to create a transformed source that retains the essence of the original. It is a challenge to identify the significant properties and keep them through the transformation process. We need to document why something is being preserved and why the particular methods were used. Some possible formats for these types of graphics are WebCGM (mostly engineering), SVG (an XML application with font and animation capability) and PDF/A. More research is needed.

Friday, April 18, 2008

Digital Preservation Matters - 18 April 2008

Definitions of Digital Preservation (updated link). American Library Association. April 15, 2008.
A working group within the Preservation and Reformatting Section has drafted a definition of ‘digital preservation’ to promote an understanding of digital preservation within the library community. They created a short, medium, and long version to accommodate a variety of needs. They express “the need for a declared intention to preserve, a plan for doing so, and engagement in measurable activities to realize that plan.”
Short Definition: Digital preservation combines policies, strategies and actions that ensure access to digital content over time.
Medium Definition: Digital preservation combines policies, strategies and actions to ensure access to reformatted and born digital content regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.
Long Definition: Digital preservation combines policies, strategies and actions to ensure the accurate rendering of authenticated content over time, regardless of the challenges of media failure and technological change. Digital preservation applies to both born digital and reformatted content.
Digital preservation policies document an organization’s commitment to preserve digital content for future use; specify file formats to be preserved and the level of preservation to be provided; and ensure compliance with standards and best practices for responsible stewardship of digital information.
Digital preservation strategies and actions address content creation, integrity and maintenance, which are listed in the definition.

The PREMIS editorial committee has updated the data dictionary. It is a resource for preservation metadata in digital archiving systems. Preservation metadata is defined as “the information a repository uses to support the digital preservation process” and includes administrative (including rights and permissions), technical, and structural. It defines core metadata as “things that most working preservation repositories are likely to need to know in order to support digital preservation.”
PREMIS schema are available from The Library of Congress website.

The newsletter includes three items:
A report of the Section 108 copyright study group. Some highlights from that:
  • Museums should be eligible for section 108
  • A new exception should permit qualified libraries and archives to make preservation copies of at-risk published works prior to any damage or loss. Access to these “preservation-only” copies will be limited.
  • A new exception should permit libraries and archives to capture and reproduce publicly available online content for preservation purposes and to make those copies accessible to users for private study, research or scholarship.
  • Libraries and archives should be permitted to make a limited number of copies as reasonably necessary to create and maintain a single replacement or preservation copy.
The Chronopolis project is a datagrid framework being developed by the San Diego Supercomputer Center and others, for preserving content and developing best practices.
The Washington State Digital Archives is leading a multi-state government project for archiving local state government data.

Windows Life-Cycle Policy for XP. Microsoft. Updated: April 3, 2008.
Microsoft has updated the end of life-cycle information for XP license availability and support to June 30, 2008. The end date for XP Home on Ultra Low-Cost PCs is extended to June 30, 2010, or one year after the general availability of the next version of Windows. A final service pack for XP is expected by the end of April: Windows XP SP3 out by end of April.

Thursday, April 17, 2008

Digital Preservation Matters - 11 April 2008

Section 108 Study Group Releases Report. George H. Pike. Information Today. April 10, 2008.

An advisory study group has been created to make recommendations about copyright issues and the role of libraries and archives in preserving information. Section 108, part of the Copyright Act of 1976, does not adequately define archiving web content, preservation of analog and digital works, and digital copies. It currently only recognizes "published" and "unpublished" works. The study group identified a new category of "publicly disseminated" works which includes copyrighted works transmitted by broadcast, online streaming, etc. The group recommended changes to Section 108 to allow libraries and archives to make "a preservation copy of any at-risk" publicly disseminated work.

This new exemption would be limited only to non-commercial unique or rare "at-risk" works that may be lost due to an unstable or ephemeral format or medium. Only libraries and archives that have comprehensive preservation programs would be allowed to make and preserve these copies. Access to these preservation copies would be restricted and not part of a library’s general collection. Only publicly accessible content could be captured. [The full report is available here; it is a 212 page PDF.]

In Storing 1’s and 0’s, the Question Is $. John Schwartz. The New York Times. April 9, 2008.

The amount of digital materials in increasing, but much of the data is ephemeral. It is very fragile; “there’s no one-size-fits-all model for preserving data in the digital age,” and the biggest problem is how to pay for it. The National Science Foundation has started a $100 million program (DataNet) to help develop methods and technologies to preserve data that make economic sense . Choices have to be made about what to keep. It is just as important to keep the right information.

Sun fixes Java SE for a fee. Gavin Clarke. Register Developer. April 7, 2008.

Sun is extending the support program for Java Standard Edition 1.4, which will officially retire this summer. The support program will require payment and will extend to 2017. Otherwise, users must upgrade to the latest edition of Java SE; free support for the software will be three years now instead of six.

Agency under fire for decision not to save federal Web content. Heather Havenstein. Computerworld. April 11, 2008.

NARA has discontinued its policy of taking a "digital snapshot" of all federal agency and congressional public Web sites at the end of congressional and presidential terms, since they believe the content is already saved by each agency as permanent records. "The fact that digital preservation is done by others outside NARA isn't an excuse for NARA to abdicate their responsibility, but an argument that they should be capable of fulfilling it. "As members of Congress and federal agencies increasingly move their work online, robust digital archiving will only become more important."

Seagate Delivers World's First 1TB Drive with SAS Interface. Press Release. April 7, 2008.

Seagate announced it is now shipping a 1 Terabyte enterprise-class hard drives with a Serial Attached SCSI interface. It includes a five-year limited warranty.

Library of Congress Groans Under Data Strain. James Rogers. Byte and Switch. April 9, 2008.

Library of Congress has to find a way of dealing with an unbelievable amount of information. The library currently has more 500 TB of digital data, split across three data centers and many different storage technologies, most are online or nearline, and some on tape. They also need help deciding which digital data needs to be preserved. “This is all about preservation and future-proofing.” They estimate the information produced every 15 minutes is equivalent to all information currently in the Library of Congress.

Friday, April 04, 2008

Digital Preservation Matters - 04 April 2008

Audio and Video Carriers: Recording Principles, Storage and Handling, Maintenance of Equipment, Format and Equipment Obsolescence. Dietrich Schuller. TAPE. February 2008.
This is an introduction to those working with sound and video collections. It outlines the history of various types of audio recordings, including CD and DVDs, how they were made and how stable they are. Also an overview of the passive preservation factors, particularly environment, handling and storage. Humidity and oxidation affect the physical surfaces. Other factors are dust, pollution, light, and magnetic fields. It includes a section on the maintenance of equipment and the obsolescence of formats.

IMLS Will Sponsor Second Conservation Forum for Collecting Institutions. Jill Collins. IMLS. Press Release. March 20, 2008.
This forum “Collaboration in the Digital Age” is intended to help museums and libraries think strategically about digital preservation. It is to be held June 24-25 in Denver. It will emphasized the fundamentals of digital content creation and preservation, emphasizing practical approaches to planning digital projects, increasing access to collections, enabling digital resources to serve multiple purposes, and protecting digital investments. In 2006, online visits accounted for 310 million of the 1.2 billion adult visits to museums and 560 million of the 1.3 billion adult visits to libraries. Yet 60% of collecting institutions do not include digital preservation in their mission.

Audio Tape Digitisation Workflow: Digitisation Workflow for Analogue Open Reel Tapes. Juha Henriksson, Nadja Wallaszkovits. TAPE. March 2008.
A practical web-based workflow for audio tape digitization. Looks at physical factors, such as tape problems, equipment, and conversion. The standard CD sampling rate of 44.1 kHz is outdated and may be inadequate for many types of material. Currently 96 kHz is regarded as a widely accepted standard. IASA recommends a minimum sampling rate of 48 kHz, though some types of material may need 192 kHz. They also recommend an encoding rate of at least 24 bit to capture analog items. Other topics are metadata, recording level, format, and archival masters. After digitization the digital file is now the preservation format. For preservation purposes an asset register should be kept and updated, and should also record the checksum for each file.

White Paper: Representation Information Registries. Adrian Brown. PLANETS. 29 January 2008.
A report on Representation Information Registries. These are a critical component of digital preservation architecture, containing the technical knowledge necessary to support access to digital objects. “Any meaningful digital preservation activity requires some form of knowledge base regarding the technical environments necessary to support access to digital objects.” This is expressed in the OAIS model. Key reasons for the registries are: efficiency of description; knowledge sharing; sustainability. “Preservation planning encompasses all activities which identify the need to perform preservation actions, and the most appropriate actions to perform in order to meet specified objectives.”

Developing Practical Approaches to Active Preservation. Adrian Brown. National Archives, UK. June 2007.
The active preservation methodology comprises three main functions:
  1. characterization: measures the properties of digital objects needed for long-term preservation;
  2. preservation planning: the appropriate preservation actions to be undertaken; and
  3. preservation action: the results of preservation planning, transforming the objects
The PRONOM technical registry supports these functions and is the core of the preservation system. The preservation planning framework determines what preservation actions should be applied to which objects, and the appropriate time to apply them.