Saturday, February 27, 2016

Back in a Flash

Back in a Flash. Edith Halvarsson. Open Preservation Foundation, ehalvarsson's Blog. 27 Jan 2016.
     Flashback is a proof of concept project run by the British Library’s Digital Preservation Team that examines emulation and migration solutions as methods for preserving the content on CD, DVD , 3.5” and 5.25” disks.  The team acquired original hardware for their legacy lab to analyze and deal with content from those formats. They have found that the old hardware can have problems. The first step is a capture process which extracts data from the storage media and characterizes its physical components and lists the files on the media. The content can be placed in a controlled environment that ensures that the bits are retained regardless of deteriorating storage media. The technical information about the content is important for preservation planning.

For less complex content such as text the solution is to migrate the file from for old or obsolete formats to  more contemporary and reliable formats. The large majority of the content though is so "tightly bound up with its original environment that it cannot be migrated", which is the case for software. For these, the option is to emulate the item’s original hardware and software environment which were supplied by the University of Freiburg via BwFLA – Emulation As A Service. Flashback is gathering data about the performance and viability of emulating groups and comparing characteristics of the software on original hardware and emulators. 

Friday, February 26, 2016

Having FITS Over Digital Preservation?

Having FITS Over Digital Preservation? Jeffrey Erickson. NDSR Boston. February 11, 2016.
     FITS (File Information Tool Set) is an open source digital preservation tool designed to identify and validate a wide assortment of file formats, determine technical characteristics, and extract embedded metadata. The technical metadata from FITS can be exported to XML schemas. Digital preservation repositories contain a growing number of file formats. "Proper identification of a file’s format and the extraction of embedded technical metadata are key aspects of preserving digital objects. Proper identification helps determine how digital objects will be managed and extracting embedded technical metadata provides information that future repository staff or users need to render, transform, access and use the digital objects." The current version of FITS bundles many other tools together, and makes them all easier to use; some are: Droid; ExifTool; ffident; Jhove; MediaInfo (video files); New Zealand Metadata Extractor Tool. Using multiple tools can help verify the file information.

FITS consolidates and normalizes the output, providing a homogenized data set that is easier to interpret. The output can be inserted into other files, such as METS files, that can provide digital preservation documentation about the file.  FITS can assist with quality control, improving metadata, format migration metadata. FITS Sites: fitstool.org  GitHub

Thursday, February 25, 2016

Where in the Org Chart is Digital Preservation?

Where in the Org Chart is Digital Preservation? Shannon Virginia Zachary. Bits and Pieces blog. January 21, 2016.
     Where on the organizational hierarchy does a digital preservation fit in a research library:IT, collections management, or preservation? Often it is bundled with other preservation responsibilities,  digital creation, curation, and delivery tasks. Early on, "digital activities were often assigned to a specific person, probably in a pilot project", then collected into a single department.  As the programs grew, "policies and practices for preservation were developed in silos—or not at all in the intense focus on creation and delivery." For the library, there were questions about creating a Digital Preservation Position; they realized the need for a position with responsibility for the development and management of preservation policies. Questions that need to be answered are, where the position should reside; in the job a management or a technical job; and is it a digital library responsibility or a library responsibility?  Discussions with stakeholders made it sound like it belongs in preservation; it needs to work with others throughout the library, and policies and technical solutions need to be developed to make preservation happen.
  • Above all, preservation requires communication and partnership between those with specialized technical knowledge and those with collection knowledge and responsibility: how the collection grows, how it is used, what materials are likely to be wanted ten years from now or fifty or a hundred. 
  • The preservation specialist needs to be able to talk management with managers and technicalities with technical experts.
Where digital preservation work is located on the org chart doesn’t matter as long as the "library recognizes that preservation needs to happen and that responsibility for preservation must be assigned somewhere". Recognizing the similarities in digital collections with what libraries have always done, however, will provide a strong foundation for sustainability.

Tuesday, February 23, 2016

Preserving Social Media

New Technology Watch report: Preserving Social Media. Sara Day Thomson. Digital Preservation Coalition and Charles Beagrie Ltd. 16 Feb 2016. [PDF]
     This report looks at the related issues of preserving social media. Institutions collecting this type of media need new approaches and methods.  The report looks at "preserving social media for long-term access by presenting practical solutions for harvesting and managing the data generated by the interactions of users on web-based networking platforms such as Facebook or Twitter." It does not consider blogs. "Helen Hockx-Yu defines social media as: ‘the collective name given to Internet-based or mobile applications which allow users to form online networks or communities’.

Web 1.0 media can be harvested by web crawlers such as Heritrix; Web 2.0 content, like social media platforms, is more effectively archived through APIs. This is often an extension of an institution's web archiving. Transparency and openness will be important when archiving content. APIs allow developers to call raw data, content and metadata directly from the platform, all transferred together in formats like JSON or XML.

Maintaining long-term access to social media data faces a number of challenges, such as working with user-generated content, continued access to social media data, privacy issues, copyright infringement issues, and having a way to maintain the linked, interactive nature of most social media platforms. There is also "the challenge of maintaining the meaning of the social media over time, which means ensuring that an archive contains enough metadata to provide meaningful context."  There are also third-party services and self-archiving services available.

Social media is vulnerable to potential loss. The report quotes one study which looked at "the lifespan of resources shared on social media and found that ‘after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day’."

Some other quotes:
  • Overall, the capture and preservation of social media data requires adequate context.
  • Capturing data, metadata, and documentation may not provide enough context to convey user experiences with these platforms and technologies.
  • When considering the big picture, however, the preservation of social media may best be undertaken by a large, centralized provider, or a few large centralized providers, rather than linking smaller datasets or collections from many different institutions.

Monday, February 22, 2016

Filling the Digital Preservation Gap

Filling the Digital Preservation Gap. A Jisc Research Data Spring project; Phase Two report. Jenny Mitcham, et al. February 2016. [PDF]
     The report is a collaboration between the universities of Hull and York and during phase 2 of the project they worked closely with Artefactual Systems to enhance Archivematica to enable it to work better as part of a wider system for managing and preserving research data. The report also includes implementation plans for establishing Archivematica at each institution.  (Phase 1 of the project (Filling the Digital Preservation Gap. A Jisc Research Data Spring project. Phase One report July 2015. Jenny Mitcham, et al.) looked at digital preservation as part of a wider infrastructure for research data management.)

Some items of interest from the report:
  • In order to manage research data effectively for the long term we need to consider how we incorporate digital preservation functionality into our Research Data Management workflows.
  • research datasets can be large, of mixed formats and their value may not be fully understood. 
  • Creating access copies may be unnecessary as some datasets will never be requested for reuse.
  • One of the potential bottlenecks in the current Archivematica pipeline is checksum generation
  • populating PRONOM is not a oneoff exercise. We need to find ways to continue to engage and submit samples in order that new file signatures can be created as the need arises.

The report also included a glossary and an appendix discussing "Hydra in Hull Preservation workflows"

Saturday, February 20, 2016

Avoid Jitter! Measuring the Performance of Audio Analog-to-Digital Converters

Avoid Jitter! Measuring the Performance of Audio Analog-to-Digital Converters. Carl Fleischhauer; Erin Engle. The Signal. February 19, 2016.    
     An article on audio recordings and conversion to digital files.  There are references to  a revision of a FADGI guideline (Audio Analog-to-Digital Converter Performance Specification and Test Method, Feb 2016) and an accompanying explanatory report (ADC Performance Testing Report on Project Development During 2015. Feb 2016.) Audio tapes deteriorate over time and some institutions are digitizing the files in order to save the recorded content.

Most preservation specialists recommend that the waveform be sampled 96,000 times per second (96 kHz) which will capture the “horizontal” frequency of the wave movement. To capture the amplitude, the “vertical” movement, each sample should be 24 bits long. This can be a challenge due to the many factors involved.


Friday, February 19, 2016

When Your Floppies Flop, Make Them Digital At DCPL's Memory Lab

When Your Floppies Flop, Make Them Digital At DCPL's Memory Lab. Rachel Kurzius. DCist. Feb 18, 2016.
     A new resource in the Digital Commons at the Martin Luther King Jr. Memorial Library gives people the technology and know-how to digitize their personal collections. Changing hardware and software make formats difficult to use, such as floppy disks, audio cassettes, and VHS tapes. The Memory Lab will be developing a curriculum that teaches people how to preserve their digital history. "We want to have some slice of the culture that's happening now for the future. As an archivist, we've been really concerned that people's personal collections are changing and are no longer just physical." They believe it is "the only project of its kind in the country with a specific eye on stewardship and preservation." The preservation actions are "mainly about how to name files, add descriptions, store them in the best possible format and location, and how to be an informed consumer."

Wednesday, February 17, 2016

Eternal 5D data storage could record the history of humankind

Eternal 5D data storage could record the history of humankind. Press Release. Optoelectronics Research Centre, University of Southampton. February 16, 2016.
     Scientists have developed the recording and retrieval processes of five dimensional (5D) digital data processing with femtosecond lasers that may be capable of  storing digital data for billions of years on nanostructured glass.  "The storage allows unprecedented properties including 360 TB/disc data capacity, thermal stability up to 1,000°C and virtually unlimited lifetime at room temperature (13.8 billion years at 190°C )".  This encoding on ‘Superman memory crystal’ is in five dimensions: the size and orientation in addition to the three dimensional position of these nanostructures.

Related:

Tuesday, February 16, 2016

A Digital ‘Library of Alexandria’

A Digital ‘Library of Alexandria’. Katie McNally. UVAToday. February 10, 2016.
     "Scholars often lament the knowledge that might have been preserved if the great Library of Alexandria had been better protected."  Digital collections face a similar threat of "steady extinction" because of technological obsolescence. One safeguard is the Academic Preservation Trust (APTrust) which is "a large-scale solution that preserves digital scholarship by storing it across multiple technologies and physical locations." The primary goal is to package and preserve information in a way so it will be accessible to future generations.  Besides proper description, “Deep dark preservation” refers to all the pieces needed to "effectively archive a digital file and the technology it runs on for future use."  So the digital preservation is really a phased thing: address those items which can be done quickly, then work on the more difficult problems, such as finding ways to preserve the software that makes digital objects accessible. Emulation environments are being worked on to keep old software running. 

APTrust stores files in two separate Amazon data centers, one in Virginia and one in Oregon, and each of these use different technologies to store the data, to help protect against the "failure of future and modern technologies.” [APTrust is also one of the DPN nodes.]

Monday, February 15, 2016

Presentations From LC/NARA Symposium on Archiving Email Made Available Online

Presentations From LC/NARA Symposium on Archiving Email Made Available Online. Gary Price. Library Journal. February 2, 2016. [Video and text].
      Videos of six presentations from the “Archiving Email Symposium” held at the Library of Congress June 2015 are now available online. This was a symposium of federal agencies, academic and research libraries, technologists, curators, archivists and records managers who are directly working on collecting and preserving email archives in order to discuss challenges and solutions. The videos are embedded in the article and links point to transcripts.

Archiving Email: Welcome & Introduction  [text]
  • Mark Sweeney (Library of Congress)
  • Paul Wester (National Archives and Records Administration)

Institutional Approaches to Archiving Email  [text
  • Ricc Ferrante (Smithsonian Institution Archives)
  • Lynda Schmitz Fuhrig (Smithsonian Institution Archives)
  • Jaime Schumacher (Northern Illinois University and Digital POWRR Project)

Challenges of Email as a Record  [text]  
  • Lisa Haralampus (U.S. Office of the Chief Records Officer)
  • Deborah Armentrout (Nuclear Regulatory Commission)
  • Jeanette Plante (U.S. Department of Justice)
  • Edwin McCeney (U.S. Department of the Interior)

Policy Development for Archiving Email  [text
  • David Kirsch (University of Maryland)
  • Anthony Cocciolo (Pratt Institute School of Information and Library Science)
  • Kenneth Hawkins (National Archives and Records Administration)
  • Kathleen O’Neill (Library of Congress)
  • Margaret McAleer (Library of Congress)
  • Christopher Hartten (Library of Congress)

Practical Approaches to Processing Email  [text]  
  • Roger Christman (Library of Virginia)
  • Aprille Cooke McKay (University of Michigan)
  • Dorothy Waugh (Emory University)

Archiving Email: Closing Summary   [text
  • Chris Prom (University of Illinois Urbana-Champaign)

Thursday, February 11, 2016

To ZIP or not to ZIP, that is the (web archiving) question

To ZIP or not to ZIP, that is the (web archiving) question. Kristinn Sigurðsson. Kris's blog. January 28, 2016.
     This post looks at the question: Do you use uncompressed (W)ARC files? Many files on the Internet are already compressed and there is "little additional benefit gained from compressing these files again (it may even increase the size very slightly)."  For other files, such as text, tremendous storage savings can be realized using compression, usually about 60% of the uncompressed size. Compression has an effect on
disk or network access and on memory. But "the additional overhead of compressing a file, as it is written to disk, is trivial."

On the access side, the bottleneck is disk access but "compression can actually help!" It can save time and money and performance is barely affected. One exception may be with HTTP Range Requests which when accessing a WARC record would have to decompress the entire payload until it finds the requested item. A hybrid solution may be the best solution: "compress everything except files whose content type indicates an already compressed format."  This would also avoid a lot of unneeded compression / uncompression.


Wednesday, February 10, 2016

“High-res audio”

“High-res audio”. Gary McGath. Mad File Format Science Blog. February 8, 2016.
    High-res audio, sound digitized at 192,000 samples per second is not necessarily better than the usual 44,000. We can only hear sounds only in a certain frequency range, generally 20 to 20,000 Hertz.

"The sampling rate of a digital recording determines the highest audio frequency it can capture. To be exact, it needs to be twice the highest audio frequency it records." Delivering playback audio at a higher rate offers no benefit and may introduce problems. Another important part of audio is the number of bits per sample, usually 16 bits, but higher-res audio often offers 24 bits. This "isn’t likely to cause any problems", but it doesn't necessarily provide a benefit.   A bigger problem is "over-compressed and otherwise badly processed files".  It is important to not skimp on quality.

Monday, February 08, 2016

Keep Your Data Safe

Love Your Data Week: Keep Your Data Safe. Bits and Pieces.  Scott A Martin. February 8, 2016.
     The post reflects on a 2013 survey of 360 respondents:
  • 14.2% indicated that a data loss had forced them to re-collect data for a project.  
  • 17.2% indicated that they had lost a file and could not re-collect the data.
If this is indicative of  the total population of academic researchers, then there is a lot of lost research time and money due to lost data. Some simple guidelines can greatly reduce the chances of catastrophic loss if steps are included in your own research workflow:
  1. Follow the 3-2-1 rule for backing up your data: store at least 3 copies of each file (1 working copy and 2 backups), 2 different storage media and at least 1 offsite copy 
  2. Perform regular backups
  3. Test your backups periodically
  4. Consider encrypting your backups.  Just make sure that you’ve got a spare copy of your encryption password stored in a secure location!  

New digital preservation solution from Arkivum

New digital preservation solution from Arkivum, shaped to grow with your data. Nik Stanbridge. Arkivum Press release. January 21, 2016.
     Arkivum is launching a new cloud-based digital preservation and archiving service with Artefactual Systems Inc. of Vancouver. "Arkivum/Perpetua is a cost-effective, comprehensive, fully hosted and managed digital preservation and public access solution that uses Archivematica and AtoM (Access to Memory) services in the cloud."

In a survey of archivists and data curators, 87% said "file format preservation and data integrity were important elements to their digital preservation workflow. And a third of respondents stated that they would be using a cloud-based solution for their digital preservation data."

Saturday, February 06, 2016

MRF for large images

MRF for large images. Gary McGath. Mad File Format Science Blog. January 21, 2016.
NASA, Esri speed delivery of cloud-based imagery data. Patrick Marshall. GCN. Jan 20, 2016.
     NASA and Esri are releasing to the public a jointly developed raster file format and a compression algorithm designed to deliver large volumes of image data from cloud storage.  The format, called MRF (Meta Raster Format) together with a patented compression algorithm called LERC, can deliver online  images ten to fifteen times faster than JPEG2000. The MRF format breaks files into three parts which can be cached separately. The metadata files can be stored those locally so users can "examine data on file contents and download the data-heavy portions only when needed". This would help to minimize the number of files that are transferred. The compression allows users to get faster performance, lower storage requirements, and they estimate the cloud storage costs would be about one-third as much as traditional file-based enterprise storage. An implementation of MRF from NASA is available on GitHub and an implementation of LERC is on GitHub from Esri.

Friday, February 05, 2016

Developing a Born-Digital Preservation Workflow

Developing a Born-Digital Preservation Workflow. Jack Kearney, Bill Donovan. April 8, 2014.
     Presentation that looks at developing a systematic approach to preserving digitally born collections. The example from Boston College are the Mary O’Hara papers. This was an opportunity for a collaborative project involving the Digital Libraries, Archives, and the Irish Music Center.
  • Important elements of the workflow:
  • Chain of Custody, 
  • Digital Forensics, 
  • Computed initial checksums, 
  • File/folder names, 
  • Local Archival Copies, Distributed Digital Preservation
“Digital forensics focuses on the use of hardware and software tools to collect, analyze, interpret, and present information from digital sources, and ensuring that the collected information has not been altered in the process.” The presentation has some specific steps and procedures in ways to not alter the information, including multiple copies, write blockers, and such. In working with external drives, they would build and output an inventory taken with this Unix command:
     :  find directory-name  -type f -exec ls -l {} ; >c:\data\MOH\inventory.txt

Local conventions regarding naming files and folders:
  • Use English alphabet and numbers 0 - 9
  • Avoid punctuation marks other than underscores or hyphens.
  • Do not use spaces.
  • Limit file/folder names to 31 characters, including the 3 digit extension . Prefer shorter names.
  • Decision: They may remediate folder and file names, but only for the working copies.
They also look for files that need actions taken:
  • Any files off-limits or expendable? System files,
  • Personally Identifiable Information (PII)
  • Unsupported Formats (Can normalize using Xena)
  • They also use a variety of tools, such as: FITS,  JHove 
Important to keep track of digital preservation actions:
  • File migrations
  • Obsolete file formats
  • Proprietary file formats
  • Metadata changes

Wednesday, February 03, 2016

Policy Planning from iPres

Policy Planning from iPres. Alice Prael. Blog. November 5, 2015.
     Report on the Policy and Practice Documentation Clinic at iPres organized by Maureen Pennock and Nancy McGovern. SCAPE has a collection of published preservation policies that she is using to create a policy framework. The Policy and Practice Clinic showed the importance of taking the time to create a policy and not trying to do it all at once. Some notes from the post:
  • Create a Digital Preservation Principles
  • Include key stakeholders when working with the principles document, early and often
  • Write what digital preservation actions are happening now
  • Start writing a Digital Preservation Plan. Nancy McGovern: A policy is ‘what we do’ and a plan is ‘what we will do.’
  • Create Procedure Documents to show how to follow the principles
  • Have the key stakeholders decide if the procedures are realistic
Some additional notes on the Policy and Practice Clinic:
  • If your institution doesn’t like the word ‘preservation’ then use ‘long term access.’ 
  • Do what is needed to get the buy-in from the stakeholders. 
  • Make the technology enforce the digital preservation policy for you.
  • People are much more likely to perform these preservation tasks if the system doesn’t give them a choice.
More information on iPres here:

Tuesday, February 02, 2016

MDISC Archive Service

MDISC Archive Service. Website. Millenniata. January 29, 2016.
    The MDISC Archival Beta tool is now available. It automatically preserves all your photos and videos, past, present, and future by engraving them on M-Discs. The service archives files that are in a designated internet service, currently Google Photo, and archives the files to M-Discs, which is then delivered to the owner. [This is a service that I have been trying out for a few months - Chris.]

The free limited-time beta service gives you three months to try out unlimited file archiving on M-Discs. "Ultimate peace-of-mind comes when you can hold your data in your hands.  Your photos can't be lost, corrupted, hacked, or erased, and they'll last forever."

Included on the website is a video that shows people opening digital files they had saved for 8 years or more. One-third of Americans have lost photos and video and don't even know it yet. The automated archival service makes it easier to archive the files.

Monday, February 01, 2016

Preserving and Emulating Digital Art Objects

Preserving and Emulating Digital Art Objects. Oya Rieger, et al. National Endowment for the Humanities White Paper. November 2015, posted December 11, 2015. 202pp. [PDF]
     This white paper describes the media archiving project's findings, discoveries, and challenges. The goal is the creation of a preservation and access practice as well as sustainable, realistic, and cost-efficient service frameworks and policies. The project was looking at new media art but it should also help inform other types of complex born-digital collections. It aims to develop scalable technical frameworks and associated tools to facilitate enduring access to complex, born-digital media objects.

Interactive digital assets are much more complex to preserve and manage than regular digital media files. A single interactive work can include a range of digital objects, dependencies, different types and formats, applications and operating systems.  The artwork can consist of "sound recordings, digital paintings, short video clips, densely layered audiovisual essays that the user navigates and explores with the clicks and movements of a computer mouse. Expansive and complex, the artwork may include many sections, each with its own distinct aesthetic, expressed through rich sound and video quality and intuitive but non-standard modes of interactivity." The interactive and technological nature of these assets poses serious challenges to digital media collections.

About 70 percent of the project artworks could not be accessed at all without using legacy hardware. The project team realized that operating system emulation could be a viable access strategy for those complex digital media holdings.

Project Goals
  1. Identify significant properties needed to preserve and access of new media objects.
  2. Define a metadata framework to support capture of technical and descriptive
    information for preservation and reuse.
  3. Create SIPs that can be ingested into a preservation repository.
  4. Explore resource requirements, staff skills, equipment needs, and associated costs.
  5. Help understand “preservation viability” for complex digital assets
The project team analyzed content to determine classes of material, and setup a digital forensics workstation using BitCurator and the AVPreserve Fixity tool to monitor the stability of directories. The final metadata structure consisted of a  combination of MARCXML for the descriptive metadata, Digital Forensics XML (DFXML) for the technical metadata, PREMIS XML for the preservation metadata and also unstructured descriptive files.

"Emulation seems an excellent and flexible approach to providing fully interactive access to obsolete artworks at very reasonable quality." However there are issues with using emulation as an archival access strategy:
  • preserve emulators must be preserved as well as artworks.
  • creating archival identities for emulators is difficult and documentation tends to be inconsistent.
  • emulators will eventually become obsolete with new operating systems 
  • new emulators must be created
  • no emulator can provide a fully “authentic” rendering of a software-based artwork.
"The key to digital media preservation is variability, not fixity." It is important to find ways to capture the experience so that future generations can see how the digital artworks were created, experienced, and interpreted.

Artists have increasing access to tool for creating complex art exhibits and objects, but it is "nearly impossible to preserve these works through generations of technology and context changes." Digital curation is more important that ever. Access is the keystone of preservation. The appendices include  Emulation Documentation, the Pre-Ingest Work Plan, and Artwork Classifications:
  • Structure of the classifications 
  • Browser-Based Works 
  • Virtual Reality Components 
  • Executables in Works 
  • Macromedia and Related Executables
  • HFS File System