Wednesday, December 14, 2016

PDF/A as a preferred, sustainable format for spreadsheets?

PDF/A as a preferred, sustainable format for spreadsheets?  Johan van der Knijff. johan's Blog. 9 Dec 2016.
     National Archives of the Netherlands published a report on preferred file formats, with an overview of their ‘preferred’ and ‘acceptable’ formats for 9 categories. The blog post concerns the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:
  • Preferred:  ODS, CSV, PDF/A     
  • Acceptable: XLS, XLSX
And the justification / explanation for using PDF:
PDF/A – PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A
There are some problems of the choice of PDF/A and its justification.
  • Displayed precision not equal to stored precision
  • Loss of precision after exporting to PDF/A
    • Also loss of precision after exporting to CSV
    • Use of cell formatting to display more precise data is possible but less than ideal,
  • Interactive content
  • Reading PDF/A spreadsheets: This may be difficult without knowing the intended users, the target software, the context, or how the user intends to use the spreadsheet. 
The justification states that some interactive functionality "will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A." However, deciding what functionality is ‘essential’ depends on the context and intended user base. In addition, interactive aspect may imply that "any spreadsheets that do not take any interaction with a user can be safely converted to PDF/A. But it may also be better to make a distinction between ‘static’ and ‘dynamic’ spreadsheets.

There may be situations where PDF/A is a good or maybe the best, but choosing a preferred format should "take into account the purpose for which a spreadsheet was created, its content, its intended use and the intended (future) user(s)."


Thorsted said...

We chose not to use PDF/A as a preferred migration of spreadsheets because of two factors.
PDF's need a media size which will clip content
PDF's won't maintain any formulas or other advanced cell formatting.

Pepijn Lucker (National Archives of The Netherlands) said...

First of all thanks to Johan for the interest taken in our preferred formats document and the feedback given in this blog. All of the arguments he makes about loss of precision are correct. That’s why we selected ODS as preferred and XLS, XLXS as accepted formats. Incidentally, the finding that CSV has precision issues was new to us and is certain to be part of our evaluation of our preferred formats document.

In this document we provide 2 lists: preferred and acceptable formats for different information categories (not only spreadsheets, but also including text documents, image files etc). A format is preferred when it’s an open standard as defined by the so called Forum Standaardisatie, a government agency dedicated to promoting usage of open standards in the Dutch central administration. Acceptable means it’s not (fully) open and documented but we already have the experience and the strategies in place to ensure longtime archiving.

The reason we included PDF/A – one of the formats included on the Forum Standaardisatie list – as one of our preferred formats is that in our experience spreadsheet are not only used for, say, complicated calculations but also to store plain text in a table format. In these kinds of spreadsheets, the individual cells do not interact with one another and what you see on your screen is all there is to it. In these cases – exclusively – we feel that PDF/A is an appropriate format for long term archiving. That’s what is meant with the phrase “Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A”. We call it interactive, Johan calls this dynamic, but I think we mean the same thing. To prevent misunderstandings we’ll probably explain in more detail what is meant by ‘interactive’ in future versions.

I had a chance yesterday to talk to Johan and we both concluded that spreadsheets are tricky. Feedback like his remarks are indeed appreciated and will be used to improve and update our preferred formats document in future versions.

Chris Erickson said...

Thanks for the comments. This information helps all of us with an involved topic.