Wednesday, August 09, 2006

Long term document format portability is not important

A post by David Perry on the Freeform Comment blog ODF Debate: A real world view caught my eye. The competitive nature of the Open Document Format and Microsoft's new Open XML format for Office 2007 are discussed.

An interesting point is raised:
We must also remember that Microsoft has serious plans to build a developer community around Office 2007 so, just as with .Net, and Visual Basic, we can anticipate a growing level of support for Open XML from ISVs that is likely outstrip ODF, at least in the short to medium term. If you have an application or service that you think should be integrated with or accessible through an “office like” application, or has the ability to manipulate an office style document, should you build around Open XML and reach 90% plus of the market, or ODF and reach a minority - no-brainer really. Perhaps it ain’t fair, possible it ain’t right, but that’s the real world.

This makes sense for document editing applications. As David also says, setting yourself up for document compatibility problems is unthinkable in a business sense, when you suddenly can't read mission critical documents 25 years after their initial creation. Being able to view documents long-term is essential, and ODF v. Open XML presents a challenge to that.

I look at the problem of document standards over the lifecycle of the document:

The lifecycle works such that the draft, review or 'work in progress' timeline is typically relatively short, compared to the timeline after the point of publishing where, in a well controlled organization the document is made an official record.

ODF and Open XML apply to the document in its 'work in progress' state. PDF/A should be the published format that provides a perfectly repeatable rendition of the document on every view, but does not require further editing.

To my mind the most important task for the work in progress formats (ODF and Open XML) is enabling editing in whichever application the user chooses. That said, early in the document lifecycle, which is fairly short, file format portability is most important only within the limited set of versions of applications available at that time. In an ideal world Open Office should not have to provide support for a MSFT Office format version that is not current. Vice versa, MSFT Office should not have to provide support for an ODF format that is not current. By current I mean with a significant number of users authoring documents. In both cases I am just worried about the editing of work in progress documents, and that happens over a fairly short period of time and therefore with a limited set of available application versions (nobody in the real world uses MS Word prior to v6 to do they?).

After publishing my primary concern as a user is being able to read the document, exactly as published, time after time. PDF/A is the enabling format for this, supported by almost everyone. Whether this will be achieved is a little dependent on whether MSFT gets over its spat with Adobe and just uses PDF/A, rather than Adobe's proprietary PDF format.

This does not mean that organizations do not need the ability to edit published document year in year out. These type of vital documents are handled by retaining an editable version of the document alongside the published version. If the document is edited over time, the portability between tools will remain current and changes to the standard tool used in an organization will be handled by saving to the new format on the next round of editing.


Document format portability is essential to allow organizations to select their editing application of choice, and to be sure that their partners can collaborate with them in the editing of work in progress documents. The portability of every combination of document format version across every version of the tool is not required, since editing should be over a relatively short period of time compared to the overall document lifecycle.

PDF has been adopted by almost every organization for publishing final documents, so there is no fear that they will not be able to read those document into the future.

The ODF v. Open XML argument for long term viewing of documents is moot: do not rely on document formats designed for editing to provide long term viewing capability - use PDF/A instead.

Technorati tags:


Joshua Drake said...

At first I disagreed with you all together, but sitting here thinking about it, we agree on one point:

Documents should be edited in the latest tools. If that is done consistently then the format does not matter.

However, in the day-to-day grind how often does that happen? I have personally made at least 3 copies of a document that I held the last existing version of, and that was for a system that was already two years out of date. So I have to question if PDF is a great storage format, I must admit that it would have "solved" the problem I had, but so would have any current electronic format.

I can see where having malleable format may cause problems, where you end up with so many copies that the authoritative one is hard to determine. But I am not sure how publishing both the editable and PDF version together would solve that issue.

Lastly I would like to point out that PDF is not the most backward compatible format. There are versions of documents in PDF that I can no longer open in the latest version of Adobe. So far only ASCII text files have stood the true, test of time standard.

Phil Ayres said...

Joshua, thanks for the comment. Just two points of clarification from my side:

1) I wouldn't expect any organization to publish both PDF and (e.g.) ODF. The ODF is for internal use only, for editing year after year. The PDF is what your customers and partners see, is published to your WWW site, or sent through email. Its also captured and classified according to your organization's retention policies in a records management or other system of record. Any decent doc management system should be able to associate the working versions with the published versions to make this easy to handle over time.

2) Adobe PDF is proprietary, I agree, and so should be treated with care. PDF/A is a standard for archiving that (in theory, and more and more proven in practice) is going to pass the test of time. Almost every government organization in the world (with an e-strategy) has approved PDF/A for long term retention, therefore I think there is a good chance that Adobe and other viewer vendors will support it.

I'd agree that there is a risk of following this approach to the letter. But even in your example, you managed to open that 2 year out of date document. As long as all apps support at least one previous version of all formats (effectively 4 years for most mainstream packages) you would be in good shape.

I still think this would work in a business setting, and am ignoring the consumer setting, where users are slower to upgrade - hi mum & dad! ;).

Publishing to PDF before distributing documents is good practice for any organization due to its largely read-only nature. Retaining that PDF means that you have a record for legal and compliance purposes long into the future (working copies of documents should in fact be destroyed periodically anyway). The single working copy that forks from the published copy is the ODF or whatever that is updated again to eventually produce the next edition of your published doc (another PDF for retention). And so the cycle continues, with the ODF always being saved in a reasonably up to date format.

Even if you then move from Open Office to MS Office 2010 (haha) then hopefully Office 2010 will have the ability to read the previous Open Office version and you can continue to update your document. And the PDF record that is published to the world never has to be touched!

I still think this is workable. If I have failed to convince you is there any chance you could highlight a business use case, following best practices, where this wouldn't work?

Thanks for the discussion!