Digital formats for long-term preservation
- Posted: May 22, 2008 at 5:42AM
Right click “Save as…”
Caroline Arms is an information technologist who came to the Library of Congress to work on the American Memory project. The challenge of preserving digital content captured her interest, and her work since has focused on understanding and promoting formats that raise the probability that content will be usefully available to future generations. She is the co-compiler, with Carl Fleischhauer, of the Digital Formats website, and a member of the committee to standardize Office Open XML.
JU: I'm interested in your perspective on XML's role in the preservation of documents for the long term.
CA: I'd like to be able to go broader than XML. It's one aspect, but it's not the only answer. When we're talking about the challenge of preserving digital content we usually think more broadly.
JU: Great point. Of course there's a whole range of issues, from how you keep the disks spinning too...well, let's step back and talk about acid-free paper, which may be a more durable format than anything we've done electronically.
JU: So, OK, give us the broad view of how you have approached this problem at the Library of Congress.
CA: The Library's mission is to make its resources useful and available to Congress and to the American people, and to sustain and preserve a universal collection of knowledge and creativity for future generations.
Congress funded the National Digital Information Infrastructure and Preservation Program (NDIIPP), and I've been working as part of that since the early 2000s.
The program looks for every opportunity to raise the probability that content created today will be usable by those future generations.
I first came to the library to work on American Memory, which was digitizing out-of-copyright materials and making them available to everybody.
JU: Of course that project isn't just a resource for future generations...
CA: Right. So, there are many ways to think about raising that probability. The program is trying to build a network of organizations committed to the stewardship of digital content. Not just traditional libraries and archives, but certainly including them.
You mentioned spinning disks. We try to have conversations with storage vendors, and try to explain how we see the requirements for long-term cultural archives as being a little different from those for business continuity.
You also mentioned acid-free paper. In the book age, we can take in a book make sure it's on acid-free paper, and it will still be there a hundred years from now. The phrase "benign neglect" gets used. Paper survives benign neglect. Digital content doesn't.
JU: It's a paradox. Recently I visited my parents, and we found a box of correspondence they had written from a yearlong trip to India many years ago. I realized that my own correspondence is probably a lot less likely to be available to available to my kids or grandkids.
CA: I have exactly the same experience. My father was away at the battlefront in World War II, and he wrote as frequently as he could. My mother still has all those letters. Today's forces are using email and cellphones and other ephemeral means of keeping in touch.
It's amazing to read the letters discussing what my name would be, because I was on the way.
JU: Of course once that box of letters is lost, it's lost. There is no backup, there are no perfect copies. It's a paradox that we're in era when you can make perfect copies, and distribute them as widely as you want, so you'd think that superabundance would save the day, but that's not necessarily true.
CA: No. You have to act at the time of creation in order to up the probability. This is true for your own digital photographs, and for libraries. So we try to influence the early stages of content creation.
CA: Working on standardization efforts is one way. Another is to form partnerships that try to exploit synergies with content creators. We just look for opportunities in different industries.
For example, the scholarly publications industry has an interest in preserving their own content, they also want it to be accessible through libraries, so we find synergies there.
JU: Of all the businesses I know, that one is most sophisticated in its thinking, and in its efforts toward long-term preservation. Those folks really get it, and have done a lot of good work to enable a level of fidelity and persistence that is unheard of elsewhere.
CA: Another community working toward this are the professional photographers. These are mainly individuals, not corporations, but they're realizing that for their own business purposes they need to have good practices. And the practices that are good for them are pretty much aligned with the practices that we believe will be helpful.
JU: What are some of those practices, and how do you interact with that group to help foster them?
CA: In the NDIIPP program, we've had some money we've been able to give out as awards. A couple of recent awards are to associations of photographers, and they're all to do with exploring what the good practices should be, and promoting them. So, discussion of formats, and in particular for photographs, the capturing and recording of metadata. Understanding what the tools that photographers use do, or don't do, about accumulating and retaining metadata.
Wise choice of format is important, but we don't think there's a single best format. In thinking about formats, the two key factors are disclosure -- that is, are specifications available -- and adoption. The more widely used a format is, the less likely that archival institutions will have to foot the bill for migrating it, or maintaining tools to render it.
We are interested in understanding the formats that are widely used, and promoting practices that will use those formats in good ways.
JU: Does this boil down to recommendations that the Library of Congress has made to photographers?
CA: No, it's working with photographers to find the synergies between the requirements, have the photographers promote the best practices, and perhaps to suggest what will be even better for us. But we don't have that much influence over what formats creators and publishers use. We have to learn to be able to handle the most widely used formats.
JU: Given that, what practices do you find most useful, and why?
CA: With photographs, there is value to us and to photographers in retaining as much color and spatial information as possible. The Library will be accepting a variety of formats. If your camera takes only JPEG, there's no point in going for anything else. But in general libraries and photographers have liked to keep full-resolution images without lossy compression. TIFF has been a standby, but it's not very good for embedding metadata.
There are explorations at the moment on formats and tools for getting metadata into images. Many photographers are positive about Adobe's DNG format, with XMP metadata, which is XML-based. An advantage of XMP is that you can embed it in images, or handle it as XML outside the image. XMP as a vehicle is now being supported by more and more tools.
But then within it, you have to have practices about what elements you record. In the photography world, the leading community is photography for journalism, so ITPC (International Press Telecommunications Council) is the leading metadata standard as far as elements are concerned.
This is a case where the Library has its own metadata standards, and we don't want to lose all the experience and compatibility with our own systems and tools, but clearly the commercial market and the equipment is gathering around the IPTC metadata schema. So we need to adjust our practices so we can take advantage of that.
JU: You've said that you look to leading practitioners, like professional photographers and journalists, but of course anyone can produce something which -- though we won't know it at the time -- could prove to be of great cultural significance. So we have to hope the standards and practices trickle down to everybody, right?
CA: Yes. The standards and practices supported in cameras and software, or in Flickr and the other management services, those are all part of the environment that we're working in, and that we have to be conscious of.
The rapidity of change is a real challenge for us. The book in its hard cover on the shelf has been there for a long time, and will continue to be. But in the digital world things change very quickly.
JU: It's a huge challenge, and we've yet to see the emergence of a way of dealing with this that would separate various concerns. Storage, for example, is a separable concern. It should be possible for individuals and organizations to choose from a range of storage options which would offer a range of preservation guarantees.
JU: And that wouldn't necessarily be tied to other kinds of arrangements. You mentioned Flickr. On the one hand people are using it for archival purposes. But it's also a catalog, it's also a database, it's also an environment for sharing and use. We're bundling all those concerns together right now, and that makes it difficult to get at what really matters to you in a rational way.
CA: I agree entirely. Flickr is not making any commitments to the way it's archiving the content. It is tricky. In the last few years, these big services provided by Amazon and Google are a complete change in the business model for these things. But it's interesting that the storage service from Amazon has taken off in a way that some other attempts failed. There were several others, but they couldn't build the market and the trust. I think that somehow Amazon has the trust of people because it clearly has a big problem of its own. People trust that it will take good care of its own content, and that somehow it will solve these problems. So although as you say things aren't separate, in a way the building of trust can't necessarily be separate.
JU: Of course there are no long-term guarantees. This is where the scholarly publication folks have done the most thoughtful and intense work. They've even thought through what happens when the organization hosting the content fades away, and have seen that there needs to be a federation of cooperating businesses that transcends any individual organization.
I should be able to swap out Flickr's storage backend for a service that offered long-term guarantees, for which I'd pay a premium. That's not an option for anyone yet, but there's a whole slew of interesting business opportunities there for lots of players in lots of niches.
CA: What's unpredictable is quite how they will develop. It's a mixture of general moves in the technology and particular organizations deciding to go in a certain direction. And then the market, whether it's consumers or industry sectors, coming together to create critical mass.
What we found in NDIIPP is that it's very hard to drive this process. You can nudge, and promote awareness of problems, but what has actually emerged in the last year or two is probably quite different from what people were talking about in 2001 when the program got started.
JU: Where do you feel you have been successful in doing some nudging and promotion?
CA: I think some of the standardization efforts, for PDF/A, the archival format for PDF, and Office Open XML, are examples of where we've been able to play a role in moving in the right direction.
JU: The Library of Congress has been involved in both of those standardization efforts?
CA: Yes. In the PDF/A case, which happened first, this was an activity stimulated by the wishes of archival institutions and especially the legal and judicial community to have an archival document format that could substitute for paper.
The standard came out I think in 2004, and there are an increasing number of tools which can save in this format. It primarily outlaws features which are difficult for preservation.
JU: I was going to ask you to clarify that, because I think many people would say that PDF itself is a good archival format.
CA: The PDF/A format outlaws embedded audio and video, it requires that the text in the PDF be in reading order, it requires that the fonts used be embedded -- because in many cases PDF relies on the fonts you have on your computer -- and it requires that the fonts be legally embeddable. It also outlaws encrypting, and mandates XMP metadata.
JU: Do you think these restrictions tend to be easy to meet, or are they onerous?
CA: My guess is that in ordinary office documents, and documents that get submitted for court cases, it probably is not onerous.
JU: So in terms of Office Open XML, how did you approach that?
CA: We joined that effort after it was already underway. We learned that the British Library was actively involved, and we shared their interest. The general move to XML-based formats for text documents, and for the other office productivity documents, seemed to us like a very good move.
XML files, that you can look at with simple tools and hopefully understand the tag names, offer inherent advantages.
As I said, the two most important factors for preservability are disclosure and adoption. By disclosure we mean that the specification exists, in a public way, that will continue to be available. Clearly to have it exist as an international standard by a known standards organization raises the probability that it will continue to be available and used.
As to adoption, clearly the Microsoft products are widely adopted, and libraries will be collecting content produced by those applications. So this seemed like a good opportunity to influence the public availability of the specification.
JU: It's an interesting question as to what extent the Library will wind up interacting directly with documents produced by those applications, versus receiving content from organizations like scholarly publications, who are now for example beginning to be able to accept articles that were authored in Microsoft Word, but delivered in the NLM -- National Library of Medicine -- XML formats.
CA: Yes, you're right. Our traditional collecting has mainly been of published materials, and we expect they'll be in some form other than what your word processor creates. But we also collect the personal papers of famous individuals, so I'm sure we already have quite a lot of documents in word processing formats.
We believe it's important to be involved early in the content creation life cycle. If the tools begin to record more information about the transformations that go on, that's of value.
And beyond standard text documents, a phenomenal amount of valuable information is currently stored in PowerPoint files. Or, information that we might have collected on paper may be available as spreadsheets. We can't afford to assume that things will remain the way they are.
We're harvesting lots of documents from the web that may not have been published through traditional channels, and those are likely to be word processing documents or PDFs.
I'm confident we'll have plenty of documents in word processor formats that we will have to try to preserve.
JU: Of course the preponderance of what would be the equivalent of personal papers, at least for a certain era, will be email. And unfortunately we don't have any XML standards governing email.
CA: Right. So email is not something the Library of Congress spends a lot of time thinking about, but another government organization, NARA, the National Archives, for them email is very important. They capture the records of government agencies, and of each administration as it transfers power to the next.
So, I must mention that there are other XML formats. The Open Document Format is also a very important development for us, and we hope that it will be adopted. We have to keep an open mind and see where the marketplace moves.
We see that the general movement to XML-based formats, wherever they are appropriate, is a good thing.
JU: Yes. Whatever the XML format, there's a huge amount of untapped potential in the interweaving of content and metadata and, actually, data -- rows and columns sorts of data which are well represented in XML formats. The numbers in spreadsheets and databases are a form of content that is merging with documents, and should.
CA: Absolutely. One of the projects I've been involved with goes under the name Data-PASS. It's a consortium of social science data archives. They have a descriptive standard, it's a multi-level standard with a rich XML structure that supports the online subsetting of the data.
JU: So I think we're having this conversation in the nick of time, because you're retiring next month, right?
CA: Yes, in late May, actually. But I expect still to be engaged in the area. It's been a very exciting time, and I hope still to be involved even if I'm trying to have more time for family and travel.
JU: I hope so too. So, the challenges are daunting, but I think you're mostly optimistic about the future.
CA: Yes. I've learned to take a long-term perspective. You do see that even though the steps are small, there are lots of steps being taken in hopeful directions. Eventually these problems will be worked out. And as people become aware that this is not just a problem for libraries and archives, but also, as you've pointed out, for their own correspondence, their own photographs -- and also that businesses share the same problems -- I'm confident we're moving in the right direction. And I'm glad to have helped.