The following entry is the highly-condensed, abridged version of a paper I wrote for a class this past semester.
The year is 2038. An archivist working at a digital repository has acquired an old storage device filled with image files from about 30 years earlier. The better part of the past year was spent creating a hardware emulator that would allow current computing devices to read the archaic storage device. Now, the archivist eagerly sifts through the files with the purpose of converting them into a newer format so that the archive can eventually add them to the digital collection. Unfortunately, the original creators of these files had saved them in a proprietary format unique to the software with which they used to create them, in hopes of preserving them for future use. Now, thirty years later, that software no longer exists. The archive has no other software capable of reading the cryptic files. Paper prints do exist for some of the images, but many remain in their original digital form; their information locked away in a capsule that no modern key could open.
This scenario, while hypothetical, illustrates what could occur when file formats become obsolete. File format obsolescence is a silent threat to individuals who manage and preserve digital materials for the long term.
What is a File Format?
The common understanding of file formats is that they are types of documents, such as a Microsoft Word document or an MP3 music file. Each file format is identified with an extension, such as .doc and .mp3, respectively. In order for a computer to make sense of a format, it must have software installed that is capable of understanding it. In this sense, it is useful to think of file formats as similar to languages, all using the same alphabet of ones and zeroes. The ones and zeroes are arranged in a manner consistent with the format’s specification. A format specification is a document that outlines the format, bit by bit. The specification contains instructions for the software designed to “understand” the format, such as displaying a color, executing a CPU calculation or pulling data from another file.
The Cornell University Library’s Digital Preservation Tutorial outlines three main classes of file format specifications: closed proprietary, open proprietary and open non-proprietary. Typically, commercial software vendors develop and own the rights to closed proprietary formats. The fruits of these formats are available to all that utilize them, but the creators restrict access to their specifications. In fact, closed proprietary format specifications may always remain just that — closed; optimized for use in their vendor’s own software.
Open proprietary formats are also owned and developed by a commercial entity, but unlike their closed counterparts, open proprietary format specifications are publicly available. As a result, they are a “compromise between business interests and open standards.” Adobe’s PDF and SWF are examples of open proprietary formats.
Finally, open non-proprietary format specifications are, as the same suggests, not owned by a commercial organization. Generally, international standards bodies publish these specifications. Examples of open non-proprietary formats include JPEG for images and ODT, an open document template. It is important to distinguish between these classes of format specifications because, on the whole, closed format specifications tend to become obsolete faster than the open specifications.
What makes a format obsolete?
Just as languages die out due to lack of native speakers or evolve to a point where they morph into a new language entirely, so file format specifications eventually reach a point where it is no longer possible to decode them. The definition of obsolete in this context is complex and multi-layered. According to David Rosenthal in Transparent Format Migration of Preserved Web Content:
A format is said to be obsolete when current hardware and software are no longer able to render information represented in it understandable to readers.
Frank Hayes, in Obsolete Defined, says that the perception of obsolescence depends on perspective:
Vendors call something obsolete when they can no longer make money selling it. IT shops say the same thing isn’t obsolete until we can no longer make money using it. And end users, the people at their desks? Many of them believe a familiar IT system isn’t obsolete until the pain involved in getting it to do what’s needed is a lot greater than the pain of migrating to something new.
Software obsolescence often runs in tandem with file format obsolescence, because many file formats depend on the availability of certain software environments. As aforementioned, this is especially the case for closed proprietary formats. Whenever a commercial vendor updates a software product, it may “evolve” associated format specifications as well, so that they support the software’s newest features. This practice helps to keep the software competitive with similar applications vying for the same market. Format evolution does not always indicate impending obsolescence, but it could motivate users to upgrade their software and hence the upgraded format, abandoning the previous version of the format.
Gradually, the old format becomes obsolete. Some companies may go as far as to purposely limit backward compatibility for legacy software versions as part of their business model. Saved computer game files and native image files are also vulnerable in this respect.
Game developers create unique format specifications that function exclusively within the game software environment and depend heavily on external file. Similarly, native image formats (such as Adobe Photoshop PSD) support multiple layers and fancy image filters. These files will often fail to function outside of their vendor’s software environment. Should the company remove support or maintain this software for business reasons, files remaining in those native formats are at high risk of falling through the cracks.
So far, many of the above reasons for obsolescence apply to closed proprietary formats. What about open-proprietary and open non-proprietary formats? In general, these two format types are less vulnerable to obsolescence. Little to no commercial incentives motivate their developers to change the specifications on a regular basis. Because user-centered organizations create the specifications for open non-proprietary formats, the development plug is less likely to be pulled due to lack of profits.
However, the longevity of all open formats, non-proprietary or not, still depends on a wide user adoption rate. Widely adopted formats, such as JPEG, PDF and SWF are supported by many software applications and can function properly in different hardware environments. Plus, these formats offer greater support for backward compatibility.
Another potential danger of open formats is their slower development. While it can be an asset, slow development can lead to stagnation. A format that sees little change over time runs the risk of being surpassed by newer formats optimized to take advantage of the latest software and hardware environments.
How much of a problem is file format obsolescence?
On the surface, digital files appear unaffected by the problems that plague physical media. After all, a manuscript typed up in MS Word will not yellow with age; an album of songs in MP3 format will never crack or accumulate dust and thousands of photos saved as JPEGs promises to remain perpetually clear. Yet beneath the surface, a different story emerges, because format obsolescence is an inevitable, looming threat, brought on by the tendency to believe that anything digital is permanent. In fact, software changes at a high rate.
“Overconfidence” in digital formats may prevent proactive efforts toward preservation, especially among home users. The chief archivist at the UK National Archives suggests that long-term preservation may actually prove harder for digital files as opposed to paper materials because there is no guarantee that modern computers will be able to access something stored on a digital device even 3 or 4 years ago — as is the case with the floppy disk.
Indeed, many documents created with 2008 software would not fit on the 3.5-inch floppy disks that were a standard just 10 years ago. At the same time, computer manufacturers have eliminated built-in floppy drives. As a result, files stored on floppy disks are at risk of becoming lost, their formats growing increasingly outdated with each passing year. If past advances in technology are any indication, it seems that it is only a matter of time before files stored on CD-ROMs and the newer USB portable drives meet a similar fate.
As it has been shown, file format obsolescence is a threat. But whom does it threaten the most? Most likely, it is anyone involved in the preparation and storage of digital materials over the long term, especially for 15 years or more. Archivists and businesses fall under this category, and to a smaller extent, personal users. However, personal users, unlike archivists and businesses, are not into the long-term storage business and so the “sting” of losing information due to obsolescence may not be as crucial as it would for a digital newspaper archive or business that keeps financial records.
The problem of preserving files over the long term has not received widespread attention from businesses and individuals at this point, but there have been efforts to confront them. The Digital Futures Alliance was formed to consider solutions to the obsolescence problem.
Strategies to combat file format obsolescence
While there is no way to prevent file format obsolescence, there are certain strategies that could lessen its impact. Two strategies, emulation and migration, center on dealing with obsolescence after and before it has occurred, respectively. Sustainability assessment centers on looking at specific properties of a file format to judge its potential for becoming obsolete, a technique that can assist individuals who are storing data over the long term.
Emulation is useful for accessing files that have already become obsolete. The process involves writing a special program — an emulator — that allows a current computer system to run archaic software. Emulators trick files into believing they are running in their native environment. This allows the information to be accessed in its original form, which may be important from an archival standpoint. However, emulation is expensive and it has the potential to get very complex when the emulators themselves must be emulated due to inevitable equipment upgrades.
Migration involves converting a file from an archaic or aging format to a current one. This can be more cost-effective than emulation, but it does come with drawbacks. Once a target format is selected, all existing files must be normalized and then saved in that format. This process often can result in glitches and corruptions when format specifications do not match up perfectly — In effect, data can be “lost in translation.” Migration also does not offer the benefit of experiencing a file in its original state, compromising the archival integrity of the information.
While emulation and migration serve as ways to work with existing files, it is possible to reduce the chances of having to resort to these methods in the first place. Much attention has been given to the concept of file format sustainability. A sustainable file format has a good chance of remaining accessible and usable over a long period of time. Characteristics of sustainable formats include wide adoption, backward compatibility, open specifications (disclosure), support for descriptive metadata, good feature support while retaining simplicity, built-in error checking, stability, externally independent and platform independence.
The issue of file format obsolesce has the potential to rework the traditional modes of archival thinking. Traditional archivists work with materials that have proven to withstand centuries. However, with digital materials, there seems to be no such thing as long-term. Yes, digital files seem to carry an air of permanence, but in reality, the opposite is true. While the files may be permanent “forever”, they may not always be accessible. It will thus be important for archivists working with digital materials to view file formats as “fleeting” — to expect change. Programs could be developed for regular file overhaul and inspections. It may even be useful to create regular positions in which people watch the development of file formats and provide continuous updates of their viability as a storage option.
Ultimately, the current digital revolution is still new, unprecedented in history. Much like the early developers of the printed book, today’s generations are pioneers of the digital age. Most of today’s file formats are not dead languages; rather they are living; ever evolving in tandem with the rest of the technological developments of the age. Because technology evolves at a faster rate than the human eye and ear, the death of digital “languages” will occur much more often than their spoken counterparts. Therefore it is crucial to maintain a steady watch to ensure that future generations will be able to translate today’s digital languages.
Images from Wikimedia Commons.
Text © 2008 Michelle of Pondersphere.com.