In NYT today is an article on digital archiving. Let me admit that I have long made fun of the Grey Lady’s forays into technical writing (if their damn archives weren’t off-google, I would link here to their articles on the “information superhighway”, “email” and, more recently, “web logs”). That said, this is not a terrible piece, even though it conflates two very distinct aspects – personal archiving and national (state/library) archiving.
Let me just focus on the technical issues facing the individual user: How does one keep the data alive through changes in storage devices (mechanic, optical, flash memory, ether) and, more crucially, formats (word star, Nota Bene, WordPerfect, Word, PDF, jpeg, jpeg2000) and, furthermore, an interpreter for the data down the road (how will we read Word files once Apple iDroid annihilates Microsoft HomeAibo in the Great Tech War of 2121?).
The thing about data storage: Everything will fail. Hard disks (movable cylinder) can fail. They are also formatted in a particular way and come with particular connectors – either of those can change within a 3-5 year period. Optical media (CD, DVD) is scratchable and no one can actually answer how long can it last (estimates range from 2 years to 200 years). Flash memory uses the word “flash” as in “gone in a flash”. One way to overcome all this is redundancy. Have your data burned to CDs and DVDs and on hard drives. For example, my files are duplicated between my work machine, my home machine, an external firewire drive, and my laptop via rsync. As well as DVD burns that take place around New Year’s eve. I am thinking of setting up a personal file server (which will double as my Tivo). All that is a pain, albeit a necessary one.
Next is the issue of software. Obviously, standards rule: .txt is a good standard. .pdf is a good standard. .jpeg is a good standard. .mp3 is a good standard. Keep your files in the broadest applicable standard and you have a fighting chance to open that file up in 25 years. Word 2025 will NOT read Word 2003. I can guarantee that. But, it might read a plain text file. I always save a copy of my word files (the few I have, go latex) in plain text. As you migrate to newer technologies, you must TAKE your data with you. When you install the Office XP, open all of your Word 5.0 files and re-save them. Stay AWAY from proprietary formats, like HP ScanSoftShite. This goes for hardware as well. An IDE or ATA hard-drive will never make you look like the idiot who is holding on to a Iomega Jaz drive with a forlorn look.
As to a program to read your files 25 years from now, again, standards and constant migration are the only things that will help your grandchildren.
Now, here comes the non-technical issue: what to archive? My advisor has two tall metal cabinets filled with papers and a cataloguing cabinet with hand written index cards. He can find a reference in about 2 minutes. It is amazing. I have a filemaker database that houses all my articles and books with searchable keywords and abstracts (if available). Most of the articles are actual scanned/downloaded pdfs. I can search on a keyword and find and print anything in 2 minutes. Of course, my dissertation exists in 20 formats in a remote location destined to survive a direct nuclear hit. It important for the next 6 months, after which I will personally set fire to it.
What about emails? I have seen faculty inboxes that are 2 GB with NO filters whatsoever. Everything they have ever gotten sits in a giant blob. Of course, people who are pack rats in real life are pack rats in their digital life. No surprise, there. Organize. Make filters. Send emails to appropriate folders. Use labels to know what to keep. And, DELETE IMMEDIATELY that which needs to be deleted. After all that, back up your email by year.
Pictures. Oh boy, do I have a ton. I think the archiving of pictures is a tough one. Mainly because, unlike all other types of data, a picture increases in value with time but it can lay forgotten in My Pictures/Folder0012/M676876.jpg. Who knows what that is? Here, things like the iPod Photo can really revolutionize things by making pictures mobile (free from the computer or camera) and viewable on any tv screen.
This blog is also an archive. A public one, but an archive nonetheless. Whenever I decide to call quits on it, I would like to keep it around with the many conversations that went on here. It may be needed for a social history of academic bloggers written 80 years later. The server has it right now, but I have a cron script that backs up everything daily to my personal server. Eventually, all these files will become part of my own archive and live happily forever.
Issues facing national archivists some other time.