About four years ago I interviewed Ray Kurzweil, and asked him about the issue of old data formats. In Singularity or one of his earlier papers or books, he had lamented the inaccessibility of old data, which prompted me to ask about the role of standards. His response:
We do use standard formats, and the standard formats are continually changed, and the formats are not always backwards compatible. It's a nice goal, but it actually doesn't work.
I have in fact electronic information that in fact goes back through many different computer systems. Some of it now I cannot access. In theory I could, or with enough effort, find people to decipher it, but it's not readily accessible. The more backwards you go, the more of a challenge it becomes.
And despite the goal of maintaining standards, or maintaining forward compatibility, or backwards compatibility, it doesn't really work out that way. Maybe we will improve that. Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.
So ironically, the most primitive formats are the ones that are easiest.
So something like acrobat documents, which are basically trying to preserve a flat document, is actually a pretty good format, and is likely to last a pretty long time. But I am not confident that these standards will remain. I think the philosophical implication is that we have to really care about knowledge. If we care about knowledge it will be preserved. And this is true knowledge in general, because knowledge is not just information. Because each generation is preserving the knowledge it cares about and of course a lot of that knowledge is preserved from earlier times, but we have to sort of re-synthesize it and re-understand it, and appreciate it anew.
Related thought: The "Rosetta Project" (http://rosettaproject.org/) aims to build a publicly accessible library of human languages. Besides the website, part of the original concept was to seed the world with "Rosetta Discs", metal spheres with an approximately 2,000-year lifespan, etched with optically readable samples of thousands of written languages. Future humans who found the discs would be able to review and understand the dead languages on it as long as at least one of the languages on the disk was still known. The website (http://rosettaproject.org/disk/concept/) describes how it would work:
The Disk surface shown here, meant to be a guide to the contents, is etched with a central image of the earth and a message written in eight major world languages: “Languages of the World: This is an archive of over 1,500 human languages assembled in the year 02008 C.E. Magnify 1,000 times to find over 13,000 pages of language documentation.” The text begins at eye-readable scale and spirals down to nano-scale. This tapered ring of languages is intended to maximize the number of people that will be able to read something immediately upon picking up the Disk, as well as implying the directions for using it—‘get a magnifier and there is more.’
On the reverse side of the disk from the globe graphic are over 13,000 microetched pages of language documentation. Since each page is a physical rather than digital image, there is no platform or format dependency. Reading the Disk requires only optical magnification. Each page is .019 inches, or half a millimeter, across. This is about equal in width to 5 human hairs, and can be read with a 650X microscope (individual pages are clearly visible with 100X magnification).
I like the way the disk begins in eye-readable scale and tapers down. But why not continue this even further? At some point, when enough exposition has been done so that the corpus is fairly understandable, you could then introduce the concept of binary encoding. And then after a few short examples and mappings, then continue the rest of the disc in binary form.
One could theoretically take this further by then explaining how we built our primitive computers, some simple math, and continue.
I would love an accompanying Rosetta project that was in just one language, but exposited our understandings of math, physics and computer science so that some civilization that discovered the twin discs could use the first one to learn English (as long as they knew or could decipher at least one of the languages) and the second one to reconstruct our understanding of math, physics and computer science and rebuild a 2000 AD era computer, and finally input to it a tar.gz dump of all of Wikipedia.
Oddly enough, I dreamt last night about the old 9-track, half-inch, reel-to-reel tapes we used to use, back in the day, and woke up wondering how much a reel actually held.
Looks like it was 140MB at highest density.
Now, excuse me a moment, while I try to get those pesky kids off of my lawn.
The unit has an average seek time of 30ms in 1973. Today, in common 7200 RPM drives, it seems to be around 8-9ms. There is only a factor of improvement of 3 to 4 times.
Latency has been scaling worse than almost any other factor for nearly the entire history of computing. The latency/throughput ratio for storage, networking, memory, cache and almost everything else has been rising continuously since the 70s and is likely to carry on rising.
This is going to have big consequences for the way that we design systems in the future. Transferring a large amount of data is going to be cheap and fast. Seeking, handshaking, back-and-forth and any other latency sensitive operations are going to be slow and expensive. This has already played a big role in algorithm design in HPC, and is going to start being felt to a much larger degree in the wider field over the next decade. As the latency/throughput ratio gets bigger, the tradeoffs behind optimal system design will change.
30ms for a Cray 1 was only 2400 clock cycles. 8ms for a modern CPU is around 30 million. That's a big change.
Average seek time across a disk which is much larger physically than the disks in use today.
It wasn't rare at all to find fair sized servos (as opposed to steppers in consumer grade stuff in the 80's) in those old disk pack units (the size of a washing machine).
You can't really compare 'common 7200 RPM drives' of today with a top-of-the-line medium from the 70's, physics didn't change at all in that time. That's why there are 'servo tracks' on the drive, they help with finding the right track (in a stepper scenario you don't actually need those, the stepper resolution defines where the tracks are).
Today enterprise level drives achieve < 4 ms average seek times at a price point which is a small fraction of what that drive cost in the 70's.
That's where the real improvement factor sits: performance (both in capacity, seek time and transfer rates) vs cost.
Well, physics didn't change, but our understanding of it did. Modern read heads are based on the giant magnetoresistive effect which was only discovered in 1988. It's actually one of the fastest transitions from fundamental physics discovery to wide use (GMR read heads were widely used from 2001-2002 or so)
The actual mass of the heads and arms was an issue as well; you had to be careful not to seek back and forth at the natural resonance frequency of the cabinet or the "washing machines" would start walking across the floor.
That's why they say that disk is the new tape. Releative to the sequential data rate, disk seeks are almost as expensive as a robot changing a tape reel used to be.
We aren't there yet, not by a long shot. Taking clock cycles as the base factor, a disk seek on a modern drive (8ms seek, 4GHz CPU) takes 32 million cycles. On the Cray1's 80MHz CPU, 32 million cycles is 400ms which would be really quick for a robot.
Tape seek times (for half a tape) are closer to 40s, which is still 100x more. We'll have to wait a while yet before disk is the new tape in terms of cost of seeks, even when comparing the 70s to today.
That's only considering latency. The cost consideration is a completely different story.
Because it is usually things like we have 1000x faster such and such and 100x smaller since such and such a time. This brings home that there were amazing technological advances even in the 70's worth noting. Sure we have SSD's now and so on. But, ya. Just kinda cool to see.
Didn't I read about someone who built a replica of a Cray, but struggled to find actual software for it? This could be exactly what they needed. I think the story of the Cray replica (scaled model) was featured on HN.
That's pretty impressive. Not only did they image the disk, it was a disk that had had a head crash and one damaged head which they managed to restore to functioning well enough to capture the data.
Now of course the big question: what is on that disk?
Unfortunately, within 30 seconds of the heads being loaded a high-pitched whining noise began to be emitted from the drive, implying a potential head-to-disk contact was taking place. The drive was then powered down and the disk pack and heads were carefully examined. Thorough examination revealed that Head #4 on the drive (which reads the bottom surface of the lowest data platter) had 'crashed' into the disk surface and scraped away a concentric ring of oxide material, permanently damaging the platter. This is a good time to point out the advantages of not experimenting with your primary source material when performing digital archeology experiments!
This reminds me of the 'Digital Needle' project, which was a hacking project done by Ofer Springer back in 2002. The idea was to use a flatbed scanner to play a vinyl record.
I can imagine how difficult this was to pull off. I came across some old Colorado tapes that had backups of BBS related stuff, files, etc. from 10 years ago. I figured I'd try to dump the data off of them using a Linux live CD only to find that Linux dropped floppy-interface tape drive support a long time ago. Next time around, I'll pull an ancient live CD version and get the data off before it goes the way of the Cray. Constantly migrate your data forward to avoid these hassles!
About a year ago, I needed to get some data off some old PDP-11 8" diskettes. Fortunately, an old friend had an LSI-11 in his closet with the disk drive. He hadn't run it in years, but it fired right up, and the disks read perfectly, and he sent me the images.
My PDP-10 software, unfortunately, was gone for good. It was all on a magtape, and the drive that wrote it was way out of spec and the tapes were unreadable on any other drive. (sob)
We do use standard formats, and the standard formats are continually changed, and the formats are not always backwards compatible. It's a nice goal, but it actually doesn't work.
I have in fact electronic information that in fact goes back through many different computer systems. Some of it now I cannot access. In theory I could, or with enough effort, find people to decipher it, but it's not readily accessible. The more backwards you go, the more of a challenge it becomes.
And despite the goal of maintaining standards, or maintaining forward compatibility, or backwards compatibility, it doesn't really work out that way. Maybe we will improve that. Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.
So ironically, the most primitive formats are the ones that are easiest.
So something like acrobat documents, which are basically trying to preserve a flat document, is actually a pretty good format, and is likely to last a pretty long time. But I am not confident that these standards will remain. I think the philosophical implication is that we have to really care about knowledge. If we care about knowledge it will be preserved. And this is true knowledge in general, because knowledge is not just information. Because each generation is preserving the knowledge it cares about and of course a lot of that knowledge is preserved from earlier times, but we have to sort of re-synthesize it and re-understand it, and appreciate it anew.
Source:
http://blogs.computerworld.com/the_kurzweil_interview_contin...