Sun's storage team would work really well either as an autonomous unit within Oracle or a seperate company (unfortunately the last is unlikely unless the staff themselves do a 'JRuby').
Solaris is doing poorly as a general purpose Unix OS to sell Sun support / hardware for, but as an embedded hardware/software appliance Sun kit is both better and cheaper than its counterparts, and they have some extremely talented engineers (though they've bled 27% of their staff since last year).
(Longer version: 3Par was founded by ex-Sun types who wanted to make storage appliances based on Solaris. Sun responded by charging them outrageous licensing fees. They went with Linux and now their business is doing just fine...)
Now Solaris is open source as OpenSolaris, it's possible for someone other than Sun to build a storage appliance without having to worry about license fees.
Why would this be very useful? How often do you store two copies of the same data on one disk?
Edit: let me be more clear. When do you have situations where duplicate data is stored on the same disk and the best way to deal with it is through the filesystem?
If you're talking about a webapp with lots of users uploading the same photo or something, isn't that better handled before you hit the filesystem, so that you have dedupe over a number of independent disks/locations?
It's block-level dedupe, not file-level. This is similar to how Data Domain (www.datadomain.com, recently acquired by EMC for $2.2B) does it. In their customer base, mostly medium to large storage operations, the average customer compression is more than 20x. For a small (one-disk) system your mileage will obviously vary, but even a compression factor of 3-4x saves a lot of bits.
One person doesn't often store two copies of the same file for no good reason.
But the great thing about this is that if operates at the block level - so if two people take a CAD drawing, change a small bit and save it to their home directories, most of the similar data can be stored once and only the changed blocks stored separately.
There are lot of use cases wherein 'dedup' helps, and it could be done in a number of ways. For eg: Lets say there is a huge email attachment and it needs to be sent to everybody in the company. Instead of saving a copy of the attachment in everybody's account, only one copy could be saved and others could point to this. Mail systems probably already do this. However, if this support natively existed in the file system layer, application developers could take advantage of this feature instead of rolling out their own 'dedup' applications.
Unless you use version control. Perforce, source safe, git, and subversion, all "dedupe" their own data. They're not all brilliant about it (try committing a copy of a large file in subversion without using `svn cp`) but zfs dedupe will be of little value on these servers. (I left out cvs because I absolutely have no idea what it does on the server.)
The real value (and the content of many EMC sales pitches) is with backing up and archiving data. EMC often argues that most "corporate" data seems not to change very much over time.
I don't see how version control avoid file duplication. When you work on multiple branches at the same time. You necessarily have to retrieve each branch in different directories.
Not having to worry about it at the application level is nice.
E.g., if you store dated snapshots of content, it's much easier on you if you can just store the whole directory, rather than manually maintaining hard links back to previous versions.
Does anyone have real-world data for the likelihood of SHA256 hash collisions for actual user data (music, movies, documents, source code, etc)?
In my toy backup app I'm doing pretty much what they're doing - I assume that each block with 1MB of data will have a unique hash. I haven't tested it on serious amounts of data though, and so I'm very curious to know how likely this scheme is to survive an encounter with the real world.
I once read that anyone who actually finds a SHA256 collision should publish it in a paper. SHA256 collisions are supposed to be the just shy of impossible to find (you know, the whole old trope of millions of machines millions of times faster than the ones we have now working for millions of years will find a collision at some point). Some confirmation of this would be welcome.
I don't like the way they present the result of how unlikely hash collisions are... The number (2-256) gives much more comfort than saying:
If you hash 4KiB-long blocks, then every possible block will share the hash value with (on average) 128 different 4KiB blocks. And on a standard 200GiB disk you can fit (more or less) 52,400,000 blocks.
This explanation is a bit less reassuring. Now consider the fact that your data is never random and you hit the same patterns all the time (loads of zeros / ascii letters / x86 code)
The same patterns thing is mitigated by the property that changing one bit in the input is (supposed to) change on average half of the bits in the hash.
You can also be reassured that there's lots of research going on about how likely these collisions are and how to find them. People are actively trying to break these hash algorithms, so it's not just in theory.
They do, but what I'm interested in is if it's something that actually occurs in real data (with SHA256, in the article he says that it's of more use with "worse but faster" hashing algorithms).
My initial remedy is to add another hash method and name the data based on the results of both; that way problematic data would need to trigger a collision in two different algorithms at the same time, which "should" be next to impossible. Currently my file fragments are named "{SHA256}.dat", in v2 I could instead name them "{SHA256}{SOME_OTHER_HASH}.dat".
NetApp and Data Domain are both enterprise-level dedupe vendors who use similar hash functions to find unique strings of bits. Check out their whitepapers. Short answer: not in a million years (and there are further checks in case the hash alg somehow failed.)
Compress the folder with your favorite zip program and the compression ratio would be very good approximation of performance gain you will get out of zfs dedup.
Web archiving is another application that can benefit from this. You can crawl the same website 10 times a day and just store all of the files as it is.
Compression works inside relatively short chunks of the files, compared to huge amounts of storage. Plus if you use zip, you're only looking at similarities within individual files. I don't think the results would be useful.
Incidentally, ZFS also has an option to compress the data. You have a choice of a fast but not so wonderful algorithm, or gzip level 1 to 9. Since de dupe is at the block level, I believe you can combine the two.
While it might be true for some compression formats/programs, it is not the case for .tar.*, as the directory is archived to the single file first (tar), and then compressed. So if you have similarities within 2 different files, that will be exploited.
I think to make something work for "large blocks" that already does a good job for "small chunks" is just finding the right values for parameters of compression algorithm used.
It's not a useful measure because it's not the same thing at all. Dedup only works if the block is aligned to a ZFS block size. Compression will find blocks which have no particular alignment.
While it might seem flippant to compare VAX to x86, it was found (on VAX) that a programmer could potentially get better aggregate performance by avoiding that instruction; that having an instruction doesn't automatically mean that the code is faster.
I can't bring myself to trust dedupe without verification after every hash match, so I plan to use it with a fast hash algorithm with full verify enabled.
CRC32 is faster than ZFS's default of Fletcher2 and has less frequent collisions.
Shame ZFS has a slightly indeterminate future.