ZFS Deduplication

gstar · on Nov 2, 2009

That's excellent - dedupe makes a huge difference in most storage environments.

Shame ZFS has a slightly indeterminate future.

nailer · on Nov 2, 2009

Sun's storage team would work really well either as an autonomous unit within Oracle or a seperate company (unfortunately the last is unlikely unless the staff themselves do a 'JRuby').

Solaris is doing poorly as a general purpose Unix OS to sell Sun support / hardware for, but as an embedded hardware/software appliance Sun kit is both better and cheaper than its counterparts, and they have some extremely talented engineers (though they've bled 27% of their staff since last year).

gaius · on Nov 2, 2009

a seperate company

One word: 3Par.

(Longer version: 3Par was founded by ex-Sun types who wanted to make storage appliances based on Solaris. Sun responded by charging them outrageous licensing fees. They went with Linux and now their business is doing just fine...)

bensummers · on Nov 2, 2009

Now Solaris is open source as OpenSolaris, it's possible for someone other than Sun to build a storage appliance without having to worry about license fees.

As an example, here's some people doing it right now: http://www.nexenta.com/

dcurtis · on Nov 2, 2009

Why would this be very useful? How often do you store two copies of the same data on one disk?

Edit: let me be more clear. When do you have situations where duplicate data is stored on the same disk and the best way to deal with it is through the filesystem?

If you're talking about a webapp with lots of users uploading the same photo or something, isn't that better handled before you hit the filesystem, so that you have dedupe over a number of independent disks/locations?

zmoney · on Nov 2, 2009

It's block-level dedupe, not file-level. This is similar to how Data Domain (www.datadomain.com, recently acquired by EMC for $2.2B) does it. In their customer base, mostly medium to large storage operations, the average customer compression is more than 20x. For a small (one-disk) system your mileage will obviously vary, but even a compression factor of 3-4x saves a lot of bits.

a2tech · on Nov 2, 2009

All the time. One of Netapp's HUGE selling points is dedupe support.

nailer · on Nov 2, 2009

And in the SAN space too - Hitachi make a lot of money on dedupe products.

jodrellblank · on Nov 2, 2009

One person doesn't often store two copies of the same file for no good reason.

But the great thing about this is that if operates at the block level - so if two people take a CAD drawing, change a small bit and save it to their home directories, most of the similar data can be stored once and only the changed blocks stored separately.

bensummers · on Nov 2, 2009

Virtual machines (and Solaris Zones).

Database dumps as backups.

Email servers with lots of CCing.

Although it's probably more for 'Enterprises' than web apps.

ponnap · on Nov 2, 2009

There are lot of use cases wherein 'dedup' helps, and it could be done in a number of ways. For eg: Lets say there is a huge email attachment and it needs to be sent to everybody in the company. Instead of saving a copy of the attachment in everybody's account, only one copy could be saved and others could point to this. Mail systems probably already do this. However, if this support natively existed in the file system layer, application developers could take advantage of this feature instead of rolling out their own 'dedup' applications.

sp332 · on Nov 2, 2009

Software development. Multiple branches means multiple copies of a lot of files.

wizard_2 · on Nov 2, 2009

Unless you use version control. Perforce, source safe, git, and subversion, all "dedupe" their own data. They're not all brilliant about it (try committing a copy of a large file in subversion without using `svn cp`) but zfs dedupe will be of little value on these servers. (I left out cvs because I absolutely have no idea what it does on the server.)

The real value (and the content of many EMC sales pitches) is with backing up and archiving data. EMC often argues that most "corporate" data seems not to change very much over time.

Rexxar · on Nov 2, 2009

> Unless you use version control

I don't see how version control avoid file duplication. When you work on multiple branches at the same time. You necessarily have to retrieve each branch in different directories.

lallysingh · on Nov 3, 2009

They don't store full copies of each revision.

jongraehl · on Nov 5, 2009

I'm not sure what fool voted you down, but I fixed that. He's talking about local (checked out), not repository, copies.

holygoat · on Nov 2, 2009

Not having to worry about it at the application level is nice.

E.g., if you store dated snapshots of content, it's much easier on you if you can just store the whole directory, rather than manually maintaining hard links back to previous versions.

m_eiman · on Nov 2, 2009

Does anyone have real-world data for the likelihood of SHA256 hash collisions for actual user data (music, movies, documents, source code, etc)?

In my toy backup app I'm doing pretty much what they're doing - I assume that each block with 1MB of data will have a unique hash. I haven't tested it on serious amounts of data though, and so I'm very curious to know how likely this scheme is to survive an encounter with the real world.

wanderingmarker · on Nov 2, 2009

I once read that anyone who actually finds a SHA256 collision should publish it in a paper. SHA256 collisions are supposed to be the just shy of impossible to find (you know, the whole old trope of millions of machines millions of times faster than the ones we have now working for millions of years will find a collision at some point). Some confirmation of this would be welcome.

viraptor · on Nov 2, 2009

I don't like the way they present the result of how unlikely hash collisions are... The number (2-256) gives much more comfort than saying:

If you hash 4KiB-long blocks, then every possible block will share the hash value with (on average) 128 different 4KiB blocks. And on a standard 200GiB disk you can fit (more or less) 52,400,000 blocks.

This explanation is a bit less reassuring. Now consider the fact that your data is never random and you hit the same patterns all the time (loads of zeros / ascii letters / x86 code)

bensummers · on Nov 2, 2009

The same patterns thing is mitigated by the property that changing one bit in the input is (supposed to) change on average half of the bits in the hash.

You can also be reassured that there's lots of research going on about how likely these collisions are and how to find them. People are actively trying to break these hash algorithms, so it's not just in theory.

bensummers · on Nov 2, 2009

Happily, they have a 'verify' mode which checks the data on disc instead of making the assumption that SHA256 works as promised.

m_eiman · on Nov 2, 2009

They do, but what I'm interested in is if it's something that actually occurs in real data (with SHA256, in the article he says that it's of more use with "worse but faster" hashing algorithms).

My initial remedy is to add another hash method and name the data based on the results of both; that way problematic data would need to trigger a collision in two different algorithms at the same time, which "should" be next to impossible. Currently my file fragments are named "{SHA256}.dat", in v2 I could instead name them "{SHA256}{SOME_OTHER_HASH}.dat".

jcsiracusa · on Nov 2, 2009

Regarding the "should" part: http://stackoverflow.com/questions/1323013

m_eiman · on Nov 2, 2009

Good stuff! Perhaps it's safe to remove the "" qualifier from my assumption.

zmoney · on Nov 2, 2009

NetApp and Data Domain are both enterprise-level dedupe vendors who use similar hash functions to find unique strings of bits. Check out their whitepapers. Short answer: not in a million years (and there are further checks in case the hash alg somehow failed.)

atamyrat · on Nov 2, 2009

Compress the folder with your favorite zip program and the compression ratio would be very good approximation of performance gain you will get out of zfs dedup.

Web archiving is another application that can benefit from this. You can crawl the same website 10 times a day and just store all of the files as it is.

bensummers · on Nov 2, 2009

Compression works inside relatively short chunks of the files, compared to huge amounts of storage. Plus if you use zip, you're only looking at similarities within individual files. I don't think the results would be useful.

Incidentally, ZFS also has an option to compress the data. You have a choice of a fast but not so wonderful algorithm, or gzip level 1 to 9. Since de dupe is at the block level, I believe you can combine the two.

atamyrat · on Nov 2, 2009

Why it wouldn't be useful?

While it might be true for some compression formats/programs, it is not the case for .tar.*, as the directory is archived to the single file first (tar), and then compressed. So if you have similarities within 2 different files, that will be exploited.

I think to make something work for "large blocks" that already does a good job for "small chunks" is just finding the right values for parameters of compression algorithm used.

bensummers · on Nov 2, 2009

It's not a useful measure because it's not the same thing at all. Dedup only works if the block is aligned to a ZFS block size. Compression will find blocks which have no particular alignment.

atamyrat · on Nov 2, 2009

If they were same thing, you'd get the exact result, but not an approximation.

tar.x will compress chunks of data that is smaller than zfs block size. ZFS dedup + ZFS compression will compress them as well, so what's the problem?

patrickgzill · on Nov 3, 2009

I don't think you understand how dedupe works for cases where it is a benefit.

Imagine you have 10 VMware virtual machines, and all the VM files are stored on a ZFS server with dedupe support.

Let's say that the base install image is 2GB in size of OS, GUI, Java libraries, etc.

Non-dedupe scenario means 20GB of storage is used.

Dedupe scenario means maybe, 2.5GB of storage is used due to slight differences in the way the blocks are arranged, etc.

Now you add an 11th VM, same OS, same base setup. Storage goes up by maybe 100MB even though the VM has the same 2GB of files.

Do you see how zipping a file would not show you that effect?

atamyrat · on Nov 3, 2009

note to self: Someone else compared ZFS dedup to compression on Slashdot and it scored 5/Interesting.

On HN, it is downvoted and reply with technically wrong claim gets 4 points.

rbanffy · on Nov 2, 2009

Let me guess: SPARC CPUs are very fast in creating SHA256 hashes, right?

And, BTW, I am in no means implying this is bad. Quite the contrary - I would love to have an inexpensive SPARC-based desktop. If it existed.

wmf · on Nov 2, 2009

I suspect they chose SHA256 because it's the best hash, not because it advantages their hardware. But yeah, Niagara 2 is pretty smokin for crypto: http://blogs.sun.com/sprack/entry/ultrasparc_t2_crypto_perfo...

Andys · on Nov 3, 2009

New Intel CPUs now have a CRC32 instruction, which I am hoping Sun engineers will take advantage of soon and add as a choice for ZFS.

Hoff · on Nov 3, 2009

VAX had a CRC instruction.

While it might seem flippant to compare VAX to x86, it was found (on VAX) that a programmer could potentially get better aggregate performance by avoiding that instruction; that having an instruction doesn't automatically mean that the code is faster.

rbanffy · on Nov 3, 2009

CRC32? Compared to SHA256? Seriously?

Andys · on Nov 3, 2009

I can't bring myself to trust dedupe without verification after every hash match, so I plan to use it with a fast hash algorithm with full verify enabled.

CRC32 is faster than ZFS's default of Fletcher2 and has less frequent collisions.

Andys · on Nov 3, 2009

Many ZFS users already leave compression enabled which uses more CPU time than SHA256 and is still fast enough.

rbanffy · on Nov 2, 2009

Why a -1?! Are x86's faster at that?