Curt wrote: > On 2022-11-08, DdB <debianl...@potentially-spam.de-bruyn.de> wrote: > >> > > Your wording likely confuses 2 different concepts: > > > > Deduplication avoids storing identical data more than once. > > whereas > > Redundancy stores information on more than one place on purpose to avoid > > loos of data in case of havoc. > > So they're antithetical concepts? Redundancy sounds a lot like a back > up.
Think of it this way: You have some data that you want to protect against the machine dying. So you copy it to another machine. Now you have a backup. You need to do this repeatedly, or else your backup is stale: lacking information that was recently changed. If you do it repeatedly to the same target, that's a lot of information. Maybe you can only send the changes? rsync, ZFS send, and some other methods make that pretty easy. But what if you accidentally deleted a file a week ago, and the backups are done every night? You're out of luck... unless you have somehow got a record of all the changes that you saved, or you have a second backup that happened before the deletion. Snapshots (rsnapshot, ZFS snapshots, others...) make it easy to go back in time to any snapshot and retrieve the state of the data then, while not storing full copies of all the data all the time. Now, let's suppose that you want your live data -- the source -- to withstand a disk dying. If all the data is on one disk, that's not going to happen. You can stripe the data on N disks, but since there's only one copy of any given chunk of data, that doesn't help with resiliency to a disk failure. Instead, you can make multiple complete copies every time you do a write: disk mirroring, or RAID 1. This is very fast, but eats twice the disk space. If you can accept slower performance, you can write the data in chunks to N disks, and write checksums calculated from that data to M disks, such that any 1 disk of the N+M can fail and you can still reconstruct the whole data. That's RAID 5. A slightly more complicated calculation withstands any 2 disks of the N+M - RAID 6. ZFS even has a three disk resiliency mode. Depending on your risk tolerance and performance needs, you might use RAID 10 (striping and mirroring) on your main machine, and backup to a more efficient but slower RAID 6 on your backup target. What we've left out is compression and deduplication. On modern CPUs, compression is really fast. So fast that it usually makes sense for the filesystem to try compressing all the data it is about to write, and store the compressed data with a flag that says it will need to be uncompressed when read. This not only increases your available storage capacity, it can make some reads and writes faster because less has to be transferred to/from the relative slow disk. There is more of an impact on rotating disks than SSDs. Deduplication tries to match data that has already been written and store a pointer to the existing data instead. This is an easy problem as long as you have two things: a fast way to match the data perfectly, and a very fast way to look up everything that has previously been written. It turns out that both of those subproblems scale badly. The main use case is for storing multiple virtual machine instances, or something similar, where you can expect every one of them to have a large percentage of identical files stemming from the operating system installation. -dsr-