On 03/11/2011 01:09, Alex Waite wrote: > I apologize if this has already been discussed before, but as of > yet I have been unable to find any info on the topic. > I have a very simple (and common) disk based backup system using > rsync, hard links, and a little bit of perl to glue it together. > Remote machines are backed up regularly using hardlinks across each > snapshot to reduce disk usage. > Recently I learned that rsync does a checksum of every file > transferred. I thought it might be interesting to record the path and > checksum of each file in a table. On future backups, the checksum of > a file being backed up could be looked up in the table. If there's a > matching checksum, a hard link will be created to the match instead of > storing a new copy. This means that the use of hard link won't be > limited to just the immediately preceding snapshot (as is the case > with my current setup). Instead a hard link could be created to an > identical file located in a different machine's snapshot. > My initial concerns were that doing the checksums would be too CPU > expensive, but if rsync is already doing them then that isn't a > concern. My next thought was that the checksums would be susceptible > to collisions, thus leading to potential data loss by linking to a > non-identical file. However, from what I've read on wikipedia, rsync > does both a MD5 and a rolling checksum. These two together make it > /very/ unlikely to have a collision, thus accidentally linking to a > non-identical file is unlikely. > Is this approach even possible, or am I missing something? I know > my labs have a lot of duplicate data across many machines, so this > could save me hundreds of GiBs, maybe even a TiB or two. > If this is possible, how can I save the resulting checksum of a > file from rsync? > Thank you for your time. I look forward to hearing your thoughts.
Check out http://backuppc.sourceforge.net/, it's perl-based backup tool, using rsync and doing exactly what you ask for. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html