Apologies to the list, the title of this thread is completely wrong. It should be something like "Question about --partial-dir and aborted transfers of large files". Let's see if this mailing list program will allow me to change it...
-- T.J. On 10 August 2012 15:28, T.J. Crowder <t...@crowdersoftware.com> wrote: > Hi all, > > rsync is a fantastic tool. :-) I'm blown away with what I've seen so far. > > I have a question about --partial-dir transfers. I've read through this > thread: > http://lists.samba.org/archive/rsync/2011-July/026575.html > ...but while similar, I don't think it's quite the same, and I didn't find > my answer there. > > The short(ish) version: > > 1. Am I correct in inferring that when rsync sees data for a file in the > --partial-dir directory, it applies its delta transfer algorithm to the > partial file? > > 2. And that this is _instead of_ applying it to the real target file? (Not > a nifty three-way combination.) > > If so, it would appear that this means a large amount of unnecessary data > may end up being transferred in the second sync of a large file if you > interrupt the first sync. Is there an option or some such to address this? > If not, would it be feasible to add? (Details on how I see that working > below, and I may be able to pitch in.) > > The long version: > > Sometimes I need to sync very large files (VM disk images) using ssh, > during an eight-hour time window. With my connection to the target server, > eight hours is unlikely to be enough, so I'll have to interrupt the sync > and continue it in the next day's window. Sometimes, the VM disk image will > be changed again in the meantime, but this isn't necessary to trigger the > behavior I mentioned above. (It is a case I'll have to handle.) > > I've run a few experiments with rsync in this area, and it looks like it > causes a fair bit of unnecessary data transfer. > > Here's how I caused that: > > 1. I created a file with 100,000 lines of text with exactly the same > length, and put it in both the source and destination. > > 2. In the source copy, I modified the first 20K lines. So roughly 20% of > the file has been changed. I didn't change the *length* of the lines (in > any of these experiments), because I'm trying to emulate a VM disk file > which is conveniently organized into fixed-size blocks. > > 3. I started a sync: > > rsync -avr --partial-dir=.rstmp src username@server:/dest/ > > ...and cancelled it part-way through. This leaves a partial file in my > .rstmp directory as expected. (In my case, just the first few hundred > lines.) > > 4. I restarted the sync, allowing it to complete. > > The second sync ended up transferring nearly the entire file, basically > the whole 100K lines minus the few hundred from the first sync. The 80K of > unchanged lines were transferred, whereas if I hadn't interrupted the first > sync, they wouldn't have been. > > I followed up with this experiment: > > 1. Starting with a synced file, I changed 20K lines in the *middle* of the > file rather than at the beginning. > > 2. I started a sync and cancelled it part-way through, after about the > same amount of time as the previous experiment. This leaves a partial file > in my .rstmp directory as expected -- but it's a LOT bigger, rsync has > quite intelligently copied the unchanged beginning of the file locally on > the target machine, up until the first change, and then transferred the > changed data after that -- which is when I interrupted it. > > 3. I started the sync again and let it continue, and it sent all of the > rest of the file, the vast majority of which was already present in the > original target file. > > In subsequent experiments, I was able to determine that if I changed part > of the file that had already been transferred into the partial file (say, > changing line 1 between steps 2 and 3 above), rsync was very smart about > that, just transferring the changed bit without re-transferring everything > in-between. That's why it seems to me it uses the full delta-transfer > algorithm on the partial -- or at least some version of it. > > All of this seems to suggest that the partial file is created by copying > the target file up to the first change and then applying changes -- but > that if you interrupt it, because the partial file is shorter than the > source file, all of the remaining source file is transferred. > > Armed with that information, I tried to box clever: I thought "If I know > I'm going to be doing one of these big files, maybe I could just copy the > target to the .rstmp on the target machine in advance, so the > delta-transfer applies to it." Unfortunately, though, cancelling the > transfer early truncates the partial file. Drat. It wouldn't have been > particularly elegant, but still would have been a workaround for now. > > If I'm right about all of the above (which I wouldn't put money on), it > seems like it would be possible to address this in a logically simple way. > Logically simple doesn't equate to being simple in code, of course. :-) The > idea being, basically, that when referring to blocks in the target partial > file (whether for determining the checksum of the block or transferring the > data), if the target partial file is missing the block entirely, use the > equivalent block from the actual target file -- so for checksum purposes, > that tells us whether it changed, and for data transfer purposes if it > didn't change, we know we can copy it locally on the target server. > > If there isn't already an option to address this, would it be feasible to > do? I may be able to pitch in if so. > > Thanks in advance, > -- > T.J. Crowder >
-- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html