Hi,
I just had some thought about improving rsync performance (reducing
amount of data transfered) when dealing with packed files and would
like some comments.
First what is the problem with packed files?
--------------------------------------------
Consider two nearly identical files that differ on the first
character. rsync will do a great job syncing those files.
Now pack both files with gzip and rsync again. The files will differ
in most, if not all, places and the whole file will be transfered.
How can we improve the performance?
-----------------------------------
First both sides could unpack the file and rsync that. This would use
up huge amount of cpu time and disk space on the server (just think
about 100 people downloading a 100 MB gz file each). It would be
possible to do, but I would rather not give the server that much cpu
time and disk space, let alone the right to write/delete files.
So what else could be done?
Heres what I though about:
--------------------------
At first rsync runs normal. The client send checksums and the server
responds. At some point the server will send a block of plain data
where the source and destination files differ (presumably the first
block).
The client then has the old file (A) and a new file (B) with the one
block replacing the old one. The client then unpacks both A and B up
to the end of the changed block and tries to determine the changes
made to the unpacked data (like a line being deleted or inserted). If
such a change is detected and both files are identical after that (as
far as B can be unpacked), the client repacks the unpacked file (A)
and stops the server from sending any mode data and recalculates the
checksums for the reminder of the file.
In this mode it might be a good idea to reverse the direction of the
data flow. Let the server calculate the checksums for each block and
let the client compare that to every position and order the changed
blocks to be send by the server. That way one has another roundping,
so responce times slow done double, but checksums won't have to be
retransmitted and the server doesn't have to be told to stop and wait
for new checksums.
Please note that I allways presume that we have a streaming format
like ar, tar, gzip, bzip2 and not something like zip where theres a
file index at the end of the file needed to unpack the data.
Also note that in case of for example deb files, which are ar files,
but in a special form, the method could also work, provided that the
repacking is done by using the beginning and end of the original file
and just change the parts that differ in the middle. If you don't know
what I'm talking about here, ignore or try to "ar -x" and "ar -c" a
debian package.
So, what do you think?
MfG
Goswin