Re: Rsyncing really large files

2005-03-09 Thread Shachar Shemesh
Kevin Day wrote: Shachar- I think Wayne is mostly pointing you to the correct location here. If you look at the code where the checksum is computer, you'll find that the rolling checksum actually consists of two parts - one is the sum of all of the byte values in the window, the other is the

re[2]: Rsyncing really large files

2005-03-09 Thread Kevin Day
Shachar-   I think Wayne is mostly pointing you to the correct location here.  If you look at the code where the checksum is computer, you'll find that the rolling checksum actually consists of two parts - one is the sum of all of the byte values in the window, the other is the offset weig

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Sat, Mar 05, 2005 at 10:58:12PM +0200, Shachar Shemesh wrote: > What about my proposal, though? Should I send in a patch so it can be > evaluated (hoping I'll manage to find my way around, that is). I'm always open potential patches. If we can make things faster without taking up large chunks

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Wayne Davison wrote: But how will you find it there? If you are going to have 740K blocks (i.e. - 740,000 strong hashes) in a 16bit hash table, you are going to have lots of collisions there (190 per bucket, on average), and you gained nothing. You'll note that a no-match bitmap has no effe

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Sat, Mar 05, 2005 at 08:07:20PM +0200, Shachar Shemesh wrote: > Definitely not [block-size]! I was talking about the hash table load. Ah yes, I can see that I misread what the "it" was referring to. Thanks for the clarification. > But [the bitmap lookup] only works if the checksum function a

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
> Wayne Davison wrote: > >He's talking about potentially losing an even distribution of values > >if the lowest order bits aren't random enough. On Sat, Mar 05, 2005 at 08:10:37PM +0200, Shachar Shemesh wrote: > Even for N which is not a power of 2? In the case of the weak checksum, it is compute

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Wayne Davison wrote: On Sat, Mar 05, 2005 at 02:18:01PM +0200, Shachar Shemesh wrote: However, if you choose to start with the 32 bit rolling hash and mod that, you will have problems. The rolling checksum has two distinct parts, and modding will only pull info from the low order bits,

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Wayne Davison wrote: On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote: And I'm suggesting making it static, by adjusting the hash table's size according to the number of blocks. The block-size? Definitely not! I was talking about the hash table load. I.e. - the ratio bet

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
[I accidentally sent this reply only to Lars instead of the list.] On Fri, Mar 04, 2005 at 06:44:48PM +0100, Lars Karlslund wrote: > I read somewhere about the default blocksize of 700 bytes. Now I'm > told the blocksize is calculated automatically. Which is it? 700 is the minimum block-size. Th

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote: > And I'm suggesting making it static, by adjusting the hash table's > size according to the number of blocks. The block-size? No thanks. The smaller the block-size for a particular file size, the more checksum data needs to be gen

Re: Rsyncing really large files

2005-03-05 Thread Wayne Davison
On Sat, Mar 05, 2005 at 02:18:01PM +0200, Shachar Shemesh wrote: > That's not what I read into it. It seems to me that the checksum > function gives a 32bit result, and we are squashing that into a 16bit > hash table. Can you point me to the code? Wayne? Look in match.c. The hash table is "tag_

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Kevin Day wrote: As a quick FYI, the block size absolutely has an impact on the effectiveness of the checksum table - larger blocks means fewer blocks, which means fewer hash colissions. Since you wouldn't expect that many blocks in the file, a 32bit weak checksum would only produce about 4 or 5

Re: Rsyncing really large files

2005-03-05 Thread Shachar Shemesh
Lars Karlslund wrote: And I'm suggesting making it static, by adjusting the hash table's size according to the number of blocks. Just do "hashtablesize=(numblocks/8+1)*10;", and you should be set. Or maybe it should really be dynamic. I'm talking about the hash table load. I.e. - the ratio b

re[2]: Rsyncing really large files

2005-03-04 Thread Kevin Day
Shachar-   I agree with what you are saying about trying to identify opportunities to improve the algorithm - sorry it seemd like a troll - that was certainly not my intention.  And certainly, my apologies for getting the identities mixed up!   As a quick FYI, the block size absolutely ha

Re: Rsyncing really large files

2005-03-04 Thread Lars Karlslund
Hi, On tor, 2005-03-03 at 10:18 +0200, Shachar Shemesh wrote: Thanks for all the attention you are giving my problem. I'm really glad you all seem so interested in this problem. My lack of responses is the combination of having lots of work to do this last week, and also a realisation of whe

Re: Rsyncing really large files

2005-03-03 Thread Shachar Shemesh
Kevin Day wrote: Shachar- True enough - with one additional thought - if the block size is set to be the square root of the file size, then the load factor on the hash table becomes dynamic in and of itself (bigger block size = less master table entries = fewer hash collisions). And I'm sugges

re[2]: Rsyncing really large files

2005-03-01 Thread Kevin Day
Shachar-   True enough - with one additional thought - if the block size is set to be the square root of the file size, then the load factor on the hash table becomes dynamic in and of itself (bigger block size = less master table entries = fewer hash collisions).   In the case of relative

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Kevin Day wrote: I would *strongly* recommend that you dig into the thesis a bit (just the section that describes the rsync algorithm itself). I tried a few weeks ago. I started to print it, and my printer ran out of ink :-). I will read it electronically eventually (I hope). Now, if you have hu

Re: Rsyncing really large files

2005-02-28 Thread Craig Barratt
Lars Karlslund writes: > Also the numbers speak for themselves, as the --whole-file option is > *way* faster than the block-copy method on our setup. At the risk of jumping into the middle of this thread without remembering everything that was discussed... Remember that by default rsync writes a

Re: Rsyncing really large files

2005-02-28 Thread Wayne Davison
On Mon, Feb 28, 2005 at 08:33:52PM +0200, Shachar Shemesh wrote: > If so, we can probably make it much much (much much much) more > efficient by using a hash table instead. That's what "tag_table" is -- it's an array of 65536 pointers into the checksum data sorted by weak checksum. The code then

re[2]: Rsyncing really large files

2005-02-28 Thread Kevin Day
Shachar,It does use a hash table.  rsync adds the two components of the rolling checksum together to come up with a 16 bit hash, and performs a table lookup.  The table contains an offset into the sorted rsync checksum table (which contains both weak and strong checksums), and does a line

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Wayne Davison wrote: However, you should be sure to have measured what is causing the slowdown first to know how much that will help. If it is not memory that is swapping on the sender, it may be that the computing of the checksums in maxing out your CPU, and removing the caching of the remote che

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Shachar Shemesh wrote: No, because the rsync algorithm can detect single byte moves of this 700 bytes block. I will just mention that I opened the ultimate documentation for rsync (the source), and it says that the default block size is the rounded square root of the file's size. This means that

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Lars Karlslund wrote: Maybe I didn't express myself thoroughly enough :-) Or me. Yes, a block is a minimum storage unit, which is considered for transfer. In size, yes. Not in position. But it's a fact that the rsync algorithm as it is now checks to see if a block should have moved. And in that c

Re: Rsyncing really large files

2005-02-28 Thread Lars Karlslund
On man, 2005-02-28 at 12:23 +0200, Shachar Shemesh wrote: > Also as far as I could read, the default block size is 700 bytes? What > kind of application would default to moving data around 700 bytes at a > time internally in a file? I'm not criticizing rsync, merely > questioning the funct

Re: Rsyncing really large files

2005-02-28 Thread Shachar Shemesh
Lars Karlslund wrote: Also as far as I could read, the default block size is 700 bytes? What kind of application would default to moving data around 700 bytes at a time internally in a file? I'm not criticizing rsync, merely questioning the functionality of this feature. I believe you may have m

Re: Rsyncing really large files

2005-02-28 Thread Lars Karlslund
Hi everyone, Thank you for your replies. On tor, 2005-02-24 at 17:59 +0100, Paul Slootman wrote: > It would certainly be possible to change the algorithm to not cache the > data (and thus only allow the current block to be compared), but I don't > think that idea has general enough interest

Re: Rsyncing really large files

2005-02-24 Thread Paul Slootman
On Thu 24 Feb 2005, Wayne Davison wrote: > > It would certainly be possible to change the algorithm to not cache the > data (and thus only allow the current block to be compared), but I don't > think that idea has general enough interest for me to work on for > inclusion in rsync. You might want

Re: Rsyncing really large files

2005-02-24 Thread Wayne Davison
On Thu, Feb 24, 2005 at 02:01:59PM +0100, Lars Karlslund wrote: > As I understand it, the way rsync works is: > - remote server calculates all checksum blocks and transmits these to client > - local client calculates all checksum blocks > - local client compares local blocks to remote blocks, check

Rsyncing really large files

2005-02-24 Thread Lars Karlslund
Hello everyone, I have a question regarding syncing of large files. We have a setup where we need to sync a 500GB file with changes in random blocks. The changes are made in a manner, so moving blocks is not probable (actually it will never happen). As I understand it, the way rsync works is: -