On Sun, 4 Jan 2004 16:23:17 -0800, jw schultz <[EMAIL PROTECTED]> wrote: > On Sun, Jan 04, 2004 at 05:30:03AM -0800, jw schultz wrote: >> On Sun, Jan 04, 2004 at 06:35:03AM -0600, John Van Essen wrote: >> > I've modified hlink.c to use a list of file struct pointers instead of >> > copies of the actual file structs themselves, so that will save memory. >> > I'll submit that patch for review in a day or two after I've tested it. >> >> I've just done the same. It reduces the memory requirements >> of the hlink list to 1/18th. It is also somewhat faster to >> build that way because we don't have to walk the list. >> >> If we built the hlink_list one element at a time the way we >> do the file_list only putting those files that we might link >> in it it would be smaller but building it would be slower. > > I'm noodling on the idea of purging the hlink_list of files > that don't have links.
I'm all for it. > Start by getting rid of the !IS_REG files because we don't > link those anyway. Right. > After the hlink_list is sorted hardlinked files are adjacent > to one another in the list so a single pass walk could do an > implace purge (collapse) of non-linked files. > > At one pointer per file space savings wouldn't be that large. > But for every regular file we to a binary search of the > hlink_list. 100,000 binary searches of a 200 member list > would be a lot faster than 100,000 searches of a 105,000 > member list. Indeed. > If a flag were added to the file_struct the purge pass could > set it if the file had any links so the 100,000 binary > searches of a 105,000 member list could be reduced to 200 > binary searches of a 200 member list. That's what I was thinking when I had proposed that the sender include a flag bit that indicates that a regular file has more than one node, and is thus hardlinked. But your idea works without needing that. I like it. > We do init_hard_links > prior to forking so the flag wouldn't suffer COW. > > My current version of the patch does get rid of the > non-regular files but i don't do the non-linked purge. > It is just an idea at this point. I'm not quite motivated > enough to do it yet. > > Hmm, 2.6.1 is shaping up to be a performance release. Yes. Maybe it can be released before another 11 months go by? :) Remember that there is also the issue of hardlinked files being transferred unnecessarily when the first file in the alphabetically sorted hardlink group does not exist, but a later one does that can be used as the 'master'. I've thought about this some more... We need to make rsync flexible enough to choose a different file as a master if the first one in the sorted list does not exist. Specifically, it needs to choose the first file in a group that it discovers _does_ exist on the receiver (not just the first file in the sorted list). That file becomes the master for the hardlinks (and might have to be transferred if it doesn't "match"). If it gets to processing the final file in the group, and still none of them has been found to exist on the receiver, then that final file becomes the master file and must be transferred. So we'd need these things: - a flag passed to check_hard_link() indicating whether or not the file being checked already exists on the receiver (it doesn't have to match, neccessarily) to avoid unnecessary testing within this routine. - a method to indicate that a particular file is the 'master' for a hardlink group. If a master has been found, the remaining files can be skipped (not transferred). - a method to indicate that a file has already been processed by check_hard_link (so we can determine when the last remaining file is being checked). We could use extra bits in the flag word in file_struct to indicate 'master' and 'already checked', but recent revelations about CoW optimization make me reluctant to do that. Another possibility is to use a byte array corresponding to the index array. Further comments, thoughts, and suggestions (by anyone) are invited. -- John Van Essen Univ of MN Alumnus <[EMAIL PROTECTED]> -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html