Oliver, You are making a good point but connecting two different subjects.
The algorithm with awk is still the fastest in theory. It uses a hash table which runs at O(c) for each comparison ASSUMING you have a good hash function that yields such result. The total number of comparison is O(n) or O(C1 * n) where C1 is some constant number based on the hash function performance. On the other hand, comparison sorts are in O(n * log(n)) comparison. The 2nd comparison of two files are at O(n). So, total comparison is still O(n * log(n)) or O(C2 * n * log(n)) where C2 is some constant based on the algorithm and implementation. That is in theory. In reality or practice, it may not be true because you could have poor C1 such that the comparison solution becomes better or because n is small enough so that it doesn't matter which solution to take. In fact, awk implementation was poor and resulted large C1 as well as slower than sort(1). Experiment is yet important so that you can determine the best algorithm in your TEST environment. This is where you are making a point. If you have control over these conditions, you can not to choose the theoretically best algorithm. I agree and had indeed expected your outcome as awk is actually known to inefficiency in its implementations. In that regard, in the practical point of view, the original implementation by cperciva@ was good enough such that searching for faster algorithms was not necessary for all people except Bert. At the end, it is up to the implementer to pick up the solution. Regards, Hiro On Fri, 23 Jan 2009 12:09:03 +0100 (CET) Oliver Fromme <o...@lurza.secnetix.de> wrote: > > Yoshihiro Ota wrote: > > Oliver Fromme wrote: > > > It would be much better to generate two lists: > > > - The list of hashes, as already done ("filelist") > > > - A list of gzipped files present, stripped to the hash: > > > > > > (cd files; echo *.gz) | > > > tr ' ' '\n' | > > > sed 's/\.gz$//' > filespresent > > > > > > Note we use "echo" instead of "ls", in order to avoid the > > > kern.argmax limit. 64000 files would certainly exceed that > > > limit. Also note that the output is already sorted because > > > the shell sorts wildcard expansions. > > > > > > Now that we have those two files, we can use comm(1) to > > > find out whether there are any hashes in filelist that are > > > not in filespresent: > > > > > > if [ -n "$(comm -23 filelist filespresent)" ]; then > > > echo -n "Update files missing -- " > > > ... > > > fi > > > > > > That solution scales much better because no shell loop is > > > required at all. > > > > This will probably be the fastest. > > Are you sure? I'm not. > Only a benchmark can answer that. See below. > > > awk -F "|" ' > > $2 ~ /^f/{required[$7]=$7; count++} > > END{FS="[/.]"; > > while("find files -name *.gz" | getline>0) > > if($2 in required) > > if(--count<=0) > > exit(0); > > exit(count)}' "$@" > > I think this awk solution is more difficult to read and > understand, which means that it is also more prone to > introduce errors. In fact, there are several problems > already: > > First, you didn't quote the *.gz wildcard, so it will > fail if there happens to be a file "*.gz" in the current > directory. > > Second, the script makes assumptions about the format of > the hash strings, e.g. that they don't contain dots. > (Currently they only contain hex digits, AFAICT, but what > if the format changes in the future.) > > Third, exit(count) is a bad idea, because exit codes are > truncated to 8 bits. If 256 files happen to be missing, > the script will seem to exit successfully. > > All these flaws could be fixed, of course, but it will > introduce more complexity. The shell code using sort(1) > and comm(1) doesn't have those flaws and appears more > robust. > > > It verifies entries using hashtable instead of sort. > > Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory. > > I am not well aware how good awk's associate array is, though. > > It is pretty simple. It's a hash list that starts with > 50 buckets. The number of buckets is doubled (and all > entries re-hashed!) when the average number of elements > per bucket exceeds 2. The hash function is very simple, > it's just "hash = hash * 31 + c" for every character c > in the string, the result is modulo the current number > of buckets. Note that freebsd-update uses SHA256 hashes > which are fairly long (64 characters). > > BTW, you can easily make benchmarks. The following command > will create 64000 files of the format "%64x.gz". Be sure > to have enough free inodes on your file system ("df -i"). > > jot -rw "%08x" 64000 0 2000000000 | sed 's/.*/&&&&&&&&.gz/' | xargs touch > > This took 2 minutes on my notebook (3 years old). I also > noticed that the first 47000 inodes were created very > quickly (about 5 seconds), but then the speed dropped down > suddenly to about 150 inodes per second for the rest of > the two minutes. CPU was 100% system according to top. > Removing the files is equally slow. > > So there seems to be a limit of about 47000 inodes per > directory -- any more than that and it gets horribly > inefficient. Therefore it would probably be a good idea > to change freebsd-update to create subdirectories and > distribute the files among them. For example, make 16 > subdirectories [0-9a-f] and put the files into them > according to the first digit of the hash. This will > probably boost performance noticeably. > > Sorting those 64000 files is really *not* an issue. The > difference between "ls" and "ls -f" is only 0.2 seconds on > my notebook. When using "ls -f | sort", sort(1) uses only > 0.12 seconds of CPU time. This is negligible. > > Next I made a simple test with awk within that directory: > > awk 'BEGIN{while("find . -name \\*.gz" | getline>0)required[$1]=$1}' > > That script (which does only half of the required work) > takes 4 seconds, reproducible. So I didn't bother to > write and test the full awk solution. > > Bottom line: The awk solution is less robust, less readable, > and much slower. > > Best regards > Oliver > > -- > Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. > Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: > secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- > chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart > > FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd > > "The most important decision in [programming] language design > concerns what is to be left out." -- Niklaus Wirth _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"