Hi, I'm using a bash script (bash 4.1) to parse debian package indices (kind of a mirroring script). This stores all package fields for every package into one big hash table.
That's about 300.000 hash array elements for Debian sid's "Sources" index file. Unfortunately performance doesn't scale at all. Needs about 2m30 for parsing that Sources file. Needs exactly the same time after completely re-writing the parser using a sed preprocessor script. Looking at the hashlib.c source-code it's somewhat apparant why it's that slow: bash's hashtables implement open hashing with 64 hash chains. So performance for search and insert is linear in number N of stored elements, as soon as N is much larger then 64. Especially insert is extremely slow as it has to scan the full chain from start to end. Are there any plans to change that? Would be helpful if the man page stated that bash arrays are O(N) in performance for large N. Had I known that, I wouldn't have relied on using bash at all for such kind of script. cheers, David -- GnuPG public key: http://user.cs.tu-berlin.de/~dvdkhlng/dk.gpg Fingerprint: B17A DC95 D293 657B 4205 D016 7DEF 5323 C174 7D40
pgpMKhA7Q14kW.pgp
Description: PGP signature