Dear Scott, thatnk you for you excellent email. I also thought about using some zip() function to compress the strings first before using them as keys in a dict.
But I don't think I can use one-way hashes, as I need to reconstruct the string later. I have to study hard to get an idea what the proposed code really does.
Scott David Daniels wrote:
Tim Peters wrote:
[Martin MOKREJÅ]
just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.
Call the set of all English words E; G, C, and P similarly.
Once you have those Set's or dict's for those 4 languages, you ask for common words
This Python expression then gives the set of words common to all 4: E & G & C & P
and for those unique to Polish.
P - E - G - C
One attack is to define a hash to get sizes smaller. Suppose you find your word sets are 10**8 large, and you find you need to deal with sets of size 10**6. Then if you define a hash that produces an integer below 100, you'll find: E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100) P - E - G - C == Union(P_k - E_k - G_k - C_k) for k in range(100) where X_k = [v for v in X if somehash(v) == k]
I don't understand here what would be the result from the hashing function. :( Can you clarify this please in more detail? P is the largest dictionary in this example, or was it shifted by one to the right?
This means, you can separate the calculation into much smaller buckets, and combine the buckets back at the end (without any comparison on the combining operations).
Yes.
For calculating these two expressions, you can dump values out to a file per hash per language, sort and dupe-elim the contents of the various files (maybe dupe-elim on the way out). Then hash-by-hash, you can calculate parts of the results by combining iterators like inboth and leftonly below on iterators producing words from files.
def dupelim(iterable): source = iter(iterable) former = source.next() # Raises StopIteration if empty yield former for element in source: if element != former: yield element former = element
def inboth(left, right): '''Take ordered iterables and yield matching cases.''' lsource = iter(left) lhead = lsource.next() # May StopIteration (and we're done) rsource = iter(right) for rhead in rsource: while lhead < rhead: lhead = lsource.next() if lhead == rhead: yield lhead lhead = lsource.next()
def leftonly(left, right): '''Take ordered iterables and yield matching cases.''' lsource = iter(left) rsource = iter(right) try: rhead = rsource.next() except StopIteration: # empty right side. for lhead in lsource: yield lhead else: for lhead in lsource: try: while rhead < lhead: rhead = rsource.next() if lhead < rhead: yield lhead except StopIteration: # Ran out of right side. yield lhead for lhead in lsource: yield lhead break
Aaaah, so the functions just walk one line in left, one line in right, if values don't match the value in left is unique, it walks again one line in left and checks if it already matches the value in riught file in the last position, and so on untill it find same value in the right file?
So, it doesn't scan the file on right n-times, but only once? Yes, nice thing. Then I really don't need an index and finally I really believe flatfiles will do just fine.
Real-word dictionaries shouldn't be a problem. I recommend you store each as a plain text file, one word per line. Then, e.g., to convert that into a set of words, do
f = open('EnglishWords.txt') set_of_English_words = set(f) f.close()
You'll have a trailing newline character in each word, but that doesn't really matter.
Note that if you sort the word-per-line text files first, the Unix `comm` utility can be used to perform intersection and difference on a pair at a time with virtually no memory burden (and regardless of file size).
In fact, once you've sorted these files, you can use the iterators above to combine those sorted files.
For example:
Ponly = open('polishonly.txt', 'w') every = open('every.txt', 'w') for hashcode in range(100): English = open('english_%s.txt' % hashcode)
--------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^ this is some kind of eval?
German = open('german_%s.txt' % hashcode) Czech = open('czech_%s.txt' % hashcode) Polish = open('polish_%s.txt' % hashcode) for unmatched in leftonly(leftonly(leftonly(dupelim(Polish), dupelim(English)), dupelim(German)), dupelim(Czech)): Ponly.write(unmatched) English.seek(0) German.seek(0) Czech.seek(0) Polish.seek(0) for matched in inboth(inboth(dupelim(Polish), dupelim(English)), inboth(dupelim(German), dupelim(Czech))): every.write(matched) English.close() German.close() Czech.close() Polish.close() Ponly.close() every.close()
--Scott David Daniels [EMAIL PROTECTED]
-- http://mail.python.org/mailman/listinfo/python-list