Re: Writing huge Sets() to disk

Martin MOKREJÅ Mon, 10 Jan 2005 15:05:06 -0800

Dear Scott,
 thatnk you for you excellent email. I also thought about
using some zip() function to compress the strings first before
using them as keys in a dict.

But I don't think I can use one-way hashes, as I need to reconstruct
the string later. I have to study hard to get an idea
what the proposed code really does.

Scott David Daniels wrote:

Tim Peters wrote:

[Martin MOKREJÅ]
just imagine, you want to compare how many words are in English, German, Czech, Polish disctionary. You collect words from every language and record them in dict or Set, as you wish.
Call the set of all English words E; G, C, and P similarly.
Once you have those Set's or dict's for those 4 languages, you ask
for common words
This Python expression then gives the set of words common to all 4:
    E & G & C & P
and for those unique to Polish.
P - E - G - C

One attack is to define a hash to get sizes smaller.  Suppose you find
your word sets are 10**8 large, and you find you need to deal with sets
of size 10**6.  Then if you define a hash that produces an integer below
100, you'll find:
    E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100)
    P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
where X_k = [v for v in X if somehash(v) == k]


I don't understand here what would be the result from the hashing function. 
:(
Can you clarify this please in more detail? P is the largest dictionary
in this example, or was it shifted by one to the right?

This means, you can separate the calculation into much smaller buckets,
and combine the buckets back at the end (without any comparison on the
combining operations).


Yes.


For calculating these two expressions, you can dump values out to a
file per hash per language, sort and dupe-elim the contents of the
various files (maybe dupe-elim on the way out).  Then hash-by-hash,
you can calculate parts of the results by combining iterators like
inboth and leftonly below on iterators producing words from files.


    def dupelim(iterable):
        source = iter(iterable)
        former = source.next()  # Raises StopIteration if empty
        yield former
        for element in source:
            if element != former:
                yield element
                former = element

    def inboth(left, right):
        '''Take ordered iterables and yield matching cases.'''
        lsource = iter(left)
        lhead = lsource.next()  # May StopIteration (and we're done)
        rsource = iter(right)
        for rhead in rsource:
            while lhead < rhead:
                lhead = lsource.next()
            if lhead == rhead:
                yield lhead
                lhead = lsource.next()

    def leftonly(left, right):
        '''Take ordered iterables and yield matching cases.'''
        lsource = iter(left)
        rsource = iter(right)
        try:
            rhead = rsource.next()
        except StopIteration: # empty right side.
            for lhead in lsource:
                yield lhead
        else:
            for lhead in lsource:
                try:
                    while rhead < lhead:
                        rhead = rsource.next()
                    if lhead < rhead:
                        yield lhead
                except StopIteration: # Ran out of right side.
                    yield lhead
                    for lhead in lsource:
                        yield lhead
                    break


Aaaah, so the functions just walk one line in left, one line in right, if 
values don't match
the value in left is unique, it walks again one line in left and checks if it 
already
matches the value in riught file in the last position, and so on untill it find
same value in the right file?

So, it doesn't scan the file on right n-times, but only once?
Yes, nice thing. Then I really don't need an index and finally I really believe
flatfiles will do just fine.

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do

    f = open('EnglishWords.txt')
    set_of_English_words = set(f)
    f.close()

You'll have a trailing newline character in each word, but that
doesn't really matter.

Note that if you sort the word-per-line text files first, the Unix
`comm` utility can be used to perform intersection and difference on a
pair at a time with virtually no memory burden (and regardless of file
size).


In fact, once you've sorted these files, you can use the iterators
above to combine those sorted files.

For example:

    Ponly = open('polishonly.txt', 'w')
    every = open('every.txt', 'w')
    for hashcode in range(100):
        English = open('english_%s.txt' % hashcode)


--------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^ this is some kind of 
eval?

        German = open('german_%s.txt' % hashcode)
        Czech = open('czech_%s.txt' % hashcode)
        Polish = open('polish_%s.txt' % hashcode)
        for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
                dupelim(English)), dupelim(German)), dupelim(Czech)):
            Ponly.write(unmatched)
        English.seek(0)
        German.seek(0)
        Czech.seek(0)
        Polish.seek(0)
        for matched in inboth(inboth(dupelim(Polish), dupelim(English)),
                              inboth(dupelim(German), dupelim(Czech))):
            every.write(matched)
        English.close()
        German.close()
        Czech.close()
        Polish.close()
    Ponly.close()
    every.close()


--Scott David Daniels
[EMAIL PROTECTED]


--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

Reply via email to