Re: Writing huge Sets() to disk

Scott David Daniels Mon, 10 Jan 2005 14:20:18 -0800

Tim Peters wrote:

[Martin MOKREJÅ]

just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.


Call the set of all English words E; G, C, and P similarly.

Once you have those Set's or dict's for those 4 languages, you ask
for common words


This Python expression then gives the set of words common to all 4:
    E & G & C & P

and for those unique to Polish.

P - E - G - C


One attack is to define a hash to get sizes smaller.  Suppose you find
your word sets are 10**8 large, and you find you need to deal with sets
of size 10**6.  Then if you define a hash that produces an integer below
100, you'll find:
    E & G & C & P == Union(E_k & G_k & C_k & P_k) for k in range(100)
    P -  E - G  - C == Union(P_k - E_k - G_k - C_k) for k in range(100)
where X_k = [v for v in X if somehash(v) == k]
This means, you can separate the calculation into much smaller buckets,
and combine the buckets back at the end (without any comparison on the
combining operations).

For calculating these two expressions, you can dump values out to a
file per hash per language, sort and dupe-elim the contents of the
various files (maybe dupe-elim on the way out).  Then hash-by-hash,
you can calculate parts of the results by combining iterators like
inboth and leftonly below on iterators producing words from files.


    def dupelim(iterable):
        source = iter(iterable)
        former = source.next()  # Raises StopIteration if empty
        yield former
        for element in source:
            if element != former:
                yield element
                former = element

    def inboth(left, right):
        '''Take ordered iterables and yield matching cases.'''
        lsource = iter(left)
        lhead = lsource.next()  # May StopIteration (and we're done)
        rsource = iter(right)
        for rhead in rsource:
            while lhead < rhead:
                lhead = lsource.next()
            if lhead == rhead:
                yield lhead
                lhead = lsource.next()

    def leftonly(left, right):
        '''Take ordered iterables and yield matching cases.'''
        lsource = iter(left)
        rsource = iter(right)
        try:
            rhead = rsource.next()
        except StopIteration: # empty right side.
            for lhead in lsource:
                yield lhead
        else:
            for lhead in lsource:
                try:
                    while rhead < lhead:
                        rhead = rsource.next()
                    if lhead < rhead:
                        yield lhead
                except StopIteration: # Ran out of right side.
                    yield lhead
                    for lhead in lsource:
                        yield lhead
                    break

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do

    f = open('EnglishWords.txt')
    set_of_English_words = set(f)
    f.close()

You'll have a trailing newline character in each word, but that
doesn't really matter.

Note that if you sort the word-per-line text files first, the Unix
`comm` utility can be used to perform intersection and difference on a
pair at a time with virtually no memory burden (and regardless of file
size).

In fact, once you've sorted these files, you can use the iterators
above to combine those sorted files.

For example:

    Ponly = open('polishonly.txt', 'w')
    every = open('every.txt', 'w')
    for hashcode in range(100):
        English = open('english_%s.txt' % hashcode)
        German = open('german_%s.txt' % hashcode)
        Czech = open('czech_%s.txt' % hashcode)
        Polish = open('polish_%s.txt' % hashcode)
        for unmatched in leftonly(leftonly(leftonly(dupelim(Polish),
                dupelim(English)), dupelim(German)), dupelim(Czech)):
            Ponly.write(unmatched)
        English.seek(0)
        German.seek(0)
        Czech.seek(0)
        Polish.seek(0)
        for matched in inboth(inboth(dupelim(Polish), dupelim(English)),
                              inboth(dupelim(German), dupelim(Czech))):
            every.write(matched)
        English.close()
        German.close()
        Czech.close()
        Polish.close()
    Ponly.close()
    every.close()


--Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huge Sets() to disk

Reply via email to