Tim Peters wrote:
[Martin MOKREJÅ]

just imagine, you want to compare how many words are in English, German,
Czech, Polish disctionary. You collect words from every language and record
them in dict or Set, as you wish.

Call the set of all English words E; G, C, and P similarly.

Once you have those Set's or dict's for those 4 languages, you ask
for common words

This Python expression then gives the set of words common to all 4:

    E & G & C & P

and for those unique to Polish.

    P -  E - G  - C

is a reasonably efficient way to compute that.

Nice, is it equivalent to common / unique methods of Sets?

I have no estimates
of real-world numbers, but we might be in range of 1E6 or 1E8?
I believe in any case, huge.

No matter how large, it's utterly tiny compared to the number of
character strings that *aren't* words in any of these languages. English has a lot of words, but nobody estimates it at over 2 million
(including scientific jargon, like names for chemical compounds):


As I've said, I analyze in real something else then languages. However, it can be described with the example of words in different languages.

But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?

My concern is actually purely scientific, not really related to analysis
of these 4 languages, but I believe it describes my intent quite well.

I wanted to be able to get a list of words NOT found in say Polish,
and therefore wanted to have a list of all, theoretically existing words.
In principle, I can drop this idea of having ideal, theoretical lexicon.
But have to store those real-world dictionaries anyway to hard drive.

Real-word dictionaries shouldn't be a problem.  I recommend you store
each as a plain text file, one word per line.  Then, e.g., to convert
that into a set of words, do

    f = open('EnglishWords.txt')
    set_of_English_words = set(f)

I'm aware I can't keep set_of_English_words in memory.


