Re: Writing huve ge Sets() to disk

Martin MOKREJÅ Mon, 10 Jan 2005 17:38:56 -0800

Tim Peters wrote:

[Tim Peters]

As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.

[Martin MOKREJÅ]

Now I got your point. I understand the comm(1) is written in C, but it still
has to scan file1 once and file2 n-times, where n is a number of lines
in file1, right? Without any index ... I'll consider it, actually will test,
thanks!

`comm` is much more efficient than that.  Note that the input files
have to be sorted.  Then, in a single pass over both files (not 2
passes, not 3, not n, just 1), it can compute any subset of these
three (do `man comm`):

1. The words common to both files.
2. The words unique to "the first" file.
3. The words unique to "the second" file.


I read the manpage actually before answering to the list.

It's essentially just doing a merge on two sorted lists, and how it
works should be obvious if you give it some thought.  It takes time
proportional to the sum of the lengths of the input files, and nothing
*can* be faster than that.


Well, I think I understand now because "Scott David Daniels" wrote
probably the very same logic in python code in this thread.

I was really hoping I'll get an answer how to alter the indexes for dictionaries
in python.
Sorry, I don't have a guess for what that might mean.


I'm not an expert, mysql for example givs you ability to index say
first 10 characters of a text column, which are typically varchar.
Just for curiosity I'd like to know how do it in python.

When importing data from a flatfile into mysql table, there's an
option to delay indexing to the very last moment, when all keys are
loaded (it doesn't make sense to re-create index after each new
row into table is added). I believe it's exactly same waste of cpu/io
in this case - when I create a dictionary and fill it with data,
I want to create index afterward, not after every key/value pair
is recorded.

You convinced me not to even try to construct to theoretical dictionary,
as it will take ages just to create. Even if I'd manage, I couldn't
save it (the theoretical and possibly not even the dict(theoret) - dict(real)
result).

Worse, it didn't sound like a good approach even if you could save it.

Still, before I give the whole project, once more:

I'll parse some text files, isolates separate words and add them to
either Set(), list, dict, flatfile line by line.

Depending on the above, I'll compare them and look for those unique
to some "language". I need to keep track of frequencies used
in every such language,

Frequencies of what?  "A language" here can contain some words
multiple times each?


Yes, to compute the frequency of each word used in say "language A",
I'll count number of occurencies and them compute frequency of it.
Frequency number should be recorded as a value in the dictionary,
where the keys are unique and represent the word itself (or it's hash
as recommended already).

Once more, the dictionary will contain every word only once, it will be really
a unique key.

so the dict approach would be the best.  The number stored as a value vould
be a float ^H^H^H^H^H^H Decimal() type - very small number.

Integers are the natural type for counting, and require less memory
than floats or decimals.


I hoped I was clear ... when I said I'll compute frequency of those words.
The algorithm to compute it will be subject to change during developemnt. ;)

Once more, I expect to have between E4 or E5 to E8??? words
stored in 20 dictionaries (remember words of sizes in range 1-20?
Every of those 20 dictionaries should be, I believe, indexed just once.
The indexing method should know all entries in a given file are of same
size, i.e. 5 chars, 15 chars, 20 chars etc.

I think you're making this more complicated than it needs to be.


I hope the algoritm can save some logic. For example, can turn off locking 
support,
index only part of the key etc.

I already did implement the logic to walk through those 20 different
dictionaries from language a and from language b and find uout those
unique to a or common to both of them. Why I went to ask on this list
was to make sure I took right approach. Sets seemed to be better solution
for the simple comparison (common, unique). To keep track of those
very small frequencies, I anyway have to have dictionaries. I say
that again, how can I improve speed of dictionary indexing?
It doesn't matter here if I have 10E4 keys in dictionary or
10E8 keys in a dictionary.

What reason do you have for believing that the speed of dictionary
indexing is worth any bother at all to speed up?  Dict lookup is
generally considered to be extremely fast already.  If you're just
*guessing* that indexing is the bottleneck, you're probably guessing
wrong -- run a profiler to find out where the real bottleneck is.


For example, looking up a key with it's value only once in the whole for 
loop tells me
I don't need an index. Yes, I'll do this 4 times for those 4 languages, but
still I think it's faster to live without an index, when I can sort
records. I think I like the sorted text file approach, and the dictionary
approach without an index would be almost the same, especially if I manage
to tell the db layout not to move the cursor randomly but just to walk down the
pre-sorted data.

 As an extra, I could record those frequences within a dictionary.
Actually, I need them, that's why I do all this work.

The idea of using Sets() I had just to find more quickly unique/common
words because of the possible speed gain. But I don't see a way to solve
this without dictionaries.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huve ge Sets() to disk

Reply via email to