Re: Writing huve ge Sets() to disk

Mike C. Fletcher Mon, 10 Jan 2005 19:15:01 -0800

Martin MOKREJÅ wrote:

Tim Peters wrote:

...

I was really hoping I'll get an answer how to alter the indexes for dictionaries in python.
Sorry, I don't have a guess for what that might mean.

I'm not an expert, mysql for example givs you ability to index say
first 10 characters of a text column, which are typically varchar.
Just for curiosity I'd like to know how do it in python.

When importing data from a flatfile into mysql table, there's an
option to delay indexing to the very last moment, when all keys are
loaded (it doesn't make sense to re-create index after each new
row into table is added). I believe it's exactly same waste of cpu/io
in this case - when I create a dictionary and fill it with data,
I want to create index afterward, not after every key/value pair
is recorded.


Okay, you seem to be missing this key idea:

   A hash-table (dict) is approximately the same level of abstraction
   as a btree index.

Your MySQL "index" is likely implemented as a btree. A hash-table could just as readily be used to implement the index. When you insert into either of these structures (btree or hash), you are not creating an "index" separate from the structure, the structure *is* an "index" of the type you are thinking about. These structures are both efficient representations that map from a data-value to some other data-value. Hashes with a good hash function (such as Python's dicts) are basically O(1) (worst case O(n), as Tim notes), while Btrees (such as common database indices) are O(log(n)) (or something of that type, basically it grows much more slowly than n).

Once more, I expect to have between E4 or E5 to E8??? words
stored in 20 dictionaries (remember words of sizes in range 1-20?
Every of those 20 dictionaries should be, I believe, indexed just once.
The indexing method should know all entries in a given file are of same
size, i.e. 5 chars, 15 chars, 20 chars etc.
I think you're making this more complicated than it needs to be.
I hope the algoritm can save some logic. For example, can turn off locking support, index only part of the key etc.

I'd tend to agree with Tim. You're making this all far too complex in what appears to be the absence of any real need. There's a maxim in computer programming that you avoid, wherever possible, what is called "premature optimisation". You are here trying to optimise away a bottleneck that doesn't exist (indexing overhead, and locking support are basically nil for a dictionary). It is almost a certainty that these are not *real* bottlenecks in your code (what with not existing), so your time would be better spent writing a "dumb" working version of the code and profiling it to see where the *real* bottlenecks are.

For example, looking up a key with it's value only once in the whole for loop tells me I don't need an index. Yes, I'll do this 4 times for those 4 languages, but still I think it's faster to live without an index, when I can sort records. I think I like the sorted text file approach, and the dictionary approach without an index would be almost the same, especially if I manage to tell the db layout not to move the cursor randomly but just to walk down the pre-sorted data.

Again, you don't really have a cursor with a dictionary (and since it's randomly ordered, a cursor wouldn't mean anything). A *btree* has an order, but not a dictionary. You could construct a sorted list in memory that would approximate what I *think* you're thinking of as a dictionary-without-an-index.

Good luck,
Mike

________________________________________________
 Mike C. Fletcher
 Designer, VR Plumber, Coder
 http://www.vrplumber.com
 http://blog.vrplumber.com

--
http://mail.python.org/mailman/listinfo/python-list

Re: Writing huve ge Sets() to disk

Reply via email to