Hi David, thanks for your comments and hints, the proposed approach with a list of dicts lookup dict is indeed much faster, than my previous attempts with a database (even without psyco). I used a slightly different structure with sets of indices, since they should be unique anyway and the values are later used for the intersection.
This way the lookups (for indices given the tag values) seems to be about 20 times faster than the sqlite query (at least for my limited test case - there might be some peculiarities with the real data); however, the (visible) code is quite a bit more complex (for my taste); (while looking back at the following line in the inner loop of the lookup function: tags_lookups[tag][item_dict[tag]] = tags_lookups[tag].get(item_dict[tag], set()) | set([idx]) I thought, whether I am not overestimating myself with respect to the future maintaining of the code ... :-) I assume, that it most likely can be written in a better way, but I tend to like the simplicity of the sql version, as its speed is fully acceptable too. I will have to recheck these approaches, as soon as I have a more complete real data available. Thanks for reminding me about the mxTextTools; I looked at this package very quickly several months ago and it seemed quite complex and heavy-weight, but maybe I will reconsider this after some investigation ... The suggested XML structure is actually almost the one, I use to prepare and control the input data before converting it to the one presented in the previous mail :-). The main problem is, that I can't seem to make it fully valid XML without deforming the structure of the text itself - it can't be easily decided, what CUSTOM_TAG should be in some places - due to the overlapping etc. Furthermore, the redundancy is actually greater, than it might seem from the sample given here - there are sometimes more tags - some of them having the same values for several dozens, sometimes hundreds, subsequent lines. I also sometimes need to access portions of texts spanning over multiple "tags", not just single elements. Thanks for your time and effort, I'll have the check the alternatives now and test them a bit further, regards, Vlasta
-- http://mail.python.org/mailman/listinfo/python-list