Jack Diederich wrote:
On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<d...@botanicus.net> wrote:
[snip]
I found my answer: Python 2.6 introduces heap.merge(), which is
designed exactly for this.

Thanks, I knew Raymond added something like that but I couldn't find
it in itertools.
That said .. it doesn't help.  Aside, heapq.merge fits better in
itertools (it uses heaps internally but doesn't require them to be
passed in).  The other function that almost helps is
itertools.groupby() and it doesn't return an iterator so is an odd fit
for itertools.

More specifically (and less curmudgeonly) heap.merge doesn't help for
this particular case because you can't tell where the merged values
came from.  You want all the iterators to yield the same thing at once
but heapq.merge muddles them all together (but in an orderly way!).
Unless I'm reading your tokenizer func wrong it can yield the same
value many times in a row.  If that happens you don't know if four
"The"s are once each from four iterators or four times from one.

David is looking to intersect sorted lists of document numbers with duplicates removed in order to find documents that contain worda and wordb and wordc ... . But you are right that duplicate are a possible fly in the ointment to be removed before merging.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to