On Jun 11, 1:54 pm, Terry Reedy <tjre...@udel.edu> wrote: > Jack Diederich wrote: > > On Thu, Jun 11, 2009 at 12:03 AM, David M. Wilson<d...@botanicus.net> wrote: > > [snip] > >> I found my answer: Python 2.6 introduces heap.merge(), which is > >> designed exactly for this. > > > Thanks, I knew Raymond added something like that but I couldn't find > > it in itertools. > > That said .. it doesn't help. Aside, heapq.merge fits better in > > itertools (it uses heaps internally but doesn't require them to be > > passed in). The other function that almost helps is > > itertools.groupby() and it doesn't return an iterator so is an odd fit > > for itertools. > > > More specifically (and less curmudgeonly) heap.merge doesn't help for > > this particular case because you can't tell where the merged values > > came from. You want all the iterators to yield the same thing at once > > but heapq.merge muddles them all together (but in an orderly way!). > > Unless I'm reading your tokenizer func wrong it can yield the same > > value many times in a row. If that happens you don't know if four > > "The"s are once each from four iterators or four times from one. > > David is looking to intersect sorted lists of document numbers with > duplicates removed in order to find documents that contain worda and > wordb and wordc ... . But you are right that duplicate are a possible > fly in the ointment to be removed before merging.
Removing the duplicates could be a big problem. With SQL, the duplicates need not have to be removed. All I have to do is change "SELECT" to "SELECT DISTINCT" to change 100 100 100 322 322 322 322 322 322 322 322 into 100 322 -- http://mail.python.org/mailman/listinfo/python-list