D11828: Simplify orPostingIterator and make it faster

Stefan Brüns Mon, 09 Apr 2018 10:07:01 -0700

bruns added a comment.

  In D11828#242402 <https://phabricator.kde.org/D11828#242402>, @michaelh wrote:

  > I'm not very familiar with the concept of iterators (yet). To me it looks 
like `auto i = new OrPostingIterator(iters); i->docId();` will return 0 and 
`i->next()` returns a valid docId. After that `i->docId();` is also valid.
  >  Is this how iterators work? Naively I would expect `i->docId();` to be 
valid after construction or both `i->docId();` and `i->next()` to be invalid.
  >
  > I have to admit, it's very difficult to understand (and I don't, really) 
what baloo is doing here with all this iterating over vectors of inherited 
iterator classes, pointer-to-pointer stuff of which some might not even exist. 
My brain hurts, terribly! All I can presently do is ask some simple questions. 
Sorry, I'm of no help here. Please take my comments as a device for better 
understanding and I hope it's not to much of a bother.

  Each PostingIterator represents a "set" of document ids. PostingIterators are 
not related to STL or Qt container iterators (although the "subiterators" are 
implemented using QVectors<>).

  The `PI::next()` call returns the smallest element and removes it from the 
set, `PI::docId()` just returns the smallest element.

  The result set of the OrPostingIterator is the union of its "subsets" 
(represented by its "subiterators", i.e. the QVector<PostingIterator*> 
constructor argument). Or works a little bit like an insertion sort. To return 
the smallest element, it creates the union of the smallest element of all 
subsets, and yields the smallest element of the union. It then calls 
`PI::next()` on all subsets which returned the smallest element, removing it 
from all subsets and thus from the union.

  In C++, you can use iterators in two different ways, either PreIncrement or 
PostIncrement. `PostingIterator::next()` will behave like `*(++it)`, i.e. it is 
incremented first, then the current value is returned.

  I don't know the original justification for this design. One possible idea is 
to delay the actual database query until `next()` is called the first time. For 
e.g. a query "a AND b AND c", one can execute "a", if not empty do "b", 
intersect (i.e. calling the Iterators of Result("a") and Result("b") until both 
return the same value), and if the intersection is non-empty, do "c".

  Currently, it does not implement this optimization. D12025 
<https://phabricator.kde.org/D12025> implements one case, i.e. stopping if a 
leaf query returns an empty set, and propagating it upwards.

REPOSITORY
  R293 Baloo

REVISION DETAIL
  https://phabricator.kde.org/D11828

To: bruns, #baloo, michaelh
Cc: #frameworks, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, 
alexeymin

D11828: Simplify orPostingIterator and make it faster

Reply via email to