Re: posting list traversal code

Sriram Sankar Thu, 13 Jun 2013 10:57:00 -0700

Thank you very much.  I think I need to play a bit with the code before
asking more questions.  Here is the context for my questions:

I was at Facebook until recently and worked extensively on the Unicorn
search backend.  Unicorn allows documents to be ordered by a static rank in
the posting lists and retrieval (based on a query) will therefore return
documents in decreasing order of importance (by static rank).  We only
retrieved a limited number of these documents - so static rank is the first
filter.  These documents are scored which becomes the second filter.  This
two level filtering ended up working very well.

I am trying to see how we can do the same with Lucene.  It looks like we
can build an index offline (say in Hadoop) and use SortingAtomicReader,
etc. to get the posting lists in static rank order.  Then everything works
great.  However, as we update the index with new data, we end up with
additional segments.  To get the same functionality as Unicorn, it looks
like we will have to sort these additional segments by the same static
rank, and then traverse the posting lists in these multiple segments
simultaneously (rather than segment by segment).  I am trying to see how
easy this will be to achieve.

To get more specific, the all documents in Unicorn have two attributes - an
id (called fbid) which is unique and never changes - so for example the
fbid for the Lucene page is 104032849634513 (so you can always get it as
www.facebook.com/104032849634513), and a static rank.  The static rank may
not be unique (multiple docs may share the same static rank, in which case
we could order them arbitrarily so long as they are ordered the same way in
all posting lists).

It looks like we will need an fbid to docid mapping in Lucene for each
segment so that the simultaneous calls to advance() in each segment can be
in sync.

This is how far I've thought things out - let me play with the code a bit
and I may have more questions in a couple of days.

Thanks again for all your help!

Srirarm.

On Thu, Jun 13, 2013 at 12:20 AM, Adrien Grand <jpou...@gmail.com> wrote:

> Hi,
>
> On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov <dot...@gmail.com> wrote:
> > Document id on the index level is offset of the document in the index.
> It can change over time for the same document, for example when merging
> several segments. They are also stored in order in posting lists. This
> allows fast posting list intersection. Some Lucene API's explicitly state
> that they operate on the document ids in order (like TermDocs), some allows
> out of order processing (like Collector). So it really depends.
> >
> > In case of SortingAtomicReader, as far as I know, it calculate document
> permutation, which allows to have sorted docIDs on the output. So, it
> basically relabel documents.
>
> This is correct. The org.apache.lucene.index.sorter.Sorter.sort method
> computes a permutation of the doc IDs which makes doc IDs sorted
> according to the sort order. SortingAtomicReader is just a view over
> an AtomicReader which uses this permutation to relabel doc IDs and
> give the impression that the index is sorted. But this class is not
> very interesting by itself can can be very slow to decode postings:
> for each term it needs to load all postings into memory and sort them
> before returning an enumeration of the doc IDs (see the
> SortingDocsEnum class), it is only useful to sort indices offline with
> IndexWriter.addIndexes or online with SortingMergePolicy.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: posting list traversal code

Reply via email to