Thank you very much. I think I need to play a bit with the code before asking more questions. Here is the context for my questions:
I was at Facebook until recently and worked extensively on the Unicorn search backend. Unicorn allows documents to be ordered by a static rank in the posting lists and retrieval (based on a query) will therefore return documents in decreasing order of importance (by static rank). We only retrieved a limited number of these documents - so static rank is the first filter. These documents are scored which becomes the second filter. This two level filtering ended up working very well. I am trying to see how we can do the same with Lucene. It looks like we can build an index offline (say in Hadoop) and use SortingAtomicReader, etc. to get the posting lists in static rank order. Then everything works great. However, as we update the index with new data, we end up with additional segments. To get the same functionality as Unicorn, it looks like we will have to sort these additional segments by the same static rank, and then traverse the posting lists in these multiple segments simultaneously (rather than segment by segment). I am trying to see how easy this will be to achieve. To get more specific, the all documents in Unicorn have two attributes - an id (called fbid) which is unique and never changes - so for example the fbid for the Lucene page is 104032849634513 (so you can always get it as www.facebook.com/104032849634513), and a static rank. The static rank may not be unique (multiple docs may share the same static rank, in which case we could order them arbitrarily so long as they are ordered the same way in all posting lists). It looks like we will need an fbid to docid mapping in Lucene for each segment so that the simultaneous calls to advance() in each segment can be in sync. This is how far I've thought things out - let me play with the code a bit and I may have more questions in a couple of days. Thanks again for all your help! Srirarm. On Thu, Jun 13, 2013 at 12:20 AM, Adrien Grand <jpou...@gmail.com> wrote: > Hi, > > On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov <dot...@gmail.com> wrote: > > Document id on the index level is offset of the document in the index. > It can change over time for the same document, for example when merging > several segments. They are also stored in order in posting lists. This > allows fast posting list intersection. Some Lucene API's explicitly state > that they operate on the document ids in order (like TermDocs), some allows > out of order processing (like Collector). So it really depends. > > > > In case of SortingAtomicReader, as far as I know, it calculate document > permutation, which allows to have sorted docIDs on the output. So, it > basically relabel documents. > > This is correct. The org.apache.lucene.index.sorter.Sorter.sort method > computes a permutation of the doc IDs which makes doc IDs sorted > according to the sort order. SortingAtomicReader is just a view over > an AtomicReader which uses this permutation to relabel doc IDs and > give the impression that the index is sorted. But this class is not > very interesting by itself can can be very slow to decode postings: > for each term it needs to load all postings into memory and sort them > before returning an enumeration of the doc IDs (see the > SortingDocsEnum class), it is only useful to sort indices offline with > IndexWriter.addIndexes or online with SortingMergePolicy. > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >