Re: FacetedSearch and MultiReader

Shai Erera Mon, 21 Jan 2013 20:20:48 -0800

Hi Nicola,

What I had in mind is something similar to this, which is possible starting
with Lucene 4.1, due to changes done to facets (per-segment faceting):


DirTaxoWriter master = new DirTaxoWriter(masterDir);
Directory[] origTaxoDirs = new Directory[numTaxoDirs]; // open Directories
and store in that array
OrdinalMap[] ordinalMaps = new OrdinalMap[numTaxoDirs]; // initialize
OrdinalMap and store in that array

// now do the merge
for (int i = 0; i < origTaxoDirs.length; i++) {
  master.addTaxonomy(origTaxoDir[i], ordinalMaps[i]);
}

// now open your readers, and create the important map
Map<AtomicReader,OrdinalMap) readerOrdinals = new
HashMap<AtomicReader,OrdinalMap>();
DirectoryReader[] readers = new DirectoryReader[origTaxoDirs.length];
for (int i = 0; i < origTaxoDirs.length; i++) {
  DirectoryReader r = DirectoryReader.open(contentDirectories[i]);
  OrdinalMap ordMap = ordinalMaps[i];
  for (AtomicReaderContext ctx : r.leaves()) {
    readerOrdinals.put(ctx.reader(), ordMap);
  }
}

MultiReader mr = new MultiReader(readers);

// create your FacetRequest (CountFacetRequest) with a custom Aggregator
FacetRequest fr = new CountFacetRequest(cp, topK) {
  @Override
  public Aggregator createAggregator(...) {
    return new OrdinalMappingAggregator() {
      int[] ordMap;

      @Override
      public void setNextReader(AtomicReaderContext context) {
        ordMap = readerOrdinals.get(context.reader()).getMap();
      }

      @Override
      public void aggregate(int docID, float score, IntsRef ordinals) {
        int upto = ordinals.offset + ordinals.length;
        for (int i = ordinals.offset; i < upto; i++) {
          int ordinal = ordinals[i]; // original ordinal read for the
AtomicReader given to setNextReader
          int mappedOrdinal = ordMap[ordinal]; // mapped ordinal, following
the taxonomy merge
          counts[mappedOrdinal]++; // count the mapped ordinal instead, so
all AtomicReaders count that ordinal
        }
      }
    };
  }
}

While it may look like I wrote actual code to do it, I didn't :). So I
guess it should work, but I haven't tried it.
That way, you don't touch the content indexes at all, just the taxonomy
ones.

Note however that you'll need to do this step every time the taxonomy index
is updated, and you refresh the TaxoReader instance.
Also, this will only work if all your indexes are opened in the same JVM
(which I assume that's the case, since you use MultiReader).

If you still don't want to do that, then what Dennis wrote above is another
way to do distributed faceted search, either inside the same JVM or across
multiple JVMs.
You obtain the FacetResult from each search and merge the results
(unfortunately, there's still no tool in Lucene to do that for you).
Just make sure to ask for a larger K, to ensure that the correct top-K is
returned (see my previous notes).

Shai




On Tue, Jan 22, 2013 at 4:32 AM, Denis Bazhenov <dot...@gmail.com> wrote:

> We have similar distribute search system and we have finished with the
> following scheme. Search replicas (machines where index resides) are build
> FacetResult's based on their index chunk (top N categories with document
> counts). Later on the results are merged "by hands" with summing relevant
> categories from different replicas.
>
> On Jan 22, 2013, at 3:08 AM, Nicola Buso <nb...@ebi.ac.uk> wrote:
>
> > Hi Shai,
> >
> > I was thinking to that too, but I'm indexing all indexes in a custom
> > distributed environment than I can't in this moment have a single
> > categories index for all the content indexes at indexing time.
> > A solution should be to merge all the categories indexes in one only
> > index and use your solution but the merge code I see in the examples
> > merge also the content index and I can't do that.
> >
> > I should share the taxonomy if is possible to merge (I see the resulting
> > categories indexes are not that big currently), but I would prefer to
> > have a solution where I can collect the facets over multiple categories
> > indexes in this way I will be sure the solution will scale better.
> >
> >
> > Nicola.
> >
> >
> > On Mon, 2013-01-21 at 17:54 +0200, Shai Erera wrote:
> >> Hi Nicola,
> >>
> >>
> >> I think that what you're describing corresponds to distributed faceted
> >> search. I.e., you have N content indexes, alongside N taxonomy
> >> indexes.
> >>
> >> The information that's indexed in each of those sub-indexes does not
> >> correlate with the other ones.
> >> For example, say that you index the category "Movie/Drama", it may
> >> receive ordinal 12 in index1 and 23 in index2.
> >>
> >> If you'll try to count ordinals using MultiReader, you'll just mess up
> >> everything.
> >>
> >>
> >> If you can share a single taxonomy index for all N content indexes,
> >> then you'll be in a super-simple position:
> >>
> >> 1) Open one TaxonomyReader
> >>
> >> 2) Execute search with MultiReader and FacetsCollector
> >>
> >>
> >>
> >> It doesn't get simpler than that ! :)
> >>
> >>
> >> Before I go into great length describing what you should do if you
> >> cannot share the taxonomy, let me know if that's not an option for
> >> you.
> >>
> >> Shai
> >>
> >>
> >>
> >> On Mon, Jan 21, 2013 at 5:39 PM, Nicola Buso <nb...@ebi.ac.uk> wrote:
> >>        Thanks for the reply Uwe,
> >>
> >>        we currently can search with MultiReader over all the indexes
> >>        we have.
> >>        Now I want to add the faceting search, than I created a
> >>        categories index
> >>        for every index I currently have.
> >>        To accumulate the faceted results now I have a MultiReader
> >>        pointing all
> >>        the indexes and I can create a TaxonomyReader for every
> >>        categories index
> >>        I have; all the way I see to obtain FacetResults are:
> >>        1 - FacetsCollector
> >>        2 - a FacetsAccumulator implementation
> >>
> >>        suppose I use the second option. I should:
> >>        - search as usual using the MultiReader
> >>        - than try to collect all the facetresults iterating over my
> >>        TaxonomyReaders; at every iteration:
> >>          - I create a FacetsAccumulator using the MultiReader and a
> >>        TaxonomyReader
> >>          - I get a list of FacetResult from the accumulator.
> >>        - as I finish I should in some way merge all the
> >>        List<FacetResult> I
> >>        have.
> >>
> >>        I think this solution is not correct because the docsids from
> >>        the search
> >>        are pointing the multireader instead the taxonomyreader is
> >>        pointing to
> >>        the categories index of a single reader.
> >>        I neither like to merge all the List of FacetResult I retrieve
> >>        from the
> >>        Accumulators.
> >>
> >>        Probably I'm missing something, can somebody clarify to me how
> >>        I should
> >>        collect the facets in this case?
> >>
> >>
> >>        Nicola.
> >>
> >>
> >>
> >>        On Mon, 2013-01-21 at 16:22 +0100, Uwe Schindler wrote:
> >>> Just use MultiReader, it extends IndexReader, so you can
> >>        pass it anywhere where IndexReader can be passed.
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>> http://www.thetaphi.de
> >>> eMail: u...@thetaphi.de
> >>>
> >>>> -----Original Message-----
> >>>> From: Nicola Buso [mailto:nb...@ebi.ac.uk]
> >>>> Sent: Monday, January 21, 2013 3:59 PM
> >>>> To: java-user@lucene.apache.org
> >>>> Subject: FacetedSearch and MultiReader
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I'm trying to develop faceted search using lucene 4.0
> >>        faceting framework.
> >>>> In our project we are searching on multiple indexes using
> >>        lucene
> >>>> MultiReader. How should we use the faceted framework to
> >>        obtain
> >>>> FacetResults starting from a MultiReader? all the example
> >>        I see are using a
> >>>> "single" IndexReader.
> >>>>
> >>>>
> >>>>
> >>>> Nicola.
> >>>>
> >>>>
> >>>>
> >>
>  ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail:
> >>        java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail:
> >>        java-user-h...@lucene.apache.org
> >>>
> >>
> >>
> >>
> >>
>  ---------------------------------------------------------------------
> >>        To unsubscribe, e-mail:
> >>        java-user-unsubscr...@lucene.apache.org
> >>        For additional commands, e-mail:
> >>        java-user-h...@lucene.apache.org
> >>
> >>
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> ---
> Denis Bazhenov <dot...@gmail.com>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: FacetedSearch and MultiReader

Reply via email to