[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Shai Erera (JIRA) Thu, 14 Nov 2013 21:44:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823304#comment-13823304
 ]


Shai Erera commented on LUCENE-5339:
------------------------------------

bq. I think that's an acceptable risk in exchange for the simpler index-time 
API.

What if someone already extends IW? Or, we'll like that approach so much that 
we'll want to apply it elsewhere too (I mean in other modules)? I'm just saying 
that extending IW is a pretty big thing, compared to FacetFields.

I was thinking maybe we can wrap an IW, and present an interface with only 
add/updateDoc so on one hand you get to add {{new FacetField()}} easily, but on 
the other you get to use whatever IW that you want. It's like RandomIW, only we 
may not want to copy the entire set of APIs, only the document-related ones? 
And in the future we can think about TaxoFacetIW which also owns the commit()?

Another crazy idea is to implement a FacetDocument which extends Document (we'd 
need to allow that first!) and does the same trick on addField() / 
indexableFields(). I think if we can pull this off (extending Document is 
preferred IMO), it's far less intrusive than a new FacetIW.

bq. Wait, the app cannot compute this (accurately) by summing the child counts? 
It will overcount in general, right?

Ooops, you're right. So in that case, we can open up the extension point (I 
think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Then 
I think the rest of the search code will just work?

{quote}
There was a user on the ML (I don't have the link) who just wanted to
get the facet count for a specific label after faceting was done, and
the hoops s/he had to jump through (custom FacetResultHandler I
think??) to achieve that was just crazy.
{quote}

Maybe it's crazy I don't know. All I'm saying is that now you took that user's 
request and made it a top-level API. I don't know if that qualifies as 
simplifying the APIs, or whether similar requests in the future will be easy to 
implement by extension, or we'd need to add them to our core code. For 
instance, this API makes Range and Count not share the same API (not saying 
it's a bad thing, just an example). But I already +1 the idea of having that on 
the API.

bq. Seriously? What abstractions are we missing?

Well, FacetArrays committing to an int[], while we have someone which wants to 
use a Map, because he has millions of facets, yet queries typically hit very 
few of them. And he uses both the int[] and float[] arrays, which for the most 
part of them are allocated for no good reason. In order to use a Map he needs 
to write a FacetResultsHandler which computes top-K from the map, as well 
extend FacetsAccumulator to override createFRH. Just an example. If we have a 
FacetValues abstraction w/ get/set/iterator methods, all he'd need to do is 
override the FacetValues that should be used.

But I already told him that this is something he can just do w/ 
FacetResultsHandler. And just to point out that this patch won't make things 
worse - since he already overrides the aggregation method, he can just 
implement a MapBasedTaxoFacetsSomething. So I only gave it as an example 
(FacetValues abstraction, I think, will hurt the performance in general, but we 
could benchmark).

Another abstraction is LUCENE-5316, which we struggle to get it to perform 
(really frustrating!).

{quote}
If this is too much change for the facets module, we could, instead,
leave the facets module as is, and break this effort out as a
different module (facets2, simplefacets, something?). We also have
many queryparsers, many highlighters, etc., and I think that's
healthy: all options can be explored.
{quote}

-1 to that. I think it delivers the wrong message. Why have two faceting 
modules? A branch is the perfect choice here, since it allows us to move the 
entire module to the new API. And on the way we benefit by assessing that the 
new API can really allow implementing associations, complements, sampling.

To give an example - the API first was overwhelming and in the last round of 
changes I removed and moved things (I thought I simplify it!). So for instance 
once upon a time FacetRequest let you specify the FacetsAggregator and 
FacetResultsHandler. I moved them both to FacetsAccumulator, but recently moved 
FacetsAggregator back to FacetRequest (it allows more easily to define which 
aggregator a certain FR needs to use). And on LUCENE-5333 I suggested to move 
FacetResultsHandler back to FR, to implement an AllDimFR.

So wearing the "experienced" hat, I just want to make sure we won't move 
everything out only to gradually swap it back in. I'm not against changing the 
APIs, but because this module is rich with features (which I, and apparently 
our users too, think are important), I'd like to make sure the new API works 
for all of them. I may compromise on complements and sampling because: (1) 
complements is not per-segment and I think it might require a full rewrite 
anywhere (there's an issue for it) and (2) sampling because it's under the 
*.old package, meaning it was never fully migrated to the new APIs and I think 
it could use some serious simplification there too.

If you're worried that you'll do this work alone, I promise to help. On the 
branch.

{quote}
It's a crazy expert thing to create another faceting impl, so I think
such developers can handle changing their rollup to be lazy if it's
beneficial/necessary for their use case.
{quote}

Let's use MapReduce as an analogy. The framework says "you worry about the Map 
and Reduce phase, I'll take care of the distribution stuff". Nice and clean. In 
the beginning the facets API said "you worry about aggregating a *single* 
ordinal, I'll worry about the rest". Then we changed it to "you worry about 
aggregating all ordinals in a segment, I'll worry about the rest".

Now, some users we have, well think of them as data scientists. All they're 
interested about is ranking facets to drive some insights out of the data. I 
feel that we're putting too much burden on them more and more. Cutting the API 
from the single-ord level to the segment-level was to gain sizable improvements 
to faceted search. But letting them do the rollup ... anyway, let's proceed 
with that, we can see as we go.

I personally like code reuse, and it kills me to see code that's duplicated. 
Often this calls out for bad design of APIs, not necessarily simplification.

{quote}
I'm not sure it's so easily separated out; the DimConfig is common to
index time and search time, and we're still iterating on that (I just
moved the indexedFieldName into it).
{quote}

I was thinking to develop it still with FacetIndexingParams in mind - just 
cutover from FacetFields to FacetIW/FacetDocument. This makes the change 
gradual and allows us to investigate it separately. It's also, I think, a much 
lower hanging fruit. But, it looks like the changes in this patch will require 
rewriting large parts of it, so maybe we should do it here.

{quote}
We'd have to change 3 places if we did that right now: FacetIndexWriter
(where we encode) and TaxonomyFacetCounts/SumValueSource (where we
decode).
{quote}

Wrong? I mean you still didn't write a version which pulls the ords from 
OrdinalsCache (and I assume you don't want to get rid of it!). With those two 
TFCs, it already means we'll have 4 classes? While I think the SumValueSource 
could be oblivious to how the ords are encoded/fetched out-of-the-box, since 
it's less common to use (we only optimize counting) and someone can rewrite it 
to pull the ords from OrdinalsCache directly (instead of going through the CLI 
abstraction).

bq. So I think its fine to bake in the encoding, but with the two key methods 
in those 20 classes 'protected' in the appropriate places so that an expert 
user could subclass them:

I think you point out two problems - one is the fact that we have the 
*.encoding package at all and the other one is how to override 
encoding/decoding. The nice thing about IntEncoder/Decoder (let's think of it 
as PackedInts ok?) is that you can use an implementation without affecting 
other classes, while with encode/decode protected methods you need to extend 
every class that you have which reads the data. I don't think that qualifies as 
good software design. Why do we have PackedInts then?

Now the question is whether we should keep the different IntEncoders or not - I 
don't know. Why do we have so many PackedInts versions? Do we use all of them 
or are they there so that someone can make a pick? If they don't hurt anyone, 
and requires almost no maintenance, why remove them?

The argument is whether we should open up encoding/decoding at all. What Mike 
is saying "if you want to encode differently, you're an expert and you can 
rewrite all of your and Lucene's aggregators (count, sum VS) too". What you're 
saying, add encode/decode protected methods, is more how the API looks today, 
only instead of a method we have an IntEncoder/Decoder (a'la PackedInts) class. 
That's just easier to use in conjunction with other classes (especially if we 
do the CLI abstraction).

Just to clarify, I don't mean we have to have the CLI abstraction and use it 
everywhere. I think it's a good API for someone to use if he doesn't care about 
how ordinals are fetched (BDV, CachedOrds, whatever) as well as how they are 
encoded/decoded. It could very well be a CLITaxoFacetCounts impl which lets you 
pass the CLI you want to use (BDVCLI, CachedOrdsCLI, MySuperExpertDecodingCLI). 
That's it. Most users won't know about it, but the few that do, won't need to 
rewrite so many classes.

> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch, LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Reply via email to