[
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823304#comment-13823304
]
Shai Erera commented on LUCENE-5339:
------------------------------------
bq. I think that's an acceptable risk in exchange for the simpler index-time
API.
What if someone already extends IW? Or, we'll like that approach so much that
we'll want to apply it elsewhere too (I mean in other modules)? I'm just saying
that extending IW is a pretty big thing, compared to FacetFields.
I was thinking maybe we can wrap an IW, and present an interface with only
add/updateDoc so on one hand you get to add {{new FacetField()}} easily, but on
the other you get to use whatever IW that you want. It's like RandomIW, only we
may not want to copy the entire set of APIs, only the document-related ones?
And in the future we can think about TaxoFacetIW which also owns the commit()?
Another crazy idea is to implement a FacetDocument which extends Document (we'd
need to allow that first!) and does the same trick on addField() /
indexableFields(). I think if we can pull this off (extending Document is
preferred IMO), it's far less intrusive than a new FacetIW.
bq. Wait, the app cannot compute this (accurately) by summing the child counts?
It will overcount in general, right?
Ooops, you're right. So in that case, we can open up the extension point (I
think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Then
I think the rest of the search code will just work?
{quote}
There was a user on the ML (I don't have the link) who just wanted to
get the facet count for a specific label after faceting was done, and
the hoops s/he had to jump through (custom FacetResultHandler I
think??) to achieve that was just crazy.
{quote}
Maybe it's crazy I don't know. All I'm saying is that now you took that user's
request and made it a top-level API. I don't know if that qualifies as
simplifying the APIs, or whether similar requests in the future will be easy to
implement by extension, or we'd need to add them to our core code. For
instance, this API makes Range and Count not share the same API (not saying
it's a bad thing, just an example). But I already +1 the idea of having that on
the API.
bq. Seriously? What abstractions are we missing?
Well, FacetArrays committing to an int[], while we have someone which wants to
use a Map, because he has millions of facets, yet queries typically hit very
few of them. And he uses both the int[] and float[] arrays, which for the most
part of them are allocated for no good reason. In order to use a Map he needs
to write a FacetResultsHandler which computes top-K from the map, as well
extend FacetsAccumulator to override createFRH. Just an example. If we have a
FacetValues abstraction w/ get/set/iterator methods, all he'd need to do is
override the FacetValues that should be used.
But I already told him that this is something he can just do w/
FacetResultsHandler. And just to point out that this patch won't make things
worse - since he already overrides the aggregation method, he can just
implement a MapBasedTaxoFacetsSomething. So I only gave it as an example
(FacetValues abstraction, I think, will hurt the performance in general, but we
could benchmark).
Another abstraction is LUCENE-5316, which we struggle to get it to perform
(really frustrating!).
{quote}
If this is too much change for the facets module, we could, instead,
leave the facets module as is, and break this effort out as a
different module (facets2, simplefacets, something?). We also have
many queryparsers, many highlighters, etc., and I think that's
healthy: all options can be explored.
{quote}
-1 to that. I think it delivers the wrong message. Why have two faceting
modules? A branch is the perfect choice here, since it allows us to move the
entire module to the new API. And on the way we benefit by assessing that the
new API can really allow implementing associations, complements, sampling.
To give an example - the API first was overwhelming and in the last round of
changes I removed and moved things (I thought I simplify it!). So for instance
once upon a time FacetRequest let you specify the FacetsAggregator and
FacetResultsHandler. I moved them both to FacetsAccumulator, but recently moved
FacetsAggregator back to FacetRequest (it allows more easily to define which
aggregator a certain FR needs to use). And on LUCENE-5333 I suggested to move
FacetResultsHandler back to FR, to implement an AllDimFR.
So wearing the "experienced" hat, I just want to make sure we won't move
everything out only to gradually swap it back in. I'm not against changing the
APIs, but because this module is rich with features (which I, and apparently
our users too, think are important), I'd like to make sure the new API works
for all of them. I may compromise on complements and sampling because: (1)
complements is not per-segment and I think it might require a full rewrite
anywhere (there's an issue for it) and (2) sampling because it's under the
*.old package, meaning it was never fully migrated to the new APIs and I think
it could use some serious simplification there too.
If you're worried that you'll do this work alone, I promise to help. On the
branch.
{quote}
It's a crazy expert thing to create another faceting impl, so I think
such developers can handle changing their rollup to be lazy if it's
beneficial/necessary for their use case.
{quote}
Let's use MapReduce as an analogy. The framework says "you worry about the Map
and Reduce phase, I'll take care of the distribution stuff". Nice and clean. In
the beginning the facets API said "you worry about aggregating a *single*
ordinal, I'll worry about the rest". Then we changed it to "you worry about
aggregating all ordinals in a segment, I'll worry about the rest".
Now, some users we have, well think of them as data scientists. All they're
interested about is ranking facets to drive some insights out of the data. I
feel that we're putting too much burden on them more and more. Cutting the API
from the single-ord level to the segment-level was to gain sizable improvements
to faceted search. But letting them do the rollup ... anyway, let's proceed
with that, we can see as we go.
I personally like code reuse, and it kills me to see code that's duplicated.
Often this calls out for bad design of APIs, not necessarily simplification.
{quote}
I'm not sure it's so easily separated out; the DimConfig is common to
index time and search time, and we're still iterating on that (I just
moved the indexedFieldName into it).
{quote}
I was thinking to develop it still with FacetIndexingParams in mind - just
cutover from FacetFields to FacetIW/FacetDocument. This makes the change
gradual and allows us to investigate it separately. It's also, I think, a much
lower hanging fruit. But, it looks like the changes in this patch will require
rewriting large parts of it, so maybe we should do it here.
{quote}
We'd have to change 3 places if we did that right now: FacetIndexWriter
(where we encode) and TaxonomyFacetCounts/SumValueSource (where we
decode).
{quote}
Wrong? I mean you still didn't write a version which pulls the ords from
OrdinalsCache (and I assume you don't want to get rid of it!). With those two
TFCs, it already means we'll have 4 classes? While I think the SumValueSource
could be oblivious to how the ords are encoded/fetched out-of-the-box, since
it's less common to use (we only optimize counting) and someone can rewrite it
to pull the ords from OrdinalsCache directly (instead of going through the CLI
abstraction).
bq. So I think its fine to bake in the encoding, but with the two key methods
in those 20 classes 'protected' in the appropriate places so that an expert
user could subclass them:
I think you point out two problems - one is the fact that we have the
*.encoding package at all and the other one is how to override
encoding/decoding. The nice thing about IntEncoder/Decoder (let's think of it
as PackedInts ok?) is that you can use an implementation without affecting
other classes, while with encode/decode protected methods you need to extend
every class that you have which reads the data. I don't think that qualifies as
good software design. Why do we have PackedInts then?
Now the question is whether we should keep the different IntEncoders or not - I
don't know. Why do we have so many PackedInts versions? Do we use all of them
or are they there so that someone can make a pick? If they don't hurt anyone,
and requires almost no maintenance, why remove them?
The argument is whether we should open up encoding/decoding at all. What Mike
is saying "if you want to encode differently, you're an expert and you can
rewrite all of your and Lucene's aggregators (count, sum VS) too". What you're
saying, add encode/decode protected methods, is more how the API looks today,
only instead of a method we have an IntEncoder/Decoder (a'la PackedInts) class.
That's just easier to use in conjunction with other classes (especially if we
do the CLI abstraction).
Just to clarify, I don't mean we have to have the CLI abstraction and use it
everywhere. I think it's a good API for someone to use if he doesn't care about
how ordinals are fetched (BDV, CachedOrds, whatever) as well as how they are
encoded/decoded. It could very well be a CLITaxoFacetCounts impl which lets you
pass the CLI you want to use (BDVCLI, CachedOrdsCLI, MySuperExpertDecodingCLI).
That's it. Most users won't know about it, but the few that do, won't need to
rewrite so many classes.
> Simplify the facet module APIs
> ------------------------------
>
> Key: LUCENE-5339
> URL: https://issues.apache.org/jira/browse/LUCENE-5339
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-5339.patch, LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder). I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
> * Instead of *FacetRequest/Params/Result, you directly instantiate
> the classes that do facet counting (currently TaxonomyFacetCounts,
> RangeFacetCounts or SortedSetDVFacetCounts), passing in the
> SimpleFacetsCollector, and then you interact with those classes to
> pull labels + values (topN under a path, sparse, specific labels).
> * At index time, no more FacetIndexingParams/CategoryListParams;
> instead, you make a new SimpleFacetFields and pass it the field it
> should store facets + drill downs under. If you want more than
> one CLI you create more than one instance of SimpleFacetFields.
> * I added a simple schema, where you state which dimensions are
> hierarchical or multi-valued. From this we decide how to index
> the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc. This is just a start ...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]