[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Shai Erera (JIRA) Wed, 13 Nov 2013 02:02:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821139#comment-13821139
 ]


Shai Erera commented on LUCENE-5339:
------------------------------------

Some comments about the patch:

* You have a TODO file which seems to have been 'svn added' - are you aware of 
it? Just making sure, so it's not accidentally committed :)

* Maybe we should do this work in a branch and avoid the .simple package? It 
seems that the majority of the patch is about copy-pasting classes over to 
.simple, which I assume you did so you can work in isolation and have tests 
pass?

* FieldTypes:
** Can FT.FieldType provide a ctor taking these arguments and then 
DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false.
** I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you 
should name it DimType or something? And the upper class FacetsConfig?
** Maybe instead of setHierarchical + setMultiValued you do 
setDimType(DimType)? Then you could avoid the synchronization?
** Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if 
the dim is both hierarchical and multi-valued? That's a step back from the 
default ALL_BUT_DIM that we have today. I think we might want to have a 
requiresDimValue or something, because I do think the dimension's value (count) 
is most often unneeded, and it's a waste to encode its ordinal?

* Constants isn't documented, not sure if 'precommit' will like that, but in 
general I think the constants should have jdocs. Maybe put a nocommit?

* TaxonomyFacetCounts
** .getAllDims() -- If a dimension is not hierarchical, I think 
SimpleFacetResult.count == 0? In that case, sorting by its count is useless?
*** I think it's also relevant for SortedSet?

* LabelAndValue has a Number as the value rather than double. I see this class 
is used only in the final step (labeling), but what's wrong w/ the previous 
'double' primitive? Is it to avoid casting or something?

On the approach in general - I personally don't think the current API is 
overwhelming, but looking at your patch, I do think it could use some 
simplification. But all in all, it allows an app extend the facets capabilities 
in ways that do not force it to duplicate a lot of code. Let me explain:

+CategoryListIterator+

That's the piece of code which is responsible for decoding the category list. 
It's currently used for two purposes: (1) allow you to encode the categories in 
different formats (using IntEncoder/Decoder abstraction) as well as load the 
categories from a different source. E.g. that's how we were able to test fast 
cutting over facets to DV from Payload, that's a nice abstraction for letting 
you load the values from a cache, and so forth.

If you want to specialize code, you can do what FastCountingFacetsAggregator 
does -- if you encoded w/ DGapVInt and you want counting, it doesn't go through 
the CLI API. But if you look at SumScoreFacetsAggregator, there's no reason for 
it to be limited to just DGapVInt encoding, or that you (the app) won't be able 
to pull the ordinals from a cache, without rewriting SumScoreFacetsAggregator.

Hack, even though it's not implemented yet, we could implement a SortedSetCLI 
(as far as I can tell) which uses SortedSetDVReaderState to pull the ordinals 
of document. It's really a useful interface.

So IMO the CLI interface is important. Most apps, I believe, don't extend the 
facets module (they'll use counting and maybe aggregate by ValueSource 
pre-built classes). But those that do want to add another aggregation method 
(e.g. ValueSourceFacetsAggregator), can simply use the CLI API, or copy-paste 
the DGapVInt decode logic if they want to specialize. Especially that that code 
is not that trivial, and we've spent a lot of time optimizing it. And the few 
apps that want to explore other encoding/decoding logic can write their own CLI.

Without that interface, if e.g. you want to encode the ordinals differently, or 
load them from a cache (e.g. CachedOrds), you'd have to rewrite *every* 
aggregation method that you want to use. And most probably you'll copy-paste 
the majority of the code. So why let go of the interface?

+FacetsCollector+

It's pretty much your SimpleFC, only it requires to get a FacetsAccumulator. 
That was done as a convenience so that apps can send in a collector and then 
call fc.getFacetResults. But its essense is really the fact that it collects 
MatchingDocs per segment, which the aggregators later use to aggreate the faces 
in those matching docs.

Given the changes in this patch, +1 for making it just return the 
List<MatchingDocs> and let whatever API we decide on to use that list, instead 
of FC.getFacetResults().

+FacetsAggregator+

This class lets you implement the aggregation function, be it count, sum-score, 
value-source or by-association. I think it's equivalent to TaxoFacetCounts, 
e.g. in your approach, if anyone will want to aggregate the ordinals by other 
than count, that's what they will need to write.

What I like about the approach in your patch is that (and I had similar 
thoughts on the last round of API changes) it "reverses" the flow. The app says 
"I want to count the ords in that field, and sum-score the ords in that field" 
and then the app can ask for the aggregated values from each such aggregator. 
Whereas today, app does "new CFR(), new CFR(), new CFR(), new SumScore()" and 
then receives back the List<FacetResult>. Today's API is convenient in that it 
allows you to pass a List<FR> around in your code and get back a 
List<FacetResult>. But I don't see this convenience getting away (you'll just 
pass a List<FacetsAggregator> and then pull the requested values later on). 
Also, you sort of pulled the specific requests, such as .getSpecificCount(CP) 
and .getAllDims() up to FacetsAggregator, where today this can be done by 
FacetResultsHandler (you'd have to write one, as I did on LUCENE-5333).

What I like less about it is that it folds in the logic coded by 
FacetsAccumulator and FacetResultsHandler:

* FacetsAccumulator
** Invokes FacetsAggregator to aggregate the facets in MatchingDocs
** Whether the request specifies a CP for which we need to rollupValues, it 
asks the aggregator to do so
** It uses a FacetResultsHandler to compute the top-K facets for each request.

* FacetResultsHandler
** Computes top-K values for each parent category
** Has variants that return a full sub-tree (TopKInEachNodeHandler)
** By means of extension, allows you to get the value of a specific 
CategoryPath, as well as get values for all dimensions (as I've shown on 
LUCENE-5333).

So I'm torn here. I really like how the API lets you extend things, and only 
the things that you care about. If you only want to implement a 
ValueSourceFacetsAggregator, you shouldn't be worried about whether or not to 
call rollupValues or how to compute top-K. That seems redundant. You should 
focus on what you want to extend. I also like less the fact that now every 
TaxoFacetsSomething will need to be aware of the facets configuration...

But I do like the approach presented in the patch, so I wonder if there's 
anything we can do to preserve the focused extension points on one hand, yet 
simplify the app's integration with the API on the other hand. Like, if an app 
did something like this:

{code}
FacetsAccumulator fa = new TaxoFA(..., matchingDocs, new 
CountingFacetsAggregator("dvFacets1", [CategoryListIterator]), 
[FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions(); // does not depend on the aggregation function, only 
needs the Taxonomy and int[]/float[] array source.

fa = new TaxoFA(..., matchingDocs, new SumScoreFacetsAggregator("dvFacets2", 
[CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();

fa = new TaxoFA(..., matchingDocs, new 
SumIntAssociationFacetsAggregator("dvFacets3"), [FacetResultsHandler]); // no 
CLI or SortedSetFA
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();

RangeAccumulator range = new RangeAccumulator(..., matchingDocs, ranges); // 
RangeAccumulator might not even extend FacetsAccumulator
range.getTopK(dimension);
// I think the rest of the getters make no sense?
{code}

The idea is that internally, FacetsAccumulator uses FacetsAggregator to do the 
aggregation and rollup if needed, and you can optionally pass a 
FacetResultsHandler, while it defaults to a FlatFacetResultsHandler (today's 
silly name DepthOneFRH). You can also pass your CategoryListIterator to the 
FacetsAggregator (those that need it). That way, app's code looks similar to 
the one in the patch, only we allow more code reuse between different 
aggregation functions.

I hope this will allow us to support other aggregation functions by SortedSet 
too, as there's really no reason why not to do it. There are two differences 
between SortedSet and TaxoIndex: (1) the Taxonomy implementation (where you 
pull the children of an ordinal from, how you do labeling etc.) and (2) where 
the ordinals are encoded in the search index (BinaryDV in TaxoIndex case, 
SortedSetDV in SortedSet case). The latter can easily be solved by a CLI 
implementation of SortedSet and former by an abstract Taxonomy (read-only) API 
which both SortedSet and TaxoIndex implement. We should explore these on a 
separate issue though.

I think that the problem with your current patch is that the aggregation is 
done in the constructor, so it sort of eliminates reasonable ways to extend 
things. But if things were done differently, we could have an abstract 
FacetsAggregator instead of FacetsAccumulator which let you implement only the 
.aggregate() method, and take care of rollup + top-K computation itself. But 
that means you'd need to initialize the class and then call a .aggregate() or 
.compute() before you call any of the .getTopK() for instance (unless we check 
in every .getTopK() if it's already computed...).

+FacetArrays+

Unfortunately, there is no efficient way to hold either an int or float in 
Java, without specifying the type of the array (i.e. long[] or double[] are too 
expensive), so this class abstracts the way facets are aggregated, so that we 
can have different FacetAggregators implementations, again, reusing the 
majority of the irrelevant code (top-K computation, rollup, 
CategoryListIterator...).

I don't know if we can get rid of it... especially as there might be 
aggregators that are interested in both arrays (e.g. avgOfSomething). Maybe if 
we have a FacetsAccumulator abstract class, we can have an IntFA and FloatFA 
abstract classes which give you access to an int[]/float[] arrays, and matching 
FacetResultsHandler?

+FacetIndexingParams+

By all means, I'm +1 for simplifying this class and make it more user-friendly. 
I like how you can specify a DimType (hierarchical, singleValue...) instead of 
setting the more arbitrary OrdinalPolicy. I do think this class has some value 
today, e.g. in CategoryListParams (which you nuked), the app doesn't need to 
know about the field under which facets are indexed at all, only if it cares to 
change it or index multiple category lists. Not sure it's a blocker, just 
pointing that out. On the plus side - it makes app more aware about what's 
going on in its index...

I don't have strong feelings about this class, we can rename it to FacetsConfig 
or whatever, but let's put the configuration parameters under it (e.g. DimType 
and drillDownDelimiter)?

> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Reply via email to