[
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821139#comment-13821139
]
Shai Erera commented on LUCENE-5339:
------------------------------------
Some comments about the patch:
* You have a TODO file which seems to have been 'svn added' - are you aware of
it? Just making sure, so it's not accidentally committed :)
* Maybe we should do this work in a branch and avoid the .simple package? It
seems that the majority of the patch is about copy-pasting classes over to
.simple, which I assume you did so you can work in isolation and have tests
pass?
* FieldTypes:
** Can FT.FieldType provide a ctor taking these arguments and then
DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false.
** I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you
should name it DimType or something? And the upper class FacetsConfig?
** Maybe instead of setHierarchical + setMultiValued you do
setDimType(DimType)? Then you could avoid the synchronization?
** Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if
the dim is both hierarchical and multi-valued? That's a step back from the
default ALL_BUT_DIM that we have today. I think we might want to have a
requiresDimValue or something, because I do think the dimension's value (count)
is most often unneeded, and it's a waste to encode its ordinal?
* Constants isn't documented, not sure if 'precommit' will like that, but in
general I think the constants should have jdocs. Maybe put a nocommit?
* TaxonomyFacetCounts
** .getAllDims() -- If a dimension is not hierarchical, I think
SimpleFacetResult.count == 0? In that case, sorting by its count is useless?
*** I think it's also relevant for SortedSet?
* LabelAndValue has a Number as the value rather than double. I see this class
is used only in the final step (labeling), but what's wrong w/ the previous
'double' primitive? Is it to avoid casting or something?
On the approach in general - I personally don't think the current API is
overwhelming, but looking at your patch, I do think it could use some
simplification. But all in all, it allows an app extend the facets capabilities
in ways that do not force it to duplicate a lot of code. Let me explain:
+CategoryListIterator+
That's the piece of code which is responsible for decoding the category list.
It's currently used for two purposes: (1) allow you to encode the categories in
different formats (using IntEncoder/Decoder abstraction) as well as load the
categories from a different source. E.g. that's how we were able to test fast
cutting over facets to DV from Payload, that's a nice abstraction for letting
you load the values from a cache, and so forth.
If you want to specialize code, you can do what FastCountingFacetsAggregator
does -- if you encoded w/ DGapVInt and you want counting, it doesn't go through
the CLI API. But if you look at SumScoreFacetsAggregator, there's no reason for
it to be limited to just DGapVInt encoding, or that you (the app) won't be able
to pull the ordinals from a cache, without rewriting SumScoreFacetsAggregator.
Hack, even though it's not implemented yet, we could implement a SortedSetCLI
(as far as I can tell) which uses SortedSetDVReaderState to pull the ordinals
of document. It's really a useful interface.
So IMO the CLI interface is important. Most apps, I believe, don't extend the
facets module (they'll use counting and maybe aggregate by ValueSource
pre-built classes). But those that do want to add another aggregation method
(e.g. ValueSourceFacetsAggregator), can simply use the CLI API, or copy-paste
the DGapVInt decode logic if they want to specialize. Especially that that code
is not that trivial, and we've spent a lot of time optimizing it. And the few
apps that want to explore other encoding/decoding logic can write their own CLI.
Without that interface, if e.g. you want to encode the ordinals differently, or
load them from a cache (e.g. CachedOrds), you'd have to rewrite *every*
aggregation method that you want to use. And most probably you'll copy-paste
the majority of the code. So why let go of the interface?
+FacetsCollector+
It's pretty much your SimpleFC, only it requires to get a FacetsAccumulator.
That was done as a convenience so that apps can send in a collector and then
call fc.getFacetResults. But its essense is really the fact that it collects
MatchingDocs per segment, which the aggregators later use to aggreate the faces
in those matching docs.
Given the changes in this patch, +1 for making it just return the
List<MatchingDocs> and let whatever API we decide on to use that list, instead
of FC.getFacetResults().
+FacetsAggregator+
This class lets you implement the aggregation function, be it count, sum-score,
value-source or by-association. I think it's equivalent to TaxoFacetCounts,
e.g. in your approach, if anyone will want to aggregate the ordinals by other
than count, that's what they will need to write.
What I like about the approach in your patch is that (and I had similar
thoughts on the last round of API changes) it "reverses" the flow. The app says
"I want to count the ords in that field, and sum-score the ords in that field"
and then the app can ask for the aggregated values from each such aggregator.
Whereas today, app does "new CFR(), new CFR(), new CFR(), new SumScore()" and
then receives back the List<FacetResult>. Today's API is convenient in that it
allows you to pass a List<FR> around in your code and get back a
List<FacetResult>. But I don't see this convenience getting away (you'll just
pass a List<FacetsAggregator> and then pull the requested values later on).
Also, you sort of pulled the specific requests, such as .getSpecificCount(CP)
and .getAllDims() up to FacetsAggregator, where today this can be done by
FacetResultsHandler (you'd have to write one, as I did on LUCENE-5333).
What I like less about it is that it folds in the logic coded by
FacetsAccumulator and FacetResultsHandler:
* FacetsAccumulator
** Invokes FacetsAggregator to aggregate the facets in MatchingDocs
** Whether the request specifies a CP for which we need to rollupValues, it
asks the aggregator to do so
** It uses a FacetResultsHandler to compute the top-K facets for each request.
* FacetResultsHandler
** Computes top-K values for each parent category
** Has variants that return a full sub-tree (TopKInEachNodeHandler)
** By means of extension, allows you to get the value of a specific
CategoryPath, as well as get values for all dimensions (as I've shown on
LUCENE-5333).
So I'm torn here. I really like how the API lets you extend things, and only
the things that you care about. If you only want to implement a
ValueSourceFacetsAggregator, you shouldn't be worried about whether or not to
call rollupValues or how to compute top-K. That seems redundant. You should
focus on what you want to extend. I also like less the fact that now every
TaxoFacetsSomething will need to be aware of the facets configuration...
But I do like the approach presented in the patch, so I wonder if there's
anything we can do to preserve the focused extension points on one hand, yet
simplify the app's integration with the API on the other hand. Like, if an app
did something like this:
{code}
FacetsAccumulator fa = new TaxoFA(..., matchingDocs, new
CountingFacetsAggregator("dvFacets1", [CategoryListIterator]),
[FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions(); // does not depend on the aggregation function, only
needs the Taxonomy and int[]/float[] array source.
fa = new TaxoFA(..., matchingDocs, new SumScoreFacetsAggregator("dvFacets2",
[CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();
fa = new TaxoFA(..., matchingDocs, new
SumIntAssociationFacetsAggregator("dvFacets3"), [FacetResultsHandler]); // no
CLI or SortedSetFA
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();
RangeAccumulator range = new RangeAccumulator(..., matchingDocs, ranges); //
RangeAccumulator might not even extend FacetsAccumulator
range.getTopK(dimension);
// I think the rest of the getters make no sense?
{code}
The idea is that internally, FacetsAccumulator uses FacetsAggregator to do the
aggregation and rollup if needed, and you can optionally pass a
FacetResultsHandler, while it defaults to a FlatFacetResultsHandler (today's
silly name DepthOneFRH). You can also pass your CategoryListIterator to the
FacetsAggregator (those that need it). That way, app's code looks similar to
the one in the patch, only we allow more code reuse between different
aggregation functions.
I hope this will allow us to support other aggregation functions by SortedSet
too, as there's really no reason why not to do it. There are two differences
between SortedSet and TaxoIndex: (1) the Taxonomy implementation (where you
pull the children of an ordinal from, how you do labeling etc.) and (2) where
the ordinals are encoded in the search index (BinaryDV in TaxoIndex case,
SortedSetDV in SortedSet case). The latter can easily be solved by a CLI
implementation of SortedSet and former by an abstract Taxonomy (read-only) API
which both SortedSet and TaxoIndex implement. We should explore these on a
separate issue though.
I think that the problem with your current patch is that the aggregation is
done in the constructor, so it sort of eliminates reasonable ways to extend
things. But if things were done differently, we could have an abstract
FacetsAggregator instead of FacetsAccumulator which let you implement only the
.aggregate() method, and take care of rollup + top-K computation itself. But
that means you'd need to initialize the class and then call a .aggregate() or
.compute() before you call any of the .getTopK() for instance (unless we check
in every .getTopK() if it's already computed...).
+FacetArrays+
Unfortunately, there is no efficient way to hold either an int or float in
Java, without specifying the type of the array (i.e. long[] or double[] are too
expensive), so this class abstracts the way facets are aggregated, so that we
can have different FacetAggregators implementations, again, reusing the
majority of the irrelevant code (top-K computation, rollup,
CategoryListIterator...).
I don't know if we can get rid of it... especially as there might be
aggregators that are interested in both arrays (e.g. avgOfSomething). Maybe if
we have a FacetsAccumulator abstract class, we can have an IntFA and FloatFA
abstract classes which give you access to an int[]/float[] arrays, and matching
FacetResultsHandler?
+FacetIndexingParams+
By all means, I'm +1 for simplifying this class and make it more user-friendly.
I like how you can specify a DimType (hierarchical, singleValue...) instead of
setting the more arbitrary OrdinalPolicy. I do think this class has some value
today, e.g. in CategoryListParams (which you nuked), the app doesn't need to
know about the field under which facets are indexed at all, only if it cares to
change it or index multiple category lists. Not sure it's a blocker, just
pointing that out. On the plus side - it makes app more aware about what's
going on in its index...
I don't have strong feelings about this class, we can rename it to FacetsConfig
or whatever, but let's put the configuration parameters under it (e.g. DimType
and drillDownDelimiter)?
> Simplify the facet module APIs
> ------------------------------
>
> Key: LUCENE-5339
> URL: https://issues.apache.org/jira/browse/LUCENE-5339
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder). I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
> * Instead of *FacetRequest/Params/Result, you directly instantiate
> the classes that do facet counting (currently TaxonomyFacetCounts,
> RangeFacetCounts or SortedSetDVFacetCounts), passing in the
> SimpleFacetsCollector, and then you interact with those classes to
> pull labels + values (topN under a path, sparse, specific labels).
> * At index time, no more FacetIndexingParams/CategoryListParams;
> instead, you make a new SimpleFacetFields and pass it the field it
> should store facets + drill downs under. If you want more than
> one CLI you create more than one instance of SimpleFacetFields.
> * I added a simple schema, where you state which dimensions are
> hierarchical or multi-valued. From this we decide how to index
> the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc. This is just a start ...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]