[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Michael McCandless (JIRA) Tue, 19 Nov 2013 04:37:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826439#comment-13826439
 ]


Michael McCandless commented on LUCENE-5339:
--------------------------------------------

{quote}
bq. I'm not sure how this can work, since in order to write the ords we need to 
see all FacetFields? Ie, at what point would we compile all the FacetFields 
into the BDV field?

I was thinking when Doc.indexableFields() is called?
{quote}

Hmm, that's iffy.  First off, IW also calls .storableFields(), and
it's not defined which will be called first, and we need to add both
storable and indexable fields.

Second off, Document has a longer lifecycle, e.g. one can reuse it,
reuse field instance in it, etc., and I don't think we should alter it
in-place (remove FacetFields, add new fields).

Maybe Document should have a "rewrite" method, that IW calls to the
"actual" document to index?  The default would just "return this".

bq. Maybe that's the wrong extension point, but what I had in mind is something 
similar to what FacetFields does today – it adds the categories to the 
TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which 
asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of 
course). I mentioned dedupAndEncode because I thought it does something like 
that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, 
then whatever method that does that ... and if there is none, let's wrap it in 
an overridable method?

Really this is just adding complexity for a minor gain?

{quote}
As I said, let's divide that into two problems: API and optimization. For API, 
we can stick w/ CategoryListIterator and implement both a 
DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, 
FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if 
they don't care where the ordinals come from.

For optimization, we do a FastFacetCounts which inlines dgap+vint and reads 
from BDV, and we can also do a CachedOrdsFacetCounts which inlines the 
interaction with OrdinalsCache. Actually, if we provide these two, we can skip 
the third FacetCounts (uses CLI), as it will be for demo purposes only given 
current encoding. If anyone changes the encoding, he can write a FacetCounts. 
Also, we can always add it later ...

The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI 
abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so 
however he wants. But the CLI is the abstraction I'm thinking – it only has two 
methods: setNextReader and getOrdinals(int doc).
{quote}

OK, I added an OrdinalsReader abstraction, and CachedOrdinalsReader
(holds all decoded ords in shared int[]), and a
DocValuesOrdinalsReader with a protected decode() method that a
subclass could customize.  FastTaxonomyFacetCounts specializes the DV
decode.

{quote}
bq. I think this is a precarious balance. If a little code dup can greatly 
simplify the APIs, then that's the better tradeoff.

In general I agree. It then becomes what's considered little vs a lot of code 
dup. I think that dgap+vint + rollup is not little (put together), as well as 
making the decision to rollup. But at this point I don't mind .. let's force 
code dup, and then simplify if users are angry.
{quote}

I pulled out a base class (TaxonomyFacets) for all the taxonomy based
facet methods, to share some code.


> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch, LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Reply via email to