[jira] [Updated] (SOLR-5963) Finalize interface and backport analytics component to 4x

Hoss Man (JIRA) Mon, 14 Apr 2014 15:28:44 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-5963:
---------------------------

    Attachment: SOLR-5963.patch

Erick: I'm pretty sure i'm more blind then you as far as this code -- i just 
started looking at it today :)

In any cases, lst friday i was reading the docs on this feature and playing 
around with some analytics requests using the example data.  Here's a an 
example URL that shows what i was talking about as far as the NamedList stuff 
(note: you have to modify the schema to make "cat" use docValues for the facet 
stuff in the AnalyticsComponent to work) ...

http://localhost:8983/solr/select?o.hoss.qf=qfacet&o.hoss.qf.qfacet.query=name:apple&facet.field=cat&indent=true&wt=json&q=popularity:[*%20TO%20*]&facet=true&o.hoss.s.price_count=count%28price%29&fl=id,price,popularity&rows=1&olap=true&o.hoss.rangefacet=price&o.hoss.s.min_price=min%28price%29&o.hoss.s.max_price=max%28price%29&o.hoss.s.stddev_price=stddev%28price%29&o.hoss.rangefacet.price.gap=result%28stddev_price%29&o.hoss.rangefacet.price.start=result%28min_price%29&o.hoss.rangefacet.price.end=result%28max_price%29&o.hoss.fieldfacet.cat.ss=yak&o.hoss.fieldfacet=cat&o.hoss.s.foo=sum%28price%29&o.hoss.s.yak=sum%28popularity%29

With your existing "use SimpleOrderedMap" everywhere, the results include 
sections like this...

{noformat}
...
      "fieldFacets":{
        "cat":{
          "electronics":{
            "foo":2772.3200187683105,
            "max_price":649.99,
            "min_price":11.5,
            "price_count":11,
            "stddev_price":196.77880239587424,
            "yak":63.0},
          "graphics card":{
            "foo":1129.9400024414062,
            "max_price":649.99,
...
      "rangeFacets":{
        "price":{
          "[0.0 TO 539.1075)":{
            "foo":2402.280040740967,
            "max_price":479.95,
            "min_price":0.0,
...
{noformat}

...which means that clients parsing that JSON will wind up with maps for each 
of those \{...\} blocks -- losing the ordering of the facet term results 
(ignoring the "o.hoss.fieldfacet.cat.ss=yak" param) and the ranges won't be in 
order.

i started poking arround into the code and cobbled together this revised patch, 
which uses NamedLists for those two situations, and makes the corrisponding 
sections of the results look like this...

{noformat}
...
      "fieldFacets":{
        "cat":[
          "electronics",{
            "foo":2772.3200187683105,
            "max_price":649.99,
            "min_price":11.5,
            "price_count":11,
            "stddev_price":196.77880239587424,
            "yak":63.0},
          "graphics card",{
            "foo":1129.9400024414062,
            "max_price":649.99,
...
      "rangeFacets":{
        "price":[
          "[0.0 TO 539.1075)",{
            "foo":2402.280040740967,
            "max_price":479.95,
            "min_price":0.0,
...
{noformat}

Based on my reading of the docs, i think those are the only 2 places where 
using a NamedList to preserve the order is really important.

----

I do however have some other broader concerns about the user API based on my 
limited experimentation so far...

* input validation: when trying to build up requests, i ran into several 
situations where i got NullPointerExceptions, or 
ArrayIndexOutOfBoundsException, or other confusing error messages that weren't 
helpful for figuring out what i did wrong because of things like the component 
assuming certain params will exist if other params exist w/o actaully 
validating the input.  For example, in this URL when trying to do field facet 
sorting on a stat that doesn't exist (because of atypo) you get an AIOOBE: 
http://localhost:8983/solr/select?q=*:*&rows=0&olap=true&o.hoss.fieldfacet=cat&o.hoss.fieldfacet.cat.ss=fooo&o.hoss.s.foo=sum%28price%29
* error handling -- while looking into the NamedList thing, i found some code 
like this which definitely scares me:{code}
      } catch (IOException e) {
        log.warn("Analytics request '"+areq.getName()+"' failed", e);
        continue;
      }
{code}
* I'm a bit concerned about the "Statistical Expressions" syntax that's added 
here -- for a couple of differnet reasons:
** "Expressions" is a concept that's already been added to Lucene that means 
something else, and there's other work in progress in SOLR-4787 to bring that 
into Solr - i anticipate some terminology confusion.
** regardless of what we call these expressions, the subtlties of the 
"Aggregate Mapping Operations/Expressions" vs the "Aggregations" vs the "Field 
Mapping Operations" seems ripe for a lot of consusion about when/how you can 
wrap one in the other -- especially since the list of "Field Mapping 
Operations" looks very similar to the list of "Aggregate Mapping 
Operations/Expressions" but is evidently not exactly the same list. (I haven't 
delved into the code enough to be clear if that's just a doc mistake)  I'm 
wondering if the syntax shouldn't have some sort of more explicit visual cue to 
make it clear what's an "aggregation" vs what are "operations"
** In general, the syntax _looks_ just like the valuesource syntax -- even some 
of the function names are identical -- but it's not the same thing, and those 
"functions" wrk very differently, which is also very confusing.
*** perhaps something as simple as changing the "Aggregations" to always use 
uppercase only would help address the above 2 points?
** once i finally understood the distinction between "Field Mapping Operations" 
and "Aggregations" i now find myself wondering why "Field Mapping Operations" 
exist at all given the large number of ValueSourceParsers available?  Why not 
just allow "Aggregations" to wrap any ValueSources  by delegating to 
ValueSourceParsers using the existing syntax?
* Distributed support -- If we focus solely on the API questions for a moment, 
w/o worrying about the implementation details (because I'm not a math guy and i 
don't understand the details enough to even pretend i'm a math guy for the 
purposes of this conversation): From what i understand based on other comments, 
it sounds like _some_ of these "Aggregations" can be efficiently supported in a 
distributed/sharded setup, but others just plain can't ever be supported 
because of what's involved in computing them in the aggregate across nodes.  If 
that's the case, then we need to think about how the component will behave in a 
distributed setup if you try to use an aggregate that isn't supported.  The 
last thing we want is silently appear to work but compute nothing -- we want to 
make sure there is some sort of error message that makes it clear to the user 
what part of their analytics request is/isn't supported.



> Finalize interface and backport analytics component to 4x
> ---------------------------------------------------------
>
>                 Key: SOLR-5963
>                 URL: https://issues.apache.org/jira/browse/SOLR-5963
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.9, 5.0
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-5963.patch, SOLR-5963.patch
>
>
> Now that we seem to have fixed up the test failures for trunk for the 
> analytics component, we need to solidify the API and back-port it to 4x. For 
> history, see SOLR-5302 and SOLR-5488.
> As far as I know, these are the merges that need to occur to do this (plus 
> any that this JIRA brings up)
> svn merge -c 1543651 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545009 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545053 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545054 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545080 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545143 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545417 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545514 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1545650 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1546074 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1546263 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1559770 https://svn.apache.org/repos/asf/lucene/dev/trunk
> svn merge -c 1583636 https://svn.apache.org/repos/asf/lucene/dev/trunk
> The only remaining thing I think needs to be done is to solidify the 
> interface, see comments from [[email protected]] on the two JIRAs mentioned, 
> although SOLR-5488 is the most relevant one.
> [~sbower], [~houstonputman] and [[email protected]] might be particularly 
> interested here.
> I really want to put this to bed, so if we can get agreement on this soon I 
> can make it march.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-5963) Finalize interface and backport analytics component to 4x

Reply via email to