[jira] [Commented] (SOLR-6349) LocalParams for enabling/disabling individual stats

Hoss Man (JIRA) Thu, 16 Oct 2014 16:43:47 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174476#comment-14174476
 ]


Hoss Man commented on SOLR-6349:
--------------------------------


Tomas: thanks a lot for working on this.

I have mixed feelings about your approach...

* -1 to the way StatsField nows has to directly know about every possible stat 
for every possible data type, & the stat dependency logic is spread between the 
StatsValue classes and the new Enum in StatsField.
* -0 to making it harder to add new types of stats in the future (one of the 
nice side effects of my hypothetical idea was that if we did go an SPI route or 
something like that it would have made it trivial for people to add new types 
of stats as plugins, even depending on other stats for distributed logic)
* -1 for the need to have that ...{code}
if (statsField.calculateStat(X)) { 
  X = calculateX() 
}
{code}...pattern you mentioned in so much code -- that's one of the reasons i 
abandomed my last patch (and before i abandoned it, i was focusingon trying to 
ensure that it was at least always a comarison with a final boolean in the hops 
that the JVM could optimize the if away)
* +99 to the fact that you have something that actually works as opposed to my 
half-thought out idea that i never tried to implement.


On balance, i think we should move forward with your idea -- if nothing else, 
then as a straw man to help us flesh out the exact expected behavior & write 
more tests. If down the road we come up with a better internal implementation, 
then so be it.

A few specific comments on what you've got so far...

* kind of weird the way your patch's API uses a variety of Stat[], 
EnumSet<Stat> and Set<Stat> ... probably best just to use EnumSet everywhere?
* be careful about your "{{statsInResponse.isEmpty() || 
statsInResponse.contains(stat)}}" logic ... we need to make sure we don't break 
existing behavior for things like {{stats.field=foo&stats.calcDistinct=true}}
** speaking of calcDistinct, it needs to be accounted for in your new enum so 
we can start supporting it as a localparam and deprecate the top level 
{{stats.calcDistinct}}, maybe along the lines of...{code}
if (statsInResponse.isEmpty()) {
  statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet
  statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built 
EnumSet looping over LEGACY_DEFAULT_STATS 
}
if (params.getFieldBool(f, STATS_CALCDISTINCT, false)) { // top level req params
  statsInResponse.add(Stat.calcDistinct);
  statsToCalculate.add(Stat.calcDistinct);
}
{code}
* i'm not positive - but i don't think your patch currently accounts for the 
idea that in a distributed request, we may need to calculate & return a stats 
dependencies, but not compute & return the stat itself (ie: for mean, we need 
each shard to compute a sum & count, but we don't wnat each shard to compute or 
return the per-shard mean)
** we should be abl to easily test this behavior by "faking" an isShard request 
to a single node and asserting which keys we do/don't get back
* rather then a static dependsOn(Stat) method with a case statement, why not 
make it a property of of the enum objects themselves?
** maybe something like this, which also shows one idea for dealing with the 
"stat doesn't depend on itself in distributed calculations"...{code}
  static enum Stat {
    min(true),
    max(true),
    missing(true),
    sum(true),
    count(true),
    mean(false, sum, count),
    sumOfSquares(true),
    stddev(false,sum,count,sumOfSquares);
    
    private final List<Stat> distribDeps;

    Stat(boolean selfDep, Stat... deps) {
      distribDeps = new ArrayList<Stat>(deps.length+1);
      distribDeps.addAll(Arrays.asList(deps));
      if (selfDep) { 
        distribDeps.add(this);
      }
    }
    public EnumSet<Stat> getDistribDeps() {
      return EnumSet.copyOf(this.distribDeps);
    }
  }
{code}
* please help me fight against the trend of distributed tests that only do 
comparisons against single node w/o asserting specific results, ie...{noformat}
+    // only ask for "min" and "mean"
+    query("q","*:*", "sort",i1+" desc", "stats", "true",
+          "stats.field", "{!min=true mean=true}" + i1);
+    
+    // only ask for "min", "mean" and "stddev"
+    query("q","*:*", "sort",i1+" desc", "stats", "true",
+          "stats.field", "{!min=true mean=true stddev=true}" + i1);
+    
+    String[] stats = new String[]{"min", "max", "sum", "sumOfSquares", 
"stddev", "mean", "missing", "count"};
+    
+    for (String stat:stats) {
+      for (String innerStat:stats) {
+        query("q","*:*", "sort",i1+" desc", "stats", "true",
+            "stats.field", "{!" + stat + "=true " + innerStat + "=true}" + i1);
+      }
+    }
+    
{noformat}...all of those can & should include xpaths asserting that the 
count() of keys in the stats is only N, and that the expected keys exist (and 
in the case of the first 2: we should be able to assert the expected value)

----

bq. The failing tests were related to the fact that I now include stats even 
for count=0. For now I modified them to pass, but should be fixed based on what 
we decide to do in that case.

What's your opinion on this?  mine hasn't really changed...


bq. ... the "correct" behavior for this query would be to get a stats block 
back for those fields where things like min/max/mean are "null", count==0, and 
missing=1 ...

tht would be really easy to do -- the only minor hitch is that i think the code 
right now (assuming count==0) winds up giving you things like -/+Infinity for 
min/max instead of null -- so we'd have to start tracking a boolean of wether 
we'd collected any values at all.



> LocalParams for enabling/disabling individual stats
> ---------------------------------------------------
>
>                 Key: SOLR-6349
>                 URL: https://issues.apache.org/jira/browse/SOLR-6349
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6349-tflobbe.patch, SOLR-6349-tflobbe.patch, 
> SOLR-6349-tflobbe.patch, SOLR-6349-xu.patch, SOLR-6349___bad_idea_broken.patch
>
>
> Stats component currently computes all stats (except for one) every time 
> because they are relatively cheap, and in some cases dependent on eachother 
> for distrib computation -- but if we start layering stats on other things it 
> becomes unnecessarily expensive to compute all the stats when they just want 
> the "sum" (and it will definitely become excessively verbose in the 
> responses).  
> The plan here is to use local params to make this configurable.  All of the 
> existing stat options could be modeled as a simple boolean param, but future 
> params (like percentiles) might take in a more complex param value...
> Example:
> {noformat}
> stats.field={!min=true max=true percentiles='99,99.999'}price
> stats.field={!mean=true}weight
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6349) LocalParams for enabling/disabling individual stats

Reply via email to