[
https://issues.apache.org/jira/browse/SOLR-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174476#comment-14174476
]
Hoss Man commented on SOLR-6349:
--------------------------------
Tomas: thanks a lot for working on this.
I have mixed feelings about your approach...
* -1 to the way StatsField nows has to directly know about every possible stat
for every possible data type, & the stat dependency logic is spread between the
StatsValue classes and the new Enum in StatsField.
* -0 to making it harder to add new types of stats in the future (one of the
nice side effects of my hypothetical idea was that if we did go an SPI route or
something like that it would have made it trivial for people to add new types
of stats as plugins, even depending on other stats for distributed logic)
* -1 for the need to have that ...{code}
if (statsField.calculateStat(X)) {
X = calculateX()
}
{code}...pattern you mentioned in so much code -- that's one of the reasons i
abandomed my last patch (and before i abandoned it, i was focusingon trying to
ensure that it was at least always a comarison with a final boolean in the hops
that the JVM could optimize the if away)
* +99 to the fact that you have something that actually works as opposed to my
half-thought out idea that i never tried to implement.
On balance, i think we should move forward with your idea -- if nothing else,
then as a straw man to help us flesh out the exact expected behavior & write
more tests. If down the road we come up with a better internal implementation,
then so be it.
A few specific comments on what you've got so far...
* kind of weird the way your patch's API uses a variety of Stat[],
EnumSet<Stat> and Set<Stat> ... probably best just to use EnumSet everywhere?
* be careful about your "{{statsInResponse.isEmpty() ||
statsInResponse.contains(stat)}}" logic ... we need to make sure we don't break
existing behavior for things like {{stats.field=foo&stats.calcDistinct=true}}
** speaking of calcDistinct, it needs to be accounted for in your new enum so
we can start supporting it as a localparam and deprecate the top level
{{stats.calcDistinct}}, maybe along the lines of...{code}
if (statsInResponse.isEmpty()) {
statsInResponse.addAll(LEGACY_DEFAULT_STATS); // static final EnumSet
statsToCalculate.addAll(LEGACY_DEFAULT_STATS_DEPENDS); // static {} built
EnumSet looping over LEGACY_DEFAULT_STATS
}
if (params.getFieldBool(f, STATS_CALCDISTINCT, false)) { // top level req params
statsInResponse.add(Stat.calcDistinct);
statsToCalculate.add(Stat.calcDistinct);
}
{code}
* i'm not positive - but i don't think your patch currently accounts for the
idea that in a distributed request, we may need to calculate & return a stats
dependencies, but not compute & return the stat itself (ie: for mean, we need
each shard to compute a sum & count, but we don't wnat each shard to compute or
return the per-shard mean)
** we should be abl to easily test this behavior by "faking" an isShard request
to a single node and asserting which keys we do/don't get back
* rather then a static dependsOn(Stat) method with a case statement, why not
make it a property of of the enum objects themselves?
** maybe something like this, which also shows one idea for dealing with the
"stat doesn't depend on itself in distributed calculations"...{code}
static enum Stat {
min(true),
max(true),
missing(true),
sum(true),
count(true),
mean(false, sum, count),
sumOfSquares(true),
stddev(false,sum,count,sumOfSquares);
private final List<Stat> distribDeps;
Stat(boolean selfDep, Stat... deps) {
distribDeps = new ArrayList<Stat>(deps.length+1);
distribDeps.addAll(Arrays.asList(deps));
if (selfDep) {
distribDeps.add(this);
}
}
public EnumSet<Stat> getDistribDeps() {
return EnumSet.copyOf(this.distribDeps);
}
}
{code}
* please help me fight against the trend of distributed tests that only do
comparisons against single node w/o asserting specific results, ie...{noformat}
+ // only ask for "min" and "mean"
+ query("q","*:*", "sort",i1+" desc", "stats", "true",
+ "stats.field", "{!min=true mean=true}" + i1);
+
+ // only ask for "min", "mean" and "stddev"
+ query("q","*:*", "sort",i1+" desc", "stats", "true",
+ "stats.field", "{!min=true mean=true stddev=true}" + i1);
+
+ String[] stats = new String[]{"min", "max", "sum", "sumOfSquares",
"stddev", "mean", "missing", "count"};
+
+ for (String stat:stats) {
+ for (String innerStat:stats) {
+ query("q","*:*", "sort",i1+" desc", "stats", "true",
+ "stats.field", "{!" + stat + "=true " + innerStat + "=true}" + i1);
+ }
+ }
+
{noformat}...all of those can & should include xpaths asserting that the
count() of keys in the stats is only N, and that the expected keys exist (and
in the case of the first 2: we should be able to assert the expected value)
----
bq. The failing tests were related to the fact that I now include stats even
for count=0. For now I modified them to pass, but should be fixed based on what
we decide to do in that case.
What's your opinion on this? mine hasn't really changed...
bq. ... the "correct" behavior for this query would be to get a stats block
back for those fields where things like min/max/mean are "null", count==0, and
missing=1 ...
tht would be really easy to do -- the only minor hitch is that i think the code
right now (assuming count==0) winds up giving you things like -/+Infinity for
min/max instead of null -- so we'd have to start tracking a boolean of wether
we'd collected any values at all.
> LocalParams for enabling/disabling individual stats
> ---------------------------------------------------
>
> Key: SOLR-6349
> URL: https://issues.apache.org/jira/browse/SOLR-6349
> Project: Solr
> Issue Type: Sub-task
> Reporter: Hoss Man
> Attachments: SOLR-6349-tflobbe.patch, SOLR-6349-tflobbe.patch,
> SOLR-6349-tflobbe.patch, SOLR-6349-xu.patch, SOLR-6349___bad_idea_broken.patch
>
>
> Stats component currently computes all stats (except for one) every time
> because they are relatively cheap, and in some cases dependent on eachother
> for distrib computation -- but if we start layering stats on other things it
> becomes unnecessarily expensive to compute all the stats when they just want
> the "sum" (and it will definitely become excessively verbose in the
> responses).
> The plan here is to use local params to make this configurable. All of the
> existing stat options could be modeled as a simple boolean param, but future
> params (like percentiles) might take in a more complex param value...
> Example:
> {noformat}
> stats.field={!min=true max=true percentiles='99,99.999'}price
> stats.field={!mean=true}weight
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]