[ 
https://issues.apache.org/jira/browse/SOLR-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-6349:
---------------------------
    Attachment: make-data-and-queries.pl

SOLR-6349


I did a bit of crude benchmarking this morning with the following two uses 
cases in mind:
* user currently asks for stats on fields, cares about all 8 of the stats
* user currently asks for stats on fields, only cares about 4of8 of them

the attached script shows my methodology -- it generates a CSV file with 10 
million docs + 2 bash files that use curl to hit Solr with 300 *:* query urls 
using randomly selected stats.field.  the sequence of stat field requests are 
identicle between the 2 bash files, but in one URLs include localparams to only 
compute min/max/mean/stddev for the field.  

Here's the results...

{noformat}
NOW     BASELINE: 126.008 seconds (ie: all stats ... queries-old.sh)

PATCH  ALL STATS: 133.571 seconds (6% slower ... queries-old.sh)
PATCH FOUR STATS: 130.515 seconds (3% slower ... queries-new.sh)
{noformat}

So not only has asking for all stats on a field gotten slower with this patch, 
but even asking for only 4 of the 8 possible numeric stats on a field is still 
slower then the existing code when all of them are returned.

A key thing to note here is that this is the total wall clock time from the 
perspective of the client, including reading the response from Solr.  Not only 
are we (in theory) computing only only 1/2 as much math per request in the 
"FOUR STATS" situation, the XML response size of each query is only ~3/4ths the 
size of the original queryies.  This should mean a lot less time both in 
processing the results and in writing/reading the data over the wire ... and 
yet instead of seeing some perf improvements, we see performance suffer.

I suspect a key factor here goes back to one of the concerns i mentioned 
earlier...

{quote}
{code}
if (statsField.calculateStat(X)) { 
  X = calculateX() 
}
{code}
...pattern you mentioned in so much code - that's one of the reasons i 
abandomed my last patch (and before i abandoned it, i was focusingon trying to 
ensure that it was at least always a comarison with a final boolean in the hops 
that the JVM could optimize the if away)
{quote}

...the cumulative overhead of those method calls for every possible stat is 
probably counter acting any gains made by reducing the stats that are computed.

----

My next step is to focus on fixing the current patch code so the few remaining 
nocommit assertions in the test start passing (see earlier comments re 
"min='false'") -- but once the behavior is locked down and solid i think we 
really need to re-assess and re-factor the code to see some perf gains before 
there's any point in moving towards adding this feature.

(NOTE: if anyone spots any flaws in my little mini-benchmark, please speak up 
-- i would be very happy to be wrong)



> LocalParams for enabling/disabling individual stats
> ---------------------------------------------------
>
>                 Key: SOLR-6349
>                 URL: https://issues.apache.org/jira/browse/SOLR-6349
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6349-tflobbe.patch, SOLR-6349-tflobbe.patch, 
> SOLR-6349-tflobbe.patch, SOLR-6349-xu.patch, SOLR-6349-xu.patch, 
> SOLR-6349-xu.patch, SOLR-6349-xu.patch, SOLR-6349.patch, SOLR-6349.patch, 
> SOLR-6349.patch, SOLR-6349.patch, SOLR-6349___bad_idea_broken.patch, 
> make-data-and-queries.pl
>
>
> Stats component currently computes all stats (except for one) every time 
> because they are relatively cheap, and in some cases dependent on eachother 
> for distrib computation -- but if we start layering stats on other things it 
> becomes unnecessarily expensive to compute all the stats when they just want 
> the "sum" (and it will definitely become excessively verbose in the 
> responses).  
> The plan here is to use local params to make this configurable.  All of the 
> existing stat options could be modeled as a simple boolean param, but future 
> params (like percentiles) might take in a more complex param value...
> Example:
> {noformat}
> stats.field={!min=true max=true percentiles='99,99.999'}price
> stats.field={!mean=true}weight
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to