[ 
https://issues.apache.org/jira/browse/SOLR-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652100#comment-17652100
 ] 

Jason Gerlowski commented on SOLR-16531:
----------------------------------------

Hey all, merry Christmas and happy holidays!

I've been too lax in posting updates here, but I have been working on this and 
have some results to share.
h3. Solr-Bench Stuff

I've run the solr-bench suite we've been discussing a number of times, but 
haven't been able to reproduce anything in the "33%" ballpark that Ishan 
mentioned a few comments up. I do see a slowdown of around ~15% pretty reliably 
with the JAX-RS change, but obviously that's only about half of what Ishan 
observed. Maybe this is hardware related? I'm running solr-bench "locally". 
Maybe his performance trials use cloud VMs or containers and the different 
results make sense, but it seems like a large deviation... Anyway, my numbers 
are in the google sheet 
[here|https://docs.google.com/spreadsheets/d/1i_jAHNOhy-LxvjCJglQqsxJtW9pPnRjlA1U1hhi0rdY/edit?usp=sharing].

On a maybe related note, there seems to be a bug in the 
{{ishan/repeatable-jenkins}} solr-bench branch that I was pointed to in a 
comment pretty early-on in this thread. See 
[here|https://github.com/gerlowskija/solr-bench/commit/f94708148ff2a13fb010c323304c3b6b4015b382]
 for the details, but the gist: the start-time calculation on 
"repeatable-jenkins" accidentally subtracts the recovery time instead of adding 
it (at least, as of 1d1a36). I mentioned this in a {{fullstorydev/solr-bench}} 
issue 
[here|https://github.com/fullstorydev/solr-bench/issues/31#issuecomment-1344667693]
 and asked for some clarification on how (or whether) to submit a fix. Take a 
look when you get a chance please!

Additionally, on runs that create a large number of collections (i.e. 
sporadically with 700 but frequently with 800 or more colls) the 
"clusterstatus" task runs into problems creating these collections. The CREATE 
request times out and the Solr logs have a suspicious loooking error message 
about an expected overseer node being missing. Running with the full 1000 
collections that Ishan and Noble's numbers were based on, it takes me somewhere 
between 5 and 10 benchmark attempts to get a clean, usable run. Again, maybe 
this is something specific to running "locally"? Curious if any of this rings a 
bell for anyone more familiar with running this suite, maybe Ishan can chime in.
h3. Some Improvement!

I also spent some time with the [benchmarking module already built into 
Solr|https://github.com/apache/solr/tree/main/solr/benchmark], which relies 
heavily on JMH. As I gather, this is less suited for the sort of macro-testing 
that solr-bench offers, but it does offer the ability to profile the code being 
run and produce (e.g.) CPU flamegraphs and other visualizations that help 
determine where time is actually being spent. I created a JMH benchmark (see 
[here|https://github.com/gerlowskija/solr/blob/benchmark_attempt/solr/benchmark/src/java/org/apache/solr/bench/index/SolrStartup.java])
 to cover Solr startup and collected flamegraphs with JAX-RS enabled and 
disabled, and with varying numbers of collections. From looking at those I was 
able to come up with some decent improvements, mostly consisting of some tweaks 
in how resources register and in disabling some unused Jersey features that 
were consuming CPU.

Profiling [those 
changes|https://github.com/gerlowskija/solr/commit/d393f4d5f6d6eef78a4734edc44f6a1570d86680]
 and comparing them to the original flamegraphs, the changes cut the CPU cycles 
spent in Jersey code in half! Cross-checking this against the solr-bench 
benchmark shows some improvement, though admittedly a good deal less than the 
CPU-cycle-halving initially dared me to hope. Again, see [this Google 
Sheet|https://docs.google.com/spreadsheets/d/1i_jAHNOhy-LxvjCJglQqsxJtW9pPnRjlA1U1hhi0rdY/edit?usp=sharing]
 for some details. I'd love a sanity check from anyone else familiar with 
running these benchmarks (probably once the time-calculation bug on 
ishan/repeatable-jenkins is cleared up)!
h3. Next Steps

I'm optimistic that the improvements to our Jersey-integration take a good bite 
out of the perf degradation. But whatever the solr-bench results from others 
might show, I expect we probably still aren't quite "there" yet and more 
improvement will be needed. So let's talk next steps.

My immediate goal in the next week or so is to try swapping out Jersey for a 
number of alternative libraries, to see if any of those offer any better 
performance. (Jersey is only one implementation of the 'JAX-RS spec' much in 
the same way that log4j offers one implementation of the slf4j spec. Spiking 
out a switch should be do-able for perf testing.)

If that doesn't pan out though, I'll turn my attention to the suggestion that's 
been made several times now to change our JAX-RS integration and existing v1 
API handling to create an entity (i.e. JAX-RS application, PluginBag, etc.) on 
a per-configset basis instead of a per-core basis. IMO this has always been the 
most promising option. I haven't tried it yet because of the scope involved: 
it's a large change in its own right, and one that'd need broader consensus 
from the community. But I'll at least try to spike it out in time for the 
January 11th date.

> Performance degradation due to introduction of JAX-RS
> -----------------------------------------------------
>
>                 Key: SOLR-16531
>                 URL: https://issues.apache.org/jira/browse/SOLR-16531
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Blocker
>             Fix For: 9.2
>
>         Attachments: Screenshot from 2022-11-09 11-20-44.png, 
> results-with-patch.tar.gz
>
>
> During performance benchmarking on branch_9x, I observed a slowdown in 
> restart performance since commits in SOLR-16347. See attached screenshot.
> CC [~gerlowskija].
> http://mostly.cool/cluster-test-with-patch.html
> The benchmark is here: 
> https://github.com/fullstorydev/solr-bench/blob/ishan/repeatable-jenkins/suites/cluster-test.json.
>  This suite was run after retro-actively applying the parallelStream patch 
> from SOLR-16414: 
> https://github.com/apache/solr/commit/b33161d0cdd976fc0c3dc78c4afafceb4db671cf.diff
>  
> Effort to automate these benchmarks is WIP and tracked here: SOLR-16525.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to