[ 
https://issues.apache.org/jira/browse/SOLR-16622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677868#comment-17677868
 ] 

Michael Gibney commented on SOLR-16622:
---------------------------------------

Thanks for this extra context, it's really helpful.

{quote}this just shows that our testing is inadequate at the moment{quote}

That makes sense broadly, IMO with some caveats (below). To state the obvious: 
these are basically integration tests, and by nature are going to be difficult 
to reproduce reliably, no matter how we proceed.

On the one hand I agree it is fair to characterize this particular case as a 
functional regression -- on the other hand "our testing is inadequate" could 
easily be read as suggesting that existing unit tests and bats integration 
tests should do a better job of covering these types of issues, which I think 
would be misleading given the inherent challenges involved with regularly 
running integration tests. Really, the existing test suite is simply not 
designed to catch these kinds of "integration test" issues, and even "bats" 
integration tests would be difficult to adapt to serve the purpose of catching 
issues that only crop up when running at scale.

"Straw man" argument: we could just lean in to periodic benchmarks helping to 
catch these types of issues. The overhead of running integration tests at scale 
would be significant. Even if the original intention of periodic benchmarks is 
to evaluate performance, it may be ok (not really a problem) that we end up 
catching some "integration test"-style issues as a consequence. (to be clear, 
I'm kinda just thinking out loud; neither assuming you agree nor disagree, 
Ishan!).

> Replicas don't come up active after node restart
> ------------------------------------------------
>
>                 Key: SOLR-16622
>                 URL: https://issues.apache.org/jira/browse/SOLR-16622
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>             Fix For: 9.1.1
>
>         Attachments: Screenshot from 2023-01-17 15-03-05.png
>
>
> While benchmarking for performance, we saw a sharp change in the graphs:
> https://issues.apache.org/jira/browse/SOLR-16525?focusedCommentId=17676725&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17676725
> Turns out there was a commit (SOLR-16414) that escaped all testing and caused 
> a regression where restarted nodes didn't have the replicas coming up as 
> active.
> This affects 9.1 release, so opening a new JIRA issue to track it.
> Here's how to reproduce it:
> {code}
> git clone https://github.com/fullstorydev/solr-bench
> cd solr-bench
> # prerequisites on ubuntu:
> sudo apt install openjdk-11-jdk
> sudo apt install wget unzip zip ant ivy lsof git netcat make maven jq
> # this is a patch to comment out the cleanup/final shutdown
> wget https://termbin.com/yuu95
> git apply yuu95
> mvn clean compile assembly:single
> ./cleanup.sh && ./stress.sh -c aa4f3d98ab19c201e7f3c74cd14c99174148616d 
> suites/stress-facets-local.json
> {code}
> If the 95th percentile is <10 or so, we have a problem. It should be >300 or 
> so. Since, we disabled cleanup, we can hit http://localhost:50000/solr/ to 
> open Solr UI. In this case, I see that querying to the ecommerce-events 
> collection shows shard2 is down.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to