Re: [PR] SOLR-17890 LeaderElectionTest flaky [solr]

via GitHub Wed, 17 Dec 2025 04:03:15 -0800


Copilot commented on code in PR #3957:
URL: https://github.com/apache/solr/pull/3957#discussion_r2626794303



##########
solr/core/src/test/org/apache/solr/cloud/LeaderElectionTest.java:
##########
@@ -579,23 +581,33 @@ public void run() {
       connLossThread.join();
       killThread.join();
 
-      int seq = threads.get(getLeaderThread()).getSeq();
+      // Retry getting leader with extended timeout to handle edge cases where
+      // getLeaderUrl() gets an unexpected exception and throws 
RuntimeException
+      RetryUtil.retryOnException(
+          Exception.class,
+          60000, // 60 seconds total timeout
+          100,   // 100ms between retries
+          () -> {
+            int seq = threads.get(getLeaderThread()).getSeq();
+            log.info("Leader election stress test completed, leader seq: {}", 
seq);

Review Comment:
   The original code retrieved the leader thread and its sequence number to 
verify that a leader exists and is accessible. However, the new retry wrapper 
only logs the information without any assertion or verification. The test no 
longer validates that leader election succeeded - it just retries until the 
code executes without throwing an exception. Consider adding an assertion to 
verify the leader state, or at minimum verify that getLeaderThread() returns a 
valid index and that the sequence number is non-negative.
   ```suggestion
               int leaderThread = getLeaderThread();
               assertTrue("Leader election did not produce a valid leader 
thread index",
                   leaderThread >= 0 && leaderThread < threads.size());
               int seq = threads.get(leaderThread).getSeq();
               assertTrue("Leader election produced an invalid (negative) 
sequence number", seq >= 0);
               log.info("Leader election stress test completed, leader thread: 
{}, leader seq: {}",
                   leaderThread, seq);
   ```



##########
solr/core/src/test/org/apache/solr/cloud/LeaderElectionTest.java:
##########
@@ -579,23 +581,33 @@ public void run() {
       connLossThread.join();
       killThread.join();
 
-      int seq = threads.get(getLeaderThread()).getSeq();
+      // Retry getting leader with extended timeout to handle edge cases where
+      // getLeaderUrl() gets an unexpected exception and throws 
RuntimeException
+      RetryUtil.retryOnException(
+          Exception.class,
+          60000, // 60 seconds total timeout
+          100,   // 100ms between retries
+          () -> {
+            int seq = threads.get(getLeaderThread()).getSeq();
+            log.info("Leader election stress test completed, leader seq: {}", 
seq);
+          });
 
-      // we have a leader we know, TODO: lets check some other things
     } finally {
       // cleanup any threads still running
       for (ClientThread thread : threads) {
-        thread.close();
-      }
-
-      // cleanup any threads still running
-      for (ClientThread thread : threads) {
-        thread.es.zkClient.close();
-        thread.close();
+        try {
+          thread.close();
+        } catch (Exception e) {
+          // ignore cleanup errors

Review Comment:
   The cleanup logic is incomplete. The original code had two separate loops: 
the first called thread.close() and the second called both 
thread.es.zkClient.close() and thread.close() again. The modified code only has 
the first loop, which means thread.es.zkClient.close() is never called, leading 
to a resource leak of ZooKeeper client connections. The second loop that 
explicitly closed zkClient should be preserved.
   ```suggestion
             // explicitly close the ZooKeeper client used by this thread
             if (thread.es != null && thread.es.zkClient != null) {
               thread.es.zkClient.close();
             }
           } catch (Exception e) {
             // ignore cleanup errors for zkClient
           }
           try {
             thread.close();
           } catch (Exception e) {
             // ignore cleanup errors for thread
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-17890 LeaderElectionTest flaky [solr]

Reply via email to