Erick Erickson created SOLR-7936:
------------------------------------

             Summary: Bogus failure when deleting collections.
                 Key: SOLR-7936
                 URL: https://issues.apache.org/jira/browse/SOLR-7936
             Project: Solr
          Issue Type: Bug
            Reporter: Erick Erickson
            Assignee: Erick Erickson


When looking at the CDCR test failures, we began to wonder whether the problem 
was
1> the cdcr code itself
2> the test framework
3> Solr

Some of the failures seem to be "impossible" assuming collection 
creation/deletion work OK.

So I wrote a little program to exercise collection creation/deletion outside 
the test framework by just adding and deleting the same collection over and 
over and over again, and it started regularly failing in 
OverseerCollectionMessageHandler.deleteCollection about line 780 it would throw 
the "Could not fully remove the collection" exception:

{code}
      TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS);
      boolean removed = false;
      while (! timeout.hasTimedOut()) {
        Thread.sleep(100);
        // WORKS SO FAR IF UNCOMMENTED zkStateReader.updateClusterState();
        removed = !zkStateReader.getClusterState().hasCollection(collection);
        if (removed) {
          Thread.sleep(500); // just a bit of time so it's more likely other
                             // readers see on return
          break;
        }
      }
      if (!removed) {
        throw new SolrException(ErrorCode.SERVER_ERROR,
            "Could not fully remove collection: " + collection);
      }
{code}

However, the collection is really gone from clusterstate. When I put the 
updateClusterState() in above, it doesn't seem to fail. Is it as simple as the 
updateClusterState() call?

Without the update in place, it failed within 20 reps very regularly. So far, 
with the update in place we're at 132 and counting. Any comments?

If this runs 1,000 times tonight, I'll check it in if there are no objections. 
I don't know what it means for CDCR yet though.

I'm also suspicious of the 500ms sleep. Anyone have a clue what that's in there 
for?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to