[ https://issues.apache.org/jira/browse/SOLR-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707396#comment-17707396 ]
Alex Deparvu commented on SOLR-7609: ------------------------------------ (pasting my PR comment here as well, for future reference https://github.com/apache/solr/pull/1504#issuecomment-1492441639) I have pushed a tentative patch for review and more importantly for discussion. I have ran this test extensively (hundreds of beast jobs) and it looks like this is the way to go, all 'add' exceptions are gone once the isLeader flag is updated. BUT I am not completely convinced this is the correct patch. possibly something could be made better in the DistributedZkUpdateProcessor area instead (where a lot of leader calculations are being made but the result is not correct in this edge case). While I am still evaluating a patch for DistributedZkUpdateProcessor, I would love some feedback on the observations so far. More on the isLeader update: I have decided to only do this in the case where it can be used to avoid throwing an exception. this was to add a best-effort way correct an inconsistent state before bailing out, possibly a cleaner fix would compute the 'isLeader' flag in a way that this approach is no longer needed. Together with the patch I have added a check which will not allow additions with no version. this is now consistent with the delete flows. although I must admit I am confused by the fact that this check was missing on this execution path, while the delete path had this from at least 8 years ago. Other points * in the current test, there are a few operations (add and deletes) that run parallel to the split operation. these can fail and this is where the bug is. because the exceptions are not correctly accounted for, the NPE popped up at a weird place. I added tracking and assertions on this part too (except for deletes, see below) * deletes can also cause exceptions, I don't have a good handle on how to fix this, so I removed the strict error check. this can be added once the confidence level is higher and this flow is accounted for. until then I will focus on the addition parts. * DistributedZkUpdateProcessor will call setupRequest twice. this is probably a minor nuisance for some future improvement PR. * finding the code very hard to read I took the decision to remove one variable that is not really needed (the updateCommand), this change is purely cosmetic and I can revert it if it adds too much noise to the current patch. > ShardSplitTest NPE > ------------------ > > Key: SOLR-7609 > URL: https://issues.apache.org/jira/browse/SOLR-7609 > Project: Solr > Issue Type: Bug > Reporter: Steven Rowe > Priority: Minor > Attachments: ShardSplitTest.NPE.log > > Time Spent: 10m > Remaining Estimate: 0h > > I'm guessing this is a test bug, but the seed doesn't reproduce for me (tried > on the same Linux machine it occurred on and on OS X): > {noformat} > [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ShardSplitTest > -Dtests.method=test -Dtests.seed=9318DDA46578ECF9 -Dtests.slow=true > -Dtests.locale=is -Dtests.timezone=America/St_Vincent -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII > [junit4] ERROR 55.8s J6 | ShardSplitTest.test <<< > [junit4] > Throwable #1: java.lang.NullPointerException > [junit4] > at > __randomizedtesting.SeedInfo.seed([9318DDA46578ECF9:1B4CE27ECB848101]:0) > [junit4] > at > org.apache.solr.cloud.ShardSplitTest.logDebugHelp(ShardSplitTest.java:547) > [junit4] > at > org.apache.solr.cloud.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:438) > [junit4] > at > org.apache.solr.cloud.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:222) > [junit4] > at > org.apache.solr.cloud.ShardSplitTest.test(ShardSplitTest.java:84) > [junit4] > at > org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:960) > [junit4] > at > org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:935) > [junit4] > at java.lang.Thread.run(Thread.java:745) > {noformat} > Line 547 of {{ShardSplitTest.java}} is: > {code:java} > idVsVersion.put(document.getFieldValue("id").toString(), > document.getFieldValue("_version_").toString()); > {code} > Skimming the code, it's not obvious what could be null. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org