[ 
https://issues.apache.org/jira/browse/SOLR-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707396#comment-17707396
 ] 

Alex Deparvu commented on SOLR-7609:
------------------------------------

(pasting my PR comment here as well, for future reference 
https://github.com/apache/solr/pull/1504#issuecomment-1492441639)

I have pushed a tentative patch for review and more importantly for discussion.

I have ran this test extensively (hundreds of beast jobs) and it looks like 
this is the way to go, all 'add' exceptions are gone once the isLeader flag is 
updated. BUT I am not completely convinced this is the correct patch. possibly 
something could be made better in the DistributedZkUpdateProcessor area instead 
(where a lot of leader calculations are being made but the result is not 
correct in this edge case).
While I am still evaluating a patch for DistributedZkUpdateProcessor, I would 
love some feedback on the observations so far.

More on the isLeader update: I have decided to only do this in the case where 
it can be used to avoid throwing an exception. this was to add a best-effort 
way correct an inconsistent state before bailing out, possibly a cleaner fix 
would compute the 'isLeader' flag in a way that this approach is no longer 
needed.

Together with the patch I have added a check which will not allow additions 
with no version. this is now consistent with the delete flows. although I must 
admit I am confused by the fact that this check was missing on this execution 
path, while the delete path had this from at least 8 years ago.

Other points
* in the current test, there are a few operations (add and deletes) that run 
parallel to the split operation. these can fail and this is where the bug is. 
because the exceptions are not correctly accounted for, the NPE popped up at a 
weird place. I added tracking and assertions on this part too (except for 
deletes, see below)
* deletes can also cause exceptions, I don't have a good handle on how to fix 
this, so I removed the strict error check. this can be added once the 
confidence level is higher and this flow is accounted for. until then I will 
focus on the addition parts.
* DistributedZkUpdateProcessor will call setupRequest twice. this is probably a 
minor nuisance for some future improvement PR.
* finding the code very hard to read I took the decision to remove one variable 
that is not really needed (the updateCommand), this change is purely cosmetic 
and I can revert it if it adds too much noise to the current patch.

> ShardSplitTest NPE
> ------------------
>
>                 Key: SOLR-7609
>                 URL: https://issues.apache.org/jira/browse/SOLR-7609
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Steven Rowe
>            Priority: Minor
>         Attachments: ShardSplitTest.NPE.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm guessing this is a test bug, but the seed doesn't reproduce for me (tried 
> on the same Linux machine it occurred on and on OS X):
> {noformat}
>    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ShardSplitTest 
> -Dtests.method=test -Dtests.seed=9318DDA46578ECF9 -Dtests.slow=true 
> -Dtests.locale=is -Dtests.timezone=America/St_Vincent -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   55.8s J6  | ShardSplitTest.test <<<
>    [junit4]    > Throwable #1: java.lang.NullPointerException
>    [junit4]    >      at 
> __randomizedtesting.SeedInfo.seed([9318DDA46578ECF9:1B4CE27ECB848101]:0)
>    [junit4]    >      at 
> org.apache.solr.cloud.ShardSplitTest.logDebugHelp(ShardSplitTest.java:547)
>    [junit4]    >      at 
> org.apache.solr.cloud.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:438)
>    [junit4]    >      at 
> org.apache.solr.cloud.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:222)
>    [junit4]    >      at 
> org.apache.solr.cloud.ShardSplitTest.test(ShardSplitTest.java:84)
>    [junit4]    >      at 
> org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:960)
>    [junit4]    >      at 
> org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:935)
>    [junit4]    >      at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Line 547 of {{ShardSplitTest.java}} is:
> {code:java}
>       idVsVersion.put(document.getFieldValue("id").toString(), 
> document.getFieldValue("_version_").toString());
> {code}
> Skimming the code, it's not obvious what could be null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to