[ 
https://issues.apache.org/jira/browse/IGNITE-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-27345:
---------------------------------------
    Description: 
 
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
... 24 more{noformat}
 

 

The node was dying due to huge heap pressure. At some moment, a lot of stack 
traces of this kind were written to log.

The exception means that either the corresponding processor was not added yet, 
or that it had already been removed.
 # If it was not added yet, this is a bug and we probably lack some 
happens-before between adding a table processor and processing messages for 
this table. But this seems unlikely as for 
TableWriteIntentSwitchReplicaRequest, the same mechanism as for other 
TableAware messages in PartitionReplicaListener is used to make sure that table 
resources are ready to process the corresponding TableAware request. The 
mechanism is taking current time from the clock (updated with the requester's 
clock time passed via request.timestamp) and then doing a schema sync with that 
time (as the table was already created earlier, it makes sure that table 
resources are prepared and installed). Nevertheless, 
WriteIntentSwitchRequestHandler does this trick itself, its code might be 
different from PartitionReplicaListener's, so it makes sense to make sure we 
don't have a bug here
 # If the table was removed, then the corresponding table processor was 
removed. In PRL, we explicitly check for null. Probably, we have to do the same 
in WriteIntentSwitchRequestHandler as well, and this seems to be the actual 
reason (and the candidate fix). Also, please take a look at IGNITE-26819 and 
the corresponding [PR|https://github.com/apache/ignite-3/pull/6944]; it seems 
that the NPE we see here is a consequence of the fix for IGNITE-26819 being 
incomplete. Just one thing causes worries with this explanation: the user says 
that they did not drop any tables (but they could be wrong).

Second item seems to be the culprit, but it would be great to write the 
corresponding test this time.

The scenario for it is:
 # Some external transaction creates a write intent
 # It's committed, but WI cleanup is not performed
 # The table is dropped
 # LWM raises enough to cause the table destruction
 # Only now do we try another attempt to switch the WI

 

  was:
Caused by: java.lang.NullPointerException
at 
org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
... 24 more


> NullPointerException in WriteIntentSwitchRequestHandler
> -------------------------------------------------------
>
>                 Key: IGNITE-27345
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27345
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Filipp Shergalis
>            Priority: Major
>              Labels: ignite-3
>
>  
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
> ... 24 more{noformat}
>  
>  
> The node was dying due to huge heap pressure. At some moment, a lot of stack 
> traces of this kind were written to log.
> The exception means that either the corresponding processor was not added 
> yet, or that it had already been removed.
>  # If it was not added yet, this is a bug and we probably lack some 
> happens-before between adding a table processor and processing messages for 
> this table. But this seems unlikely as for 
> TableWriteIntentSwitchReplicaRequest, the same mechanism as for other 
> TableAware messages in PartitionReplicaListener is used to make sure that 
> table resources are ready to process the corresponding TableAware request. 
> The mechanism is taking current time from the clock (updated with the 
> requester's clock time passed via request.timestamp) and then doing a schema 
> sync with that time (as the table was already created earlier, it makes sure 
> that table resources are prepared and installed). Nevertheless, 
> WriteIntentSwitchRequestHandler does this trick itself, its code might be 
> different from PartitionReplicaListener's, so it makes sense to make sure we 
> don't have a bug here
>  # If the table was removed, then the corresponding table processor was 
> removed. In PRL, we explicitly check for null. Probably, we have to do the 
> same in WriteIntentSwitchRequestHandler as well, and this seems to be the 
> actual reason (and the candidate fix). Also, please take a look at 
> IGNITE-26819 and the corresponding 
> [PR|https://github.com/apache/ignite-3/pull/6944]; it seems that the NPE we 
> see here is a consequence of the fix for IGNITE-26819 being incomplete. Just 
> one thing causes worries with this explanation: the user says that they did 
> not drop any tables (but they could be wrong).
> Second item seems to be the culprit, but it would be great to write the 
> corresponding test this time.
> The scenario for it is:
>  # Some external transaction creates a write intent
>  # It's committed, but WI cleanup is not performed
>  # The table is dropped
>  # LWM raises enough to cause the table destruction
>  # Only now do we try another attempt to switch the WI
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to