[
https://issues.apache.org/jira/browse/IGNITE-27345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-27345:
---------------------------------------
Description:
{noformat}
Caused by: java.lang.NullPointerException
at
org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
... 24 more{noformat}
The node was dying due to huge heap pressure. At some moment, a lot of stack
traces of this kind were written to log.
The exception means that either the corresponding processor was not added yet,
or that it had already been removed.
# If it was not added yet, this is a bug and we probably lack some
happens-before between adding a table processor and processing messages for
this table. But this seems unlikely as for
TableWriteIntentSwitchReplicaRequest, the same mechanism as for other
TableAware messages in PartitionReplicaListener is used to make sure that table
resources are ready to process the corresponding TableAware request. The
mechanism is taking current time from the clock (updated with the requester's
clock time passed via request.timestamp) and then doing a schema sync with that
time (as the table was already created earlier, it makes sure that table
resources are prepared and installed). Nevertheless,
WriteIntentSwitchRequestHandler does this trick itself, its code might be
different from PartitionReplicaListener's, so it makes sense to make sure we
don't have a bug here
# If the table was removed, then the corresponding table processor was
removed. In PRL, we explicitly check for null. Probably, we have to do the same
in WriteIntentSwitchRequestHandler as well, and this seems to be the actual
reason (and the candidate fix). Also, please take a look at IGNITE-26819 and
the corresponding [PR|https://github.com/apache/ignite-3/pull/6944]; it seems
that the NPE we see here is a consequence of the fix for IGNITE-26819 being
incomplete. Just one thing causes worries with this explanation: the user says
that they did not drop any tables (but they could be wrong).
Second item seems to be the culprit, but it would be great to write the
corresponding test this time.
The scenario for it is:
# Some external transaction creates a write intent
# It's committed, but WI cleanup is not performed
# The table is dropped
# LWM raises enough to cause the table destruction
# Only now do we try another attempt to switch the WI
was:
Caused by: java.lang.NullPointerException
at
org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
... 24 more
> NullPointerException in WriteIntentSwitchRequestHandler
> -------------------------------------------------------
>
> Key: IGNITE-27345
> URL: https://issues.apache.org/jira/browse/IGNITE-27345
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Filipp Shergalis
> Priority: Major
> Labels: ignite-3
>
>
> {noformat}
> Caused by: java.lang.NullPointerException
> at
> org.apache.ignite.internal.partition.replicator.handlers.WriteIntentSwitchRequestHandler.lambda$invokeTableWriteIntentSwitchReplicaRequest$4(WriteIntentSwitchRequestHandler.java:172)
> ... 24 more{noformat}
>
>
> The node was dying due to huge heap pressure. At some moment, a lot of stack
> traces of this kind were written to log.
> The exception means that either the corresponding processor was not added
> yet, or that it had already been removed.
> # If it was not added yet, this is a bug and we probably lack some
> happens-before between adding a table processor and processing messages for
> this table. But this seems unlikely as for
> TableWriteIntentSwitchReplicaRequest, the same mechanism as for other
> TableAware messages in PartitionReplicaListener is used to make sure that
> table resources are ready to process the corresponding TableAware request.
> The mechanism is taking current time from the clock (updated with the
> requester's clock time passed via request.timestamp) and then doing a schema
> sync with that time (as the table was already created earlier, it makes sure
> that table resources are prepared and installed). Nevertheless,
> WriteIntentSwitchRequestHandler does this trick itself, its code might be
> different from PartitionReplicaListener's, so it makes sense to make sure we
> don't have a bug here
> # If the table was removed, then the corresponding table processor was
> removed. In PRL, we explicitly check for null. Probably, we have to do the
> same in WriteIntentSwitchRequestHandler as well, and this seems to be the
> actual reason (and the candidate fix). Also, please take a look at
> IGNITE-26819 and the corresponding
> [PR|https://github.com/apache/ignite-3/pull/6944]; it seems that the NPE we
> see here is a consequence of the fix for IGNITE-26819 being incomplete. Just
> one thing causes worries with this explanation: the user says that they did
> not drop any tables (but they could be wrong).
> Second item seems to be the culprit, but it would be great to write the
> corresponding test this time.
> The scenario for it is:
> # Some external transaction creates a write intent
> # It's committed, but WI cleanup is not performed
> # The table is dropped
> # LWM raises enough to cause the table destruction
> # Only now do we try another attempt to switch the WI
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)