[jira] [Commented] (IGNITE-2219) ClassCastException from NodeIdMessage to AffinityTopologyVersion

Noam Liran (JIRA) Tue, 22 Dec 2015 11:59:05 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068644#comment-15068644
 ]


Noam Liran commented on IGNITE-2219:
------------------------------------

I think we know what the issue is.

The oldest node frequently sends full partition maps to other nodes in the 
cluster. 
This is done in {{sendAllPartitions()}} in 
{{GridCachePartitionExchangeManager}} which is called on several occasions.

{{sendAllPartitions()}} creates a {{GridDhtPartitionsFullMessage}} and 
populates it with the full partition maps.
Notice these are complex objects that are stored by reference.

It then iterates over all nodes in the cluster and sends the message to them, 
one by one, asynchronously using {{cctx.io().sendNoRetry()}}.

This, in turn, calls {{GridDhtPartitionsFullMessage.prepareMarshal()}} for each 
node separately (on the same message object).
If the maps somehow change, this will cause partsBytes (the serialized version 
of the maps) to change *even though* some writers might have already started 
sending it to other nodes.

Specifically, if the byte array length was already written in one packet and 
the array changes to a different size, the written size will reflect the size 
of the *new* array rather the old one and essentially cause a corruption of the 
message.

We think this issue might actually happen frequently with other messages as 
well since it is easy to overlook.

We thought that creating a separate message instance for each node might be a 
quick fix for this but we're not sure if there are side effects for this.

Regards and Happy Christmas / Novy God!
Noam and [~avihai.berkov...@microsoft.com]

> ClassCastException from NodeIdMessage to AffinityTopologyVersion
> ----------------------------------------------------------------
>
>                 Key: IGNITE-2219
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2219
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: ignite-1.4
>         Environment: Ubuntu 12.04 64 bit
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> Ignite 1.4.0
>            Reporter: Avihai Berkovitz
>         Attachments: message-hex.txt
>
>
> We had a cluster up and running for a couple of days. Without doing anything 
> new, we got the following error in one of the nodes:
> {noformat}
> Caught unhandled exception in NIO worker thread (restart the node). 
> java.lang.ClassCastException: 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$NodeIdMessage 
> cannot be cast to 
> org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java:176)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.direct.DirectByteBufferStream.readMessage(DirectByteBufferStream.java:963)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.direct.DirectMessageReader.readMessage(DirectMessageReader.java:252)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:249)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:79)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:78)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2124)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:898)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1437)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1379)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1263)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-1.4.0.jar:1.4.0]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> It happened only once so far, but killed the communication from this node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-2219) ClassCastException from NodeIdMessage to AffinityTopologyVersion

Reply via email to