[ https://issues.apache.org/jira/browse/IGNITE-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068644#comment-15068644 ]
Noam Liran commented on IGNITE-2219: ------------------------------------ I think we know what the issue is. The oldest node frequently sends full partition maps to other nodes in the cluster. This is done in {{sendAllPartitions()}} in {{GridCachePartitionExchangeManager}} which is called on several occasions. {{sendAllPartitions()}} creates a {{GridDhtPartitionsFullMessage}} and populates it with the full partition maps. Notice these are complex objects that are stored by reference. It then iterates over all nodes in the cluster and sends the message to them, one by one, asynchronously using {{cctx.io().sendNoRetry()}}. This, in turn, calls {{GridDhtPartitionsFullMessage.prepareMarshal()}} for each node separately (on the same message object). If the maps somehow change, this will cause partsBytes (the serialized version of the maps) to change *even though* some writers might have already started sending it to other nodes. Specifically, if the byte array length was already written in one packet and the array changes to a different size, the written size will reflect the size of the *new* array rather the old one and essentially cause a corruption of the message. We think this issue might actually happen frequently with other messages as well since it is easy to overlook. We thought that creating a separate message instance for each node might be a quick fix for this but we're not sure if there are side effects for this. Regards and Happy Christmas / Novy God! Noam and [~avihai.berkov...@microsoft.com] > ClassCastException from NodeIdMessage to AffinityTopologyVersion > ---------------------------------------------------------------- > > Key: IGNITE-2219 > URL: https://issues.apache.org/jira/browse/IGNITE-2219 > Project: Ignite > Issue Type: Bug > Affects Versions: ignite-1.4 > Environment: Ubuntu 12.04 64 bit > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > Ignite 1.4.0 > Reporter: Avihai Berkovitz > Attachments: message-hex.txt > > > We had a cluster up and running for a couple of days. Without doing anything > new, we got the following error in one of the nodes: > {noformat} > Caught unhandled exception in NIO worker thread (restart the node). > java.lang.ClassCastException: > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$NodeIdMessage > cannot be cast to > org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java:176) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.direct.DirectByteBufferStream.readMessage(DirectByteBufferStream.java:963) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.direct.DirectMessageReader.readMessage(DirectMessageReader.java:252) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:249) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:79) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:78) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2124) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:898) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1437) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1379) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1263) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > [ignite-core-1.4.0.jar:1.4.0] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > {noformat} > It happened only once so far, but killed the communication from this node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)