[ https://issues.apache.org/jira/browse/IGNITE-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068213#comment-15068213 ]
Noam Liran commented on IGNITE-2219: ------------------------------------ We added debug prints in org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java) and now have the problematic buffer; We've attached the full message buffer at this point in hex; we also think these log lines are relevant: {noformat} 2015-12-22 14:21:11.800 WARN o.a.i.s.c.tcp.TcpCommunicationSpi tcp-comm-worker-#1%null ctx: actor: - Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/0:0:0:0:0:0:0:1%lo:47105, failureDetectionTimeout=10000] 2015-12-22 14:21:11.800 WARN o.a.i.s.c.tcp.TcpCommunicationSpi tcp-comm-worker-#1%null ctx: actor: - Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47105, failureDetectionTimeout=10000] 2015-12-22 14:21:11.801 WARN o.a.i.s.c.tcp.TcpCommunicationSpi tcp-comm-worker-#1%null ctx: actor: - Failed to connect to a remote node (make sure that destination node is alive and operating system firewall is disabled on local and remote hosts) [addrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47105, /0:0:0:0:0:0:0:1%lo:47105, /127.0.0.1:47105]] 2015-12-22 14:21:11.807 WARN o.a.i.i.p.c.GridCachePartitionExchangeManager ignite-#23%sys-null% ctx: actor: - Failed to send partitions full message [node=TcpDiscoveryNode [id=fe615af4-4935-4edb-a947-cc431e97ea48, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.1.62, 127.0.0.1], sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504, /0:0:0:0:0:0:0:1%lo:47504, /10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, order=10, intOrder=10, lastExchangeTime=1450793971531, loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], err=class o.a.i.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=fe615af4-4935-4edb-a947-cc431e97ea48, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.1.62, 127.0.0.1], sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504, /0:0:0:0:0:0:0:1%lo:47504, /10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, order=10, intOrder=10, lastExchangeTime=1450793971531, loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], topic=TOPIC_CACHE, msg=GridDhtPartitionsFullMessage [parts={218603958=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=451, size=18], 816937872=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=446, size=18], -2100569601=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, size=18], 689859866=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=467, size=18], -423204530=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=462, size=18], 1325947219=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, size=18], 1059659892=GridDhtPartitionFullMap [nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=459, size=18]}, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], partCnt=7, super=GridDhtPartitionsAbstractMessage [exchId=null, lastVer=null, super=GridCacheMessage [msgId=909, depInfo=null, err=null, skipPrepare=false]]], policy=2]] {noformat} The *really* interesting thing we found is this: in the full buffer, the first field is the length of partsBytes: {{521F0100}} = {{0x00011f52}} = {{73554}}. If you skip it (+ 4 bytes for the length itself) you get to this part of the buffer: {{FF0002000000000000000000FFFFFFFF08000000}} Since the first byte is 0xFF, it is being deserialized as {{NodeIdMessage}} even though {{GridDhtPartitionsFullMessage}} mandates that at this point only {{AffinityTopologyVersion}} should be found. The thing is that, if you look a bit before that in the buffer you can actually find {{AffinityTopologyVersion}}: {noformat} ... 41FC41FD41FE41FF41 6F00000000FFFFFFFFFFFFFFFF 0002000000000000000000FFFFFFFF08000000 {noformat} Obviously something is a miss here: either partsBytes' length is 12 bytes too long, or 12 bytes went missing somewhere. Regards, Noam and [~avihai.berkov...@microsoft.com] > ClassCastException from NodeIdMessage to AffinityTopologyVersion > ---------------------------------------------------------------- > > Key: IGNITE-2219 > URL: https://issues.apache.org/jira/browse/IGNITE-2219 > Project: Ignite > Issue Type: Bug > Affects Versions: ignite-1.4 > Environment: Ubuntu 12.04 64 bit > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > Ignite 1.4.0 > Reporter: Avihai Berkovitz > Attachments: message-hex.txt > > > We had a cluster up and running for a couple of days. Without doing anything > new, we got the following error in one of the nodes: > {noformat} > Caught unhandled exception in NIO worker thread (restart the node). > java.lang.ClassCastException: > org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$NodeIdMessage > cannot be cast to > org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java:176) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.direct.DirectByteBufferStream.readMessage(DirectByteBufferStream.java:963) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.direct.DirectMessageReader.readMessage(DirectMessageReader.java:252) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:249) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:79) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:78) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2124) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:898) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1437) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1379) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1263) > ~[ignite-core-1.4.0.jar:1.4.0] > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > [ignite-core-1.4.0.jar:1.4.0] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > {noformat} > It happened only once so far, but killed the communication from this node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)