[ 
https://issues.apache.org/jira/browse/IGNITE-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068213#comment-15068213
 ] 

Noam Liran commented on IGNITE-2219:
------------------------------------

We added debug prints in 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java)
 and now have the problematic buffer;

We've attached the full message buffer at this point in hex; we also think 
these log lines are relevant:
{noformat}
2015-12-22 14:21:11.800 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    
tcp-comm-worker-#1%null ctx:             actor:             - Connect timed out 
(consider increasing 'failureDetectionTimeout' configuration property) 
[addr=/0:0:0:0:0:0:0:1%lo:47105, failureDetectionTimeout=10000]
2015-12-22 14:21:11.800 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    
tcp-comm-worker-#1%null ctx:             actor:             - Connect timed out 
(consider increasing 'failureDetectionTimeout' configuration property) 
[addr=/127.0.0.1:47105, failureDetectionTimeout=10000]
2015-12-22 14:21:11.801 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    
tcp-comm-worker-#1%null ctx:             actor:             - Failed to connect 
to a remote node (make sure that destination node is alive and operating system 
firewall is disabled on local and remote hosts) 
[addrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47105, 
/0:0:0:0:0:0:0:1%lo:47105, /127.0.0.1:47105]]
2015-12-22 14:21:11.807 WARN  o.a.i.i.p.c.GridCachePartitionExchangeManager 
ignite-#23%sys-null% ctx:             actor:             - Failed to send 
partitions full message [node=TcpDiscoveryNode 
[id=fe615af4-4935-4edb-a947-cc431e97ea48, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.1.62, 
127.0.0.1], sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504, 
/0:0:0:0:0:0:0:1%lo:47504, /10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, 
order=10, intOrder=10, lastExchangeTime=1450793971531, loc=false, 
ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], err=class 
o.a.i.IgniteCheckedException: Failed to send message (node may have left the 
grid or TCP connection cannot be established due to firewall issues) 
[node=TcpDiscoveryNode [id=fe615af4-4935-4edb-a947-cc431e97ea48, 
addrs=[0:0:0:0:0:0:0:1%lo, 10.0.1.62, 127.0.0.1], 
sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504, 
/0:0:0:0:0:0:0:1%lo:47504, /10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, 
order=10, intOrder=10, lastExchangeTime=1450793971531, loc=false, 
ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], topic=TOPIC_CACHE, 
msg=GridDhtPartitionsFullMessage [parts={218603958=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=451, 
size=18], 816937872=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=446, 
size=18], -2100569601=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, 
size=18], 689859866=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=467, 
size=18], -423204530=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=462, 
size=18], 1325947219=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, 
size=18], 1059659892=GridDhtPartitionFullMap 
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=459, 
size=18]}, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], 
partCnt=7, super=GridDhtPartitionsAbstractMessage [exchId=null, lastVer=null, 
super=GridCacheMessage [msgId=909, depInfo=null, err=null, 
skipPrepare=false]]], policy=2]]
{noformat}

The *really* interesting thing we found is this: in the full buffer, the first 
field is the length of partsBytes: {{521F0100}} = {{0x00011f52}} = {{73554}}.
If you skip it (+ 4 bytes for the length itself) you get to this part of the 
buffer: 
{{FF0002000000000000000000FFFFFFFF08000000}}
Since the first byte is 0xFF, it is being deserialized as {{NodeIdMessage}} 
even though {{GridDhtPartitionsFullMessage}} mandates that at this point only 
{{AffinityTopologyVersion}} should be found.

The thing is that, if you look a bit before that in the buffer you can actually 
find {{AffinityTopologyVersion}}: {noformat} 
... 41FC41FD41FE41FF41 6F00000000FFFFFFFFFFFFFFFF 
0002000000000000000000FFFFFFFF08000000 
{noformat}

Obviously something is a miss here: either partsBytes' length is 12 bytes too 
long, or 12 bytes went missing somewhere.

Regards,
Noam and [~avihai.berkov...@microsoft.com]

> ClassCastException from NodeIdMessage to AffinityTopologyVersion
> ----------------------------------------------------------------
>
>                 Key: IGNITE-2219
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2219
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: ignite-1.4
>         Environment: Ubuntu 12.04 64 bit
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> Ignite 1.4.0
>            Reporter: Avihai Berkovitz
>         Attachments: message-hex.txt
>
>
> We had a cluster up and running for a couple of days. Without doing anything 
> new, we got the following error in one of the nodes:
> {noformat}
> Caught unhandled exception in NIO worker thread (restart the node). 
> java.lang.ClassCastException: 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$NodeIdMessage 
> cannot be cast to 
> org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java:176)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.direct.DirectByteBufferStream.readMessage(DirectByteBufferStream.java:963)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.direct.DirectMessageReader.readMessage(DirectMessageReader.java:252)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:249)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:79)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:78)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2124)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:898)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1437)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1379)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1263)
>  ~[ignite-core-1.4.0.jar:1.4.0]
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
> [ignite-core-1.4.0.jar:1.4.0]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> It happened only once so far, but killed the communication from this node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to