[ https://issues.apache.org/jira/browse/CASSANDRA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560301#comment-14560301 ]
Yuki Morishita commented on CASSANDRA-9458: ------------------------------------------- Thanks for providing the log and the patch. I don't think the problem here is the race though. StreamSession's {{state}} is guarded by synchronized methods. >From the log, I think both ends is in {{WAIT_COMPLETE}} state, and >{{/11.22.33.44}} is waiting for {{/11.22.33.55}} to send {{CompleteMessage}} >after it completes finalizing received files(in >{{StreamReceiveTask.OnCompletionRunnable}}). Do you have secondary indexes? Right now, streaming is considered completed after secondary indexes are built in that finalize phase(CASSANDRA-9308). > Race condition causing StreamSession to get stuck in WAIT_COMPLETE > ------------------------------------------------------------------ > > Key: CASSANDRA-9458 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9458 > Project: Cassandra > Issue Type: Bug > Reporter: Omid Aladini > Assignee: Omid Aladini > Priority: Critical > Fix For: 2.1.x, 2.0.x > > Attachments: 9458-v1.txt > > > I think there is a race condition in StreamSession where one side of the > stream could get stuck in WAIT_COMPLETE although both have sent COMPLETE > messages. Consider a scenario that node B is being bootstrapped and it only > receives files during the session: > 1- During a stream session A sends some files to B and B sends no files to A. > 2- Once B completes the last task (receiving), StreamSession::maybeComplete > is invoked. > 3- While B is sending the COMPLETE message via StreamSession::maybeComplete, > it also receives the COMPLETE message from A and therefore > StreamSession::complete() is invoked. > 4- Therefore both maybeComplete() and complete() functions have branched into > the state != State.WAIT_COMPLETE case and both set the state to WAIT_COMPLETE. > 5- Now B is waiting to receive COMPLETE although it's already received it and > nothing triggers checking the state again, until it times out after > streaming_socket_timeout_in_ms. > In the log below: > https://gist.github.com/omidaladini/003de259958ad8dfb07e > although the node has received COMPLETE, "SocketTimeoutException" is thrown > after streaming_socket_timeout_in_ms (30 minutes here). -- This message was sent by Atlassian JIRA (v6.3.4#6332)