[ https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903422#comment-15903422 ]
Chaoyu Tang edited comment on HIVE-16071 at 3/9/17 5:24 PM: ------------------------------------------------------------ So we reached the consensus that hive.spark.client.server.connect.timeout should not be used for cancelTask at RPCServer side. The value proposed could be hive.spark.client.connect.timeout. [~xuefuz] The reason that I previously suggested we could consider another timeout for cancelTask (a little longer than hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time to timeout the handshaking than RPCServer. If the timeout at both sides are set to exactly same value, we might see the situations quite often where the terminations of SASL handshaking are initiated by cancelTask at RpcServer side because the timeout at RemoteDriver side might be slightly later for whatever reasons. During this short window, the handshake could still have a chance to succeed if it is not terminated by cancelTask. To my understanding, to shorten cancelTask timeout is mainly for RpcServer to detect the handshake timeout (fired by RemoteDriver) sooner, we still want RemoteDriver to mainly control the SASL handshake timeout, and most handshake timeout should be fired from remoteDriver, right? was (Author: ctang.ma): So we reached the consensus that hive.spark.client.server.connect.timeout should not be used for cancelTask at RPCServer side. The value proposed could be hive.spark.client.connect.timeout. [~xuefuz] The reason that I previously suggested we could consider another timeout for cancelTask (a little longer than hive.spark.client.connect.timeout.) is to give RemoteDriver a little more time to timeout the handshaking than RPCServer. If the timeout at both sides are set to exactly same value, we might see the situations quite often where the terminations of SASL handshaking are initiated by cancelTask at RpcServer side because the timeout at RemoteDriver side might be slightly later for whatever reasons. During this short window, the handshake could still have a chance to succeed if it is not terminated by cancelTask. To my understanding, to shorten cancelTask timeout is mainly for RpcServer to detect the handshake timeout (fired by RemoteDriver) sooner, we still want RemoteDriver to mainly control the SASL handshake timeout, and most handshake timeout should be fired from remoteDriver, right? In addition, I think we should > Spark remote driver misuses the timeout in RPC handshake > -------------------------------------------------------- > > Key: HIVE-16071 > URL: https://issues.apache.org/jira/browse/HIVE-16071 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Chaoyu Tang > Assignee: Chaoyu Tang > Attachments: HIVE-16071.patch > > > Based on its property description in HiveConf and the comments in HIVE-12650 > (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979), > hive.spark.client.connect.timeout is the timeout when the spark remote > driver makes a socket connection (channel) to RPC server. But currently it is > also used by the remote driver for RPC client/server handshaking, which is > not right. Instead, hive.spark.client.server.connect.timeout should be used > and it has already been used by the RPCServer in the handshaking. > The error like following is usually caused by this issue, since the default > hive.spark.client.connect.timeout value (1000ms) used by remote driver for > handshaking is a little too short. > {code} > 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: > java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: > Client closed before SASL negotiation finished. > java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: > Client closed before SASL negotiation finished. > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) > at > org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156) > at > org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542) > Caused by: javax.security.sasl.SaslException: Client closed before SASL > negotiation finished. > at > org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453) > at > org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)