[ https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900607#comment-15900607 ]
Chaoyu Tang commented on HIVE-16071: ------------------------------------ I agree with [~xuefuz] that we need a timeout for SASL handshaking at RPC server site for the case he raised. This timeout should be shorter than client.server.connect.timeout used by RegisterClient, but ideally I think it should be a little longer than the client.connect.timeout used by RemoteDriver handshaking so that we can try to avoid the handshaking timeout initiated by the server given that starting a remoteDriver is quite expensive. If so, I would suggest we can introduce a new configuration like hive.spark.rpc.handshake.server.timeout, and rename hive.spark.client.connect.timeout to hive.spark.rpc.handshake.client.timeout (though it is also used as the socket connect timeout at RemoteDriver side like now). Also the hive.spark.client.server.connect.timeout could be renamed to something like hive.spark.register.remote.driver.timeout if necessary. What do you guys think about it? > Spark remote driver misuses the timeout in RPC handshake > -------------------------------------------------------- > > Key: HIVE-16071 > URL: https://issues.apache.org/jira/browse/HIVE-16071 > Project: Hive > Issue Type: Bug > Components: Spark > Reporter: Chaoyu Tang > Assignee: Chaoyu Tang > Attachments: HIVE-16071.patch > > > Based on its property description in HiveConf and the comments in HIVE-12650 > (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979), > hive.spark.client.connect.timeout is the timeout when the spark remote > driver makes a socket connection (channel) to RPC server. But currently it is > also used by the remote driver for RPC client/server handshaking, which is > not right. Instead, hive.spark.client.server.connect.timeout should be used > and it has already been used by the RPCServer in the handshaking. > The error like following is usually caused by this issue, since the default > hive.spark.client.connect.timeout value (1000ms) used by remote driver for > handshaking is a little too short. > {code} > 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: > java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: > Client closed before SASL negotiation finished. > java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: > Client closed before SASL negotiation finished. > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) > at > org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156) > at > org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542) > Caused by: javax.security.sasl.SaslException: Client closed before SASL > negotiation finished. > at > org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453) > at > org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)