[ https://issues.apache.org/jira/browse/HIVE-9094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247911#comment-14247911 ]
Chengxiang Li commented on HIVE-9094: ------------------------------------- Here is the TimeoutException in spark client side: {noformat} 2014-12-15 12:14:09,062 ERROR [main]: ql.Driver (SessionState.java:printError(838)) - FAILED: SemanticException Failed to get spark memory/core info: java.util.concurrent.TimeoutException org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get spark memory/core info: java.util.concurrent.TimeoutException at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:120) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78) at org.apache.hadoop.hive.ql.lib.ForwardWalker.walk(ForwardWalker.java:79) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109) at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:134) at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:99) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10202) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:420) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:306) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1108) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1035) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:199) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:151) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:362) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:297) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:837) at org.apache.hadoop.hive.cli.TestSparkCliDriver.runTest(TestSparkCliDriver.java:234) at org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_join16(TestSparkCliDriver.java:166) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:176) at junit.framework.TestCase.runBare(TestCase.java:141) at junit.framework.TestResult$1.protect(TestResult.java:122) at junit.framework.TestResult.runProtected(TestResult.java:142) at junit.framework.TestResult.run(TestResult.java:125) at junit.framework.TestCase.run(TestCase.java:129) at junit.framework.TestSuite.runTest(TestSuite.java:255) at junit.framework.TestSuite.run(TestSuite.java:250) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) Caused by: java.util.concurrent.TimeoutException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:49) at org.apache.hive.spark.client.JobHandleImpl.get(JobHandleImpl.java:74) at org.apache.hive.spark.client.JobHandleImpl.get(JobHandleImpl.java:35) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.getExecutorCount(RemoteHiveSparkClient.java:92) at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.getMemoryAndCores(SparkSessionImpl.java:77) at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:118) ... 43 more {noformat} And I found the following log in spark.log 5 seconds before timeout in client side: {noformat} 2014-12-15 12:14:04,159 INFO [Driver-RPC-Handler-0]: client.RemoteDriver (RemoteDriver.java:handle(255)) - Received job request 13ca3a2b-0719-42fb-8201-0a645cea992f 2014-12-15 12:14:04,166 INFO [Driver-RPC-Handler-0]: client.RemoteDriver (RemoteDriver.java:submit(179)) - SparkContext not yet up, queueing job request. {noformat} As I set timeout to 5s at RemoteHiveSparkClient::getExecutorCount, so the root cause here should be RemoteHiveSparkClient::getExecutorCount timeout after 5s as Spark cluster has not launched yet. I carelessly ignore the spark cluster launch time previously, I think we should make the timeout value configurable, and set the default timeout value to more reasonable size, which includes spark cluster launch time. If it take more time to launch spark cluster, it make sense to failed current query, and user should check why it happens. 60s should be enough as a default value, [~vanzin], what do you think? > TimeoutException when trying get executor count from RSC [Spark Branch] > ----------------------------------------------------------------------- > > Key: HIVE-9094 > URL: https://issues.apache.org/jira/browse/HIVE-9094 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > > In > http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/532/testReport, > join25.q failed because: > {code} > 2014-12-12 19:14:50,084 ERROR [main]: ql.Driver > (SessionState.java:printError(838)) - FAILED: SemanticException Failed to get > spark memory/core info: java.util.concurrent.TimeoutException > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get spark > memory/core info: java.util.concurrent.TimeoutException > at > org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:120) > at > org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) > at > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94) > at > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78) > at > org.apache.hadoop.hive.ql.lib.ForwardWalker.walk(ForwardWalker.java:79) > at > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109) > at > org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:134) > at > org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:99) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10202) > at > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221) > at > org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74) > at > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:420) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:306) > at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1108) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1045) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1035) > at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:199) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:151) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:362) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:297) > at > org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:837) > at > org.apache.hadoop.hive.cli.TestSparkCliDriver.runTest(TestSparkCliDriver.java:234) > at > org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join25(TestSparkCliDriver.java:162) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at junit.framework.TestCase.runTest(TestCase.java:176) > at junit.framework.TestCase.runBare(TestCase.java:141) > at junit.framework.TestResult$1.protect(TestResult.java:122) > at junit.framework.TestResult.runProtected(TestResult.java:142) > at junit.framework.TestResult.run(TestResult.java:125) > at junit.framework.TestCase.run(TestCase.java:129) > at junit.framework.TestSuite.runTest(TestSuite.java:255) > at junit.framework.TestSuite.run(TestSuite.java:250) > at > org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) > Caused by: java.util.concurrent.TimeoutException > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:49) > at > org.apache.hive.spark.client.JobHandleImpl.get(JobHandleImpl.java:74) > at > org.apache.hive.spark.client.JobHandleImpl.get(JobHandleImpl.java:35) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.getExecutorCount(RemoteHiveSparkClient.java:92) > at > org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.getMemoryAndCores(SparkSessionImpl.java:77) > at > org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:118) > ... 43 more > {code} > The timeout is introduced in HIVE-9079. Previously the driver may hang. This > seems to be a robustness issue of RSC. Hanging isn't good, but timeout isn't > good either unless there is some network issue, which doesn't seem to be case > here. [~chengxiang li]/[~vanzin], could we get the bottom of this? Increase > the timeout value if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)