[ https://issues.apache.org/jira/browse/HIVE-17908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Dere updated HIVE-17908: ------------------------------ Attachment: HIVE-17908.2.patch Pre-commit tests never ran. Re-attaching patch. > LLAP External client not correctly handling killTask for pending requests > ------------------------------------------------------------------------- > > Key: HIVE-17908 > URL: https://issues.apache.org/jira/browse/HIVE-17908 > Project: Hive > Issue Type: Bug > Components: llap > Reporter: Jason Dere > Assignee: Jason Dere > Attachments: HIVE-17908.1.patch, HIVE-17908.2.patch > > > Hitting "Timed out waiting for heartbeat for task ID" errors with the LLAP > external client. > HIVE-17393 fixed some of these errors, however it is also occurring because > the client is not correctly handling the killTask notification when the > request is accepted but still waiting for the first task heartbeat. In this > situation the client should retry the request, similar to what the LLAP AM > does. Current logic is ignoring the killTask in this situation, which results > in a heartbeat timeout - no heartbeats are sent by LLAP because of the > killTask notification. > {noformat} > 17/08/09 05:36:02 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 14, > cn114-10.l42scl.hortonworks.com, executor 5): java.io.IOException: Received > reader event error: Timed out waiting for heartbeat for task ID > attempt_7739111832518812959_0005_0_00_000010_0 > at > org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:178) > at > org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:50) > at > org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121) > at > org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: > LlapTaskUmbilicalExternalClient(attempt_7739111832518812959_0005_0_00_000010_0): > Error while attempting to read chunk length > at > org.apache.hadoop.hive.llap.io.ChunkedInputStream.read(ChunkedInputStream.java:82) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hive.llap.LlapBaseRecordReader.hasInput(LlapBaseRecordReader.java:267) > at > org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:142) > ... 22 more > Caused by: java.net.SocketException: Socket closed > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)