[ https://issues.apache.org/jira/browse/HIVE-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039831#comment-13039831 ]
jirapos...@reviews.apache.org commented on HIVE-2156: ----------------------------------------------------- bq. On 2011-05-24 20:49:24, Ning Zhang wrote: bq. > ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java, line 571 bq. > <https://reviews.apache.org/r/777/diff/2/?file=19556#file19556line571> bq. > bq. > error code -101 is also used in TaskRunner.java to indicate OOM exception. We should define all these error code in a centralized place. bq. bq. Syed Albiz wrote: bq. This was just used as something to initialize the exitVal to, that specific value should never be returned unless the call to runningJob.waitFor() returns the same value. I can change it to something else just to avoid the collision, but should we do both the consolidation of exit codes and the change to showJobDebugInfo in the same patch? They seem like different changes, and consolidating the exit codes would require touching several other parts of MapredLocalTask, MapRedTask and ExecDriver. Would these changes fit better in a separate patch? Yes, change it to something else won't be fine for now. We should probably consider consolidate all error codes into a centralized place in a separate JIRA. bq. On 2011-05-24 20:49:24, Ning Zhang wrote: bq. > ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java, line 110 bq. > <https://reviews.apache.org/r/777/diff/2/?file=19557#file19557line110> bq. > bq. > Do you have some numbers on how long it takes to get all the TaskCompletionEvents? There are cases that a job may have more than 10k tasks and all of them failed with the same error. bq. > bq. > If it takes too long you may want to consider adding a threshold to the time spent in getting all the TaskCompleteEvents. bq. bq. Syed Albiz wrote: bq. I have only tested it on some of the queries in the NegativeCliDriver tests, where it usually only takes <10s running in miniMR cluster mode. There is a coarse timeout (default 5 minutes, configurable in HiveConf.ConfVars.JOB_DEBUG_TIMEOUT) to get all TaskCompletionEvents before we stop that is enforced by HadoopJobExecHelper, but it would make sense to timeout grabbing TaskCompletionEvents specifically, and then print out the information obtained so far instead of what this patch does, which is just throw away the taskCompletionEvents gathered so far and return the "could not obtain debugging info". Does that sound reasonable, or do you think the coarse timeout would be sufficient? I think 5 mins is too long for getting the TaskCompleteEvents. And if the timeout happens, we won't get any error message from the task tracker. Can you get a sense of how long it takes to get a small number of TaskCompleteEvents in a real cluster, and then extrapolate to large (say 30k) # of mappers? If that's too long we should restrict the number of fetching TaskCompleteEvents to a few seconds and spend sometime to retrieve the task logs. - Ning ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/777/#review711 ----------------------------------------------------------- On 2011-05-24 04:29:32, Syed Albiz wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/777/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-05-24 04:29:32) bq. bq. bq. Review request for hive and John Sichi. bq. bq. bq. Summary bq. ------- bq. bq. - Add local error messages to point to job logs and provide TaskIDs bq. - Add a timeout to the fetching of task logs and errors bq. bq. bq. This addresses bug HIVE-2156. bq. https://issues.apache.org/jira/browse/HIVE-2156 bq. bq. bq. Diffs bq. ----- bq. bq. build-common.xml 00c3680 bq. common/src/java/org/apache/hadoop/hive/conf/HiveConf.java dc96a1f bq. conf/hive-default.xml 159d825 bq. ql/build.xml 449b47a bq. ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java 4717c25 bq. ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java PRE-CREATION bq. ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 bq. ql/src/java/org/apache/hadoop/hive/ql/exec/MapredLocalTask.java 691f038 bq. ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 9cb407c bq. ql/src/test/queries/clientnegative/minimr_broken_pipe.q PRE-CREATION bq. ql/src/test/results/clientnegative/dyn_part3.q.out 5f4df65 bq. ql/src/test/results/clientnegative/minimr_broken_pipe.q.out PRE-CREATION bq. ql/src/test/results/clientnegative/script_broken_pipe1.q.out d33d2cc bq. ql/src/test/results/clientnegative/script_broken_pipe2.q.out afbaa44 bq. ql/src/test/results/clientnegative/script_broken_pipe3.q.out fe8f757 bq. ql/src/test/results/clientnegative/script_error.q.out c72d780 bq. ql/src/test/results/clientnegative/udf_reflect_neg.q.out f2082a3 bq. ql/src/test/results/clientnegative/udf_test_error.q.out 5fd9a00 bq. ql/src/test/results/clientnegative/udf_test_error_reduce.q.out ddc5e5b bq. ql/src/test/templates/TestNegativeCliDriver.vm ec13f79 bq. bq. Diff: https://reviews.apache.org/r/777/diff bq. bq. bq. Testing bq. ------- bq. bq. Tested TestNegativeCliDriver in both local and miniMR mode bq. bq. bq. Thanks, bq. bq. Syed bq. bq. > Improve error messages emitted during task execution > ---------------------------------------------------- > > Key: HIVE-2156 > URL: https://issues.apache.org/jira/browse/HIVE-2156 > Project: Hive > Issue Type: Improvement > Reporter: Syed S. Albiz > Assignee: Syed S. Albiz > Attachments: HIVE-2156.1.patch, HIVE-2156.2.patch > > > Follow-up to HIVE-1731 > A number of issues were related to reporting errors from task execution and > surfacing these in a more useful form. > Currently a cryptic message with "Execution Error" and a return code and > class name of the task is emitted. > The most useful log messages here are emitted to the local logs, which can be > found through jobtracker. Having either a pointer to these logs as part of > the error message or the actual content would improve the usefulness > substantially. It may also warrant looking into how the underlying error > reporting through Hadoop is done and if more information can be propagated up > from there. > Specific issues raised in HIVE-1731: > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.MapRedTask > * issue was in regexp_extract syntax > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask > * tried: desc table_does_not_exist; -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira