[jira] [Commented] (HIVE-2156) Improve error messages emitted during task execution

[email protected] (JIRA) Thu, 26 May 2011 11:00:33 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039831#comment-13039831
 ]

[email protected] commented on HIVE-2156:
-----------------------------------------------------

bq.  On 2011-05-24 20:49:24, Ning Zhang wrote:
bq.  > ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java, 
line 571
bq.  > <https://reviews.apache.org/r/777/diff/2/?file=19556#file19556line571>
bq.  >
bq.  >     error code -101 is also used in TaskRunner.java to indicate OOM 
exception. We should define all these error code in a centralized place.
bq.  
bq.  Syed Albiz wrote:
bq.      This was just used as something to initialize the exitVal to, that 
specific value should never be returned unless the call to runningJob.waitFor() 
returns the same value. I can change it to something else just to avoid the 
collision, but should we do both the consolidation of exit codes and the change 
to showJobDebugInfo in the same patch? They seem like different changes, and 
consolidating the exit codes would require touching several other parts of 
MapredLocalTask, MapRedTask and ExecDriver. Would these changes fit better in a 
separate patch?

Yes, change it to something else won't be fine for now. We should probably 
consider consolidate all error codes into a centralized place in a separate 
JIRA. 

bq.  On 2011-05-24 20:49:24, Ning Zhang wrote:
bq.  > ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java, line 110
bq.  > <https://reviews.apache.org/r/777/diff/2/?file=19557#file19557line110>
bq.  >
bq.  >     Do you have some numbers on how long it takes to get all the 
TaskCompletionEvents? There are cases that a job may have more than 10k tasks 
and all of them failed with the same error.
bq.  >     
bq.  >     If it takes too long you may want to consider adding a threshold to 
the time spent in getting all the TaskCompleteEvents.
bq.  
bq.  Syed Albiz wrote:
bq.      I have only tested it on some of the queries in the NegativeCliDriver 
tests, where it usually only takes <10s running in miniMR cluster mode. There 
is a coarse timeout (default 5 minutes, configurable in 
HiveConf.ConfVars.JOB_DEBUG_TIMEOUT) to get all TaskCompletionEvents before we 
stop that is enforced by HadoopJobExecHelper, but it would make sense to 
timeout grabbing TaskCompletionEvents specifically, and then print out the 
information obtained so far instead of what this patch does, which is just 
throw away the taskCompletionEvents gathered so far and return the "could not 
obtain debugging info". Does that sound reasonable, or do you think the coarse 
timeout would be sufficient?

I think 5 mins is too long for getting the TaskCompleteEvents. And if the 
timeout happens, we won't get any error message from the task tracker.  Can you 
get a sense of how long it takes to get a small number of TaskCompleteEvents in 
a real cluster, and then extrapolate to large (say 30k) # of mappers? If that's 
too long we should restrict the number of fetching TaskCompleteEvents to a few 
seconds and spend sometime to retrieve the task logs. 

- Ning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/777/#review711
-----------------------------------------------------------

On 2011-05-24 04:29:32, Syed Albiz wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/777/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-05-24 04:29:32)
bq.  
bq.  
bq.  Review request for hive and John Sichi.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  - Add local error messages to point to job logs and provide TaskIDs
bq.  - Add a timeout to the fetching of task logs and errors
bq.  
bq.  
bq.  This addresses bug HIVE-2156.
bq.      https://issues.apache.org/jira/browse/HIVE-2156
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    build-common.xml 00c3680 
bq.    common/src/java/org/apache/hadoop/hive/conf/HiveConf.java dc96a1f 
bq.    conf/hive-default.xml 159d825 
bq.    ql/build.xml 449b47a 
bq.    ql/src/java/org/apache/hadoop/hive/ql/exec/HadoopJobExecHelper.java 
4717c25 
bq.    ql/src/java/org/apache/hadoop/hive/ql/exec/JobDebugger.java PRE-CREATION 
bq.    ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 53769a0 
bq.    ql/src/java/org/apache/hadoop/hive/ql/exec/MapredLocalTask.java 691f038 
bq.    ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
9cb407c 
bq.    ql/src/test/queries/clientnegative/minimr_broken_pipe.q PRE-CREATION 
bq.    ql/src/test/results/clientnegative/dyn_part3.q.out 5f4df65 
bq.    ql/src/test/results/clientnegative/minimr_broken_pipe.q.out PRE-CREATION 
bq.    ql/src/test/results/clientnegative/script_broken_pipe1.q.out d33d2cc 
bq.    ql/src/test/results/clientnegative/script_broken_pipe2.q.out afbaa44 
bq.    ql/src/test/results/clientnegative/script_broken_pipe3.q.out fe8f757 
bq.    ql/src/test/results/clientnegative/script_error.q.out c72d780 
bq.    ql/src/test/results/clientnegative/udf_reflect_neg.q.out f2082a3 
bq.    ql/src/test/results/clientnegative/udf_test_error.q.out 5fd9a00 
bq.    ql/src/test/results/clientnegative/udf_test_error_reduce.q.out ddc5e5b 
bq.    ql/src/test/templates/TestNegativeCliDriver.vm ec13f79 
bq.  
bq.  Diff: https://reviews.apache.org/r/777/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Tested TestNegativeCliDriver in both local and miniMR mode
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Syed
bq.  
bq.

> Improve error messages emitted during task execution
> ----------------------------------------------------
>
>                 Key: HIVE-2156
>                 URL: https://issues.apache.org/jira/browse/HIVE-2156
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Syed S. Albiz
>            Assignee: Syed S. Albiz
>         Attachments: HIVE-2156.1.patch, HIVE-2156.2.patch
>
>
> Follow-up to HIVE-1731
> A number of issues were related to reporting errors from task execution and 
> surfacing these in a more useful form.
> Currently a cryptic message with "Execution Error" and a return code and 
> class name of the task is emitted.
> The most useful log messages here are emitted to the local logs, which can be 
> found through jobtracker. Having either a pointer to these logs as part of 
> the error message or the actual content would improve the usefulness 
> substantially. It may also warrant looking into how the underlying error 
> reporting through Hadoop is done and if more information can be propagated up 
> from there.
> Specific issues raised in  HIVE-1731:
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.MapRedTask
> * issue was in regexp_extract syntax
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> * tried: desc table_does_not_exist;

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2156) Improve error messages emitted during task execution

Reply via email to