Sahil Takiar created HIVE-20512:
-----------------------------------

             Summary: Improve record and memory usage logging in 
SparkRecordHandler
                 Key: HIVE-20512
                 URL: https://issues.apache.org/jira/browse/HIVE-20512
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
            Reporter: Sahil Takiar


We currently log memory usage and # of records processed in Spark tasks, but we 
should improve the methodology for how frequently we log this info. Currently 
we use the following code:

{code:java}
private long getNextLogThreshold(long currentThreshold) {
    // A very simple counter to keep track of number of rows processed by the
    // reducer. It dumps
    // every 1 million times, and quickly before that
    if (currentThreshold >= 1000000) {
      return currentThreshold + 1000000;
    }
    return 10 * currentThreshold;
  }
{code}

The issue is that after a while, the increase by 10x factor means that you have 
to process a huge # of records before this gets triggered.

A better approach would be to log this info at a given interval. This would 
help in debugging tasks that are seemingly hung.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to