[ 
https://issues.apache.org/jira/browse/HIVE-13288?focusedWorklogId=822458&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-822458
 ]

ASF GitHub Bot logged work on HIVE-13288:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Nov/22 20:22
            Start Date: 01/Nov/22 20:22
    Worklog Time Spent: 10m 
      Work Description: cnauroth opened a new pull request, #3722:
URL: https://github.com/apache/hive/pull/3722

   ### What changes were proposed in this pull request?
   
   Improve logging in `DagUtils#localizeResource` to clarify root cause for why 
localizing a resource has failed.
   
   ### Why are the changes needed?
   
   While creating a Tez session, `DagUtils#localizeResource` is responsible for 
copying the client's hive-exec.jar into HDFS (`hive.jar.directory`). This 
process can be triggered from multiple threads concurrently, in which case one 
thread performs the copy while the others wait, polling for arrival of the 
destination file.
   
   If there is an `IOException` during this process, it's assumed that the 
thread attempting the write failed, and all others abort. No information about 
the underlying `IOException` is logged. Instead, the log states "previous 
writer likely failed to write." In some cases though, the `IOException` can 
occur on a polling thread for reasons unrelated to what happened in a writing 
thread. For example, in a production incident, the root cause was really that 
an external process had corrupted the copy of hive-exec.jar in 
`hive.jar.directory`, causing failure of the file length validation check in 
`DagUtils#checkPreExisting`. Since the logs didn't say anything about this, it 
made it much more difficult to troubleshoot.
   
   This patch clarifies the logging by stating that a failure on the writing 
thread is just one possible reason for the error. It also logs the exception 
stack trace to make it easier to find the real root cause. This is a patch I 
ran to help recover from the production incident.
   
   ### Does this PR introduce _any_ user-facing change?
   
   There is no behavior change, but it does change the logging output.
   
   ### How was this patch tested?
   
   This patch was deployed as part of resolving the production incident that I 
mentioned. I was also able to create a reproduction in a test environment by 
externally overwriting the hive-exec.jar in `hive.jar.directory` to simulate 
the production incident.




Issue Time Tracking
-------------------

            Worklog Id:     (was: 822458)
    Remaining Estimate: 0h
            Time Spent: 10m

> Confusing exception message in DagUtils.localizeResource
> --------------------------------------------------------
>
>                 Key: HIVE-13288
>                 URL: https://issues.apache.org/jira/browse/HIVE-13288
>             Project: Hive
>          Issue Type: Improvement
>          Components: Clients
>    Affects Versions: 1.2.1
>            Reporter: Jeff Zhang
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I got the following exception when query through hive server2. And check the 
> source code, it it due to some error when copying data from local to hdfs. 
> But the IOException is ignored and assume that it is due to another thread is 
> also writing. I don't think it make sense to assume that, at least should log 
> the IOException. 
> {code}
> LOG.info("Localizing resource because it does not exist: " + src + " to dest: 
> " + dest);
>       try {
>         destFS.copyFromLocalFile(false, false, src, dest);
>       } catch (IOException e) {
>         LOG.info("Looks like another thread is writing the same file will 
> wait.");
>         int waitAttempts =
>             
> conf.getInt(HiveConf.ConfVars.HIVE_LOCALIZE_RESOURCE_NUM_WAIT_ATTEMPTS.varname,
>                 
> HiveConf.ConfVars.HIVE_LOCALIZE_RESOURCE_NUM_WAIT_ATTEMPTS.defaultIntVal);
>         long sleepInterval = HiveConf.getTimeVar(
>             conf, HiveConf.ConfVars.HIVE_LOCALIZE_RESOURCE_WAIT_INTERVAL,
>             TimeUnit.MILLISECONDS);
>         LOG.info("Number of wait attempts: " + waitAttempts + ". Wait 
> interval: "
>             + sleepInterval);
>         boolean found = false;
> {code}
> {noformat}
> 2016-03-15 11:25:39,921 INFO  [HiveServer2-Background-Pool: Thread-249]: 
> tez.DagUtils (DagUtils.java:getHiveJarDirectory(876)) - Jar dir is 
> null/directory doesn't exist. Choosing HIVE_INSTALL_DIR - /user/jeff/.hiveJars
> 2016-03-15 11:25:40,058 INFO  [HiveServer2-Background-Pool: Thread-249]: 
> tez.DagUtils (DagUtils.java:localizeResource(952)) - Localizing resource 
> because it does not exist: 
> file:/usr/hdp/2.3.2.0-2950/hive/lib/hive-exec-1.2.1.2.3.2.0-2950.jar to dest: 
> hdfs://sandbox.hortonworks.com:8020/user/jeff/.hiveJars/hive-exec-1.2.1.2.3.2.0-2950-a97c953db414a4f792d868e2b0417578a61ccfa368048016926117b641b07f34.jar
> 2016-03-15 11:25:40,063 INFO  [HiveServer2-Background-Pool: Thread-249]: 
> tez.DagUtils (DagUtils.java:localizeResource(956)) - Looks like another 
> thread is writing the same file will wait.
> 2016-03-15 11:25:40,064 INFO  [HiveServer2-Background-Pool: Thread-249]: 
> tez.DagUtils (DagUtils.java:localizeResource(963)) - Number of wait attempts: 
> 5. Wait interval: 5000
> 2016-03-15 11:25:53,548 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> thrift.ThriftCLIService (ThriftCLIService.java:OpenSession(294)) - Client 
> protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
> 2016-03-15 11:25:53,548 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> metastore.HiveMetaStore (HiveMetaStore.java:logInfo(747)) - 1: Shutting down 
> the object store...
> 2016-03-15 11:25:53,549 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(372)) - 
> ugi=hive/sandbox.hortonworks....@example.com       ip=unknown-ip-addr      
> cmd=Shutting down the object store...
> 2016-03-15 11:25:53,549 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> metastore.HiveMetaStore (HiveMetaStore.java:logInfo(747)) - 1: Metastore 
> shutdown complete.
> 2016-03-15 11:25:53,549 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(372)) - 
> ugi=hive/sandbox.hortonworks....@example.com       ip=unknown-ip-addr      
> cmd=Metastore shutdown complete.
> 2016-03-15 11:25:53,573 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> session.SessionState (SessionState.java:createPath(641)) - Created local 
> directory: /tmp/e43fbaab-a659-4331-90cb-0ea0b2098e25_resources
> 2016-03-15 11:25:53,577 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> session.SessionState (SessionState.java:createPath(641)) - Created HDFS 
> directory: /tmp/hive/ambari-qa/e43fbaab-a659-4331-90cb-0ea0b2098e25
> 2016-03-15 11:25:53,582 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> session.SessionState (SessionState.java:createPath(641)) - Created local 
> directory: /tmp/hive/e43fbaab-a659-4331-90cb-0ea0b2098e25
> 2016-03-15 11:25:53,587 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> session.SessionState (SessionState.java:createPath(641)) - Created HDFS 
> directory: 
> /tmp/hive/ambari-qa/e43fbaab-a659-4331-90cb-0ea0b2098e25/_tmp_space.db
> 2016-03-15 11:25:53,592 INFO  [HiveServer2-Handler-Pool: Thread-48]: 
> session.HiveSessionImpl (HiveSessionImpl.java:setOperationLogSessionDir(236)) 
> - Operation log session directory is created: 
> /home/hive/${system:java.io.tmpdir}/${system:user.name}/operation_logs/e43fbaab-a659-4331-90cb-0ea0b2098e25
> 2016-03-15 11:26:05,077 ERROR [HiveServer2-Background-Pool: Thread-249]: 
> tez.DagUtils (DagUtils.java:localizeResource(979)) - Could not find the jar 
> that was being uploaded
> 2016-03-15 11:26:05,077 ERROR [HiveServer2-Background-Pool: Thread-249]: 
> exec.Task (TezTask.java:execute(198)) - Failed to execute tez graph.
> java.io.IOException: Previous writer likely failed to write 
> hdfs://sandbox.hortonworks.com:8020/user/jeff/.hiveJars/hive-exec-1.2.1.2.3.2.0-2950-a97c953db414a4f792d868e2b0417578a61ccfa368048016926117b641b07f34.jar.
>  Failing because I am unlikely to write too.
>       at 
> org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:980)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.createJarLocalResource(TezSessionState.java:350)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:152)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:257)
>       at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:140)
>       at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>       at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>       at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1655)
>       at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1414)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1054)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to