I'm trying to load a table using an INSERT query [1]. Not all the data is making it from the original table into the new table. The Hadoop task tracker logs show that the query works without error until the last second of the job. The job typically takes about 45 minutes, but in actual last second of the job a number of IOExceptions arise [2]. The exceptions are a result of temporary Hive files disappearing during a map task. The INSERT query actually spawns 2 Hadoop jobs, one which takes the aforementioned ~45 minutes and a second task which takes approximately 10 seconds. Both tasks have the same mapred.job.name and hive.query.string in the job config. By examining the task tracker logs this second task is just renaming the very temporary files that the previous Hadoop job errors on. According to the Hadoop job tracker these jobs don't overlap, that is, the second job starts immediately after the first job completes, but something's amiss.
What's the purpose of the second job? How can I fix this? Thanks, Jim Krehl [1] https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-InsertingdataintoHiveTablesfromqueries [2] ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-hive/hive_2012-10-15_13-45-21_245_1936216192130095423/_task_tmp.-ext-10002/month=2012-01/_tmp.000000_1 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-hive/hive_2012-10-15_13-45-21_245_1936216192130095423/_task_tmp.-ext-10002/month=2012-01/_tmp.000000_1 File does not exist. Holder DFSClient_NONMAPREDUCE_-672101740_1 does not have any open files.
