[jira] [Work logged] (HIVE-24936) Fix file name parsing and copy file move.

ASF GitHub Bot (Jira) Thu, 20 May 2021 01:48:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24936?focusedWorklogId=599680&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-599680
 ]


ASF GitHub Bot logged work on HIVE-24936:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/May/21 08:47
            Start Date: 20/May/21 08:47
    Worklog Time Spent: 10m 
      Work Description: kishendas commented on a change in pull request #2120:
URL: https://github.com/apache/hive/pull/2120#discussion_r635874792



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM 
tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive 
expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any 
files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases 
where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes 
the files are are generated by

Review comment:
       files are generated by

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM 
tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive 
expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any 
files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases 
where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes 
the files are are generated by
+            // speculatively executed tasks.
             // Example: MoveTask thinks the following files are same
             // part-m-00000_1417075294718
             // part-m-00001_1417075294718
             // Assumes 1417075294718 as taskId and retains only large file 
supposedly generated by speculative execution.
-            // This can result in data loss in case of CONCATENATE/merging. 
Filter out files that does not match Hive's
-            // filename convention.
-            if (!Utilities.isHiveManagedFile(incompatFile)) {
-              // rename un-managed files to conform to Hive's naming standard
-              // Example:
-              // /warehouse/table/part-m-00000_1417075294718 will get renamed 
to /warehouse/table/.hive-staging/000000_0
-              // If staging directory already contains the file, taskId_copy_N 
naming will be used.
-              final String taskId = Utilities.getTaskId(jc);
-              Path destFilePath = new Path(destDir, new Path(taskId));
-              for (int counter = 1; fs.exists(destFilePath); counter++) {
-                destFilePath = new Path(destDir, taskId + 
(Utilities.COPY_KEYWORD + counter));
-              }
-              LOG.warn("Path doesn't conform to Hive's expectation. Renaming 
{} to {}", incompatFile, destFilePath);
-              destPath = destFilePath;
-            }
+            // This can result in data loss in case of CONCATENATE/merging.
 
+            // If filename is consistent with XXXXXX_N and another task with 
same task-id runs after this move, then
+            // the same file name is used in the other task which will result 
in task failure and retry of task and
+            // subsequent removal of this file as duplicate.
+            // Example: if the file name is 000001_0 and another task runs 
with taskid 000001_0, it will fail to create
+            // the file and next attempt will create 000001_1, both the files 
will be considered as output of same task
+            // and only 000001_1 will be picked resulting it loss of existing 
file 000001_0.

Review comment:
       resulting in loss of 

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/AbstractFileMergeOperator.java
##########
@@ -275,39 +275,33 @@ public void closeOp(boolean abort) throws HiveException {
           throw new HiveException("Incompatible files should not happen in MM 
tables.");
         }
         Path destDir = finalPath.getParent();
-        Path destPath = destDir;
         // move any incompatible files to final path
         if (incompatFileSet != null && !incompatFileSet.isEmpty()) {
           for (Path incompatFile : incompatFileSet) {
-            // check if path conforms to Hive's file name convention. Hive 
expects filenames to be in specific format
+            // Hive expects filenames to be in specific format
             // like 000000_0, but "LOAD DATA" commands can let you add any 
files to any partitions/tables without
-            // renaming. This can cause MoveTask to remove files in some cases 
where MoveTask assumes the files are
-            // are generated by speculatively executed tasks.
+            // renaming.
+            // This can cause a few issues:
+            // MoveTask will remove files in some cases where MoveTask assumes 
the files are are generated by
+            // speculatively executed tasks.
             // Example: MoveTask thinks the following files are same
             // part-m-00000_1417075294718
             // part-m-00001_1417075294718
             // Assumes 1417075294718 as taskId and retains only large file 
supposedly generated by speculative execution.
-            // This can result in data loss in case of CONCATENATE/merging. 
Filter out files that does not match Hive's
-            // filename convention.
-            if (!Utilities.isHiveManagedFile(incompatFile)) {
-              // rename un-managed files to conform to Hive's naming standard
-              // Example:
-              // /warehouse/table/part-m-00000_1417075294718 will get renamed 
to /warehouse/table/.hive-staging/000000_0
-              // If staging directory already contains the file, taskId_copy_N 
naming will be used.
-              final String taskId = Utilities.getTaskId(jc);
-              Path destFilePath = new Path(destDir, new Path(taskId));
-              for (int counter = 1; fs.exists(destFilePath); counter++) {
-                destFilePath = new Path(destDir, taskId + 
(Utilities.COPY_KEYWORD + counter));
-              }
-              LOG.warn("Path doesn't conform to Hive's expectation. Renaming 
{} to {}", incompatFile, destFilePath);
-              destPath = destFilePath;
-            }
+            // This can result in data loss in case of CONCATENATE/merging.
 
+            // If filename is consistent with XXXXXX_N and another task with 
same task-id runs after this move, then
+            // the same file name is used in the other task which will result 
in task failure and retry of task and
+            // subsequent removal of this file as duplicate.
+            // Example: if the file name is 000001_0 and another task runs 
with taskid 000001_0, it will fail to create
+            // the file and next attempt will create 000001_1, both the files 
will be considered as output of same task
+            // and only 000001_1 will be picked resulting it loss of existing 
file 000001_0.
+            final String destFileName = Utilities.getTaskId(jc) + 
Utilities.COPY_KEYWORD + 1;

Review comment:
       Will appending "+1" work in all cases ? 
   What happens if concat partially succeeds and there is one more file with 
the same name, after the previous one is already moved.
   If we run concat again, will it not result in data loss ?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
##########
@@ -1093,40 +1093,69 @@ public static void rename(FileSystem fs, Path src, Path 
dst) throws IOException,
     }
   }
 
-  private static void moveFile(FileSystem fs, FileStatus file, Path dst) 
throws IOException,
+  private static void moveFileOrDir(FileSystem fs, FileStatus file, Path dst) 
throws IOException,

Review comment:
       Can we add tests to cover this method as well ? 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ParsedOutputFileName.java
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+/**
+ * Helper class to match hive filenames and extract taskId, taskAttemptId, 
copyIndex.
+ *
+ * Matches following:
+ * 00001_02
+ * 00001_02.gz
+ * 00001_02.zlib.gz
+ * 00001_02_copy_1
+ * 00001_02_copy_1.gz
+ * <p>
+ * All the components are here:
+ * tmp_(taskPrefix)00001_02_copy_1.zlib.gz
+ */
+public class ParsedOutputFileName {
+  private static final Pattern COPY_FILE_NAME_TO_TASK_ID_REGEX = 
Pattern.compile(
+      "^(.*?)?" + // any prefix
+      "(\\(.*\\))?" + // taskId prefix
+      "([0-9]+)" + // taskId
+      "(?:_([0-9]{1,6}))?" + // _<attemptId> (limited to 6 digits)

Review comment:
       Can attemptId be more than 6 digits in future ? Is it possible to refer 
to the regex for attemptId, from the place where its generated ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 599680)
    Time Spent: 0.5h  (was: 20m)

> Fix file name parsing and copy file move.
> -----------------------------------------
>
>                 Key: HIVE-24936
>                 URL: https://issues.apache.org/jira/browse/HIVE-24936
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Harish JP
>            Assignee: Harish JP
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The taskId and taskAttemptId is not extracted correctly for copy files 
> (00001_02_copy_3) and when doing a move file of an incompatible copy file the 
> rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to 
> 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be 
> 00001_02_copy_N.
>  
> Incompatible files should be always renamed using the current task or it can 
> get deleted if the file name conflicts with another task output file. Ex: if 
> the input file name for a task is 00005_01 and is incompatible then if we 
> move this file, it will be treated as an output file for task id 5, attempt 1 
> which if exists will try to generate the same file and fail and another 
> attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping 
> code will remove 00005_01 resulting in data loss. There are other scenarios 
> where the same can happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24936) Fix file name parsing and copy file move.

Reply via email to