[ https://issues.apache.org/jira/browse/HIVE-24936?focusedWorklogId=599721&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-599721 ]
ASF GitHub Bot logged work on HIVE-24936: ----------------------------------------- Author: ASF GitHub Bot Created on: 20/May/21 10:28 Start Date: 20/May/21 10:28 Worklog Time Spent: 10m Work Description: harishjp commented on a change in pull request #2120: URL: https://github.com/apache/hive/pull/2120#discussion_r635975315 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ParsedOutputFileName.java ########## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec; + +import java.util.regex.Matcher; +import java.util.regex.Pattern; + + +/** + * Helper class to match hive filenames and extract taskId, taskAttemptId, copyIndex. + * + * Matches following: + * 00001_02 + * 00001_02.gz + * 00001_02.zlib.gz + * 00001_02_copy_1 + * 00001_02_copy_1.gz + * <p> + * All the components are here: + * tmp_(taskPrefix)00001_02_copy_1.zlib.gz + */ +public class ParsedOutputFileName { + private static final Pattern COPY_FILE_NAME_TO_TASK_ID_REGEX = Pattern.compile( + "^(.*?)?" + // any prefix + "(\\(.*\\))?" + // taskId prefix + "([0-9]+)" + // taskId + "(?:_([0-9]{1,6}))?" + // _<attemptId> (limited to 6 digits) Review comment: No idea, this was refactored from the existing pattern which assumed 6 digits, kept it same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 599721) Time Spent: 50m (was: 40m) > Fix file name parsing and copy file move. > ----------------------------------------- > > Key: HIVE-24936 > URL: https://issues.apache.org/jira/browse/HIVE-24936 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Reporter: Harish JP > Assignee: Harish JP > Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The taskId and taskAttemptId is not extracted correctly for copy files > (00001_02_copy_3) and when doing a move file of an incompatible copy file the > rename utility generates wrong file names. Ex: 00001_02_copy_3 is renamed to > 00001_02_copy_3_1 if 00001_02_copy_3 already exists, ideally it should be > 00001_02_copy_N. > > Incompatible files should be always renamed using the current task or it can > get deleted if the file name conflicts with another task output file. Ex: if > the input file name for a task is 00005_01 and is incompatible then if we > move this file, it will be treated as an output file for task id 5, attempt 1 > which if exists will try to generate the same file and fail and another > attempt will be made. There will be 2 files 00005_01, 00005_02, the deduping > code will remove 00005_01 resulting in data loss. There are other scenarios > where the same can happen. -- This message was sent by Atlassian Jira (v8.3.4#803005)