[jira] [Commented] (HIVE-20580) OrcInputFormat.isOriginal() should not rely on hive.acid.key.index

Peter Vary (JIRA) Fri, 08 Mar 2019 06:22:25 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787922#comment-16787922
 ]


Peter Vary commented on HIVE-20580:
-----------------------------------

Based on the description I suspect that the following methods should be checked:
{code:java}
  public static boolean isOriginal(Reader file) {
    return !file.hasMetadataValue(OrcRecordUpdater.ACID_KEY_INDEX_NAME);
  }

  public static boolean isOriginal(Footer footer) {
    for(OrcProto.UserMetadataItem item: footer.getMetadataList()) {
      if (item.hasName() && 
item.getName().equals(OrcRecordUpdater.ACID_KEY_INDEX_NAME)) {
        return true;
      }
    }
    return false;
  }
{code}
The funny thing is that the first method (with the Reader as a parameter) 
returns {{true}} if we do *not find* the {{hive.acid.key.index}} in the 
metadata list, the second method returns true if we *find* the 
{{hive.acid.key.index}} :) :)

I think the original intention (pun intended :)) was to return true for a 
Non-ACID file, and false for an ACID one.
The second method is used only to set 
{{org.apache.hadoop.hive.llap.io.metadata.OrcFileMetadata.isOriginalFormat}} 
which is not accessed anywhere in the code (or if so, I was not able to find), 
so I think we will stick to the original meaning of the isOriginal, and we 
should fix the second one.

Tested the first part (Reader based check only) of the change with using the 
following commands:
{code:java|title=Non ACID}
0: jdbc:hive2://localhost:10003/default> load data inpath 'original.orc' into 
table acid;
[..]
INFO  : Completed executing 
command(queryId=petervary_20190308140915_3e1ee5ef-22ec-4cd5-9353-7b00f0702e4d); 
Time taken: 10.706 seconds
{code}
{code:java|title=ACID}
0: jdbc:hive2://localhost:10003/default> load data inpath 'acid.orc' into table 
acid;
Error: Error while compiling statement: FAILED: SemanticException [Error 
10413]: "acid.orc" was created by Acid write - it cannot be loaded into anther 
Acid table (state=42000,code=10413)
{code}

Also created a little code to test the stuff on specific files:
{code}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.orc.OrcProto;

import java.io.IOException;

public class a {
  public static void main(String[] args) throws IOException {
//    String path = "/Users/petervary/tmp/orc_split_elim.orc"; // Non-ACID file
    String path = "/Users/petervary/tmp/bucket_00000"; // ACID file
    Reader reader = OrcFile.createReader(new Path(path),
        OrcFile.readerOptions(new Configuration()));
    OrcProto.Footer footer = reader.getFileTail().getFooter();
    boolean result1 = OrcInputFormat.isOriginal(reader);
    boolean result2 = OrcInputFormat.isOriginal(footer);
    System.out.println("IsOriginal: " + result1 + " " + result2);
  }
}
{code}

[~vgumashta], [~ashutoshc]: Any easy way to write a unit test? I think the best 
would be to have 3 test files in the {{/data/files/}} directory:
* Non-ACID orc file
* ACID v1 file
* ACID v2 file

And the test code above could be used the check the result of the isOriginal 
method. Shall I create the test files myself, or you know some files that are 
already there and I can use them?

Thanks,
Peter

> OrcInputFormat.isOriginal() should not rely on hive.acid.key.index
> ------------------------------------------------------------------
>
>                 Key: HIVE-20580
>                 URL: https://issues.apache.org/jira/browse/HIVE-20580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>    Affects Versions: 3.1.0
>            Reporter: Eugene Koifman
>            Assignee: Peter Vary
>            Priority: Major
>
> {{org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.isOriginal()}} is checking 
> for presence of {{hive.acid.key.index}} in the footer.  This is only created 
> when the file is written by {{OrcRecordUpdater}}.  It should instead check 
> for presence of Acid metadata columns so that a file can be produced by 
> something other than {{OrcRecordUpater}}.
> Also, {{hive.acid.key.index}} counts number of different type of events which 
> is not really useful for Acid V2 (as of Hive 3) since each file only has 1 
> type of event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20580) OrcInputFormat.isOriginal() should not rely on hive.acid.key.index

Reply via email to