[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-3874:
------------------------------

    Attachment: HIVE-3874.D8529.1.patch

omalley requested code review of "HIVE-3874 [jira] Create a new Optimized Row 
Columnar file format for Hive".

Reviewers: JIRA

HIVE-3874. Create ORC File format.

There are several limitations of the current RC File format that I'd like to 
address by creating a new format:

        each column value is stored as a binary blob, which means:

                the entire column value must be read, decompressed, and 
deserialized
                the file format can't use smarter type-specific compression
                push down filters can't be evaluated

        the start of each row group needs to be found by scanning
        user metadata can only be added to the file when the file is created
        the file doesn't store the number of rows per a file or row group
        there is no mechanism for seeking to a particular row number, which is 
required for external indexes.
        there is no mechanism for storing light weight indexes within the file 
to enable push-down filters to skip entire row groups.
        the type of the rows aren't stored in the file

TEST PLAN
  EMPTY

REVISION DETAIL
  https://reviews.facebook.net/D8529

AFFECTED FILES
  build.properties
  build.xml
  ivy/libraries.properties
  ql/build.xml
  ql/ivy.xml
  ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/BitFieldReader.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/BitFieldWriter.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/BooleanColumnStatistics.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/ColumnStatistics.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/ColumnStatisticsImpl.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/CompressionCodec.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/CompressionKind.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/DoubleColumnStatistics.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/DynamicByteArray.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/DynamicIntArray.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/FileDump.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/InStream.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/IntegerColumnStatistics.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcFile.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcInputFormat.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcOutputFormat.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcSerde.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcStruct.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcUnion.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/OutStream.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/PositionProvider.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/PositionRecorder.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/PositionedOutputStream.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/Reader.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/ReaderImpl.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RecordReader.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RecordReaderImpl.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RunLengthByteReader.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RunLengthByteWriter.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RunLengthIntegerReader.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/RunLengthIntegerWriter.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/SerializationUtils.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/SnappyCodec.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/StreamName.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/StringColumnStatistics.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/StringRedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/StripeInformation.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/Writer.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/WriterImpl.java
  ql/src/java/org/apache/hadoop/hive/ql/orc/ZlibCodec.java
  ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitFieldReader.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestDynamicArray.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFileDump.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInStream.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcStruct.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthByteReader.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRunLengthIntegerReader.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestSerializationUtils.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestStreamName.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestStringRedBlackTree.java
  ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestZlib.java
  ql/src/test/resources/orc-file-dump.out

MANAGE HERALD RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/20781/

To: JIRA, omalley

                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to