[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

ASF GitHub Bot (Jira) Fri, 29 Jan 2021 07:40:04 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544315&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544315
 ]


ASF GitHub Bot logged work on HIVE-23553:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Jan/21 15:39
            Start Date: 29/Jan/21 15:39
    Worklog Time Spent: 10m 
      Work Description: pgaref commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r566784743



##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),

Review comment:
       Sure, makes sense -- extracted method above.

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = 
OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : 
tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, 
stripeIdx++, -1, null));

Review comment:
       In ORC-1.5 encryption is not supported -- Stripe info can also be 
encrypted since ORC-523 and thus the extra arguments here. Since we are not yet 
using encryption on LLAP the last 2 params can be null we I am keeping an 
incremental StripeId as it is used in a couple of places like the StripePlanner.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc.encoded;
+
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.OrcIndex;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/** An abstract data reader that IO formats can use to read bytes from 
underlying storage. */
+public interface LlapDataReader extends AutoCloseable, Cloneable {
+
+  /** Opens the DataReader, making it ready to use. */
+  void open() throws IOException;
+
+  OrcIndex readRowIndex(StripeInformation stripe,
+      TypeDescription fileSchema,
+      OrcProto.StripeFooter footer,
+      boolean ignoreNonUtf8BloomFilter,
+      boolean[] included,
+      OrcProto.RowIndex[] indexes,
+      boolean[] sargColumns,
+      OrcFile.WriterVersion version,
+      OrcProto.Stream.Kind[] bloomFilterKinds,
+      OrcProto.BloomFilterIndex[] bloomFilterIndices
+  ) throws IOException;
+
+  OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws 
IOException;
+
+  /** Reads the data.
+   *
+   * Note that for the cases such as zero-copy read, caller must release the 
disk ranges
+   * produced after being done with them. Call isTrackingDiskRanges to find 
out if this is needed.
+   * @param range List if disk ranges to read. Ranges with data will be 
ignored.
+   * @param baseOffset Base offset from the start of the file of the ranges in 
disk range list.
+   * @param doForceDirect Whether the data should be read into direct buffers.
+   * @return New or modified list of DiskRange-s, where all the ranges are 
filled with data.
+   */
+  DiskRangeList readFileData(
+      DiskRangeList range, long baseOffset, boolean doForceDirect) throws 
IOException;
+
+
+  /**
+   * Whether the user should release buffers created by readFileData. See 
readFileData javadoc.
+   */
+  boolean isTrackingDiskRanges();
+
+  /**
+   * Releases buffers created by readFileData. See readFileData javadoc.
+   * @param toRelease The buffer to release.
+   */
+  void releaseBuffer(ByteBuffer toRelease);
+
+  /**
+   * Clone the entire state of the DataReader with the assumption that the
+   * clone will be closed at a different time. Thus, any file handles in the
+   * implementation need to be cloned.
+   * @return a new instance
+   */
+  LlapDataReader clone();
+
+  @Override
+  void close() throws IOException;
+
+  /**
+   * Returns the compression codec used by this datareader.
+   * We should consider removing this from the interface.
+   * @return the compression codec
+   */
+  CompressionCodec getCompressionCodec();

Review comment:
       The interface is almost identical indeed, however the main issue the 
readFileData method above that takes a DiskRangeList instead of a 
BufferChunkList.
   The issue here is that ORC-1.6 uses a separate StripePlanner that reads 
RowGroups converting them to BufferChunks while in LLAP we are doing our own 
custom planning with DiskRanges.
   
   I am opening another thicket for this (moving LLAP to Stripe planning) but 
for now I believe it makes sense to keep the class.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java
##########
@@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata(
     OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata);
     OrcProto.Footer footer = orcTail.getFooter();
     int stripeCount = footer.getStripesCount();
-    boolean writerUsedProlepticGregorian = footer.hasCalendar()
-        ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN
-        : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf);
+    // Always convert To PROLEPTIC_GREGORIAN

Review comment:
       Hive is already using the proleptic calendar as default for processing 
and we seem to be converting the StripeStats to that since HIVE-22589.
   
https://github.com/apache/orc/blob/32be030290905de9c2cd5b8cd84e210d8c0cf25c/java/core/src/java/org/apache/orc/impl/OrcTail.java#L114
   
   The main difference here is that the ORC Reader is now doing this 
transparently:
   
https://github.com/apache/orc/blob/7542d04a2fee8437f2c12f72f9c647b4325bc7bb/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L817

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval 
getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       Sure, created a wrapper instead reusing the same logic

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " 
has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), 
tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       This is related to ORC-685 where we added a way for the ReaderImpl to 
directly extractFileTail from a ByteBuffer.
   Using the buffer size is the right thing to do here as there can be a 
missmatch with the FileLength leading to bad tail extraction.
   
   

##########
File path: 
ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out
##########
@@ -687,11 +687,11 @@ POSTHOOK: Input: 
default@part_change_various_various_timestamp_n6
 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1
 #### A masked pattern was here ####
 insert_num     part    c1      c2      c3      c4      c5      c6      c7      
c8      c9      c10     c11     c12     b
-101    1       1970-01-01 00:00:00.001 1969-12-31 23:59:59.872 NULL    
1969-12-07 03:28:36.352 NULL    NULL    NULL    NULL    6229-06-28 
02:54:28.970117179   6229-06-28 02:54:28.97011       6229-06-28 02:54:28.97011  
     1950-12-18 00:00:00     original
-102    1       1970-01-01 00:00:00     1970-01-01 00:00:00.127 1970-01-01 
00:00:32.767 1970-01-25 20:31:23.647 NULL    NULL    NULL    NULL    5966-07-09 
03:30:50.597 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 2049-12-18 
00:00:00     original
+101    1       1970-01-01 00:00:01     1969-12-31 23:57:52     NULL    
1901-12-13 20:45:52     NULL    NULL    NULL    NULL    6229-06-28 
02:54:28.970117179   6229-06-28 02:54:28.97011       6229-06-28 02:54:28.97011  
     1950-12-18 00:00:00     original

Review comment:
       Yeah, I spend some time figuring this out myself, the root cause of this 
is a change in Integer to Timestamp conversion. In ORC-1.6 we use seconds while 
ORC-1.5 we use milliseconds.
   As a result casting (+1) to Timestamp will be: 1970-01-01 00:00:00.001 in 
ORC-1.5 while 1970-01-01 00:00:01 in ORC-1.6. 
    
https://github.com/apache/orc/blob/f7dce53fb39cf9654641edfe2d3ad68c0a8ef7b8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L1499
    
   PS: also added this as user-facing change on the PR

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, 
orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       Thats one of the breaking changes that case with encryption support for 
ORC.
   As Tail and thus StripeStatistics may be encrypted, we always need a reader 
instance to retrieve them.
   
   OrcTail maintained the API call for backwards compatibility but it still 
expects a reader to actually retrieve the stats:
   
https://github.com/apache/orc/blob/d78cc39a9299b62bc8a5d8f5c3fac9201e03cb8b/java/core/src/java/org/apache/orc/impl/OrcTail.java#L210

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java
##########
@@ -68,6 +70,10 @@
 
   private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024;
   private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024;
+  private static final GregorianCalendar PROLEPTIC = new GregorianCalendar();

Review comment:
       ACK

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final 
int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == 
TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       ACK

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java
##########
@@ -282,6 +280,56 @@ public String toString() {
     }
   }
 
+  public static boolean[] findPresentStreamsByColumn(

Review comment:
       Sure, done

##########
File path: 
ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out
##########
@@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest
 #### A masked pattern was here ####
 -- BEGIN ORC FILE DUMP --
 #### A masked pattern was here ####
-File Version: 0.12 with ORC_517
+File Version: 0.12 with ORC_14

Review comment:
       Yes, ORC_14 is the latest Writer version (for 1.6) as per:
   
   
https://github.com/apache/orc/blob/a1671f1e8abf748178c07e2a03b03f00268c8863/java/core/src/java/org/apache/orc/OrcFile.java#L177




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 544315)
    Time Spent: 4h 20m  (was: 4h 10m)

> Upgrade ORC version to 1.6.7
> ----------------------------
>
>                 Key: HIVE-23553
>                 URL: https://issues.apache.org/jira/browse/HIVE-23553
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

Reply via email to