[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544315&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544315 ]
ASF GitHub Bot logged work on HIVE-23553: ----------------------------------------- Author: ASF GitHub Bot Created on: 29/Jan/21 15:39 Start Date: 29/Jan/21 15:39 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566784743 ########## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java ########## @@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException { OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers); counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT); FileTail tail = orcTail.getFileTail(); - stats = orcTail.getStripeStatisticsProto(); + CompressionKind compressionKind = orcTail.getCompressionKind(); + InStream.StreamOptions options = null; + if (compressionKind != CompressionKind.NONE) { + options = InStream.options() + .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize()); + } + InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(), Review comment: Sure, makes sense -- extracted method above. ########## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java ########## @@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException { OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers); counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT); FileTail tail = orcTail.getFileTail(); - stats = orcTail.getStripeStatisticsProto(); + CompressionKind compressionKind = orcTail.getCompressionKind(); + InStream.StreamOptions options = null; + if (compressionKind != CompressionKind.NONE) { + options = InStream.options() + .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize()); + } + InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(), + orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options); + stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList(); stripes = new ArrayList<>(tail.getFooter().getStripesCount()); + int stripeIdx = 0; for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) { - stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto)); + stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null)); Review comment: In ORC-1.5 encryption is not supported -- Stripe info can also be encrypted since ORC-523 and thus the extra arguments here. Since we are not yet using encryption on LLAP the last 2 params can be null we I am keeping an incremental StripeId as it is used in a couple of places like the StripePlanner. ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java ########## @@ -0,0 +1,93 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io.orc.encoded; + +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.orc.CompressionCodec; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.OrcIndex; + +import java.io.IOException; +import java.nio.ByteBuffer; + +/** An abstract data reader that IO formats can use to read bytes from underlying storage. */ +public interface LlapDataReader extends AutoCloseable, Cloneable { + + /** Opens the DataReader, making it ready to use. */ + void open() throws IOException; + + OrcIndex readRowIndex(StripeInformation stripe, + TypeDescription fileSchema, + OrcProto.StripeFooter footer, + boolean ignoreNonUtf8BloomFilter, + boolean[] included, + OrcProto.RowIndex[] indexes, + boolean[] sargColumns, + OrcFile.WriterVersion version, + OrcProto.Stream.Kind[] bloomFilterKinds, + OrcProto.BloomFilterIndex[] bloomFilterIndices + ) throws IOException; + + OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws IOException; + + /** Reads the data. + * + * Note that for the cases such as zero-copy read, caller must release the disk ranges + * produced after being done with them. Call isTrackingDiskRanges to find out if this is needed. + * @param range List if disk ranges to read. Ranges with data will be ignored. + * @param baseOffset Base offset from the start of the file of the ranges in disk range list. + * @param doForceDirect Whether the data should be read into direct buffers. + * @return New or modified list of DiskRange-s, where all the ranges are filled with data. + */ + DiskRangeList readFileData( + DiskRangeList range, long baseOffset, boolean doForceDirect) throws IOException; + + + /** + * Whether the user should release buffers created by readFileData. See readFileData javadoc. + */ + boolean isTrackingDiskRanges(); + + /** + * Releases buffers created by readFileData. See readFileData javadoc. + * @param toRelease The buffer to release. + */ + void releaseBuffer(ByteBuffer toRelease); + + /** + * Clone the entire state of the DataReader with the assumption that the + * clone will be closed at a different time. Thus, any file handles in the + * implementation need to be cloned. + * @return a new instance + */ + LlapDataReader clone(); + + @Override + void close() throws IOException; + + /** + * Returns the compression codec used by this datareader. + * We should consider removing this from the interface. + * @return the compression codec + */ + CompressionCodec getCompressionCodec(); Review comment: The interface is almost identical indeed, however the main issue the readFileData method above that takes a DiskRangeList instead of a BufferChunkList. The issue here is that ORC-1.6 uses a separate StripePlanner that reads RowGroups converting them to BufferChunks while in LLAP we are doing our own custom planning with DiskRanges. I am opening another thicket for this (moving LLAP to Stripe planning) but for now I believe it makes sense to keep the class. ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java ########## @@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata( OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata); OrcProto.Footer footer = orcTail.getFooter(); int stripeCount = footer.getStripesCount(); - boolean writerUsedProlepticGregorian = footer.hasCalendar() - ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN - : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf); + // Always convert To PROLEPTIC_GREGORIAN Review comment: Hive is already using the proleptic calendar as default for processing and we seem to be converting the StripeStats to that since HIVE-22589. https://github.com/apache/orc/blob/32be030290905de9c2cd5b8cd84e210d8c0cf25c/java/core/src/java/org/apache/orc/impl/OrcTail.java#L114 The main difference here is that the ORC Reader is now doing this transparently: https://github.com/apache/orc/blob/7542d04a2fee8437f2c12f72f9c647b4325bc7bb/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L817 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ########## @@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List<OrcPr * @param colStats The statistics array * @return The min record key */ + private static OrcRawRecordMerger.KeyInterval getKeyInterval(ColumnStatistics[] colStats) { Review comment: Sure, created a wrapper instead reusing the same logic ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java ########## @@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) { if (bb.capacity() != bb.remaining()) { throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity()); } - cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(), - tail.getFileModificationTime(), bb.duplicate())); + cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate())); Review comment: This is related to ORC-685 where we added a way for the ReaderImpl to directly extractFileTail from a ByteBuffer. Using the buffer size is the right thing to do here as there can be a missmatch with the FileLength leading to bad tail extraction. ########## File path: ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out ########## @@ -687,11 +687,11 @@ POSTHOOK: Input: default@part_change_various_various_timestamp_n6 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1 #### A masked pattern was here #### insert_num part c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 b -101 1 1970-01-01 00:00:00.001 1969-12-31 23:59:59.872 NULL 1969-12-07 03:28:36.352 NULL NULL NULL NULL 6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original -102 1 1970-01-01 00:00:00 1970-01-01 00:00:00.127 1970-01-01 00:00:32.767 1970-01-25 20:31:23.647 NULL NULL NULL NULL 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 2049-12-18 00:00:00 original +101 1 1970-01-01 00:00:01 1969-12-31 23:57:52 NULL 1901-12-13 20:45:52 NULL NULL NULL NULL 6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original Review comment: Yeah, I spend some time figuring this out myself, the root cause of this is a change in Integer to Timestamp conversion. In ORC-1.6 we use seconds while ORC-1.5 we use milliseconds. As a result casting (+1) to Timestamp will be: 1970-01-01 00:00:00.001 in ORC-1.5 while 1970-01-01 00:00:01 in ORC-1.6. https://github.com/apache/orc/blob/f7dce53fb39cf9654641edfe2d3ad68c0a8ef7b8/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L1499 PS: also added this as user-facing change on the PR ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ########## @@ -443,7 +444,8 @@ public void setBaseAndInnerReader( return new OrcRawRecordMerger.KeyInterval(null, null); } - OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail; + VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData = Review comment: Thats one of the breaking changes that case with encryption support for ORC. As Tail and thus StripeStatistics may be encrypted, we always need a reader instance to retrieve them. OrcTail maintained the API call for backwards compatibility but it still expects a reader to actually retrieve the stats: https://github.com/apache/orc/blob/d78cc39a9299b62bc8a5d8f5c3fac9201e03cb8b/java/core/src/java/org/apache/orc/impl/OrcTail.java#L210 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java ########## @@ -68,6 +70,10 @@ private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024; private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024; + private static final GregorianCalendar PROLEPTIC = new GregorianCalendar(); Review comment: ACK ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ########## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) + .setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: ACK ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java ########## @@ -282,6 +280,56 @@ public String toString() { } } + public static boolean[] findPresentStreamsByColumn( Review comment: Sure, done ########## File path: ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out ########## @@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest #### A masked pattern was here #### -- BEGIN ORC FILE DUMP -- #### A masked pattern was here #### -File Version: 0.12 with ORC_517 +File Version: 0.12 with ORC_14 Review comment: Yes, ORC_14 is the latest Writer version (for 1.6) as per: https://github.com/apache/orc/blob/a1671f1e8abf748178c07e2a03b03f00268c8863/java/core/src/java/org/apache/orc/OrcFile.java#L177 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 544315) Time Spent: 4h 20m (was: 4h 10m) > Upgrade ORC version to 1.6.7 > ---------------------------- > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement > Reporter: Panagiotis Garefalakis > Assignee: Panagiotis Garefalakis > Priority: Major > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)