[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=542604&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-542604 ]
ASF GitHub Bot logged work on HIVE-23553: ----------------------------------------- Author: ASF GitHub Bot Created on: 27/Jan/21 03:28 Start Date: 27/Jan/21 03:28 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r564992785 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java ########## @@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata( OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata); OrcProto.Footer footer = orcTail.getFooter(); int stripeCount = footer.getStripesCount(); - boolean writerUsedProlepticGregorian = footer.hasCalendar() - ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN - : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf); + // Always convert To PROLEPTIC_GREGORIAN Review comment: Why is it OK to use proleptic calendar always here? Could we leave short explanation in the comment for when we need to revisit this code? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java ########## @@ -282,6 +280,56 @@ public String toString() { } } + public static boolean[] findPresentStreamsByColumn( Review comment: Can we add javadoc for these public static utility methods? If they are used only in this class, should we change their visibility? ########## File path: ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out ########## @@ -687,11 +687,11 @@ POSTHOOK: Input: default@part_change_various_various_timestamp_n6 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1 #### A masked pattern was here #### insert_num part c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 b -101 1 1970-01-01 00:00:00.001 1969-12-31 23:59:59.872 NULL 1969-12-07 03:28:36.352 NULL NULL NULL NULL 6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original -102 1 1970-01-01 00:00:00 1970-01-01 00:00:00.127 1970-01-01 00:00:32.767 1970-01-25 20:31:23.647 NULL NULL NULL NULL 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 2049-12-18 00:00:00 original +101 1 1970-01-01 00:00:01 1969-12-31 23:57:52 NULL 1901-12-13 20:45:52 NULL NULL NULL NULL 6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original Review comment: This shifting for timestamp values does not seem right (or at least I cannot make sense of it). Could you explain what is going on here? Some of the shifting is significant: For those, I remember there were some backwards incompatible changes in schema evolution in 1.6.x, it may be related to that? However, other shifting seems a bit more suspicious, e.g., 1 second, ~2 minutes? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ########## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) + .setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: As @mustafaiman mentioned, I think this should be always false indeed: TIMESTAMP_INSTANT is equivalent to TIMESTAMP_WITH_LOCAL_TIME_ZONE type in Hive. AFAIK support to read/write timestamp with local time zone in ORC is not implemented yet. ########## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java ########## @@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception { + "binary,string1:string,middle:struct<list:array<struct<int1:int," + "string1:string>>>,list:array<struct<int1:int,string1:string>>," + "map:map<string,struct<int1:int,string1:string>>,ts:timestamp," - + "decimal1:decimal(38,18)>", readerInspector.getTypeName()); + + "decimal1:decimal(38,10)>", readerInspector.getTypeName()); Review comment: Change in decimal scale. Expected? ########## File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out ########## @@ -249,15 +249,15 @@ Stripes: Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 0.029 expectedFpp: 7.090246E-7 Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 loadFactor: 0.2825 expectedFpp: 0.0063713384 Row group indices for column 9: - Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0 - Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,7,488,0,1538,488 + Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 positions: 0,0,0,0,0,0 + Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703076 max: 2013-03-01 09:11:58.703325 positions: 0,7,488,0,1538,488 Bloom filters for column 9: Entry 0: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Row group indices for column 10: - Entry 0: count: 1000 hasNull: false min: 8 max: 9994 sum: 5118211 positions: 0,0,0,0,0 - Entry 1: count: 49 hasNull: false min: 248 max: 9490 sum: 246405 positions: 0,2194,0,4,488 + Entry 0: count: 1000 hasNull: false min: 0.08 max: 99.94 sum: 51182.11 positions: 0,0,0,0,0 Review comment: min/max changed. Expected? ########## File path: ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction_multicol.q.out ########## @@ -355,7 +355,7 @@ Stage-1 HIVE COUNTERS: RECORDS_OUT_OPERATOR_TS_3: 800 TOTAL_TABLE_ROWS_WRITTEN: 0 Stage-1 LLAP IO COUNTERS: - CACHE_HIT_BYTES: 138344 + CACHE_MISS_BYTES: 138342 Review comment: Expected? Was it a typo? ########## File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out ########## @@ -111,7 +111,7 @@ Stripe Statistics: Column 6: count: 1049 hasNull: false bytesOnDisk: 3323 min: 0.02 max: 49.85 sum: 26286.349999999977 Column 7: count: 1049 hasNull: false bytesOnDisk: 137 true: 526 Column 8: count: 1049 hasNull: false bytesOnDisk: 3430 min: max: zach zipper sum: 13443 - Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 + Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 Review comment: Timestamp precision change. Expected? ########## File path: ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out ########## @@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest #### A masked pattern was here #### -- BEGIN ORC FILE DUMP -- #### A masked pattern was here #### -File Version: 0.12 with ORC_517 +File Version: 0.12 with ORC_14 Review comment: Is this change expected? ########## File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out ########## @@ -162,8 +162,8 @@ Stripe Statistics: Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7 Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368 Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736 - Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 - Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 + Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808999999 Review comment: Change in timestamp precision in stats. Expected? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 542604) Time Spent: 4h (was: 3h 50m) > Upgrade ORC version to 1.6.7 > ---------------------------- > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement > Reporter: Panagiotis Garefalakis > Assignee: Panagiotis Garefalakis > Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)