[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

ASF GitHub Bot (Jira) Tue, 26 Jan 2021 19:29:06 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=542604&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-542604
 ]


ASF GitHub Bot logged work on HIVE-23553:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Jan/21 03:28
            Start Date: 27/Jan/21 03:28
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r564992785



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java
##########
@@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata(
     OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata);
     OrcProto.Footer footer = orcTail.getFooter();
     int stripeCount = footer.getStripesCount();
-    boolean writerUsedProlepticGregorian = footer.hasCalendar()
-        ? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN
-        : OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf);
+    // Always convert To PROLEPTIC_GREGORIAN

Review comment:
       Why is it OK to use proleptic calendar always here? Could we leave short 
explanation in the comment for when we need to revisit this code?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java
##########
@@ -282,6 +280,56 @@ public String toString() {
     }
   }
 
+  public static boolean[] findPresentStreamsByColumn(

Review comment:
       Can we add javadoc for these public static utility methods? If they are 
used only in this class, should we change their visibility?

##########
File path: 
ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out
##########
@@ -687,11 +687,11 @@ POSTHOOK: Input: 
default@part_change_various_various_timestamp_n6
 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1
 #### A masked pattern was here ####
 insert_num     part    c1      c2      c3      c4      c5      c6      c7      
c8      c9      c10     c11     c12     b
-101    1       1970-01-01 00:00:00.001 1969-12-31 23:59:59.872 NULL    
1969-12-07 03:28:36.352 NULL    NULL    NULL    NULL    6229-06-28 
02:54:28.970117179   6229-06-28 02:54:28.97011       6229-06-28 02:54:28.97011  
     1950-12-18 00:00:00     original
-102    1       1970-01-01 00:00:00     1970-01-01 00:00:00.127 1970-01-01 
00:00:32.767 1970-01-25 20:31:23.647 NULL    NULL    NULL    NULL    5966-07-09 
03:30:50.597 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 2049-12-18 
00:00:00     original
+101    1       1970-01-01 00:00:01     1969-12-31 23:57:52     NULL    
1901-12-13 20:45:52     NULL    NULL    NULL    NULL    6229-06-28 
02:54:28.970117179   6229-06-28 02:54:28.97011       6229-06-28 02:54:28.97011  
     1950-12-18 00:00:00     original

Review comment:
       This shifting for timestamp values does not seem right (or at least I 
cannot make sense of it). Could you explain what is going on here? Some of the 
shifting is significant: For those, I remember there were some backwards 
incompatible changes in schema evolution in 1.6.x, it may be related to that? 
However, other shifting seems a bit more suspicious, e.g., 1 second, ~2 minutes?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final 
int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == 
TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       As @mustafaiman  mentioned, I think this should be always false indeed: 
TIMESTAMP_INSTANT is equivalent to TIMESTAMP_WITH_LOCAL_TIME_ZONE type in Hive. 
AFAIK support to read/write timestamp with local time zone in ORC is not 
implemented yet.

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java
##########
@@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception {
         + "binary,string1:string,middle:struct<list:array<struct<int1:int,"
         + "string1:string>>>,list:array<struct<int1:int,string1:string>>,"
         + "map:map<string,struct<int1:int,string1:string>>,ts:timestamp,"
-        + "decimal1:decimal(38,18)>", readerInspector.getTypeName());
+        + "decimal1:decimal(38,10)>", readerInspector.getTypeName());

Review comment:
       Change in decimal scale. Expected?

##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -249,15 +249,15 @@ Stripes:
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 
0.029 expectedFpp: 7.090246E-7
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 
loadFactor: 0.2825 expectedFpp: 0.0063713384
     Row group indices for column 9:
-      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 
2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703 max: 
2013-03-01 09:11:58.703 positions: 0,7,488,0,1538,488
+      Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.70307 max: 
2013-03-01 09:11:58.703325 positions: 0,0,0,0,0,0
+      Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703076 max: 
2013-03-01 09:11:58.703325 positions: 0,7,488,0,1538,488
     Bloom filters for column 9:
       Entry 0: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 
0.0006 expectedFpp: 1.6543056E-13
       Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 
0.0006 expectedFpp: 1.6543056E-13
       Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 4 
loadFactor: 0.0006 expectedFpp: 1.6543056E-13
     Row group indices for column 10:
-      Entry 0: count: 1000 hasNull: false min: 8 max: 9994 sum: 5118211 
positions: 0,0,0,0,0
-      Entry 1: count: 49 hasNull: false min: 248 max: 9490 sum: 246405 
positions: 0,2194,0,4,488
+      Entry 0: count: 1000 hasNull: false min: 0.08 max: 99.94 sum: 51182.11 
positions: 0,0,0,0,0

Review comment:
       min/max changed. Expected?

##########
File path: 
ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction_multicol.q.out
##########
@@ -355,7 +355,7 @@ Stage-1 HIVE COUNTERS:
    RECORDS_OUT_OPERATOR_TS_3: 800
    TOTAL_TABLE_ROWS_WRITTEN: 0
 Stage-1 LLAP IO COUNTERS:
-   CACHE_HIT_BYTES: 138344
+   CACHE_MISS_BYTES: 138342

Review comment:
       Expected? Was it a typo?

##########
File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out
##########
@@ -111,7 +111,7 @@ Stripe Statistics:
     Column 6: count: 1049 hasNull: false bytesOnDisk: 3323 min: 0.02 max: 
49.85 sum: 26286.349999999977
     Column 7: count: 1049 hasNull: false bytesOnDisk: 137 true: 526
     Column 8: count: 1049 hasNull: false bytesOnDisk: 3430 min:  max: zach 
zipper sum: 13443
-    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 
09:11:58.703 max: 2013-03-01 09:11:58.703
+    Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 
09:11:58.70307 max: 2013-03-01 09:11:58.703325

Review comment:
       Timestamp precision change. Expected?

##########
File path: 
ql/src/test/results/clientpositive/llap/acid_bloom_filter_orc_file_dump.q.out
##########
@@ -76,7 +76,7 @@ PREHOOK: Input: default@bloomtest
 #### A masked pattern was here ####
 -- BEGIN ORC FILE DUMP --
 #### A masked pattern was here ####
-File Version: 0.12 with ORC_517
+File Version: 0.12 with ORC_14

Review comment:
       Is this change expected?

##########
File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out
##########
@@ -162,8 +162,8 @@ Stripe Statistics:
     Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 
5.62236530305E7
     Column 7: count: 12288 hasNull: false min: 
00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 
sum: 442368
     Column 8: count: 12288 hasNull: false min: 
000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: 
fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 
884736
-    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 
1969-12-31 16:00:30.808
-    Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 
1969-12-31 16:00:30.808
+    Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 
1969-12-31 16:00:30.808999999

Review comment:
       Change in timestamp precision in stats. Expected?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 542604)
    Time Spent: 4h  (was: 3h 50m)

> Upgrade ORC version to 1.6.7
> ----------------------------
>
>                 Key: HIVE-23553
>                 URL: https://issues.apache.org/jira/browse/HIVE-23553
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

Reply via email to