Rei Mai created HIVE-21069: ------------------------------ Summary: Timestamp statistics in orc is wrong if read with useUTCTimestamp=true Key: HIVE-21069 URL: https://issues.apache.org/jira/browse/HIVE-21069 Project: Hive Issue Type: Bug Affects Versions: 3.1.0 Environment: timezone for both client and server "Europe/Moscow" (UTC+3) hive version 3.1.0.3.0.1.0-187 Reporter: Rei Mai Attachments: 000000_0
We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both client and server. After switching to hive 3.1.0 which uses orc 1.5.1 we've got an issue with predicate push down filtering out matching stripes by timestamp. E.g. consider a table (it's orc data is in the attachment): {quote}{{create external table test_ts (ts timestamp) stored as orc;}} {{insert into test_ts values ("2018-12-24 18:30:00");}} {{// No rows selected}} {{select * from test_ts where ts < "2018-12-24 19:00:00";}} // the lowest filter to return the value {{select * from test_ts where ts <= "2018-12-24 21:30:00";}} {quote} The issue only affects external orc tables statistics. Turning ppd off with _set hive.optimize.index.filter=false;_ helps. We believe it was the https://jira.apache.org/jira/browse/ORC-341, which introduced it. org.apache.orc.impl.SerializationUtils utc convertion is rather strange: {quote}public static long convertToUtc(TimeZone local, long time){ {color:#cc7832} int {color}offset = local.getOffset(time - local.getRawOffset()){color:#cc7832};{color}{color:#cc7832} return {color}time - offset{color:#cc7832};{color} } {quote} This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it substract 3 hours, btw?). If org.apache.orc.impl.TimestampStatisticsImpl is used with useUTCTimestamp=false, the timestamp is converted back in a compatible way via SerializationUtils.convertFromUtc. But hive seems to override default org.apache.orc.OrcFile.ReaderOptions with org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true) in it's constructor. With useUTCTimestamp=true evaluatePredicateProto predictate is using TimestampStatisticsImpl.getMaximumUTC(), which returns the timestamp as is, i.e. in the example it's "2018-12-24 21:30:00 UTC+3". At the same time the search predicate is not shifted (the value in this tez log is in UTC+3): {quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0|#0] |orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 19:00:00.0), expr = leaf-0 {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)