Abhinav Koul created SPARK-51734: ------------------------------------ Summary: Wrong results when reading ORC Timestamp type with different Reader/Writer Timezones Key: SPARK-51734 URL: https://issues.apache.org/jira/browse/SPARK-51734 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1 Reporter: Abhinav Koul
When reading ORC TimestampLTZ (Timestamp with local timezone) spark returns incorrect values if Reader and Writer timezones are different. How to Replicate: {code:java} TimeZone.setDefault(TimeZone.getTimeZone("Europe/Berlin")) sql("SET spark.sql.session.timeZone = Europe/Berlin") sql("DROP TABLE IF EXISTS t") sql("CREATE TABLE t (tz TIMESTAMP) USING hive OPTIONS(fileFormat 'orc')") sql("INSERT INTO t VALUES (TIMESTAMP('1996-08-02 09:00:00.723100809'))") TimeZone.setDefault(TimeZone.getTimeZone("Asia/Kolkata")) sql("SET spark.sql.session.timeZone = Asia/Kolkata") spark.table("t").collect() {code} On analysing the above query results with parquet I found the following: || ||Parquet(ms)||Orc(ms)||Parquet (Timestamp)||Orc (Timestamp)|| |Spark to Fileformat Writer|838969200723|838969200723|1996-08-02 09:00:00.723100809|1996-08-02 09:00:00.723100809| |Fileformat Reader to Spark|838969200723|838956600723|1996-08-02 12:30:00.723100809|1996-08-02 09:00:00.723100809| Inside ORC reader I found that ORC did read correct millisecond value of 838969200723 but purposefully adds WriterTZ - ReaderTZ offset to it (-12600000 ms about -3hrs 30mins). What parquet does seems to be correct according to my understanding where Timestamp should be adjusted to corresponding timezone and should not show the same time like ORC's current behaviour. Please suggest what can be done further here. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org