Re: [PR] [bug](hive) fix datetime type outfile parquet file not correct [doris]

via GitHub Mon, 16 Mar 2026 23:33:40 -0700


github-actions[bot] commented on code in PR #61396:
URL: https://github.com/apache/doris/pull/61396#discussion_r2944683735



##########
be/src/core/data_type_serde/data_type_datetimev2_serde.cpp:
##########
@@ -355,12 +355,24 @@ Status 
DataTypeDateTimeV2SerDe::write_column_to_arrow(const IColumn& column,
             DateV2Value<DateTimeV2ValueType> datetime_val = col_data[i];
             datetime_val.unix_timestamp(&timestamp, real_ctz);
 
-            if (_scale > 3) {
+            switch (timestamp_type->unit()) {

Review Comment:
   **[Parallel code path not updated]** 
`DataTypeTimeStampTzSerDe::write_column_to_arrow` in 
`data_type_timestamptz_serde.cpp:192-223` still uses the old `_scale`-based 
branching:
   ```cpp
   if (_scale > 3) {
       timestamp = (timestamp * 1000000) + microsecond;
   } else if (_scale > 0) {
       timestamp = (timestamp * 1000) + millisecond;
   }
   ```
   Since `TYPE_TIMESTAMPTZ` shares the same `convert_to_arrow_type` code path 
(line 100-101 of `arrow_row_batch.cpp`), when `enable_int96_timestamps=true`, 
the Arrow type will be `TimestampType(NANO)` but the serde will not handle the 
NANO case — it will fall into the `default` path and leave the timestamp in 
seconds. This should be updated with the same switch-on-unit pattern.



##########
regression-test/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy:
##########
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+import org.codehaus.groovy.runtime.IOGroovyMethods
+
+import java.nio.charset.StandardCharsets
+import java.nio.file.Files
+import java.nio.file.Paths
+
+suite("test_hive_read_parquet_datetime", "p0,external") {
+
+    String enabled = context.config.otherConfigs.get("enableHiveTest")
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        logger.info("diable Hive test.")
+        return;
+    }
+
+    for (String hivePrefix : ["hive3"]) {
+        setHivePrefix(hivePrefix)
+
+        // open nereids
+        sql """ set enable_nereids_planner=true """
+        sql """ set enable_fallback_to_original_planner=false """
+
+        String hdfs_port = context.config.otherConfigs.get(hivePrefix + 
"HdfsPort")
+        String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+
+        // It's okay to use random `hdfsUser`, but can not be empty.
+        def hdfsUserName = "doris"
+        def format = "parquet"
+        def defaultFS = "hdfs://${externalEnvIp}:${hdfs_port}"
+        def defaultFS_with_postfix = "hdfs://${externalEnvIp}:${hdfs_port}/"
+        def outfile_path = "/user/doris/tmp_data"
+        def uri = "${defaultFS}" + "${outfile_path}/exp_"
+        def export_table_name = "outfile_hive_read_parquet_datetime_type_test"
+        def hive_database = "test_hive_read_parquet_datetime_type"
+        def hive_table = "outfile_hive_read_parquet_datetime_type_test"
+
+        def create_table = {table_name ->
+            sql """ DROP TABLE IF EXISTS ${table_name} """
+            sql """
+            CREATE TABLE IF NOT EXISTS ${table_name} (
+                `id` INT NOT NULL,
+                `ts` DATETIME NOT NULL
+                )
+                DISTRIBUTED BY HASH(id) PROPERTIES("replication_num" = "1");
+            """
+        }
+
+        def create_hive_table = {table_name ->
+            def drop_table_str = """ drop table if exists 
${hive_database}.${table_name} """
+            def drop_database_str = """ drop database if exists 
${hive_database}"""
+            def create_database_str = """ create database ${hive_database}"""
+            def create_table_str = """ CREATE EXTERNAL TABLE 
${hive_database}.${table_name} (
+                                            id INT,
+                                            ts TIMESTAMP
+                                        )
+                                        stored as ${format}
+                                        LOCATION "${outfile_path}"
+                                    """
+
+            logger.info("hive sql: " + drop_table_str)
+            hive_docker """ ${drop_table_str} """
+
+            logger.info("hive sql: " + drop_database_str)
+            hive_docker """ ${drop_database_str} """
+
+            logger.info("hive sql: " + create_database_str)
+            hive_docker """ ${create_database_str}"""
+
+            logger.info("hive sql: " + create_table_str)
+            hive_docker """ ${create_table_str} """
+        }
+
+        def outfile_to_HDFS = { enable_int96 = false ->

Review Comment:
   **[Dead code / misleading]** The `enable_int96` parameter is accepted but 
never used — line 102 has the `enable_int96_timestamps` property commented out. 
The comment at line 122 says "outfile to hdfs with 
enable_int96_timestamps=true" but line 123 passes `false`. The test only works 
because the FE default for `enableInt96Timestamps` is `true` (see 
`ParquetFileFormatProperties.java:64`).
   
   Please either:
   1. Uncomment the property on line 102 and pass `true` on line 123 to make 
the intent explicit, or
   2. Remove the unused `enable_int96` parameter entirely.
   
   Relying on the FE default makes the test fragile — if the default changes, 
the test silently tests a different code path.



##########
regression-test/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy:
##########
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+import org.codehaus.groovy.runtime.IOGroovyMethods
+
+import java.nio.charset.StandardCharsets
+import java.nio.file.Files
+import java.nio.file.Paths
+
+suite("test_hive_read_parquet_datetime", "p0,external") {
+
+    String enabled = context.config.otherConfigs.get("enableHiveTest")
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        logger.info("diable Hive test.")
+        return;
+    }
+
+    for (String hivePrefix : ["hive3"]) {
+        setHivePrefix(hivePrefix)
+
+        // open nereids
+        sql """ set enable_nereids_planner=true """
+        sql """ set enable_fallback_to_original_planner=false """
+
+        String hdfs_port = context.config.otherConfigs.get(hivePrefix + 
"HdfsPort")
+        String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+
+        // It's okay to use random `hdfsUser`, but can not be empty.
+        def hdfsUserName = "doris"
+        def format = "parquet"
+        def defaultFS = "hdfs://${externalEnvIp}:${hdfs_port}"
+        def defaultFS_with_postfix = "hdfs://${externalEnvIp}:${hdfs_port}/"
+        def outfile_path = "/user/doris/tmp_data"
+        def uri = "${defaultFS}" + "${outfile_path}/exp_"
+        def export_table_name = "outfile_hive_read_parquet_datetime_type_test"
+        def hive_database = "test_hive_read_parquet_datetime_type"
+        def hive_table = "outfile_hive_read_parquet_datetime_type_test"
+
+        def create_table = {table_name ->
+            sql """ DROP TABLE IF EXISTS ${table_name} """
+            sql """
+            CREATE TABLE IF NOT EXISTS ${table_name} (
+                `id` INT NOT NULL,
+                `ts` DATETIME NOT NULL
+                )
+                DISTRIBUTED BY HASH(id) PROPERTIES("replication_num" = "1");
+            """
+        }
+
+        def create_hive_table = {table_name ->
+            def drop_table_str = """ drop table if exists 
${hive_database}.${table_name} """
+            def drop_database_str = """ drop database if exists 
${hive_database}"""
+            def create_database_str = """ create database ${hive_database}"""
+            def create_table_str = """ CREATE EXTERNAL TABLE 
${hive_database}.${table_name} (
+                                            id INT,
+                                            ts TIMESTAMP
+                                        )
+                                        stored as ${format}
+                                        LOCATION "${outfile_path}"
+                                    """
+
+            logger.info("hive sql: " + drop_table_str)
+            hive_docker """ ${drop_table_str} """
+
+            logger.info("hive sql: " + drop_database_str)
+            hive_docker """ ${drop_database_str} """
+
+            logger.info("hive sql: " + create_database_str)
+            hive_docker """ ${create_database_str}"""
+
+            logger.info("hive sql: " + create_table_str)
+            hive_docker """ ${create_table_str} """
+        }
+
+        def outfile_to_HDFS = { enable_int96 = false ->
+            def uuid = UUID.randomUUID().toString()
+            outfile_path = "/user/doris/tmp_data/${uuid}"
+            uri = "${defaultFS}" + "${outfile_path}/exp_"
+
+            def res = sql """
+                SELECT * FROM ${export_table_name} t ORDER BY id
+                INTO OUTFILE "${uri}"
+                FORMAT AS ${format}
+                PROPERTIES (
+                    "hadoop.username" = "${hdfsUserName}"
+                );
+            """
+                    // "enable_int96_timestamps" = "${enable_int96}"
+            logger.info("outfile success path: " + res[0][3]);
+            return res[0][3]
+        }
+
+        try {
+            // create doris table
+            create_table(export_table_name)
+
+            // insert 3 rows
+            sql """ INSERT INTO ${export_table_name} VALUES
+                (1, '2024-01-01 10:00:00'),
+                (2, '2024-06-15 12:30:45'),
+                (3, '2024-12-31 23:59:59');
+            """
+
+            // verify doris data
+            qt_select_base_desc """ desc ${export_table_name}; """
+            qt_select_base """ SELECT * FROM ${export_table_name} t ORDER BY 
id; """
+
+            // outfile to hdfs with enable_int96_timestamps=true

Review Comment:
   **[Comment contradicts code]** This comment says "outfile to hdfs with 
enable_int96_timestamps=true" but line 123 calls `outfile_to_HDFS(false)`. Even 
though the parameter is dead code (property is commented out on line 102), the 
comment is misleading.



##########
regression-test/suites/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.groovy:
##########
@@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+import org.codehaus.groovy.runtime.IOGroovyMethods

Review Comment:
   **[Unused imports]** `IOGroovyMethods`, `StandardCharsets`, `Files`, and 
`Paths` are imported but none are used in this test. Please remove them.



##########
regression-test/data/external_table_p0/export/hive_read/parquet/test_hive_read_parquet_datetime.out:
##########
@@ -0,0 +1,28 @@
+-- This file is automatically generated. You should know what you did if you 
want to edit this
+-- !select_base_desc --
+id     int     No      true    \N      
+ts     datetime        No      true    \N      
+
+-- !select_base --
+1      2024-01-01T10:00
+2      2024-06-15T12:30:45
+3      2024-12-31T23:59:59
+
+-- !select_tvf_desc --
+id     int     Yes     false   \N      NONE
+ts     datetime(6)     Yes     false   \N      NONE
+
+-- !select_tvf --
+1      2024-01-01T10:00
+2      2024-06-15T12:30:45
+3      2024-12-31T23:59:59
+
+-- !select_parquet_tvf --
+id     INT32   \N      REQUIRED        0       \N      \N      \N      \N      
\N
+ts     INT96   \N      REQUIRED        0       \N      \N      \N      \N      
\N
+
+-- !hive_docker --
+1      2024-01-01 02:00:00.0

Review Comment:
   **[Hive results show timezone shift - potential data correctness issue]** 
The expected Hive results show an 8-hour offset: `10:00:00` → `02:00:00`, 
`12:30:45` → `04:30:45`, `23:59:59` → `15:59:59`. This looks like the session 
timezone (likely `Asia/Shanghai`, UTC+8) is being applied to the write path.
   
   INT96 in Parquet is a wall-clock (local) timestamp — Hive treats it as-is 
with no timezone conversion. If the written INT96 nanoseconds embed a 
UTC-converted epoch, Hive will show the wrong time. This -8h shift suggests the 
`unix_timestamp(&timestamp, real_ctz)` call in `write_column_to_arrow` is 
converting the wall-clock time to UTC epoch, and then that UTC epoch is stored 
as INT96 nanoseconds.
   
   Is this the expected/correct behavior? If so, please add a comment in the 
test explaining why Hive shows shifted values. If not, the write path may need 
to use `utc_time_zone()` instead of `real_ctz` for the INT96/NANO case (which 
was attempted in commit 1 but reverted in commit 2).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [bug](hive) fix datetime type outfile parquet file not correct [doris]

Reply via email to