Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20681 )

Change subject: [WIP] IMPALA-12322: Support converting UTC timestamps read from 
Kudu to local time
......................................................................


Patch Set 5:

(5 comments)

Left a few comments mainly about DST change troubles, as the patch doesn't seem 
to be prepared for those. Generally reading the values from Kudu and converting 
them to local time should work, but handling predicates/runtime filters 
correctly can be very tricky.

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc
File be/src/exec/kudu/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc@288
PS5, Line 288:         const Timezone* local_tz = 
state_->query_options().convert_kudu_utc_timestamps ?
             :             state_->local_time_zone() : UTCPTR;
             :
             :         KuduValue* min_value;
             :         RETURN_IF_ERROR(CreateKuduValue(col_type, min, 
&min_value, local_tz));
             :         KUDU_RETURN_IF_ERROR(scanner_->AddConjunctPredicate(
             :           scanner_->GetKuduTable()->NewComparisonPredicate(
             :               col_name, 
KuduPredicate::ComparisonOp::GREATER_EQUAL, min_value)),
             :             BuildErrorString("Failed to add min predicate"));
             :
             :         KuduValue* max_value;
             :         RETURN_IF_ERROR(CreateKuduValue(col_type, max, 
&max_value, local_tz));
This can filter out values incorrectly during dst changes. The issue is that 
two utc micro values 1 hour apart can map to the same local timestamp, e.g. 
e.g. both 01:10:00 (UTC) and 02:10:00 (UTC) may be to 00:10:00 (local). A 
filter like 00:05:00 < ts < 00:15:00 would be either converted to 02:05:00 < ts 
< 02:15:00 or 01:05:00 < ts < 01:15:00, and both would drop one of the values 
incorrectly.

Generally we should use the smaller possible conversion for min value and the 
larger one for max value if they are ambiguous. A similar logic was implemented 
for Parquet stat filtering at 
https://github.com/apache/impala/blob/3be9b82207ddf32ca48ea1b2afbc88d8dffad8a4/be/src/exec/parquet/parquet-common.cc#L266
 , but it works the other way, we have the UTC min/max stats from the Parquet 
file, and try to get their min/max local values.


http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc@139
PS5, Line 139: local_time_zone
This should use time_zone_for_unix_time_conversions() instead of 
local_time_zone to assume utc if use_local_tz_for_unix_timestamp_conversions is 
false.

If you need a function that behaves differently than other functions, then it 
could have a special name like ToUnixMicrosForKudu


http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc@371
PS5, Line 371: DEFINE_bool(convert_kudu_utc_timestamps, false,
I don't think that we need a flag if we already have a query option, as the 
default of the query option can be set with flag default_query_options. 
convert_legacy_hive_parquet_utc_timestamps flag was kept for backwards 
compatibility, as it existed before the query option.


http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java@573
PS5, Line 573:           kuduPredicate = 
KuduPredicate.newComparisonPredicate(column, op, unixMicros);
This may not work correctly during DST change, see my other comments.


http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@485
PS5, Line 485:         String functionName = 
analyzer.getQueryOptions().isConvert_kudu_utc_timestamps() ?
             :             "to_unix_micros" : "utc_to_unix_micros";
             :         Expr toUnixTimeExpr =
             :             new FunctionCallExpr(functionName, 
Lists.newArrayList(srcExpr));
Bloom filters with timezone conversion are tricky - the problem is that two 
different utc micro timestamps can be mapped to the same local timestamp, e.g. 
in case of dst change. If we map the local timestamp to a single utc timestamp 
during bloom filter creation, then it will only match with one of the possible 
values that would be converted to this timestamp, so we will incorrectly filter 
the other value.



--
To view, visit http://gerrit.cloudera.org:8080/20681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3
Gerrit-Change-Number: 20681
Gerrit-PatchSet: 5
Gerrit-Owner: Zihao Ye <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Mon, 13 Nov 2023 18:33:23 +0000
Gerrit-HasComments: Yes

Reply via email to