Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20681 )

Change subject: [WIP] IMPALA-12322: Support converting UTC timestamps read from 
Kudu to local time
......................................................................


Patch Set 5:

(5 comments)

Thanks for your review! I indeed didn't consider the issue of daylight saving 
time, as the region I'm in doesn't observe it. I appreciate your reminder and 
will take it into consideration.

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc
File be/src/exec/kudu/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc@288
PS5, Line 288:         const Timezone* local_tz = 
state_->query_options().convert_kudu_utc_timestamps ?
             :             state_->local_time_zone() : UTCPTR;
             :
             :         KuduValue* min_value;
             :         RETURN_IF_ERROR(CreateKuduValue(col_type, min, 
&min_value, local_tz));
             :         KUDU_RETURN_IF_ERROR(scanner_->AddConjunctPredicate(
             :           scanner_->GetKuduTable()->NewComparisonPredicate(
             :               col_name, 
KuduPredicate::ComparisonOp::GREATER_EQUAL, min_value)),
             :             BuildErrorString("Failed to add min predicate"));
             :
             :         KuduValue* max_value;
             :         RETURN_IF_ERROR(CreateKuduValue(col_type, max, 
&max_value, local_tz));
> This can filter out values incorrectly during dst changes. The
 > issue is that two utc micro values 1 hour apart can map to the same
 > local timestamp, e.g. e.g. both 01:10:00 (UTC) and 02:10:00 (UTC)
 > may be to 00:10:00 (local). A filter like 00:05:00 < ts < 00:15:00
 > would be either converted to 02:05:00 < ts < 02:15:00 or 01:05:00 <
 > ts < 01:15:00, and both would drop one of the values incorrectly.
 >
 > Generally we should use the smaller possible conversion for min
 > value and the larger one for max value if they are ambiguous. A
 > similar logic was implemented for Parquet stat filtering at
 > https://github.com/apache/impala/blob/3be9b82207ddf32ca48ea1b2afbc88d8dffad8a4/be/src/exec/parquet/parquet-common.cc#L266
 > , but it works the other way, we have the UTC min/max stats from
 > the Parquet file, and try to get their min/max local values.

I tried using a similar approach to solve this problem, and after a simple 
test, it seems worked fine when converting UTC ambiguities.


http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc@139
PS5, Line 139: local_time_zone
> This should use time_zone_for_unix_time_conversions() instead of
 > local_time_zone to assume utc if use_local_tz_for_unix_timestamp_conversions
 > is false.
 >
 > If you need a function that behaves differently than other
 > functions, then it could have a special name like ToUnixMicrosForKudu

I might consider removing this function later and implementing the conversion 
in a different way.


http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc@371
PS5, Line 371: DEFINE_bool(convert_kudu_utc_timestamps, false,
> I don't think that we need a flag if we already have a query option, as the
Done


http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java@573
PS5, Line 573:           kuduPredicate = 
KuduPredicate.newComparisonPredicate(column, op, unixMicros);
> This may not work correctly during DST change, see my other
 > comments.

After testing, it is indeed problematic, and I'm working on finding a solution. 
For the "in list" predicate, perhaps we can add both ambiguous values for the 
conversion. As for binary predicates, we could handle them similar to the 
min_max runtime filter by using a broader range of conversion values. Do you 
have any suggestions regarding this?


http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@485
PS5, Line 485:         String functionName = 
analyzer.getQueryOptions().isConvert_kudu_utc_timestamps() ?
             :             "to_unix_micros" : "utc_to_unix_micros";
             :         Expr toUnixTimeExpr =
             :             new FunctionCallExpr(functionName, 
Lists.newArrayList(srcExpr));
> Bloom filters with timezone conversion are tricky - the problem is
 > that two different utc micro timestamps can be mapped to the same
 > local timestamp, e.g. in case of dst change. If we map the local
 > timestamp to a single utc timestamp during bloom filter creation,
 > then it will only match with one of the possible values that would
 > be converted to this timestamp, so we will incorrectly filter the
 > other value.

The timezone conversion for Bloom filters is indeed tricky. Enabling both 
ambiguous values in the filter could potentially solve the problem, but it 
would require significant modifications. Currently, I'm considering disabling 
the Bloom filter for timezone conversion in Kudu and adding an additional query 
option to re-enable it in such cases. This would allow regions that do not 
observe daylight saving time to function properly. Another option is to discard 
the Bloom filter when ambiguous values are detected, although the extent of its 
impact is uncertain. Do you have any suggestions regarding this?



--
To view, visit http://gerrit.cloudera.org:8080/20681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3
Gerrit-Change-Number: 20681
Gerrit-PatchSet: 5
Gerrit-Owner: Zihao Ye <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Wed, 15 Nov 2023 12:32:55 +0000
Gerrit-HasComments: Yes

Reply via email to