2010YOUY01 commented on code in PR #20008:
URL: https://github.com/apache/datafusion/pull/20008#discussion_r2730273058
##########
datafusion/expr/src/udf.rs:
##########
@@ -709,20 +709,49 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send +
Sync {
Ok(ExprSimplifyResult::Original(args))
}
- /// Returns the [preimage] for this function and the specified scalar
value, if any.
+ /// Returns a single contiguous preimage for this function and the
specified
+ /// scalar expression, if any.
///
- /// A preimage is a single contiguous [`Interval`] of values where the
function
- /// will always return `lit_value`
+ /// # Return Value
///
- /// Implementations should return intervals with an inclusive lower bound
and
- /// exclusive upper bound.
+ /// Implementations should return a half-open interval: inclusive lower
+ /// bound and exclusive upper bound. Note that this is slightly different
+ /// from normal [`Interval`] semantics where the upper bound is closed. The
+ /// upper endpoint should be adjusted to the next value.
///
- /// This rewrite is described in the [ClickHouse Paper] and is particularly
- /// useful for simplifying expressions `date_part` or equivalent
functions. The
- /// idea is that if you have an expression like `date_part(YEAR, k) =
2024` and you
- /// can find a [preimage] for `date_part(YEAR, k)`, which is the range of
dates
- /// covering the entire year of 2024. Thus, you can rewrite the expression
to `k
- /// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
+ /// # Background
+ ///
+ /// A [preimage] here is a single contiguous [`Interval`] of the function's
+ /// argument(s) where the function will return a single literal (constant)
+ /// value. This can also be thought of as a form of interval containment.
+ ///
+ /// Using a preimage to rewrite predicates is described in the [ClickHouse
+ /// Paper]:
+ ///
+ /// > some functions can compute the preimage of a given function result.
+ /// > This is used to replace comparisons of constants with function calls
+ /// > on the key columns by comparing the key column value with the
preimage.
+ /// > For example, `toYear(k) = 2024` can be replaced by
+ /// > `k >= 2024-01-01 && k < 2025-01-01`
+ ///
+ /// As mentioned above, the preimage can be used to simplify certain types
of
+ /// expressions such as `date_part` into a form that is more optimizable.
+ ///
+ /// For example, given an expression like
+ /// ```sql
+ /// date_part(YEAR, k) = 2024
Review Comment:
```suggestion
/// date_part('YEAR', k) = 2024
```
A nit sql syntax issue
##########
datafusion/expr/src/udf.rs:
##########
@@ -709,20 +709,49 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send +
Sync {
Ok(ExprSimplifyResult::Original(args))
}
- /// Returns the [preimage] for this function and the specified scalar
value, if any.
+ /// Returns a single contiguous preimage for this function and the
specified
+ /// scalar expression, if any.
///
- /// A preimage is a single contiguous [`Interval`] of values where the
function
- /// will always return `lit_value`
+ /// # Return Value
///
- /// Implementations should return intervals with an inclusive lower bound
and
- /// exclusive upper bound.
+ /// Implementations should return a half-open interval: inclusive lower
+ /// bound and exclusive upper bound. Note that this is slightly different
+ /// from normal [`Interval`] semantics where the upper bound is closed. The
+ /// upper endpoint should be adjusted to the next value.
///
- /// This rewrite is described in the [ClickHouse Paper] and is particularly
- /// useful for simplifying expressions `date_part` or equivalent
functions. The
- /// idea is that if you have an expression like `date_part(YEAR, k) =
2024` and you
- /// can find a [preimage] for `date_part(YEAR, k)`, which is the range of
dates
- /// covering the entire year of 2024. Thus, you can rewrite the expression
to `k
- /// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
+ /// # Background
+ ///
+ /// A [preimage] here is a single contiguous [`Interval`] of the function's
+ /// argument(s) where the function will return a single literal (constant)
+ /// value. This can also be thought of as a form of interval containment.
+ ///
+ /// Using a preimage to rewrite predicates is described in the [ClickHouse
+ /// Paper]:
+ ///
+ /// > some functions can compute the preimage of a given function result.
+ /// > This is used to replace comparisons of constants with function calls
+ /// > on the key columns by comparing the key column value with the
preimage.
+ /// > For example, `toYear(k) = 2024` can be replaced by
+ /// > `k >= 2024-01-01 && k < 2025-01-01`
+ ///
+ /// As mentioned above, the preimage can be used to simplify certain types
of
+ /// expressions such as `date_part` into a form that is more optimizable.
+ ///
+ /// For example, given an expression like
+ /// ```sql
+ /// date_part(YEAR, k) = 2024
+ /// ```
+ ///
+ /// There is a single preimage [`2024-01-01`, `2025-01-01`), which is the
+ /// range of dates covering the entire year of 2024 for which
+ /// `date_part(YEAR, k)` evaluates to `2024`. Using this preimage the
+ /// expression can be rewritten to
+ ///
+ /// ```sql
+ /// k >= '2024-01-01' AND k < '2025-01-01'
+ /// ```
+ ///
+ /// which is often more optimizable, such as being used in min/max pruning.
///
Review Comment:
```suggestion
/// which is often more optimizable: the predicate is rewritten into a
simpler
/// and more canonical form, making it easier for different optimizer
passes
/// to recognize and apply further transformations. For example:
///
/// Case 1:
///
/// Original:
/// ```sql
/// date_part('YEAR', k) = 2024 AND k >= '2024-06-01'
/// ```
///
/// After preimage rewrite:
/// ```sql
/// k >= '2024-01-01' AND k < '2025-01-01' AND k >= '2024-06-01'
/// ```
///
/// Since this form is much simpler, the optimizer can combine and simplify
/// sub-expressions further into:
/// ```sql
/// k >= '2024-06-01' AND k < '2025-01-01'
/// ```
///
/// Case 2:
///
/// For min/max pruning, simpler predicates such as:
/// ```sql
/// k >= '2024-01-01' AND k < '2025-01-01'
/// ```
/// are much easier for the pruner to reason about. See [PruningPredicate]
/// for the backgrounds of predicate pruning.
///
/// The trade-off is that evaluating the preimage form can be slightly more
/// expensive than evaluating the original expression. In practice, this
cost
/// is usually outweighed by the more aggressive optimization
opportunities it
/// enables.
///
/// [PruningPredicate]:
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
```
I think we can should explain what is 'more optimizable'.
I haven't verified if the `Case 1` would actually get triggered, but it
should be doable technically.
##########
datafusion/expr/src/udf.rs:
##########
@@ -709,20 +709,49 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send +
Sync {
Ok(ExprSimplifyResult::Original(args))
}
- /// Returns the [preimage] for this function and the specified scalar
value, if any.
+ /// Returns a single contiguous preimage for this function and the
specified
+ /// scalar expression, if any.
///
- /// A preimage is a single contiguous [`Interval`] of values where the
function
- /// will always return `lit_value`
+ /// # Return Value
///
- /// Implementations should return intervals with an inclusive lower bound
and
- /// exclusive upper bound.
+ /// Implementations should return a half-open interval: inclusive lower
+ /// bound and exclusive upper bound. Note that this is slightly different
+ /// from normal [`Interval`] semantics where the upper bound is closed. The
+ /// upper endpoint should be adjusted to the next value.
///
- /// This rewrite is described in the [ClickHouse Paper] and is particularly
- /// useful for simplifying expressions `date_part` or equivalent
functions. The
- /// idea is that if you have an expression like `date_part(YEAR, k) =
2024` and you
- /// can find a [preimage] for `date_part(YEAR, k)`, which is the range of
dates
- /// covering the entire year of 2024. Thus, you can rewrite the expression
to `k
- /// >= '2024-01-01' AND k < '2025-01-01' which is often more optimizable.
+ /// # Background
+ ///
+ /// A [preimage] here is a single contiguous [`Interval`] of the function's
+ /// argument(s) where the function will return a single literal (constant)
+ /// value. This can also be thought of as a form of interval containment.
+ ///
+ /// Using a preimage to rewrite predicates is described in the [ClickHouse
+ /// Paper]:
+ ///
+ /// > some functions can compute the preimage of a given function result.
+ /// > This is used to replace comparisons of constants with function calls
+ /// > on the key columns by comparing the key column value with the
preimage.
+ /// > For example, `toYear(k) = 2024` can be replaced by
+ /// > `k >= 2024-01-01 && k < 2025-01-01`
+ ///
+ /// As mentioned above, the preimage can be used to simplify certain types
of
+ /// expressions such as `date_part` into a form that is more optimizable.
+ ///
+ /// For example, given an expression like
+ /// ```sql
+ /// date_part(YEAR, k) = 2024
+ /// ```
+ ///
+ /// There is a single preimage [`2024-01-01`, `2025-01-01`), which is the
+ /// range of dates covering the entire year of 2024 for which
+ /// `date_part(YEAR, k)` evaluates to `2024`. Using this preimage the
Review Comment:
```suggestion
/// `date_part('YEAR', k)` evaluates to `2024`. Using this preimage the
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]