kosiew commented on code in PR #22274:
URL: https://github.com/apache/datafusion/pull/22274#discussion_r3292904935
##########
datafusion/functions-nested/src/repeat.rs:
##########
@@ -175,9 +175,10 @@ fn general_repeat<O: OffsetSizeTrait>(
array: &ArrayRef,
count_array: &Int64Array,
) -> Result<ArrayRef> {
- let total_repeated_values: usize = (0..count_array.len())
- .map(|i| get_count_with_validity(count_array, i))
- .sum();
+ let total_repeated_values = checked_sum_counts(
+ (0..count_array.len()).map(|i| get_count_with_validity(count_array,
i)),
+ "array_repeat: total repeated values overflowed usize",
+ )?;
let mut take_indices = Vec::with_capacity(total_repeated_values);
Review Comment:
The scalar/non-list path still allocates `take_indices` before confirming
that the result fits in the returned `List` offset type.
For example, on the default `List<i32>` path, a count like `2147483648` does
not overflow `usize`, so `checked_sum_counts` succeeds. After that,
`Vec::with_capacity(total_repeated_values)` can try to reserve around 16 GiB
before the later `O::from_usize(running_offset)` error is reached.
Could we validate `O::from_usize(total_repeated_values)` first, and return
the existing offset error before doing any capacity reservation or extension?
##########
datafusion/functions-nested/src/repeat.rs:
##########
Review Comment:
The list-input path still has a couple of unchecked capacity and offset
construction spots.
`outer_total + 1` can overflow when the checked sum is exactly `usize::MAX`,
and the outer offsets are later rebuilt with
`OffsetBuffer::<O>::from_lengths(...)`, which can panic on `usize` or offset
overflow in Arrow.
Since this function is aiming to make capacity and offset calculations
overflow-safe, could we avoid the unchecked `+ 1` and avoid `from_lengths` for
untrusted counts? It looks like we can reuse the already checked totals and
build or validate the outer offsets with `checked_add` plus `O::from_usize`
before allocating.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]