kosiew commented on code in PR #20787:
URL: https://github.com/apache/datafusion/pull/20787#discussion_r2965086191
##########
datafusion/functions/src/strings.rs:
##########
@@ -75,6 +75,11 @@ impl StringArrayBuilder {
.extend_from_slice(array.value(i).as_bytes());
}
}
+ ColumnarValueRef::NullableBinaryArray(array) => {
Review Comment:
I think there is a safety issue here. By accepting `BinaryArray` directly in
the string builders, `concat()` can now pass unchecked bytes into
`StringArrayBuilder::finish`, which still relies on `build_unchecked()` and
assumes all data is valid UTF-8.
The scalar path and the `||` operator both validate binary inputs first, but
this new array path skips that step. That means malformed binary values could
cause a panic or produce an invalid string array instead of a proper execution
error.
Would it make sense to validate or cast binary arrays before they reach
these builders? Also, it would be great to add a regression test that covers
invalid UTF-8 input so we lock in the expected behavior.
##########
datafusion/functions/src/string/concat.rs:
##########
@@ -120,26 +122,35 @@ impl ScalarUDFImpl for ConcatFunc {
}
});
- let array_len = args.iter().find_map(|x| match x {
- ColumnarValue::Array(array) => Some(array.len()),
- _ => None,
- });
+ let array_len = args
+ .iter()
+ .filter_map(|x| match x {
+ ColumnarValue::Array(array) => Some(array.len()),
+ _ => None,
+ })
+ .next();
// Scalar
if array_len.is_none() {
- let mut values = Vec::with_capacity(args.len());
+ let mut values: Vec<&str> = Vec::with_capacity(args.len());
for arg in &args {
let ColumnarValue::Scalar(scalar) = arg else {
return internal_err!("concat expected scalar value, got
{arg:?}");
};
-
- match scalar.try_as_str() {
- Some(Some(v)) => values.push(v),
- Some(None) => {} // null literal
- None => plan_err!(
- "Concat function does not support scalar type {}",
- scalar
- )?,
+ if let ScalarValue::Binary(Some(value)) = scalar {
+ let s: &str = std::str::from_utf8(value).map_err(|_| {
+ exec_datafusion_err!("invalid UTF-8 in binary literal:
{value:?}")
+ })?;
+ values.push(s);
+ } else {
+ match scalar.try_as_str() {
+ Some(Some(v)) => values.push(v),
+ Some(None) => {} // null literal
+ None => plan_err!(
+ "Concat function does not support scalar type {}",
+ scalar
+ )?,
+ }
}
}
let result = values.concat();
Review Comment:
Small suggestion. The return type precedence (`Utf8View` > `LargeUtf8` >
`Utf8`) is computed both in `return_type` and again in `invoke_with_args`.
It might be worth pulling this into a small helper so we do not accidentally
update one path and forget the other if new types get added later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]