kosiew commented on code in PR #20787:
URL: https://github.com/apache/datafusion/pull/20787#discussion_r2965086191


##########
datafusion/functions/src/strings.rs:
##########
@@ -75,6 +75,11 @@ impl StringArrayBuilder {
                         .extend_from_slice(array.value(i).as_bytes());
                 }
             }
+            ColumnarValueRef::NullableBinaryArray(array) => {

Review Comment:
   I think there is a safety issue here. By accepting `BinaryArray` directly in 
the string builders, `concat()` can now pass unchecked bytes into 
`StringArrayBuilder::finish`, which still relies on `build_unchecked()` and 
assumes all data is valid UTF-8.
   
   The scalar path and the `||` operator both validate binary inputs first, but 
this new array path skips that step. That means malformed binary values could 
cause a panic or produce an invalid string array instead of a proper execution 
error.
   
   Would it make sense to validate or cast binary arrays before they reach 
these builders? Also, it would be great to add a regression test that covers 
invalid UTF-8 input so we lock in the expected behavior.



##########
datafusion/functions/src/string/concat.rs:
##########
@@ -120,26 +122,35 @@ impl ScalarUDFImpl for ConcatFunc {
             }
         });
 
-        let array_len = args.iter().find_map(|x| match x {
-            ColumnarValue::Array(array) => Some(array.len()),
-            _ => None,
-        });
+        let array_len = args
+            .iter()
+            .filter_map(|x| match x {
+                ColumnarValue::Array(array) => Some(array.len()),
+                _ => None,
+            })
+            .next();
 
         // Scalar
         if array_len.is_none() {
-            let mut values = Vec::with_capacity(args.len());
+            let mut values: Vec<&str> = Vec::with_capacity(args.len());
             for arg in &args {
                 let ColumnarValue::Scalar(scalar) = arg else {
                     return internal_err!("concat expected scalar value, got 
{arg:?}");
                 };
-
-                match scalar.try_as_str() {
-                    Some(Some(v)) => values.push(v),
-                    Some(None) => {} // null literal
-                    None => plan_err!(
-                        "Concat function does not support scalar type {}",
-                        scalar
-                    )?,
+                if let ScalarValue::Binary(Some(value)) = scalar {
+                    let s: &str = std::str::from_utf8(value).map_err(|_| {
+                        exec_datafusion_err!("invalid UTF-8 in binary literal: 
{value:?}")
+                    })?;
+                    values.push(s);
+                } else {
+                    match scalar.try_as_str() {
+                        Some(Some(v)) => values.push(v),
+                        Some(None) => {} // null literal
+                        None => plan_err!(
+                            "Concat function does not support scalar type {}",
+                            scalar
+                        )?,
+                    }
                 }
             }
             let result = values.concat();

Review Comment:
   Small suggestion. The return type precedence (`Utf8View` > `LargeUtf8` > 
`Utf8`) is computed both in `return_type` and again in `invoke_with_args`.
   
   It might be worth pulling this into a small helper so we do not accidentally 
update one path and forget the other if new types get added later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to