alamb opened a new issue, #13796: URL: https://github.com/apache/datafusion/issues/13796
### Is your feature request related to a problem or challenge? Similarly to https://github.com/apache/datafusion/issues/13759, we found another potential code vulnerability from a security audit performed by InfluxData > The public method `with_capacity` of `StringArrayBuilder` accepts a parameter `item_capacity: usize`. It is used to create a `MutableBuffer` with a capacity calculated as `(item_capacity + 1)*size_of::<i32>()`, which can overflow undetected, leaving the buffer > too small. The subsequent unsafe call to push a value into the buffer can lead to an out-of-bounds memory access. > > In datafusion/functions/src/strings.rs, lines 125ff: ```rust impl StringArrayBuilder { pub fn with_capacity(item_capacity: usize, data_capacity: usize) -> Self { let mut offsets_buffer = MutableBuffer::with_capacity((item_capacity + 1) * size_of::<i32>()); // SAFETY: the first offset value is definitely not going to exceed the bounds. unsafe { offsets_buffer.push_unchecked(0_i32) }; Self { offsets_buffer, value_buffer: MutableBuffer::with_capacity(data_capacity), } } ``` I analyzed the potential risk, and I agree there is a risk of memory unsafety but I do not think it is exploitable via DataFusion APIs. Specifically, the only callsites are: https://github.com/apache/datafusion/blob/3ee9b3dfb6d9c4e95a93d694b6aaf5c21ab61354/datafusion/functions/src/string/concat_ws.rs#L228-L229 The argument is taken from https://github.com/apache/datafusion/blob/3ee9b3dfb6d9c4e95a93d694b6aaf5c21ab61354/datafusion/functions/src/string/concat_ws.rs#L149-L150 So to trigger this code you would have to provide an input record batch with `usize::MAX` (aka the maximum value of a 64bit unsigned integer) rows (the default is 8k) which would require an allocation of 4 * usize::MAX bytes in the source array. I don't think such an allocation is possible (it would be rejected by the OS) ### Reproducer showing segfault: Here is a test case I wrote: ```diff andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ git diff diff --git a/datafusion/functions/src/strings.rs b/datafusion/functions/src/strings.rs index d2fb5d585..8e7c05fa1 100644 --- a/datafusion/functions/src/strings.rs +++ b/datafusion/functions/src/strings.rs @@ -422,3 +422,13 @@ impl ColumnarValueRef<'_> { } } } + + +#[cfg(test)] +mod test { + use super::*; + #[test] + fn test_overflow() { + let builder = StringArrayBuilder::with_capacity(usize::MAX, usize::MAX); + } +} ``` It must be run in release mode ```shell cargo test --release -p datafusion-functions --lib -- overflow ``` ``` error: test failed, to rerun pass `-p datafusion-functions --lib` Caused by: process didn't exit successfully: `/Users/andrewlamb/Software/datafusion/target/release/deps/datafusion_functions-e3de98ad3d4dc27c overflow` (signal: 11, SIGSEGV: invalid memory reference) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ ``` ### Describe the solution you'd like > Recommendations: Check for integer overflow when calculating the capacity. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org