Jefffrey commented on issue #4030: URL: https://github.com/apache/datafusion/issues/4030#issuecomment-3315002759
I think to achieve the expected behaviour you would need to mark the UDF as volatile. An updated example using latest main: ```rust use arrow::array::{ArrayRef, BooleanArray}; use datafusion::arrow::datatypes::DataType; use datafusion::common::cast::as_float32_array; use datafusion::error::Result; use datafusion::logical_expr::{ColumnarValue, Volatility}; use datafusion::prelude::*; use std::sync::Arc; #[tokio::main] async fn main() -> Result<()> { let ctx = SessionContext::new(); ctx.register_csv( "csv", "/Users/jeffrey/Downloads/test.csv", CsvReadOptions::new(), ) .await .unwrap(); let udf = { create_udf( "rand_bool", vec![DataType::Float32], DataType::Boolean, Volatility::Volatile, // From Stable to Volatile Arc::new(|args: &[ColumnarValue]| { let ColumnarValue::Array(l) = &args[0] else { panic!("should be array") }; const BOOLS: [bool; 4] = [true, true, false, false]; let x = as_float32_array(l)?; println!("udf in: {x:?}"); Ok(ColumnarValue::from(Arc::new(BooleanArray::from(Vec::from( &BOOLS[..x.len()], ))) as ArrayRef)) }), ) }; ctx.register_udf(udf.clone()); let query = ctx .sql("SELECT * FROM (SELECT *, rand_bool(num) AS rand FROM csv) WHERE NOT rand") .await?; query.clone().show_limit(10).await.unwrap(); query.explain(false, false).unwrap().show().await.unwrap(); Ok(()) } ``` - Switching back to `Stable` gives the same output as described in the issue, with two evaluations Gives output: ```sh udf in: PrimitiveArray<Float32> [ 100.0, 200.0, 150.0, 300.0, ] +--------+-----+-------+ | name_1 | num | rand | +--------+-----+-------+ | andy | 150 | false | | paul | 300 | false | +--------+-----+-------+ +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Filter: NOT rand | | | Projection: csv.name_1, csv.num, rand_bool(CAST(csv.num AS Float32)) AS rand | | | TableScan: csv projection=[name_1, num] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: NOT rand@2 | | | ProjectionExec: expr=[name_1@0 as name_1, num@1 as num, rand_bool(CAST(num@1 AS Float32)) as rand] | | | RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1 | | | DataSourceExec: file_groups={1 group: [[Users/jeffrey/Downloads/test.csv]]}, projection=[name_1, num], file_type=csv, has_header=true | | | | +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------+ ``` Can see it's evaluated only once. I guess we could try update the docs around `Volatility` to see if we can make this clearer 🤔 https://github.com/apache/datafusion/blob/1488e1010a670ee5973fc621af1ec73fd92c9b71/datafusion/expr-common/src/signature.rs#L46-L86 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org