adriangb commented on code in PR #22919:
URL: https://github.com/apache/datafusion/pull/22919#discussion_r3448077383
##########
benchmarks/sql_benchmarks/predicate_eval/load/corrproxy.sql:
##########
@@ -0,0 +1,30 @@
+-- Correlated-proxy dataset: a cheap integer predicate that is a perfect proxy
+-- for three string predicates, plus one independent string predicate.
+--
+-- c0 = 1 for ~30% of rows (cheap proxy)
+-- s1 contains 'aaa', 'ccc' and 'ddd' exactly where c0 = 1 (correlated)
+-- s2 contains 'bbb' for an independent ~30% of rows (independent)
+--
+-- Marginally, the four regex predicates are indistinguishable: similar cost,
+-- the same ~30% selectivity. Their *conditional* selectivities behind the
+-- proxy differ completely: after `c0 = 1`, the three s1 regexes keep every
+-- survivor (each re-tests the proxy's condition) while the s2 regex still
+-- discards ~70%. Only joint statistics can see that; an independence
+-- assumption prices all four regexes identically in every position.
+--
+-- PRED_FILL sets the filler width around each marker (a non-matching
+-- `regexp_like` must scan the whole value), and PRED_ROWS sizes the table.
+CREATE TABLE t AS
+SELECT
+ CASE WHEN (value * 7) % 100 < 30 THEN 1 ELSE 0 END AS c0,
+ repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'aaa' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'ccc' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'ddd' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30}) AS s1,
+ repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 13) % 100 < 30 THEN 'bbb' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30}) AS s2
+FROM generate_series(1, ${PRED_ROWS:-1000000});
Review Comment:
Good catch — fixed in the latest push. The dataset now uses four columns of
identical shape (`s1`/`s2`/`s3`/`s4`): each is the same width, holds a single
marker at the same offset, and is matched by an equally cheap regex of the same
~30% selectivity. So the four regexes are now marginally indistinguishable in
any position — neither a per-predicate cost estimate nor runtime timing can
prefer one over another; only their joint distribution with the proxy can.
Verified on 1M rows: equal widths (63), marginal selectivities all 0.30, and
conditional-on-`c0` selectivity 1.0 for `s1`/`s2`/`s3` vs 0.30 for `s4`.
Re-measured headroom is ~1.7x (16.5 ms vs 9.7 ms median).
##########
benchmarks/sql_benchmarks/predicate_eval/load/corrproxy.sql:
##########
@@ -0,0 +1,30 @@
+-- Correlated-proxy dataset: a cheap integer predicate that is a perfect proxy
+-- for three string predicates, plus one independent string predicate.
+--
+-- c0 = 1 for ~30% of rows (cheap proxy)
+-- s1 contains 'aaa', 'ccc' and 'ddd' exactly where c0 = 1 (correlated)
+-- s2 contains 'bbb' for an independent ~30% of rows (independent)
+--
+-- Marginally, the four regex predicates are indistinguishable: similar cost,
+-- the same ~30% selectivity. Their *conditional* selectivities behind the
+-- proxy differ completely: after `c0 = 1`, the three s1 regexes keep every
+-- survivor (each re-tests the proxy's condition) while the s2 regex still
+-- discards ~70%. Only joint statistics can see that; an independence
+-- assumption prices all four regexes identically in every position.
+--
+-- PRED_FILL sets the filler width around each marker (a non-matching
+-- `regexp_like` must scan the whole value), and PRED_ROWS sizes the table.
+CREATE TABLE t AS
+SELECT
+ CASE WHEN (value * 7) % 100 < 30 THEN 1 ELSE 0 END AS c0,
+ repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'aaa' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'ccc' ELSE 'zzz' END
+ || repeat('q', ${PRED_FILL:-30})
+ || CASE WHEN (value * 7) % 100 < 30 THEN 'ddd' ELSE 'zzz' END
Review Comment:
Done — the proxy condition (and the independent control) are now defined
once each in a `WITH base` CTE as named booleans (`proxy`, `indep`) and reused,
so the perfect-proxy / independence invariants have a single source of truth
and cannot drift if a threshold changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]