adriangb commented on PR #22919: URL: https://github.com/apache/datafusion/pull/22919#issuecomment-4761276969
@kosiew thanks for the review — addressed both points and rebased on latest `main`: - **Marginal indistinguishability:** the dataset now uses four columns of identical shape (`s1`–`s4`) — equal width, one marker at the same offset, an equally cheap regex each — so a marginal cost estimator or runtime timing cannot tell the four regexes apart; only their joint distribution with the proxy can. Verified on 1M rows (equal widths 63, marginal selectivities all 0.30, conditional-on-`c0` 1.0 for `s1`/`s2`/`s3` vs 0.30 for `s4`). - **Readability:** the proxy and independent conditions are factored into a `WITH base` CTE as named booleans. Re-measured headroom: written vs hand-optimal `[c0, s4, s1/s2/s3]` is ~1.7x (16.5 ms vs 9.7 ms median). PR description updated. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
