sdf-jkl commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3712166688
Hey @adriangb, I've been thinking about something like this since the New Year. It's really cool to see you putting together a draft for it. I haven't had a chance to give a full go at your code, but I wanted to share some research I've done earlier that might be relevant: - Clickhouse release blog about adaptive filter selectivity: https://clickhouse.com/blog/clickhouse-release-23-11#column-statistics-for-prewhere - How clickhouse measures adaptive filter selectivity: https://clickhouse.com/docs/optimize/prewhere#how-to-measure-prewhere-impact - First clickhouse column statistics PR: https://github.com/ClickHouse/ClickHouse/pull/53240 Before seeing your PR and comments in #3463 I was thinking about using more simple heuristics for sorting predicates. - col type -> size - cardinality of the predicate operator -> (=, !=) > (>, <) > (>=, <=) etc. - how simple/complex the predicate -> how long/ how much CPU it takes to evaluate - col encoding -> if it supports random access, we could filter without decoding (https://github.com/apache/arrow-rs/issues/8842) From a quick skim of the clickhouse original PR, they still rely on some simple heuristics when columns statistics aren't available. I would like to give your PR a proper review once I'm home, but I already love the direction you're taking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
