sdf-jkl commented on PR #19639:
URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3712166688

   Hey @adriangb, I've been thinking about something like this since the New 
Year. It's really cool to see you putting together a draft for it.
   
   I haven't had a chance to give a full go at your code, but I wanted to share 
some research I've done earlier that might be relevant:
   
   - Clickhouse release blog about adaptive filter selectivity:
   
https://clickhouse.com/blog/clickhouse-release-23-11#column-statistics-for-prewhere
   - How clickhouse measures adaptive filter selectivity:
   https://clickhouse.com/docs/optimize/prewhere#how-to-measure-prewhere-impact
   - First clickhouse column statistics PR:
   https://github.com/ClickHouse/ClickHouse/pull/53240
   
   Before seeing your PR and comments in #3463 I was thinking about using more 
simple heuristics for sorting predicates.
   - col type -> size
   - cardinality of the predicate operator -> (=, !=) > (>, <) > (>=, <=) etc.
   - how simple/complex the predicate -> how long/ how much CPU it takes to 
evaluate
   - col encoding -> if it supports random access, we could filter without 
decoding (https://github.com/apache/arrow-rs/issues/8842)
   
   From a quick skim of the clickhouse original PR, they still rely on some 
simple heuristics when columns statistics aren't available.
   
   I would like to give your PR a proper review once I'm home, but I already 
love the direction you're taking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to