alamb commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3699162074

   > Could we also use a vectorized approach for distribution statistics? I 
think we should be able to store them as a union of structs and use udfs to 
compute interesections, etc.?
   > 
   > For sets statistics, at least the `HashSet<ScalarValue>` type we could 
have a simple size based heuristic: in my experience these sorts of statistics 
are most useful when the sets are small. Larger sets are less useful and much 
more expensive to manage, i.e. cardinality of 1 vs 1M is useful, 1M vs. 2M less 
so. So maybe we cap it at 128 elements or something like that and drop it / 
stop building it beyond that? 
   
   I agree that the value of a set distribution is low for many members. Maybe 
we could convert the set to a min/max range after a few values
   
   > I imagine for larger sets estimated set sizes and membership would be more 
useful, e.g. a bloom filter.
   
   100%
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to