Hi everyone, I would like to check on adding incremental computation support to the compute_table_stats action.
Currently, compute_table_stats performs a full table scan every time it runs. The procedure loads all data files and computes Theta sketches over the entire dataset. For larger tables with frequent appends, this makes regular stats maintenance expensive. Since Apache DataSketches Theta sketches are mergeable via the union operation, we can compute stats incrementally. Handling deletes: Theta sketches support union but not set difference. after row deletes, the sketch may slightly overestimate NDV since it still includes deleted values. For tables with deletes, the slight overestimate is bounded. Periodic full recomputation can be triggered when needed( after major compaction). Would appreciate your thoughts and feedback on this, especially around how to handle the delete/overestimate trade off. I would be happy to work on this. Thanks Hemanth Boyina
