Hi everyone,

  I would like to check on adding incremental computation support to the
compute_table_stats action.

 Currently, compute_table_stats performs a full table scan every time it
runs. The procedure loads all data files and computes Theta sketches over
the entire dataset. For larger tables with frequent appends, this makes
regular stats maintenance expensive.

 Since Apache DataSketches Theta sketches are mergeable via the union
operation, we can compute stats incrementally.

Handling deletes:  Theta sketches support union but not set difference.
after row deletes, the sketch may slightly overestimate NDV since it still
includes deleted values. For tables with deletes, the slight overestimate
is bounded. Periodic full recomputation can be triggered when needed( after
major compaction).

Would appreciate your thoughts and feedback on this, especially around how
to handle the delete/overestimate trade off. I would be happy to work on
this.

Thanks
Hemanth Boyina

Reply via email to