Hi, I’m exploring schema inference from columnar storage, where tuple compaction infers the schema and stores it in the LSM Tree. I’ve found a way to aggregate the inferred schemas from all LSM Trees across each NC and data partition.
The concern now is handling unflushed data. There seem to be two possible approaches: Force a flush and then aggregate all inferred schemas. Infer schema from unflushed data and aggregate it with the existing schema. Would this be the right direction, or is there a better alternative? Also, for option 2, is there a mechanism to efficiently read only unflushed records? Looking forward to your thoughts. Best regards, Calvin Dani
