Thoughts on Schema Inference from Columnar Storage

Calvin Dani Thu, 03 Apr 2025 12:40:19 -0700

Hi,

I’m exploring schema inference from columnar storage, where tuple
compaction infers the schema and stores it in the LSM Tree. I’ve found a
way to aggregate the inferred schemas from all LSM Trees across each NC and
data partition.


The concern now is handling unflushed data. There seem to be two possible
approaches:

Force a flush and then aggregate all inferred schemas.

Infer schema from unflushed data and aggregate it with the existing schema.

Would this be the right direction, or is there a better alternative? Also,
for option 2, is there a mechanism to efficiently read only unflushed
records?

Looking forward to your thoughts.

Best regards,

Calvin Dani

Thoughts on Schema Inference from Columnar Storage

Reply via email to