exposing per-field storage usage

Michael Sokolov Mon, 13 Jun 2022 10:22:42 -0700

At Amazon, we have a need to produce regular metrics on how much disk
storage is consumed by each field. We manage an index with data
contributed by many teams and business units and we are often asked to
produce reports attributing index storage usage to these customers.
The best tool we have for this today is based on a custom Codec that
separates storage by field; to get the statistics we read an existing
index and write it out using AddIndexes and force-merging, using the
custom codec. This is time-consuming and inefficient and tends not to
get done.


I wonder if it would make sense to add methods to *some* API that
would expose a per-field disk space metric? If we don't want to add to
IndexReader, which would imply lots of intermediate methods and API
additions, maybe we could make it be computed by CheckIndex?

(implementation note: For the current formats, the information for
each field is always segregated by field, I think. I suppose that in
theory we might want to have some shared data structure across fields
some day, but it seems like an edge case that we could handle in some
exceptional way.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

exposing per-field storage usage

Reply via email to