At Amazon, we have a need to produce regular metrics on how much disk storage is consumed by each field. We manage an index with data contributed by many teams and business units and we are often asked to produce reports attributing index storage usage to these customers. The best tool we have for this today is based on a custom Codec that separates storage by field; to get the statistics we read an existing index and write it out using AddIndexes and force-merging, using the custom codec. This is time-consuming and inefficient and tends not to get done.
I wonder if it would make sense to add methods to *some* API that would expose a per-field disk space metric? If we don't want to add to IndexReader, which would imply lots of intermediate methods and API additions, maybe we could make it be computed by CheckIndex? (implementation note: For the current formats, the information for each field is always segregated by field, I think. I suppose that in theory we might want to have some shared data structure across fields some day, but it seems like an edge case that we could handle in some exceptional way.) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org