Oh, yes that's a clever idea. It seems it would take quite a while
(tens of minutes?) for a larger index though? Much faster than the
force-merge solution for sure. I guess to get faster we would have to
instrument each format. I mean they generally do know how much space
each field is occupying, but perhaps it's too much API change to
expose that.

On Tue, Jun 14, 2022 at 12:09 AM Nhat Nguyen
<nhat.ngu...@elastic.co.invalid> wrote:
>
>
>> Also, the tool can be much more efficient than checkindex, e.g. for
>> stored fields and vectors it can just retrieve the first and last
>> documents, whereas checkindex should verify all of the documents
>> slowly.
>
>
> Yes, we implemented a similar heuristic in the DiskUsage API in Elasticsearch.
>
> On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcm...@gmail.com> wrote:
>>
>> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
>> <nhat.ngu...@elastic.co.invalid> wrote:
>> >
>> > Hi Michael,
>> >
>> > We developed a similar functionality in Elasticsearch. The DiskUsage API 
>> > estimates the storage of each field by iterating its structures (i.e., 
>> > inverted index, doc-values, stored fields, etc.) and tracking the number 
>> > of read-bytes. The result is pretty fast and accurate.
>> >
>> > I am +1 to the proposal.
>> >
>>
>> I like an approach such as this, enumerate the index, using something
>> like FilterDirectory to track the bytes. It doesn't require you to
>> force-merge all the data through addIndexes, and at the same time it
>> doesn't invade the codec apis.
>> The user can always force-merge the data themselves for situations
>> such as benchmarks/tracking space over time, otherwise the
>> fluctuations from merges could create too much noise.
>> Personally, I would suggest separate api/tool from CheckIndex, perhaps
>> this tracking could mask bugs? No reason to mix the two concerns.
>> Also, the tool can be much more efficient than checkindex, e.g. for
>> stored fields and vectors it can just retrieve the first and last
>> documents, whereas checkindex should verify all of the documents
>> slowly.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to