> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.


Yes, we implemented a similar heuristic in the DiskUsage API in
Elasticsearch.

On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcm...@gmail.com> wrote:

> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
> <nhat.ngu...@elastic.co.invalid> wrote:
> >
> > Hi Michael,
> >
> > We developed a similar functionality in Elasticsearch. The DiskUsage API
> estimates the storage of each field by iterating its structures (i.e.,
> inverted index, doc-values, stored fields, etc.) and tracking the number of
> read-bytes. The result is pretty fast and accurate.
> >
> > I am +1 to the proposal.
> >
>
> I like an approach such as this, enumerate the index, using something
> like FilterDirectory to track the bytes. It doesn't require you to
> force-merge all the data through addIndexes, and at the same time it
> doesn't invade the codec apis.
> The user can always force-merge the data themselves for situations
> such as benchmarks/tracking space over time, otherwise the
> fluctuations from merges could create too much noise.
> Personally, I would suggest separate api/tool from CheckIndex, perhaps
> this tracking could mask bugs? No reason to mix the two concerns.
> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to