> Also, the tool can be much more efficient than checkindex, e.g. for > stored fields and vectors it can just retrieve the first and last > documents, whereas checkindex should verify all of the documents > slowly.
Yes, we implemented a similar heuristic in the DiskUsage API in Elasticsearch. On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcm...@gmail.com> wrote: > On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen > <nhat.ngu...@elastic.co.invalid> wrote: > > > > Hi Michael, > > > > We developed a similar functionality in Elasticsearch. The DiskUsage API > estimates the storage of each field by iterating its structures (i.e., > inverted index, doc-values, stored fields, etc.) and tracking the number of > read-bytes. The result is pretty fast and accurate. > > > > I am +1 to the proposal. > > > > I like an approach such as this, enumerate the index, using something > like FilterDirectory to track the bytes. It doesn't require you to > force-merge all the data through addIndexes, and at the same time it > doesn't invade the codec apis. > The user can always force-merge the data themselves for situations > such as benchmarks/tracking space over time, otherwise the > fluctuations from merges could create too much noise. > Personally, I would suggest separate api/tool from CheckIndex, perhaps > this tracking could mask bugs? No reason to mix the two concerns. > Also, the tool can be much more efficient than checkindex, e.g. for > stored fields and vectors it can just retrieve the first and last > documents, whereas checkindex should verify all of the documents > slowly. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >