Oh, yes that's a clever idea. It seems it would take quite a while (tens of minutes?) for a larger index though? Much faster than the force-merge solution for sure. I guess to get faster we would have to instrument each format. I mean they generally do know how much space each field is occupying, but perhaps it's too much API change to expose that.
On Tue, Jun 14, 2022 at 12:09 AM Nhat Nguyen <nhat.ngu...@elastic.co.invalid> wrote: > > >> Also, the tool can be much more efficient than checkindex, e.g. for >> stored fields and vectors it can just retrieve the first and last >> documents, whereas checkindex should verify all of the documents >> slowly. > > > Yes, we implemented a similar heuristic in the DiskUsage API in Elasticsearch. > > On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcm...@gmail.com> wrote: >> >> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen >> <nhat.ngu...@elastic.co.invalid> wrote: >> > >> > Hi Michael, >> > >> > We developed a similar functionality in Elasticsearch. The DiskUsage API >> > estimates the storage of each field by iterating its structures (i.e., >> > inverted index, doc-values, stored fields, etc.) and tracking the number >> > of read-bytes. The result is pretty fast and accurate. >> > >> > I am +1 to the proposal. >> > >> >> I like an approach such as this, enumerate the index, using something >> like FilterDirectory to track the bytes. It doesn't require you to >> force-merge all the data through addIndexes, and at the same time it >> doesn't invade the codec apis. >> The user can always force-merge the data themselves for situations >> such as benchmarks/tracking space over time, otherwise the >> fluctuations from merges could create too much noise. >> Personally, I would suggest separate api/tool from CheckIndex, perhaps >> this tracking could mask bugs? No reason to mix the two concerns. >> Also, the tool can be much more efficient than checkindex, e.g. for >> stored fields and vectors it can just retrieve the first and last >> documents, whereas checkindex should verify all of the documents >> slowly. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org