Hi, There are two related but orthogonal parts to this:
1. The refactoring to IOContext and hints, that Simon has described. 2. The default advice that Lucene should use out-of-the-box. I believe that we are in good shape to completing no.1. For no.2, we discussed this in the following issue https://github.com/apache/lucene/issues/14408 - the conclusion is that we revert the default back to NORMAL. With this, then Lucene does not set MADV_RANDOM, unless the user opts-in - which is greatly improved by no.1. -Chris. > On 8 Aug 2025, at 09:40, Simon Cooper <simon.coo...@elastic.co.INVALID> wrote: > > As I've been working in this area, here's my 2c... > > The move from ReadAdvice to IOContext hints is as yet unfinished, > https://github.com/apache/lucene/pull/14977 and > https://github.com/apache/lucene/pull/14844 will finish it off. Once those > are merged, ReadAdvice will only be used as an implementation detail of > MMapDirectory and related classes, core Lucene classes will only deal with > IOContext and hints. By subclassing MMapDirectory, you can modify the hints > that are passed down to the base implementation as you need to, and/or > specify your own hints or IOContext implementations to help refine the > behaviour you need. > > It will then be up to each directory implementation to look at the hints > specified, and use those to inform how it should open the files. At the > moment, MMapDirectory is the only one which does this, and it does this using > different ReadAdvices based on the hints. Exactly which ReadAdvice is used > for a particular combination of hints can be modified. I'm also not sure > where NORMAL or RANDOM is best used, but I've tried to keep current behaviour > unchanged as much as possible so far. > > SimonC > > On Thu, 7 Aug 2025 at 22:03, Michael Sokolov <soko...@falutin.net.invalid> > wrote: > I want to raise an issue here that has come up before which is about the > choices we have made to apply madvise flags in an opinionated way. > > In our environment, the choices Lucene is making are really detrimental to > our indexing throughput. In the past we had disabled this by subclassing > MMapDirectory (a super expert workaround). Somehow we missed the fact that > changes in Lucene 10 made this workaround ineffective and it took us a while > to find the new recommended workaround, which is a system property setting. > In an excess (perhaps) of caution, instead of the sysprop we've opted to > modify a Lucene fork to disable this in a more fundamental way (cauterizing > PosixNativeAccess.madvise), I think hoping that this might insulate us > against future changes in this area? But we don't want to have to engage in > this kind of paranoid programming! > > Lucene has made a choice that may be good for some environments or operating > conditions, but not for others, and the difference can be pretty dramatic. > I'm not sure how we came to decide that the current default is better than > the old one? I'll also say I don't really understand why the MADV_RANDOM is > hurting us so much, but it does cause our merge operations to get much > slower, fall behind, and pile up to the extent that low-resource environments > (that used to work fine with MADV_NORMAL) are crumbling under the weight of > pending merges. > > Another thread is that the multiple layers of abstraction we have today > (IOContext + ReadAdvice + DataAccessHint + FileDataHint + madvise) make it > quite difficult to reason about what OS behavior is happening for any given > IO operation. I read the IOContext javadocs but they only give general > information and don't explain how hints are used to determine an actual MADV > flag. In what circumstance should I use a hint vs an advice? The > IndexInput.updateReadAdvice javadoc actually says "provide a hint" but > accepts an advice. > > So to summarize: > > • Selflishly, I don't like the current default MADV setting Lucene has > chosen, although I recognize it's possible it may work for some use case. > But I do wonder at some level if the OS's default shouldn't be a good default > setting? > • I find the Lucene API in this area confusing and not well-documented. > Understanding that the IO contexts are many and varied and could profitably > be tuned differently, I wonder if we could have a centralized and first-class > API (not a system property) that can be used to set a memory access profile > of some sort? > > I think some evidence supporting the choices we have made today (why is the > default MADV_RANDOM) would be helpful as a starting point. Maybe there is a > past thread I overlooked? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org