> I mean my benchmarks show up > to 300% improvement with 4.x versus older versions so something is > weird ie. non-realistic here or there is a bug so lets figure this > out. Can you profile you app and see if you find something suspicious? > I'll try now and report back.
It seems to be largely my mistake: maven enables assertions automatically when running tests. Executing it as normal public main class results in faster indexing times for 4.0 compared to 3.5. Conclusion: 1. execution with assertions for 4.0 is slower than 3.5 (thats what I mainly measured :/) 2. luc 4.0 execution times vary more than 3.5 when using reopen thread (and one single indexing thread, others not tested). 3. luc 4.0 then is still slower, but for 5 mio of my items its less then 5%. The hot spots are: * 30% ThreadAffinityDocumentsWriterThreadPool -> java.util.concurrent.ConcurrentHashMap.get(Object) -> threadBindings.get * 26% BufferedDeletesStream.applyTermDeletes(Iterable, SegmentReader) * 16% FreqProxTermsWriterPerField.flush(String, FieldsConsumer, SegmentWriteState) * 10% DocFieldProcessor.processDocument Now when reusing BytesRef in 4.0 (and reusing the char array in 3.5) then luc 4 is >20% faster than 3.5 for 5 mio docs! But somewhen I had problems as a thread concurrently modified the docs - can this happen e.g. from the reopen thread? Or is it safe to reuse BytesRef? Regards, Peter. > Hi Simon, > > answers below. > >>> It does not seem to be an 'IO related issue' because using RAMDirectory >>> results in the same times. >>> And indexing via Luc4 with only one thread shouldn't be slower than 3.5 (?) >> it could be since we use a different term dictionary impl which is >> more expensive in building than the previous versions; thats just a >> guess. >> What I am really wondering is why you are using the NRT manager and >> reopen during indexing - are you measuring the NRT reopen times too? > My project requires reopening as it will then clear some caches. > > Reopening isn't that frequent (every 5 seconds). When disabling it the > difference even increases slightly, but the big variation for luc4 goes > away! > > >> What merge policies are you using for 3x and 4x? > The default ones. I'm now using LogByteSizeMergePolicy for both but it > is nearly the same difference. > > >>>> You should add some more randomness or reality to your test. >>> Hmmh, ok. The uid and type is the reality in my other (experimental) >>> project as it uses a generated and incremented id from AtomicLong and >>> two types. >>> Or do you have an explanation why luc4 can be slower on such 'simple' >>> fields? >> you reported that indexing only the ID is faster in 4.x but the other >> fields AFAIK are likely always the same for all docs, no? > no, the _uid field is different: it's the id field converted to string. > > >> you are indexing with one thread right? > yes. > > >> I mean my benchmarks show up >> to 300% improvement with 4.x versus older versions so something is >> weird ie. non-realistic here or there is a bug so lets figure this >> out. Can you profile you app and see if you find something suspicious? > I'll try now and report back. > > >> I'd also try to index way more documents to make your benchmarks run >> little longer just to be sure. > For ~5 times more docs (5 mio) it is nearly the same difference. > > > Regards, > Peter. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org