Hi all, I'm seeing a sporadic NullPointerException (and 500 Solr response) with Solr 8.11.2 and 9.4.0 when performing lots of updates (full documents and partial updates, some in-place, most not) for a long period of time. This is the last step of building an index from scratch, during which, after the full index from scratch using full (not partial) documents has finished, I then replay mostly partial updates that have happened since I started the primary reindex. I'm hoping you'll have some suggestions or advice on how to proceed in investigating further and maybe avoiding the error or filing a bug report.
Please forgive the length below, but I wanted to include all the info that I think might be relevant. First, the stack trace is the following (this is on 8.11.2, but it's also a NullPointerException in `UpdateLog.applyOlderUpdates` in 9.4.0, and I haven't yet tested other versions): java.lang.NullPointerException at org.apache.solr.update.UpdateLog.applyOlderUpdates(UpdateLog.java:962) at org.apache.solr.update.UpdateLog.applyPartialUpdates(UpdateLog.java:934) at org.apache.solr.handler.component.RealTimeGetComponent.resolveFullDocument(RealTimeGetComponent.java:450) at org.apache.solr.handler.component.RealTimeGetComponent.getInputDocumentFromTlog(RealTimeGetComponent.java:677) at org.apache.solr.handler.component.RealTimeGetComponent.getInputDocumentFromTlog(RealTimeGetComponent.java:616) at org.apache.solr.handler.component.RealTimeGetComponent.getInputDocument(RealTimeGetComponent.java:725) at org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(DistributedUpdateProcessor.java:705) at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:380) at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:343) at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:343) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:229) at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106) at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:521) at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:180) at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:156) at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:82) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2637) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:791) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:564) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:516) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang.Thread.run(Thread.java:829) These are what I think are the key facts: - single solr instance (not solr cloud), with 1 core, with the index using about 20GB for about 30 million docs - solrconfig is configured with auto hard commit every 300s and no auto soft commit (set to `-1` in solrconfig.xml) - updates are happening in a single thread from Python via the default `update` endpoint using JSON, with no other solr usage at all (no queries, no other updates, etc.) - the index is brand new, and was built from scratch for this version of solr; once the entire index completes, which only sends updates containing whole documents, I start this "catch up" process that runs through a few days of updates that have been missed - the error only happens doing partial updates - it only seems to happen when using `commitWithin` (always supplied as a query param in the update URL) - I've seen it occur with `commitWithin` values between 1000ms and 600,000ms, and I haven't found any values that work reliably, but the time between errors ranges from minutes to 16 hours - once the error happens, trying again to update with those same partial updates that triggered the failure results in the same error; if I immediately do an empty update call with `commit=true`, then those same documents can be added with `commitWithin` again successfully and I can continue with other updates until the next time it fails - every time the error happens, the trigger seems to be that the partial update docs that triggered the failure had been updated recently (within seconds or tens of seconds) Here's an example of two batches of document updates that triggered a failure. First, there was a successful update (commitWithin=600000) of [ { "id": 16688142, "playlist": { "set": [ 348469 ] }, "playlist_index_348469": { "set": 0 } }, { "id": 16688143, "playlist": { "set": [ 348469 ] }, "playlist_index_348469": { "set": 1 } } ] and within 20 seconds, after some updates of other docs had succeeded, the following batch triggered the 500 error that yielded the stack trace above: [ { "id": "16688142", "playlist": { "set": [ 348469 ] }, "playlist_index_348469": { "set": 1 } }, { "id": "16688143", "playlist": { "set": [ 348469 ] }, "playlist_index_348469": { "set": 0 } } ] As you can see, the doc ids are the same, and it's always the case that some of the docs in the error batch have been recently updated in a previous successful update (within the `commitWithin` window and the hard autocommit window). The `playlist` and `playlist_index_*` fields are defined as (both with `docValues="true"`): <field name="playlist" type="ints" indexed="true" stored="false"/> <dynamicField name="playlist_index_*" type="int" indexed="false" stored="false"/> I also observed the error when the schema for `playlist_index_*` was set to `indexed=true` (the index has always been built from scratch any time the schema has been changed, so it's definitely not in a bad state due to a schema change without reindexing). It's not always these exact fields that trigger the failure. I've also seen an initial update of [ { "id": 16688522, "playlist_index_338289": { "set": 4 } } , ... ] followed shortly but not immediately thereafter by [ { "id": 16688522, "collection": { "add": "/1667690/1670103/1670113" } } ] and then [ { "id": 16688522, "collection": { "add": "/1667690/1670103/1670113" } } ] with the last of the three triggering the error. In this case, the last update was identical to the 2nd one, and didn't need to happen, but that's not usually the case. The `collection` field is defined as: <field name="collection" type="descendent_path" indexed="true" stored="true" omitNorms="true" multiValued="true"/> I'm not sure if it happens if all the updates have `softCommit=true` and no `commitWithin`, but that's too slow to be usable (1 to 2 seconds for an update of 1 doc). It didn't happen when I let it run with `softCommit=true` and no `commitWithin` for about an hour though, so I would guess that the `commitWithin` is essential. The exact conditions that trigger the failure are not as simple as above though, because if I set up a manual test with some updates like I showed above that make changes to the same documents, I can't reproduce it. It only happens in the course of making tons of updates (as fast as the single thread can send them one batch at a time and solr can process them [i.e., the thread waits for each solr response before sending the next request]) and then only after it's been running for some period of time and only some of the times when they involve the same documents, so I'm not sure what the other requirements/triggers are. If you have any suggestions on settings changes I might try or additional information I could collect, or any other suggestions or feedback, I'd very much appreciate it. Thanks for your time, Calvin