Hi all,

I'm seeing a sporadic NullPointerException (and 500 Solr response) with
Solr 8.11.2 and 9.4.0 when performing lots of updates (full documents and
partial updates, some in-place, most not) for a long period of time. This
is the last step of building an index from scratch, during which, after the
full index from scratch using full (not partial) documents has finished, I
then replay mostly partial updates that have happened since I started the
primary reindex. I'm hoping you'll have some suggestions or advice on how
to proceed in investigating further and maybe avoiding the error or filing
a bug report.

Please forgive the length below, but I wanted to include all the info that
I think might be relevant.

First, the stack trace is the following (this is on 8.11.2, but it's also a
NullPointerException in `UpdateLog.applyOlderUpdates` in 9.4.0, and I
haven't yet tested other versions):

java.lang.NullPointerException
at org.apache.solr.update.UpdateLog.applyOlderUpdates(UpdateLog.java:962)
at org.apache.solr.update.UpdateLog.applyPartialUpdates(UpdateLog.java:934)
at
org.apache.solr.handler.component.RealTimeGetComponent.resolveFullDocument(RealTimeGetComponent.java:450)
at
org.apache.solr.handler.component.RealTimeGetComponent.getInputDocumentFromTlog(RealTimeGetComponent.java:677)
at
org.apache.solr.handler.component.RealTimeGetComponent.getInputDocumentFromTlog(RealTimeGetComponent.java:616)
at
org.apache.solr.handler.component.RealTimeGetComponent.getInputDocument(RealTimeGetComponent.java:725)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(DistributedUpdateProcessor.java:705)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:380)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:343)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:343)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:229)
at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:521)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:180)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:156)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:82)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2637)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:791)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:564)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
at
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
at
org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:829)

These are what I think are the key facts:

 - single solr instance (not solr cloud), with 1 core, with the index using
about 20GB for about 30 million docs
 - solrconfig is configured with auto hard commit every 300s and no auto
soft commit (set to `-1` in solrconfig.xml)
 - updates are happening in a single thread from Python via the default
`update` endpoint using JSON, with no other solr usage at all (no queries,
no other updates, etc.)
 - the index is brand new, and was built from scratch for this version of
solr; once the entire index completes, which only sends updates containing
whole documents, I start this "catch up" process that runs through a few
days of updates that have been missed
 - the error only happens doing partial updates
 - it only seems to happen when using `commitWithin` (always supplied as a
query param in the update URL)
 - I've seen it occur with `commitWithin` values between 1000ms and
600,000ms, and I haven't found any values that work reliably, but the time
between errors ranges from minutes to 16 hours
 - once the error happens, trying again to update with those same partial
updates that triggered the failure results in the same error; if I
immediately do an empty update call with `commit=true`, then those same
documents can be added with `commitWithin` again successfully and I can
continue with other updates until the next time it fails
 - every time the error happens, the trigger seems to be that the partial
update docs that triggered the failure had been updated recently (within
seconds or tens of seconds)

Here's an example of two batches of document updates that triggered a
failure.

First, there was a successful update (commitWithin=600000) of

[
  {
    "id": 16688142,
    "playlist": {
      "set": [
        348469
      ]
    },
    "playlist_index_348469": {
      "set": 0
    }
  },
  {
    "id": 16688143,
    "playlist": {
      "set": [
        348469
      ]
    },
    "playlist_index_348469": {
      "set": 1
    }
  }
]

and within 20 seconds, after some updates of other docs had succeeded, the
following batch triggered the 500 error that yielded the stack trace above:

[
  {
    "id": "16688142",
    "playlist": {
      "set": [
        348469
      ]
    },
    "playlist_index_348469": {
      "set": 1
    }
  },
  {
    "id": "16688143",
    "playlist": {
      "set": [
        348469
      ]
    },
    "playlist_index_348469": {
      "set": 0
    }
  }
]

As you can see, the doc ids are the same, and it's always the case that
some of the docs in the error batch have been recently updated in a
previous successful update (within the `commitWithin` window and the hard
autocommit window).

The `playlist` and `playlist_index_*` fields are defined as (both with
`docValues="true"`):

  <field name="playlist" type="ints" indexed="true" stored="false"/>
  <dynamicField name="playlist_index_*" type="int" indexed="false"
stored="false"/>

I also observed the error when the schema for `playlist_index_*` was set to
`indexed=true` (the index has always been built from scratch any time the
schema has been changed, so it's definitely not in a bad state due to a
schema change without reindexing).

It's not always these exact fields that trigger the failure. I've also seen
an initial update of

[
  {
    "id": 16688522,
    "playlist_index_338289": {
      "set": 4
    }
  }
  , ...
]

followed shortly but not immediately thereafter by

[
  {
    "id": 16688522,
    "collection": {
      "add": "/1667690/1670103/1670113"
    }
  }
]

and then

[
  {
    "id": 16688522,
    "collection": {
      "add": "/1667690/1670103/1670113"
    }
  }
]

with the last of the three triggering the error. In this case, the last
update was identical to the 2nd one, and didn't need to happen, but that's
not usually the case.

The `collection` field is defined as:

<field name="collection" type="descendent_path" indexed="true"
stored="true" omitNorms="true" multiValued="true"/>

I'm not sure if it happens if all the updates have `softCommit=true` and no
`commitWithin`, but that's too slow to be usable (1 to 2 seconds for an
update of 1 doc). It didn't happen when I let it run with `softCommit=true`
and no `commitWithin` for about an hour though, so I would guess that the
`commitWithin` is essential.

The exact conditions that trigger the failure are not as simple as above
though, because if I set up a manual test with some updates like I showed
above that make changes to the same documents, I can't reproduce it. It
only happens in the course of making tons of updates (as fast as the single
thread can send them one batch at a time and solr can process them [i.e.,
the thread waits for each solr response before sending the next request])
and then only after it's been running for some period of time and only some
of the times when they involve the same documents, so I'm not sure what the
other requirements/triggers are.

If you have any suggestions on settings changes I might try or additional
information I could collect, or any other suggestions or feedback, I'd very
much appreciate it.

Thanks for your time,
Calvin

Reply via email to