Jay Hacker created SOLR-4090:
--------------------------------
Summary: Indexing mangles documents under load
Key: SOLR-4090
URL: https://issues.apache.org/jira/browse/SOLR-4090
Project: Solr
Issue Type: Bug
Affects Versions: 4.0
Environment: Red Hat 5.8 x86_64, Oracle JDK 1.7.0_09
Reporter: Jay Hacker
Sometimes when indexing documents with multiple concurrent processes, I get
intermittent data corruption errors. The data submitted for indexing is valid,
but Solr will complain that it is malformed and return an HTTP 500 or 400
error. The errors are similar whether submitting as XML, JSON, or CSV. The
problem has not occurred using a single process (i.e., one process POSTing to
Solr), is rare with four processes, and is pretty reproducible with 16 (or 64,
or 128). I've seen this problem since at least Solr 1.4; we generally just use
four threads and hope for the best. It seems a bit more common with Solr 4, so
I decided to track it down.
I have dummy documents for the example ("collection1") schema that, when posted
using many simultaneous processes, often generate parsing errors. I use
something like this to pound Solr with repeated POSTs of the same document:
{code}
yes docs.xml | xargs -P64 -I{} curl -s --data-binary @{} --header
'Content-type: text/xml' 'http://localhost:8983/solr/collection1/update'
{code}
Which fairly reliably gives this error:
{noformat}
SEVERE: org.apache.solr.common.SolrException: Unexpected '<' in attribute value
at [row,col {unknown-source}]: [2765,74]
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:722)
Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected '<' in attribute
value
at [row,col {unknown-source}]: [2765,74]
at
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at
com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1951)
at
com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3037)
at
com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:370)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:229)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
... 31 more
{noformat}
When posting the same data as JSON:
{code}
yes docs.json | xargs -P64 -I{} curl -s --data-binary @{} --header
'Content-type: application/json' 'http://localhost:8983/solr/collection1/update'
{code}
The errors look like this:
{noformat}
SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=3] multiple values
encountered for non multiValued field f0008_t:
[388888888888888888888888888888888888888888888888888,
9888888888888888888888888888888888888888888888888888]
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:243)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:386)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:111)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:722)
{noformat}
I would like to emphasize that the stack traces are misleading: there is no
problem with the data. Often the exact same document will index just fine a
thousand times before throwing an error. Somewhere between being submitted and
parsed, the data is being occasionally corrupted. The errors are intermittent,
but when the do happen, they often complain about exactly the same byte in the
input.
Here's what I know:
* Small POSTs, or single large documents, don't seem to trigger the problem; it
seems to take multiple documents totaling ~150K or more.
* More threads trigger the error more often.
* Restricting the JVM heap to say -Xmx384M seems to trigger the problem more
often.
* Setting {{commit=true}} with each post seems to alleviate the problem.
* I can't seem to reproduce it with 3.6.1, or 4.1-2012-11-16_01-01-43 nightly
-- maybe related to [SOLR-3621|https://issues.apache.org/jira/browse/SOLR-3621]?
* To test if it was Jetty, I wrote a small servlet that just decodes JSON and
pounded it under both Jetty 8.1.2 and 8.1.7 without problems -- however,
[SOLR-4031|https://issues.apache.org/jira/browse/SOLR-4031] sounds awfully
similar, and...
* If I unzip the 4.0 tarball, and replace jetty (example/{lib/,start.jar}) with
the jetty 8.1.7 from 4.1 nightly, it seems to fix the problem.
I'm posting this in case anyone runs into similar problems, and meanwhile I
will keep testing 4.1.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]