[
https://issues.apache.org/jira/browse/SOLR-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jay Hacker resolved SOLR-4090.
------------------------------
Resolution: Duplicate
Fix Version/s: 4.1
> Indexing mangles documents under load
> -------------------------------------
>
> Key: SOLR-4090
> URL: https://issues.apache.org/jira/browse/SOLR-4090
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.0
> Environment: Red Hat 5.8 x86_64, Oracle JDK 1.7.0_09
> Reporter: Jay Hacker
> Fix For: 4.1
>
> Attachments: docs.json, docs.xml
>
>
> Sometimes when indexing documents with multiple concurrent processes, I get
> intermittent data corruption errors. The data submitted for indexing is
> valid, but Solr will complain that it is malformed and return an HTTP 500 or
> 400 error. The errors are similar whether submitting as XML, JSON, or CSV.
> The problem has not occurred using a single process (i.e., one process
> POSTing to Solr), is rare with four processes, and is pretty reproducible
> with 16 (or 64, or 128). I've seen this problem since at least Solr 1.4; we
> generally just use four threads and hope for the best. It seems a bit more
> common with Solr 4, so I decided to track it down.
> I have dummy documents for the example ("collection1") schema that, when
> posted using many simultaneous processes, often generate parsing errors. I
> use something like this to pound Solr with repeated POSTs of the same
> document:
> {code}
> yes docs.xml | xargs -P64 -I{} curl -s --data-binary @{} --header
> 'Content-type: text/xml' 'http://localhost:8983/solr/collection1/update'
> {code}
> Which fairly reliably gives this error:
> {noformat}
> SEVERE: org.apache.solr.common.SolrException: Unexpected '<' in attribute
> value
> at [row,col {unknown-source}]: [2765,74]
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:159)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> at org.eclipse.jetty.server.Server.handle(Server.java:351)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected '<' in
> attribute value
> at [row,col {unknown-source}]: [2765,74]
> at
> com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
> at
> com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
> at
> com.ctc.wstx.sr.BasicStreamReader.parseNormalizedAttrValue(BasicStreamReader.java:1951)
> at
> com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3037)
> at
> com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2936)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2848)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
> at
> org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:370)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:229)
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
> ... 31 more
> {noformat}
> When posting the same data as JSON:
> {code}
> yes docs.json | xargs -P64 -I{} curl -s --data-binary @{} --header
> 'Content-type: application/json'
> 'http://localhost:8983/solr/collection1/update'
> {code}
> The errors look like this:
> {noformat}
> SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=3] multiple values
> encountered for non multiValued field f0008_t:
> [388888888888888888888888888888888888888888888888888,
> 9888888888888888888888888888888888888888888888888888]
> at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:243)
> at
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:432)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:557)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
> at
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:386)
> at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:111)
> at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
> at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> at org.eclipse.jetty.server.Server.handle(Server.java:351)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> I would like to emphasize that the stack traces are misleading: there is no
> problem with the data. Often the exact same document will index just fine a
> thousand times before throwing an error. Somewhere between being submitted
> and parsed, the data is being occasionally corrupted. The errors are
> intermittent, but when the do happen, they often complain about exactly the
> same byte in the input.
> Here's what I know:
> * Small POSTs, or single large documents, don't seem to trigger the problem;
> it seems to take multiple documents totaling ~150K or more.
> * More threads trigger the error more often.
> * Restricting the JVM heap to say -Xmx384M seems to trigger the problem more
> often.
> * Setting {{commit=true}} with each post seems to alleviate the problem.
> * I can't seem to reproduce it with 3.6.1, or 4.1-2012-11-16_01-01-43 nightly
> -- maybe related to
> [SOLR-3621|https://issues.apache.org/jira/browse/SOLR-3621]?
> * To test if it was Jetty, I wrote a small servlet that just decodes JSON and
> pounded it under both Jetty 8.1.2 and 8.1.7 without problems -- however,
> [SOLR-4031|https://issues.apache.org/jira/browse/SOLR-4031] sounds awfully
> similar, and...
> * If I unzip the 4.0 tarball, and replace jetty (example/{lib/,start.jar})
> with the jetty 8.1.7 from 4.1 nightly, it seems to fix the problem.
> * If I take Jetty 8.1.2 from 4.0 and use it with 4.1, it breaks 4.1.
> I'm posting this in case anyone runs into similar problems, and meanwhile I
> will keep testing 4.1.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]