[
https://issues.apache.org/jira/browse/SOLR-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916790#comment-17916790
]
Alex Deparvu edited comment on SOLR-13360 at 2/1/25 2:01 PM:
-------------------------------------------------------------
Just trying to make some progress here, I was able to add a PR with some
minimal tests that reproduce the issue
[https://github.com/apache/solr/pull/3112]
There is no proposed solution yet because I wanted to discuss options here
first. but I did leave some TODO notes where I think changes could do.
just a disclaimer I don't really have a lot of knowledge in this area so I did
spend some time trying to understand what the issue is so some of this might
not be 100% correct. also thanks to the wealth of details on this ticket I was
able to come up with a relatively simple test (see
SpellCheckCollatorWithSynonymTest) but also a more lower level unit test that
unpacks all the complexity of setting up a Solr instance with correct field
types and all (see SpellCheckCollatorCollationOnlyTest).
To me this looks like an overlapping interval problem. given a sufficiently
complicated analyzer, the tokens can have a lot of overlapping start/end
indexes and this can cause chaos on the collation code which seems to assume
tokens come in with strictly increasing start indexes. The twist to the PR that
was posted above is that not only can start index repeat BUT you can have
tokens inside other tokens which is a gap that also needs to be fixed. see the
unit test here
[https://github.com/apache/solr/pull/3112/files#diff-a85c16ca4fe0d8747a0c76e21460c7ec5ede698a4a83eed47e39cdc197af0c48R126-R130]
Not really sure what the solution is, I am leaning towards a cleanup of the
correction tokens (basically remove any overlapping intervals): sort by start
index first, remove anything that is inside the previous token's interval (this
will favor the first token over the others and I am not sure it is a good
approach).
example
- search `panthera pardus`
- synonim definition `panthera pardus, leopard|0.6`
- corrections are: leopard (0, 15), 0(0,15), 6(0,15), panthera(0, 8), pardu(9,
15)
- note the `0` and `6` tokens were generated from the syonym definition so
even if a possible bug - I am leaving the example in because it shows what can
happen
More open questions
1. I added a log capturing the data in case of a future
StringIndexOutOfBoundsException. curious what people think, is this useful or
not. I can remove it if it is too intrusive or can leak sensitive data in the
logs.
2. I can put this behind a system property (on by default) so if anything
happens this can be reverted to previous behavior.
3. there is some issue with boosts in synonyms. I tried documenting here
[https://github.com/apache/solr/pull/3112/files#diff-8a4f8ed7cdb05bd73fcaaa4b688a166db1c28ab32eca3179ec19ec43138384ccR39]
4. I think there might be an issue with the whitespace correction. I move this
code into a dedicated metod but did not have time to add any tests.
was (Author: alex.parvulescu):
Just trying to make some progress here, I was able to add a PR with some
minimal tests that reproduce the issue https://github.com/apache/solr/pull/3112
There is no proposed solution yet because I wanted to discuss options here
first. but I did leave some TODO notes where I think changes could do.
just a disclaimer I don't really have a lot of knowledge in this area so I did
spend some time trying to understand what the issue is so some of this might
not be 100% correct. also thanks to the wealth of details on this ticket I was
able to come up with a relatively simple test (see
SpellCheckCollatorWithSynonymTest) but also a more lower level unit test that
unpacks all the complexity of setting up a Solr instance with correct field
types and all (see SpellCheckCollatorCollationOnlyTest).
To me this looks like an overlapping interval problem. given a sufficiently
complicated analyzer, the tokens can have a lot of overlapping start/end
indexes and this can cause chaos on the collation code which seems to assume
tokens come in with strictly increasing start indexes. The twist to the PR that
was posted above is that not only can start index repeat BUT you can have
tokens inside other tokens which is a gap that also needs to be fixed. see the
unit test here
https://github.com/apache/solr/pull/3112/files#diff-a85c16ca4fe0d8747a0c76e21460c7ec5ede698a4a83eed47e39cdc197af0c48R110-R114
Not really sure what the solution is, I am leaning towards a cleanup of the
correction tokens (basically remove any overlapping intervals): sort by start
index first, remove anything that is inside the previous token's interval (this
will favor the first token over the others and I am not sure it is a good
approach).
example
- search `panthera pardus`
- synonim definition `panthera pardus, leopard|0.6`
- corrections are: leopard (0, 15), 0(0,15), 6(0,15), panthera(0, 8), pardu(9,
15)
- note the `0` and `6` tokens were generated from the syonym definition so
even if a possible bug - I am leaving the example in because it shows what can
happen
More open questions
1. I added a log capturing the data in case of a future
StringIndexOutOfBoundsException. curious what people think, is this useful or
not. I can remove it if it is too intrusive or can leak sensitive data in the
logs.
2. I can put this behind a system property (on by default) so if anything
happens this can be reverted to previous behavior.
3. there is some issue with boosts in synonyms. I tried documenting here
https://github.com/apache/solr/pull/3112/files#diff-8a4f8ed7cdb05bd73fcaaa4b688a166db1c28ab32eca3179ec19ec43138384ccR41
4. I think there might be an issue with the whitespace correction. I move this
code into a dedicated metod but did not have time to add any tests.
> StringIndexOutOfBoundsException: String index out of range: -3
> --------------------------------------------------------------
>
> Key: SOLR-13360
> URL: https://issues.apache.org/jira/browse/SOLR-13360
> Project: Solr
> Issue Type: Bug
> Affects Versions: 7.2.1
> Environment: Solr 7.2.1 - SAP Hybris 6.7.0.8
> Reporter: Ahmed Ghoneim
> Priority: Critical
> Labels: pull-request-available
> Attachments: managed-schema, managed-schema, resources.json,
> solr-config.zip
>
> Time Spent: 2h
> Remaining Estimate: 0h
>
> *{color:#ff0000}I cannot execute the following query:{color}*
> {noformat}
> http://localhost:8983/solr/master_Project_Product_flip/suggest?q=duotop&spellcheck.q=duotop&qt=/suggest&spellcheck.dictionary=de&spellcheck.collate=true{noformat}
> 4/1/2019, 1:16:07 PM ERROR true RequestHandlerBase
> java.lang.StringIndexOutOfBoundsException: String index out of range: -3
> {code:java}
> java.lang.StringIndexOutOfBoundsException: String index out of range: -3
> at
> java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:851)
> at java.lang.StringBuilder.replace(StringBuilder.java:262)
> at
> org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:252)
> at
> org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:94)
> at
> org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:297)
> at
> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:209)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> 4/1/2019, 1:16:07 PM ERROR true HttpSolrCall
> null:java.lang.StringIndexOutOfBoundsException: String index out of range: -3
> {code:java}
> null:java.lang.StringIndexOutOfBoundsException: String index out of range: -3
> at
> java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:851)
> at java.lang.StringBuilder.replace(StringBuilder.java:262)
> at
> org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:252)
> at
> org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:94)
> at
> org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:297)
> at
> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:209)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:748){code}
> *{color:#14892c}However the following query works:{color}*
> {noformat}
> http://localhost:8983/solr/master_Project_Product_flip/suggest?q=duotop&spellcheck.q=duotop&qt=/suggest&spellcheck.dictionary=de&spellcheck.collate=false{noformat}
> Note: there's a synonym
> {noformat}
> duotop -> Duo Top
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]