[ https://issues.apache.org/jira/browse/SOLR-13360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916790#comment-17916790 ]
Alex Deparvu edited comment on SOLR-13360 at 2/1/25 2:15 PM: ------------------------------------------------------------- Just trying to make some progress here, I was able to add a PR with some minimal tests that reproduce the issue [https://github.com/apache/solr/pull/3112] There is no proposed solution yet because I wanted to discuss options here first. but I did leave some TODO notes where I think changes could do. just a disclaimer I don't really have a lot of knowledge in this area so I did spend some time trying to understand what the issue is so some of this might not be 100% correct. also thanks to the wealth of details on this ticket I was able to come up with a relatively simple test (see SpellCheckCollatorWithSynonymTest) but also a more lower level unit test that unpacks all the complexity of setting up a Solr instance with correct field types and all (see SpellCheckCollatorCollationOnlyTest). To me this looks like an overlapping interval problem. given a sufficiently complicated analyzer, the tokens can have a lot of overlapping start/end indexes and this can cause chaos on the collation code which seems to assume tokens come in with strictly increasing start indexes. The twist to the PR that was posted above is that not only can start index repeat BUT you can have tokens inside other tokens which is a gap that also needs to be fixed. see the unit test here [https://github.com/apache/solr/pull/3112/files#diff-a85c16ca4fe0d8747a0c76e21460c7ec5ede698a4a83eed47e39cdc197af0c48R126-R130] Not really sure what the solution is, I am leaning towards a cleanup of the correction tokens (basically remove any overlapping intervals): sort by start index first, remove anything that is inside the previous token's interval (this will favor the first token over the others and I am not sure it is a good approach). example - search `panthera pardus` - synonim definition `panthera pardus, leopard|0.6` - corrections are: leopard (0, 15), 0(0,15), 6(0,15), panthera(0, 8), pardu(9, 15) - note the `0` and `6` tokens were generated from the syonym definition so even if a possible bug - I am leaving the example in because it shows what can happen More open questions 1. I added a log capturing the data in case of a future StringIndexOutOfBoundsException. curious what people think, is this useful or not. I can remove it if it is too intrusive or can leak sensitive data in the logs. 2. I can put this behind a system property (on by default) so if anything happens this can be reverted to previous behavior. 3. there is some issue with boosts in synonyms. I tried documenting here [https://github.com/apache/solr/pull/3112/files#diff-8a4f8ed7cdb05bd73fcaaa4b688a166db1c28ab32eca3179ec19ec43138384ccR39] 4. I think there might be an issue with the whitespace correction parts but did not have time to add any tests. was (Author: alex.parvulescu): Just trying to make some progress here, I was able to add a PR with some minimal tests that reproduce the issue [https://github.com/apache/solr/pull/3112] There is no proposed solution yet because I wanted to discuss options here first. but I did leave some TODO notes where I think changes could do. just a disclaimer I don't really have a lot of knowledge in this area so I did spend some time trying to understand what the issue is so some of this might not be 100% correct. also thanks to the wealth of details on this ticket I was able to come up with a relatively simple test (see SpellCheckCollatorWithSynonymTest) but also a more lower level unit test that unpacks all the complexity of setting up a Solr instance with correct field types and all (see SpellCheckCollatorCollationOnlyTest). To me this looks like an overlapping interval problem. given a sufficiently complicated analyzer, the tokens can have a lot of overlapping start/end indexes and this can cause chaos on the collation code which seems to assume tokens come in with strictly increasing start indexes. The twist to the PR that was posted above is that not only can start index repeat BUT you can have tokens inside other tokens which is a gap that also needs to be fixed. see the unit test here [https://github.com/apache/solr/pull/3112/files#diff-a85c16ca4fe0d8747a0c76e21460c7ec5ede698a4a83eed47e39cdc197af0c48R126-R130] Not really sure what the solution is, I am leaning towards a cleanup of the correction tokens (basically remove any overlapping intervals): sort by start index first, remove anything that is inside the previous token's interval (this will favor the first token over the others and I am not sure it is a good approach). example - search `panthera pardus` - synonim definition `panthera pardus, leopard|0.6` - corrections are: leopard (0, 15), 0(0,15), 6(0,15), panthera(0, 8), pardu(9, 15) - note the `0` and `6` tokens were generated from the syonym definition so even if a possible bug - I am leaving the example in because it shows what can happen More open questions 1. I added a log capturing the data in case of a future StringIndexOutOfBoundsException. curious what people think, is this useful or not. I can remove it if it is too intrusive or can leak sensitive data in the logs. 2. I can put this behind a system property (on by default) so if anything happens this can be reverted to previous behavior. 3. there is some issue with boosts in synonyms. I tried documenting here [https://github.com/apache/solr/pull/3112/files#diff-8a4f8ed7cdb05bd73fcaaa4b688a166db1c28ab32eca3179ec19ec43138384ccR39] 4. I think there might be an issue with the whitespace correction. I move this code into a dedicated metod but did not have time to add any tests. > StringIndexOutOfBoundsException: String index out of range: -3 > -------------------------------------------------------------- > > Key: SOLR-13360 > URL: https://issues.apache.org/jira/browse/SOLR-13360 > Project: Solr > Issue Type: Bug > Affects Versions: 7.2.1 > Environment: Solr 7.2.1 - SAP Hybris 6.7.0.8 > Reporter: Ahmed Ghoneim > Priority: Critical > Labels: pull-request-available > Attachments: managed-schema, managed-schema, resources.json, > solr-config.zip > > Time Spent: 2h > Remaining Estimate: 0h > > *{color:#ff0000}I cannot execute the following query:{color}* > {noformat} > http://localhost:8983/solr/master_Project_Product_flip/suggest?q=duotop&spellcheck.q=duotop&qt=/suggest&spellcheck.dictionary=de&spellcheck.collate=true{noformat} > 4/1/2019, 1:16:07 PM ERROR true RequestHandlerBase > java.lang.StringIndexOutOfBoundsException: String index out of range: -3 > {code:java} > java.lang.StringIndexOutOfBoundsException: String index out of range: -3 > at > java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:851) > at java.lang.StringBuilder.replace(StringBuilder.java:262) > at > org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:252) > at > org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:94) > at > org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:297) > at > org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:209) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} > 4/1/2019, 1:16:07 PM ERROR true HttpSolrCall > null:java.lang.StringIndexOutOfBoundsException: String index out of range: -3 > {code:java} > null:java.lang.StringIndexOutOfBoundsException: String index out of range: -3 > at > java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:851) > at java.lang.StringBuilder.replace(StringBuilder.java:262) > at > org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:252) > at > org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:94) > at > org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:297) > at > org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:209) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748){code} > *{color:#14892c}However the following query works:{color}* > {noformat} > http://localhost:8983/solr/master_Project_Product_flip/suggest?q=duotop&spellcheck.q=duotop&qt=/suggest&spellcheck.dictionary=de&spellcheck.collate=false{noformat} > Note: there's a synonym > {noformat} > duotop -> Duo Top > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org