Ah ok! Thanks Adam and Xiefeng On Sat, Jan 9, 2021 at 6:02 PM Adam Walz <[email protected]> wrote:
> It is expected that the StandardTokenizer will not break on underscores. > The StandardTokenizer follows the Unicode UAX 29 > <https://unicode.org/reports/tr29/#Word_Boundaries> standard which > specifies an underscore as an "extender" and this rule > <https://unicode.org/reports/tr29/#WB13a> says to not break from > extenders. > This is why xiefengchang was suggesting to use a > PatternReplaceFilterFactory after the StandardTokenizer in order to further > split on underscores if that is your use case. > > On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <[email protected]> > wrote: > > > Nope. The underscore is preserved right after tokenization even before it > > reaches any filters. You can choose the type "text_general" and try an > > index time analysis through the "Analysis" page on Solr Admin UI. > > > > Thanks, > > Rahul > > > > On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <[email protected]> > > wrote: > > > > > did you configured PatternReplaceFilterFactory? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2021-01-08 12:16:06, "Rahul Goswami" <[email protected]> wrote: > > > >Hello, > > > >So recently I was debugging a problem on Solr 7.7.2 where the query > > wasn't > > > >returning the desired results. Turned out that the indexed terms had > > > >underscore separated terms, but the query didn't. I was under the > > > >impression that terms separated by underscore are also tokenized by > > > >StandardTokenizerFactory, but turns out that's not the case. Eg: > > > >'hello-world' would be tokenized into 'hello' and 'world', but > > > >'hello_world' is treated as a single token. > > > >Is this a bug or a designed behavior? > > > > > > > >If this is by design, it would be helpful if this behavior is included > > in > > > >the documentation since it is similar to the behavior with periods. > > > > > > > > > > > > > > https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer > > > >"Periods (dots) that are not followed by whitespace are kept as part > of > > > the > > > >token, including Internet domain names. " > > > > > > > >Thanks, > > > >Rahul > > > > > > > > -- > Adam Walz >
