Re: StandardTokenizerFactory doesn't split on underscore

Rahul Goswami Sat, 09 Jan 2021 18:01:11 -0800

Ah ok! Thanks Adam and Xiefeng

On Sat, Jan 9, 2021 at 6:02 PM Adam Walz <[email protected]> wrote:


> It is expected that the StandardTokenizer will not break on underscores.
> The StandardTokenizer follows the Unicode UAX 29
> <https://unicode.org/reports/tr29/#Word_Boundaries> standard which
> specifies an underscore as an "extender" and this rule
> <https://unicode.org/reports/tr29/#WB13a> says to not break from
> extenders.
> This is why xiefengchang was suggesting to use a
> PatternReplaceFilterFactory after the StandardTokenizer in order to further
> split on underscores if that is your use case.
>
> On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <[email protected]>
> wrote:
>
> > Nope. The underscore is preserved right after tokenization even before it
> > reaches any filters. You can choose the type "text_general" and try an
> > index time analysis through the "Analysis" page on Solr Admin UI.
> >
> > Thanks,
> > Rahul
> >
> > On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <[email protected]>
> > wrote:
> >
> > > did you configured PatternReplaceFilterFactory?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2021-01-08 12:16:06, "Rahul Goswami" <[email protected]> wrote:
> > > >Hello,
> > > >So recently I was debugging a problem on Solr 7.7.2 where the query
> > wasn't
> > > >returning the desired results. Turned out that the indexed terms had
> > > >underscore separated terms, but the query didn't. I was under the
> > > >impression that terms separated by underscore are also tokenized by
> > > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > > >'hello-world' would be tokenized into 'hello' and 'world', but
> > > >'hello_world' is treated as a single token.
> > > >Is this a bug or a designed behavior?
> > > >
> > > >If this is by design, it would be helpful if this behavior is included
> > in
> > > >the documentation since it is similar to the behavior with periods.
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > > >"Periods (dots) that are not followed by whitespace are kept as part
> of
> > > the
> > > >token, including Internet domain names. "
> > > >
> > > >Thanks,
> > > >Rahul
> > >
> >
>
>
> --
> Adam Walz
>

Re: StandardTokenizerFactory doesn't split on underscore

Reply via email to