Re: standard tokenizer seemingly splitting on dot

Bill Tantzen Thu, 04 May 2023 11:29:36 -0700

Rahul,
No I do not, but note that this behavior has been observed by others and
reported as a possible issue.
Thank you!
~~Bill


On Thu, May 4, 2023 at 1:07 PM Rahul Goswami <rahul196...@gmail.com> wrote:

> Bill,
> Do you have a WordDelimiterFilterFactory in the analysis chain (with
> "*preserveOriginal"
> *attribute likely set to *0*)?
> That would split the token on the period downstream in the analysis chain
> even if StandardTokenizer doesn't.
>
> -Rahul
>
> On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev <m...@apache.org> wrote:
>
> > Raised https://github.com/apache/lucene/issues/12264.
> > Let's look at what devs say.
> >
> > On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <tantz...@umn.edu.invalid>
> > wrote:
> >
> > > Shawn,
> > > No, email addresses are not preserved -- from the docs:
> > >
> > >
> > >    -
> > >
> > >    The "@" character is among the set of token-splitting punctuation,
> so
> > >    email addresses are not preserved as single tokens.
> > >
> > >
> > > but the non-split on "test.com" vs the split on "test7.com" is
> > unexpected!
> > > ~~Bill
> > >
> > >
> > > On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <apa...@elyograg.org>
> > wrote:
> > >
> > > > On 5/2/23 15:30, Bill Tantzen wrote:
> > > > > This works as I expected:
> > > > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > > > >
> > > > > This doesn't work as I expected
> > > > > ab003.tif -- tokenizes with a result of ab003 and tif
> > > >
> > > > I got the same behavior with ICUTokenizer, which uses ICU4J for
> Unicode
> > > > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.
> > I
> > > > think StandardTokenizer is using a different implementation.
> > > >
> > > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > > > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> > > central.
> > > >
> > > > Two different Unicode implementations are doing exactly the same
> thing.
> > > > Is it a bug, or expected behavior?  It does mean filenames are
> > sometimes
> > > > not being handled in the way you expect.
> > > >
> > > > I ran another check ... I had thought that StandardTokenizer
> preserved
> > > > email addresses as a single token ... but I am seeing that
> > t...@test.com
> > > > is split into two terms.  It splits t...@test7.com into three terms.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > >
> > >
> > > --
> > > Human wheels spin round and round
> > > While the clock keeps the pace... -- John Mellencamp
> > > ________________________________________________________________
> > > Bill Tantzen    University of Minnesota Libraries
> > > 612-626-9949 (U of M)    612-325-1777 (cell)
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Reply via email to