Re: standard tokenizer seemingly splitting on dot

Bill Tantzen Wed, 03 May 2023 08:12:42 -0700

Shawn,
No, email addresses are not preserved -- from the docs:


   -

   The "@" character is among the set of token-splitting punctuation, so
   email addresses are not preserved as single tokens.


but the non-split on "test.com" vs the split on "test7.com" is unexpected!
~~Bill


On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/2/23 15:30, Bill Tantzen wrote:
> > This works as I expected:
> > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> >
> > This doesn't work as I expected
> > ab003.tif -- tokenizes with a result of ab003 and tif
>
> I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I
> think StandardTokenizer is using a different implementation.
>
> I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.
>
> Two different Unicode implementations are doing exactly the same thing.
> Is it a bug, or expected behavior?  It does mean filenames are sometimes
> not being handled in the way you expect.
>
> I ran another check ... I had thought that StandardTokenizer preserved
> email addresses as a single token ... but I am seeing that t...@test.com
> is split into two terms.  It splits t...@test7.com into three terms.
>
> Thanks,
> Shawn
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Reply via email to