Re: standard tokenizer seemingly splitting on dot

Shawn Heisey Wed, 03 May 2023 08:04:55 -0700

On 5/2/23 15:30, Bill Tantzen wrote:

This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif


This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif

I got the same behavior with ICUTokenizer, which uses ICU4J for Unicodehandling. I am pretty sure ICU4J is IBM's implementation of Unicode. Ithink StandardTokenizer is using a different implementation.

I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it usesreference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.

Two different Unicode implementations are doing exactly the same thing.Is it a bug, or expected behavior? It does mean filenames are sometimesnot being handled in the way you expect.

I ran another check ... I had thought that StandardTokenizer preservedemail addresses as a single token ... but I am seeing that t...@test.comis split into two terms. It splits t...@test7.com into three terms.


Thanks,
Shawn

Re: standard tokenizer seemingly splitting on dot

Reply via email to