On 5/2/23 15:30, Bill Tantzen wrote:
This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif

This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif

I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode handling. I am pretty sure ICU4J is IBM's implementation of Unicode. I think StandardTokenizer is using a different implementation.

I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.

Two different Unicode implementations are doing exactly the same thing. Is it a bug, or expected behavior? It does mean filenames are sometimes not being handled in the way you expect.

I ran another check ... I had thought that StandardTokenizer preserved email addresses as a single token ... but I am seeing that t...@test.com is split into two terms. It splits t...@test7.com into three terms.

Thanks,
Shawn

Reply via email to