On 5/2/23 15:30, Bill Tantzen wrote:
This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif
This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif
I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
handling. I am pretty sure ICU4J is IBM's implementation of Unicode. I
think StandardTokenizer is using a different implementation.
I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.
Two different Unicode implementations are doing exactly the same thing.
Is it a bug, or expected behavior? It does mean filenames are sometimes
not being handled in the way you expect.
I ran another check ... I had thought that StandardTokenizer preserved
email addresses as a single token ... but I am seeing that t...@test.com
is split into two terms. It splits t...@test7.com into three terms.
Thanks,
Shawn