Shawn, No, email addresses are not preserved -- from the docs:
- The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens. but the non-split on "test.com" vs the split on "test7.com" is unexpected! ~~Bill On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <apa...@elyograg.org> wrote: > On 5/2/23 15:30, Bill Tantzen wrote: > > This works as I expected: > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif > > > > This doesn't work as I expected > > ab003.tif -- tokenizes with a result of ab003 and tif > > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode > handling. I am pretty sure ICU4J is IBM's implementation of Unicode. I > think StandardTokenizer is using a different implementation. > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central. > > Two different Unicode implementations are doing exactly the same thing. > Is it a bug, or expected behavior? It does mean filenames are sometimes > not being handled in the way you expect. > > I ran another check ... I had thought that StandardTokenizer preserved > email addresses as a single token ... but I am seeing that t...@test.com > is split into two terms. It splits t...@test7.com into three terms. > > Thanks, > Shawn > -- Human wheels spin round and round While the clock keeps the pace... -- John Mellencamp ________________________________________________________________ Bill Tantzen University of Minnesota Libraries 612-626-9949 (U of M) 612-325-1777 (cell)