Rahul, No I do not, but note that this behavior has been observed by others and reported as a possible issue. Thank you! ~~Bill
On Thu, May 4, 2023 at 1:07 PM Rahul Goswami <rahul196...@gmail.com> wrote: > Bill, > Do you have a WordDelimiterFilterFactory in the analysis chain (with > "*preserveOriginal" > *attribute likely set to *0*)? > That would split the token on the period downstream in the analysis chain > even if StandardTokenizer doesn't. > > -Rahul > > On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev <m...@apache.org> wrote: > > > Raised https://github.com/apache/lucene/issues/12264. > > Let's look at what devs say. > > > > On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <tantz...@umn.edu.invalid> > > wrote: > > > > > Shawn, > > > No, email addresses are not preserved -- from the docs: > > > > > > > > > - > > > > > > The "@" character is among the set of token-splitting punctuation, > so > > > email addresses are not preserved as single tokens. > > > > > > > > > but the non-split on "test.com" vs the split on "test7.com" is > > unexpected! > > > ~~Bill > > > > > > > > > On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <apa...@elyograg.org> > > wrote: > > > > > > > On 5/2/23 15:30, Bill Tantzen wrote: > > > > > This works as I expected: > > > > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif > > > > > > > > > > This doesn't work as I expected > > > > > ab003.tif -- tokenizes with a result of ab003 and tif > > > > > > > > I got the same behavior with ICUTokenizer, which uses ICU4J for > Unicode > > > > handling. I am pretty sure ICU4J is IBM's implementation of Unicode. > > I > > > > think StandardTokenizer is using a different implementation. > > > > > > > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses > > > > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven > > > central. > > > > > > > > Two different Unicode implementations are doing exactly the same > thing. > > > > Is it a bug, or expected behavior? It does mean filenames are > > sometimes > > > > not being handled in the way you expect. > > > > > > > > I ran another check ... I had thought that StandardTokenizer > preserved > > > > email addresses as a single token ... but I am seeing that > > t...@test.com > > > > is split into two terms. It splits t...@test7.com into three terms. > > > > > > > > Thanks, > > > > Shawn > > > > > > > > > > > > > -- > > > Human wheels spin round and round > > > While the clock keeps the pace... -- John Mellencamp > > > ________________________________________________________________ > > > Bill Tantzen University of Minnesota Libraries > > > 612-626-9949 (U of M) 612-325-1777 (cell) > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > https://t.me/MUST_SEARCH > > A caveat: Cyrillic! > > > -- Human wheels spin round and round While the clock keeps the pace... -- John Mellencamp ________________________________________________________________ Bill Tantzen University of Minnesota Libraries 612-626-9949 (U of M) 612-325-1777 (cell)