OK, I see what's going on. I should not have used a generic example like XYZ.
In my specific case, as you can see, I'm working with filenames. This works as I expected: ab00c.tif -- tokenizes as it should with a value of ab00c.tif This doesn't work as I expected ab003.tif -- tokenizes with a result of ab003 and tif That is, the standard tokenizer treats dot as described in the docs when it is preceded by an alpha character. It treats dot as any other delimiter when it is preceded by a numeric character, that is, it creates two tokens. (This is maybe documented in the linked unicode.org page in that section of the docs, but honestly that page went way over my head...) So at least it works as advertised except in the edge case where the dot is preceded by a numeric. I don't know why that is the case, but I can work with that! Thanks to everybody who weighed in on this! ~~Bill On Tue, May 2, 2023 at 3:56 PM Shawn Heisey <apa...@elyograg.org> wrote: > On 5/2/23 13:16, Bill Tantzen wrote: > > This tokenizer splits the text field into tokens, treating whitespace and > > punctuation as delimiters. > > Delimiter characters are discarded, with the following exceptions: > > Periods (dots) that are not followed by whitespace are kept as part of > the > > token, including Internet domain names. > > I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does > indeed do exactly what the docs say. > > The analysis definition in the fieldType probably has things beyond the > StandardTokenizer, one or more filters that DO break up terms on a period. > > Thanks, > Shawn > -- Human wheels spin round and round While the clock keeps the pace... -- John Mellencamp ________________________________________________________________ Bill Tantzen University of Minnesota Libraries 612-626-9949 (U of M) 612-325-1777 (cell)