That looks like a bug. Seems to be splitting if the character class before and after differ, but not if they are the same.
ST XYZ123 tif SF XYZ123 tif LCF xyz123 tif and ST XYZ 123tif SF XYZ 123tif LCF xyz 123tif But... ST XYZ123.123tif SF XYZ123.123tif LCF xyz123.123tif On Tue, May 2, 2023 at 5:30 PM Bill Tantzen <tantz...@umn.edu.invalid> wrote: > OK, I see what's going on. I should not have used a generic example like > XYZ. > > In my specific case, as you can see, I'm working with filenames. > > This works as I expected: > ab00c.tif -- tokenizes as it should with a value of ab00c.tif > > This doesn't work as I expected > ab003.tif -- tokenizes with a result of ab003 and tif > > That is, the standard tokenizer treats dot as described in the docs when it > is preceded by an alpha character. > It treats dot as any other delimiter when it is preceded by a numeric > character, that is, it creates two tokens. > > (This is maybe documented in the linked unicode.org page in that section > of > the docs, but honestly that page went way over my head...) > > So at least it works as advertised except in the edge case where the dot is > preceded by a numeric. I don't know why that is the case, but I can work > with that! > > Thanks to everybody who weighed in on this! > ~~Bill > > > > > > On Tue, May 2, 2023 at 3:56 PM Shawn Heisey <apa...@elyograg.org> wrote: > > > On 5/2/23 13:16, Bill Tantzen wrote: > > > This tokenizer splits the text field into tokens, treating whitespace > and > > > punctuation as delimiters. > > > Delimiter characters are discarded, with the following exceptions: > > > Periods (dots) that are not followed by whitespace are kept as part of > > the > > > token, including Internet domain names. > > > > I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does > > indeed do exactly what the docs say. > > > > The analysis definition in the fieldType probably has things beyond the > > StandardTokenizer, one or more filters that DO break up terms on a > period. > > > > Thanks, > > Shawn > > > > > -- > Human wheels spin round and round > While the clock keeps the pace... -- John Mellencamp > ________________________________________________________________ > Bill Tantzen University of Minnesota Libraries > 612-626-9949 (U of M) 612-325-1777 (cell) > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)