Re: standard tokenizer seemingly splitting on dot

Bill Tantzen Tue, 02 May 2023 14:30:29 -0700

OK, I see what's going on.  I should not have used a generic example like
XYZ.

In my specific case, as you can see, I'm working with filenames.

This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif

This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif

That is, the standard tokenizer treats dot as described in the docs when it
is preceded by an alpha character.
It treats dot as any other delimiter when it is preceded by a numeric
character, that is, it creates two tokens.

(This is maybe documented in the linked unicode.org page in that section of
the docs, but honestly that page went way over my head...)

So at least it works as advertised except in the edge case where the dot is
preceded by a numeric.  I don't know why that is the case, but I can work
with that!

Thanks to everybody who weighed in on this!
~~Bill

On Tue, May 2, 2023 at 3:56 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/2/23 13:16, Bill Tantzen wrote:
> > This tokenizer splits the text field into tokens, treating whitespace and
> > punctuation as delimiters.
> > Delimiter characters are discarded, with the following exceptions:
> > Periods (dots) that are not followed by whitespace are kept as part of
> the
> > token, including Internet domain names.
>
> I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does
> indeed do exactly what the docs say.
>
> The analysis definition in the fieldType probably has things beyond the
> StandardTokenizer, one or more filters that DO break up terms on a period.
>
> Thanks,
> Shawn
>

-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Reply via email to