Re: standard tokenizer seemingly splitting on dot

Bill Tantzen Tue, 02 May 2023 12:17:23 -0700

Thanks Dave!
Using a string field instead would work fine for my purposes I think...
I'm just trying to understand why it doesn't work with a field of type
text_general which uses the standard tokenizer in both the index and the
query analyzer.  The docs state:


This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:
Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names.

That's what is confusing me...  Meanwhile, I'm going to take your
suggestion and convert the field to a string!
~~Bill

On Tue, May 2, 2023 at 1:40 PM Dave <hastings.recurs...@gmail.com> wrote:

> You’re not doing anything wrong, a dot is not a character so it splits the
> field in the index and the query. If you used a string instead it
> theoretically would maintain the non characters but also lead to more
> strict search constraints. If you tried this you need to re index a couple
> documents to
> Make sure you are getting what you want.
>
> -Dave
>
> > On May 2, 2023, at 2:22 PM, Bill Tantzen <tantz...@umn.edu.invalid>
> wrote:
> >
> > I'm using the solrconfig.xml from the distribution,
> > ./server/solr/configsets/_default/conf/solrconfig.xml
> >
> > But this problem extends to the index as well; using the initial example,
> > if I search for <str name="parsedquery">metadata_txt:ab00001</str>
> (instead
> > of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> > ab00001.png, etc so the tokens in the index are split on dot as well, not
> > just the query.
> >
> > I'm doing something wrong, or I'm misunderstanding something!!
> > ~~Bill
> >
> >> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <m...@apache.org>
> wrote:
> >>
> >> Analyzer is configured in schema.xml. But literally, splitting on dot is
> >> what I expect from StandardTokenizer.
> >>
> >> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <tantz...@umn.edu.invalid>
> >> wrote:
> >>
> >>> Mikhail,
> >>> Thanks for the quick reply.  Here is the parser info:
> >>>
> >>> <str name="QParser">LuceneQParser</str>
> >>>
> >>> ~~Bill
> >>>
> >>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <m...@apache.org>
> >> wrote:
> >>>
> >>>> Hello Bill,
> >>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
> >> tune
> >>> it
> >>>> accordingly.
> >>>>
> >>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <tantz...@umn.edu.invalid
> >
> >>>> wrote:
> >>>>
> >>>>> In my solr 9.2 schema, I am leveraging the dynamicField
> >>>>>
> >>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
> >>>>> stored="true"/>
> >>>>>
> >>>>> which tokenizes with solr.StandardTokenizerFactory for index and
> >> query.
> >>>>>
> >>>>> However, when I query with, for example,
> >>>>> <str name="q">metadata_txt:XYZ.tif</str>
> >>>>>
> >>>>> I see many more hits than I expect.  When I add debug=true to the
> >>> query,
> >>>> I
> >>>>> see:
> >>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> >>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
> >>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> >>>>>
> >>>>> But I expect that dots not followed by whitespace will be kept as
> >> part
> >>> of
> >>>>> the token, that is, the parsed query should remain
> >>> "metadata_txt:XYZ.tif"
> >>>>> but solr appears to be splitting into two tokens.
> >>>>>
> >>>>> Can somebody point out what I am misunderstanding?
> >>>>> Thanks,
> >>>>> ~~Bill
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sincerely yours
> >>>> Mikhail Khludnev
> >>>> https://t.me/MUST_SEARCH
> >>>> A caveat: Cyrillic!
> >>>>
> >>>
> >>>
> >>> --
> >>> Human wheels spin round and round
> >>> While the clock keeps the pace... -- John Mellencamp
> >>> ________________________________________________________________
> >>> Bill Tantzen    University of Minnesota Libraries
> >>> 612-626-9949 (U of M)    612-325-1777 (cell)
> >>>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> https://t.me/MUST_SEARCH
> >> A caveat: Cyrillic!
> >>
> >
> >
> > --
> > Human wheels spin round and round
> > While the clock keeps the pace... -- John Mellencamp
> > ________________________________________________________________
> > Bill Tantzen    University of Minnesota Libraries
> > 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Reply via email to