Re: standard tokenizer seemingly splitting on dot

Dave Tue, 02 May 2023 11:40:18 -0700

You’re not doing anything wrong, a dot is not a character so it splits the 
field in the index and the query. If you used a string instead it theoretically 
would maintain the non characters but also lead to more strict search 
constraints. If you tried this you need to re index a couple documents to
Make sure you are getting what you want.


-Dave

> On May 2, 2023, at 2:22 PM, Bill Tantzen <tantz...@umn.edu.invalid> wrote:
> 
> I'm using the solrconfig.xml from the distribution,
> ./server/solr/configsets/_default/conf/solrconfig.xml
> 
> But this problem extends to the index as well; using the initial example,
> if I search for <str name="parsedquery">metadata_txt:ab00001</str> (instead
> of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> ab00001.png, etc so the tokens in the index are split on dot as well, not
> just the query.
> 
> I'm doing something wrong, or I'm misunderstanding something!!
> ~~Bill
> 
>> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <m...@apache.org> wrote:
>> 
>> Analyzer is configured in schema.xml. But literally, splitting on dot is
>> what I expect from StandardTokenizer.
>> 
>> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <tantz...@umn.edu.invalid>
>> wrote:
>> 
>>> Mikhail,
>>> Thanks for the quick reply.  Here is the parser info:
>>> 
>>> <str name="QParser">LuceneQParser</str>
>>> 
>>> ~~Bill
>>> 
>>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <m...@apache.org>
>> wrote:
>>> 
>>>> Hello Bill,
>>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
>> tune
>>> it
>>>> accordingly.
>>>> 
>>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <tantz...@umn.edu.invalid>
>>>> wrote:
>>>> 
>>>>> In my solr 9.2 schema, I am leveraging the dynamicField
>>>>> 
>>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
>>>>> stored="true"/>
>>>>> 
>>>>> which tokenizes with solr.StandardTokenizerFactory for index and
>> query.
>>>>> 
>>>>> However, when I query with, for example,
>>>>> <str name="q">metadata_txt:XYZ.tif</str>
>>>>> 
>>>>> I see many more hits than I expect.  When I add debug=true to the
>>> query,
>>>> I
>>>>> see:
>>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
>>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
>>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
>>>>> 
>>>>> But I expect that dots not followed by whitespace will be kept as
>> part
>>> of
>>>>> the token, that is, the parsed query should remain
>>> "metadata_txt:XYZ.tif"
>>>>> but solr appears to be splitting into two tokens.
>>>>> 
>>>>> Can somebody point out what I am misunderstanding?
>>>>> Thanks,
>>>>> ~~Bill
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> https://t.me/MUST_SEARCH
>>>> A caveat: Cyrillic!
>>>> 
>>> 
>>> 
>>> --
>>> Human wheels spin round and round
>>> While the clock keeps the pace... -- John Mellencamp
>>> ________________________________________________________________
>>> Bill Tantzen    University of Minnesota Libraries
>>> 612-626-9949 (U of M)    612-325-1777 (cell)
>>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> https://t.me/MUST_SEARCH
>> A caveat: Cyrillic!
>> 
> 
> 
> -- 
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Reply via email to