Re: Terms with hyphens and fuzzy search

Dave Tue, 23 Aug 2022 09:05:56 -0700

Yea now I think you’re getting the concept. The dash is effectively white space 
and means nothing, like a period or comma. So it’s now three separate words. 
And to quote:


Once the list of matching documents has been identified using the fq and qf 
parameters, the pf parameter can be used to "boost" the score of documents in 
cases where all of the terms in the q parameter appear in close proximity

There is a lot of power in the pf parameter, it might be more what you’re 
looking for. On a side note there is a whole concept of shingles which could 
further help you out which combines words together. Like:
Dark storm rising
Can turn into 
Dark_storm
Storm_rising
Dark
Storm 
Rising
If you set it to two. It can get really fun when you do this and mix in stop 
words. 

> On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg 
> <morten.ernebj...@data4life.care> wrote:
> 
> Hi again
> 
> OK, so I think this is starting to make sense, What was confusing us was
> that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> just a single term, meaning that fuzzy search should apply as usual.
> However, if I understand you correctly, it sounds like the correct
> statement is actually that fuzzy search applies to *terms that result in a
> single token after indexing*. Since the standard tokenizer splits on
> hyphens, fuzzy search would then not apply. Did I get that right
> 
>> phrase query fields
> 
> I'm not sure I quite follow - do you mean using the qf query parameter or
> setting up separate "parallel" fields of some sort?
> 
> Best,
> 
> Morten
> 
>> On Tue, 23 Aug 2022 at 17:29, Dave <hastings.recurs...@gmail.com> wrote:
>> 
>> Ok so from what I’m looking at you have a proximity search so the terms
>> have to be within the distance value of each other. In my example, 2, which
>> obviously won’t work since there are three terms.  A fuzzy search is based
>> on a single term/token. So you need to add ~2 to each term if that’s what
>> you want. There’s really good
>> Documentation about the difference and why it’s not working as you
>> expected here:
>> 
>> https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
>> 
>> Also try to make use of phrase query fields and boosting them,
>> 
>> 
>> 
>>> On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
>> <morten.ernebj...@data4life.care> wrote:
>>> 
>>> (replying on behalf of  my colleague Julius who wrote this question who
>> is
>>> unable to reply for technical reasons)
>>> Hi David,
>>> 
>>> Thanks for the reply! I think your question may point to something we
>>> overlooked. We are actually using Solr 8.11 and we want to use fuzzy
>> search
>>> (
>>> 
>> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
>> ),
>>> i.e. find words that differ from the query by one or a few characters.
>> Our
>>> understanding was that to get matches that differ by max two chars from
>>> (using separate line to avoid adding confusing quotation marks)
>>> 
>>> term-with-hyphens
>>> 
>>> we should send the following query (without any quotation marks):
>>> 
>>> term-with-hyphens~2
>>> 
>>> Our thinking was that the hyphenated term is one word so there is no need
>>> to quote it. We had a quick try quoting the hyphenated term in the query
>> as
>>> you suggested and it looks like it works (i.e. returns matches). Since
>> the
>>> standard tokenizer splits on hyphens, I'm wondering the unquoted query
>>> somehow gets converted to the *proximity search* query
>>> 
>>> "term with hyphens"~2
>>> 
>>> which then fails (though it looks like it should still match
>>> term-with-hyphens). Would be great to understand what is happening.
>>> 
>>> Best,
>>> 
>>> Morten
>>> 
>>> 
>>> 
>>>> On Tue, 23 Aug 2022 at 16:30, David Hastings <
>> hastings.recurs...@gmail.com>
>>>> wrote:
>>>> 
>>>> I’m not certain of course of your tokenizer but shouldn’t it be
>>>> “terms-with-hyphens”~1
>>>> 
>>>> ? Just a syntax thing that may not have translated over email but
>> curious
>>>> 
>>>> On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.h...@data4life.care
>>> 
>>>> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am getting peculiar results when querying for a term containing
>> hyphens
>>>>> and add fuzzy search
>>>>> <
>>>>> 
>>>> 
>> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
>>>>>> 
>>>>> .
>>>>> 
>>>>> I have indexed two items (1) "term-with-hyphens" and (2) "term with
>>>>> hyphens". When I query ("q") for "term-with-hyphens" or "term with
>>>> hyphens"
>>>>> both items are returned as expected. The same is the case for escaped
>>>>> hyphens "term\-with\-hyphens".
>>>>> 
>>>>> The problem: When I add the fuzzy search parameter (i.e.,
>>>>> "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
>>>> back.
>>>>> 
>>>>> I struggle to understand the results, or how to solve this problem. My
>>>>> intuition tells me that adding a fuzzy search parameter should surely
>>>>> increase the size of the set of results. I am happy for any help on
>> this!
>>>>> 
>>>>> Our current setup is using the "Extended DisMax Query Parser"
>>>>> <
>> https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
>>>>> 
>>>>> however we observe the same behaviour using the "Standard Query Parser
>>>>> <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
>> We
>>>>> are
>>>>> using the "Standard Tokenizer
>>>>> <
>>>>> 
>>>> 
>> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
>>>>>> ",
>>>>> which splits at hyphens. Does this relate to this problem?
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> --
>>>>> 
>>>>> *Julian Hugo*
>>>>> 
>>>>> Working Student
>>>>> Backend Development
>>>>> 
>>>>> (he/his)
>>>>> 
>>>>> 
>>>>> julian.h...@data4life.care
>>>>> 
>>>>> 
>>>>> D4L data4life gGmbH
>>>>> Charlottenstraße 109
>>>>> 14467 Potsdam, Germany
>>>>> 
>>>>> www.data4life.care
>>>>> 
>>>>> 
>>>>> Amtsgericht Potsdam, HRB 30667
>>>>> 
>>>>> Managing Director: Christian-Cornelius Weiß
>>>>> 
>>>>> 
>>>>> We are Data4Life. We've been certified by the German Federal Office for
>>>>> Information Security (BSI) in accordance with ISO 27001 on the basis of
>>>>> "IT-Grundschutz".
>>>>> 
>>>>> 
>>>>> Diversity is the driving force behind our work towards a society where
>>>>> digital health improves quality of life for everyone.
>>>>> Data4Life warmly welcomes applicants from the LGBTQI+ community, people
>>>>> with a migration background, People of Color, and individuals with
>>>>> disabilities or chronic illnesses to the team.
>>>>> 
>>>>> 
>>>>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> *Morten Ernebjerg, Ph.D.*
>>> 
>>> Senior Developer
>>> 
>>> 
>>> morten.ernebj...@data4life.care
>>> 
>>> D4L data4life gGmbH
>>> 
>>> Charlottenstraße 109
>>> 
>>> 14467 Potsdam, Germany
>>> 
>>> www.data4life.care
>>> 
>>> Amtsgericht Potsdam, HRB 30667
>>> 
>>> Managing Director: Christian-Cornelius Weiß
>>> 
>>> 
>>> We are Data4Life. We've been certified by the German Federal Office for
>>> Information Security (BSI) in accordance with ISO 27001 on the basis of
>>> "IT-Grundschutz".
>>> 
>>> 
>>> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
>> 
> 
> 
> -- 
> 
> *Morten Ernebjerg, Ph.D.*
> 
> Senior Developer
> 
> 
> morten.ernebj...@data4life.care
> 
> D4L data4life gGmbH
> 
> Charlottenstraße 109
> 
> 14467 Potsdam, Germany
> 
> www.data4life.care
> 
> Amtsgericht Potsdam, HRB 30667
> 
> Managing Director: Christian-Cornelius Weiß
> 
> 
> We are Data4Life. We've been certified by the German Federal Office for
> Information Security (BSI) in accordance with ISO 27001 on the basis of
> "IT-Grundschutz".
> 
> 
> Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Reply via email to