Re: Terms with hyphens and fuzzy search

Morten Ernebjerg Wed, 24 Aug 2022 00:38:02 -0700

Hi David & Markus
Thanks for the input! - I think we should now have the tools to work out a
solution for this.
Best,
Morten


On Tue, 23 Aug 2022 at 18:19, David Hastings <hastings.recurs...@gmail.com>
wrote:

> And if you want to get really fun, use a natural language/entity
> extraction, mix just those values into an index field, with stop words
> killed, and then bring in shingles, up the shingle to about four, and boost
> it with the pf. I promise you won’t get bored. Your index size will grow
> but you should already have some metal behind you when you start doing
> that.
>
> On Tue, Aug 23, 2022 at 12:05 PM Dave <hastings.recurs...@gmail.com>
> wrote:
>
> > Yea now I think you’re getting the concept. The dash is effectively white
> > space and means nothing, like a period or comma. So it’s now three
> separate
> > words. And to quote:
> >
> > Once the list of matching documents has been identified using the fq and
> > qf parameters, the pf parameter can be used to "boost" the score of
> > documents in cases where all of the terms in the q parameter appear in
> > close proximity
> >
> > There is a lot of power in the pf parameter, it might be more what you’re
> > looking for. On a side note there is a whole concept of shingles which
> > could further help you out which combines words together. Like:
> > Dark storm rising
> > Can turn into
> > Dark_storm
> > Storm_rising
> > Dark
> > Storm
> > Rising
> > If you set it to two. It can get really fun when you do this and mix in
> > stop words.
> >
> > On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg
> > <morten.ernebj...@data4life.care> wrote:
> >
> > Hi again
> >
> >
> >
> > OK, so I think this is starting to make sense, What was confusing us was
> > that we indeed thought of a hyphenated term (like: term-with-hyphens) as
> > just a single term, meaning that fuzzy search should apply as usual.
> > However, if I understand you correctly, it sounds like the correct
> > statement is actually that fuzzy search applies to *terms that result in
> a
> > single token after indexing*. Since the standard tokenizer splits on
> > hyphens, fuzzy search would then not apply. Did I get that right
> >
> > phrase query fields
> >
> >
> > I'm not sure I quite follow - do you mean using the qf query parameter or
> > setting up separate "parallel" fields of some sort?
> >
> > Best,
> >
> > Morten
> >
> > On Tue, 23 Aug 2022 at 17:29, Dave <hastings.recurs...@gmail.com> wrote:
> >
> > Ok so from what I’m looking at you have a proximity search so the terms
> >
> > have to be within the distance value of each other. In my example, 2,
> which
> >
> > obviously won’t work since there are three terms.  A fuzzy search is
> based
> >
> > on a single term/token. So you need to add ~2 to each term if that’s what
> >
> > you want. There’s really good
> >
> > Documentation about the difference and why it’s not working as you
> >
> > expected here:
> >
> >
> > https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/
> >
> >
> > Also try to make use of phrase query fields and boosting them,
> >
> >
> >
> >
> > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg
> >
> > <morten.ernebj...@data4life.care> wrote:
> >
> >
> > (replying on behalf of  my colleague Julius who wrote this question who
> >
> > is
> >
> > unable to reply for technical reasons)
> >
> > Hi David,
> >
> >
> > Thanks for the reply! I think your question may point to something we
> >
> > overlooked. We are actually using Solr 8.11 and we want to use fuzzy
> >
> > search
> >
> > (
> >
> >
> >
> >
> https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches
> >
> > ),
> >
> > i.e. find words that differ from the query by one or a few characters.
> >
> > Our
> >
> > understanding was that to get matches that differ by max two chars from
> >
> > (using separate line to avoid adding confusing quotation marks)
> >
> >
> > term-with-hyphens
> >
> >
> > we should send the following query (without any quotation marks):
> >
> >
> > term-with-hyphens~2
> >
> >
> > Our thinking was that the hyphenated term is one word so there is no need
> >
> > to quote it. We had a quick try quoting the hyphenated term in the query
> >
> > as
> >
> > you suggested and it looks like it works (i.e. returns matches). Since
> >
> > the
> >
> > standard tokenizer splits on hyphens, I'm wondering the unquoted query
> >
> > somehow gets converted to the *proximity search* query
> >
> >
> > "term with hyphens"~2
> >
> >
> > which then fails (though it looks like it should still match
> >
> > term-with-hyphens). Would be great to understand what is happening.
> >
> >
> > Best,
> >
> >
> > Morten
> >
> >
> >
> >
> > On Tue, 23 Aug 2022 at 16:30, David Hastings <
> >
> > hastings.recurs...@gmail.com>
> >
> > wrote:
> >
> >
> > I’m not certain of course of your tokenizer but shouldn’t it be
> >
> > “terms-with-hyphens”~1
> >
> >
> > ? Just a syntax thing that may not have translated over email but
> >
> > curious
> >
> >
> > On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.h...@data4life.care
> >
> >
> > wrote:
> >
> >
> > Hello,
> >
> >
> > I am getting peculiar results when querying for a term containing
> >
> > hyphens
> >
> > and add fuzzy search
> >
> > <
> >
> >
> >
> >
> >
> https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
> >
> >
> > .
> >
> >
> > I have indexed two items (1) "term-with-hyphens" and (2) "term with
> >
> > hyphens". When I query ("q") for "term-with-hyphens" or "term with
> >
> > hyphens"
> >
> > both items are returned as expected. The same is the case for escaped
> >
> > hyphens "term\-with\-hyphens".
> >
> >
> > The problem: When I add the fuzzy search parameter (i.e.,
> >
> > "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results
> >
> > back.
> >
> >
> > I struggle to understand the results, or how to solve this problem. My
> >
> > intuition tells me that adding a fuzzy search parameter should surely
> >
> > increase the size of the set of results. I am happy for any help on
> >
> > this!
> >
> >
> > Our current setup is using the "Extended DisMax Query Parser"
> >
> > <
> >
> > https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html
> >
> >
> > however we observe the same behaviour using the "Standard Query Parser
> >
> > <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>".
> >
> > We
> >
> > are
> >
> > using the "Standard Tokenizer
> >
> > <
> >
> >
> >
> >
> >
> https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >
> > ",
> >
> > which splits at hyphens. Does this relate to this problem?
> >
> >
> > Thank you!
> >
> >
> > --
> >
> >
> > *Julian Hugo*
> >
> >
> > Working Student
> >
> > Backend Development
> >
> >
> > (he/his)
> >
> >
> >
> > julian.h...@data4life.care
> >
> >
> >
> > D4L data4life gGmbH
> >
> > Charlottenstraße 109
> >
> > 14467 Potsdam, Germany
> >
> >
> > www.data4life.care
> >
> >
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> >
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> >
> > "IT-Grundschutz".
> >
> >
> >
> > Diversity is the driving force behind our work towards a society where
> >
> > digital health improves quality of life for everyone.
> >
> > Data4Life warmly welcomes applicants from the LGBTQI+ community, people
> >
> > with a migration background, People of Color, and individuals with
> >
> > disabilities or chronic illnesses to the team.
> >
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
> >
> >
> >
> > --
> >
> >
> > *Morten Ernebjerg, Ph.D.*
> >
> >
> > Senior Developer
> >
> >
> >
> > morten.ernebj...@data4life.care
> >
> >
> > D4L data4life gGmbH
> >
> >
> > Charlottenstraße 109
> >
> >
> > 14467 Potsdam, Germany
> >
> >
> > www.data4life.care
> >
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> >
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> >
> > "IT-Grundschutz".
> >
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
> >
> >
> > --
> >
> > *Morten Ernebjerg, Ph.D.*
> >
> > Senior Developer
> >
> >
> > morten.ernebj...@data4life.care
> >
> > D4L data4life gGmbH
> >
> > Charlottenstraße 109
> >
> > 14467 Potsdam, Germany
> >
> > www.data4life.care
> >
> > Amtsgericht Potsdam, HRB 30667
> >
> > Managing Director: Christian-Cornelius Weiß
> >
> >
> > We are Data4Life. We've been certified by the German Federal Office for
> > Information Security (BSI) in accordance with ISO 27001 on the basis of
> > "IT-Grundschutz".
> >
> >
> > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>
> >
> >
>


-- 

*Morten Ernebjerg, Ph.D.*

Senior Developer


morten.ernebj...@data4life.care

D4L data4life gGmbH

Charlottenstraße 109

14467 Potsdam, Germany

www.data4life.care

Amtsgericht Potsdam, HRB 30667

Managing Director: Christian-Cornelius Weiß


We are Data4Life. We've been certified by the German Federal Office for
Information Security (BSI) in accordance with ISO 27001 on the basis of
"IT-Grundschutz".


Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>

Re: Terms with hyphens and fuzzy search

Reply via email to