Hi David & Markus Thanks for the input! - I think we should now have the tools to work out a solution for this. Best, Morten
On Tue, 23 Aug 2022 at 18:19, David Hastings <hastings.recurs...@gmail.com> wrote: > And if you want to get really fun, use a natural language/entity > extraction, mix just those values into an index field, with stop words > killed, and then bring in shingles, up the shingle to about four, and boost > it with the pf. I promise you won’t get bored. Your index size will grow > but you should already have some metal behind you when you start doing > that. > > On Tue, Aug 23, 2022 at 12:05 PM Dave <hastings.recurs...@gmail.com> > wrote: > > > Yea now I think you’re getting the concept. The dash is effectively white > > space and means nothing, like a period or comma. So it’s now three > separate > > words. And to quote: > > > > Once the list of matching documents has been identified using the fq and > > qf parameters, the pf parameter can be used to "boost" the score of > > documents in cases where all of the terms in the q parameter appear in > > close proximity > > > > There is a lot of power in the pf parameter, it might be more what you’re > > looking for. On a side note there is a whole concept of shingles which > > could further help you out which combines words together. Like: > > Dark storm rising > > Can turn into > > Dark_storm > > Storm_rising > > Dark > > Storm > > Rising > > If you set it to two. It can get really fun when you do this and mix in > > stop words. > > > > On Aug 23, 2022, at 11:50 AM, Morten Ernebjerg > > <morten.ernebj...@data4life.care> wrote: > > > > Hi again > > > > > > > > OK, so I think this is starting to make sense, What was confusing us was > > that we indeed thought of a hyphenated term (like: term-with-hyphens) as > > just a single term, meaning that fuzzy search should apply as usual. > > However, if I understand you correctly, it sounds like the correct > > statement is actually that fuzzy search applies to *terms that result in > a > > single token after indexing*. Since the standard tokenizer splits on > > hyphens, fuzzy search would then not apply. Did I get that right > > > > phrase query fields > > > > > > I'm not sure I quite follow - do you mean using the qf query parameter or > > setting up separate "parallel" fields of some sort? > > > > Best, > > > > Morten > > > > On Tue, 23 Aug 2022 at 17:29, Dave <hastings.recurs...@gmail.com> wrote: > > > > Ok so from what I’m looking at you have a proximity search so the terms > > > > have to be within the distance value of each other. In my example, 2, > which > > > > obviously won’t work since there are three terms. A fuzzy search is > based > > > > on a single term/token. So you need to add ~2 to each term if that’s what > > > > you want. There’s really good > > > > Documentation about the difference and why it’s not working as you > > > > expected here: > > > > > > https://examples.javacodegeeks.com/apache-solr-fuzzy-search-example/ > > > > > > Also try to make use of phrase query fields and boosting them, > > > > > > > > > > On Aug 23, 2022, at 11:18 AM, Morten Ernebjerg > > > > <morten.ernebj...@data4life.care> wrote: > > > > > > (replying on behalf of my colleague Julius who wrote this question who > > > > is > > > > unable to reply for technical reasons) > > > > Hi David, > > > > > > Thanks for the reply! I think your question may point to something we > > > > overlooked. We are actually using Solr 8.11 and we want to use fuzzy > > > > search > > > > ( > > > > > > > > > https://solr.apache.org/guide/8_11/the-standard-query-parser.html#fuzzy-searches > > > > ), > > > > i.e. find words that differ from the query by one or a few characters. > > > > Our > > > > understanding was that to get matches that differ by max two chars from > > > > (using separate line to avoid adding confusing quotation marks) > > > > > > term-with-hyphens > > > > > > we should send the following query (without any quotation marks): > > > > > > term-with-hyphens~2 > > > > > > Our thinking was that the hyphenated term is one word so there is no need > > > > to quote it. We had a quick try quoting the hyphenated term in the query > > > > as > > > > you suggested and it looks like it works (i.e. returns matches). Since > > > > the > > > > standard tokenizer splits on hyphens, I'm wondering the unquoted query > > > > somehow gets converted to the *proximity search* query > > > > > > "term with hyphens"~2 > > > > > > which then fails (though it looks like it should still match > > > > term-with-hyphens). Would be great to understand what is happening. > > > > > > Best, > > > > > > Morten > > > > > > > > > > On Tue, 23 Aug 2022 at 16:30, David Hastings < > > > > hastings.recurs...@gmail.com> > > > > wrote: > > > > > > I’m not certain of course of your tokenizer but shouldn’t it be > > > > “terms-with-hyphens”~1 > > > > > > ? Just a syntax thing that may not have translated over email but > > > > curious > > > > > > On Tue, Aug 23, 2022 at 10:12 AM Julian Hugo <julian.h...@data4life.care > > > > > > wrote: > > > > > > Hello, > > > > > > I am getting peculiar results when querying for a term containing > > > > hyphens > > > > and add fuzzy search > > > > < > > > > > > > > > > > https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches > > > > > > . > > > > > > I have indexed two items (1) "term-with-hyphens" and (2) "term with > > > > hyphens". When I query ("q") for "term-with-hyphens" or "term with > > > > hyphens" > > > > both items are returned as expected. The same is the case for escaped > > > > hyphens "term\-with\-hyphens". > > > > > > The problem: When I add the fuzzy search parameter (i.e., > > > > "term-with-hyphens~1" or "term\-with\-hyphens~1"). I get zero results > > > > back. > > > > > > I struggle to understand the results, or how to solve this problem. My > > > > intuition tells me that adding a fuzzy search parameter should surely > > > > increase the size of the set of results. I am happy for any help on > > > > this! > > > > > > Our current setup is using the "Extended DisMax Query Parser" > > > > < > > > > https://solr.apache.org/guide/6_6/the-extended-dismax-query-parser.html > > > > > > however we observe the same behaviour using the "Standard Query Parser > > > > <https://solr.apache.org/guide/6_6/the-standard-query-parser.html>". > > > > We > > > > are > > > > using the "Standard Tokenizer > > > > < > > > > > > > > > > > https://solr.apache.org/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer > > > > ", > > > > which splits at hyphens. Does this relate to this problem? > > > > > > Thank you! > > > > > > -- > > > > > > *Julian Hugo* > > > > > > Working Student > > > > Backend Development > > > > > > (he/his) > > > > > > > > julian.h...@data4life.care > > > > > > > > D4L data4life gGmbH > > > > Charlottenstraße 109 > > > > 14467 Potsdam, Germany > > > > > > www.data4life.care > > > > > > > > Amtsgericht Potsdam, HRB 30667 > > > > > > Managing Director: Christian-Cornelius Weiß > > > > > > > > We are Data4Life. We've been certified by the German Federal Office for > > > > Information Security (BSI) in accordance with ISO 27001 on the basis of > > > > "IT-Grundschutz". > > > > > > > > Diversity is the driving force behind our work towards a society where > > > > digital health improves quality of life for everyone. > > > > Data4Life warmly welcomes applicants from the LGBTQI+ community, people > > > > with a migration background, People of Color, and individuals with > > > > disabilities or chronic illnesses to the team. > > > > > > > > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life> > > > > > > > > > > > > -- > > > > > > *Morten Ernebjerg, Ph.D.* > > > > > > Senior Developer > > > > > > > > morten.ernebj...@data4life.care > > > > > > D4L data4life gGmbH > > > > > > Charlottenstraße 109 > > > > > > 14467 Potsdam, Germany > > > > > > www.data4life.care > > > > > > Amtsgericht Potsdam, HRB 30667 > > > > > > Managing Director: Christian-Cornelius Weiß > > > > > > > > We are Data4Life. We've been certified by the German Federal Office for > > > > Information Security (BSI) in accordance with ISO 27001 on the basis of > > > > "IT-Grundschutz". > > > > > > > > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life> > > > > > > > > > > -- > > > > *Morten Ernebjerg, Ph.D.* > > > > Senior Developer > > > > > > morten.ernebj...@data4life.care > > > > D4L data4life gGmbH > > > > Charlottenstraße 109 > > > > 14467 Potsdam, Germany > > > > www.data4life.care > > > > Amtsgericht Potsdam, HRB 30667 > > > > Managing Director: Christian-Cornelius Weiß > > > > > > We are Data4Life. We've been certified by the German Federal Office for > > Information Security (BSI) in accordance with ISO 27001 on the basis of > > "IT-Grundschutz". > > > > > > Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life> > > > > > -- *Morten Ernebjerg, Ph.D.* Senior Developer morten.ernebj...@data4life.care D4L data4life gGmbH Charlottenstraße 109 14467 Potsdam, Germany www.data4life.care Amtsgericht Potsdam, HRB 30667 Managing Director: Christian-Cornelius Weiß We are Data4Life. We've been certified by the German Federal Office for Information Security (BSI) in accordance with ISO 27001 on the basis of "IT-Grundschutz". Climate neutral since 2019 <https://wtca.lfca.earth/e/data4life>