Re: Best fuzzy match on multiple terms

2019-06-14 Thread Tomoko Uchida
Hi Boris, Query parsing and scoring/ranking are completely separated processes so I'd debug those problems separately. For debugging fuzzy query, Query.rewrite() method would be a good first step (by which you can see all unrolled terms generated by fuzzy query). I'm not sure about what is your pr

Re: Best fuzzy match on multiple terms

2019-06-14 Thread Matthias Müller
Hi Boris, "Acer campestre 'Rozi'" now receives a higher score with DFISimilarity and BM25Similarity (with tuned 'b') instead of the standard BM25. It really iswas a scoring/normalization issue: While "Rozi" gets a higher score, "Acer" and "campestere" received lower values and the combined result

Re: Best fuzzy match on multiple terms

2019-06-14 Thread baris . kazar
These are great suggestions, i was going to suggest explain plan of query, too. i really wonder in Your case why 'Rozi' entry does not get higher score. Is there any effect from " ' " chars? In my case i have sort of reverse situation: my query is maink~2 (mains was a special case where i st

Re: Best fuzzy match on multiple terms

2019-06-14 Thread Matthias Müller
Hi Namgyu and Tomoko, your hint towards Explanation was very helpful and I was not aware of this feature. I have now experimented with different scoring functions and it seems that DFISimilarity and BM25Similarity (with lower 'b') produce results in the direction I prefer, though not perfect for

Re: Best fuzzy match on multiple terms

2019-06-14 Thread Tomoko Uchida
Hi Matthias, What similarity class are you using. Just a guess... but possibly one reason is document (field) length normalization. Generally speaking shorter documents would get higher scores than longer documents. (I saw that classic TFIDF similarity tends to give much higher scores to shorter

Re: Best fuzzy match on multiple terms

2019-06-13 Thread Namgyu Kim
Dear Matthias, First you need to know about the Lucene's ranking concept. Lucene's basic ranking is BM25 and it depends on your index status. (https://en.wikipedia.org/wiki/Okapi_BM25) There can be many reasons. One of thing that I can guess is your index has a lot of 'rozi' term so it is getting

Re: Best fuzzy match on multiple terms

2019-06-13 Thread baris . kazar
i would suggest trying (indexing and searching) without === ' === s and see You can find it first. Thanks On 6/13/19 11:25 AM, Matthias Müller wrote: I am currently matching botanic names (with possible mis-spellings) against an indexed referenced list with Lucene. After quick progress in the

Best fuzzy match on multiple terms

2019-06-13 Thread Matthias Müller
I am currently matching botanic names (with possible mis-spellings) against an indexed referenced list with Lucene. After quick progress in the beginning, I am struggeling with the proper query design to achieve a ranking result I want. Here is an example: Search term: Acer campestre 'Rozi' Toke