Wow Karl, thank you so much for writing this up! It was a great help!
I have the ngram tokenizing working as you described. Searches are
very good!
In order to verify the hits are of high quality, I use the
Smith-Waterman algorithm. Other approximate string comparisons I
evaluated didn't work well
8 maj 2009 kl. 13.13 skrev Nate:
Is it possible to get a count for how many terms a result matched?
Currently I think you can only do that by using Searcher.explain().
But that is not a very nice solution. A better solution is beeing
worked on and might be available in a few months or so.
Ngrams can be use for lots of stuff. In your case it has nothing to do
with spellchecking, it was the "until" vs. "'till" that made me think
of them as they would allow you to get at least partial matching of
the text. Also, ngrams gives you a bit of phrase functionallity.
Create the grams
Is it possible to get a count for how many terms a result matched?
Googling, it doesn't appear to be done easily. I tried it out by
breaking my query into words myself, then doing a search for each one
and keeping track of the results and counts. This way I know if 4 out
of 5 terms matched a docume
Hi Karl,
No, sometimes there will not be a matching MP3 for a note file. When
this happens, the results I get are very poor. For example, if a song
with a common song word like "love" in the name does not have a
matching note file, then I get a handful of results that contain the
word "love" but a
Nate,
will there always be a correspodning mp3 for any given note sheet?
As for analysis, I'd try using ngrams of the complete untokenized file
name if I was you.
"Michael Jackson Don't Stop 'till You Get Enough" ->
"^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so
on
Thanks Anshum.
What happens if a search returns only one match, and that match is not
very "good"? If scores are only comparable to the scores of other
matches in the same search, then the score is effectively meaningless
if there is only one match.
It seems like a very common need to want to pro
Hi Nate,
The scores are only comparable within the same search and not over different
searches as the scores are affected by query as well as docs.
About the threshold, I guess you could have count cutoff to get 'x' best
matches. Said so coz I'm not really able to recollect anything which could
use