Ngrams can be use for lots of stuff. In your case it has nothing to do
with spellchecking, it was the "until" vs. "'till" that made me think
of them as they would allow you to get at least partial matching of
the text. Also, ngrams gives you a bit of phrase functionallity.
Create the grams by passing a token containing the complete text to
NgramTokenFilter. Something like this:
TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson
Don't Stop 'till You Get Enough"));
ts = new LowerCaseFilter(ts);
ts = new NgramTokenFilter(ts, 4, 4);
document.add(new Field("ngrams", ts));
At query time you create a query the same way, perhaps something like
this:
TokenStream ts = new SingleTokenTokenStream(new Token("Michael Jackson
Don't Stop 'till You Get Enough"));
ts = new LowerCaseFilter(ts);
ts = new NgramTokenFilter(ts, 4, 4);
BooleanQuery bq = new BooleanQuery();
Token token;
while ((token = ts.next(new Token()) != null) {
bq.add(new BooleanClause(new TermQuery("ngrams", token.text()),
BooleanClause.SHOULD);
}
You might want to fiddle around with the gram sizes.
In order to detect a match you might want to take a look at the
distance between scores.
0.98 : Michael Jackson, Don't stop 'till you get enough
0.96 : Michael Jackson, Don't stop until you get enough
0.07 : Barry White, Can't get enough of your love babe
0.06 : Depeche Mode, Just can't get enough
Manual inspection of a bunch of queries that match the correct
document vs. queries that looks for something that does not exist in
your index will give you an indication of what distance from the top
score is intersting. It is very probable that if you search for
something that does not exist then the top hits will all have very
similar score, and if you search for something that does exist then
there will be a rather large gap between the correct hit and the first
non correct hit.
In order to fortify the response with knowledge that it really was a
good hit in the top you might want to use some edit distance measure
such as Levenstein or perhaps even something like the Jaccard index.
I've actually done something similar to this. It worked almost
flawless but as my data required 100% certainty someone had to
manually check the data to avoid errors. I think that if you can
settle with something like 95% certainty then it should be possible to
have this automated.
karl
8 maj 2009 kl. 06.57 skrev Nate:
Hi Karl,
No, sometimes there will not be a matching MP3 for a note file. When
this happens, the results I get are very poor. For example, if a song
with a common song word like "love" in the name does not have a
matching note file, then I get a handful of results that contain the
word "love" but are otherwise obviously not a good match. I need some
way to judge the quality of the matches, or possible some other
approach to doing the search that helps avoid false positives.
On your clue, I have been reading about ngrams. Very interesting! I
see it is very useful for spell checking. However, how would I
leverage ngrams for my needs? Would the Lucene SpellChecker classes be
of any use?
I really feel like I'm floundering here. I am more than willing to put
in the work, I just need a push or two in the right directions. :)
Thanks!
-Nate
On Thu, May 7, 2009 at 7:50 AM, Karl Wettin <karl.wet...@gmail.com>
wrote:
Nate,
will there always be a correspodning mp3 for any given note sheet?
As for analysis, I'd try using ngrams of the complete untokenized
file name
if I was you.
"Michael Jackson Don't Stop 'till You Get Enough" ->
"^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and
so on.
See
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/ngram/package-summary.html
karl
7 maj 2009 kl. 08.28 skrev Nate:
Thanks Anshum.
What happens if a search returns only one match, and that match is
not
very "good"? If scores are only comparable to the scores of other
matches in the same search, then the score is effectively
meaningless
if there is only one match.
It seems like a very common need to want to provide a "relevance"
metric along with search results. I somewhat understand the
complexities after reading this thread and the threads it links...
http://www.gossamer-threads.com/lists/lucene/java-user/75002
My case is slightly better since I don't care to show users the
metric. My queries are simple term and boolean queries.
This thread talks about "theoretical maximum score" but quickly
loses
me. Does this seem like the road to go down, given my needs?
http://www.gossamer-threads.com/lists/lucene/java-user/61075#61075
Say I do a search like:
Michael Jackson Don't stop until you get enough
And this is the top match:
Michael Jackson Don't Stop 'till You Get Enough
Would it make any sense to do a query with the exact contents of the
top match to get a maximum score for that document? Would the
resulting percentage be meaningful?
-Nate
On Wed, May 6, 2009 at 10:08 PM, Anshum <ansh...@gmail.com> wrote:
Hi Nate,
The scores are only comparable within the same search and not over
different
searches as the scores are affected by query as well as docs.
About the threshold, I guess you could have count cutoff to get
'x' best
matches. Said so coz I'm not really able to recollect anything
which
could
use score as a metric to absolutely cluster 'good' and 'not good'
matches.
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com
The facts expressed here belong to everybody, the opinions to me.
The
distinction is yours to draw............
On Thu, May 7, 2009 at 6:27 AM, Nate <n...@n4te.com> wrote:
Hi all,
First, the problem I'm trying to solve: I have two folders, each
containing files. I need to match files in one folder with files
in
the other. Eg:
notes/Michael Jackson - Don't Stop 'till You Get Enough.notes
songs/Michael Jackson Don't stop until you get enough.mp3
I provide the notes files, but the song files come from a user's
music
library, so often are not named well. I am attempting to use
Lucene to
find the most likely note file for each song file.
I index the note files, then I use the StandardAnalyzer with
carefully
chosen stop words to search the index. The query uses each word
in the
song file name (w/o extension) as a term. Fuzzy matching is used
for
words with > 4 characters, and the fuzzy percentage is set to be
1 /
termlength. This works ok so far, though I would love to hear
opinions
on any improvements I could make. This is my first use of
Lucene, so
I'm not sure I've chosen the best approach.
The problem I'm having is: Sometimes there is a song file that
has no
matching note file. In this case I get back results with "low"
scores,
such as 0.2 or 0.05. A "really good" match gives me 7 or 8. I
don't
really understand what the scoring means, so I don't know what
would
be a reasonable threshold to ignore scores.
I understand scores are not relevance percentages. I think the
scores
are only useful relative to other scores. Is this right? Are
they only
relative to scores from the same search, or from any search
against
the same index? How can I know if a score is "low", so I can
ignore
matches that aren't very good?
Sorry if this has been discussed before. I have searched around a
great deal and was unable to find a straight answer.
Thanks!
-Nate
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org