Re: Advice regarding fuzzy phrase searching

Mark Miller Tue, 11 Dec 2007 15:00:28 -0800

Take a look at: https://issues.apache.org/jira/browse/LUCENE-794

This is an extension to the Highlighter that highlights span andproximity queries. If you rewrite the query it will also do fuzzyqueries. I am sure you can easily steal some of the code to do what youwant.

Keep in mind, because of how Lucene's SpanQuery works, if you say tofind 'mark within 4 of ball', Lucene will not find all occurrences. ie:'mark close to ball ball' -- even if you say find mark within 20 ofball, a Span query will only find the first occurrence of ball eventhough both occurrences are within 20. If ball was on both sides ofmark, both would match, but after finding the first ball with 20 ofmark, Span doesnt continue looking for another.


- Mark

Jose Luna wrote:

Hello,
I am looking for some advice regarding which tools I might use tosolve my problem. I apologize ahead of time for the long explanation.
Problem Description: I would like to index a set of very large HTMLdocuments. I would then be able to run two different kinds ofqueries: proximity queries, and fuzzy phrase queries. I would liketo get the exact positions of the matching results from the query (Ineed to modify the original documents at these positions.) I willonly need to search one document at a time, i.e., I already know whichdocument I'll be looking in, so what's important is finding thepositions of the hits within that document.
For example, for a fuzzy search, I may want to search for "arterialoxygen saturation". I would want this to match "arterial oxygensaturate", and I would want to get the position of where it matches.I would also like to do proximity searches, with these broken intoseparate terms. So, I may be searching for "arterial", "oxygen", and"saturate" all within 10 terms of each other, and get the positions ofthe cases that match.
To the best of my understanding, Lucene is not a good choice to solvethis problem (please correct me if I'm wrong). As far as I can tell,Lucene breaks up a document into a set of terms, and indexes these insome sort of structure. My guess is a B+ tree, but I'm curious tolearn more about it -- I couldn't find much in the documentation aboutthe underlying index structure. Anyway, this means that thekeys->pointer pairs in the index are basically term->documenID pairs.So this isn't very suitable for my problem. I already know whichdocument I want to search, I'm interested in the position of hits.If I were to search for the phrase "arterial oxygen saturation", thiswould be broken into terms and I could iterate through all of theTermPositions for a given term in the document, and try to find outwhere these terms are adjacent in the document. Considering that mydocument is very large, the phrases can be 10+ terms, and I need to dothis hundreds of times, this doesn't sound like a very good solution.If we introduce the idea of fuzzy matches and proximity searches, itseems like this task of iterating through TermPositions becomes verycomplicated.I've spent time reading the docs, creating a test program, and readingthe mailing list. As far as I can tell, Lucene is geared towardsdocument based queries, and isn't the ideal tool for my problem. Ithink an index based on a suffix tree (or variation of) would bettermeet my needs, but I'm not sure how well these perform with fuzzy andproximity searches. I've looked around, and I can't seem to find agood opensource indexing framework like lucene that's based on asuffix tree. Are there any suggestions for tools that would help withthis problem? Does anyone have any suggestions on how I might bendLucene to meet my needs?
Thanks in advance,

JLuna


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Advice regarding fuzzy phrase searching

Reply via email to