Advice regarding fuzzy phrase searching

Jose Luna Tue, 11 Dec 2007 10:30:47 -0800

Hello,

I am looking for some advice regarding which tools I might use to solvemy problem. I apologize ahead of time for the long explanation.

Problem Description: I would like to index a set of very large HTMLdocuments. I would then be able to run two different kinds of queries:proximity queries, and fuzzy phrase queries. I would like to get theexact positions of the matching results from the query (I need to modifythe original documents at these positions.) I will only need to searchone document at a time, i.e., I already know which document I'll belooking in, so what's important is finding the positions of the hitswithin that document.

For example, for a fuzzy search, I may want to search for "arterialoxygen saturation". I would want this to match "arterial oxygensaturate", and I would want to get the position of where it matches. Iwould also like to do proximity searches, with these broken intoseparate terms. So, I may be searching for "arterial", "oxygen", and"saturate" all within 10 terms of each other, and get the positions ofthe cases that match.

To the best of my understanding, Lucene is not a good choice to solvethis problem (please correct me if I'm wrong). As far as I can tell,Lucene breaks up a document into a set of terms, and indexes these insome sort of structure. My guess is a B+ tree, but I'm curious to learnmore about it -- I couldn't find much in the documentation about theunderlying index structure. Anyway, this means that the keys->pointerpairs in the index are basically term->documenID pairs. So this isn'tvery suitable for my problem. I already know which document I want tosearch, I'm interested in the position of hits. If I were to searchfor the phrase "arterial oxygen saturation", this would be broken intoterms and I could iterate through all of the TermPositions for a giventerm in the document, and try to find out where these terms are adjacentin the document. Considering that my document is very large, thephrases can be 10+ terms, and I need to do this hundreds of times, thisdoesn't sound like a very good solution. If we introduce the idea offuzzy matches and proximity searches, it seems like this task ofiterating through TermPositions becomes very complicated.I've spent time reading the docs, creating a test program, and readingthe mailing list. As far as I can tell, Lucene is geared towardsdocument based queries, and isn't the ideal tool for my problem. Ithink an index based on a suffix tree (or variation of) would bettermeet my needs, but I'm not sure how well these perform with fuzzy andproximity searches. I've looked around, and I can't seem to find a goodopensource indexing framework like lucene that's based on a suffixtree. Are there any suggestions for tools that would help with thisproblem? Does anyone have any suggestions on how I might bend Lucene tomeet my needs?


Thanks in advance,

JLuna

Advice regarding fuzzy phrase searching

Reply via email to