Look into SpanNearQuery. It has a slop which lets you say how close you
want the terms to be. For a single document, if you are going to be
doing a lot of these searches, I recommend using a MemoryIndex.
Russ
Jose Luna wrote:
Hello,
I am looking for some advice regarding which tools I might use to
solve my problem. I apologize ahead of time for the long explanation.
Problem Description: I would like to index a set of very large HTML
documents. I would then be able to run two different kinds of
queries: proximity queries, and fuzzy phrase queries. I would like
to get the exact positions of the matching results from the query (I
need to modify the original documents at these positions.) I will
only need to search one document at a time, i.e., I already know which
document I'll be looking in, so what's important is finding the
positions of the hits within that document.
For example, for a fuzzy search, I may want to search for "arterial
oxygen saturation". I would want this to match "arterial oxygen
saturate", and I would want to get the position of where it matches.
I would also like to do proximity searches, with these broken into
separate terms. So, I may be searching for "arterial", "oxygen", and
"saturate" all within 10 terms of each other, and get the positions of
the cases that match.
To the best of my understanding, Lucene is not a good choice to solve
this problem (please correct me if I'm wrong). As far as I can tell,
Lucene breaks up a document into a set of terms, and indexes these in
some sort of structure. My guess is a B+ tree, but I'm curious to
learn more about it -- I couldn't find much in the documentation about
the underlying index structure. Anyway, this means that the
keys->pointer pairs in the index are basically term->documenID pairs.
So this isn't very suitable for my problem. I already know which
document I want to search, I'm interested in the position of hits.
If I were to search for the phrase "arterial oxygen saturation", this
would be broken into terms and I could iterate through all of the
TermPositions for a given term in the document, and try to find out
where these terms are adjacent in the document. Considering that my
document is very large, the phrases can be 10+ terms, and I need to do
this hundreds of times, this doesn't sound like a very good solution.
If we introduce the idea of fuzzy matches and proximity searches, it
seems like this task of iterating through TermPositions becomes very
complicated.
I've spent time reading the docs, creating a test program, and reading
the mailing list. As far as I can tell, Lucene is geared towards
document based queries, and isn't the ideal tool for my problem. I
think an index based on a suffix tree (or variation of) would better
meet my needs, but I'm not sure how well these perform with fuzzy and
proximity searches. I've looked around, and I can't seem to find a
good opensource indexing framework like lucene that's based on a
suffix tree. Are there any suggestions for tools that would help with
this problem? Does anyone have any suggestions on how I might bend
Lucene to meet my needs?
Thanks in advance,
JLuna
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]