Hey! I'm actually looking all on my own. Anyway, 2<b> gives me
TooManyClauses. It looks like what I want is to use
IndexReader.termPositions to aggregate the offsets of all the wildcard terms
on a per-document basis, then walk the lists for proximity. Something
like...

for each wildcard term wt
  for each WildCardTermEnum wet
       for each termdoc aggregate the positions by doc ID with other terms
from wt.


Now I have, for each original wildcard term a list of all doc IDs and all
positions that any term matching the wildcard occupies. For any doc that
appears in all lists, compare the positions for proximity and, if proximity
is met, add it to my filter.

And away we go. Of course, I have no idea what the speed here is, but I
guess that's what testing is for.

Am I on the right path?

Erick


On 8/2/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

I'm back, with another flavor of wildcards. What direction would you point
a poor boy who's project lead wants wildcard queries and spans? Here's the
problem....

I cannot use any of the classes that throw a "TooManyClauses" exception (
e.g. SpanRegexQuery or SpanNearQuery with, say WildCardQuery). The corpus
is big enough that this is guaranteed to be thrown. So, currently I'm using
a filter for wildcard queries, populating it via WildcardTermEnum and
TermDocs... Works like a champ. But I don't see how to combine this with
spans...

It seems to me that spans are incompatible with filters, they're just
different beasts. I see no way incorporate spans and filters without doing
actual work myself. So, it seems I'm left with several alternatives.

1> figure it out when creating the filter. Conceptually, for each document
find the offsets of the terms I want to span, and find out if the distance
between them fits my criteria and only add the doc to the filter if the
distance is within my parameters.

2> Look at the docs returned by the current filtered process and, for each
doc returned,
  a> don't add if it doesn't fit my span criteria by examining the term
positions.
  b> re-query with a wildcard span, restricted by doc ID. I *think* that
by restricting the query by (lucene) doc_id I'll be able to avoid the "too
many clauses" issue. Assuming that I remember correctly and that the
most-restrictive clause is honored when trying this....

guys, feel free to hop in here with just the names of the classes I really
want to pay attention to <G>....

I know this is scanty info, what I'm looking for is a very quick
pointer.... What I'm especially looking for is "Just use the
contrib/JustWhatYouWanted class" <G> although I poked around and didn't see
anything...

Thanks
Erick

Reply via email to