Re: Positions in SpanFirst

Erick Erickson Wed, 21 Feb 2007 14:00:22 -0800

See below..

On 2/21/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


Hi Erick,

> I'm not sure you can, since all the interfaces I use alter the increment
> between successive terms, but I'll be the first to admit that there are
> many
> nooks and crannies that I don't know about... But I suspect that a
negative
> increment is not supported intentionally....

I read your other interesting post about omitting termvector info and this
led
me to find Analyzer.getPositionIncrementGap.  The javadocs state

"Invoked before indexing a Field instance if terms have already been added
to
that field..."

so I thought that sounded good, but there does not seem to be a way to set
it
and most of the Analyzers just seem to use the base Analyzer method which
returns 0, so I'm now confused as to what this actually does in practice.




What this does is allow you to put gaps between successive sets of terms
indexed in the same field. For instance...
doc.add("field", "some stuff");
doc.add("field", "bunch hooey");
doc.add("field", "what is this");
writer.add(doc);

In this case, there would be the following positions, assuming that the
IncrementGap was 1000....
some 0
stuff 1
bunch 1002
hooey 1003
what 2004
is 2005
this 2006

It was a little hard to get my head around. The purpose is to be able to
increment things in a single field in a document, but have some sense of
grouping.

But I really doubt you want to do this due to the consequences. Consider
in
> your example the terms would have the following offsets
> first 0
> bit 1
> second 0
> part 1
> third 0
> section 1
>
> Now think about a proximity query "first section"~1. This would produce
a
> hit because you've changed the whole sense of what offsets mean. Is this
> really a good thing?

That's a good point.  The field is used to index mail recipients and
currently
has a "starts with" search (non Lucene implementation).  Unless I can set
the
position increment gap, it is only ever possible to search for the first
indexed
recipient with proxity queries.\



This is confusing me. You can easily use proximity queries with the above
scenario. For instance, searching for "bunch hooey"~4 would generate a hit.
As would "bunch hooey"~10000. But "some this"~10 would not generate a hit.
Whether that does what you need is another question <G>... So it's time to
ask "what are you really trying to do?" In other words, what behavior are
you trying to mimic from the old code? It's not clear to me what the
behavior you need is. It'd help if you gave a concrete example of the raw
data, and what you want returned...

In your first example, using the above scheme, you'd get hits (using
SpanNear rather than SpanFirst) if you searched on
"first bit" in a SpanNear query with a slop of 2. You'd also get a hit if
you searched on
"second part" in a SpanNear with a slop of 2. Does this mimic the behavior
you need?

NOTE:, my "first bit" with slop shorthand above would actually be
constructed by instantiating a SpanNear query with two SpanTermQuerys in the
consctructor....

Best
Erick


I'm trying to ensure the Lucene implementation provides at least the

original
functionality.  If I can't achieve it I can just document the
limitation.  If I
can, I may get false hits, but I still have the choice to filter the hits
and
weed out the false ones before being given to the client.  It's not a
showstopper, it would be good it it could be done.

Thanks
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Positions in SpanFirst

Reply via email to