Hi Erick,
Thanks for the reply. The use case I have is this:
Say you have a synonym expansion like this:
ac -> air conditioning
And to keep it simple, a document where the first term is ac. When
analyzing the document I currently create a token stream that looks
something like this for the 'ac' term:
ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 2,
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 0,
endOffset = 2, type = 'synonym')
Now what I'd like to do is be able to do is to display the fact that
this expansion took place to the user when they query. However, now I
don't know how to reconstruct the word 'air conditioning'. If however,
I do this:
ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 3,
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 3,
endOffset = 15, type = 'synonym')
I can reconstruct the fact that ac maps to air conditioning.
Thanks
Brendan
On May 12, 2008, at 12:23 PM, Erick Erickson wrote:
Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.
If the former, setting all your start offsets to 0 seems wrong. You're
essentially saying that all tokens are at the beginning of the
document
and each one is some ever-increasing distance from the start. Some
of your later words will look really, really long (like the length
of your document).
Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
Or maybe not, maybe some (or all) of these work off of
termpositions rather than offsets. So unless you're trying to
accomplish something specific, I sure wouldn't do this
given the problems it *could* create.
That said, I'm not much of an expert on how the offsets are
used, but I'd be really leery of changing them "just for the fun of
it" <G>......
Best
Erick
On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
[EMAIL PROTECTED]> wrote:
Hi,
I have a TokenStream that inserts synonym tokens into the stream when
matched. One thing I am wondering about is what is the effect of the
startOffset and endOffset. I have something like this:
Token synonymToken = new Token(originalToken.startOffset(),
originalToken.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);
What I am wondering is if I set the startOffset and endOffset to 0
and the
endOffset to the length of the synonym string what effect will this
have?
eg
Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]