Re: Question about startOffset and endOffset

Brendan Grainger Mon, 12 May 2008 14:12:01 -0700

Hi Erick,

Thanks for the reply. The use case I have is this:


Say you have a synonym expansion like this:

ac -> air conditioning

And to keep it simple, a document where the first term is ac. Whenanalyzing the document I currently create a token stream that lookssomething like this for the 'ac' term:

ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,type = 'word')air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 2,type = 'synonym')conditioning -> Token(positionIncrement = 0, startOffset = 0,endOffset = 2, type = 'synonym')

Now what I'd like to do is be able to do is to display the fact thatthis expansion took place to the user when they query. However, now Idon't know how to reconstruct the word 'air conditioning'. If however,I do this:

ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,type = 'word')air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 3,type = 'synonym')conditioning -> Token(positionIncrement = 0, startOffset = 3,endOffset = 15, type = 'synonym')


I can reconstruct the fact that ac maps to air conditioning.

Thanks
Brendan

On May 12, 2008, at 12:23 PM, Erick Erickson wrote:

Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.

If the former, setting all your start offsets to 0 seems wrong. You're

essentially saying that all tokens are at the beginning of thedocument

and each one is some ever-increasing distance from the start. Some
of your later words will look really, really long (like the length
of your document).

Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
Or maybe not, maybe some (or all) of these work off of
termpositions rather than offsets. So unless you're trying to
accomplish something specific, I sure wouldn't do this
given the problems it *could* create.

That said, I'm not much of an expert on how the offsets are
used, but I'd be really leery of changing them "just for the fun of
it" <G>......

Best
Erick

On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
[EMAIL PROTECTED]> wrote:

Hi,

I have a TokenStream that inserts synonym tokens into the stream when
matched. One thing I am wondering about is what is the effect of the
startOffset and endOffset. I have something like this:

Token synonymToken = new Token(originalToken.startOffset(),
originalToken.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);

What I am wondering is if I set the startOffset and endOffset to 0and theendOffset to the length of the synonym string what effect will thishave?


eg
Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
synonymToken.setPositionIncrement(0);

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about startOffset and endOffset

Reply via email to