Hi Simon,
On 25.11.2010 10:40, ext Simon Willnauer wrote:
Hi Jan,
On Wed, Nov 24, 2010 at 9:12 AM,<jan.kure...@nokia.com> wrote:
Of course:
We are trying to search in documents that contain text in several languages. We
are also investigating other approaches*, so this is not about finding other
variants.
the goal is to only match tokens from 1 or more given languages and not to
match the token if it is by accident the same in another language.
For the payloads my plan is to add the correct language to each and every token
during indexing (I'm not sure how to solve this best, but I'm sure this can be
solved at least with lucene directly).
On search side my current idea is to wrap around a TermPosition and skip all
docs, where the current payload has not one of the requested languages.
I probably need to use my own Query/Weight for this?
You don't need to start from nothing here, I suggest you to look at
SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
TermPositions in non-trunk versions). TermSpan gives you the ability
to override #next() and #skipTo() which is from what I understand what
you are looking for, right?
Just to get it right: I only subclass the SpanTermQuery to verwrite the
getSpans(Reader) method to return MyTermSpans().
MyTermSpans are a subclass of TermSpans where I just extend #next() and
#skipTo() to go further until my desired Payload is found.
Sounds pretty easy and straight forward.
Another approach would be to just overwrite the Similarity, but this will only
influence scoring and depending on the underlying query not completely skip the
token - I have to test the difference for the final score between this
approaches.
Well as you figured correctly this is rather for scoring really.
So if I'm going to use the scoring stuff also, I rather subclass
PayloadTermQuery then
This one blog made me curious if there is already something similar, that skips
TermPositions based on given attributes? I could imagine something similar to
the current Tokenattribute concept during index time, but also available during
search and controlled by a similarity...
Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
allows you to add custom attributes to you enumerations. Yet there is
no logic that skips based on that though.
Simon
lucene 4.0 is a little far away today? If the above approach performs
good (and it sounds like it will) it should be good enough for now
Jan
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org