On Thu, Nov 25, 2010 at 3:25 PM, Jan Kurella <jan.kure...@nokia.com> wrote: > Hi Simon, > > On 25.11.2010 10:40, ext Simon Willnauer wrote: >> >> Hi Jan, >> >> On Wed, Nov 24, 2010 at 9:12 AM,<jan.kure...@nokia.com> wrote: >>> >>> Of course: >>> >>> We are trying to search in documents that contain text in several >>> languages. We are also investigating other approaches*, so this is not about >>> finding other variants. >>> the goal is to only match tokens from 1 or more given languages and not >>> to match the token if it is by accident the same in another language. >>> >>> For the payloads my plan is to add the correct language to each and every >>> token during indexing (I'm not sure how to solve this best, but I'm sure >>> this can be solved at least with lucene directly). >>> On search side my current idea is to wrap around a TermPosition and skip >>> all docs, where the current payload has not one of the requested languages. >>> I probably need to use my own Query/Weight for this? >> >> You don't need to start from nothing here, I suggest you to look at >> SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather >> TermPositions in non-trunk versions). TermSpan gives you the ability >> to override #next() and #skipTo() which is from what I understand what >> you are looking for, right? > > Just to get it right: I only subclass the SpanTermQuery to verwrite the > getSpans(Reader) method to return MyTermSpans(). > MyTermSpans are a subclass of TermSpans where I just extend #next() and > #skipTo() to go further until my desired Payload is found.
that sounds about right... > > Sounds pretty easy and straight forward. >>> >>> Another approach would be to just overwrite the Similarity, but this will >>> only influence scoring and depending on the underlying query not completely >>> skip the token - I have to test the difference for the final score between >>> this approaches. >> >> Well as you figured correctly this is rather for scoring really. > > So if I'm going to use the scoring stuff also, I rather subclass > PayloadTermQuery then hmm I am not a span expert but I guess that would make it easier though. >>> >>> This one blog made me curious if there is already something similar, that >>> skips TermPositions based on given attributes? I could imagine something >>> similar to the current Tokenattribute concept during index time, but also >>> available during search and controlled by a similarity... >> >> Actually in lucene 4.0 each Flex-Enum has a AttributeSource that >> allows you to add custom attributes to you enumerations. Yet there is >> no logic that skips based on that though. >> >> Simon > > lucene 4.0 is a little far away today? If the above approach performs good > (and it sounds like it will) it should be good enough for now i was just saying that this is on the way... and yeah you might need to wait a bit until 4.0 :) simon > > Jan > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org