OK, so I thought some more concrete evidence might be helpful to make the
case here and did a quick POC. To get access to precise within-token
offsets we do need to make some changes to the public API, but the profile
could be kept small. In the version I worked up, I extracted the character
offset
Given that character transformations do happen in TokenFilters, shouldn't
we strive to have an API that supports correct offsets (ie highlighting)
for any combination of token filters? Currently we can't do that. For
example because of the current situation, WordDelimiterGraphFilter,
decompounding
The problem is not a performance one, its a complexity thing. Really I
think only the tokenizer should be messing with the offsets...
They are the ones actually parsing the original content so it makes
sense they would produce the pointers back to them.
I know there are some tokenfilters out there
Yes, in fact Tokenizer already provides correctOffset which just delegates
to CharFilter. We could expand on this, moving correctOffset up to
TokenStream, and also adding correct() so that TokenFilters can add to the
character offset data structure (two int arrays) and share it across the
analysis
How would a fixup API work? We would try to provide correctOffset
throughout the full analysis chain?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote:
> I've run into some difficulties with offsets in some TokenFilters I've been
> writing,
I think you see it correctly. Currently, only tokenizers can really
safely modify offsets, because only they have access to the correction
logic from the charfilter.
Doing it from a tokenfilter just means you will have bugs...
On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote:
> I've run in
>
> The second question if where I should put in place of "???". The API says
> "pass a prior PostingsEnum for possible reuse", but I don't get how to create
> an instance of it.
You can just pass null.
Alan Woodward
www.flax.co.uk
>
> Many thanks!
>
>
> ---
Dear Carsten,
your question was about the purpose of the offset-Attribute and the
reader.getTermFreqVector method.
You have asked because this method is not very fast.
imho main reason for TermFreqVectors is highlighting.
(FastVectorHighlighter and DefaultSolrHighlighter#doHighlightingByHighligh
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de:
Dear Karsten,
> abstract of your post:
> you need the offset to perform your search/ranking like the position is
> needed for phrase queries.
> You are using reader.getTermFreqVector to get the offset.
> This is to slow for your application and
Dear Carsten,
abstract of your post:
you need the offset to perform your search/ranking like the position is needed
for phrase queries.
You are using reader.getTermFreqVector to get the offset.
This is to slow for your application and you think about a switch to version 4.0
imho you should usi
10 matches
Mail list logo