Re: offsets

2018-08-04 Thread Michael Sokolov
OK, so I thought some more concrete evidence might be helpful to make the case here and did a quick POC. To get access to precise within-token offsets we do need to make some changes to the public API, but the profile could be kept small. In the version I worked up, I extracted the character offset

Re: offsets

2018-08-01 Thread Michael Sokolov
Given that character transformations do happen in TokenFilters, shouldn't we strive to have an API that supports correct offsets (ie highlighting) for any combination of token filters? Currently we can't do that. For example because of the current situation, WordDelimiterGraphFilter, decompounding

Re: offsets

2018-07-31 Thread Robert Muir
The problem is not a performance one, its a complexity thing. Really I think only the tokenizer should be messing with the offsets... They are the ones actually parsing the original content so it makes sense they would produce the pointers back to them. I know there are some tokenfilters out there

Re: offsets

2018-07-30 Thread Michael Sokolov
Yes, in fact Tokenizer already provides correctOffset which just delegates to CharFilter. We could expand on this, moving correctOffset up to TokenStream, and also adding correct() so that TokenFilters can add to the character offset data structure (two int arrays) and share it across the analysis

Re: offsets

2018-07-29 Thread Michael McCandless
How would a fixup API work? We would try to provide correctOffset throughout the full analysis chain? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run into some difficulties with offsets in some TokenFilters I've been > writing,

Re: offsets

2018-07-25 Thread Robert Muir
I think you see it correctly. Currently, only tokenizers can really safely modify offsets, because only they have access to the correction logic from the charfilter. Doing it from a tokenfilter just means you will have bugs... On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run in

Re: offsets of a term in a document

2015-09-21 Thread Alan Woodward
> > The second question if where I should put in place of "???". The API says > "pass a prior PostingsEnum for possible reuse", but I don't get how to create > an instance of it. You can just pass null. Alan Woodward www.flax.co.uk > > Many thanks! > > > ---

Re: Offsets in 3.6/4.0

2012-07-17 Thread karsten-solr
Dear Carsten, your question was about the purpose of the offset-Attribute and the reader.getTermFreqVector method. You have asked because this method is not very fast. imho main reason for TermFreqVectors is highlighting. (FastVectorHighlighter and DefaultSolrHighlighter#doHighlightingByHighligh

Re: Offsets in 3.6/4.0

2012-07-17 Thread Carsten Schnober
Am 16.07.2012 13:07, schrieb karsten-s...@gmx.de: Dear Karsten, > abstract of your post: > you need the offset to perform your search/ranking like the position is > needed for phrase queries. > You are using reader.getTermFreqVector to get the offset. > This is to slow for your application and

Re: Offsets in 3.6/4.0

2012-07-16 Thread karsten-solr
Dear Carsten, abstract of your post: you need the offset to perform your search/ranking like the position is needed for phrase queries. You are using reader.getTermFreqVector to get the offset. This is to slow for your application and you think about a switch to version 4.0 imho you should usi