Re: offsets

2018-07-30 Thread Michael Sokolov
Yes, in fact Tokenizer already provides correctOffset which just delegates to CharFilter. We could expand on this, moving correctOffset up to TokenStream, and also adding correct() so that TokenFilters can add to the character offset data structure (two int arrays) and share it across the analysis

Practical usages of arbitrary Shingles when using a query parser?

2018-07-30 Thread Chris Hostetter
Although I've been aware of Shings and some of the useful applications for a long time, today is the first tiem i really sat down and tried to do something non-trivial with them myself. My objective seems realatively straight forard: given a corpus of text and some analyzer (for sake of dis