Yes, in fact Tokenizer already provides correctOffset which just delegates
to CharFilter. We could expand on this, moving correctOffset up to
TokenStream, and also adding correct() so that TokenFilters can add to the
character offset data structure (two int arrays) and share it across the
analysis
Although I've been aware of Shings and some of the useful applications for
a long time, today is the first tiem i really sat down and tried to do
something non-trivial with them myself.
My objective seems realatively straight forard: given a corpus of text and
some analyzer (for sake of dis