Hey everyone,
I'm trying to do the following: 1. Run a string through a simple tokenizer (i.e. WhitespaceTokenizer) 2. Run the resultant tokens through my current tokenizer as well as StandardTokenizer, in order to isolate the tokens that are different between them. (Background: I want to do this so that I can maintain my current tokenization scheme while allowing for matches against the StandardTokenizer scheme.) ~~ Example: Original string: "Bob v2document" 1. "Bob v2document" => ["Bob", "v2document"] 2. Run each of these through my current tokenizer and StandardTokenizer. For "v2document", let's say that my tokenizer spits out ["v", "2", "document"] and StandardTokenizer spits out ["v2document"]. I would take the latter token and append it, and get something like ["Bob", "v", "2", "document", GAP_SO_PHRASE_QUERIES_SKIP, "v2document"]. ~~ I'll be doing this at scale when I index any document, and I'm concerned about performance. For every single original token, I'll need to a build a StringReader and pass it through both Tokenizers. I know that in general, what I'm doing seems more appropriate for TokenFilter's, but I see two issues: - It looks like there's no TokenFilter version of StandardTokenizer that applies the rules to each token. Maybe I'm overlooking something? - If I pass ["Bob", "v2document"] through two TokenFilters separately, I'll get something like: ["Bob", "v", "2", "document"] / ["Bob", "v2document"]. And then I won't know which tokens belong to which original token. Any suggestions for doing this in a performant way? Thanks! Tavi