Hello, I want to align the output of two different analysis pipelines, but I don't know how. We are using Lucene for text analysis. First, every input text is normalized using StandardTokenizer, StandardFilter and LowerCaseFilter. This yields a list of tokens (list1). Second, the same input text is also stemmed and stopwords are removed, yielding list2:
list1: [this text contains stopwords i need to align them] list2: [---- text contain stopword -- need -- align ----] If I want to align both lists, I need to know which tokens were removed by the StopFilter. The following code works, but not for the last token ("them"): while (tokenStream.incrementToken()) { int skippedTokens = = tokenStream.getAttribute(PositionIncrementAttribute.class) .getPositionIncrement() - 1; // process the current token, e.g. we know that "need" is the 6th // element in the list because the previous token was removed } For stopwords that are at the end of the tokenStream (e.g. "them"), the positionIncrement is not updated - after leaving the while-loop, skippedTokens is 0. My workaround is to append a unique number to every input text, so that every text ends with a non-stopword. Can you think of a more reasonable approach? Thank you, Hannes
signature.asc
Description: This is a digitally signed message part