hello Lucene list, I have a question about a custom Analyzer we're trying to write. The intention is that it tokenizes on whitespace, and abstracts over upper/lowercase and accented characters. It is used both when indexing documents, and before creating lucene queries from search terms.
I have 2 implementations. The first one seems to work only correctly if the index is rebuilt after we add something to the index. If we do not rebuild, the newly added document is not found when you search for it. I've no idea what could cause this behaviour. I'm posting its code below, called "Attempt A". The second implementation seems to work better. Using it, newly indexed documents are immediately findable, without first rebuilding the index. It also seems to abstract over upper/lowercase, and in my colleague's tests (but not in mine) seems to abstract over accented characters. I'm posting its code below, called "Attempt B". We do not understand why our first implementation "Attempt A" behaves like it does. We also do not understand why the second implementation "Attempt B" improves on that, and whether that implementation actually fulfills our goals (seeing the different test results we got). So I'd very much appreciate it if someone could help us understand this, and tell us if we're taking the right approach here to achieve this seemingly simple goal. Kind regards Heikki Doeleman =============================================== Attempt A : public final class GeoNetworkAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = new WhitespaceTokenizer(reader); ts = new LowerCaseFilter(ts); ts = new ASCIIFoldingFilter(ts); return ts; } @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { TokenStream ts = (TokenStream) getPreviousTokenStream(); if (ts == null) { ts = tokenStream(null, reader); setPreviousTokenStream(ts); } else { ts.reset(); } return ts; } } ================================================= Attempt B : public final class GeoNetworkAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new ASCIIFoldingFilter(new LowerCaseFilter(new WhitespaceTokenizer(reader))); } @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream(); if (tokenizer == null) { tokenizer = new WhitespaceTokenizer(reader); setPreviousTokenStream(tokenizer); } else tokenizer.reset(reader); return tokenizer; } } =================================================