Hi I am trying to understand why I am not able to retrieve docs I have indexed by a ShingleAnalyzer. The setup is as follows:
During indexing I do the following: PerFieldAnalyzerWrapper wrapper = DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords); writer = new IndexWriter(_lucenedir, new IndexWriterConfig(Version.LUCENE_32,wrapper)); where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper public static PerFieldAnalyzerWrapper getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){ PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new KeywordAnalyzer()); wrapper.addAnalyzer("title",new KeywordAnalyzer()); wrapper.addAnalyzer("titleSynonyms",new KeywordAnalyzer()); wrapper.addAnalyzer("date",new KeywordAnalyzer()); wrapper.addAnalyzer("about",new KeywordAnalyzer()); wrapper.addAnalyzer("titleAnalyzed",new StandardAnalyzer(Version.LUCENE_32,Stopwords)); wrapper.addAnalyzer("content",new LimitTokenCountAnalyzer( new StandardAnalyzer(Version.LUCENE_32,Stopwords), Integer.MAX_VALUE)); wrapper.addAnalyzer("contentForSpelling",new ShinglesAnalyzer(2,Stopwords)); return wrapper; } where the custom ShinglesAnalyzer is defined as follows: public class ShinglesAnalyzer extends Analyzer { private HashSet<String> Stopwords; private Integer shingleSize; public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream filter = new ShingleFilter( new StopFilter(Version.LUCENE_32, new LowerCaseFilter(Version.LUCENE_32, new StandardFilter(Version.LUCENE_32, new StandardTokenizer(Version.LUCENE_32, reader))), Stopwords), shingleSize); return filter; } } Then index as follows (note, all fields are set to ANALYZED because the fields that are not analyzed are set to be KeywordAnalyzer) doc.add(new Field("title",title,Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("about",article.getAbout().toString(),Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("date", article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED)); String content = article.getCleanContent(); Field contentField = new Field("content", content, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(contentField); Field contentSpellingField = new Field("contentForSpelling", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(contentSpellingField); Looking at index using luke the field "contentForSpelling" is indexed using both unigram and bi-gram (Shingles is set to be 2). Then during search time given a query q, which is a sentence provided by the user, I do the following: ShingleAnalyzerWrapper analyzer = new ShinglesAnalyzer(2,Stopwords); QueryParser parser = new QueryParser(Version.LUCENE_32, "contentForSpelling",analyzer); Query query = parser.parse(q); TopDocs hits = searcher.search(query); This is the output query: $13 for any of season package at Dallas ShinglesAnalyzer: 1: [13:1->3:<NUM>] [13 _:1->15:shingle] 2: [_ season:15->21:shingle] 3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle] 4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle] 5: [_ dallas:33->39:shingle] 6: [dallas:33->39:<ALPHANUM>] but when I print the query (query.toString()) it looks like this analyzed query: contentForSpelling:13 contentForSpelling:season contentForSpelling:package contentForSpelling:dallas But the query looks wrong to me. thank you Peyman