Hi All, I am pretty new to Lucene and Pylucene. This is a problem when I am using pylucene to write a customized analyzer, to tokenize text in to bigrams.
The code for analyzer class is: class BiGramShingleAnalyzer(PythonAnalyzer): def __init__(self, outputUnigrams=False): PythonAnalyzer.__init__(self) self.outputUnigrams = outputUnigrams def tokenStream(self, field, reader): result = ShingleFilter(LowerCaseTokenizer(Version.LUCENE_35,reader)) result.setOutputUnigrams(self.outputUnigrams) #print 'result is', result return result I used ShingleFilter on the TokenStream produced by LowerCaseTokeinizer. When I call the tokenStream function directly, it works just tine: str = ‘divide this sentence' bi = BiGramShingleAnalyzer(False) sf = bi.tokenStream('f', StringReader(str)) while sf.incrementToken(): print sf (divide this,startOffset=0,endOffset=11,positionIncrement=1,type=shingle) (this sentence,startOffset=7,endOffset=20,positionIncrement=1,type=shingle) The analyzer also works fine with QueryTermVector. But when I tried to build a query parser using this analyzer, problem occurred: parser = QueryParser(Version.LUCENE_35, 'f', bi) query = parser.parse(str) In query there is nothing. After I add print statement in the tokenStream function, I found when I call parser.parse(str), the print statement in tokenStream actually get called 3 times (3 words in my str variable). It seems to me the parser pre-processed the str I passed to it, and call the tokenStream function on the result of the pre-processing. Any thoughts on how should I make the analyzer work, so that when I pass it to query parser, the parser could parse a string into bigrams? Thanks in advance! Ke Wu