Build my own Analyzer using ShingleFilter in tokenStream function

Wu Ke Mon, 23 Apr 2012 13:03:41 -0700

Hi All,

I am pretty new to Lucene and Pylucene. This is a problem when I am using 
pylucene to write a customized analyzer, to tokenize text in to bigrams.


The code for analyzer class is:

class BiGramShingleAnalyzer(PythonAnalyzer):                                    
                                                           
    def __init__(self, outputUnigrams=False):                                   
                                                            
        PythonAnalyzer.__init__(self)                                           
                                                            
        self.outputUnigrams = outputUnigrams                                    
                                                           
                                                                                
                                                            
    def tokenStream(self, field, reader):                                       
                                                            
        result = ShingleFilter(LowerCaseTokenizer(Version.LUCENE_35,reader))    
                                                           
        result.setOutputUnigrams(self.outputUnigrams)                           
                                                            
        #print 'result is', result                                              
                                                            
        return result


I used ShingleFilter on the TokenStream produced by LowerCaseTokeinizer. When I 
call the tokenStream function directly, it works just tine:
str = ‘divide this sentence'
bi = BiGramShingleAnalyzer(False)
sf = bi.tokenStream('f', StringReader(str))
while sf.incrementToken():
    print sf
(divide this,startOffset=0,endOffset=11,positionIncrement=1,type=shingle)
(this sentence,startOffset=7,endOffset=20,positionIncrement=1,type=shingle)

The analyzer also works fine with QueryTermVector.

But when I tried to build a query parser using this analyzer, problem occurred:

parser = QueryParser(Version.LUCENE_35, 'f', bi)
query = parser.parse(str)

In query there is nothing.

After I add print statement in the tokenStream function, I found when I call 
parser.parse(str), the print statement in tokenStream actually get called 3 
times (3 words in my str variable). It seems to me the parser pre-processed the 
str I passed to it, and call the tokenStream function on the result of the 
pre-processing.

Any thoughts on how should I make the analyzer work, so that when I pass it to 
query parser, the parser could parse a string into bigrams?

Thanks in advance!

Ke Wu

Build my own Analyzer using ShingleFilter in tokenStream function

Reply via email to