On Fri, 3 Sep 2010, technology inspired wrote:
How can one define the list of allowed stopwords in StandardAnalyzer? According to Lucene Java API doc, a set should be defined in Constructor to include the list of allowed Stopwords. I want to avoid skipping few words like "The", "on", "off" from being not indexed while using StandardAnalyzer. How one would define such a constructor in PyLucene?
Stop words can be passed to StandardAnalyzer via a Set instance. To do this you can either: - add java.util.HashSet to PyLucene's jcc invocation in Makefile, rebuild PyLucene and then use a HashSet instance (in the Makefile, look for java.util.Arrays and add java.util.HashSet below). - use the JavaSet class in the collections.py module that is installed with PyLucene. The JavaSet class is a Python class that extends PythonSet, a Java class that implements the java.util.Set interface. JavaSet takes a set instance, wraps it and makes its elements accessible to Java via the java.util.Set interface. For example: >>> from lucene import * >>> from lucene.collections import JavaSet >>> initVM() <jcc.JCCEnv object at 0x10040a0d8> >>> a=set(['foo', 'bar', 'baz']) >>> b=JavaSet(a) >>> b <JavaSet: org.apache.pylucene.util.python...@424ecfdd> >>> StandardAnalyzer(Version.LUCENE_CURRENT, b) <StandardAnalyzer: org.apache.lucene.analysis.standard.standardanaly...@4430d82d> Andi..