Re: Stopwords in StandardAnalyzer; Constructor

Andi Vajda Fri, 03 Sep 2010 10:10:33 -0700


On Fri, 3 Sep 2010, technology inspired wrote:

How can one define the list of allowed stopwords in StandardAnalyzer?
According to Lucene Java API doc, a set should be defined in Constructor to
include the list of allowed Stopwords. I want to avoid skipping few words
like "The", "on", "off" from being not indexed while using StandardAnalyzer.

How one would define such a constructor in PyLucene?


Stop words can be passed to StandardAnalyzer via a Set instance.
To do this you can either:

  - add java.util.HashSet to PyLucene's jcc invocation in Makefile,
    rebuild PyLucene and then use a HashSet instance (in the Makefile, look
    for java.util.Arrays and add java.util.HashSet below).

  - use the JavaSet class in the collections.py module that is installed
    with PyLucene. The JavaSet class is a Python class that extends
    PythonSet, a Java class that implements the java.util.Set interface.
    JavaSet takes a set instance, wraps it and makes its elements
    accessible to Java via the java.util.Set interface.
    For example:
        >>> from lucene import *
        >>> from lucene.collections import JavaSet
        >>> initVM()
        <jcc.JCCEnv object at 0x10040a0d8>
        >>> a=set(['foo', 'bar', 'baz'])
        >>> b=JavaSet(a)
        >>> b
        <JavaSet: org.apache.pylucene.util.python...@424ecfdd>
        >>> StandardAnalyzer(Version.LUCENE_CURRENT, b)
        <StandardAnalyzer: 
org.apache.lucene.analysis.standard.standardanaly...@4430d82d>

Andi..

Re: Stopwords in StandardAnalyzer; Constructor

Reply via email to