Traceback (most recent call last):
  File "C:\index1.py", line 94, in <module>
    IndexFiles(sys.argv[1], os.path.join(base_dir, INDEX_DIR),
EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger"))
  File "C:\index1.py", line 48, in __init__
    self.indexDocs(root, writer)
  File "C:\index1.py", line 81, in indexDocs
    writer.addDocument(doc)
JavaError: org.apache.jcc.PythonException: ('while calling', 'tokenStream',
<class '__main__.EnglishLemmaTokenizer'>)
TypeError: ('while calling', 'tokenStream', <class
'__main__.EnglishLemmaTokenizer'>)

    Java stacktrace:
org.apache.jcc.PythonException: ('while calling', 'tokenStream', <class
'__main__.EnglishLemmaTokenizer'>)
TypeError: ('while calling', 'tokenStream', <class
'__main__.EnglishLemmaTokenizer'>)


    at org.apache.pylucene.analysis.PythonAnalyzer.tokenStream(Native
Method)

    at
org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:80)

    at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:137)

    at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)

    at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)

    at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)

    at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2034)


Then I tried to change the return object and runit as index2.py, again I
have the following errors:


Traceback (most recent call last):
  File "C:\newIndexfiles.py", line 94, in <module>
    IndexFiles(sys.argv[1], os.path.join(base_dir, INDEX_DIR),
EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger"))
  File "C:\newIndexfiles.py", line 48, in __init__
    self.indexDocs(root, writer)
  File "C:\newIndexfiles.py", line 81, in indexDocs
    writer.addDocument(doc)
JavaError: java.lang.NullPointerException
    Java stacktrace:
java.lang.NullPointerException

    at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:141)

    at
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)

    at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)

    at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2060)

    at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2034)


I cannot figure out the issues here. Thanks



On Sat, Oct 18, 2014 at 10:11 PM, Alexander Alex <
greatalexander4r...@gmail.com> wrote:

> Thanks Andi. am going to try these suggestions out.
>
> On Sat, Oct 18, 2014 at 9:55 PM, Andi Vajda <va...@apache.org> wrote:
>
>>
>> On Sat, 18 Oct 2014, Alexander Alex wrote:
>>
>>  The init file in the pylucene egg. Below is it:
>>>
>>>
>>> import os, sys
>>>
>>> if sys.platform == 'win32':
>>>  from jcc.windows import add_jvm_dll_directory_to_path
>>>  add_jvm_dll_directory_to_path()
>>>  import jcc, _lucene
>>> else:
>>>  import _lucene
>>>
>>> __dir__ = os.path.abspath(os.path.dirname(__file__))
>>>
>>> class JavaError(Exception):
>>>  def getJavaException(self):
>>>    return self.args[0]
>>>  def __str__(self):
>>>    writer = StringWriter()
>>>    self.getJavaException().printStackTrace(PrintWriter(writer))
>>>    return "\n".join((super(JavaError, self).__str__(), "    Java
>>> stacktrace:", str(writer)))
>>>
>>> class InvalidArgsError(Exception):
>>>  pass
>>>
>>> _lucene._set_exception_types(JavaError, InvalidArgsError)
>>>
>>> VERSION = "3.6.2"
>>> CLASSPATH = [os.path.join(__dir__, "lucene-core-3.6.2.jar"),
>>> os.path.join(__dir__, "lucene-analyzers-3.6.2.jar"),
>>> os.path.join(__dir__,
>>> "lucene-memory-3.6.2.jar"), os.path.join(__dir__,
>>> "lucene-highlighter-3.6.2.jar"), os.path.join(__dir__,
>>> "extensions.jar"),
>>> os.path.join(__dir__, "lucene-queries-3.6.2.jar"), os.path.join(__dir__,
>>> "lucene-grouping-3.6.2.jar"), os.path.join(__dir__,
>>> "lucene-join-3.6.2.jar"), os.path.join(__dir__,
>>> "lucene-facet-3.6.2.jar"),
>>> os.path.join(__dir__, "lucene-spellchecker-3.6.2.jar")]
>>> CLASSPATH = os.pathsep.join(CLASSPATH)
>>> _lucene.CLASSPATH = CLASSPATH
>>> _lucene._set_function_self(_lucene.initVM, _lucene)
>>>
>>> from _lucene import *
>>>
>>
>> Thanks. This looks like the vanilla __init__.py file in the pylucene egg.
>> I see no modifications from you for, I quote "path of the dependencies to
>> classpath in the init.py file".
>>
>> To be sure there is no misunderstanding here, this is what I understand
>> from you so far:
>>   - you downloaded, built and installed PyLucene 3.6.2
>>     (with what Python version and what Java version ?)
>>   - you then compiled a new class and added it to two JAR files,
>>     lucene-core-3.6.2.jar and lucene-analyzers-3.6.2.jar
>>     (with that Java version ?, why did you modify two JAR files ?
>>      why not create your own JAR file with your extra stuff ?)
>>   - you then edited __init__.py to reflect this change but I don't see
>>     any change in the file you pasted nor why the change is needed if you
>>     just modified existing JAR files (in the right location, inside the
>>     PyLucene egg, right ?)
>>   - you did not rebuild PyLucene itself after making any of these changes
>>
>> If this mental picture is correct then this is not the right way to go
>> about it. The proper way to modify Lucene Core and then PyLucene is to:
>>   - compile and build your new classes using the same version of Java (and
>>     Lucene)
>>   - create a new JAR file containing your extra stuff
>>   - test that it all works with a simple Java program that uses Lucene
>> core
>>     and your new code together
>>   - _then_ rebuild PyLucene including your new JAR file either by:
>>      - adding it to the list of JAR files being wrapped by JCC via --jar
>>        in the PyLucene Makefile
>>      - OR pass it to JCC via --include instead so that it just becomes
>> part
>>        of the new PyLucene egg (ensuring it being inside the egg and on
>> the
>>        classpath but no Python wrappers for it are generated)
>>
>> To get command line argument help from JCC run python -m jcc --help (or
>> whatever the correct invocation is for your version of Python).
>>
>> Andi..
>>
>>
>>  On Sat, Oct 18, 2014 at 12:29 AM, Andi Vajda <va...@apache.org> wrote:
>>>
>>>
>>>> On Sat, 18 Oct 2014, Alexander Alex wrote:
>>>>
>>>>  ok. I built the class files for the java files attached herein, add
>>>> them
>>>>
>>>>> to
>>>>> lucene-core-3.6.2.jar at org.apache.lucene.analysis and
>>>>> lucene-analyzers-3.6.2.jar at org.apache.lucene.analysis. I then added
>>>>> the
>>>>> path of the dependencies to classpath in the init.py file.
>>>>>
>>>>>
>>>> What init.py file ?
>>>> Can you paste the contents of that file here, please ?
>>>>
>>>> Andi..
>>>>
>>>>
>>>>  I ran the
>>>>
>>>>> typical index file using this customized analyzer through
>>>>> PythonAnalyzer
>>>>> and got the above error. Meanwhile, I had earlier ran the index file
>>>>> using
>>>>> standard analyzer before adding the classes and it worked. After
>>>>> running
>>>>> the index file with the customized analyzer failed, I tried again with
>>>>> the
>>>>> standard analyzer which had earlier worked before adding the classes
>>>>> but
>>>>> failed this time around with same error message as above. I guess the
>>>>> problem has to do with array compatibility in java and python but I
>>>>> don't
>>>>> really know. Thanks.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 7:23 PM, Andi Vajda <va...@apache.org> wrote:
>>>>>
>>>>>
>>>>>  On Fri, 17 Oct 2014, Alexander Alex wrote:
>>>>>>
>>>>>>  Meanwhile, am using lucene 3.6.2 version. The problem is jvm
>>>>>> instantiation
>>>>>>
>>>>>>  from any python code using lucene caused as a result of the classes I
>>>>>>> added
>>>>>>> to lucene core.
>>>>>>>
>>>>>>> ---------- Forwarded message ----------
>>>>>>>
>>>>>>> I added a customized lucene analyzer class to lucene core in
>>>>>>> Pylucene.
>>>>>>>
>>>>>>>
>>>>>>>  Please explain in _detail_ the steps you followed to accomplish
>>>>>> this.
>>>>>> A log of all the commands you ran would be ideal.
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Andi..
>>>>>>
>>>>>>
>>>>>>  This class is google guava as a dependency because of the array
>>>>>> handling
>>>>>>
>>>>>>  function available in com.google.common.collect.Iterables in guava.
>>>>>>> When
>>>>>>> I tried to index using this analyzer, I got the following error:
>>>>>>>
>>>>>>> Traceback (most recent call last): File "C:\IndexFiles.py", line 78,
>>>>>>> in
>>>>>>> lucene.initVM() JavaError: java.lang.NoClassDefFoundError:
>>>>>>> org/apache/lucene/analysis/CharArraySet Java stacktrace:
>>>>>>> java.lang.NoClassDefFoundError: org/apache/lucene/analysis/
>>>>>>> CharArraySet
>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>> org.apache.lucene.analysis.CharArraySet at
>>>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
>>>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
>>>>>>> java.security.AccessController.doPrivileged(Native Method) at
>>>>>>> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
>>>>>>> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
>>>>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
>>>>>>> java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>>>>>
>>>>>>> Even the example indexing code in Lucene in Action that I tried
>>>>>>> earlier
>>>>>>> and
>>>>>>> worked, when I retried it after adding this class is returning the
>>>>>>> same
>>>>>>> error above. Am not too familiar with CharArraySet class as I can see
>>>>>>> the
>>>>>>> problem is from it. How do i handle this? Attached is the java files
>>>>>>> whose
>>>>>>> class were added to lucene core in pylucene. Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>
#!/usr/bin/env python
INDEX_DIR = "index"
import sys, os, lucene, threading, time
from datetime import datetime
from lucene import*

"""
This class is loosely based on the Lucene (java implementation) demo class 
org.apache.lucene.demo.IndexFiles.  It will take a directory as an argument
and will index all of the files in that directory and downward recursively.
It will index on the file path, the file name and the file contents.  The
resulting Lucene index will be placed in the current directory and called
'index'.
"""

class EnglishLemmaAnalyzer(PythonAnalyzer):
    def tokenStream(self, fieldName, reader):
        class EnglishLemmaTokenizer(PythonTokenStream):
            def __init__(self):
                super(EnglishLemmaTokenizer, self_).__init__()
                result = EnglishLemmaTokenizer(reader)
                result = LowerCaseFilter(result)
        return EnglishLemmaTokenizer
    
class Ticker(object):

    def __init__(self):
        self.tick = True

    def run(self):
        while self.tick:
            sys.stdout.write('.')
            sys.stdout.flush()
            time.sleep(1.0)

class IndexFiles(object):
    """Usage: python IndexFiles <doc_directory>"""

    def __init__(self, root, storeDir, analyzer):

        if not os.path.exists(storeDir):
            os.mkdir(storeDir)
        store = lucene.SimpleFSDirectory(lucene.File(storeDir))
        analyzer = EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger")
        writer = lucene.IndexWriter(store, analyzer, True,
                                    lucene.IndexWriter.MaxFieldLength.UNLIMITED)
        writer.MaxFieldLength #writer.setMaxFieldLength(1048576)
        self.indexDocs(root, writer)
        ticker = Ticker()
        print 'optimizing index',
        threading.Thread(target=ticker.run).start()
        writer.optimize()
        writer.close()
        ticker.tick = False
        print 'done'

    def indexDocs(self, root, writer):
        for root, dirnames, filenames in os.walk(root):
            for filename in filenames:
                if not filename.endswith('.txt'):
                    continue
                print "adding", filename
                try:
                    path = os.path.join(root, filename)
                    file = open(path)
                    contents = unicode(file.read(), 'iso-8859-1')
                    file.close()
                    doc = lucene.Document()
                    doc.add(lucene.Field("name", filename,
                                         lucene.Field.Store.YES,
                                         lucene.Field.Index.NOT_ANALYZED))
                    doc.add(lucene.Field("path", path,
                                         lucene.Field.Store.YES,
                                         lucene.Field.Index.NOT_ANALYZED))
                    if len(contents) > 0:
                        doc.add(lucene.Field("contents", contents,
                                             lucene.Field.Store.NO,
                                             lucene.Field.Index.ANALYZED))
                    else:
                        print "warning: no content in %s" % filename
                    writer.addDocument(doc)
                except Exception, e:
                    print "Failed in indexDocs:", e

if __name__ == '__main__':
    sys.argv=['IndexFiles.py','C:/kk']
    print IndexFiles.__doc__
    #sys.exit(1)
    lucene.initVM()
    print 'lucene', lucene.VERSION
    start = datetime.now()
    try:
        base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
        IndexFiles(sys.argv[1], os.path.join(base_dir, INDEX_DIR), EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger"))
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e
#!/usr/bin/env python
INDEX_DIR = "index"
import sys, os, lucene, threading, time
from datetime import datetime
from lucene import*

"""
This class is loosely based on the Lucene (java implementation) demo class 
org.apache.lucene.demo.IndexFiles.  It will take a directory as an argument
and will index all of the files in that directory and downward recursively.
It will index on the file path, the file name and the file contents.  The
resulting Lucene index will be placed in the current directory and called
'index'.
"""

class EnglishLemmaAnalyzer(PythonAnalyzer):
    def tokenStream(self, fieldName, reader):
        class EnglishLemmaTokenizer(PythonTokenStream):
            def __init__(self):
                super(EnglishLemmaTokenizer, self_).__init__()
                result = EnglishLemmaTokenizer(reader)
                result = LowerCaseFilter(result)
                return result
    
class Ticker(object):

    def __init__(self):
        self.tick = True

    def run(self):
        while self.tick:
            sys.stdout.write('.')
            sys.stdout.flush()
            time.sleep(1.0)

class IndexFiles(object):
    """Usage: python IndexFiles <doc_directory>"""

    def __init__(self, root, storeDir, analyzer):

        if not os.path.exists(storeDir):
            os.mkdir(storeDir)
        store = lucene.SimpleFSDirectory(lucene.File(storeDir))
        analyzer = EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger")
        writer = lucene.IndexWriter(store, analyzer, True,
                                    lucene.IndexWriter.MaxFieldLength.UNLIMITED)
        writer.MaxFieldLength #writer.setMaxFieldLength(1048576)
        self.indexDocs(root, writer)
        ticker = Ticker()
        print 'optimizing index',
        threading.Thread(target=ticker.run).start()
        writer.optimize()
        writer.close()
        ticker.tick = False
        print 'done'

    def indexDocs(self, root, writer):
        for root, dirnames, filenames in os.walk(root):
            for filename in filenames:
                if not filename.endswith('.txt'):
                    continue
                print "adding", filename
                try:
                    path = os.path.join(root, filename)
                    file = open(path)
                    contents = unicode(file.read(), 'iso-8859-1')
                    file.close()
                    doc = lucene.Document()
                    doc.add(lucene.Field("name", filename,
                                         lucene.Field.Store.YES,
                                         lucene.Field.Index.NOT_ANALYZED))
                    doc.add(lucene.Field("path", path,
                                         lucene.Field.Store.YES,
                                         lucene.Field.Index.NOT_ANALYZED))
                    if len(contents) > 0:
                        doc.add(lucene.Field("contents", contents,
                                             lucene.Field.Store.NO,
                                             lucene.Field.Index.ANALYZED))
                    else:
                        print "warning: no content in %s" % filename
                    writer.addDocument(doc)
                except Exception, e:
                    print "Failed in indexDocs:", e

if __name__ == '__main__':
    sys.argv=['IndexFiles.py','C:/kk']
    print IndexFiles.__doc__
    #sys.exit(1)
    lucene.initVM()
    print 'lucene', lucene.VERSION
    start = datetime.now()
    try:
        base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
        IndexFiles(sys.argv[1], os.path.join(base_dir, INDEX_DIR), EnglishLemmaAnalyzer("english-bidirectional-distsim.tagger"))
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e

Reply via email to