On Thu, 8 Sep 2016, Dirk Rothe wrote:
Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:
On Thu, 8 Sep 2016, Dirk Rothe wrote:
Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:
class _Tokenizer(PythonTokenizer):
def __init__(self, INPUT):
super(_Tokenizer, self).__init__(INPUT)
# prepare INPUT
def incrementToken(self):
# stuff into termAtt/offsetAtt/posIncrAtt
class Analyzer6(PythonAnalyzer):
def createComponents(self, fieldName):
return Analyzer.TokenStreamComponents(_Tokenizer())
The PositionIncrementTestCase is pretty similar but initialized with
static input. Would be a nice place for an example with dynamic input, I
think.
This was our 3.6 approach:
class Analyzer3(PythonAnalyzer):
def tokenStream(self, fieldName, reader):
data = data_from_reader(reader)
class _tokenStream(PythonTokenStream):
def __init__(self):
super(_tokenStream, self).__init__()
# prepare termAtt/offsetAtt/posIncrAtt
def incrementToken(self):
# stuff from data into termAtt/offsetAtt/posIncrAtt
return _tokenStream()
Any hints how to get Analyzer6 working?
I've lost track of the countless API changes since 3.x.
The Lucene project does a good job at tracking them in the CHANGES.txt
file, usually pointing at the issue that tracked it, often with examples
about how to accomplish the same in the new way and the rationale behind
the change.
I guess we are here:
https://issues.apache.org/jira/browse/LUCENE-5388
https://svn.apache.org/viewvc?view=revision&revision=1556801
You can also look at the PyLucene tests I just ported to 6.x. For example,
in test_Analyzers.py, you can see that Tokenizer no longer takes a reader
but can be set one with setReader() after construction.
Yes, I've done that pretty carefully. I think, this quote points in the right
direction: "The tokenStream method takes a String or Reader and will pass
this to Tokenizer#setReader()."
from:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E
I've checked the lucene source and this happens automatically an cannot be
overwritten.
So I've hacked something ugly together which seems to work.
class _Tokenizer(PythonTokenizer):
def __init__(self, getReader):
super(_Tokenizer, self).__init__()
self.getReader = getReader
self.i = 0
self.data = []
def incrementToken(self):
if self.i == 0:
self.data = data_from_reader(self.getReader())
if self.i == len(self.data):
# we are reused - reset
self.i = 0
return False
# stuff from self.data into termAtt/offsetAtt/posIncrAtt
self.i += 1
return True
class Analyzer6(PythonAnalyzer):
def createComponents(self, fieldName):
return Analyzer.TokenStreamComponents(_Tokenizer(lambda:
self._reader))
def initReader(self, fieldName, reader):
# capture reader
self._reader = reader
return reader
I've made initReader() python-overridable (see patch). What do you think?
Not sure what to think. While your change looks fine, if Lucene decided to
make this 'hard', it may be a sign that you're doing something wrong or
going the wrong way about it.
I suggest you ask on the java-u...@lucene.apache.org list as you're probably
not the first one to transition from 3.x to something more recent.
Please let pylucene-dev@ know what you find out...
Andi..
--dirk