Re: Field.setStringValue

Andi Vajda Wed, 09 Oct 2019 14:33:29 -0700


On Wed, 9 Oct 2019, Andi Vajda wrote:

On Wed, 9 Oct 2019, Marc Jeurissen wrote:
Good day to you,
I have the following issue when setting the value of a field, valuecontaining a character > 160 (Pylucene 8.1.1, Python 3.7.2)
...
(Pdb) field
<Field: stored,indexed,tokenized,omitNormsindexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:>>(Pdb) value = '«Volgende facturen werden verstuurd aan de financiëledienst.»'
(Pdb) type(value)
<class 'str'>
(Pdb) field.setStringValue(value)
(Pdb) field
<Field:stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgendefacturen werden verstuurd aan de financiële dienst>>
The field value has lost 2 characters.

But when I encode value:

(Pdb) value = value.encode('utf-8')
(Pdb) value
b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xabledienst.\xc2\xbb'
(Pdb) field.setStringValue(value)
(Pdb) field
<Field:stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgendefacturen werden verstuurd aan de financiële dienst.»>>
The field value is correct.
So what does field.setStringValue expect: a string (as says the Lucenedocumentation) or a byte sequence?
Indeed, there is a problem. I was able to reproduce it with justStringBuffer, no lucene involved at all:
from lucene import initVM
initVM()
b=b'\xc2\xabVolgende facturen werden verstuurd aan definanci\xc3\xabledienst.\xc2\xbb'
a=b.decode('utf-8')
from java.lang import StringBuffer
StringBuffer(b)
<StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst.»>
StringBuffer(a)
<StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst>
StringBuffer(a).length()
59
StringBuffer(b).length()
61
type(a)
<class 'str'>
type(b)
<class 'bytes'>

There must be a bug in the Python 'str' -> Java 'String' conversion code.
Any Java API such as field.setStringValue() that expects a java.lang.String()can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is verylikely where the bug is.

Digging a bit further, it doesn't seem to be a problem when using Python 2.I'm not implying this is a python bug, strings are just very differentbetween python 2 and 3.


Andi..


Andi..


Thank you very much.


Met vriendelijke groeten,
Marc Jeurissen

Bibliotheek UAntwerpen
Stadscampus ? Ve35.303
Venusstraat 35 ? 2000 Antwerpen
marc.jeuris...@uantwerpen.be
T +32 3 265 49 71

Re: Field.setStringValue

Reply via email to