It's fixed now in JCC's trunk. Andi..
> On Oct 10, 2019, at 05:18, Marc Jeurissen <marc.jeuris...@uantwerpen.be> > wrote: > > Ok thank you Andi. > I’ll use the sidepath with the bytes for the moment. > Hope it will get solved soon though. > > > Met vriendelijke groeten, > Marc Jeurissen > > > > Bibliotheek UAntwerpen > Stadscampus – Ve35.303 > Venusstraat 35 – 2000 Antwerpen > marc.jeuris...@uantwerpen.be > T +32 3 265 49 71 > > > > From: Andi Vajda > Sent: woensdag 9 oktober 2019 23:33 > To: Andi Vajda > Cc: pylucene-dev@lucene.apache.org > Subject: Re: Field.setStringValue > > > On Wed, 9 Oct 2019, Andi Vajda wrote: > > > > > On Wed, 9 Oct 2019, Marc Jeurissen wrote: > > > >> Good day to you, > >> > >> I have the following issue when setting the value of a field, value > >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) > >> > >> ... > >> (Pdb) field > >> <Field: stored,indexed,tokenized,omitNorms > >> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:>> > >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële > >> dienst.»' > >> (Pdb) type(value) > >> <class 'str'> > >> (Pdb) field.setStringValue(value) > >> (Pdb) field > >> <Field: > >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende > >> facturen werden verstuurd aan de financiële dienst>> > >> > >> The field value has lost 2 characters. > >> > >> But when I encode value: > >> > >> (Pdb) value = value.encode('utf-8') > >> (Pdb) value > >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable > >> dienst.\xc2\xbb' > >> > >> (Pdb) field.setStringValue(value) > >> (Pdb) field > >> <Field: > >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende > >> facturen werden verstuurd aan de financiële dienst.»>> > >> > >> The field value is correct. > >> > >> So what does field.setStringValue expect: a string (as says the Lucene > >> documentation) or a byte sequence? > > > > Indeed, there is a problem. I was able to reproduce it with just > > StringBuffer, no lucene involved at all: > > > >>>> from lucene import initVM > >>>> initVM() > >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de > >>>> financi\xc3\xabledienst.\xc2\xbb' > >>>> a=b.decode('utf-8') > >>>> from java.lang import StringBuffer > >>>> StringBuffer(b) > > <StringBuffer: «Volgende facturen werden verstuurd aan de > > financiëledienst.»> > >>>> StringBuffer(a) > > <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst> > >>>> StringBuffer(a).length() > > 59 > >>>> StringBuffer(b).length() > > 61 > >>>> type(a) > > <class 'str'> > >>>> type(b) > > <class 'bytes'> > > > > There must be a bug in the Python 'str' -> Java 'String' conversion code. > > Any Java API such as field.setStringValue() that expects a > > java.lang.String() > > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very > > likely where the bug is. > > Digging a bit further, it doesn't seem to be a problem when using Python 2. > I'm not implying this is a python bug, strings are just very different > between python 2 and 3. > > Andi.. > > > > > Andi.. > > > >> > >> Thank you very much. > >> > >> > >> Met vriendelijke groeten, > >> Marc Jeurissen > >> > >> Bibliotheek UAntwerpen > >> Stadscampus ? Ve35.303 > >> Venusstraat 35 ? 2000 Antwerpen > >> marc.jeuris...@uantwerpen.be > >> T +32 3 265 49 71 > >> > >> > >> > > >