It's fixed now in JCC's trunk.

Andi..

> On Oct 10, 2019, at 05:18, Marc Jeurissen <marc.jeuris...@uantwerpen.be> 
> wrote:
> 
> Ok thank you Andi.
> I’ll use the sidepath with the bytes for the moment.
> Hope it will get solved soon though.
>  
>  
> Met vriendelijke groeten,
> Marc Jeurissen
> 
> 
> 
> Bibliotheek UAntwerpen
> Stadscampus – Ve35.303
> Venusstraat 35 – 2000 Antwerpen
> marc.jeuris...@uantwerpen.be
> T +32 3 265 49 71
>  
> 
>  
> From: Andi Vajda
> Sent: woensdag 9 oktober 2019 23:33
> To: Andi Vajda
> Cc: pylucene-dev@lucene.apache.org
> Subject: Re: Field.setStringValue
>  
>  
> On Wed, 9 Oct 2019, Andi Vajda wrote:
>  
> > 
> > On Wed, 9 Oct 2019, Marc Jeurissen wrote:
> > 
> >> Good day to you,
> >>
> >> I have the following issue when setting the value of a field, value
> >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2)
> >>
> >> ...
> >> (Pdb) field
> >> <Field: stored,indexed,tokenized,omitNorms
> >> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:>>
> >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële
> >> dienst.»'
> >> (Pdb) type(value)
> >> <class 'str'>
> >> (Pdb) field.setStringValue(value)
> >> (Pdb) field
> >> <Field:
> >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende
> >> facturen werden verstuurd aan de financiële dienst>>
> >>
> >> The field value has lost 2 characters.
> >>
> >> But when I encode value:
> >>
> >> (Pdb) value = value.encode('utf-8')
> >> (Pdb) value
> >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable
> >> dienst.\xc2\xbb'
> >>
> >> (Pdb) field.setStringValue(value)
> >> (Pdb) field
> >> <Field:
> >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende
> >> facturen werden verstuurd aan de financiële dienst.»>>
> >>
> >> The field value is correct.
> >>
> >> So what does field.setStringValue expect: a string (as says the Lucene
> >> documentation) or a byte sequence?
> > 
> > Indeed, there is a problem. I was able to reproduce it with just
> > StringBuffer, no lucene involved at all:
> > 
> >>>> from lucene import initVM
> >>>> initVM()
> >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de
> >>>> financi\xc3\xabledienst.\xc2\xbb'
> >>>> a=b.decode('utf-8')
> >>>> from java.lang import StringBuffer
> >>>> StringBuffer(b)
> > <StringBuffer: «Volgende facturen werden verstuurd aan de 
> > financiëledienst.»>
> >>>> StringBuffer(a)
> > <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst>
> >>>> StringBuffer(a).length()
> > 59
> >>>> StringBuffer(b).length()
> > 61
> >>>> type(a)
> > <class 'str'>
> >>>> type(b)
> > <class 'bytes'>
> > 
> > There must be a bug in the Python 'str' -> Java 'String' conversion code.
> > Any Java API such as field.setStringValue() that expects a 
> > java.lang.String()
> > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very
> > likely where the bug is.
>  
> Digging a bit further, it doesn't seem to be a problem when using Python 2.
> I'm not implying this is a python bug, strings are just very different
> between python 2 and 3.
>  
> Andi..
>  
> > 
> > Andi..
> > 
> >>
> >> Thank you very much.
> >>
> >>
> >> Met vriendelijke groeten,
> >> Marc Jeurissen
> >>
> >> Bibliotheek UAntwerpen
> >> Stadscampus ? Ve35.303
> >> Venusstraat 35 ? 2000 Antwerpen
> >> marc.jeuris...@uantwerpen.be
> >> T +32 3 265 49 71
> >>
> >>
> >>
> > 
>  

Reply via email to