Re: Store byte array in StoredField using zlib compression

Prashant Saxena Sat, 26 Oct 2024 22:50:41 -0700

Ok, Everything has been cleared out about the problem. Please let me know
how to get this


*from org.apache.lucene.codecs.lucene100 import Lucene100Codec*
*print(Lucene100Codec.Mode.BEST_COMPRESSION)*

Error

AttributeError: type object 'Lucene100Codec$Mode' has no attribute
'BEST_COMPRESSION'

I need it here:

config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
config.setCode(Lucene100Codec(Lucene100Codec.Mode.BEST_COMPRESSION))

Prashant

On Sat, Oct 26, 2024 at 8:13 PM Andi Vajda <[email protected]> wrote:

>
> > On Oct 26, 2024, at 16:21, Prashant Saxena <[email protected]>
> wrote:
> >
> > There must be an explanation about 83 MB of compressed data getting
> almost
> > double of its size. It doesn't make sense at all.
>
> When not using a JArray('byte') your python byte array is converted into a
> partial java string and is being corrupted, probably at the first utf-8
> conversion error. I didn't actually verify this, I'm not near my computer
> but you're comparing a working solution with a non-working one 😊
>
> Andi..
>
> >
> >> On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <[email protected]> wrote:
> >>
> >>
> >>> On Oct 26, 2024, at 14:50, Prashant Saxena <[email protected]>
> >> wrote:
> >>>
> >>> I just need to store compressed strings to save space. If it can be
> >> done in
> >>> any other way, I'm OK with that.
> >>
> >> The JArray('byte') is the way.
> >>
> >> Andi..
> >>
> >>>
> >>>
> >>>> On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <[email protected]> wrote:
> >>>>
> >>>>
> >>>>> On Sat, 26 Oct 2024, Prashant Saxena wrote:
> >>>>>
> >>>>> PyLucene 10.0.0
> >>>>>
> >>>>> I'm trying to store a long text by compressing it first using zlib
> >>>>>
> >>>>> *doc.add(StoredField("contents",
> >> zlib.compress(ftext.encode('utf-8'))))*
> >>>>>
> >>>>> The resulting index size is *~83 MB*. When reading it's value back
> >> using
> >>>>>
> >>>>> *c = doc.getBinaryValue("contents")*
> >>>>>
> >>>>> It's returning 'NoneType' and when using
> >>>>>
> >>>>> *c = doc.get("contents")*
> >>>>>
> >>>>> It's returning a string which cannot be decompressed.
> >>>>>
> >>>>> When using
> >>>>>
> >>>>> *doc.add(StoredField("contents",
> >>>>> JArray('byte')(zlib.compress(ftext.encode('utf-8')))))*
> >>>>>
> >>>>> The resulting index size is ~*160 MB. *There is no problem in getting
> >>>> it's
> >>>>> value using
> >>>>>
> >>>>>
> >>>>>
> >>>>> *c = doc.getBinaryValue("contents")cc =
> >>>>> zlib.decompress(c.bytes.bytes_).decode('utf-8') *
> >>>>>
> >>>>> *Question 1 : *Why does the index size almost double when using
> JArray?
> >>>>
> >>>> Because the value you're passing is actually processed correctly ?
> >>>>
> >>>>> *Question 2: *How do you correctly create and store compressed binary
> >>>> data
> >>>>> in StoredField ?
> >>>>
> >>>> If you want a python byte object, like b'abcd', to be seen by Lucene
> >>>> (Java)
> >>>> as a byte array, you should wrap it with a JArray('byte') like you
> did.
> >>>> Otherwise, it's seen as a string (I need to double-check) and not
> >> handled
> >>>> correctly.
> >>>>
> >>>>> I am using PyLucene in my current project. Please advise me if I
> should
> >>>>> post my questions on the java-user list instead of here.
> >>>>
> >>>> This particular question is specific to PyLucene and should be asked
> >> here,
> >>>> like you did ;-)
> >>>>
> >>>> Andi..
> >>>>
> >>
> >>
>

Re: Store byte array in StoredField using zlib compression

Reply via email to