Ok, Everything has been cleared out about the problem. Please let me know how to get this
*from org.apache.lucene.codecs.lucene100 import Lucene100Codec* *print(Lucene100Codec.Mode.BEST_COMPRESSION)* Error AttributeError: type object 'Lucene100Codec$Mode' has no attribute 'BEST_COMPRESSION' I need it here: config = IndexWriterConfig(analyzer) config.setOpenMode(IndexWriterConfig.OpenMode.CREATE) config.setCode(Lucene100Codec(Lucene100Codec.Mode.BEST_COMPRESSION)) Prashant On Sat, Oct 26, 2024 at 8:13 PM Andi Vajda <va...@apache.org> wrote: > > > On Oct 26, 2024, at 16:21, Prashant Saxena <animator...@gmail.com> > wrote: > > > > There must be an explanation about 83 MB of compressed data getting > almost > > double of its size. It doesn't make sense at all. > > When not using a JArray('byte') your python byte array is converted into a > partial java string and is being corrupted, probably at the first utf-8 > conversion error. I didn't actually verify this, I'm not near my computer > but you're comparing a working solution with a non-working one 😊 > > Andi.. > > > > >> On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote: > >> > >> > >>> On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com> > >> wrote: > >>> > >>> I just need to store compressed strings to save space. If it can be > >> done in > >>> any other way, I'm OK with that. > >> > >> The JArray('byte') is the way. > >> > >> Andi.. > >> > >>> > >>> > >>>> On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote: > >>>> > >>>> > >>>>> On Sat, 26 Oct 2024, Prashant Saxena wrote: > >>>>> > >>>>> PyLucene 10.0.0 > >>>>> > >>>>> I'm trying to store a long text by compressing it first using zlib > >>>>> > >>>>> *doc.add(StoredField("contents", > >> zlib.compress(ftext.encode('utf-8'))))* > >>>>> > >>>>> The resulting index size is *~83 MB*. When reading it's value back > >> using > >>>>> > >>>>> *c = doc.getBinaryValue("contents")* > >>>>> > >>>>> It's returning 'NoneType' and when using > >>>>> > >>>>> *c = doc.get("contents")* > >>>>> > >>>>> It's returning a string which cannot be decompressed. > >>>>> > >>>>> When using > >>>>> > >>>>> *doc.add(StoredField("contents", > >>>>> JArray('byte')(zlib.compress(ftext.encode('utf-8')))))* > >>>>> > >>>>> The resulting index size is ~*160 MB. *There is no problem in getting > >>>> it's > >>>>> value using > >>>>> > >>>>> > >>>>> > >>>>> *c = doc.getBinaryValue("contents")cc = > >>>>> zlib.decompress(c.bytes.bytes_).decode('utf-8') * > >>>>> > >>>>> *Question 1 : *Why does the index size almost double when using > JArray? > >>>> > >>>> Because the value you're passing is actually processed correctly ? > >>>> > >>>>> *Question 2: *How do you correctly create and store compressed binary > >>>> data > >>>>> in StoredField ? > >>>> > >>>> If you want a python byte object, like b'abcd', to be seen by Lucene > >>>> (Java) > >>>> as a byte array, you should wrap it with a JArray('byte') like you > did. > >>>> Otherwise, it's seen as a string (I need to double-check) and not > >> handled > >>>> correctly. > >>>> > >>>>> I am using PyLucene in my current project. Please advise me if I > should > >>>>> post my questions on the java-user list instead of here. > >>>> > >>>> This particular question is specific to PyLucene and should be asked > >> here, > >>>> like you did ;-) > >>>> > >>>> Andi.. > >>>> > >> > >> >