https://github.com/pharo-project/pharo/issues/4806
PR will follow > On 3 Oct 2019, at 13:49, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Sven, Tomo > > Thanks for this discussion. I shall bear in mind Sven's proposed extension to > ByteArray - this is exactly the sort of neater solution I was hoping for. Any > chance this might make it into standard Pharo (perhaps inP8)? > > Peter Kenny > > -----Original Message----- > From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Tomohiro > Oda > Sent: 03 October 2019 12:22 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] How to zip a WideString > > Sven, > > Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of > zipping/unzipping binary data. > I also love the new idioms. They look clean and concise. > > Best Regards, > --- > tomo > > 2019年10月3日(木) 20:14 Sven Van Caekenberghe <s...@stfx.eu>: >> >> Actually, thinking about this a bit more, why not add #zipped #unzipped to >> ByteArray ? >> >> >> ByteArray>>#zipped >> "Return a GZIP compressed version of the receiver as a ByteArray" >> >> ^ ByteArray streamContents: [ :out | >> (GZipWriteStream on: out) nextPutAll: self; close ] >> >> ByteArray>>#unzipped >> "Assuming the receiver contains GZIP encoded data, >> return the decompressed data as a ByteArray" >> >> ^ (GZipReadStream on: self) upToEnd >> >> >> The original oneliner then becomes >> >> 'string' utf8Encoded zipped. >> >> and >> >> data unzipped utf8Decoded >> >> which is pretty clear, simple and intention-revealing, IMHO. >> >>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <s...@stfx.eu> wrote: >>> >>> Hi Tomo, >>> >>> Indeed, I stand corrected, it does indeed seem possible to use the existing >>> gzip classes to work from bytes to bytes, this works fine: >>> >>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) >>> nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>> >>> (GZipReadStream on: data) upToEnd utf8Decoded. >>> >>> Now regarding the encoding option, I am not so sure that is really >>> necessary (though nice to have). Why would anyone use anything except UTF8 >>> (today). >>> >>> Thanks again for the correction ! >>> >>> Sven >>> >>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote: >>>> >>>> Peter and Sven, >>>> >>>> zip API from string to string works fine except that aWideString >>>> zipped generates malformed zip string. >>>> I think it might be a good guidance to define >>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>> Such as >>>> String>>zippedWithEncoding: encoder >>>> zippedWithEncoding: encoder >>>> ^ ByteArray >>>> streamContents: [ :stream | >>>> | gzstream | >>>> gzstream := GZipWriteStream on: stream. >>>> encoder >>>> next: self size >>>> putAll: self >>>> startingAt: 1 >>>> toStream: gzstream. >>>> gzstream close ] >>>> >>>> and ByteArray>>unzippedWithEncoding: encoder >>>> unzippedWithEncoding: encoder >>>> | byteStream | >>>> byteStream := GZipReadStream on: self. >>>> ^ String >>>> streamContents: [ :stream | >>>> [ byteStream atEnd ] >>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>> byteStream) ] ] >>>> >>>> Then, you can write something like >>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. >>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>> >>>> This will not affect the existing zipped/unzipped API and you can >>>> handle other encodings. >>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>> conformant to the encoding API. >>>> And you don't have to create many intermediate byte arrays and byte >>>> strings. >>>> >>>> I hope this helps. >>>> --- >>>> tomo >>>> >>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>: >>>>> >>>>> Hi Peter, >>>>> >>>>> About #zipped / #unzipped and the inflate / deflate classes: your >>>>> observation is correct, these work from string to string, while clearly >>>>> the compressed representation should be binary. >>>>> >>>>> The contents (input, what is inside the compressed data) can be anything, >>>>> it is not necessarily a string (it could be an image, so also something >>>>> binary). Only the creator of the compressed data knows, you cannot assume >>>>> to know in general. >>>>> >>>>> It would be possible (and it would be very nice) to change this, however >>>>> that will have serious impact on users (as the contract changes). >>>>> >>>>> About your use case: why would your DB not be capable of storing large >>>>> strings ? A good DB should be capable of storing any kind of string (full >>>>> unicode) efficiently. >>>>> >>>>> What DB and what sizes are we talking about ? >>>>> >>>>> Sven >>>>> >>>>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I have a problem with text storage, to which I seem to have found a >>>>>> solution, but it’s a bit clumsy-looking. I would be grateful for >>>>>> confirmation that (a) there is no neater solution, (b) I can rely on >>>>>> this to work – I only know that it works in a few test cases. >>>>>> >>>>>> I need to store a large number of text strings in a database. To avoid >>>>>> the database files becoming too large, I am thinking of zipping the >>>>>> strings, or at least the less frequently accessed ones. Depending on the >>>>>> source, some of the strings will be instances of ByteString, some of >>>>>> WideString (because they contain characters not representable in one >>>>>> byte). Storing a WideString uncompressed seems to occupy 4 bytes per >>>>>> character, so I decided, before thinking of compression, to store the >>>>>> strings utf8Encoded, which yields a ByteArray. But zipped can only be >>>>>> applied to a String, not a ByteArray. >>>>>> >>>>>> So my proposed solution is: >>>>>> >>>>>> For compression: myZipString := myWideString utf8Encoded >>>>>> asString zipped. >>>>>> For decompression: myOutputString := myZipString unzipped >>>>>> asByteArray utf8Decoded. >>>>>> >>>>>> As I said, it works in all the cases I tried, whether WideString or not, >>>>>> but the chains of transformations look clunky somehow. Can anyone see a >>>>>> neater way of doing it? And can I rely on it working, especially when I >>>>>> am handling foreign texts with many multi-byte characters? >>>>>> >>>>>> Thanks in advance for any help. >>>>>> >>>>>> Peter Kenny >>>>> >>>>> >>>> >>> >> >> > >