Thanks Sven. Just 5 hours from when I raised the question, there is a solution in place for everyone. This group is amazing!
-----Original Message----- From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van Caekenberghe Sent: 03 October 2019 15:28 To: Any question about pharo is welcome <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] How to zip a WideString https://github.com/pharo-project/pharo/pull/4812 > On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <s...@stfx.eu> wrote: > > https://github.com/pharo-project/pharo/issues/4806 > > PR will follow > >> On 3 Oct 2019, at 13:49, PBKResearch <pe...@pbkresearch.co.uk> wrote: >> >> Sven, Tomo >> >> Thanks for this discussion. I shall bear in mind Sven's proposed extension >> to ByteArray - this is exactly the sort of neater solution I was hoping for. >> Any chance this might make it into standard Pharo (perhaps inP8)? >> >> Peter Kenny >> >> -----Original Message----- >> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of >> Tomohiro Oda >> Sent: 03 October 2019 12:22 >> To: Any question about pharo is welcome <pharo-users@lists.pharo.org> >> Subject: Re: [Pharo-users] How to zip a WideString >> >> Sven, >> >> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of >> zipping/unzipping binary data. >> I also love the new idioms. They look clean and concise. >> >> Best Regards, >> --- >> tomo >> >> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <s...@stfx.eu>: >>> >>> Actually, thinking about this a bit more, why not add #zipped #unzipped to >>> ByteArray ? >>> >>> >>> ByteArray>>#zipped >>> "Return a GZIP compressed version of the receiver as a ByteArray" >>> >>> ^ ByteArray streamContents: [ :out | >>> (GZipWriteStream on: out) nextPutAll: self; close ] >>> >>> ByteArray>>#unzipped >>> "Assuming the receiver contains GZIP encoded data, return the >>> decompressed data as a ByteArray" >>> >>> ^ (GZipReadStream on: self) upToEnd >>> >>> >>> The original oneliner then becomes >>> >>> 'string' utf8Encoded zipped. >>> >>> and >>> >>> data unzipped utf8Decoded >>> >>> which is pretty clear, simple and intention-revealing, IMHO. >>> >>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <s...@stfx.eu> wrote: >>>> >>>> Hi Tomo, >>>> >>>> Indeed, I stand corrected, it does indeed seem possible to use the >>>> existing gzip classes to work from bytes to bytes, this works fine: >>>> >>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) >>>> nextPutAll: 'foo 10 €' utf8Encoded; close ]. >>>> >>>> (GZipReadStream on: data) upToEnd utf8Decoded. >>>> >>>> Now regarding the encoding option, I am not so sure that is really >>>> necessary (though nice to have). Why would anyone use anything except UTF8 >>>> (today). >>>> >>>> Thanks again for the correction ! >>>> >>>> Sven >>>> >>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote: >>>>> >>>>> Peter and Sven, >>>>> >>>>> zip API from string to string works fine except that aWideString >>>>> zipped generates malformed zip string. >>>>> I think it might be a good guidance to define >>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . >>>>> Such as >>>>> String>>zippedWithEncoding: encoder >>>>> zippedWithEncoding: encoder >>>>> ^ ByteArray >>>>> streamContents: [ :stream | >>>>> | gzstream | >>>>> gzstream := GZipWriteStream on: stream. >>>>> encoder >>>>> next: self size >>>>> putAll: self >>>>> startingAt: 1 >>>>> toStream: gzstream. >>>>> gzstream close ] >>>>> >>>>> and ByteArray>>unzippedWithEncoding: encoder >>>>> unzippedWithEncoding: encoder >>>>> | byteStream | >>>>> byteStream := GZipReadStream on: self. >>>>> ^ String >>>>> streamContents: [ :stream | >>>>> [ byteStream atEnd ] >>>>> whileFalse: [ stream nextPut: (encoder nextFromStream: >>>>> byteStream) ] ] >>>>> >>>>> Then, you can write something like zipped := yourLongWideString >>>>> zippedWithEncoding: ZnCharacterEncoder utf8. >>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. >>>>> >>>>> This will not affect the existing zipped/unzipped API and you can >>>>> handle other encodings. >>>>> This zippedWithEncoding: generates a ByteArray, which is kind of >>>>> conformant to the encoding API. >>>>> And you don't have to create many intermediate byte arrays and byte >>>>> strings. >>>>> >>>>> I hope this helps. >>>>> --- >>>>> tomo >>>>> >>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>: >>>>>> >>>>>> Hi Peter, >>>>>> >>>>>> About #zipped / #unzipped and the inflate / deflate classes: your >>>>>> observation is correct, these work from string to string, while clearly >>>>>> the compressed representation should be binary. >>>>>> >>>>>> The contents (input, what is inside the compressed data) can be >>>>>> anything, it is not necessarily a string (it could be an image, so also >>>>>> something binary). Only the creator of the compressed data knows, you >>>>>> cannot assume to know in general. >>>>>> >>>>>> It would be possible (and it would be very nice) to change this, however >>>>>> that will have serious impact on users (as the contract changes). >>>>>> >>>>>> About your use case: why would your DB not be capable of storing large >>>>>> strings ? A good DB should be capable of storing any kind of string >>>>>> (full unicode) efficiently. >>>>>> >>>>>> What DB and what sizes are we talking about ? >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote: >>>>>>> >>>>>>> Hello >>>>>>> >>>>>>> I have a problem with text storage, to which I seem to have found a >>>>>>> solution, but it’s a bit clumsy-looking. I would be grateful for >>>>>>> confirmation that (a) there is no neater solution, (b) I can rely on >>>>>>> this to work – I only know that it works in a few test cases. >>>>>>> >>>>>>> I need to store a large number of text strings in a database. To avoid >>>>>>> the database files becoming too large, I am thinking of zipping the >>>>>>> strings, or at least the less frequently accessed ones. Depending on >>>>>>> the source, some of the strings will be instances of ByteString, some >>>>>>> of WideString (because they contain characters not representable in one >>>>>>> byte). Storing a WideString uncompressed seems to occupy 4 bytes per >>>>>>> character, so I decided, before thinking of compression, to store the >>>>>>> strings utf8Encoded, which yields a ByteArray. But zipped can only be >>>>>>> applied to a String, not a ByteArray. >>>>>>> >>>>>>> So my proposed solution is: >>>>>>> >>>>>>> For compression: myZipString := myWideString utf8Encoded >>>>>>> asString zipped. >>>>>>> For decompression: myOutputString := myZipString unzipped >>>>>>> asByteArray utf8Decoded. >>>>>>> >>>>>>> As I said, it works in all the cases I tried, whether WideString or >>>>>>> not, but the chains of transformations look clunky somehow. Can anyone >>>>>>> see a neater way of doing it? And can I rely on it working, especially >>>>>>> when I am handling foreign texts with many multi-byte characters? >>>>>>> >>>>>>> Thanks in advance for any help. >>>>>>> >>>>>>> Peter Kenny >>>>>> >>>>>> >>>>> >>>> >>> >>> >> >> >