Re: [Pharo-users] How to zip a WideString

PBKResearch Thu, 03 Oct 2019 08:15:16 -0700

Thanks Sven. Just 5 hours from when I raised the question, there is a solution 
in place for everyone. This group is amazing!


-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
Caekenberghe
Sent: 03 October 2019 15:28
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] How to zip a WideString

https://github.com/pharo-project/pharo/pull/4812

> On 3 Oct 2019, at 14:05, Sven Van Caekenberghe <s...@stfx.eu> wrote:
> 
> https://github.com/pharo-project/pharo/issues/4806
> 
> PR will follow
> 
>> On 3 Oct 2019, at 13:49, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>> 
>> Sven, Tomo
>> 
>> Thanks for this discussion. I shall bear in mind Sven's proposed extension 
>> to ByteArray - this is exactly the sort of neater solution I was hoping for. 
>> Any chance this might make it into standard Pharo (perhaps inP8)?
>> 
>> Peter Kenny
>> 
>> -----Original Message-----
>> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
>> Tomohiro Oda
>> Sent: 03 October 2019 12:22
>> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
>> Subject: Re: [Pharo-users] How to zip a WideString
>> 
>> Sven,
>> 
>> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of 
>> zipping/unzipping binary data.
>> I also love the new idioms. They look clean and concise.
>> 
>> Best Regards,
>> ---
>> tomo
>> 
>> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <s...@stfx.eu>:
>>> 
>>> Actually, thinking about this a bit more, why not add #zipped #unzipped to 
>>> ByteArray ?
>>> 
>>> 
>>> ByteArray>>#zipped
>>> "Return a GZIP compressed version of the receiver as a ByteArray"
>>> 
>>> ^ ByteArray streamContents: [ :out |
>>>     (GZipWriteStream on: out) nextPutAll: self; close ]
>>> 
>>> ByteArray>>#unzipped
>>> "Assuming the receiver contains GZIP encoded data,  return the 
>>> decompressed data as a ByteArray"
>>> 
>>> ^ (GZipReadStream on: self) upToEnd
>>> 
>>> 
>>> The original oneliner then becomes
>>> 
>>> 'string' utf8Encoded zipped.
>>> 
>>> and
>>> 
>>> data unzipped utf8Decoded
>>> 
>>> which is pretty clear, simple and intention-revealing, IMHO.
>>> 
>>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>>>> 
>>>> Hi Tomo,
>>>> 
>>>> Indeed, I stand corrected, it does indeed seem possible to use the 
>>>> existing gzip classes to work from bytes to bytes, this works fine:
>>>> 
>>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) 
>>>> nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>>> 
>>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>>> 
>>>> Now regarding the encoding option, I am not so sure that is really 
>>>> necessary (though nice to have). Why would anyone use anything except UTF8 
>>>> (today).
>>>> 
>>>> Thanks again for the correction !
>>>> 
>>>> Sven
>>>> 
>>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote:
>>>>> 
>>>>> Peter and Sven,
>>>>> 
>>>>> zip API from string to string works fine except that aWideString 
>>>>> zipped generates malformed zip string.
>>>>> I think it might be a good guidance to define
>>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>>> Such as
>>>>> String>>zippedWithEncoding: encoder
>>>>> zippedWithEncoding: encoder
>>>>> ^ ByteArray
>>>>>     streamContents: [ :stream |
>>>>>         | gzstream |
>>>>>         gzstream := GZipWriteStream on: stream.
>>>>>         encoder
>>>>>             next: self size
>>>>>             putAll: self
>>>>>             startingAt: 1
>>>>>             toStream: gzstream.
>>>>>         gzstream close ]
>>>>> 
>>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>>> unzippedWithEncoding: encoder
>>>>> | byteStream |
>>>>> byteStream := GZipReadStream on: self.
>>>>> ^ String
>>>>>     streamContents: [ :stream |
>>>>>         [ byteStream atEnd ]
>>>>>             whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>>> byteStream) ] ]
>>>>> 
>>>>> Then, you can write something like zipped := yourLongWideString 
>>>>> zippedWithEncoding: ZnCharacterEncoder utf8.
>>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>>> 
>>>>> This will not affect the existing zipped/unzipped API and you can 
>>>>> handle other encodings.
>>>>> This zippedWithEncoding: generates a ByteArray, which is kind of 
>>>>> conformant to the encoding API.
>>>>> And you don't have to create many intermediate byte arrays and byte 
>>>>> strings.
>>>>> 
>>>>> I hope this helps.
>>>>> ---
>>>>> tomo
>>>>> 
>>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>:
>>>>>> 
>>>>>> Hi Peter,
>>>>>> 
>>>>>> About #zipped / #unzipped and the inflate / deflate classes: your 
>>>>>> observation is correct, these work from string to string, while clearly 
>>>>>> the compressed representation should be binary.
>>>>>> 
>>>>>> The contents (input, what is inside the compressed data) can be 
>>>>>> anything, it is not necessarily a string (it could be an image, so also 
>>>>>> something binary). Only the creator of the compressed data knows, you 
>>>>>> cannot assume to know in general.
>>>>>> 
>>>>>> It would be possible (and it would be very nice) to change this, however 
>>>>>> that will have serious impact on users (as the contract changes).
>>>>>> 
>>>>>> About your use case: why would your DB not be capable of storing large 
>>>>>> strings ? A good DB should be capable of storing any kind of string 
>>>>>> (full unicode) efficiently.
>>>>>> 
>>>>>> What DB and what sizes are we talking about ?
>>>>>> 
>>>>>> Sven
>>>>>> 
>>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>>>>>>> 
>>>>>>> Hello
>>>>>>> 
>>>>>>> I have a problem with text storage, to which I seem to have found a 
>>>>>>> solution, but it’s a bit clumsy-looking. I would be grateful for 
>>>>>>> confirmation that (a) there is no neater solution, (b) I can rely on 
>>>>>>> this to work – I only know that it works in a few test cases.
>>>>>>> 
>>>>>>> I need to store a large number of text strings in a database. To avoid 
>>>>>>> the database files becoming too large, I am thinking of zipping the 
>>>>>>> strings, or at least the less frequently accessed ones. Depending on 
>>>>>>> the source, some of the strings will be instances of ByteString, some 
>>>>>>> of WideString (because they contain characters not representable in one 
>>>>>>> byte). Storing a WideString uncompressed seems to occupy 4 bytes per 
>>>>>>> character, so I decided, before thinking of compression, to store the 
>>>>>>> strings utf8Encoded, which yields a ByteArray. But zipped can only be 
>>>>>>> applied to a String, not a ByteArray.
>>>>>>> 
>>>>>>> So my proposed solution is:
>>>>>>> 
>>>>>>> For compression:             myZipString := myWideString utf8Encoded 
>>>>>>> asString zipped.
>>>>>>> For decompression:         myOutputString := myZipString unzipped 
>>>>>>> asByteArray utf8Decoded.
>>>>>>> 
>>>>>>> As I said, it works in all the cases I tried, whether WideString or 
>>>>>>> not, but the chains of transformations look clunky somehow. Can anyone 
>>>>>>> see a neater way of doing it? And can I rely on it working, especially 
>>>>>>> when I am handling foreign texts with many multi-byte characters?
>>>>>>> 
>>>>>>> Thanks in advance for any help.
>>>>>>> 
>>>>>>> Peter Kenny
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: [Pharo-users] How to zip a WideString

Reply via email to