Re: [Pharo-users] How to zip a WideString

Sven Van Caekenberghe Thu, 03 Oct 2019 05:07:09 -0700

https://github.com/pharo-project/pharo/issues/4806


PR will follow

> On 3 Oct 2019, at 13:49, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sven, Tomo
> 
> Thanks for this discussion. I shall bear in mind Sven's proposed extension to 
> ByteArray - this is exactly the sort of neater solution I was hoping for. Any 
> chance this might make it into standard Pharo (perhaps inP8)?
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Tomohiro 
> Oda
> Sent: 03 October 2019 12:22
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] How to zip a WideString
> 
> Sven,
> 
> Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of 
> zipping/unzipping binary data.
> I also love the new idioms. They look clean and concise.
> 
> Best Regards,
> ---
> tomo
> 
> 2019年10月3日(木) 20:14 Sven Van Caekenberghe <s...@stfx.eu>:
>> 
>> Actually, thinking about this a bit more, why not add #zipped #unzipped to 
>> ByteArray ?
>> 
>> 
>> ByteArray>>#zipped
>>  "Return a GZIP compressed version of the receiver as a ByteArray"
>> 
>>  ^ ByteArray streamContents: [ :out |
>>      (GZipWriteStream on: out) nextPutAll: self; close ]
>> 
>> ByteArray>>#unzipped
>>  "Assuming the receiver contains GZIP encoded data,
>>   return the decompressed data as a ByteArray"
>> 
>>  ^ (GZipReadStream on: self) upToEnd
>> 
>> 
>> The original oneliner then becomes
>> 
>>  'string' utf8Encoded zipped.
>> 
>> and
>> 
>>  data unzipped utf8Decoded
>> 
>> which is pretty clear, simple and intention-revealing, IMHO.
>> 
>>> On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>>> 
>>> Hi Tomo,
>>> 
>>> Indeed, I stand corrected, it does indeed seem possible to use the existing 
>>> gzip classes to work from bytes to bytes, this works fine:
>>> 
>>> data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) 
>>> nextPutAll: 'foo 10 €' utf8Encoded; close ].
>>> 
>>> (GZipReadStream on: data) upToEnd utf8Decoded.
>>> 
>>> Now regarding the encoding option, I am not so sure that is really 
>>> necessary (though nice to have). Why would anyone use anything except UTF8 
>>> (today).
>>> 
>>> Thanks again for the correction !
>>> 
>>> Sven
>>> 
>>>> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote:
>>>> 
>>>> Peter and Sven,
>>>> 
>>>> zip API from string to string works fine except that aWideString 
>>>> zipped generates malformed zip string.
>>>> I think it might be a good guidance to define
>>>> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
>>>> Such as
>>>> String>>zippedWithEncoding: encoder
>>>> zippedWithEncoding: encoder
>>>>  ^ ByteArray
>>>>      streamContents: [ :stream |
>>>>          | gzstream |
>>>>          gzstream := GZipWriteStream on: stream.
>>>>          encoder
>>>>              next: self size
>>>>              putAll: self
>>>>              startingAt: 1
>>>>              toStream: gzstream.
>>>>          gzstream close ]
>>>> 
>>>> and ByteArray>>unzippedWithEncoding: encoder
>>>> unzippedWithEncoding: encoder
>>>>  | byteStream |
>>>>  byteStream := GZipReadStream on: self.
>>>>  ^ String
>>>>      streamContents: [ :stream |
>>>>          [ byteStream atEnd ]
>>>>              whileFalse: [ stream nextPut: (encoder nextFromStream:
>>>> byteStream) ] ]
>>>> 
>>>> Then, you can write something like
>>>> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
>>>> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
>>>> 
>>>> This will not affect the existing zipped/unzipped API and you can 
>>>> handle other encodings.
>>>> This zippedWithEncoding: generates a ByteArray, which is kind of 
>>>> conformant to the encoding API.
>>>> And you don't have to create many intermediate byte arrays and byte 
>>>> strings.
>>>> 
>>>> I hope this helps.
>>>> ---
>>>> tomo
>>>> 
>>>> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>:
>>>>> 
>>>>> Hi Peter,
>>>>> 
>>>>> About #zipped / #unzipped and the inflate / deflate classes: your 
>>>>> observation is correct, these work from string to string, while clearly 
>>>>> the compressed representation should be binary.
>>>>> 
>>>>> The contents (input, what is inside the compressed data) can be anything, 
>>>>> it is not necessarily a string (it could be an image, so also something 
>>>>> binary). Only the creator of the compressed data knows, you cannot assume 
>>>>> to know in general.
>>>>> 
>>>>> It would be possible (and it would be very nice) to change this, however 
>>>>> that will have serious impact on users (as the contract changes).
>>>>> 
>>>>> About your use case: why would your DB not be capable of storing large 
>>>>> strings ? A good DB should be capable of storing any kind of string (full 
>>>>> unicode) efficiently.
>>>>> 
>>>>> What DB and what sizes are we talking about ?
>>>>> 
>>>>> Sven
>>>>> 
>>>>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>>>>>> 
>>>>>> Hello
>>>>>> 
>>>>>> I have a problem with text storage, to which I seem to have found a 
>>>>>> solution, but it’s a bit clumsy-looking. I would be grateful for 
>>>>>> confirmation that (a) there is no neater solution, (b) I can rely on 
>>>>>> this to work – I only know that it works in a few test cases.
>>>>>> 
>>>>>> I need to store a large number of text strings in a database. To avoid 
>>>>>> the database files becoming too large, I am thinking of zipping the 
>>>>>> strings, or at least the less frequently accessed ones. Depending on the 
>>>>>> source, some of the strings will be instances of ByteString, some of 
>>>>>> WideString (because they contain characters not representable in one 
>>>>>> byte). Storing a WideString uncompressed seems to occupy 4 bytes per 
>>>>>> character, so I decided, before thinking of compression, to store the 
>>>>>> strings utf8Encoded, which yields a ByteArray. But zipped can only be 
>>>>>> applied to a String, not a ByteArray.
>>>>>> 
>>>>>> So my proposed solution is:
>>>>>> 
>>>>>> For compression:             myZipString := myWideString utf8Encoded 
>>>>>> asString zipped.
>>>>>> For decompression:         myOutputString := myZipString unzipped 
>>>>>> asByteArray utf8Decoded.
>>>>>> 
>>>>>> As I said, it works in all the cases I tried, whether WideString or not, 
>>>>>> but the chains of transformations look clunky somehow. Can anyone see a 
>>>>>> neater way of doing it? And can I rely on it working, especially when I 
>>>>>> am handling foreign texts with many multi-byte characters?
>>>>>> 
>>>>>> Thanks in advance for any help.
>>>>>> 
>>>>>> Peter Kenny
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: [Pharo-users] How to zip a WideString

Reply via email to