Re: [Pharo-users] How to zip a WideString

Sven Van Caekenberghe Thu, 03 Oct 2019 04:05:22 -0700

Hi Tomo,

Indeed, I stand corrected, it does indeed seem possible to use the existing 
gzip classes to work from bytes to bytes, this works fine:


data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) 
nextPutAll: 'foo 10 €' utf8Encoded; close ].

(GZipReadStream on: data) upToEnd utf8Decoded.

Now regarding the encoding option, I am not so sure that is really necessary 
(though nice to have). Why would anyone use anything except UTF8 (today).

Thanks again for the correction !

Sven

> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote:
> 
> Peter and Sven,
> 
> zip API from string to string works fine except that aWideString
> zipped generates malformed zip string.
> I think it might be a good guidance to define
> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
> Such as
> String>>zippedWithEncoding: encoder
> zippedWithEncoding: encoder
>    ^ ByteArray
>        streamContents: [ :stream |
>            | gzstream |
>            gzstream := GZipWriteStream on: stream.
>            encoder
>                next: self size
>                putAll: self
>                startingAt: 1
>                toStream: gzstream.
>            gzstream close ]
> 
> and ByteArray>>unzippedWithEncoding: encoder
> unzippedWithEncoding: encoder
>    | byteStream |
>    byteStream := GZipReadStream on: self.
>    ^ String
>        streamContents: [ :stream |
>            [ byteStream atEnd ]
>                whileFalse: [ stream nextPut: (encoder nextFromStream:
> byteStream) ] ]
> 
> Then, you can write something like
> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
> 
> This will not affect the existing zipped/unzipped API and you can
> handle other encodings.
> This zippedWithEncoding: generates a ByteArray, which is kind of
> conformant to the encoding API.
> And you don't have to create many intermediate byte arrays and byte strings.
> 
> I hope this helps.
> ---
> tomo
> 
> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>:
>> 
>> Hi Peter,
>> 
>> About #zipped / #unzipped and the inflate / deflate classes: your 
>> observation is correct, these work from string to string, while clearly the 
>> compressed representation should be binary.
>> 
>> The contents (input, what is inside the compressed data) can be anything, it 
>> is not necessarily a string (it could be an image, so also something 
>> binary). Only the creator of the compressed data knows, you cannot assume to 
>> know in general.
>> 
>> It would be possible (and it would be very nice) to change this, however 
>> that will have serious impact on users (as the contract changes).
>> 
>> About your use case: why would your DB not be capable of storing large 
>> strings ? A good DB should be capable of storing any kind of string (full 
>> unicode) efficiently.
>> 
>> What DB and what sizes are we talking about ?
>> 
>> Sven
>> 
>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>>> 
>>> Hello
>>> 
>>> I have a problem with text storage, to which I seem to have found a 
>>> solution, but it’s a bit clumsy-looking. I would be grateful for 
>>> confirmation that (a) there is no neater solution, (b) I can rely on this 
>>> to work – I only know that it works in a few test cases.
>>> 
>>> I need to store a large number of text strings in a database. To avoid the 
>>> database files becoming too large, I am thinking of zipping the strings, or 
>>> at least the less frequently accessed ones. Depending on the source, some 
>>> of the strings will be instances of ByteString, some of WideString (because 
>>> they contain characters not representable in one byte). Storing a 
>>> WideString uncompressed seems to occupy 4 bytes per character, so I 
>>> decided, before thinking of compression, to store the strings utf8Encoded, 
>>> which yields a ByteArray. But zipped can only be applied to a String, not a 
>>> ByteArray.
>>> 
>>> So my proposed solution is:
>>> 
>>> For compression:             myZipString := myWideString utf8Encoded 
>>> asString zipped.
>>> For decompression:         myOutputString := myZipString unzipped 
>>> asByteArray utf8Decoded.
>>> 
>>> As I said, it works in all the cases I tried, whether WideString or not, 
>>> but the chains of transformations look clunky somehow. Can anyone see a 
>>> neater way of doing it? And can I rely on it working, especially when I am 
>>> handling foreign texts with many multi-byte characters?
>>> 
>>> Thanks in advance for any help.
>>> 
>>> Peter Kenny
>> 
>> 
>

Re: [Pharo-users] How to zip a WideString

Reply via email to