Re: [Pharo-users] How to zip a WideString

Tomohiro Oda Thu, 03 Oct 2019 04:23:46 -0700

Sven,

Yes, ByteArray>>zipped/unzipped are simple, neat and intuitive way of
zipping/unzipping binary data.
I also love the new idioms. They look clean and concise.


Best Regards,
---
tomo

2019年10月3日(木) 20:14 Sven Van Caekenberghe <s...@stfx.eu>:
>
> Actually, thinking about this a bit more, why not add #zipped #unzipped to 
> ByteArray ?
>
>
> ByteArray>>#zipped
>   "Return a GZIP compressed version of the receiver as a ByteArray"
>
>   ^ ByteArray streamContents: [ :out |
>       (GZipWriteStream on: out) nextPutAll: self; close ]
>
> ByteArray>>#unzipped
>   "Assuming the receiver contains GZIP encoded data,
>    return the decompressed data as a ByteArray"
>
>   ^ (GZipReadStream on: self) upToEnd
>
>
> The original oneliner then becomes
>
>   'string' utf8Encoded zipped.
>
> and
>
>   data unzipped utf8Decoded
>
> which is pretty clear, simple and intention-revealing, IMHO.
>
> > On 3 Oct 2019, at 13:04, Sven Van Caekenberghe <s...@stfx.eu> wrote:
> >
> > Hi Tomo,
> >
> > Indeed, I stand corrected, it does indeed seem possible to use the existing 
> > gzip classes to work from bytes to bytes, this works fine:
> >
> > data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) 
> > nextPutAll: 'foo 10 €' utf8Encoded; close ].
> >
> > (GZipReadStream on: data) upToEnd utf8Decoded.
> >
> > Now regarding the encoding option, I am not so sure that is really 
> > necessary (though nice to have). Why would anyone use anything except UTF8 
> > (today).
> >
> > Thanks again for the correction !
> >
> > Sven
> >
> >> On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote:
> >>
> >> Peter and Sven,
> >>
> >> zip API from string to string works fine except that aWideString
> >> zipped generates malformed zip string.
> >> I think it might be a good guidance to define
> >> String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
> >> Such as
> >> String>>zippedWithEncoding: encoder
> >> zippedWithEncoding: encoder
> >>   ^ ByteArray
> >>       streamContents: [ :stream |
> >>           | gzstream |
> >>           gzstream := GZipWriteStream on: stream.
> >>           encoder
> >>               next: self size
> >>               putAll: self
> >>               startingAt: 1
> >>               toStream: gzstream.
> >>           gzstream close ]
> >>
> >> and ByteArray>>unzippedWithEncoding: encoder
> >> unzippedWithEncoding: encoder
> >>   | byteStream |
> >>   byteStream := GZipReadStream on: self.
> >>   ^ String
> >>       streamContents: [ :stream |
> >>           [ byteStream atEnd ]
> >>               whileFalse: [ stream nextPut: (encoder nextFromStream:
> >> byteStream) ] ]
> >>
> >> Then, you can write something like
> >> zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
> >> unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.
> >>
> >> This will not affect the existing zipped/unzipped API and you can
> >> handle other encodings.
> >> This zippedWithEncoding: generates a ByteArray, which is kind of
> >> conformant to the encoding API.
> >> And you don't have to create many intermediate byte arrays and byte 
> >> strings.
> >>
> >> I hope this helps.
> >> ---
> >> tomo
> >>
> >> 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>:
> >>>
> >>> Hi Peter,
> >>>
> >>> About #zipped / #unzipped and the inflate / deflate classes: your 
> >>> observation is correct, these work from string to string, while clearly 
> >>> the compressed representation should be binary.
> >>>
> >>> The contents (input, what is inside the compressed data) can be anything, 
> >>> it is not necessarily a string (it could be an image, so also something 
> >>> binary). Only the creator of the compressed data knows, you cannot assume 
> >>> to know in general.
> >>>
> >>> It would be possible (and it would be very nice) to change this, however 
> >>> that will have serious impact on users (as the contract changes).
> >>>
> >>> About your use case: why would your DB not be capable of storing large 
> >>> strings ? A good DB should be capable of storing any kind of string (full 
> >>> unicode) efficiently.
> >>>
> >>> What DB and what sizes are we talking about ?
> >>>
> >>> Sven
> >>>
> >>>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> >>>>
> >>>> Hello
> >>>>
> >>>> I have a problem with text storage, to which I seem to have found a 
> >>>> solution, but it’s a bit clumsy-looking. I would be grateful for 
> >>>> confirmation that (a) there is no neater solution, (b) I can rely on 
> >>>> this to work – I only know that it works in a few test cases.
> >>>>
> >>>> I need to store a large number of text strings in a database. To avoid 
> >>>> the database files becoming too large, I am thinking of zipping the 
> >>>> strings, or at least the less frequently accessed ones. Depending on the 
> >>>> source, some of the strings will be instances of ByteString, some of 
> >>>> WideString (because they contain characters not representable in one 
> >>>> byte). Storing a WideString uncompressed seems to occupy 4 bytes per 
> >>>> character, so I decided, before thinking of compression, to store the 
> >>>> strings utf8Encoded, which yields a ByteArray. But zipped can only be 
> >>>> applied to a String, not a ByteArray.
> >>>>
> >>>> So my proposed solution is:
> >>>>
> >>>> For compression:             myZipString := myWideString utf8Encoded 
> >>>> asString zipped.
> >>>> For decompression:         myOutputString := myZipString unzipped 
> >>>> asByteArray utf8Decoded.
> >>>>
> >>>> As I said, it works in all the cases I tried, whether WideString or not, 
> >>>> but the chains of transformations look clunky somehow. Can anyone see a 
> >>>> neater way of doing it? And can I rely on it working, especially when I 
> >>>> am handling foreign texts with many multi-byte characters?
> >>>>
> >>>> Thanks in advance for any help.
> >>>>
> >>>> Peter Kenny
> >>>
> >>>
> >>
> >
>
>

Re: [Pharo-users] How to zip a WideString

Reply via email to