Peter and Sven,

zip API from string to string works fine except that aWideString
zipped generates malformed zip string.
I think it might be a good guidance to define
String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: .
Such as
String>>zippedWithEncoding: encoder
zippedWithEncoding: encoder
    ^ ByteArray
        streamContents: [ :stream |
            | gzstream |
            gzstream := GZipWriteStream on: stream.
            encoder
                next: self size
                putAll: self
                startingAt: 1
                toStream: gzstream.
            gzstream close ]

and ByteArray>>unzippedWithEncoding: encoder
unzippedWithEncoding: encoder
    | byteStream |
    byteStream := GZipReadStream on: self.
    ^ String
        streamContents: [ :stream |
            [ byteStream atEnd ]
                whileFalse: [ stream nextPut: (encoder nextFromStream:
byteStream) ] ]

Then, you can write something like
zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8.
unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8.

This will not affect the existing zipped/unzipped API and you can
handle other encodings.
This zippedWithEncoding: generates a ByteArray, which is kind of
conformant to the encoding API.
And you don't have to create many intermediate byte arrays and byte strings.

I hope this helps.
---
tomo

2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>:
>
> Hi Peter,
>
> About #zipped / #unzipped and the inflate / deflate classes: your observation 
> is correct, these work from string to string, while clearly the compressed 
> representation should be binary.
>
> The contents (input, what is inside the compressed data) can be anything, it 
> is not necessarily a string (it could be an image, so also something binary). 
> Only the creator of the compressed data knows, you cannot assume to know in 
> general.
>
> It would be possible (and it would be very nice) to change this, however that 
> will have serious impact on users (as the contract changes).
>
> About your use case: why would your DB not be capable of storing large 
> strings ? A good DB should be capable of storing any kind of string (full 
> unicode) efficiently.
>
> What DB and what sizes are we talking about ?
>
> Sven
>
> > On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> >
> > Hello
> >
> > I have a problem with text storage, to which I seem to have found a 
> > solution, but it’s a bit clumsy-looking. I would be grateful for 
> > confirmation that (a) there is no neater solution, (b) I can rely on this 
> > to work – I only know that it works in a few test cases.
> >
> > I need to store a large number of text strings in a database. To avoid the 
> > database files becoming too large, I am thinking of zipping the strings, or 
> > at least the less frequently accessed ones. Depending on the source, some 
> > of the strings will be instances of ByteString, some of WideString (because 
> > they contain characters not representable in one byte). Storing a 
> > WideString uncompressed seems to occupy 4 bytes per character, so I 
> > decided, before thinking of compression, to store the strings utf8Encoded, 
> > which yields a ByteArray. But zipped can only be applied to a String, not a 
> > ByteArray.
> >
> > So my proposed solution is:
> >
> > For compression:             myZipString := myWideString utf8Encoded 
> > asString zipped.
> > For decompression:         myOutputString := myZipString unzipped 
> > asByteArray utf8Decoded.
> >
> > As I said, it works in all the cases I tried, whether WideString or not, 
> > but the chains of transformations look clunky somehow. Can anyone see a 
> > neater way of doing it? And can I rely on it working, especially when I am 
> > handling foreign texts with many multi-byte characters?
> >
> > Thanks in advance for any help.
> >
> > Peter Kenny
>
>

Reply via email to