Peter and Sven, zip API from string to string works fine except that aWideString zipped generates malformed zip string. I think it might be a good guidance to define String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . Such as String>>zippedWithEncoding: encoder zippedWithEncoding: encoder ^ ByteArray streamContents: [ :stream | | gzstream | gzstream := GZipWriteStream on: stream. encoder next: self size putAll: self startingAt: 1 toStream: gzstream. gzstream close ]
and ByteArray>>unzippedWithEncoding: encoder unzippedWithEncoding: encoder | byteStream | byteStream := GZipReadStream on: self. ^ String streamContents: [ :stream | [ byteStream atEnd ] whileFalse: [ stream nextPut: (encoder nextFromStream: byteStream) ] ] Then, you can write something like zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. This will not affect the existing zipped/unzipped API and you can handle other encodings. This zippedWithEncoding: generates a ByteArray, which is kind of conformant to the encoding API. And you don't have to create many intermediate byte arrays and byte strings. I hope this helps. --- tomo 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>: > > Hi Peter, > > About #zipped / #unzipped and the inflate / deflate classes: your observation > is correct, these work from string to string, while clearly the compressed > representation should be binary. > > The contents (input, what is inside the compressed data) can be anything, it > is not necessarily a string (it could be an image, so also something binary). > Only the creator of the compressed data knows, you cannot assume to know in > general. > > It would be possible (and it would be very nice) to change this, however that > will have serious impact on users (as the contract changes). > > About your use case: why would your DB not be capable of storing large > strings ? A good DB should be capable of storing any kind of string (full > unicode) efficiently. > > What DB and what sizes are we talking about ? > > Sven > > > On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > > > Hello > > > > I have a problem with text storage, to which I seem to have found a > > solution, but it’s a bit clumsy-looking. I would be grateful for > > confirmation that (a) there is no neater solution, (b) I can rely on this > > to work – I only know that it works in a few test cases. > > > > I need to store a large number of text strings in a database. To avoid the > > database files becoming too large, I am thinking of zipping the strings, or > > at least the less frequently accessed ones. Depending on the source, some > > of the strings will be instances of ByteString, some of WideString (because > > they contain characters not representable in one byte). Storing a > > WideString uncompressed seems to occupy 4 bytes per character, so I > > decided, before thinking of compression, to store the strings utf8Encoded, > > which yields a ByteArray. But zipped can only be applied to a String, not a > > ByteArray. > > > > So my proposed solution is: > > > > For compression: myZipString := myWideString utf8Encoded > > asString zipped. > > For decompression: myOutputString := myZipString unzipped > > asByteArray utf8Decoded. > > > > As I said, it works in all the cases I tried, whether WideString or not, > > but the chains of transformations look clunky somehow. Can anyone see a > > neater way of doing it? And can I rely on it working, especially when I am > > handling foreign texts with many multi-byte characters? > > > > Thanks in advance for any help. > > > > Peter Kenny > >