Hi Tomo, Indeed, I stand corrected, it does indeed seem possible to use the existing gzip classes to work from bytes to bytes, this works fine:
data := ByteArray streamContents: [ :out | (GZipWriteStream on: out) nextPutAll: 'foo 10 €' utf8Encoded; close ]. (GZipReadStream on: data) upToEnd utf8Decoded. Now regarding the encoding option, I am not so sure that is really necessary (though nice to have). Why would anyone use anything except UTF8 (today). Thanks again for the correction ! Sven > On 3 Oct 2019, at 12:41, Tomohiro Oda <tomohiro.tomo....@gmail.com> wrote: > > Peter and Sven, > > zip API from string to string works fine except that aWideString > zipped generates malformed zip string. > I think it might be a good guidance to define > String>>zippedWithEncoding: and ByteArray>>unzippedWithEncoding: . > Such as > String>>zippedWithEncoding: encoder > zippedWithEncoding: encoder > ^ ByteArray > streamContents: [ :stream | > | gzstream | > gzstream := GZipWriteStream on: stream. > encoder > next: self size > putAll: self > startingAt: 1 > toStream: gzstream. > gzstream close ] > > and ByteArray>>unzippedWithEncoding: encoder > unzippedWithEncoding: encoder > | byteStream | > byteStream := GZipReadStream on: self. > ^ String > streamContents: [ :stream | > [ byteStream atEnd ] > whileFalse: [ stream nextPut: (encoder nextFromStream: > byteStream) ] ] > > Then, you can write something like > zipped := yourLongWideString zippedWithEncoding: ZnCharacterEncoder utf8. > unzipped := zipped unzippedWithEncoding: ZnCharacterEncoder utf8. > > This will not affect the existing zipped/unzipped API and you can > handle other encodings. > This zippedWithEncoding: generates a ByteArray, which is kind of > conformant to the encoding API. > And you don't have to create many intermediate byte arrays and byte strings. > > I hope this helps. > --- > tomo > > 2019/10/3(Thu) 18:56 Sven Van Caekenberghe <s...@stfx.eu>: >> >> Hi Peter, >> >> About #zipped / #unzipped and the inflate / deflate classes: your >> observation is correct, these work from string to string, while clearly the >> compressed representation should be binary. >> >> The contents (input, what is inside the compressed data) can be anything, it >> is not necessarily a string (it could be an image, so also something >> binary). Only the creator of the compressed data knows, you cannot assume to >> know in general. >> >> It would be possible (and it would be very nice) to change this, however >> that will have serious impact on users (as the contract changes). >> >> About your use case: why would your DB not be capable of storing large >> strings ? A good DB should be capable of storing any kind of string (full >> unicode) efficiently. >> >> What DB and what sizes are we talking about ? >> >> Sven >> >>> On 3 Oct 2019, at 11:06, PBKResearch <pe...@pbkresearch.co.uk> wrote: >>> >>> Hello >>> >>> I have a problem with text storage, to which I seem to have found a >>> solution, but it’s a bit clumsy-looking. I would be grateful for >>> confirmation that (a) there is no neater solution, (b) I can rely on this >>> to work – I only know that it works in a few test cases. >>> >>> I need to store a large number of text strings in a database. To avoid the >>> database files becoming too large, I am thinking of zipping the strings, or >>> at least the less frequently accessed ones. Depending on the source, some >>> of the strings will be instances of ByteString, some of WideString (because >>> they contain characters not representable in one byte). Storing a >>> WideString uncompressed seems to occupy 4 bytes per character, so I >>> decided, before thinking of compression, to store the strings utf8Encoded, >>> which yields a ByteArray. But zipped can only be applied to a String, not a >>> ByteArray. >>> >>> So my proposed solution is: >>> >>> For compression: myZipString := myWideString utf8Encoded >>> asString zipped. >>> For decompression: myOutputString := myZipString unzipped >>> asByteArray utf8Decoded. >>> >>> As I said, it works in all the cases I tried, whether WideString or not, >>> but the chains of transformations look clunky somehow. Can anyone see a >>> neater way of doing it? And can I rely on it working, especially when I am >>> handling foreign texts with many multi-byte characters? >>> >>> Thanks in advance for any help. >>> >>> Peter Kenny >> >> >