On Tue, Oct 01, 2019 at 11:20:39AM +0500, Andrey Borodin wrote:
30 сент. 2019 г., в 22:29, Tomas Vondra <tomas.von...@2ndquadrant.com>
написал(а):
On Mon, Sep 30, 2019 at 09:20:22PM +0500, Andrey Borodin wrote:
30 сент. 2019 г., в 20:56, Tomas Vondra <tomas.von...@2ndquadrant.com>
написал(а):
I mean this:
/*
* Use int64 to prevent overflow during calculation.
*/
compressed_size = (int32) ((int64) rawsize * 9 + 8) / 8;
I'm not very familiar with pglz internals, but I'm a bit puzzled by
this. My first instinct was to compare it to this:
#define PGLZ_MAX_OUTPUT(_dlen) ((_dlen) + 4)
but clearly that's a very different (much simpler) formula. So why
shouldn't pglz_maximum_compressed_size simply use this macro?
compressed_size accounts for possible increase of size during
compression. pglz can consume up to 1 control byte for each 8 bytes of
data in worst case.
OK, but does that actually translate in to the formula? We essentially
need to count 8-byte chunks in raw data, and multiply that by 9. Which
gives us something like
nchunks = ((rawsize + 7) / 8) * 9;
which is not quite what the patch does.
I'm afraid neither formula is correct, but all this is hair-splitting
differences.
Sure. I just want to be sure the formula is safe and we won't end up
using too low value in some corner case.
Your formula does not account for the fact that we may not need all bytes from
last chunk.
Consider desired decompressed size of 3 bytes. We may need 1 control byte and 3
literals, 4 bytes total
But nchunks = 9.
OK, so essentially this means my formula works with whole chunks, i.e.
if we happen to need just a part of a decompressed chunk, we still
request enough data to decompress it whole. This way we may request up
to 7 extra bytes, which seems fine.
Binguo's formula is appending 1 control bit per data byte and one extra
control byte. Consider size = 8 bytes. We need 1 control byte, 8
literals, 9 total. But compressed_size = 10.
Mathematically correct formula is compressed_size = (int32) ((int64)
rawsize * 9 + 7) / 8; Here we take one bit for each data byte, and 7
control bits for overflow.
But this equations make no big difference, each formula is safe. I'd
pick one which is easier to understand and document (IMO, its nchunks =
((rawsize + 7) / 8) * 9).
I'd use the *mathematically correct* formula, it doesn't seem to be any
more complex, and the "one bit per byte, complete bytes" explanation
seems quite understandable.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services