On 8/5/21 1:05 AM, Andres Freund wrote:
Hi,
On 2021-08-04 23:44:10 +0200, Tomas Vondra wrote:
How is that better than the two varint flavors that are already out there,
i.e. the bitcoin [1] and protocol buffers [2]?
The protobuf one is *terrible* for CPU efficiency. You need to go through each
byte, do masking and shifting for each byte and then have a conditional
branch. That's bad from the the amount of instructions you need to execute,
and *really* bad for the branch predictor.
Yeah, probably true - particularly for longer values. No argument here.
The first one seems quite efficient in how it encodes the length into very
few bits (which matters especially for small values). It's designed for
integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths
fairly easily, I think:
Look at the first byte, and
0 - 243 - encoded as is
244 - 1 byte
245 - 2 bytes
246 - 3 bytes
247 - 4 bytes
248 - 5 bytes
249 - 6 bytes
250 - 7 bytes
251 - 8 bytes
252 - next 1 byte is length
253 - next 2 bytes are length
254 - next 3 bytes are length
255 - next 4 bytes are length
If we want to support longer lengths, we'd have to reserve an extra value
(which reduces the number of values that require a single byte).
I think that's not a bad scheme. I think it may end up being a bit more
expensive to decode because you need more branches instead of using
find-first-set type instructions.
I don't think it requires many branches, because you can essentially do
if (byte[0] <= 243)
length = 0
else if (byte[0] <= 251)
length = byte[0] - 243
else
{
length_bytes = byte[0] - 251;
... read length_bytes, decode length
}
but I haven't tried implementing it and maybe my intuition is wrong.
Or maybe it'd be a good scheme for on-disk format, but poor for memory.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company