On 12/19/2013 08:37 AM, Oleg Bartunov wrote:
Guys,

before digging deep into the art of comp/decomp world I'd like to know
if you familiar with results of
http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
some newer research ?

Yeah, I saw that paper.

Do we agree in what we really want ? Basically,
there are three main features: size, compression, decompression speed
- we should take two :)

According to that Zhang et al paper you linked, the Vbyte method actually performs the worst on all of those measures. The other algorithms are quite similar in terms of size (PForDelta being the most efficient), while PForDelta is significantly faster to compress/decompress.

Just by looking at those numbers, PForDelta looks like a clear winner. However, it operates on much bigger batches than the other algorithms; I haven't looked at it in detail, but Zhang et al used 128 integer batches, and they say that 32 integers is the minimum batch size. If we want to use it for the inline posting lists stored in entry tuples, that would be quite wasteful if there are only a few item pointers on the tuple.

Also, in the tests I've run, the compression/decompression speed is not a significant factor in total performance, with either varbyte encoding or Simple9-like encoding I hacked together.

Actually, now that I think about this a bit more, maybe we should go with Rice encoding after all? It's the most efficient in terms of size, and I believe it would be fast enough.

Should we design sort of plugin, which could support independent
storage on disk, so users can apply different techniques, depending on
data.

What I want to say is that we certainly can play with this very
challenged task, but we have limited time  before 9.4 and we should
think in positive direction.

Once we have the code in place to deal with one encoding, it's easy to switch the implementation. Making it user-configurable or pluggable would be overkill IMHO.

What I'm saying is that we should make sure we get the page format right (in particular, I strongly feel we should use the self-contained PostingListSegment struct instead of item-indees that I mentioned in the other post), with the implementation details hidden in the functions in ginpostinglist.c. Then we can easily experiment with different algorithms.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to