Re: [HACKERS] GIN improvements part 1: additional information

Heikki Linnakangas Thu, 19 Dec 2013 00:45:34 -0800

On 12/19/2013 08:37 AM, Oleg Bartunov wrote:

Guys,


before digging deep into the art of comp/decomp world I'd like to know
if you familiar with results of
http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
some newer research ?


Yeah, I saw that paper.

Do we agree in what we really want ? Basically,
there are three main features: size, compression, decompression speed
- we should take two :)

According to that Zhang et al paper you linked, the Vbyte methodactually performs the worst on all of those measures. The otheralgorithms are quite similar in terms of size (PForDelta being the mostefficient), while PForDelta is significantly faster to compress/decompress.

Just by looking at those numbers, PForDelta looks like a clear winner.However, it operates on much bigger batches than the other algorithms; Ihaven't looked at it in detail, but Zhang et al used 128 integerbatches, and they say that 32 integers is the minimum batch size. If wewant to use it for the inline posting lists stored in entry tuples, thatwould be quite wasteful if there are only a few item pointers on the tuple.

Also, in the tests I've run, the compression/decompression speed is nota significant factor in total performance, with either varbyte encodingor Simple9-like encoding I hacked together.

Actually, now that I think about this a bit more, maybe we should gowith Rice encoding after all? It's the most efficient in terms of size,and I believe it would be fast enough.

Should we design sort of plugin, which could support independent
storage on disk, so users can apply different techniques, depending on
data.

What I want to say is that we certainly can play with this very
challenged task, but we have limited time  before 9.4 and we should
think in positive direction.

Once we have the code in place to deal with one encoding, it's easy toswitch the implementation. Making it user-configurable or pluggablewould be overkill IMHO.

What I'm saying is that we should make sure we get the page format right(in particular, I strongly feel we should use the self-containedPostingListSegment struct instead of item-indees that I mentioned in theother post), with the implementation details hidden in the functions inginpostinglist.c. Then we can easily experiment with different algorithms.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

Reply via email to