Re: [HACKERS] Custom compression methods

Tomas Vondra Tue, 21 Nov 2017 09:48:41 -0800

Hi,

On 11/21/2017 03:47 PM, Ildus Kurbangaliev wrote:
> On Mon, 20 Nov 2017 00:04:53 +0100
> Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> 
> ...
>
>> 6) I'm rather confused by AttributeCompression vs.
>> ColumnCompression. I mean, attribute==column, right? Of course, one
>> is for data from parser, the other one is for internal info. But
>> can we make the naming clearer?
> 
> For now I have renamed AttributeCompression to CompressionOptions,
> not sure that's a good name but at least it gives less confusion.
>


I propose to use either

   CompressionMethodOptions (and CompressionMethodRoutine)

or

   CompressionOptions (and CompressionRoutine)

>>
>> 7) The docs in general are somewhat unsatisfactory, TBH. For example
>> the ColumnCompression has no comments, unlike everything else in
>> parsenodes. Similarly for the SGML docs - I suggest to expand them to
>> resemble FDW docs
>> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
>> also follows the handler/routines pattern.
> 
> I've added more comments. I think I'll add more documentation if the
> committers will approve current syntax.
> 

OK. Haven't reviewed this yet.

>>
>> 8) One of the unclear things if why we even need 'drop' routing. It
>> seems that if it's defined DropAttributeCompression does something.
>> But what should it do? I suppose dropping the options should be done
>> using dependencies (just like we drop columns in this case).
>>
>> BTW why does DropAttributeCompression mess with att->attisdropped in
>> this way? That seems a bit odd.
> 
> 'drop' routine could be useful. An extension could do something
> related with the attribute, like remove extra tables or something
> else. The compression options will not be removed after unlinking
> compression method from a column because there is still be stored
> compressed data in that column.
> 

OK. So something like a "global" dictionary used for the column, or
something like that? Sure, seems useful and I've been thinking about
that, but I think we badly need some extension using that, even if in a
very simple way. Firstly, we need a "how to" example, secondly we need
some way to test it.

>>
>> 13) When writing the experimental extension, I was extremely
>> confused about the regular varlena headers, custom compression
>> headers, etc. In the end I stole the code from tsvector.c and
>> whacked it a bit until it worked, but I wouldn't dare to claim I
>> understand how it works.
>>
>> This needs to be documented somewhere. For example postgres.h has
>> a bunch of paragraphs about varlena headers, so perhaps it should
>> be there? I see the patch tweaks some of the constants, but does
>> not update the comment at all.
> 
> This point is good, I'm not sure how this documentation should look 
> like. I've just assumed that people should have deep undestanding of 
> varlenas if they're going to compress them. But now it's easy to
> make mistake there. Maybe I should add some functions that help to
> construct varlena, with different headers. I like the way is how
> jsonb is constructed. It uses StringInfo and there are few helper
> functions (reserveFromBuffer, appendToBuffer and others). Maybe they
> should be not static.
> 

Not sure. My main problem was not understanding how this affects the
varlena header, etc. And I had no idea where to look.

>>
>> Perhaps it would be useful to provide some additional macros
>> making access to custom-compressed varlena values easier. Or
>> perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already
>> support that? This part is not very clear to me.
> 
> These macros will work, custom compressed varlenas behave like old
> compressed varlenas.
> 

OK. But then I don't understand why tsvector.c does things like

    VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
    VARRAWSIZE_4B_C(data) - arrsize

instead of

    VARSIZE_ANY_EXHDR(data) - arrsize
    VARSIZE_ANY(data) - arrsize

Seems somewhat confusing.

>>> Still it's a problem if the user used for example `SELECT 
>>> <compressed_column> INTO * FROM *` because postgres will copy 
>>> compressed tuples, and there will not be any dependencies
>>> between destination and the options.
>>>   
>>
>> This seems like a rather fatal design flaw, though. I'd say we need
>> to force recompression of the data, in such cases. Otherwise all
>> the dependency tracking is rather pointless.
> 
> Fixed this problem too. I've added recompression for datum that use 
> custom compression.
> 

Hmmm, it still doesn't work for me. See this:

    test=# create extension pg_lz4 ;
    CREATE EXTENSION
    test=# create table t_lz4 (v text compressed lz4);
    CREATE TABLE
    test=# create table t_pglz (v text);
    CREATE TABLE
    test=# insert into t_lz4 select repeat(md5(1::text),300);
    INSERT 0 1
    test=# insert into t_pglz select * from t_lz4;
    INSERT 0 1
    test=# drop extension pg_lz4 cascade;
    NOTICE:  drop cascades to 2 other objects
    DETAIL:  drop cascades to compression options for lz4
    drop cascades to table t_lz4 column v
    DROP EXTENSION
    test=# \c test
    You are now connected to database "test" as user "user".
    test=# insert into t_lz4 select repeat(md5(1::text),300);^C
    test=# select * from t_pglz ;
    ERROR:  cache lookup failed for compression options 16419

That suggests no recompression happened.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

Reply via email to