Re: [HACKERS] Custom compression methods

Dilip Kumar Wed, 02 Sep 2020 03:33:58 -0700

On Wed, Sep 2, 2020 at 4:57 AM Mark Dilger <mark.dil...@enterprisedb.com> wrote:
>
>
>
> > On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> >
> > v1-0001: As suggested by Robert, it provides the syntax support for
> > setting the compression method for a column while creating a table and
> > adding columns.  However, we don't support changing the compression
> > method for the existing column.  As part of this patch, there is only
> > one built-in compression method that can be set (pglz).  In this, we
> > have one in-build am (pglz) and the compressed attributes will directly
> > store the oid of the AM.  In this patch, I have removed the
> > pg_attr_compresion as we don't support changing the compression
> > for the existing column so we don't need to preserve the old
> > compressions.
>
> I do not like the way pglz compression is handled in this patch.  After 
> upgrading PostgreSQL to the first version with this patch included, 
> pre-existing on-disk compressed data will not include any custom compression 
> Oid in the header, and toast_decompress_datum will notice that and decompress 
> the data directly using pglz_decompress.   If the same data were then written 
> back out, perhaps to another table, into a column with no custom compression 
> method defined, it will get compressed by toast_compress_datum using 
> DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID.  That 
> isn't a proper round-trip for the data, as when it gets re-compressed, the 
> PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a 
> bit longer, but also means that it is not byte-for-byte the same as it was, 
> which is counter-intuitive.  Given that any given pglz compressed datum now 
> has two totally different formats that might occur on disk, code may have to 
> consider both of them, which increases code complexity, and regression tests 
> will need to be written with coverage for both of them, which increases test 
> complexity.  It's also not easy to write the extra tests, as there isn't any 
> way (that I see) to intentionally write out the traditional shorter form from 
> a newer database server; you'd have to do something like a pg_upgrade test 
> where you install an older server to write the older format, upgrade, and 
> then check that the new server can handle it.
>
> The cleanest solution to this would seem to be removal of the compression 
> am's Oid from the header for all compression ams, so that pre-patch written 
> data and post-patch written data look exactly the same.  The other solution 
> is to give pglz pride-of-place as the original compression mechanism, and 
> just say that when pglz is the compression method, no Oid gets written to the 
> header, and only when other compression methods are used does the Oid get 
> written.  This second option seems closer to the implementation that you 
> already have, because you already handle the decompression of data where the 
> Oid is lacking, so all you have to do is intentionally not write the Oid when 
> compressing using pglz.
>
> Or did I misunderstand your implementation?


Thanks for looking into it.  Actually, I am planning to change this
patch such that we will use the upper 2 bits of the size field instead
of storing the amoid for the builtin compression methods.
e. g. 0x00 = pglz, 0x01 = zlib, 0x10 = other built-in, 0x11 -> custom
compression method.  And when 0x11 is set then only we will store the
amoid in the toast header.  I think after a week or two I will make
these changes and post my updated patch.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

Reply via email to