On Mon, Mar 15, 2021 at 8:14 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > In the attached patches I have changed this, ...
OK, so just looking over this patch series, here's what I think: - 0001 and 0002 are now somewhat independent of the rest of this work, and could be dropped, but I think they're a good idea, so I'd like to commit them. I went over 0001 carefully this morning and didn't find any problems. I still need to do some more review of 0002. - 0003 through 0005 are the core of this patch set. I'd like to get them into this release, but I think we're likely to run out of time. - I don't think we want to proceed with 0006 at this time. It needs broader buy-in, I think, and I think it also needs some other improvements, at the least to the comments. - 0007 is not intended for commit, but just exists to fool the CommitFest bot into testing the feature. Regarding 0003: The biggest thing that jumps out at me while looking at this with fresh eyes is that the patch doesn't touch varatt_external.va_extsize at all. In a varatt_external, we can't use the va_rawsize to indicate the compression method, because there are no bits free, because the 2 bits not required to store the size are used to indicate what type of varlena we've got. But, that means that the size of a varlena is limited to 1GB, so there are 2 bits free in varatt_external.va_extsize, just like there are in va_compressed.va_rawsize. We could store the same two bits in varatt_external.va_extsize that we're storing in va_compressed.va_rawsize aka va_tcinfo. That's a big deal, because then toast_get_compression_method() doesn't have to call toast_fetch_datum_slice() any more, which is a rather large savings. If it's only impacting pg_column_compression() then whatever, but that's not the case: we've got calls to CompareCompressionMethodAndDecompress in places like intorel_receive() and ExecModifyTable() that look pretty performance-critical. I think that CompareCompressionMethodAndDecompress() could be redesigned to be more efficient by moving more of the per-tuple work into a separate setup phase. Consider a case where the tuple has 300 columns. 299 of them are fixed-with, but column 100 is a varlena. In an ideal world, we would do slot_getsomeattrs(slot, 100). Then we would check whether column is not null, whether it is compressed, and whether the compression method is the one we want. If recompression is required, then we must slot_getallattrs(slot), memcpy all the values to the virtual slot created for this purpose, and decompress and recompress column 100. But, if column 100 is not null, then we need not ever deform beyond column 100, and in no case do we need to iterate over all 300 attributes. But the current code will do just that. For every new tuple, it loops again over every attribute and re-discovers which ones are varlenas. That's kinda the pits. I got thinking about this after looking at ExecFilterJunk(). That function is extremely simple precisely because all the work of figuring out what should be done has been precomputed. All the smarts are in cleanmap[], which is set up before we actually begin execution. In a similar way, you can imagine creating some sort of object, let's call it a CompressedAttributeFilter, that looks at the tupledesc figures out from the tupledesc which columns we need to consider recompressing and puts them in an array. Then you have a struct that stores a pointer to the array, the number of elements in the array, and the value of the last array element. You pass this struct to what is now CompareCompressionMethodAndDecompress() and it can now run more like what I described above. It's possible to imagine doing even better. Imagine that for every column we maintain an attcompression value and an attpreservecompression value. The former indicates the compression type for the column or '\0' if it cannot be compressed, and the latter indicates whether any other compression type might be present. Then, when we build the CompressedAttributeFilter object, we can exclude varlena attributes if attpreservecompression is false and attcompression is '\0' or matches the attcompression value for the corresponding attribute in the table into which ExecModifyTable() or intorel_receive() will be putting the tuple. This seems quite complex in terms of bookkeeping, but it would allow us to elide basically all of the per-tuple bookkeeping in a lot of common cases, such as UPDATE, or an INSERT or CTAS into a table that's using the same compression method as the source data. You could probably contrive it so that you have a CompressedAttributeFilter pointer that's NULL if no such treatment is required, just like we already do for junkfilter. There's another, rather brute-force approach to this problem, too. We could just decide that lz4 will only be used for external data, and that there's no such thing as an inline-compressed lz4 varlena. deotast_fetch_datum() would just notice that the value is lz4'd and de-lz4 it before returning it, since a compressed lz4 datum is impossible. I'm open to being convinced that we don't need to do either of these things, and that the cost of iterating over all varlenas in the tuple is not so bad as to preclude doing things as you have them here. But, I'm afraid it's going to be too expensive. -- Robert Haas EDB: http://www.enterprisedb.com