from:"Cary Huang"

Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-02-28 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi

I am glad to find this patch here because it helps with my current development 
work, which involves a lot of debugging with the WAL records and this patch 
definitely make this much easier rather than using grep externally. 

I have tried all of the new options added by the patch and every combination 
seems to result correctly. 

The only comment I would have is in the documentation, where I would replace:
"Display only records touching the given block" with "Display only records 
associated with the given block" 
"Display only records touching the given relation" with " Display only records 
associated with the given relation"

just to make it sound more formal. :)

best

Cary Huang
--
HighGo Software Canada
www.highgo.ca

Re: pg_rewind copies

2020-12-15 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hello

The patch seems to do as described and the regression and tap tests are passing 
+   /*
+* A local source is not expected to change while we're rewinding, so 
check
+* that we size of the file matches our earlier expectation.
+*/
Is this a tyoo?

thanks
Cary

Re: Fix typo about WalSndPrepareWrite

2021-02-17 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,

thanks for the patch, however I don't think it is a typo. The comment indicates 
a technique that is used throughout many functions in walsender.c, which is to 
fill the send-timestamp as late as possible so it is more accurate. This is 
done by first reserving a spot in the write stream, write the actual data, and 
then finally write the current timestamp to the reserved spot. This technique 
is used in WalSndWriteData() and also XLogSendPhysical()... so really it 
doesn't matter which function name to put in the comment.

thank you!

---
Cary Huang
HighGo Software (Canada)

Re: [PATCH] Add --create-only option to pg_dump/pg_dumpall

2021-03-29 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi

I have tried the patch and the new option is able to control the contents of 
pg_dump outputs to include only create db related commands. I also agree that 
the option name is a little misleading to the user so I would suggest instead 
of using "create-only", you can say something maybe like "createdb-only" 
because this option only applies to CREATE DATABASE related commands, not 
CREATE TABLE or other objects. In the help menu, you can then elaborate more 
that this option "dump only the commands related to create database like ALTER, 
GRANT..etc"

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Terminate the idle sessions

2020-08-10 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I applied this patch to the PG13 branch and generally this feature works as 
described. The new "idle_session_timeout" that controls the idle session 
disconnection is not in the default postgresql.conf and I think it should be 
included there with default value 0, which means disabled. 
There is currently no enforced minimum value for "idle_session_timeout" (except 
for value 0 for disabling the feature), so user can put any value larger than 0 
and it could be very small like 500 or even 50 millisecond, this would make any 
psql connection to disconnect shortly after it has connected, which may not be 
ideal. Many systems I have worked with have 30 minutes inactivity timeout by 
default, and I think it would be better and safer to enforce a reasonable 
minimum timeout value

Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-09-04 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi

I have tried the patch and it functions as described. The attached tap test 
case is comprehensive and is passing. However, the patch does not apply well on 
the current master; I had to checkout to a much earlier commit to be able to 
patch correctly. The patch will need to be rebased to the current master.

Thanks

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: [PATCH] Proof of concept for GUC improvements

2021-10-15 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi

I quite like the feature this patch provides, it makes special int values more 
meaningful to user. I like that you added a configuration to allow the original 
int value to be outputted or the special value, this adds some flexibility in 
case returning special value breaks some old applications. I scanned through 
the GUC list and found that the following parameters can potentially be 
categorized in the "special_disabled0" group,
just for your reference.

max_prepared_transactions
vacuum_cost_delay
effective_io_concurrency
max_parallel_workers
max_parallel_maintenance_workers
max_parallel_workers_per_gather
max_logical_replication_workers
max_sync_workers_per_subscription
jit_above_cost
jit_inline_above_cost
jit_optimize_above_cost
log_rotation_age
log_rotation_size
log_transaction_sample_rate

Cary Huang
-
HighGo Software Canada
www.highgo.ca

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2021-11-29 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi

The patch applies and tests fine. I don't think there is any harm having little 
extra statistical information about the checkpoint process. In fact, it could 
be useful in identifying a bottleneck during the checkpoint process as the 
stats exactly the time taken to do the file IO in pg_logical dir. 

best

Cary Huang

Re: Setting min/max TLS protocol in clientside libpq

2020-01-02 Thread cary huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hello

I have applied the patch and did some basic testing with various combination of 
min and max TLS protocol versions. Overall the functionality works and the 
chosen TLS protocol aligns with the min/max TLS protocol settings on the PG 
server side.

I agree with Arthur that it makes sense to check the validity of 
"conn->sslmaxprotocolversion" first before checking if it is larger than 
"conn->sslminprotocolversion"

A small suggestion here. I see that PG server defaults TLS min version to be 
TLSv1.2 and max version to none. So by default the server accepts TLSv1.2 and 
above. I think on the client side, it also makes sense to have the same 
defaults as the server. In the patch, if the client does not supply 
"sslminprotocolversion", it will run to "else" statement and sets TLS min 
version to "INT_MIN", which is a huge negative number and of course openssl 
won't set it. I think this else statement can be enhanced a little to set 
"sslminprotocolversion" to TLSv1.2 by default to match the server and provide 
some warning message that TLS minimum has defaulted to TLSv1.2.

Cary
HighGo Software Canada

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-01-03 Thread cary huang

Hello Sawada and all

I would like to elaborate more on Sehrope and Sawada's discussion on passing 
NULL IV in "pg_cipher_encrypt/decrypt" functions during kmgr_wrap_key and 
kmgr_unwrap_key routines in kmgr_utils.c. Openssl implements key wrap according 
to RFC3394 as Sawada mentioned and passing NULL will make openssl to use 
default IV, which equals to A6A6A6A6A6A6A6A6. I have confirmed this on my side; 
A key wrapped with "NULL" IV can be unwrapped successfully with 
IV=A6A6A6A6A6A6A6A6, and unwrap will fail if IV is set to anything else other 
than NULL or A6A6A6A6A6A6A6A6.

I would like to provide some comments on the encryption and decryption routines 
provided by cipher_openssl.c in which cipher.c and kmgr_utils.c are using. I 
see that "ossl_cipher_encrypt" calls "EVP_EncryptInit_ex" and 
"EVP_EncryptUpdate" only to complete the encryption. Same thing applies to 
decryption routines. According to my past experience with openssl and the 
usages online, it is highly recommended to use "init-update-final" cycle to 
complete the encryption and I see that the "final" part (EVP_EncryptFinal) is 
missing. This call will properly handle the last block of data especially when 
padding is taken into account. The functions still works now because the input 
is encryption key and its size is multiple of each cipher block and no padding 
is used. I think it will be safer to use the proper "init-update-final" cycle 
for encryption/decryption

According to openssl EVP documentation, "EVP_EncryptUpdate" can be called 
multiple times at different offset to the input data to be encrypted. I see 
that "pg_cipher_encrypt" only calls "EVP_EncryptUpdate" once, which makes me 
assume that the application should invoke "pg_cipher_encrypt" multiple times 
until the entire data block is encrypted? I am asking because if we were to use 
"EVP_EncryptFinal" to complete the encryption cycle, then it is better to let 
"pg_cipher_encrypt" to figure out how many times "EVP_EncryptUpdate" should be 
called and finalize it with "EVP_EncryptFinal" at last block.

Lastly, I think we are missing a cleanup routine that calls 
"EVP_CIPHER_CTX_free()" to free up the EVP_CIPHER_CTX when encryption is done. 

Thank you

Cary Huang
HighGo Software Canada

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-01-09 Thread cary huang

On Mon, Jan 6, 2020 at 4:43 AM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Sat, 4 Jan 2020 at 15:11, cary huang  wrote:
> >>
> >> Hello Sawada and all
> >>
> >> I would like to elaborate more on Sehrope and Sawada's discussion on
> passing NULL IV in "pg_cipher_encrypt/decrypt" functions during
> kmgr_wrap_key and kmgr_unwrap_key routines in kmgr_utils.c. Openssl
> implements key wrap according to RFC3394 as Sawada mentioned and passing
> NULL will make openssl to use default IV, which equals to A6A6A6A6A6A6A6A6.
> I have confirmed this on my side; A key wrapped with "NULL" IV can be
> unwrapped successfully with IV=A6A6A6A6A6A6A6A6, and unwrap will fail if IV
> is set to anything else other than NULL or A6A6A6A6A6A6A6A6.
> >>
>
> >Sehrope also suggested me not to use the fixed IV in order to avoid
> >getting the same result from the same value. I'm researching it now.
> >Also, currently it's using key wrap algorithm[1] but it accepts only
> >multiple of 8 bytes as input. Since it's not good for some cases it's
> >better to use key wrap with padding algorithm[2] instead, which seems
> >available in OpenSSL 1.1.0 or later.
>

Since the current kmgr only supports AES128 and AES256, the master key
generated by kmgr during bootstrap will always be multiple of 8. I believe
AES keys in general must always be in multiple of 8. I have not done enough
research as to which encryption algorithm will involve keys not in multiple
of 8 so I think with the current key_wrap_algorithm is fine. With the key
wrap algorithm defined in RFC3394. The IV is used only as a "initial value"
and it has to be static; either we randomize one or we use the default
A6A6A6... by passing NULL, It is different from the CTR block cipher mode
which has been selected to encrypt WAL and buffer. In CTR mode, each block
requires a different and unique IV as input in order to be secured; and we
have agreed to use segment IDs as IVs. For this reason, I think the current
key wrap implementation is fine. The least we can do is to generate a IV
during bootstrap and store it in control file, and this generated IV will
be used for all key wrapping / unwrapping purposes instead of using the
default one.

Best regards
Cary Huang
HighGo Software Canada

Re: [DOC] Add detail regarding resource consumption wrt max_connections

2024-01-12 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I think it is good to warn the user about the increased allocation of memory 
for certain parameters so that they do not abuse it by setting it to a huge 
number without knowing the consequences.

It is true that max_connections can increase the size of proc array and other 
resources, which are allocated in the shared buffer, which also means less 
shared buffer to perform regular data operations. I am sure this is not the 
only parameter that affects the memory allocation. "max_prepared_xacts" can 
also affect the shared memory allocation too so the same warning message 
applies here as well. Maybe there are other parameters with similar effects. 

Instead of stating that higher max_connections results in higher allocation, It 
may be better to tell the user that if the value needs to be set much higher, 
consider increasing the "shared_buffers" setting as well.

thank you

---
Cary Huang
Highgo Software Canada
www.highgo.ca

Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-02-24 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi

the patch applies fine on current master branch and it works as described. 
However, I would suggest changing the new option name from "--with-childs" to 
"--with-partitions" for several reasons. 

"childs" is grammatically incorrect and in the PG community, the term 
"partitioned table" is normally used to denote a parent table, and the term 
"partition" is used to denote the child table under the parent table. We should 
use these terms to stay consistent with the community.

Also, I would rephrase the documentation as:

Used in conjunction with -t/--table or 
-T/--exclude-table options to include or 
exclude partitions of the specified tables if any.

thank you

Cary Huang

HighGo Software Canada
www.highgo.ca

Re: pgbench - adding pl/pgsql versions of tests

2023-03-24 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi

thank you for the patch. It can be applied to current master branch and 
compiled fine. 

The feature works as described, I am able to run plpgsql-tpcb-like and 
plpgsql-simple-update scripts as you have added them.

But I am not sure the purpose of --no-function to prevent the creation of 
pl/pgsql functions when the new plpgsql test scripts need them. 

I initialized pgbench database with --no-function, and plpgsql-tpcb-like and 
plpgsql-simple-update scripts will fail to run

thanks

Cary Huang
===
Highgo Software Canada
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-18 Thread Cary Huang

Hi Jacob

> Hi Cary, did you have any thoughts on the timestamptz notes from my last mail?
> 
> > It might also be nice to rename
> > ASN1_TIME_to_timestamp().
> >
> > Squinting further at the server backend implementation, should that
> > also be using TimestampTz throughout, instead of Timestamp? It all
> > goes through float8_timestamptz at the end, so I guess it shouldn't
> > have a material impact, but it's a bit confusing.

Sorry I kind of missed this review comment from your last email. Thanks for 
bringing it up again though. I think it is right to change the backend 
references of "timestamp" to "timestampTz" for consistency reasons. I have gone 
ahead to make the changes.

I have also reviewed the wording on the documentation and removed "UTC" from 
the descriptions. Since sslinfo extension and pg_stat_ssl both return 
timestampTz in whatever timezone PostgreSQL is running on, they do not always 
return UTC timestamps.

Attached is the v10 patch with the above changes. Thanks again for the review.

Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v10-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-19 Thread Cary Huang

Thank you for your review again. 

 > ...but I think Timestamp[Tz]s are stored as microseconds, so we're off
 > by a factor of a million. This still works because later we cast to
 > double and pass it back through float8_timestamptz, which converts it:

In my test, if I made ASN1_TIME_to_timestamptz to return in microseconds, the
float8_timestamptz() function will catch a "timestamp out of range" exception as
this function treats the input as seconds.

> But anyone who ends up inspecting the value of
 > st_sslstatus->ssl_not_before directly is going to find an incorrect
 > timestamp. I think it'd be clearer to store microseconds to begin
 > with, and then just use TimestampTzGetDatum rather than the
 > DirectFunctionCall1 we have now. 

I have also tried TimestampTzGetDatum with ASN1_TIME_to_timestamptz 
outputting in microseconds. No exception is caught, but pg_stat_ssl displays 
incorrect results. The timestamps are extra large because the extra factor of 
1 million is considered in the timestamp computation as well.

The comments for TimestampTz says:

 * Timestamps, as well as the h/m/s fields of intervals, are stored as
 * int64 values with units of microseconds.  (Once upon a time they were
 * double values with units of seconds.)

but it seems to me that many of the timestamp related functions still consider
timestamp or timestampTz as "double values with units of seconds" though. 

Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Add last_commit_lsn to pg_stat_database

2024-04-05 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello

I think it is convenient to know the last commit LSN of a database for 
troubleshooting and monitoring purposes similar to the "pd_lsn" field in a page 
header that records the last LSN that modifies this page. I am not sure if it 
can help determine the WAL location to resume / catch up in logical replication 
as the "confirmed_flush_lsn" and "restart_lsn" in a logical replication slot 
are already there to figure out the resume / catchup point as I understand. 
Anyhow, I had a look at the patch, in general it looks good to me. Also ran it 
against the CI bot, which also turned out fine. Just one small comment though. 
The patch supports the recording of last commit lsn from 2 phase commit as 
well, but the test does not seem to have a test on 2 phase commit. In my 
opinion, it should test whether the last commit lsn increments when a prepared 
transaction is committed in addition to a regular transaction.

thank you
-
Cary Huang
www.highgo.ca

Re: Encoding protection for pgcrypto

2024-02-09 Thread cary huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hello

I had a look at your patch, which applies fine to PostgreSQL master. I noticed 
that the new regression tests you have added to test utf-8 encoding fails on my 
setup (make check) with the following diffs:

---
@ -13,6 +13,7 @@
 =ivrD
 -END PGP MESSAGE-
 '), '0123456789abcdefghij'), 'sha1');
+ERROR:  invalid byte sequence for encoding "UTF8": 0x91
 -- Database encoding protection.  Ciphertext source:
 -- printf '\xe0\xe0\xbff' | gpg --batch --passphrase mykey --textmode --armor 
--symmetric
 select pgp_sym_decrypt(dearmor('
@@ -23,5 +24,5 @@
 =QKy4
 -END PGP MESSAGE-
 '), 'mykey', 'debug=1');
+ERROR:  invalid byte sequence for encoding "UTF8": 0xe0 0xe0 0xbf
 \quit
-\endif
---

I am not sure but it seems that you intentionally provide a text input that 
would produce a non-utf-8 compliant decrypted output, which triggers the error 
from within "pg_verifymbstr()" call that you have added in pgp-pgsql.c? Are the 
errors expected in your new test case? If so, then the tests shall pass instead 
because it has caught a invalid encoding in decrypted output.

Generally, I am ok with the extra encoding check after text decryption but do 
not think if it is a good idea to just error out and abort the transaction when 
it detects invalid encoding character. text decryption routines may be used 
quite frequently and users generally do not expect them to abort transaction. 
It may be ok to just give them a warning about invalid character encoding. 

thanks

Cary Huang
Highgo Software - Canada
www.highgo.ca

[Patch] add multiple client certificate selection feature

2024-02-12 Thread Cary Huang

Hello 



I would like to share a patch that adds a feature to libpq to automatically 
select the best client certificate to send to the server (if it requests one). 
This feature is inspired by this email discussion years ago: 
https://www.postgresql.org/message-id/200905081539.n48Fdl2Y003286%40no.baka.org.
 This feature is useful if libpq client needs to communicate with multiple 
TLS-enabled PostgreSQL servers with different TLS certificate setups. Instead 
of letting the application to figure out the right certificate for the right 
server, the patch allows libpq library itself to pick the most ideal client 
certificate to send to the server.



Currently, we rely on options “sslcert” and “sslkey” parameters on the client 
side to select a client certificate + private key to send to the server, the 
patch adds 2 new options. “sslcertdir” and “sslkeydir” to specify directories 
where all possible certificate and private key files are stored. The new 
options cannot be used with “sslcert” and “sslkey” at the same time.



The most ideal certificate selection is based on the trusted CA names sent by 
the server in “Certificate Request” handshake message; obtained by the client 
making a call to “SSL_get0_peer_CA_list()” function. This list of trusted CA 
names tells the client the list of “issuers” that this server can trust. Inside 
“sslcertdir”, If a client certificate candidate’s issuer name equals to one of 
the trusted CA names, then that is the certificate to use. Once a candidate 
certificate is identified, the patch will then look for a matching private key 
in “sslkeydir”. These actions are performed in certificate callback function 
(cert_cb), which gets called when server requests a client certificate during 
TLS handshake.



This patch requires OpenSSL version 1.1.1 or later to work. The feature will be 
disabled with older OpenSSL versions. Attached is a POC patch containing the 
described feature.



Limitations:



One limitation of this feature is that it does not quite support the case where 
multiple private key files inside “sslkeydir” are encrypted with different 
passwords. When the client wants to find a matching private key from 
“sslkeydir”, it will always use the same password supplied by the client (via 
“sslpassword” option) to decrypt the private key it tries to access.





Also, no tap tests have been added to the patch to test this feature yet. So, 
to test this feature, we will need to prepare the environment manually:



1. generate 2 root CA certificates (ca1 and ca2), which sign 2 sets of client 
and server certificates.

2. configure the server to use a server certificate signed by either ca1 or ca2.

3. put all client certificates and private keys (signed by both ca1 and ca2) 
into a directory (we will point"sslcertdir" and "sslkeydir" to this directory)

4. based on the root CA certificate configured at the server side, the client 
will pick the certificate that the server can trust from specified "sslcertdir" 
and "sslkeydir" directories



Please let me know what you think. Any comments / feedback are greatly 
appreciated.



Best regards




Cary Huang

Highgo Software (Canada)

www.highgo.ca

v1-0001-multiple_client_certificate_selection_support.patch
Description: Binary data

Re: [Patch] add multiple client certificate selection feature

2024-03-01 Thread Cary Huang

Hello

I would like to share a version 2 patch for multiple client certificate 
selection feature with several enhancements over v1. I removed the extra 
parameter "sslcertdir" and "sslkeydir". Instead, I reuse the existing sslcert, 
ssldir and sslpassword parameters but allow multiple entries to be supplied 
separated by comma. This way, we are able to use a different sslpassword to 
decrypt different sslkey files based on the selected certificate. This was not 
possible in v1.

When a client is doing a TLS handshake with a server that requires client 
certificate, the client will obtain a list of trusted CA names from the server 
and try to match it from the list of certificates provided via sslcert option. 
A client certificate is chosen if its issuer matches one of the server’s 
trusted CA names. Once a certificate is chosen, the corresponding private key 
and sslpassword (if required) will be used to establish a secured TLS 
connection.

The feature is useful when a libpq client needs to communicate with multiple 
TLS-enabled PostgreSQL server instances with different TLS certificate setups. 
Instead of letting the application to figure out what certificate to send to 
what server, we can configure all possible certificate candidates to libpq and 
have it choose the best one to use instead.

 

Hello Daniel

Sorry to bother. I am just wondering your opinion about this feature? Should 
this be added to commitfest for review? This feature involves certificates 
issued by different root CAs to test the its ability to pick the right 
certificate, so the existing ssl tap test’s certificate generation script needs 
an update to test this. I have not done so yet, because I would like to discuss 
with you first.

Any comments and recommendations are welcome. Thank you!





Best regards

Cary Huang

v2-0001-multiple_client_certificate_selection_support.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-08 Thread Cary Huang

Hello

Thank you for the review and your patch. I have tested with minimum OpenSSL 
version 1.0.2 support and incorporated your changes into the v9 patch as 
attached. 

 > In my -08 timezone, the date doesn't match what's recorded either
 > (it's my "tomorrow"). I think those probably just need to be converted
 > to UTC explicitly? I've attached a sample diff on top of v8 that
 > passes tests on my machine.

Yes, I noticed this in the SSL test too. I am also in GTM-8, so for me the 
tests would fail too due to the time zone differences from GMT. It shall be 
okay to specifically set the outputs of pg_stat_ssl, ssl_client_get_notbefore, 
and ssl_client_get_notafte to be in GMT time zone. The not before and not after 
time stamps in a client certificate are generally expressed in GMT.


Thank you!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v9-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data

typo in paths.h

2024-03-12 Thread Cary Huang

Hello



I noticed that the comment for declaring create_tidscan_paths() in 
src/include/optimizer/paths.h has a typo. The function is implemented in 
tidpath.c, not tidpath.h as stated, which does not exist.



Made a small patch to correct it.



Thank you





Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

correct_source_filename.patch
Description: Binary data

Re: Patch: Global Unique Index

2023-01-12 Thread Cary Huang


On 2022-11-29 6:16 p.m., Tom Lane wrote:

Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.


During inserts, global unique index patch does not acquire exclusive 
lock on the whole index Y while checking it for the uniqueness; it 
acquires a low level AccessShareLock on Y and will release after 
checking. So while it is checking, another backend can still insert a 
duplicate in index Y. If this is the case, a "transaction level lock" 
will be triggered.


For example.

Say backend A inserts into index X, and checks index Y to find no 
conflict, and backend B inserts a conflict into index Y right after. In 
this case, backend B still has to check index X for conflict and It will 
fetch a duplicate tuple that has been inserted by A, but it cannot 
declare a duplicate error yet. This is because the transaction inserting 
this conflict tuple started by backend A is still in progress. At this 
moment, backend B has to wait for backend A to commit / abort before it 
can continue. This is how "transaction level lock" prevents concurrent 
insert conflicts.


There is a chance of deadlock if the conflicting insertions done by A 
and B happen at roughly the same time, where both backends trigger 
"transaction level lock" to wait for each other to commit/abort. If this 
is the case, PG's deadlock detection code will error out one of the 
backends.  It should be okay because it means one of the backends tries 
to insert a conflict. The purpose of global unique index is also to 
error out backends trying to insert duplicates. In the end the effects 
are the same, it's just that the error says deadlock detected instead of 
duplicate detected.


If backend B did not insert a conflicting tuple, no transaction lock 
wait will be triggered, and therefore no deadlock will happen.


Regards
Cary Huang
---
HighGo Software Canada

Re: Patch: Global Unique Index

2023-01-13 Thread Cary Huang


On 2022-11-30 2:30 p.m., Greg Stark wrote:

On Tue, 29 Nov 2022 at 21:16, Tom Lane  wrote:

I actually think that that problem should be soluble with a
slightly different approach.  The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress.  That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.

I think there`s a lot of room to manoeuvre here. This is a new feature
that doesn't need to be 100% complete or satisfy any existing
standard. There are lots of options for compromises that leave room
for future improvements.

1) We could just say sure ATTACH is slow if you're attaching an
non-empty partition
2) We could invent a concept like convalidated and let people attach a
partition without validating the uniqueness and then validate it later
concurrently
3) We could say ATTACH doesn't work now and come up with a better
strategy in the future

Also, don't I vaguely recall something in exclusion constraints about
having some kind of in-memory "intent" list where you declared that
you're about to insert a value, you validate it doesn't violate the
constraint and then you're free to insert it because anyone else will
see your intent in memory? There might be a need for some kind of
global object that only holds inserted keys long enough that other
sessions are guaranteed to see the key in the correct index. And that
could maybe even be in memory rather than on disk.

This isn't a simple project but I don't think it's impossible as long
as we keep an open mind about the requirements.


In the current global unique index implementation, ATTACH can be slow if 
there are concurrent inserts happening. ATTACH tries to acquire 
shareLock on all existing partitions and partition-to-be before it scans 
and sorts them for uniqueness check. It will release them only after all 
partitions have been checked. If there are concurrent inserts, ATTACH 
has to wait for all inserts complete. Likewise, if ATTACH is in 
progress, inserts have to wait as well. This is an issue now.


If we were to make ATTACH acquire a lower level lock (AccessShareLock), 
scans a partition, and then release it. there is nothing stopping any 
concurrent inserts from inserting a conflict right after it finishes 
checking. This is another issue. There is no transaction level lock 
being triggered here like in multiple concurent inserts case


Another email thread called "create index concurrently on partitioned 
index" discuss some approaches that may be used to solve the attach 
issue here, basically to allow ATTACH PARTITION CONCURRENTLY...



regards

Cary Huang
-
HighGo Software Canada

Re: pg_recvlogical prints bogus error when interrupted

2023-04-06 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hello

The patch applies and tests fine. I like the way to have both ready_to_exit and 
time_to_abort variables to control the exit sequence. I think the (void) cast 
can be removed in front of PQputCopyEnd(), PQflush for consistency purposes as 
it does not give warnings and everywhere else does not have those casts. 

thank you
Cary

Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-05-05 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, failed
Spec compliant:   not tested
Documentation:not tested

Hello

This is one of those features that is beneficial to a handful of people in 
specific test cases. It may not benefit the majority of the users but is 
certainly not useless either. As long as it can be disabled and enough tests 
have been run to ensure it won't have a significant impact on working 
components while disabled, it should be fine in my opinion. Regarding where the 
selected port shall be saved (postmaster.pid, read by pg_ctl or saved in a 
dedicated file), I see that postmaster.pid already contains a port number in 
line number 4, so adding a port number into there is nothing new; port number 
is already there and we can simply replace the port number with the one 
selected by the system. 

I applied and tested the patch and found that the system can indeed start when 
port is set to 0, but line 4 of postmaster.pid does not store the port number 
selected by the system, rather, it stored 0, which is the same as configured. 
So I am actually not able to find out the port number that my PG is running on, 
at least not in a straight-forward way. 

thank you
==
Cary Huang
HighGo Software
www.highgo.ca

Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-03-31 Thread Cary Huang

Hello

pg_receivewal creates this .partial WAL file during WAL streaming and it is 
already treating this file as a temporary file. It will fill this .partial file 
with zeroes up to 16777216 by default before streaming real WAL data on it. 

If your .partial file is only 8396800 bytes, then this could mean that 
pg_receivewal is terminated abruptly while it is appending zeroes or your 
system runs out of disk space. Do you have any error message? 

If this is case, the uninitialized .partial file should still be all zeroes, so 
it should be ok to delete it and have pg_receivewal to recreate a new .partial 
file.

Also, in your patch, you are using pad_to_size argument in function 
dir_open_for_write to determine if it needs to create a temp file, but I see 
that this function is always given a pad_to_size  = 16777216 , and never 0. Am 
I missing something?

Cary Huang
===
HighGo Software Canada

Re: Add support for AT LOCAL

2023-09-22 Thread cary huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

I think this feature can be a useful addition in dealing with time zones. I 
have applied and tried out the patch, The feature works as described and seems 
promising. The problem with compilation failure was probably reported on 
CirrusCI when compiled on different platforms. I have run the latest patch on 
my own Cirrus CI environment and everything checked out fine. 

Thank you

Cary Huang
--
Highgo Software Canada
www.highgo.ca

Include sequence relation support in logical replication

2020-03-24 Thread Cary Huang

Hi



>From the PG logical replication documentation, I see that there is a listed 
>limitation that sequence relation is not replicated logically. After some 
>examination, I see that retrieving the next value from a sequence using the 
>nextval() call will emits a WAL update every 32 calls to nextval(). In fact, 
>when it emits a WAL update, it will write a future value 32 increments from 
>now, and maintain a internal cache for delivering sequence numbers. It is done 
>this way to minimize the write operation to WAL record at a risk of losing 
>some values during a crash. So if we were to replicate the sequence, the 
>subscriber will receive a future value (32 calls to nextval()) from now, and 
>it obviously does not reflect current status. Sequence changes caused by other 
>sequence-related SQL functions like setval() or ALTER SEQUENCE xxx, will 
>always emit a WAL update, so replicating changes caused by these should not be 
>a problem. 



I have shared a patch that allows sequence relation to be supported in logical 
replication via the decoding plugin ( test_decoding for example ); it does not 
support sequence relation in logical replication between a PG publisher and a 
PG subscriber via pgoutput plugin as it will require much more work. For the 
replication to make sense, the patch actually disables the WAL update at every 
32 nextval() calls, so every call to nextval() will emit a WAL update for 
proper replication. This is done by setting SEQ_LOG_VALS to 0 in sequence.c



I think the question is that should we minimize WAL update frequency (every 32 
calls) for getting next value in a sequence at a cost of losing values during 
crash or being able to replicate a sequence relation properly at a cost or more 
WAL updates?





Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

sequence_replication.patch
Description: Binary data

Re: Include sequence relation support in logical replication

2020-03-26 Thread Cary Huang

Hi Andres



thanks for your reply and your patch review. Please see my comments below



>On 2020-03-24 16:19:21 -0700, Cary Huang wrote: 

>> I have shared a patch that allows sequence relation to be supported in 

>> logical replication via the decoding plugin ( test_decoding for 

>> example ); it does not support sequence relation in logical 

>> replication between a PG publisher and a PG subscriber via pgoutput 

>> plugin as it will require much more work. 

>

> Could you expand on "much more work"? Once decoding support is there, 

> that shouldn't be that much? 



By much more work, I meant more source files will need to be changed to have 
sequence replication 

supported between a PG subscriber and publisher using pgoutput plugin. About 10 
more source file changes.

Idea is similar though.




>> Sequence changes caused by other sequence-related SQL functions like 

>> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so 

>> replicating changes caused by these should not be a problem. 

>

> I think this really would need to handle at the very least setval to 

> make sense. 



yes, sure



>> For the replication to make sense, the patch actually disables the WAL 

>> update at every 32 nextval() calls, so every call to nextval() will 

>> emit a WAL update for proper replication. This is done by setting 

>> SEQ_LOG_VALS to 0 in sequence.c 

>

> Why is that needed? ISTM updating in increments of 32 is fine for 

> replication purposes? It's good imo, because sending out more granular 

> increments would increase the size of the WAL stream? 



yes, updating WAL at every 32 increment is good and have huge performance 
benefits according 

to Michael, but when it is replicated logically to subscribers, the sequence 
value they receive would not 

make much sense to them.

For example, 



If i have a Sequence called "seq" with current value = 100 and increment = 5. 
The nextval('seq') call will

return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), 
because that is the value after 32

increments and internally it is also maintaining a "log_cnt" counter that 
tracks how many nextval() calls have been invoked

since the last WAL write, so it could kind of derive backwards to find the 
proper next value to return to client. 



But the subscriber for this sequence will receive a change of 260 instead of 
105, and it does not represent the current

sequence status. Subscriber is not able to derive backwards because it does not 
know the increment size in its schema.



Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment 
behavior and makes WAL update at every nextval() call

and therefore the subscriber to this sequence will receive the same value (105) 
as the client, as a cost of more WAL writes.



I would like to ask if you have some suggestions or ideas that can make 
subscriber receives the current value without the need to

disabling the 32 increment behavior?



>> diff --git a/contrib/test_decoding/test_decoding.c 
>> b/contrib/test_decoding/test_decoding.c 

>> index 93c948856e..7a7e572d6c 100644 

>> --- a/contrib/test_decoding/test_decoding.c 

>> +++ b/contrib/test_decoding/test_decoding.c 

>> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, 
>> ReorderBufferTXN *txn, 

>>  &change->data.tp.oldtuple->tuple, 

>>  true); 

>>  break; 

>> +case REORDER_BUFFER_CHANGE_SEQUENCE: 

>> +appendStringInfoString(ctx->out, " SEQUENCE:"); 

>> +if (change->data.sequence.newtuple == NULL) 

>> +appendStringInfoString(ctx->out, " 
>> (no-tuple-data)"); 

>> +else 

>> +tuple_to_stringinfo(ctx->out, tupdesc, 

>> +
>> &change->data.sequence.newtuple->tuple, 

>> +false); 

>> +break; 

>>  default: 

>>  Assert(false); 

>>  } 

>

> You should also add tests - the main purpose of contrib/test_decoding is 

> to facilitate writing those... 



thanks, I will add




>> +ReorderBufferXidSetCatalogChanges(ctx->reorder, 
>> XLogRecGetXid(buf->record), buf->origptr); 

>

> Huh, why are you doing this? That's going to increase overhead of logical 

> decoding by many times? 



This is needed to allow snapshot to be built inside DecodeCommit() function. 
Regular changes caused by

Re: Internal key management system

2020-03-30 Thread Cary Huang

Hi

I had a look on kms_v9 patch and have some comments

--> pg_upgrade.c

keys are copied correctly, but as pg_upgrade progresses further, it will try to 
start the new_cluster from "issue_warnings_and_set_wal_level()" function, which 
is called after key copy. The new cluster will fail to start due to the 
mismatch between cluster_passphrase_command and the newly copied keys. This 
causes pg_upgrade to always finish with failure. We could move 
"copy_master_encryption_key()" to be called after 
"issue_warnings_and_set_wal_level()" and this will make pg_upgrade to finish 
with success, but user will still have to manually correct the 
"cluster_passphrase_command" param on the new cluster in order for it to start 
up correctly. Should pg_upgrade also take care of copying 
"cluster_passphrase_command" param from old to new cluster after it has copied 
the encryption keys so users don't have to do this step? If the expectation is 
for users to manually correct "cluster_passphrase_command" param after 
successful pg_upgrade and key copy, then there should be a message to remind 
the users to do so. 

-->Kmgr.c 

+   /*

+* If there is only temporary directory, it means that the previous

+* rotation failed after wrapping the all internal keys by the new

+* passphrase.  Therefore we use the new cluster passphrase.

+*/

+   if (stat(KMGR_DIR, &st) != 0)

+   {

+   ereport(DEBUG1,

+   (errmsg("both directories %s and %s exist, use 
the newly wrapped keys",

+   KMGR_DIR, KMGR_TMP_DIR)));

I think the error message should say "there is only temporary directory exist" 
instead of "both directories exist"

thanks!

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Wed, 25 Mar 2020 01:51:08 -0700 Masahiko Sawada 
 wrote 

On Tue, 24 Mar 2020 at 23:15, Bruce Momjian <mailto:br...@momjian.us> wrote: 
> 
> On Tue, Mar 24, 2020 at 02:29:57PM +0900, Masahiko Sawada wrote: 
> > That seems to work fine. 
> > 
> > So we will have pg_cryptokeys within PGDATA and each key is stored 
> > into separate file named the key id such as "sql", "tde-wal" and 
> > "tde-block". I'll update the patch and post. 
> 
> Yes, that makes sense to me. 
> 

I've attached the updated patch. With the patch, we have three 
internal keys: SQL key, TDE-block key and TDE-wal key. Only SQL key 
can be used so far to wrap and unwrap user secret via pg_wrap and 
pg_unwrap SQL functions. Each keys is saved to the single file located 
at pg_cryptokeys. After initdb with enabling key manager, the 
pg_cryptokeys directory has the following files: 

$ ll data/pg_cryptokeys 
total 12K 
-rw--- 1 masahiko staff 132 Mar 25 15:45  
-rw--- 1 masahiko staff 132 Mar 25 15:45 0001 
-rw--- 1 masahiko staff 132 Mar 25 15:45 0002 

I used the integer id rather than string id to make the code simple. 

When cluster passphrase rotation, we update all keys atomically using 
temporary directory as follows: 

1. Derive the new passphrase 
2. Wrap all internal keys with the new passphrase 
3. Save all internal keys to the temp directory 
4. Remove the original directory, pg_cryptokeys 
5. Rename the temp directory to pg_cryptokeys 

In case of failure during rotation, pg_cyrptokeys and 
pg_cyrptokeys_tmp can be left in an incomplete state. We recover it by 
checking if the temporary directory exists and the wrapped keys in the 
temporary directory are valid. 

Regards, 

-- 
Masahiko Sawada http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Let people set host(no)ssl settings from initdb

2020-04-06 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi 
I applied the patch 
"v3-0001-Enable-setting-pg_hba.conf-permissions-from-initd.patch" and did some 
verification with it. The intended feature works overall and I think it is 
quite useful to support default auth methods for ssl and gss host types. I have 
also found some minor things in the patch and would like to share as below:

> +"# CAUTION: Configuring the system for \"trust\" authentication\n" \
> +"# allows any user who can reach the databse on the route specified\n" \
> +"# to connect as any PostgreSQL user, including the database\n" \
> +"# superuser.  If you do not trust all the users who could\n" \
> +"# reach the database on the route specified, use a more restrictive\n" \
> +"# authentication method.\n"

Found a typo: should be 'database' instead of 'databse'

>  * a sort of poor man's grep -v
>  */
> -#ifndef HAVE_UNIX_SOCKETS
>  static char **
>  filter_lines_with_token(char **lines, const char *token)
>  {
> @@ -461,7 +466,6 @@ filter_lines_with_token(char **lines, const char *token)
>  
>   return result;
> }
> -#endif

I see that you have removed "#ifndef HAVE_UNIX_SOCKETS" around the 
filter_lines_with_token() function definition so it would be always available, 
which is used to remove the @tokens@ in case user does not specify a default 
auth method for the new hostssl, hostgss options. I think you should also 
remove the "#ifndef HAVE_UNIX_SOCKETS" around its declaration as well so both 
function definition and declaration would make sense.

#ifndef HAVE_UNIX_SOCKETS
static char **filter_lines_with_token(char **lines, const char *token);
#endif

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: warn if GUC set to an invalid shared library

2022-01-28 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hello

I tested the patches on master branch on Ubuntu 18.04 and regression turns out 
fine. I did a manual test following the query examples in this email thread and 
I do see the warnings when I attempted: these queries:

postgres=# SET local_preload_libraries=xyz.so;
2022-01-28 15:11:00.592 PST [13622] WARNING:  could not access file "xyz.so"
WARNING:  could not access file "xyz.so"
SET
postgres=# ALTER SYSTEM SET shared_preload_libraries=abc.so;
2022-01-28 15:11:07.729 PST [13622] WARNING:  could not access file 
"$libdir/plugins/abc.so"
WARNING:  could not access file "$libdir/plugins/abc.so"
ALTER SYSTEM

This is fine as this is what these patches are aiming to provide. However, when 
I try to restart the server, it fails to start because abc.so and xyz.so do not 
exist. Setting the parameters "local_preload_libraries" and 
"local_preload_libraries" to something else in postgresql.conf does not seem to 
take effect either.
It still complains shared_preload_libraries abc.so does not exist even though I 
have already set shared_preload_libraries to something else in postgresql.conf. 
This seems a little strange to me 

thank you
Cary

Patch: Global Unique Index

2022-11-17 Thread Cary Huang

Patch: Global Unique Index 



“Global unique index” in our definition is a unique index on a partitioned 
table that can ensure cross-partition uniqueness using a non-partition key. 
This work is inspired by this email thread, “Proposal: Global Index” started 
back in 2019 (Link below). My colleague David and I took a different approach 
to implement the feature that ensures uniqueness constraint spanning multiple 
partitions. We achieved this mainly by using application logics without heavy 
modification to current Postgres’s partitioned table/index structure. In other 
words, a global unique index and a regular partitioned index are essentially 
the same in terms of their storage structure except that one can do 
cross-partition uniqueness check, the other cannot.



https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com



- Patch -

The attached patches were generated based on commit 
`85d8b30724c0fd117a683cc72706f71b28463a05` on master branch.



- Benefits of global unique index -

1. Ensure uniqueness spanning all partitions using a non-partition key column

2. Allow user to create a unique index on a non-partition key column without 
the need to include partition key (current Postgres enforces this)

3. Query performance increase when using a single unique non-partition key 
column





- Supported Features -

1. Global unique index is supported only on btree index type

2. Global unique index is useful only when created on a partitioned table.

3. Cross-partition uniqueness check with CREATE UNIQUE INDEX in serial and 
parallel mode

4. Cross-partition uniqueness check with ATTACH in serial and parallel mode

5. Cross-partition uniqueness check when INSERT and UPDATE





- Not-supported Features -

1. Global uniqueness check with Sub partition tables is not yet supported as we 
do not have immediate use case and it may involve majoy change in current 
implementation





- Global unique index syntax -

A global unique index can be created with "GLOBAL" and "UNIQUE" clauses in a 
"CREATE INDEX" statement run against a partitioned table. For example,



CREATE UNIQUE INDEX global_index ON idxpart(bid) GLOBAL;





- New Relkind: RELKIND_GLOBAL_INDEX -

When a global unique index is created on a partitioned table, its relkind is 
RELKIND_PARTITIONED_INDEX (I). This is the same as creating a regular index. 
Then Postgres will recursively create index on each child partition, except now 
the relkind will be set as RELKIND_GLOBAL_INDEX (g) instead of RELKIND_INDEX 
(i). This new relkind, along with uniqueness flag are needed for 
cross-partition uniqueness check later.





- Create a global unique index -

To create a regular unique index on a partitioned table, Postgres has to 
perform heap scan and sorting on every child partition. Uniqueness check 
happens during the sorting phase and will raise an error if multiple tuples 
with the same index key are sorted together. To achieve global uniqueness 
check, we make Postgres perform the sorting after all of the child partitions 
have been scanned instead of on the "sort per partition" fashion. In 
otherwords, the sorting only happens once at the very end and it sorts the 
tuples coming from all the partitions and therefore can ensure global 
uniqueness.



In parallel index build case, the idea is the same, except that the tuples will 
be put into shared file set (temp files) on disk instead of in memory to ensure 
other workers can share the sort results. At the end of the very last partition 
build, we make Postgres take over all the temp files and perform a final merge 
sort to ensure global uniqueness.



Example:



> CREATE TABLE gidx_part(a int, b int, c text) PARTITION BY RANGE (a);

> CREATE TABLE gidx_part1 PARTITION OF gidx_part FOR VALUES FROM (0) to (10);

> CREATE TABLE gidx_part2 PARTITION OF gidx_part FOR VALUES FROM (10) to (20);

> INSERT INTO gidx_part values(5, 5, 'test');

> INSERT INTO gidx_part values(15, 5, 'test');

> CREATE UNIQUE INDEX global_unique_idx ON gidx_part(b) GLOBAL;

ERROR:  could not create unique index "gidx_part1_b_idx"

DETAIL:  Key (b)=(5) is duplicated.





- INSERT and UPDATE -

For every new tuple inserted or updated, Postgres attempts to fetch the same 
tuple from current partition to determine if a duplicate already exists. In the 
global unique index case, we make Postgres attempt to fetch the same tuple from 
other partitions as well as the current partition. If a duplicate is found, 
global uniqueness is violated and an error is raised.



Example:



> CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);

> CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);

> CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);

> CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;

> INSERT INTO gidx_part values(5, 5, 'test');

> INSERT INTO gidx_part values(15, 5, 'test');

Re: Patch: Global Unique Index

2022-11-23 Thread Cary Huang

6 partition of test for values from (500) to (600);

-> create unique index myindex on test(b) global;

-> insert into test values(generate_series(0,599), 
generate_series(0,599), 'test'); /* record timing */

-> delete from test;/* record timing */

-> drop index myindex;

-> insert into test values(generate_series(0,599), 
generate_series(0,599), 'test');

-> create unique index myindex on test(b) global;   /* record timing */

-> create table test7 (a int, b int, c text);

-> insert into test7 values(generate_series(600, 699), 
generate_series(600, 699), 'test');

-> alter table test attach partition test7 for values from (600) TO 
(700);  /* record timing */

-> alter table test detach partition test7; /* record timing */





As you can see, insert operation suffers the most performance drawbacks. In 
fact, it takes roughly 6 times as much time to complete the insertion, which 
matches the number of partitions in the test.



The Attach operation also takes roughly 6 times as much time to complete, 
because it has to performs uniqueness check on all 6 existing partitions to 
determine global uniqueness. Detach in both case takes the same time to 
complete.



Create global unique index takes 35% longer to build.



We also ran some tests for random SELECT and UPDATE using non-partition key 
with pgbench to compare the performance among 3 conditions: no index, regular 
unique index (with partition-key involved), and global unique index:



Test 1: scale=100, 10 partitions, 1 million tuples/partition



SELECT:

-> No partitioned index:tps = 3.827886

-> regular unique index:tps = 14.713099

-> global unique index: tps = 23791.314238

UPDATE mixed with SELECT:

-> No partitioned index:tps = 1.926013

-> regular unique index:tps = 7.087059

-> global unique index: tps = 2253.098335



Test 2: scale=1,000, 100 partitions, 1 million tuples/partition



SELECT:

-> No partitioned index:tps = 0.110029

-> regular unique index:tps = 0.268199

-> global unique index: tps = 2334.811682

UPDATE mixed with SELECT:

-> No partitioned index:tps = 0.115329

-> regular unique index:tps = 0.197052

-> global unique index: tps = 541.488621



Test 3: scale=10,000, 1,000 partitions, 1 million tuples/partition



SELECT:

-> No partitioned index:tps = 0.011047

-> regular unique index:tps = 0.036812

-> global unique index: tps = 147.189456

UPDATE mixed with SELECT:

-> No partitioned index:tps = 0.008209

-> regular unique index:tps = 0.054367

-> global unique index: tps = 57.740432





thank you very much and we hope this information could help clarify some 
concerns about this approach.



David and Cary





HighGo Software Canada

http://www.highgo.ca











 On Mon, 21 Nov 2022 05:33:30 -0700  Simon Riggs  wrote ---

> On Thu, 17 Nov 2022 at 22:01, Cary Huang mailto:cary.hu...@highgo.ca> wrote:

> >

> > Patch: Global Unique Index

>

> Let me start by expressing severe doubt on the usefulness of such a

> feature, but also salute your efforts to contribute.

>

> > In other words, a global unique index and a regular partitioned index are 
> > essentially the same in terms of their storage structure except that one 
> > can do cross-partition uniqueness check, the other cannot.

>

> This is the only workable architecture, since it allows DETACH to be

> feasible, which is essential.

>

> You don't seem to mention that this would require a uniqueness check

> on each partition. Is that correct? This would result in O(N) cost of

> uniqueness checks, severely limiting load speed. I notice you don't

> offer any benchmarks on load speed or the overhead associated with

> this, which is not good if you want to be taken seriously, but at

> least it is recoverable.

>

> (It might be necessary to specify some partitions as READ ONLY, to

> allow us to record their min/max values for the indexed cols, allowing

> us to do this more quickly.)

>

> > - Supported Features -

> > 1. Global unique index is supported only on btree index type

>

> Why? Surely any index type that supports uniqueness is g

Re: Patch: Global Unique Index

2022-11-24 Thread Cary Huang

  On Thu, 24 Nov 2022 08:00:59 -0700  Thomas Kellerer  wrote --- 
 > Pavel Stehule schrieb am 24.11.2022 um 07:03:
 > > There are many Oracle users that find global indexes useful despite
 > > their disadvantages.
 > >
 > > I have seen this mostly when the goal was to get the benefits of
 > > partition pruning at runtime which turned the full table scan (=Seq 
 > > Scan)
 > > on huge tables to partition scans on much smaller partitions.
 > > Partition wise joins were also helpful for query performance.
 > > The substantially slower drop partition performance was accepted in 
 > > thos cases
 > >
 > >
 > > I think it would be nice to have the option in Postgres as well.
 > >
 > > I do agree however, that the global index should not be created 
 > > automatically.
 > >
 > > Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
 > >
 > >
 > > Is it necessary to use special marks like GLOBAL if this index will
 > > be partitioned, and uniqueness will be ensured by repeated
 > > evaluations?
 > >
 > > Or you think so there should be really forced one relation based
 > > index?
 > >
 > > I can imagine a unique index on partitions without a special mark,
 > > that will be partitioned,  and a second variant classic index created
 > > over a partitioned table, that will be marked as GLOBAL.
 > 
 > 
 > My personal opinion is, that a global index should never be created
 > automatically.
 > 
 > The user should consciously decide on using a feature
 > that might have a serious impact on performance in some areas.


Agreed, if a unique index is created on non-partition key columns without 
including the special mark (partition key columns), it may be a mistake from 
user. (At least I make this mistake all the time). Current PG will give you a 
warning to include the partition keys, which is good. 

If we were to automatically turn that into a global unique index, user may be 
using the feature without knowing and experiencing some performance impacts (to 
account for extra uniqueness check in all partitions).

Re: [PATCH] Simple code cleanup in tuplesort.c.

2022-11-25 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Removing "state->status = TSS_SORTEDINMEM" should be fine as it is already set 
in sort_bounded_heap(state) few lines before.

Cary Huang

HighGo Software Canada
www.highgo.ca

Re: Authentication fails for md5 connections if ~/.postgresql/postgresql.{crt and key} exist

2023-01-27 Thread Cary Huang


> I think the sslcertmode=disable option that I introduced in [1] solves
> this issue too; would it work for your case? That whole patchset is
> meant to tackle the general case of the problem you've described.
>
> (Eventually I'd like to teach the server not to ask for a client
>  certificate if it's not going to use it.)

there is an option in pg_hba.conf on the server side called "clientcert" 
that can be specified besides the auth method that controls if certain 
client connections are required to send client certificate for 
additional verification. The value of "clientcert" can be "verify-ca" or 
"verify-full". For example:


hostssl    all all 127.0.0.1/32 md5 
clientcert=verify-full


If clientcert is not requested by the server, but yet the client still 
sends the certificate, the server will still verify it. This is the case 
in this discussion.


I agree that it is a more elegant approach to add "sslcertmode=disable" 
on the client side to prevent sending default certificate.


But, if the server does request clientcert but client uses 
"sslcertmode=disable" to connect and not give a certificate, it would 
also result in authentication failure. In this case, we actually would 
want to ignore "sslcertmode=disable" and send default certificates if found.


It would perhaps to better change the parameter to 
"defaultclientcert=on-demand" on the client side that will:


1. not send the existing default certificate if server does not request 
a certificate
2. send the existing default certificate if server does request a 
certificate while the client does not use "sslcert" parameter to specify 
another non-default certificate


I put "default" in the parameter name to indicate that it only applies 
to default certificate. If user specifies a non-default certificate 
using "sslcert" parameter, "defaultclientcert" should not be used and 
client should give error if both exists.



Cary Huang

HighGo Software Canada
www.highgo.ca

Re: Avoid memory leaks during base backups

2022-10-14 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello

I applied your v5 patch on the current master and run valgrind on it while 
doing a basebackup with simulated error. No memory leak related to backup is 
observed. Regression is also passing 

thank you

Cary Huang
HighGo Software Canada

Re: allow specifying action when standby encounters incompatible parameter settings

2022-04-29 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

The patch applied and tested fine. I think for this kind of exception, it 
really is up to the administrator how he/she should proceed to resolve 
depending on his/her business application. Leaving things configurable by the 
user is generally a nice and modest change. I also like that you leave the 
parameters as enum entry so it is possible to extend other possible actions 
such as automatically adjust to match the new value. 


-
Cary Huang
HighGo Software Canada

Re: Allowing REINDEX to have an optional name

2022-05-27 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

The patch applies and tests fine and I think this patch has good intentions to 
prevent the default behavior of REINDEX DATABASE to cause a deadlock. However, 
I am not in favor of simply omitting the database name after DATABASE clause 
because of consistency. Almost all other queries involving the DATABASE clause 
require database name to be given following after. For example, ALTER DATABASE 
[dbname].  

Being able to omit database name for REINDEX DATABASE seems inconsistent to me.

The documentation states that REINDEX DATABASE only works on the current 
database, but it still requires the user to provide a database name and require 
that it must match the current database. Not very useful option, isn’t it? But 
it is still required from the user to stay consistent with other DATABASE 
clauses.

Maybe the best way is to keep the query clause as is (with the database name 
still required) and simply don’t let it reindex system catalog to prevent 
deadlock. At the end, give user a notification that system catalogs have not 
been reindexed, and tell them to use REINDEX SYSTEM instead.

Cary Huang
-
HighGo Software Canada
www.highgo.ca

Re: Switching XLog source from archive to streaming when primary available

2022-06-24 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello

I tested this patch in a setup where the standby is in the middle of 
replicating and REDOing primary's WAL files during a very large data insertion. 
During this time, I keep killing the walreceiver process to cause a stream 
failure and force standby to read from archive. The system will restore from 
archive for "wal_retrieve_retry_interval" seconds before it attempts to steam 
again. Without this patch, once the streaming is interrupted, it keeps reading 
from archive until standby reaches the same consistent state of primary and 
then it will switch back to streaming again. So it seems that the patch does 
the job as described and does bring some benefit during a very large REDO job 
where it will try to re-stream after restoring some WALs from archive to speed 
up this "catch up" process. But if the recovery job is not a large one, PG is 
already switching back to streaming once it hits consistent state.

thank you

Cary Huang
HighGo Software Canada

Re: Add last failed connection error message to pg_stat_wal_receiver

2022-07-22 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

The patch can be applied to PG master branch without problem and it passed 
regression and tap tests. I manually tested this feature too and the last conn 
error is correctly shown in the pg_stat_get_wal_receiver output, which does 
exactly as described. I think this feature is nice to have to troubleshoot 
replication issues on the standby side.

thank you

Cary Huang

Highgo Software Canada

Re: Automatic notification of top transaction IDs

2021-09-17 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hello

This patch applies fine to master branch and the regression tests are passing. 

Regarding the parallel worker case, the AssignTransactionId() function  is 
already handling that and it will error out if IsParallelWorker() is true. In a 
normal case, this function is only called by the main backend, and the parallel 
workers will synchronize the transaction ID when they are spawned and they will 
not call this function anyway.

thank you

Cary Huang

HighGo Software Canada
www.highgo.ca

Re: CREATE SEQUENCE with RESTART option

2021-07-23 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi

I have applied and run your patch, which works fine in my environment. 
Regarding your comments in the patch:

/*
 * Restarting a sequence while defining it doesn't make any sense
 * and it may override the START value. Allowing both START and
 * RESTART option for CREATE SEQUENCE may cause confusion to user.
 * Hence, we throw error for CREATE SEQUENCE if RESTART option is
 * specified. However, it can be used with ALTER SEQUENCE.
 */

I would remove the first sentence, because it seems like a personal opinion to 
me. I am sure someone, somewhere may think it makes total sense :).

I would rephrase like this:

/* 
 * Allowing both START and RESTART option for CREATE SEQUENCE 
 * could override the START value and cause confusion to user. Hence, 
 * we throw an error for CREATE SEQUENCE if RESTART option is
 * specified; it can only be used with ALTER SEQUENCE.
 */

just a thought.

thanks!

-
Cary Huang
HighGo Software Canada
www.highgo.ca

Re: postgres_fdw: Handle boolean comparison predicates

2021-08-20 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hello

I tried to apply the patch to master branch and got a couple of errors, so I 
think the patch needs a rebase. 

I also agree with Ashutosh that the "IS NOT TRUE" case can be simplified to 
just "IS FALSE". it's simpler to understand.

based on this, I think we should restructure the switch-case statement in 
deparseBooleanTest because some of the cases in there evaluate to the same 
result but handles differently. 

For example, "IS TRUE" and "IS NOT FALSE" both evaluate to true, so can be 
handled in the same way

something like:
switch (node->booltesttype)
{
case IS_TRUE:
case IS_NOT_FALSE:
appendStringInfoChar(buf, '(');
deparseExpr(node->arg, context);
appendStringInfoString(buf, ")");
break;
case IS_FALSE:
case IS_NOT_TRUE:
appendStringInfoChar(buf, '(');
deparseExpr(node->arg, context);
appendStringInfoString(buf, " IS FALSE)");
break;
case IS_UNKNOWN:
appendStringInfoChar(buf, '(');
deparseExpr(node->arg, context);
appendStringInfoString(buf, " IS NULL)");
break;
case IS_NOT_UNKNOWN:
appendStringInfoChar(buf, '(');
deparseExpr(node->arg, context);
    appendStringInfoString(buf, " IS NOT NULL)");
break;
}

just a thought
thanks!

---
Cary Huang
HighGo Software Canada
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-10 Thread Cary Huang

 > Thanks for the new version!  It doesn't fail the ssl tests, but the Kerberos
 > test now fails.  You can see the test reports from the CFBot here:

Yes, kerberos tests failed due to the addition of notbefore and notafter 
values. The values array within "pg_stat_get_activity" function related to 
"pg_stat_gssapi" were not set correctly. It is now fixed


 > This runs on submitted patches, you can also run the same CI checks in your 
 > own
 > Github clone using the supplied CI files in the postgres repo.

Thank you for pointing this out. I followed the CI instruction as suggested and 
am able to run the same CI checks to reproduce the test failures.


> There are also some trivial whitespace issues shown with "git diff --check",
> these can of course easily be addressed by a committer in a final-version 
> patch
> but when sending a new version you might as well fix those.

Yes, the white spaces issues should be addressed in the attached patches.


> X509_getm_notBefore() and X509_getm_notAfter() are only available in OpenSSL
> 1.1.1 and onwards, but postgres support 1.0.2 (as of today with 8e278b6576).
> X509_get_notAfter() is available in 1.0.2 but deprecated in 1.1.1 and turned
> into an alias for X509_getm_notAfter() (same with _notBefore of course), and
> since we set 1.0.2 as the API compatibility we should be able to use that
> without warnings instead.

Thank you so much for catching this openssl function compatibility issue. I 
have changed the function calls to:
-  X509_get_notBefore()
-  X509_get_notAfter()

which are compatible in OpenSSL v1.0.2 and also v1.1.1 where they will get 
translated to X509_getm_notBefore() and X509_getm_notAfter() respectively


 > These functions should IMO return timestamp data types to save the user from
 > having to convert them. Same with the additions to pg_stat_get_activity.

Yes, agreed, the attached patches have the output changed to timestamp datatype 
instead of text.


 > You should add tests for the new functions in src/test/ssl/t/003_sslinfo.pl.

Yes, agreed, I added 2 additional tests in src/test/ssl/t/003_sslinfo.pl to 
compare the notbefore and notafter outputs from sslinfo extension and 
pg_stat_ssl outputs. Both should be tested equal.


Also added related documentation about the new not before and not after 
timestamps in pg_stat_ssl.

thank you

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca



v4-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v4-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-14 Thread Cary Huang

Hello 

 > The way we typically ship extensions in contrib/ is to not create a new base
 > version .sql file for smaller changes like adding a few functions.  For this
 > patch we should keep --1.2.sql and instead supply a 1.2--1.3.sql with the new
 > functions.

Thank you for pointing this out. It makes sense to me.


 > +errmsg("failed to convert tm to timestamp")));
 > 
 > I think this error is too obscure for the user to act on, what we use 
 > elsewhere
 > is "timestamp out of range" and I think thats more helpful.  I do wonder if
 > there is ever a legitimate case when this can fail while still having a
 > authenticated client connection?

My bad here, you are right. "timestamp out of range" is a much better error 
message. However, in an authenticated client connection, there should not be a 
legitimate case where the not before and not after can fall out of range. The 
"not before" and "not after" timestamps in a X509 certificate are normally 
represented by ASN1, which has maximum timestamp of December 31, . The 
timestamp data structure in PostgreSQL on the other hand can support year up to 
June 3, 5874898. Assuming the X509 certificate is generated correctly and no 
data corruptions happening (which is unlikely), the conversion from ASN1 to 
timestamp shall not result in out of range error.

Perhaps calling "tm2timestamp(&pgtm_time, 0, NULL, &ts)" without checking the 
return code would be just fine. I see some other usages of tm2timstamp() in 
other code areas also skip checking the return code.

 > I have addressed the issues above in a new v5 patchset which includes a new
 > patch for setting stable validity on the test certificates (the notBefore 
 > time
 > was arbitrarily chosen to match the date of opening up the tree for v17 - we
 > just need a date in the past).  Your two patches are rolled into a single one
 > with a commit message added to get started on that part as well.

thank you so much for addressing the ssl tests to make "not before" and "not 
after" timestamps static in the test certificate and also adjusting 
003_sslinfo.pl to expect the new static timestamps in the v5 patches. I am able 
to apply both and all tests are passing. I did not know this test certificate 
could be changed by `cd src/test/ssl && make -f sslfiles.mk`, but now I know, 
thanks to you :p.

Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-17 Thread Cary Huang

Hello

 > > Perhaps calling "tm2timestamp(&pgtm_time, 0, NULL, &ts)" without checking 
 > > the return code would be just fine. I see some other usages of 
 > > tm2timstamp() in other code areas also skip checking the return code.
 > 
 > I think we want to know about any failures, btu we can probably make it into 
 > an
 > elog() instead, as it should never fail.

Yes, sure. I have corrected the error message to elog(ERROR, "timestamp out of 
range") on a rare tm2timestamp() failure. Please see the attached patch based 
on your v5. "v6-0001-Set-fixed-dates-for-test-certificates-validity.patch" is 
exactly the same as 
"v5-0001-Set-fixed-dates-for-test-certificates-validity.patch", I just up the 
version to be consistent. 

thank you very much


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca



v6-0001-Set-fixed-dates-for-test-certificates-validity.patch
Description: Binary data


v6-0002-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-07-28 Thread Cary Huang

 Hello

I had a look at the patch and tested it on CI bot, it compiles and tests fine 
both autoconf and meson. I noticed that the v2 patch contains the v1 patch file 
as well. Not sure if intended but put there my accident.

> I don't think "invalidating" is the right terminology. Note that we already
 > have InvalidateBuffer() - but it's something we can't allow users to do, as 
 > it
 > throws away dirty buffer contents (it's used for things like dropping a
 > table).
 >
 > How about using "discarding" for this functionality?

I think "invalidating" is the right terminology here, it is exactly what the 
feature is doing, it tries to invalidate a buffer ID by calling 
InvalidateBuffer() routine inside buffer manager and calls FlushBuffer() before 
invalidating if marked dirty. 

The problem here is that InvalidateBuffer() could be dangerous because it 
allows a user to invalidate buffer that may have data in other tables not owned 
by the current user, 

I think it all comes down to the purpose of this feature. Based on the 
description in this email thread, I feel like this feature should be 
categorized as a developer-only feature, to be used by PG developer to 
experiment and observe some development works by invalidating one more more 
specific buffers. If this is the case, it may be helpful to add a 
"DEVELOPER_OPTIONS" in GUC, which allows or disallows the TryInvalidateBuffer() 
to run or to return error if user does not have this developer option enabled.

If the purpose of this feature is for general users, then it would make sense 
to have something like pg_unwarm (exactly opposite of pg_prewarm) that takes 
table name (instead of buffer ID) and drop all buffers associated with that 
table name. There will be permission checks as well so a user cannot pg_unwarm 
a table owned by someone else. User in this case won't be able to invalidate a 
particular buffer, but he/she should not have to as a regular user anyway.

thanks!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: [ psql - review request ] review request for \d+ tablename, \d+ indexname indenting

2023-08-25 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hello

Thank you for the patch and the effort to enhance \d+ 's output on partitioned 
tables that contain sub-partitions. However, the patch does not apply and I 
notice that this patch is generated as a differ file from 2 files, describe.c 
and describe_change.c. You should use git diff to generate a patch rather than 
maintaining 2 files yourself. Also I noticed that you include a 
"create_object.sql" file to illustrate the feature, which is not necessary. 
Instead, you should add them as a regression test cases in the existing 
regression test suite under "src/test/regress", so these will get run as tests 
to illustrate the feature. This patch changes the output of \d+ and it could 
potentially break other test cases so you should fix them in the patch in 
addition to providing the feature

Now, regarding the feature, I see that you intent to print the sub partitions' 
partitions in the output, which is okay in my opinion. However, a sub-partition 
can also contain another sub-partition, which contains another sub-partition 
and so on. So it is possible that sub-partitions can span very, very deep. Your 
example assumes only 1 level of sub-partitions. Are you going to print all of 
them out in \d+? If so, it would definitely cluster the output so much that it 
starts to become annoying. Are you planning to set a limit on how many levels 
of sub-partitions to print or just let it print as many as it needs?

thank you

Cary Huang
---
Highgo Software Canada
www.highgo.ca

Re: [Patch] add multiple client certificate selection feature

2024-04-11 Thread Cary Huang

Hello 



I would like to share an updated patch that adds a feature to libpq to 
automatically select the best client certificate to send to the server (if it 
requests one). This feature is inspired by this email discussion years ago: 
https://www.postgresql.org/message-id/200905081539.n48Fdl2Y003286%40no.baka.org,
 which makes it easier for a single client to communicate TLS with multiple 
TLS-enabled PostgreSQL servers with different certificate setups.



Instead of specifying just one sslcert, sslkey, or sslpassword, this patch 
allows multiple to be specified and libpq is able to pick the matching one to 
send to the PostgreSQL server based on the trusted CA names sent during TLS 
handshake.



If anyone finds it useful and would like to give it as try, I wrote a blog on 
how to test and verify this feature here: 
https://www.highgo.ca/2024/03/28/procedure-to-multiple-client-certificate-feature/



thank you



Best regards



Cary Huang

v3-0001-multiple_client_certificate_selection_support.patch
Description: Binary data

Support tid range scan in parallel?

2024-04-29 Thread Cary Huang

Hello

 

When using ctid as a
restriction clause with lower and upper bounds, PostgreSQL's planner will use
TID range scan plan to handle such query. This works and generally fine.
However, if the ctid range covers a huge amount of data, the planner will not
use parallel worker to perform ctid range scan because it is not supported. It 
could, however,
still choose to use parallel sequential scan to complete the scan if ti costs 
less. 

 

In one of our
migration scenarios, we rely on tid range scan to migrate huge table from one
database to another once the lower and upper ctid bound is determined. With the
support of parallel ctid range scan, this process could be done much quicker.

 

The attached patch
is my approach to add parallel ctid range scan to PostgreSQL's planner and 
executor. In my
tests, I do see an increase in performance using parallel tid range scan over
the single worker tid range scan and it is also faster than parallel sequential
scan covering similar ranges. Of course, the table needs to be large enough to
reflect the performance increase.

 

below is the timing to complete a select query covering all the records in a 
simple 2-column
table with 40 million records,

 

 - tid range scan takes 10216ms

 - tid range scan with 2 workers takes 7109ms

 - sequential scan with 2 workers takes 8499ms

 

Having the support
for parallel ctid range scan is definitely helpful in our migration case, I am
sure it could be useful in other cases as well. I am sharing the patch here and
if someone could provide a quick feedback or review that would be greatly 
appreciated.

 

Thank you!

 

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

v1-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Logging which interface was connected to in log_line_prefix

2024-04-29 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi 

I did a quick test on this patch and it seems to work as expected. Originally I 
thought the patch would add the name of "local interface" such as "eth0", 
"eth1", "lo"... etc as %L log prefix format. Instead, it formats the local 
interface IP addresses , but I think it is fine too. 

I have tested this new addition with various types of IPs including IPv4, IPv4 
and IPv6 local loop back addresses, global IPv6 address, linked local IPv6 
address with interface specifier, it seems to format these IPs correctly

There is a comment in the patch that states:

/* We do not need clean_ipv6_addr here: just report verbatim */

I am not quite sure what it means, but I am guessing it means that the patch 
does not need to format the IPv6 addresses in any specific way. For example, 
removing leading zeros or compressing consecutive zeros to make a IPv6 address 
shorter. It may not be necessary to indicate this in a comment because In my 
test, if any of my interface's IPv6 address have consecutive zeroes like this: 
2000::::::200:cafe/64, my network driver (Ubuntu 18.04) 
will format it as 2000::200:cafe, and the patch of course will read it as 
2000::200:cafe, which is ... correct and clean.

thank you
Cary Huang
www.highgo.ca

Re: Support tid range scan in parallel?

2024-04-30 Thread Cary Huang

Range Scan on test  (cost=0.01..511262.43 rows=28603575 
width=4) (actual time=0.332..4906.262 rows=3272 loops=3)
 TID Cond: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
 Planning Time: 0.213 ms
 Execution Time: 12265.873 ms
(7 rows)

[parallel seq scan with 2 workers]
set enable_tidscan = 'off';

EXPLAIN ANALYZE SELECT a FROM test WHERE ctid >= '(1,0)' AND ctid <= 
'(540540,100)';
QUERY PLAN
---
 Gather  (cost=0.00..969595.42 rows=68648581 width=4) (actual 
time=4.489..9713.299 rows=9815 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test  (cost=0.00..969595.42 rows=28603575 width=4) 
(actual time=0.995..5541.178 rows=3272 loops=3)
 Filter: ((ctid >= '(1,0)'::tid) AND (ctid <= '(540540,100)'::tid))
 Rows Removed by Filter: 62
 Planning Time: 0.129 ms
 Execution Time: 12675.681 ms
(8 rows)


Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Support tid range scan in parallel?

2024-05-01 Thread Cary Huang

 > This isn't a complete review. It's just that this seems enough to keep
 > you busy for a while. I can look a bit harder when the patch is
 > working correctly. I think you should have enough feedback to allow
 > that now.

Thanks for the test, review and feedback. They are greatly appreciated! 
I will polish the patch some more following your feedback and share new
results / patch when I have them. 

Thanks again!

Cary

Re: Support tid range scan in parallel?

2024-05-03 Thread Cary Huang

lt; '(216216,40)';
   QUERY PLAN

 Aggregate  (cost=716221.63..716221.64 rows=1 width=8) (actual 
time=12931.695..12931.696 rows=1 loops=1)
   Buffers: shared read=216217
   I/O Timings: shared read=599.331
   ->  Tid Range Scan on test  (cost=0.01..616221.31 rows=4130 width=4) 
(actual time=0.079..6800.482 rows=3999 loops=1)
 TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(216216,40)'::tid))
 Buffers: shared read=216217
 I/O Timings: shared read=599.331
 Planning:
   Buffers: shared hit=1 read=2
   I/O Timings: shared read=0.124
 Planning Time: 0.917 ms
 Execution Time: 12932.348 ms
(12 rows)

===parallel tid rangescan with aggregate===

set max_parallel_workers_per_gather=2;
EXPLAIN (ANALYZE, BUFFERS) select count(a) from test where ctid >= '(0,0)' and 
ctid < '(216216,40)';
 QUERY PLAN
-
 Finalize Aggregate  (cost=298425.70..298425.71 rows=1 width=8) (actual 
time=4842.512..4847.863 rows=1 loops=1)
   Buffers: shared read=216217
   I/O Timings: shared read=1155.321
   ->  Gather  (cost=298425.68..298425.69 rows=2 width=8) (actual 
time=4842.020..4847.851 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared read=216217
 I/O Timings: shared read=1155.321
 ->  Partial Aggregate  (cost=298425.68..298425.69 rows=1 width=8) 
(actual time=4824.730..4824.731 rows=1 loops=3)
   Buffers: shared read=216217
   I/O Timings: shared read=1155.321
   ->  Parallel Tid Range Scan on test  (cost=0.01..256758.88 
rows=1721 width=4) (actual time=0.098..2614.108 rows=1333 loops=3)
 TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < 
'(216216,40)'::tid))
 Buffers: shared read=216217
 I/O Timings: shared read=1155.321
 Planning:
   Buffers: shared read=3
   I/O Timings: shared read=3.323
 Planning Time: 4.124 ms
 Execution Time: 4847.992 ms
(20 rows)


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v2-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Support tid range scan in parallel?

2024-05-08 Thread Cary Huang

63..0.565 rows=1 loops=4)
   Buffers: shared hit=11
   ->  Parallel Tid Range Scan on test  (cost=0.01..11.08 rows=656 
width=0) (actual time=0.018..0.338 rows=509 loops=4)
 TID Cond: ((ctid >= '(0,1)'::tid) AND (ctid < 
'(11,0)'::tid))
 Buffers: shared hit=11


thank you!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v3-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Support tid range scan in parallel?

2024-05-09 Thread Cary Huang

 > I've not looked at the patch, but please add it to the July CF.  I'll
 > try and look in more detail then.

Thanks David, I have added this patch on July commitfest under the
server feature category. 

I understand that the regression tests for parallel ctid range scan is a
bit lacking now. It only has a few EXPLAIN clauses to ensure parallel 
workers are used when tid ranges are specified. They are added as
part of select_parallel.sql test. I am not sure if it is more appropriate
to have them as part of tidrangescan.sql test instead. So basically
re-run the same test cases in tidrangescan.sql but in parallel? 

thank you

Cary

Re: Support tid range scan in parallel?

2024-05-14 Thread Cary Huang

> You should add a test that creates a table with a very low fillfactor,
> low enough so only 1 tuple can fit on each page and insert a few dozen
> tuples. The test would do SELECT COUNT(*) to ensure you find the
> correct subset of tuples. You'd maybe want MIN(ctid) and MAX(ctid) in
> there too for extra coverage to ensure that the correct tuples are
> being counted.  Just make sure and EXPLAIN (COSTS OFF) the query first
> in the test to ensure that it's always testing the plan you're
> expecting to test.

thank you for the test suggestion. I moved the regress tests from 
select_parallel
to tidrangescan instead. I follow the existing test table creation in 
tidrangescan
with the lowest fillfactor of 10, I am able to get consistent 5 tuples per page
instead of 1. It should be okay as long as it is consistently 5 tuples per page 
so
the tuple count results from parallel tests would be in multiples of 5.

The attached v4 patch includes the improved regression tests.

Thank you very much!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca








v4-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Serverside SNI support in libpq

2024-05-24 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This is an interesting feature on PostgreSQL server side where it can swap the
certificate settings based on the incoming hostnames in SNI field in client
hello message.

I think this patch resonate with a patch I shared awhile ago
( https://commitfest.postgresql.org/48/4924/ ) that adds multiple certificate
support on the libpq client side while this patch adds multiple certificate
support on the server side. My patch allows user to supply multiple certs, keys,
sslpasswords in comma separated format and the libpq client will pick one that
matches the CA issuer names sent by the server. In relation with your patch,
this CA issuer name would match the CA certificate configured in pg_hosts.cfg.

I had a look at the patch and here's my comments:

+   
+PostgreSQL can be configured for
+SNI using the pg_hosts.conf
+configuration file. PostgreSQL inspects the TLS
+hostname extension in the SSL connection handshake, and selects the right
+SSL certificate, key and CA certificate to use for the connection.
+   

pg_hosts should also have sslpassword_command just like in the postgresql.conf 
in
case the sslkey for a particular host is encrypted with a different password.

+   /*
+* Install SNI TLS extension callback in case the server is configured 
to
+* validate hostnames.
+*/
+   if (ssl_snimode != SSL_SNIMODE_OFF)
+   SSL_CTX_set_tlsext_servername_callback(context, 
sni_servername_cb);

If libpq client does not provide SNI, this callback will not be called, so there
is not a chance to check for a hostname match from pg_hosts, swap the TLS 
CONTEXT,
or possibly reject the connection even in strict mode. The TLS handshake in such
case shall proceed and server will use the certificate specified in
postgresql.conf (if these are loaded) to complete the handshake with the client.
There is a comment in the patch that reads:

>  - strict: only pg_hosts.conf is loaded and the TLS extension hostname
>   MUST be passed and MUST have a match in the configuration, else the
>  connection is refused.

I am not sure if it implies that if ssl_snimode is strict, then the normal 
ssl_cert,
ssl_key and ca_cert…etc settings in postgresql.conf are ignored?

thank you

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

sslinfo extension - add notbefore and notafter timestamps

2022-08-19 Thread Cary Huang

Hello



I noticed that sslinfo extension does not have functions to return current 
client certificate's notbefore and notafter timestamps which are also quite 
important attributes in a X509 certificate. The attached patch adds 2 functions 
to get notbefore and notafter timestamps from the currently connected client 
certificate.



thank you!







Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

v1-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data

Re: Mark a transaction uncommittable

2023-06-06 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello

It is one of those features that a handful of people would find useful in 
specific user cases. I think it is a nice to have feature that safeguards your 
production database against unwanted commits when troubleshooting production 
problems. Your patch applies fine on master and I am able to run a couple tests 
on it and it seems to do as described. I noticed that the patch has a 
per-session variable "default_transaction_committable" that could make all 
transaction committable or uncommittable on the session even without specifying 
"begin transaction not committable;" I am wondering if we should have a 
configurable default at all as I think it should always defaults to true and 
unchangable. If an user wants a uncommittable transaction, he/she will need to 
explicitly specify that during "begin". Having another option to change default 
behavior for all transactions may be a little unsafe, it is possible someone 
could purposely change this default to false on a production session that needs 
transactions to absolutely commit, causing damages there. 

thank you
Cary Huang
--
Highgo Software Canada
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-23 Thread Cary Huang


 > Off the cuff that doesn't seem like a bad idea, but I wonder if we should add
 > them to pg_stat_ssl (or both) instead if we deem them valuable?

I think the same information should be available to pg_stat_ssl as well.  
pg_stat_ssl can show the client certificate information for all connecting 
clients, having it to show not_before and not_after timestamps of every client 
are important in my opinion. The attached patch 
"v2-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch" adds this 
support
 
 > Re the patch, it would be nice to move the logic in ssl_client_get_notafter 
 > and
 > the _notbefore counterpart to a static function since they are copies of
 > eachother.

Yes agreed. I have made the adjustment attached as 
"v2-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch"

would this feature be suitable to be added to commitfest? What do you think?

thank you

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca



v2-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v2-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-23 Thread Cary Huang

 > Yes, please add it to the July commitfest and feel free to set me as 
 > Reviewer,
 > I intend to take a look at it.

Thank you Daniel, I have added this patch to July commitfest under security 
category and added you as reviewer. 

best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-30 Thread Cary Huang

 > This needs to adjust the tests in src/test/ssl which now fails due to SELECT 
 > *
 > returning a row which doesn't match what the test was coded for.

Thank you so much for pointing out. I have adjusted the extra ssl test to 
account for the extra columns returned. It should not fail now. 
 
> The new patchset isn't updating contrib/sslinfo/meson with the 1.3 update so 
> it
 > fails to build with Meson. 

Thanks again for pointing out, I have adjusted the meson build file to include 
the 1.3 update

Please see attached patches for the fixes. 
Thank you so much!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v3-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v3-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data

Re: [Patch] remove duplicated smgrclose

2024-08-23 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi

the patch looks good to me as well. Calling smgrclose() right after calling 
smgrdounlinkall() does seem
unnecessary as it is already done inside smgrdounlinkall() as you mentioned. I 
checked the commit logs
and seems like the code has been like this for over 10 years. One difference is 
that smgrdounlinkall() does
not reset smgr_cached_nblocks and smgr_targblock but that does not seem to 
matter as it is about to
remove the physical files.

While leaving them like this does no harm because smgrclose() simply does 
nothing if the relation has already
been closed, it does look weird that the code tries to close the relation after 
smgrdounlinkall(), because the
physical files have just been deleted when smgrdounlinkall() completes, and we 
try to close something that
has been deleted ?!

---
Cary Huang
Highgo software Canada
www.highgo.ca

Re: Internal key management system

2020-06-18 Thread Cary Huang

Hi all

Having read through the discussion, I have some comments and suggestions that I 
would like to share. 

I think it is still quite early to even talk about external key management 
system even if it is running on the same host as PG. This is most likely 
achieved as an extension that can provide communication to external key server 
and it would be a separate project/discussion. I think the focus now is to 
finalize on the internal KMS design, and we can discuss about how to migrate 
internally managed keys to the external when the time is right.

Key management system is generally built to manage the life cycle of 
cryptographic keys, so our KMS in my opinion needs to be built with key life 
cycle in mind such as:

* Key generation
* key protection
* key storage
* key rotation

* key rewrap
* key disable/enable
* key destroy

KMS should not perform the above life cycle management by itself automatically 
or hardcoded, instead it should expose some interfaces to the end user or even 
a backend process like TDE to trigger the above. 

The only key KMS should manage by itself is the KEK, which is derived from 
cluster passphrase value. This is fine in my opinion. This KEK should exist 
only within KMS to perform key protection (by wrapping) and key storage (save 
as file).

The other life cycle stages should be just left as interfaces, waiting for 
somebody to request KMS to perform. Somebody could be end user or back end 
process like TDE.

Using TDE as example, when TDE initializes, it calls KMS's key_generation 
interface to get however many keys that it needs, KMS should not return the 
keys in clear text of hex, it can return something like a key ID.

Using regular user as example, each user can also call KMS's key_generation 
interface to create a cryptographic key for their own purpose, KMS should also 
return just an key ID and this key should be bound to the user and we can limit 
that each user can have only one key managed, and regular user can only manage 
his/her own key with KMS, rotate, destroy, disable..etc; he/she cannot manage 
others' keys

super user (or key admin), however, can do all kinds of management to all keys, 
(generated by TDE or by other users). He or she can do key rotation, key 
rewrap, disable or destroy. Here we will need to think about how to prevent 
this user from misusing the key management functions. 

Super user should also be able to view the status of all the keys managed, 
information such as: 

* date of generation

* key ID

* owner

* status

* key length

* suggested date of rotation..etc etc

* expiry date??

to actually perform the encryption with keys managed by internal KMS, I suggest 
adding a set of wrapper functions within KMS using the Key ID as input 
argument. So for example, TDE wants to encrypt some data, it will call KMS's 
wrapper encryption function with Key ID supplied, KMS looked up the key with  
key ID ,verify caller's permission and translate these parameters and feed to 
pgcrypto for example. This may be a little slower due to the lookup, but we can 
have a variation of the function where KMS can look up the key with supplied 
key ID and convert it to encryption context and return it back to TDE. Then TDE 
can use this context to call another wrapper function for encryption without 
lookup all the time. If an end user wants to encrypt something, it will also 
call KMS's wrapper function and supply the key ID in the same way. 

I know that there is a discussion on moving the cryptographic functions as 
extension. In an already running PG, it is fine, but when TDE and XLOG 
bootstraps during initdb, it requires cryptographic function to encrypt the 
initial WAL file and during initdb i dont think it has access to any 
cryptographic function provided by the extension. we may need to include 
limited cryptographic function within KMS and TDE so it is enough to finish 
initdb with intial WAl encrypted.

This is just my thought how this KMS and TDE should look like. 

Best

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Tue, 16 Jun 2020 23:52:03 -0700 Masahiko Sawada 
 wrote 

On Sun, 14 Jun 2020 at 19:39, Fabien COELHO <mailto:coe...@cri.ensmp.fr> wrote: 
> 
> 
> Hello Masahiko-san, 
> 
> >>> * The external place needs to manage more encryption keys than the 
> >>> current patch does. 
> >> 
> >> Why? If the external place is just a separate process on the same host, 
> >> probably it would manage the very same amount as what your patch. 
> > 
> > In the current patch, the external place needs to manage only one key 
> > whereas postgres needs to manages multiple DEKs. But with your idea, 
> > the external place needs to manage both KEK and DEKs. 
> 
> Hmmm. I do not see a good use case for a "managem

Re: pg_dump --where option

2020-07-09 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

Hi

I had a look at the patch and it cleanly applies to postgres master branch. I 
tried to do a quick test on the new "where clause" functionality and for the 
most part it does the job as described and I'm sure some people will find this 
feature useful to their database dump needs. However I tried the feature with a 
case where I have a subquery in the where clause, but it seems to be failing to 
dump the data. I ran the pg_dump like:

  $ pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > 
testdump2
  $ pg_dump: error: processing of table "public.test1" failed

both test1 and test2 exist in the database and the same subquery works under 
psql.
 
I also notice that the regression tests for pg_dump is failing due to the 
patch, I think it is worth looking into the failure messages and also add some 
test cases on the new "where" clause to ensure that it can cover as many use 
cases as possible.

thank you
Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Internal key management system

2020-04-07 Thread Cary Huang

ions 
keys and store the new results in pg_cryptokeys directory.

To rotate the cluster passphrase, user firstly needs to update 
cluster_passphrase_command in the postgresql.conf and then execute 
pg_rotate_cluster_passphrase() SQL function to initiate the rotation.

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Mon, 30 Mar 2020 21:30:19 -0700 Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote 

On Tue, 31 Mar 2020 at 09:36, Cary Huang <mailto:cary.hu...@highgo.ca> wrote: 
> 
> Hi 
> I had a look on kms_v9 patch and have some comments 
> 
> --> pg_upgrade.c 
> keys are copied correctly, but as pg_upgrade progresses further, it will try 
> to start the new_cluster from "issue_warnings_and_set_wal_level()" function, 
> which is called after key copy. The new cluster will fail to start due to the 
> mismatch between cluster_passphrase_command and the newly copied keys. This 
> causes pg_upgrade to always finish with failure. We could move 
> "copy_master_encryption_key()" to be called after 
> "issue_warnings_and_set_wal_level()" and this will make pg_upgrade to finish 
> with success, but user will still have to manually correct the 
> "cluster_passphrase_command" param on the new cluster in order for it to 
> start up correctly. Should pg_upgrade also take care of copying 
> "cluster_passphrase_command" param from old to new cluster after it has 
> copied the encryption keys so users don't have to do this step? If the 
> expectation is for users to manually correct "cluster_passphrase_command" 
> param after successful pg_upgrade and key copy, then there should be a 
> message to remind the users to do so. 

I think both the old cluster and the new cluster must be initialized 
with the same passphrase at initdb. Specifying the different 
passphrase command to the new cluster at initdb and changing it after 
pg_upgrade doesn't make sense. Also I don't think we need to copy 
cluster_passphrase_command same as other GUC parameters. 

I've changed the patch so that pg_upgrade copies the crypto keys only 
if both new and old cluster enable the key management. User must 
specify the same passphrase command to both old and new cluster, which 
is not cumbersome, I think. I also added the description about this to 
the doc. 

> 
> -->Kmgr.c 
> + /* 
> + * If there is only temporary directory, it means that the previous 
> + * rotation failed after wrapping the all internal keys by the new 
> + * passphrase.  Therefore we use the new cluster passphrase. 
> + */ 
> + if (stat(KMGR_DIR, &st) != 0) 
> + { 
> + ereport(DEBUG1, 
> + (errmsg("both directories %s and %s exist, use the newly wrapped keys", 
> + KMGR_DIR, KMGR_TMP_DIR))); 
> 
> I think the error message should say "there is only temporary directory 
> exist" instead of "both directories exist" 

You're right. Fixed. 

I've attached the new version patch. 

Regards, 

-- 
Masahiko Sawada http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Include sequence relation support in logical replication

2020-04-16 Thread Cary Huang

Hi Craig, Andres

Thank you guys so much for your reviews and comments. Really helpful. Yes you 
guys are right, Sequence does not guarantee free of gaps and replicating 
sequence is useful for failover cases, then there will be no problem for a 
subscriber to get a future value 32 increments after. I will do more analysis 
on my end based on your comments and refine the patch with better test cases. 
Much appreciated of your help.

Best regards

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Wed, 15 Apr 2020 22:18:28 -0700 Craig Ringer  
wrote 

On Thu, 16 Apr 2020 at 07:44, Andres Freund <mailto:and...@anarazel.de> wrote:

> I would like to ask if you have some suggestions or ideas that can make 
> subscriber receives the current value without the need to
 > 
 > disabling the 32 increment behavior?

 It simply shouldn't expect that to be the case.  What do you need it
 for?

 As far as I can tell replicating sequence values is useful to allow
 failover, by ensuring all sequences will return sensible values going
 forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken 
code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical 
decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was 
introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of 
transactional and non-transactional state in a very messy way. This made it 
very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, 
as TAP tests in src/test/recovery, probably a new module for logical decoding 
of sequences:

1.

* Begin txn

* Create sequence

* Call nextval() on sequence over generate_series() and discard results

* Rollback

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.

* Begin 1st txn

* Create a sequence

* Use the sequence to populate a temp table with enough rows to ensure sequence 
updates are written

* Begin a 2nd txn

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Commit the 2nd txn

* Commit the 1st txn

* Wait for subscription catchup

* Check that the sequence value on the subscriber reflects the value after 
sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive 
for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should 
really have a helper for this. I'll see if I'm allowed to post the one I use 
for my own TAP tests to the list.

Re: Internal key management system

2020-05-01 Thread Cary Huang

Hi all



I am sharing here a document patch based on top of kms_v10 that was shared 
awhile back. This document patch aims to cover more design details of the 
current KMS design and to help people understand KMS better. Please let me know 
if you have any more comments.



thank you



Best regards



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca




 On Tue, 07 Apr 2020 20:56:12 -0700 Ahsan Hadi 
<mailto:ahsan.h...@gmail.com> wrote 



Hi Bruce/Joe,



In the last meeting we discussed the need for improving the documentation for 
KMS so it is easier to understand the interface. Cary from highgo had a go at 
doing that, please see the previous email on this thread from Cary and let us 
know if it looks good...?



-- Ahsan 




On Wed, Apr 8, 2020 at 3:46 AM Cary Huang <mailto:cary.hu...@highgo.ca> wrote:








-- 

Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca/
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: 





Hello



Thanks a lot for the patch, I think in terms of functionality, the patch 
provides very straightforward functionalities regarding key management. In 
terms of documentation, I think the patch is still lacking some pieces of 
information that kind of prevent people from fully understanding how KMS works 
and how it can be used and why, (at least that is the impression I got from the 
zoom meeting recordings :p). I spent some time today revisiting the 
key-management documentation in the patch and rephrase and restructure it  
based on my current understanding of latest KMS design. I mentioned all 3 
application level keys that we have agreed and emphasize on explaining the SQL 
level encryption key because that is the key that can be used right now. Block 
and WAL levels keys we can add here more information once they are actually 
used in the TDE development. 



Please see below the KMS documentation that I have revised and I hope it will 
be more clear and easier for people to understand KMS. Feel free to make 
adjustments. Please note that we use the term "wrap" and "unwrap" a lot in our 
past discussions. Originally we used the terms within a context involving Key 
encryption keys (KEK). For example, "KMS wraps a master key with KEK". Later, 
we used the same term in a context involving encrypting user secret /password. 
For example, "KMS wraps a user secret with SQL key". In my opinion, both make 
sense but it may be confusing to people having the same term used differently. 
So in my revision below, the terms "wrap" and "unwrap" refer to encrypting or 
decrypting user secret / password as they are used in "pg_wrap() and 
pg_unwrap()". I use the terms "encapsulate" and "restore" when KEK is used to 
encrypt or decrypt a key.







Chapter 32: Encryption Key Management 

--



PostgreSQL supports internal Encryption Key Management System, which is 
designed to manage the life cycles of cryptographic keys within the PostgreSQL 
system. This includes dealing with their generation, storage, usage and 
rotation.



Encryption Key Management is enabled when PostgreSQL is build with 
--with-openssl and cluster passphrase command is specified during initdb. The 
cluster passphrase provided by --cluster-passphrase-command option during 
initdb and the one generated by cluster_passphrase_command in the 
postgresql.conf must match, otherwise, the database cluster will not start up.



32.1 Key Generations and Derivations

--



When cluster_passphrase_command option is specified to the initdb, the process 
will derive the cluster passphrase into a Key Encryption Key (KEK) and a HMAC 
Key using key derivation protocol before the actual generation of application 
level cryptographic level keys.



-Key Encryption Key (KEK)

KEK is primarily used to encapsulate or restore a given application level 
cryptographic key



-HMAC Key

HMAC key is used to compute the HASH of a given application level cryptographic 
key for integrity check purposes



These 2 keys are not stored physically within the PostgreSQL cluster as they 
are designed to be derived from the correctly configured cluster passphrase.



Encryption Key Management System currently manages 3 application level 
cryptographic keys that have different purposes and usages within the 
PostgreSQL system and these are generated using pg_strong_random() after KEK 
and HMAC key derivation during initdb process.



The 3 keys are:



-SQL Level Key

SQL Level Key is used to wrap and unwrap a user secret / passphrase via 
pg_wrap() and pg_unwrap() SQL functions. These 2 functions are designed to be 
used in conjunction with the cryptographic functions provided by pgcrypto 
extension to perform column level encryption/decryption without having to 
supply a clear

Re: Include sequence relation support in logical replication

2020-05-08 Thread Cary Huang

Hi Craig

I have added more regression test cases to the sequence replication patch with 
emphasis on transactions and rollback per your suggestions. I find that when a 
transaction is aborted with rollback, the decoder plugin will not receive the 
change but the sequence value will in fact advance if nextval() or setval() 
were called. I have also made sequence replication an optional parameter in 
test_decoding so other test_decoding regression test cases will not need an 
update due to the new sequence replication function. The sequence update in 
this patch will emit an wal update every 32 increment, and each update is a 
future value 32 increments after like it was originally, so it is no longer 
required getting a specific value of sequence.

Could you elaborate more on your second case where it requires a psql in an 
IPC::Run background session and check if regression needs more coverage on 
certain cases?
thank you!

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Thu, 16 Apr 2020 09:45:06 -0700 Cary Huang 
<mailto:cary.hu...@highgo.ca> wrote 

Hi Craig, Andres

Thank you guys so much for your reviews and comments. Really helpful. Yes you 
guys are right, Sequence does not guarantee free of gaps and replicating 
sequence is useful for failover cases, then there will be no problem for a 
subscriber to get a future value 32 increments after. I will do more analysis 
on my end based on your comments and refine the patch with better test cases. 
Much appreciated of your help.

Best regards

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Wed, 15 Apr 2020 22:18:28 -0700 Craig Ringer 
<mailto:cr...@2ndquadrant.com> wrote 

On Thu, 16 Apr 2020 at 07:44, Andres Freund <mailto:and...@anarazel.de> wrote:

> I would like to ask if you have some suggestions or ideas that can make 
> subscriber receives the current value without the need to
 > 
 > disabling the 32 increment behavior?

 It simply shouldn't expect that to be the case.  What do you need it
 for?

 As far as I can tell replicating sequence values is useful to allow
 failover, by ensuring all sequences will return sensible values going
 forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken 
code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical 
decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was 
introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of 
transactional and non-transactional state in a very messy way. This made it 
very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, 
as TAP tests in src/test/recovery, probably a new module for logical decoding 
of sequences:

1.

* Begin txn

* Create sequence

* Call nextval() on sequence over generate_series() and discard results

* Rollback

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.

* Begin 1st txn

* Create a sequence

* Use the sequence to populate a temp table with enough rows to ensure sequence 
updates are written

* Begin a 2nd txn

* Issue a dummy insert+commit to some other table to force logical decoding to 
send something

* Commit the 2nd txn

* Commit the 1st txn

* Wait for subscription catchup

* Check that the sequence value on the subscriber reflects the value after 
sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive 
for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should 
really have a helper for this. I'll see if I'm allowed to post the one I use 
for my own TAP tests to the list.

sequence_replication_v3.patch
Description: Binary data

Re: Internal key management system

2020-06-02 Thread Cary Huang

Hi

 

I took a step back
today and started to think about the purpose of internal KMS and what it is
supposed to do, and how it compares to external KMS. Both are intended to
manage the life cycle of encryptions keys including their generation,
protection, storage and rotation. External KMS, on the other hand, is a more
centralized but expensive way to manage encryption life cycles and many
deployment actually starts with internal KMS and later migrate to external one.

 

Anyhow, the design
and implementation of internal KMS should circle around these stages of key
lifecycle.

 

1. Key Generation -  Yes, internal KMS module generates keys
 using pseudo random function, but only the keys for TDE and  SQL level 
keys. Users cannot request new
 key generation

2. Key Protection - Yes,
 internal KMS wrap all keys with KEK and HMAC hash derived from a cluster
 passphrase

3. Key Storage - Yes, the
 wrapped keys are stored in the cluster

4. Key Rotation - Yes, internal
 KMS has a SQL method to swap out cluster passphrase, which rotates the KEK
 and HMAC key

 

I am
saying this, because I want to make sure we can all agree on the scope of
internal KMS. Without clear scope, this KMS development will seem to go on
forever.

 

In this
patch, the internal KMS exposes pg_encrypt() and pg_decrypt() (was pg_wrap()
and pg_unwrap() before) to the user to turn a clear text password into some
sort of key material based on the SQL level key generated at initdb. This is
used so the user does not have to provide clear text password to
pgp_sym_encrypt() provided by pgcrypto extension. The intention is good, I
understand, but I don't think it is within the scope of KMS and it is
definitely not within the scope of TDE either.

 

Even
if the password can be passed into pgp_sym_encrypt() securely by using 
pg_decrypt() function, the pgp_sym_encrypt() still will have to take
this password and derive into an encryption key using algorithms that internal
KMS does not manage currently. This kind of defeats the purpose of internal
KMS. So simply using pg_encrypt() and pg_decrypt() is not really a solution to
pgcrypto's limitation. This should be in another topic/project that is aimed to
improve pgcrypto by integrating it with the internal KMS, similar to TDE where
it also has to integrate with the internal KMS later.

 

So
for internal KMS, the only cryptographic functions needed for now is
kmgr_wrap_key() and kmgr_unwrap_key() to encapsulate and restore the
encryptions keys to satisfy the "key protection" life cycle stage. I
don't think pg_encrypt() and pg_decrypt() should be part of internal KMS.

 

Anyways, I have also
reviewed the patch and have a few comments below:

 

(1)

The ciphering
algorithm in my opinion should contain the algorithm name, key length and block
cipher mode, which is similar to openssl's definition.

 

Instead of defining
a cipher as  PG_CIPHER_AES_CBC, and have
key length as separate parameter, I would define them as

#define
PG_CIPHER_AES128_CBC 0

#define
PG_CIPHER_AES256_CBC 1

#define
PG_CIPHER_AES128_CTR 2

#define
PG_CIPHER_AES256_CTR 3

 

I know
PG_CIPHER_AES128_CTR and PG_CIPHER_AES256_CTR are not being used now as these
are for the TDE in future, but might as well list them here because this KMS is
made to work specifically for TDE as I understand.

---

/*

 * Supported symmetric encryption algorithm.
These identifiers are passed

 * to pg_cipher_ctx_create() function, and then
actual encryption

 * implementations need to initialize their
context of the given encryption

 * algorithm.

 */

#define
PG_CIPHER_AES_CBC0

#define
PG_MAX_CIPHER_ID1

---

 

(2)

If the cipher
algorithms are defined like (1), then there is no need to pass key length as
argument to ossl_cipher_ctx_create() function because it already knows the key
length based on the cipher definition. Less argument the better.

 

---

PgCipherCtx *

pg_cipher_ctx_create(int
cipher, uint8 *key, int klen)

{

PgCipherCtx
*ctx = NULL;

 

if
(cipher >= PG_MAX_CIPHER_ID)

return
NULL;

 

#ifdef USE_OPENSSL

ctx
= (PgCipherCtx *) palloc0(sizeof(PgCipherCtx));

 

ctx->encctx
= ossl_cipher_ctx_create(cipher, key, klen, true);

ctx->decctx
= ossl_cipher_ctx_create(cipher, key, klen, false);

#endif

 

return
ctx;

}

---



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca



 On Sun, 31 May 2020 23:58:31 -0700 Masahiko Sawada 
 wrote 


On

Re: Internal key management system

2020-02-06 Thread Cary Huang

Since the user does not need to know the master secret key used to cipher the 
data, I don't think we should expose "pg_kmgr_unwrap("")" SQL function to 
the user at all.

The wrapped key "" is stored in control data and it is possible to obtain 
by malicious user and steal the key by running SELECT pg_kmgr_unwrap(""). 

Even the user is righteous, it may not be very straightforward for that user to 
obtain the wrapped key "" to use in the unwrap function.



pg_kmgr_(un)wrap function is in discussion because encrypt and decrypt function 
require the master secret key as input argument. 

I would suggest using cluster passphrase as input instead of master key, so the 
user does not have to obtain the master key using pg_kmgr_unwrap("") in 
order to use the encrypt and decrypt function. 

The passphrase is in fact not stored anywhere in the system and we have to be 
careful that this passphrase is not shown in any activity log



so instead of:

--


INSERT INTO tbl VALUES (pg_encrypt('user data', pg_kmgr_unwrap('x'));

SELECT pg_decrypt(secret_column, pg_kmgr_unwrap('x')) FROM tbl;



it would become:

--

INSERT INTO tbl VALUES (pg_encrypt('user data', 'cluster_pass_phrase');

SELECT pg_decrypt(secret_column, 'cluster_pass_phrase') FROM tbl;



pg_decrypt will then have to:



1. derive the cluster pass phrase into KEK and HMAC key 

2. verify pass phrase by comparing MAC

3. unwrap the key - Sehrope suggests a good approach to make wrap/unwrap 
function more secure by adding MAC verification and randomed IV instead of 
default. I think it is good

4. decrypt the data

5. return



Using passphrase instead of master key to encrypt and decrypt function will 
also make front end tool integration simpler, as the front end tool also do not 
need to know the master key so it does not need to derive KEK or unwrap the 
key...etc. 

Not sure if you guys agree?



Thanks!



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca






 On Thu, 06 Feb 2020 12:30:02 -0800 Robert Haas  
wrote 



On Mon, Feb 3, 2020 at 10:18 PM Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote: 
> > I'm lost. If pg_{en,de}crypt and pg_kmgr_unwrap are functions, what 
> > prevent users to: 
> > 
> >SELECT pg_kmgr_unwrap(''); 
> > 
> > so as to recover the key, somehow in contradiction to "allows user to 
> > encrypt and decrypt data without knowing the actual key". 
> 
> I might be missing your point but the above '' is the wrapped key 
> wrapped by the master key stored in PostgreSQL server. So user doesn't 
> need to know the raw secret key to encrypt/decrypt the data. Even if a 
> malicious user gets '' they cannot know the actual secret key 
> without the master key. pg_kmgr_wrap and pg_kmgr_unwrap are functions 
> and it's possible for user to know the raw secret key by using 
> pg_kmgr_unwrap(). The master key stored in PostgreSQL server never be 
> revealed. 
 
I think I have the same confusion as Fabien. Isn't it bad if somebody 
just runs pg_kmgr_unwrap() and records the return value? Now they've 
stolen your encryption key, it seems. 
 
-- 
Robert Haas 
EnterpriseDB: http://www.enterprisedb.com 
The Enterprise PostgreSQL Company

Re: Internal key management system

2020-02-18 Thread Cary Huang

Hi 

I have tried the attached kms_v3 patch and have some comments:

1) In the comments, I think you meant hmac + iv + encrypted data instead of iv 
+ hmac + encrypted data? 

---> in kmgr_wrap_key( ):

+   /*

+    * Assemble the wrapped key. The order of the wrapped key is iv, hmac 
and

+    * encrypted data.

+    */

2) I see that create_keywrap_ctx function in kmgr_utils.c and regular cipher 
context init will both call ossl_aes256_encrypt_init to initialise context for 
encryption and key wrapping. In ossl_aes256_encrypt_init, the cipher method 
always initialises to aes-256-cbc, which is ok for keywrap because under CBC 
block cipher mode, we only had to supply one unique IV as initial value. But 
for actual WAL and buffer encryption that will come in later, I think the 
discussion is to use CTR block cipher mode, which requires one unique IV for 
each block, and the sequence id from WAL and buffer can be used to derive 
unique IV for each block for better security? I think it would be better to 
allow caller to decide which EVP_CIPHER to initialize? EVP_aes_256_cbc(), 
EVP_aes_256_ctr() or others?

+ossl_aes256_encrypt_init(pg_cipher_ctx *ctx, uint8 *key)

+{

+   if (!EVP_EncryptInit_ex(ctx, EVP_aes_256_cbc(), NULL, NULL, NULL))

+   return false;

+   if (!EVP_CIPHER_CTX_set_key_length(ctx, PG_AES256_KEY_LEN))

+   return false;

+   if (!EVP_EncryptInit_ex(ctx, NULL, NULL, key, NULL))

+   return false;

+

+   /*

+    * Always enable padding. We don't need to check the return

+    * value as EVP_CIPHER_CTX_set_padding always returns 1.

+    */

+   EVP_CIPHER_CTX_set_padding(ctx, 1);

+

+   return true;

+}

3) Following up point 2), I think we should enhance the enum to include not 
only the Encryption algorithm and key size, but also the block cipher mode as 
well because having all 3 pieces of information can describe exactly how KMS is 
performing the encryption and decryption. So when we call 
"ossl_aes256_encrypt_init", we can include the new enum as input parameter and 
it will initialise the EVP_CIPHER_CTX with either EVP_aes_256_cbc() or 
EVP_aes_256_ctr() for different purposes (key wrapping, or WAL encryption..etc).

---> kmgr.h

+/* Value of key_management_cipher */

+enum

+{

+   KMGR_CIPHER_OFF = 0,

+   KMGR_CIPHER_AES256

+};

+

so it would become 

+enum

+{

+   KMGR_CIPHER_OFF = 0,

+   KMGR_CIPHER_AES256_CBC = 1,

+       KMGR_CIPHER_AES256_CTR = 2

+};

If you agree with this change, several other places will need to be changed as 
well, such as "kmgr_cipher_string", "kmgr_cipher_value" and the initdb code

4) the pg_wrap_key and pg_unwrap_key SQL functions defined in kmgr.c

I tried these new SQL functions and found that the pg_unwrap_key will produce 
the original key with 4 bytes less. This is because the result length is not 
set long enough to accommodate the 4 byte VARHDRSZ header used by the 
multi-type variable.

the len variable in SET_VARSIZE(res, len) should include also the variable 
header VARHDRSZ. Now it is 4 byte short so it will produce incomplete output.

---> pg_unwrap_key function in kmgr.c

+   if (!kmgr_unwrap_key(UnwrapCtx, (uint8 *) VARDATA_ANY(data), datalen,

+    (uint8 *) VARDATA(res), &len))

+   ereport(ERROR,

+   (errmsg("could not unwrap the given secret")));

+

+   /*

+    * The size of unwrapped key can be smaller than the size estimated

+    * before unwrapping since the padding is removed during unwrapping.

+    */

+   SET_VARSIZE(res, len);

+   PG_RETURN_BYTEA_P(res);

I am only testing their functionalities with random key as input data. It is 
currently not possible for a user to obtain the wrapped key from the server in 
order to use these wrap/unwrap functions. I personally don't think it is a good 
idea to expose these functions to user

thank you

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Fri, 14 Feb 2020 08:00:45 -0800 Robert Haas  
wrote 

On Thu, Feb 6, 2020 at 9:19 PM Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote: 
> This feature protects data from disk thefts but cannot protect data 
> from attackers who are able to access PostgreSQL server. In this 
> design application side still is responsible for managing the wrapped 
> secret in order to protect it from attackers. This is the same as when 
> we use pgcrypto now. The difference is that data is safe even if 
> attackers steal the wrapped secret and the disk. The data cannot be 
> decrypted either without the passphrase which can be stored to other 
> key management systems or without accessing postgres server

Re: Internal key management system

2020-02-25 Thread Cary Huang

Hi 

I would like to share with you a front end patch based on kms_v4.patch that you 
have shared, called "kms_v4_fe.patch". 



The patch integrates front end tool pg_waldump with the KMSv4 and obtain 
encryption and decryption cipher contexts from the KMS backend. These cipher 
contexts can then be used in subsequent encryption and decryption operations 
provided by cipher.h when TDE is enabled. I added two common functions in your 
kmgr_utils that other front end tools affected by TDE can also use to obtain 
the cipher contexts. Do let me know if this is how you would envision KMS APIs 
to be used by a front end. 



cheers





Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca




 On Mon, 24 Feb 2020 17:55:09 -0800 Masahiko Sawada 
 wrote 



On Thu, 20 Feb 2020 at 16:16, Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote: 
> 
> On Wed, 19 Feb 2020 at 09:29, Cary Huang <mailto:cary.hu...@highgo.ca> wrote: 
> > 
> > Hi 
> > 
> > I have tried the attached kms_v3 patch and have some comments: 
> > 
> > 1) In the comments, I think you meant hmac + iv + encrypted data instead of 
> > iv + hmac + encrypted data? 
> > 
> > ---> in kmgr_wrap_key( ): 
> > +   /* 
> > +* Assemble the wrapped key. The order of the wrapped key is iv, 
> > hmac and 
> > +* encrypted data. 
> > +*/ 
> 
> Right, will fix. 
> 
> > 
> > 
> > 2) I see that create_keywrap_ctx function in kmgr_utils.c and regular 
> > cipher context init will both call ossl_aes256_encrypt_init to initialise 
> > context for encryption and key wrapping. In ossl_aes256_encrypt_init, the 
> > cipher method always initialises to aes-256-cbc, which is ok for keywrap 
> > because under CBC block cipher mode, we only had to supply one unique IV as 
> > initial value. But for actual WAL and buffer encryption that will come in 
> > later, I think the discussion is to use CTR block cipher mode, which 
> > requires one unique IV for each block, and the sequence id from WAL and 
> > buffer can be used to derive unique IV for each block for better security? 
> > I think it would be better to allow caller to decide which EVP_CIPHER to 
> > initialize? EVP_aes_256_cbc(), EVP_aes_256_ctr() or others? 
> > 
> > +ossl_aes256_encrypt_init(pg_cipher_ctx *ctx, uint8 *key) 
> > +{ 
> > +   if (!EVP_EncryptInit_ex(ctx, EVP_aes_256_cbc(), NULL, NULL, NULL)) 
> > +   return false; 
> > +   if (!EVP_CIPHER_CTX_set_key_length(ctx, PG_AES256_KEY_LEN)) 
> > +   return false; 
> > +   if (!EVP_EncryptInit_ex(ctx, NULL, NULL, key, NULL)) 
> > +   return false; 
> > + 
> > +   /* 
> > +* Always enable padding. We don't need to check the return 
> > +* value as EVP_CIPHER_CTX_set_padding always returns 1. 
> > +*/ 
> > +   EVP_CIPHER_CTX_set_padding(ctx, 1); 
> > + 
> > +   return true; 
> > +} 
> 
> It seems good. We can expand it to make caller decide the block cipher 
> mode of operation and key length. I removed such code from the 
> previous patch to make it simple since currently we support only 
> AES-256 CBC. 
> 
> > 
> > 3) Following up point 2), I think we should enhance the enum to include not 
> > only the Encryption algorithm and key size, but also the block cipher mode 
> > as well because having all 3 pieces of information can describe exactly how 
> > KMS is performing the encryption and decryption. So when we call 
> > "ossl_aes256_encrypt_init", we can include the new enum as input parameter 
> > and it will initialise the EVP_CIPHER_CTX with either EVP_aes_256_cbc() or 
> > EVP_aes_256_ctr() for different purposes (key wrapping, or WAL 
> > encryption..etc). 
> > 
> > ---> kmgr.h 
> > +/* Value of key_management_cipher */ 
> > +enum 
> > +{ 
> > +   KMGR_CIPHER_OFF = 0, 
> > +   KMGR_CIPHER_AES256 
> > +}; 
> > + 
> > 
> > so it would become 
> > +enum 
> > +{ 
> > +   KMGR_CIPHER_OFF = 0, 
> > +   KMGR_CIPHER_AES256_CBC = 1, 
> > +   KMGR_CIPHER_AES256_CTR = 2 
> > +}; 
> > 
> > If you agree with this change, several other places will need to be changed 
> > as well, such as "kmgr_cipher_string", "kmgr_cipher_value" and the initdb 
> > code 
> 
> KMGR_CIPHER_XXX is relevant with cipher mode used by KMS and KMS would 
> still use AES256 CBC even if we had TDE which would use AES256 CTR. 
> 
> After more t

[PATCH] Documentation bug related to client authentication using TLS certificate

2020-03-02 Thread Cary Huang

Hi



I found a document bug about client authentication using TLS certificate. When 
clientcert authentication is enabled in pg_hba.conf, libpq does not verify that 
the common name in certificate matches database username like it is described 
in the documentation before allowing client connection.

Instead, when sslmode is set to “verify-full”, libpq will verify if the server 
host name matches the common name in client certificate. When sslmode is set to 
“verify-ca”, libpq will verify that the client is trustworthy by checking the 
certificate trust chain up to the root certificate and it does not verify 
server hostname and certificate common name match in this case.



The attached patch corrects the clientcert authentication description in the 
documentation



cheers













Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

client_cert_auth.patch
Description: Binary data

Re: Internal key management system

2020-03-02 Thread Cary Huang

rephrase based on (2) and uses pg_(un)wrap_secret instead.

=
Encryption key management System provides several functions described in Table 
9.97, to wrap and unwrap user secrets with the Master Encryption Key, which is 
uniquely and securely stored inside the database cluster.

These functions allow user to encrypt and decrypt user data without having to 
provide user encryption secret in plain text. One possible use case is to use 
encryption key management together with pgcrypto. User wraps the user 
encryption secret with pg_wrap_secret() and passes the wrapped encryption 
secret to the pgcrypto encryption functions. The wrapped secret can be stored 
in the application server or somewhere secured and should be obtained promptly 
for cryptographic operations with pgcrypto.

[same examples follow after...]

=====

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Tue, 25 Feb 2020 12:50:18 -0800 Cary Huang 
<mailto:cary.hu...@highgo.ca> wrote 

Hi 

I would like to share with you a front end patch based on kms_v4.patch that you 
have shared, called "kms_v4_fe.patch". 

The patch integrates front end tool pg_waldump with the KMSv4 and obtain 
encryption and decryption cipher contexts from the KMS backend. These cipher 
contexts can then be used in subsequent encryption and decryption operations 
provided by cipher.h when TDE is enabled. I added two common functions in your 
kmgr_utils that other front end tools affected by TDE can also use to obtain 
the cipher contexts. Do let me know if this is how you would envision KMS APIs 
to be used by a front end. 

cheers

Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

 On Mon, 24 Feb 2020 17:55:09 -0800 Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote 

On Thu, 20 Feb 2020 at 16:16, Masahiko Sawada 
<mailto:masahiko.saw...@2ndquadrant.com> wrote: 
> 
> On Wed, 19 Feb 2020 at 09:29, Cary Huang <mailto:cary.hu...@highgo.ca> wrote: 
> > 
> > Hi 
> > 
> > I have tried the attached kms_v3 patch and have some comments: 
> > 
> > 1) In the comments, I think you meant hmac + iv + encrypted data instead of 
> > iv + hmac + encrypted data? 
> > 
> > ---> in kmgr_wrap_key( ): 
> > +   /* 
> > +* Assemble the wrapped key. The order of the wrapped key is iv, 
> > hmac and 
> > +* encrypted data. 
> > +*/ 
> 
> Right, will fix. 
> 
> > 
> > 
> > 2) I see that create_keywrap_ctx function in kmgr_utils.c and regular 
> > cipher context init will both call ossl_aes256_encrypt_init to initialise 
> > context for encryption and key wrapping. In ossl_aes256_encrypt_init, the 
> > cipher method always initialises to aes-256-cbc, which is ok for keywrap 
> > because under CBC block cipher mode, we only had to supply one unique IV as 
> > initial value. But for actual WAL and buffer encryption that will come in 
> > later, I think the discussion is to use CTR block cipher mode, which 
> > requires one unique IV for each block, and the sequence id from WAL and 
> > buffer can be used to derive unique IV for each block for better security? 
> > I think it would be better to allow caller to decide which EVP_CIPHER to 
> > initialize? EVP_aes_256_cbc(), EVP_aes_256_ctr() or others? 
> > 
> > +ossl_aes256_encrypt_init(pg_cipher_ctx *ctx, uint8 *key) 
> > +{ 
> > +   if (!EVP_EncryptInit_ex(ctx, EVP_aes_256_cbc(), NULL, NULL, NULL)) 
> > +   return false; 
> > +   if (!EVP_CIPHER_CTX_set_key_length(ctx, PG_AES256_KEY_LEN)) 
> > +   return false; 
> > +   if (!EVP_EncryptInit_ex(ctx, NULL, NULL, key, NULL)) 
> > +   return false; 
> > + 
> > +   /* 
> > +* Always enable padding. We don't need to check the return 
> > +* value as EVP_CIPHER_CTX_set_padding always returns 1. 
> > +*/ 
> > +   EVP_CIPHER_CTX_set_padding(ctx, 1); 
> > + 
> > +   return true; 
> > +} 
> 
> It seems good. We can expand it to make caller decide the block cipher 
> mode of operation and key length. I removed such code from the 
> previous patch to make it simple since currently we support only 
> AES-256 CBC. 
> 
> > 
> > 3) Following up point 2), I think we should enhance the enum to include not 
> > only the Encryption algorithm and key size, but also the block cipher mode 
> > as well because having all 3 pieces of information can describe exactly how 
> > KMS is performing the encryption and decryption. So when

Re: [PATCH] Documentation bug related to client authentication using TLS certificate

2020-03-03 Thread Cary Huang

Hi Chris

Thank you for your feedback. You are right, libpq verify if the server is 
trustworthy by checking server certificate and check hostname matches with 
server common name when sslmode is verify-full, and it is already explained in 
another documentation page 
https://www.postgresql.org/docs/current/libpq-ssl.html

Having done another investigation, I found that the original documentation 
(https://www.postgresql.org/docs/current/auth-cert.html) is actually right. The 
server is indeed also checking the client certificate cn matches the database 
user name if the authentication method is set to "cert" 

Please disregard this patch.

thanks!

Cary

 On Mon, 02 Mar 2020 19:23:37 -0800 Chris Bandy  
wrote 

Hi, Cary. 

On 3/2/20 1:06 PM, Cary Huang wrote: 
> Hi 
> 
> I found a document bug about client authentication using TLS 
> certificate. When clientcert authentication is enabled in pg_hba.conf, 
> libpq does not verify that the *common name*in certificate 
> matches*database username*like it is described in the documentation 
> before allowing client connection. 
> 
> Instead, when sslmode is set to “verify-full”, libpq will verify if the 
> *server host name*matches the *common name *in client certificate. 

This sounds incorrect. My understanding is that the *server* host name 
is always matched with the *server* common name. 

 When 
> sslmode is set to “verify-ca”, libpq will verify that the client is 
> trustworthy by checking the certificate trust chain up to the root 
> certificate and it does not verify *server hostname*and 
> certificate*common name *match in this case. 

Similarly, libpq will verify the *server* is trustworthy by checking the 
*server* certificate up to the root. It does not verify that the host 
name matches the common name in the *server* certificate. 

In all cases, libpq is responsible for verifying the *server* is who it 
claims to be. 

-- Chris

Re: Improve pg_dump dumping publication tables

2020-10-13 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi

I applied the patch to PG master branch successfully and the patch seems to do 
as described. If there are a lot of tables created in the Postgres and only a 
few of them having publications, the getPublicationTables() function will loop 
through all the tables and check each one to see if it is the desired relation 
type and if it contains one or more publications. So potentially a lot of 
looping and checking may happen. In John's patch, all these looping and 
checking are replaced with one query that will return all the tables that have 
publications. One problem though, is that, if there are a million tables 
created in the server like you say and all of them have publications, the query 
can return a million rows, which will require a lot of memory on client side 
and the pg_dump client may need to handle this extreme case with proper 
pagination, which adds complexity to the function's logic. Existing logic does 
not have this problem, because it simply queries a table's publication one at a 
time and do it a million times.

thank you
Cary Huang
HighGo Software

minor problem in boolean cast

2020-10-23 Thread Cary Huang

Hi



I noticed that when casting a string to boolean value with input 'of' it still 
cast it to 'f'. I think with 'of', it should give an error because 'off' is the 
expected candidate. This may not be intended so I made a simple patch to 
address this. 


```

postgres=# select cast('of' as boolean);

 bool 

--

 f

(1 row)

```



Cary Huang

-

HighGo Software Inc. (Canada)

mailto:cary.hu...@highgo.ca

http://www.highgo.ca

0001-boolean-type-cast-fix.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2024-12-19 Thread Cary Huang

 > > The recent bump in minmum required versions of OpenSSL and LibreSSL made me
 > > remember to revisit this patch which was previously reverted due to library
 > > incompatibility (with *both* OpenSSL and LibreSSL on different APIs).
 > > 
 > > The attached removes the timestamp conversion workaround which is no longer
 > > needed.
 > 
 > The patch was marked as ready for committer and is currently failing
 > in the CI.  I've moved it to next CF waiting on author.  Could you
 > provide a rebase?

Since the minimum OpenSSL version is now 1.1.1, the v13 patch would fail the CI 
because
it uses the old APIs to obtain notBefore and notAfter timestamps:

- X509_get_notBefore
- X509_get_notAfter

which where deprecated in OpenSSL 1.1.0...
Instead, it should use:

- X509_get0_notBefore
- X509_get0_notAfter

which are available in version 1.1.1 and beyond. These APIs now return a "const 
ASN1_TIME*"
instead of "ASN1_TIME*".

The changes below should remove the CI failing when applied to v13 patch:

---   a/contrib/sslinfo/sslinfo.c
+++ b/contrib/sslinfo/sslinfo.c

-static Datum ASN1_TIME_to_timestamptz(ASN1_TIME *time);
+static Datum ASN1_TIME_to_timestamptz(const ASN1_TIME *time);

-ASN1_TIME_to_timestamptz(ASN1_TIME *ASN1_cert_ts)
+ASN1_TIME_to_timestamptz(const ASN1_TIME *ASN1_cert_ts)

-   return ASN1_TIME_to_timestamptz(X509_get_notBefore(cert));
+  return ASN1_TIME_to_timestamptz(X509_get0_notBefore(cert));

-   return ASN1_TIME_to_timestamptz(X509_get_notAfter(cert));
+  return ASN1_TIME_to_timestamptz(X509_get0_notAfter(cert));


---   a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c

-static TimestampTz ASN1_TIME_to_timestamptz(ASN1_TIME *time);
+static TimestampTz ASN1_TIME_to_timestamptz(const ASN1_TIME *time);

-ASN1_TIME_to_timestamptz(ASN1_TIME *ASN1_cert_ts)
+ASN1_TIME_to_timestamptz(const ASN1_TIME *ASN1_cert_ts)

-*ptr = ASN1_TIME_to_timestamptz(X509_get_notBefore(port->peer));
+*ptr = ASN1_TIME_to_timestamptz(X509_get0_notBefore(port->peer));

-  *ptr = ASN1_TIME_to_timestamptz(X509_get_notAfter(port->peer));
+ *ptr = ASN1_TIME_to_timestamptz(X509_get0_notAfter(port->peer));

can you make a rebase with the above changes?


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: sslinfo extension - add notbefore and notafter timestamps

2025-03-10 Thread Cary Huang

The attached v14 is the rebased patch that updated the OpenSSL API calls to be 
compatible in
version 1.1.1 (and beyond), which is the new minimum OpenSSL version supported 
in PostgreSQL. 


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v14-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-disp.patch
Description: Binary data

Re: encode/decode support for base64url

2025-03-11 Thread Cary Huang

> Oh well - you're probably right.
> I guess I was blinded by my convenience.
> Adding a 'base64url' option there is more appropriate.

I agree with it too. It is neater to add "base64url" as a new option for
encode() and decode() SQL functions in encode.c.

In addition, you may also want to add the C versions of base64rul encode
and decode functions to "src/common/base64.c" as new API calls so that
the frontend, backend applications and extensions can also have access
to these base64url conversions. 


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Support tid range scan in parallel?

2025-06-27 Thread Cary Huang

Hi Steven

thanks for the review!

 > I have two comments:
 > 1. Does table_beginscan_parallel_tidrange() need an assert of relid,
 > like what table_beginscan_parallel() did?
 > Assert(RelationGetRelid(relation) == pscan->phs_relid);

In the v6 rebased patch, the assert has become:

Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));

rather than:

Assert(RelationGetRelid(relation) == pscan->phs_relid);

table_beginscan_parallel_tidrange() already has the proper assert line
similar to what table_beginscan_parallel() has.

 > 2. The new field phs_numblock in ParallelBlockTableScanDescData
 > structure has almost the same name as another field phs_nblocks. Would
 > you consider changing it to another name, for example,
 > phs_maxnumblocktoscan?

I actually had a similar thought too, phs_nblocks and phs_numblock are
very similar but are quite different. But I still left the name as phs_numblock
because I want to keep it consistent (kind of) with the 'numBlks' used in
heap_set_tidrange() in heapam.c. The comments besides their declaration
should be enough to describe their differences without causing confusion.

Best regards
Cary

Re: Support tid range scan in parallel?

2025-06-09 Thread Cary Huang

Hi Junwang 

Thank you so much for the review!

> + /*
 > + * if parallel mode is used, store startblock and numblocks in parallel
 > + * scan descriptor as well.
 > + */
 > + if (scan->rs_base.rs_parallel != NULL)
 > + {
 > + ParallelBlockTableScanDesc bpscan = NULL;
 > +
 > + bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
 > + bpscan->phs_startblock = scan->rs_startblock;
 > + bpscan->phs_numblock = scan->rs_numblocks;
 > + }
 > 
 > It would be more intuitive and clearer to directly use startBlk and numBlks
 > to set these values. Since scan->rs_startblock and scan->rs_numblocks
 > are already set using these variables, using the same approach for bpscan
 > would make the code easier to understand.
 > 
 > Another nitty-gritty is that you might want to use a capital `If` in the
 > comments to maintain the same style.

Agreed, made the adjustment in the attached patch.


 > + if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock != 0 &&
 > + nallocated >= pbscan->phs_numblock))
 > 
 > I'd suggest explictly setting phs_numblock to InvalidBlockNumber in
 > table_block_parallelscan_initialize, and compare with InvalidBlockNumber
 > here.

Also agreed, phs_numblock should be initialized in
table_block_parallelscan_initialize just like all other parameters in parallel 
scan
context. You are right, it is much neater to use InvalidBlockNumber rather
than 0 to indicate if an upper bound has been specified in the TID range scan.

I have addressed your comment in the attached v6 patch. Thank you again for
the review.

Best regards
Cary Huang
 


v6-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2025-06-05 Thread Cary Huang

  On Mon, 10 Mar 2025 12:29:28 -0700  Cary Huang  
wrote --- 
 > The attached v14 is the rebased patch that updated the OpenSSL API calls to 
 > be compatible in
 > version 1.1.1 (and beyond), which is the new minimum OpenSSL version 
 > supported in PostgreSQL. 
 > 
 > 
 > Cary Huang
 > -
 > HighGo Software Inc. (Canada)
 > cary.hu...@highgo.ca
 > www.highgo.ca
 
attached is the rebased (v15) patch

Regards

Cary Huang


v15-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-disp.patch
Description: Binary data

Re: Support tid range scan in parallel?

2025-07-24 Thread Cary Huang

Hi David

Thank you so much for the review! I have addressed the comments in the
attached v7 patch.

 > 1. In cost_tidrangescan() you're dividing the total costs by the
 > number of workers yet the comment is claiming that's CPU cost. I think
 > this needs to follow the lead of cost_seqscan() and separate out the
 > CPU and IO cost then add the IO cost at the end, after the divide by
 > the number of workers.

I have separated the costs into disk and CPU costs similar to the style in
cost_seqscan().

 > 2. In execParallel.c, could you move the case for T_TidRangeScanState
 > below T_ForeignScanState? What you have right now is now quite
 > following the unwritten standard set out by the other nodes, i.e
 > non-parallel aware nodes are last. A good spot seems to be putting it
 > at the end of the scan types... Custom scan seems slightly misplaced,
 > but probably can ignore that one and put it after T_ForeignScanState

Yes, it's been done.

 > 3. The following comment should mention what behaviour occurs when the
 > field is set to InvalidBlockNumber:

Also addressed 

 > 4. I think the following would be clearer if written using an else if:
 > 
 > if (nallocated >= pbscan->phs_nblocks || (pbscan->phs_numblock !=
 >   InvalidBlockNumber && nallocated >= pbscan->phs_numblock))
 > page = InvalidBlockNumber; /* all blocks have been allocated */
 > else
 > page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
 > 
 > e.g:
 > 
 > if (nallocated >= pbscan->phs_nblocks)
 > page = InvalidBlockNumber; /* all blocks have been allocated */
 > else if (pbscan->phs_numblock != InvalidBlockNumber &&
 > nallocated >= pbscan->phs_numblock)
 > page = InvalidBlockNumber; /* upper scan limit reached */
 > else
 > page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
 > 
 > That way the comment after the assignment is accurate.

Agreed, and also addressed.

 > 5. For the tests, is there any reason not to reuse the tidrangescan table?

To test TID range scan in parallel, I have a new table created with very low 
fill
factor such that there will be more pages created with small and fixed amount
of tuples in each. Then, the test would do SELECT COUNT(*) to ensure correct
amount of tuples are being counted by the parallel workers during parallel TID
range scan. With this new table, it is easy to ensure the count is correct since
we know how many tuples exist in each page and how many pages are to be
scanned based on the WHERE predicates. Reusing the tidrangescan table
would be hard to tell if the count is correct in my case. 


thank you!
Cary.


v7-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Support tid range scan in parallel?

2025-06-05 Thread Cary Huang

Hello 

Sorry David and all who have reviewed the patch, it's been awhile since the 
patch
was last worked on :(. Thank you all for the reviews and comments! Attached is
the rebased patch that adds support for parallel TID range scans. This feature 
is
particularly useful scanning large tables where the data needs to be scanned in
sizable segments using a TID range in the WHERE clause to define each segment.
By enabling parallelism, this approach can improve performance compared to
both non-parallel TID range scans and traditional sequential scans.

Regards

Cary Huang

v5-0001-add-parallel-tid-rangescan.patch
Description: Binary data

Re: Support tid range scan in parallel?

2025-08-13 Thread Cary Huang

Hello David

Thank you so much for testing the patch more thoroughly and for sharing your 
patch.
I see that you moved the setting of ParallelBlockTableScanDesc fields from
heap_set_tidrange() to be within heap_setscanlimits() so that the missing case 
of
(ItemPointerCompare(&highestItem, &lowestItem) < 0) can be covered. This
placement should be fine.

The workers, however, still end up in heap_set_tidrange() during a call to
TidRangeNext() via table_rescan_tidrange() at the beginning of the scan as you 
have
described. Each of the worker does attempt to update the tid range to basically 
the
same range to the parallel context shared among the workers in shared memory,
which may seem a little racy here. 

The problem is that the logic in TidRangeNext() is mostly designed as 
non-parallel
mode and that  table_rescan_tidrange() is incorrectly called in parallel mode, 
causing
each worker to try to update TID ranges at the same time.

The initialization of scan descriptor, rescan and setting TID ranges in fact 
take place
at different places between parallel and non-parallel modes. In non-parallel 
mode,
everything takes place in TidRangeNext() while in parallel mode, they take 
place in
ExecTidRangeScanInitializeDSM() and ExecTidRangeScanReInitializeDSM() and they
are called only by the leader.

I have updated TidRangeNext() to not do any rescan or set TID limit in parallel 
mode.

I have also updated ExecTidRangeScanInitializeDSM() to set the TID range after
initializing parallel scan descriptor, which is shared to the workers via 
shared memory.
Similarly, in ExecTidRangeScanReInitializeDSM(), a new TID range will be set 
after the
re-initialization of parallel scan descriptor during the rescan case.
ExecTidRangeScanInitializeWorker() is called by each parallel worker and is also
updated such that it will not set the TID limits again. 

Since ExecTidRangeScanInitializeDSM() and ExecTidRangeScanReInitializeDSM() are
only called by the leader process, there will not be any concurrent call to
heap_setscanlimits(), so no locking is needed there. 

On top of the 2 patches you shared, I have attached the third patch with the 
changes
I describe above, so it is easy to spot the new diffs.

Thank you once again for your review!

Cary

v9-0002-fixup-v7-parallel-TID-range-scan-patch.patch
Description: Binary data


v9-0003-fix-the-incorrect-call-to-scan_set_tidrange.patch
Description: Binary data


v9-0001-v7-parallel-TID-range-scan-patch.patch
Description: Binary data

Re: Support tid range scan in parallel?

2025-08-29 Thread Cary Huang

 > The workers are ending their scan early because
 > heap_getnextslot_tidrange() returns false on the first call from the
 > parallel worker.

Hi David, thank you for the testing!

Yes, the previous v9 patch missed setting node->trss_mintid and
node->trss_maxtid, causing the parallel workers to exit early due to
heap_getnextslot_tidrange() returning false. 

With the attached v10 patch, the parallel leader and workers now
have to evaluate (TidRangeEval()) the given TID ranges and set them
via ExecTidRangeScanInitializeDSM(),
ExecTidRangeScanReInitializeDSM() and 
ExecTidRangeScanInitializeWorker().

To prevent the leader and the workers from calling heap_setscanlimits()
and trying to set phs_startblock and phs_numblock concurrently in
shared memory, I added a condition to only allow parallel
leader to set them. Since node->trss_mintid and node->trss_maxtid
reside in local memory, the workers still have to call heap_set_tidrange
() to have them set to return correct scan results:

# SET parallel_setup_cost TO 0;
# SET parallel_tuple_cost TO 0;

# select count(*) from test where ctid >=  '(10,10)' and ctid <= '(1,10)';
  count
-
 1848151
(1 row)

# SET max_parallel_workers_per_gather TO 0;
=# select count(*) from test where ctid >=  '(10,10)' and ctid <= '(1,10)';
  count
-
 1848151
(1 row)

thank you again!

Cary

v10-0001-v10-parallel-tid-range-scan.patch
Description: Binary data

92 matches

Mail list logo