Re: [PATCH v1] pg_ls_tmpdir to show directories
Hello Justin, I'm trying to think about how to get rid of the strange structure and hacks, and the arbitrary looking size 2 array. Also the recursion is one step, but I'm not sure why, ISTM it could/should go on always? Because tmpfiles only go one level deep. I'm not sure it is a general rule. ISTM that extensions can use tmp files, and we would have no control about what they would do there. Looking at the code, ISTM that relying on a stack/list would be much cleaner and easier to understand. The code could look like: I'm willing to change the implementation, but only after there's an agreement about the desired behavior (extra column, one level, etc). For the level, ISTM that the implementation should not make this assumption. If in practice there is just one level, then the function will not recurse deep, no problem. For the column, I'm not sure that "isdir" is necessary. You could put it implicitely in the file name by ending it with "/", and/or showing the directory contents is enough a hint that there is a directory? Also, I'm not fully sure why ".*" files should be skipped, maybe it should be an option? Or the user can filter it with SQL if it does not want them? -- Fabien.
Re: Setting min/max TLS protocol in clientside libpq
> On 16 Jan 2020, at 04:22, Michael Paquier wrote: > > On Wed, Jan 15, 2020 at 02:58:09PM +0900, Michael Paquier wrote: >> On Tue, Jan 14, 2020 at 11:01:00PM +0100, Daniel Gustafsson wrote: >>> Files renamed to match existing naming convention, the rest of the patch >>> left >>> unchanged. >> >> [previous review] > > One thing I remembered after sleeping on it is that we can split the > patch into two parts: the refactoring pieces and the addition of the > options for libpq. Correct, they are mostly independent (the refactoring doesn't make a lot of sense without the follow-up patch, but the min/max patch can be kept more readable without the refactoring in it as well). > The previous review mostly impacts the libpq part, > and the split is straight-forward, so attached is a patch for only the > refactoring pieces with some fixes and tweaks. I have tested it with > and without OpenSSL, using 1.0.2 and 1.1.0 on Linux and Windows > (MSVC). Those tests have allowed me to find an error in the previous > patch that I missed: the new files openssl.h and protocol_openssl.c > still declared SSL_CTX_set_min/max_proto_version as static functions, > so compilation was broken when trying to use OpenSSL <= 1.0.2. Doh .. thanks. > If that looks fine, I would like to get that part committed first. > Daniel, any thoughts? The patch looks fine to me, I don't an issue with splitting it into a refactoring patch and a TLS min/max version patch. cheers ./daniel
Re: Option to dump foreign data in pg_dump
On Tue, Jan 14, 2020 at 5:22 PM Luis Carril wrote: > Can you have a look at dump with parallel option. Parallel option will > take a lock on table while invoking lockTableForWorker. May be this is > not required for foreign tables. > Thoughts? > > I tried with -j and found no issue. I guess that the foreign table needs > locking anyway to prevent anyone to modify it while is being dumped. > > I'm able to get the problem with the following steps: Bring up a postgres setup with servers running in 5432 & 5433 port. Execute the following commands in Server1 configured on 5432 port: - CREATE EXTENSION postgres_fdw; - CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '127.0.0.1', port '5433', dbname 'postgres'); - create user user1 password '123'; - alter user user1 with superuser; - CREATE USER MAPPING FOR user1 SERVER foreign_server OPTIONS (user 'user1', password '123'); Execute the following commands in Server2 configured on 5433 port: - create user user1 password '123'; - alter user user1 with superuser; Execute the following commands in Server2 configured on 5433 port as user1 user: - create schema test; - create table test.test1(id int); - insert into test.test1 values(10); Execute the following commands in Server1 configured on 5432 port as user1 user: - CREATE FOREIGN TABLE foreign_table1 (id integer NOT NULL) SERVER foreign_server OPTIONS (schema_name 'test', table_name 'test1'); Without parallel option, the operation is successful: - ./pg_dump -d postgres -f dumpdir -U user1 -F d --include-foreign-data foreign_server With parallel option it fails: - ./pg_dump -d postgres -f dumpdir1 -U user1 -F d -j 5 --include-foreign-data foreign_server pg_dump: error: could not obtain lock on relation "public.foreign_table1" This usually means that someone requested an ACCESS EXCLUSIVE lock on the table after the pg_dump parent process had gotten the initial ACCESS SHARE lock on the table. pg_dump: error: a worker process died unexpectedly There may be simpler steps than this to reproduce the issue, i have not try to optimize it. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Amcheck: do rightlink verification with lock coupling
> 14 янв. 2020 г., в 9:47, Andrey Borodin написал(а): > > Page updates may be lost due to bug in backup software with incremental > backups, bug in storage layer of Aurora-style system, bug in page cache, > incorrect > fsync error handling, bug in ssd firmware etc. And our data checksums do not > detect this kind of corruption. BTW I think that it would be better if our > checksums were not stored on a page itseft, they could detect this kind of > faults. Observed it just now. There is one HA cluster where a node was marked dead. This node was disconnected from cluster, but due to human error there was postgres running. Node managed to install block-level incremental backup to the chain. And backup software did not detect that backup step was taken from part of timeline that was not in actual timeline's history. Result of restoration is: man-w%/%db R # select bt_index_check('%.pk_%'); bt_index_check (1 row) Time: 1411.065 ms (00:01.411) man-w%/%db R # select patched_index_check('%.pk_%'); ERROR: XX002: left link/right link pair in index "pk_labels" not in agreement DETAIL: Block=42705 left block=42707 left link from block=45495. LOCATION: bt_recheck_block_rightlink, verify_nbtree.c:621 Time: 671.336 ms ('%' is replacing removed chars) I understand that this corruption was not introduced by postgres itself, but by combination of bug in two 3rd party tools and human error. But I can imagine similar corruptions with different root causes. Best regards, Andrey Borodin.
Re: Implementing Incremental View Maintenance
Error occurs when updating user-defined type columns. Create an INCREMENTAL MATERIALIZED VIEW by specifying a query that includes user-defined type columns. After the view is created, an error occurs when inserting into the view source table (including the user-defined type column). ``` ERROR: operator does not exist ``` An execution example is shown below. ``` [ec2-user@ip-10-0-1-10 ivm]$ psql testdb -a -f extension-insert.sql -- -- pg_fraction: https://github.com/nuko-yokohama/pg_fraction -- DROP EXTENSION IF EXISTS pg_fraction CASCADE; psql:extension-insert.sql:4: NOTICE: drop cascades to column data of table foo DROP EXTENSION DROP TABLE IF EXISTS foo CASCADE; DROP TABLE CREATE EXTENSION IF NOT EXISTS pg_fraction; CREATE EXTENSION \dx List of installed extensions Name | Version | Schema | Description -+-++-- pg_fraction | 1.0 | public | fraction data type plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language (2 rows) \dT+ fraction List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description +--+---+--+--+--+---+- public | fraction | fraction | 16 | | postgres | | (1 row) CREATE TABLE foo (id int, data fraction); CREATE TABLE INSERT INTO foo (id, data) VALUES (1,'2/3'),(2,'1/3'),(3,'1/2'); INSERT 0 3 SELECT id, data FROM foo WHERE data >= '1/2'; id | data +-- 1 | 2/3 3 | 1/2 (2 rows) CREATE INCREMENTAL MATERIALIZED VIEW foo_imv AS SELECT id, data FROM foo WHERE data >= '1/2'; SELECT 2 TABLE foo_imv; id | data +-- 1 | 2/3 3 | 1/2 (2 rows) INSERT INTO foo (id, data) VALUES (4,'2/3'),(5,'2/5'),(6,'3/6'); -- error psql:extension-insert.sql:17: ERROR: operator does not exist: fraction pg_catalog.= fraction LINE 1: ...(mv.id IS NULL AND diff.id IS NULL)) AND (mv.data OPERATOR(p... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts. QUERY: WITH updt AS (UPDATE public.foo_imv AS mv SET __ivm_count__ = mv.__ivm_count__ OPERATOR(pg_catalog.+) diff.__ivm_count__ FROM pg_temp_3.pg_temp_73900 AS diff WHERE (mv.id OPERATOR(pg_catalog.=) diff.id OR (mv.id IS NULL AND diff.id IS NULL)) AND (mv.data OPERATOR(pg_catalog.=) diff.data OR (mv.data IS NULL AND diff.data IS NULL)) RETURNING mv.id, mv.data) INSERT INTO public.foo_imv SELECT * FROM pg_temp_3.pg_temp_73900 AS diff WHERE NOT EXISTS (SELECT 1 FROM updt AS mv WHERE (mv.id OPERATOR(pg_catalog.=) diff.id OR (mv.id IS NULL AND diff.id IS NULL)) AND (mv.data OPERATOR(pg_catalog.=) diff.data OR (mv.data IS NULL AND diff.data IS NULL))); TABLE foo; id | data +-- 1 | 2/3 2 | 1/3 3 | 1/2 (3 rows) TABLE foo_imv; id | data +-- 1 | 2/3 3 | 1/2 (2 rows) DROP MATERIALIZED VIEW foo_imv; DROP MATERIALIZED VIEW INSERT INTO foo (id, data) VALUES (4,'2/3'),(5,'2/5'),(6,'3/6'); INSERT 0 3 TABLE foo; id | data +-- 1 | 2/3 2 | 1/3 3 | 1/2 4 | 2/3 5 | 2/5 6 | 1/2 (6 rows) ``` Best regards. 2018年12月27日(木) 21:57 Yugo Nagata : > Hi, > > I would like to implement Incremental View Maintenance (IVM) on > PostgreSQL. > IVM is a technique to maintain materialized views which computes and > applies > only the incremental changes to the materialized views rather than > recomputate the contents as the current REFRESH command does. > > I had a presentation on our PoC implementation of IVM at PGConf.eu 2018 > [1]. > Our implementation uses row OIDs to compute deltas for materialized > views. > The basic idea is that if we have information about which rows in base > tables > are contributing to generate a certain row in a matview then we can > identify > the affected rows when a base table is updated. This is based on an idea of > Dr. Masunaga [2] who is a member of our group and inspired from ID-based > approach[3]. > > In our implementation, the mapping of the row OIDs of the materialized view > and the base tables are stored in "OID map". When a base relation is > modified, > AFTER trigger is executed and the delta is recorded in delta tables using > the transition table feature. The accual udpate of the matview is triggerd > by REFRESH command with INCREMENTALLY option. > > However, we realize problems of our implementation. First, WITH OIDS will > be removed since PG12, so OIDs are no longer available. Besides this, it > would > be hard to implement this since it needs many changes of executor nodes to > collect base tables's OIDs during execuing a query. Also, the cost of > maintaining > OID map would be high. > > For these reasons, we started to think to implement IVM without relying on > OIDs > and made a bit more surveys. > > We also looked at Kevin Grittner's discussion [4]
empty range
Hello, default constructor for ranges use lower bound closed '[' and upper bound open ')'. This is correct behavior, but when upper bound is same like lower bound then range is empty. Mathematically is correct again - but in database is lost information about range bounds (lower/upper is NULL). To prevent this sitiuation we must have check if lower and upper argument is same and add some 0.1s to upper range or use another constructor like tstzrange(now(),now(),'[]') . Is there chance to change behavior of storing ranges? Its possible store range bounds in internal structure and lower(tstzrange(now(),now())) show not NULL value or change default behavior tstzrange(timestamptz,timestamptz) - if both args are same, then store as '[]', else '[)' and only tstzrange(timestamptz,timestamtz,'[)') and tstzrange(timestamptz,timestamtz,'()') store empty range. It's only suggestion, i don't now if somebody wants store empty range without bounds. We must have some checks to prevent storing empty values on every place where can occur this empty range, becouse we don't want lose bound information. Best regards, -- - Ing. David TUROŇ LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.:+420 591 166 224 fax:+420 596 621 273 mobil: +420 732 589 152 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz -
Re: remove some STATUS_* symbols
At Thu, 16 Jan 2020 14:50:01 +0900, Michael Paquier wrote in > On Sat, Jan 11, 2020 at 08:14:17AM +0100, Peter Eisentraut wrote: > > OK, pushed as it was then. > > Thanks, that looks fine. I am still not sure whether the second patch > adding an enum via ProcWaitStatus improves the code readability > though, so my take would be to discard it for now. Perhaps others > think differently, I don't know. I feel the same about the second patch. Although actually STATUS_WAITING is used only by ProcSleep and related functions, likewise STATUS_EOF is seen only in auth.c/h. Other files, pqcomm.c, crypt.c postmaster.c, hba.c, fe-auth.c , fe-connect.c, fe-gssapi-common.c are using only STATUS_OK and ERROR. I haven't had a close look but all of the usages would be equivalent to bool. On the other hand many functions in fe-*.c and pqcomm.c returns EOF(-1)/0 instead of STATUS_EOF(-2)/STATUS_OK(0). We could reorganize the values and their usage but it doesn't seem to be a big win.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Block level parallel vacuum
On Thu, 16 Jan 2020 at 14:11, Amit Kapila wrote: > > On Thu, Jan 16, 2020 at 10:11 AM Mahendra Singh Thalor > wrote: > > > > On Thu, 16 Jan 2020 at 08:22, Amit Kapila wrote: > > > > > > > 2. > > > > I checked time taken by vacuum.sql test. Execution time is almost same > > > > with and without v45 patch. > > > > > > > > Without v45 patch: > > > > Run1) vacuum ... ok 701 ms > > > > Run2) vacuum ... ok 549 ms > > > > Run3) vacuum ... ok 559 ms > > > > Run4) vacuum ... ok 480 ms > > > > > > > > With v45 patch: > > > > Run1) vacuum ... ok 842 ms > > > > Run2) vacuum ... ok 808 ms > > > > Run3) vacuum ... ok 774 ms > > > > Run4) vacuum ... ok 792 ms > > > > > > > > > > I see some variance in results, have you run with autovacuum as off. > > > I was expecting that this might speed up some cases where parallel > > > vacuum is used by default. > > > > I think, this is expected difference in timing because we are adding > > some vacuum related test. I am not starting server manually(means I am > > starting server with only default setting). > > > > Can you once test by setting autovacuum = off? The autovacuum leads > to variability in test timing. > > I've also run the regression tests with and without the patch: * w/o patch and autovacuum = on: 255 ms * w/o patch and autovacuum = off: 258 ms * w/ patch and autovacuum = on: 370 ms * w/ patch and autovacuum = off: 375 ms > > If we start server with default settings, then we will not hit vacuum > > related test cases to parallel because size of index relation is very > > small so we will not do parallel vacuum. Right. Most indexes (all?) of tables that are used in the regression tests are smaller than min_parallel_index_scan_size. And we set min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not be speeded-up much because of the relation size. Since we instead populate new table for parallel vacuum testing the regression test for vacuum would take a longer time. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thoughts on "killed tuples" index hint bits support on standby
Hello, hackers. Currently hint bits in the index pages (dead tuples) are set and taken into account only at primary server. Standby just ignores it. It is done for reasons, of course (see RelationGetIndexScan and [1]): * We do this because the xmin on the primary node could easily be * later than the xmin on the standby node, so that what the primary * thinks is killed is supposed to be visible on standby. So for correct * MVCC for queries during recovery we must ignore these hints and check * all tuples. Also, according to [2] and cases like [3] it seems to be good idea to support "ignore_killed_tuples" on standby. I hope I know the way to support it correctly with reasonable amount of changes. First thing we need to consider - checksums and wal_log_hints are widely used these days. So, at any moment master could send FPW page with new "killed tuples" hints and overwrite hints set by standby. Moreover it is not possible to distinguish hints are set by primary or standby. And there is where hot_standby_feedback comes to play. Master node considers xmin of hot_standy_feedback replicas (RecentGlobalXmin) while setting "killed tuples" bits. So, if hot_standby_feedback is enabled on standby for a while - it could safely trust hint bits from master. Also, standby could set own hints using xmin it sends to primary during feedback (but without marking page as dirty). Of course all is not so easy, there are a few things and corner cases to care about * Looks like RecentGlobalXmin could be moved backwards in case of new replica with lower xmin is connected (or by switching some replica to hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved strictly forward. * hot_standby_feedback could be enabled on the fly. In such a case we need distinguish transactions which are safe or unsafe to deal with hints. Standby could receive fresh RecentGlobalXmin as response to feedback message. All standby transactions with xmin >= RecentGlobalXmin are safe to use hints. * hot_standby_feedback could be disabled on the fly. In such situation standby needs to continue to send feedback while canceling all queries with ignore_killed_tuples=true. Once all such queries are canceled - feedback are no longer needed and should be disabled. Could someone validate my thoughts please? If the idea is mostly correct - I could try to implement and test it. [1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38 [2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8 [3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
Re: [HACKERS] Block level parallel vacuum
On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada wrote: > > Right. Most indexes (all?) of tables that are used in the regression > tests are smaller than min_parallel_index_scan_size. And we set > min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not > be speeded-up much because of the relation size. Since we instead > populate new table for parallel vacuum testing the regression test for > vacuum would take a longer time. > Fair enough and I think it is good in a way that it won't change the coverage of existing vacuum code. I have fixed all the issues reported by Mahendra and have fixed a few other cosmetic things in the attached patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com v49-0001-Allow-vacuum-command-to-process-indexes-in-parallel.patch Description: Binary data
Re: [HACKERS] Block level parallel vacuum
Hi all, I would like to share my observation on this PG feature "Block-level parallel vacuum". I have tested the earlier patch (i.e v48) with below high-level test scenarios, and those are working as expected. - I have played around with these GUC parameters while testing max_worker_processes autovacuum = off shared_buffers max_parallel_workers max_parallel_maintenance_workers min_parallel_index_scan_size vacuum_cost_limit vacuum_cost_delay - Tested the parallel vacuum with tables and Partition tables having possible datatypes and Columns having various indexes(like btree, gist, etc.) on part / full table. - Tested the pgbench tables data with multiple indexes created manually and ran script(vacuum_test.sql) with DMLs and VACUUM for multiple Clients, Jobs, and Time as below. ./pgbench -c 8 -j 16 -T 900 postgres -f vacuum_test.sql We observe the usage of parallel workers during VACUUM. - Ran few isolation schedule test cases(in regression) with huge data and indexes, perform DMLs -> VACUUM - Tested with PARTITION TABLEs -> global/local indexes -> DMLs -> VACUUM - Tested with PARTITION TABLE having different TABLESPACE in different location -> global/local indexes -> DMLs -> VACUUM - With Changing STORAGE options for columns(as PLAIN / EXTERNAL / EXTENDED) -> DMLs -> VACUUM - Create index with CONCURRENTLY option / Changing storage_parameter for index as below -> DMLs -> VACUUM with(buffering=auto) / with(buffering=on) / with(buffering=off) / with(fillfactor=30); - Tested with creating Simple and Partitioned tables -> DMLs -> pg_dump/pg_restore/pg_upgrade -> VACUUM Verified the data after restore / upgrade / VACUUM. - Indexes on UUID-OSSP data -> DMLs -> pg_upgrade -> VACUUM - Verified with various test scenarios for better performance of parallel VACUUM as compared to Non-parallel VACUUM. Time taken by VACUUM on PG HEAD+PATCH(with PARALLEL) < Time taken by VACUUM on PG HEAD (without PARALLEL) Machine configuration: (16 VCPUs / RAM: 16GB / Disk size: 640GB) *PG HEAD:* VACUUM tab1; Time: 38915.384 ms (00:38.915) Time: 48389.006 ms (00:48.389) Time: 41324.223 ms (00:41.324) *Time: 37640.874 ms (00:37.641) --median* Time: 36897.325 ms (00:36.897) Time: 36351.022 ms (00:36.351) Time: 36198.890 ms (00:36.199) *PG HEAD + v48 Patch:* VACUUM tab1; Time: 37051.589 ms (00:37.052) *Time: 33647.459 ms (00:33.647) --median* Time: 31580.894 ms (00:31.581) Time: 34442.046 ms (00:34.442) Time: 31335.960 ms (00:31.336) Time: 34441.245 ms (00:34.441) Time: 31159.639 ms (00:31.160) -- With Regards, Prabhat Kumar Sahu EnterpriseDB: http://www.enterprisedb.com
Re: remove some STATUS_* symbols
On Thu, Jan 16, 2020 at 12:50 AM Michael Paquier wrote: > Thanks, that looks fine. I am still not sure whether the second patch > adding an enum via ProcWaitStatus improves the code readability > though, so my take would be to discard it for now. Perhaps others > think differently, I don't know. IMHO, custom enums for each particular case would be a big improvement over supposedly-generic STATUS codes. It makes it clearer which values are possible in each code path, and it comes out nicer in the debugger, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making psql error out on output failures
David Zhang wrote: > If I change the error log message like below, where "%m" is used to pass the > value of strerror(errno), "could not write to output file:" is copied from > function "WRITE_ERROR_EXIT". > - pg_log_error("Error printing tuples"); > + pg_log_error("could not write to output file: %m"); > then the output message is something like below, which, I believe, is more > consistent with pg_dump. The problem is that errno may not be reliable to tell us what was the problem that leads to ferror(fout) being nonzero, since it isn't saved at the point of the error and the output code may have called many libc functions between the first occurrence of the output error and when pg_log_error() is called. The linux manpage on errno warns specifically about this: NOTES A common mistake is to do if (somecall() == -1) { printf("somecall() failed\n"); if (errno == ...) { ... } } where errno no longer needs to have the value it had upon return from somecall() (i.e., it may have been changed by the printf(3)). If the value of errno should be preserved across a library call, it must be saved: This other bit from the POSIX spec [1] is relevant: "The value of errno shall be defined only after a call to a function for which it is explicitly stated to be set and until it is changed by the next function call or if the application assigns it a value." To use errno in a way that complies with the above, the psql code should be refactored. I don't know if having a more precise error message justifies that refactoring. I've elaborated a bit about that upthread with the initial submission. Besides, I'm not even sure that errno is necessarily set on non-POSIX platforms when fputc or fputs fails. That's why this patch uses the safer approach to emit a generic error message. [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/errno.html Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: Patch to document base64 encoding
Hello Karl, New patch attached: doc_base64_v13.patch This required surprisingly little re-wording. Added word "binary" into the descriptions of convert(), substring(), convert_from(), and convert_to(). I also added data types to the call syntax of set_bit() and set_byte(). And this patch adds hyperlinks from the get_bit(), get_byte(), set_bit(), and set_byte() descriptions to the note that offsets are zero-based. I also removed the hyperlinked asterisks about the hash function results and instead hyperlinked the word "hash" in the descriptions. (Links to the note about md5() returning hex text and the others returning bytea and how to convert between the two.) Patch applies cleanly and compiles. My 0.02€: The overall restructuration and cross references is an improvement. Some comments about v13: The note about get_byte reads: get_byte and set_byte number the first byte of a binary string as byte 0. get_bit and set_bit number bits from the right within each byte; for example bit 0 is the least significant bit of the first byte, and bit 15 is the most significant bit of the second byte. The two sentences starts with a lower case letter, which looks strange to me. I'd suggest to put "Functions" at the beginning of the sentences: Functions get_byte and set_byte number the first byte of a binary string as byte 0. Functions get_bit and set_bit number bits from the right within each byte; for example bit 0 is the least significant bit of the first byte, and bit 15 is the most significant bit of the second byte. The note about hash provides an example for getting the hex representation out of sha*. I'd add an exemple to get the bytea representation from md5, eg "DECODE(MD5('hello world'), 'hex')"… Maybe the encode/decode in the note could be linked to the function description? Well, they are just after, maybe it is not very useful. The "Binary String Functions and Operators" 9.5 section has only one subsection, "9.5.1", which is about at two thirds of the page. This structure looks weird. ISTM that a subsection is missing for the beginning of the page, or that the subsection should just be dropped, because it is somehow redundant with the table title. The "9.4" section has the same structural weirdness. Either remove the subsection, or add some for the other parts? -- Fabien.
Re: SlabCheck leaks memory into TopMemoryContext
On Wed, Jan 15, 2020 at 10:41:43PM -0800, Andres Freund wrote: Hi, On 2020-01-16 01:25:00 -0500, Tom Lane wrote: Andres Freund writes: > On 2020-01-16 00:09:53 -0500, Tom Lane wrote: >> It's basically assuming that the memory management mechanism is sane, >> which makes the whole thing fundamentally circular, even if it's >> relying on some other context to be sane. Is there a way to do the >> checking without extra allocations? > Probably not easily. In the AllocSet code, we don't hesitate to expend extra space per-chunk for debug support: typedef struct AllocChunkData { ... #ifdef MEMORY_CONTEXT_CHECKING Sizerequested_size; #endif ... I don't see a pressing reason why SlabContext couldn't do something similar, either per-chunk or per-context, whatever makes sense. Well, what I suggested upthread, was to just have two globals like int slabcheck_freechunks_size; bool *slabcheck_freechunks; and realloc that to the larger size in SlabContextCreate() if necessary, based on the computed chunksPerBlock? I thought you were asking whether any additional memory could just be avoided... I don't see why not to just do what Tom proposed, i.e. allocate this as part of the memory context in SlabContextCreate(), when memory context checking is enabled. It seems much more convenient / simpler than the globals (no repalloc, ...). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: empty range
> It's only suggestion, i don't now if somebody wants store empty range without > bounds. I thought about the same while developing the BRIN inclusion operator class. I am not sure how useful empty ranges are in practice, but keeping their bound would only bring more flexibility, and eliminate special cases on most of the range operators. For reference, we allow empty boxes, and none of the geometric operators has to handle them specially.
Re: Duplicate Workers entries in some EXPLAIN plans
> Sounds good, I'll try that format. Any idea how to test YAML? With the > JSON format, I was able to rely on Postgres' own JSON-manipulating > functions to strip or canonicalize fields that can vary across > executions--I can't really do that with YAML. Yes, this approach was clear in the patch and works great with Json. Also you are correct, this can not be done with YAML. I spend a bit of time to look around and I could not find any tests really on yaml format. > Or should I run EXPLAIN > with COSTS OFF, TIMING OFF, SUMMARY OFF and assume that for simple > queries the BUFFERS output (and other fields I can't turn off like > Sort Space Used) *is* going to be stable? I have to admit with the current diff tool used in pg_regress, this is not possible. I am pretty certain that it *is not* going to be stable. Not for long anyways. I withdraw my suggestion for YAML and currently awaiting for TEXT format only.
Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node
On Tue, Dec 17, 2019 at 2:19 PM Michael Paquier wrote: > > On Mon, Dec 16, 2019 at 12:22:18PM +0900, Fujii Masao wrote: > > > +Detection of WAL records having references to invalid pages > > > during > > > +recovery causes PostgreSQL to report > > > +an error, aborting the recovery. Setting > > > Well, that's not really an error. This triggers a PANIC, aka crashes > > > the server. And in this case the actual problem is that you may not > > > be able to move on with recovery when restarting the server again, > > > except if luck is on your side because you would continuously face > > > it.. > > > > So you're thinking that "report an error" should be changed to > > "trigger a PANIC"? Personally "report an error" sounds ok because > > PANIC is one of "error", I think. But if that misleads people, > > I will change the sentence. > > In the context of a recovery, an ERROR is promoted to a FATAL, but > here are talking about something that bypasses the crash of the > server. So this could bring confusion. I think that the > documentation should be crystal clear about that, with two aspects > outlined when the parameter is disabled, somewhat like data_sync_retry > actually: > - A PANIC-level error is triggered. > - It crashes the cluster. OK, I updated the patch that way. Attached is the updated version of the patch. Regards, -- Fujii Masao ignore_invalid_pages_v3.patch Description: Binary data
Re: [PATCH v1] pg_ls_tmpdir to show directories
On Thu, Jan 16, 2020 at 09:34:32AM +0100, Fabien COELHO wrote: > Also, I'm not fully sure why ".*" files should be skipped, maybe it should > be an option? Or the user can filter it with SQL if it does not want them? I think if someone wants the full generality, they can do this: postgres=# SELECT name, s.size, s.modification, s.isdir FROM (SELECT 'base/pgsql_tmp'p)p, pg_ls_dir(p)name, pg_stat_file(p||'/'||name)s; name | size | modification | isdir --+--++--- .foo | 4096 | 2020-01-16 08:57:04-05 | t In my mind, pg_ls_tmpdir() is for showing tmpfiles, not just a shortcut to SELECT pg_ls_dir((SELECT 'base/pgsql_tmp'p)); -- or, for all tablespaces: WITH x AS (SELECT format('/PG_%s_%s', split_part(current_setting('server_version'), '.', 1), catalog_version_no) suffix FROM pg_control_system()), y AS (SELECT a, pg_ls_dir(a) AS d FROM (SELECT DISTINCT COALESCE(NULLIF(pg_tablespace_location(oid),'')||suffix, 'base') a FROM pg_tablespace,x)a) SELECT a, pg_ls_dir(a||'/pgsql_tmp') FROM y WHERE d='pgsql_tmp'; I think changing dotfiles is topic for another patch. That would also affect pg_ls_dir, and everything else that uses the backing function pg_ls_dir_files_recurse. I'd have to ask why not also show . and .. ? (In fact, if I were to change anything, I would propose to limit pg_ls_tmpdir() to files matching PG_TEMP_FILE_PREFIX). Justin
Re: Add pg_file_sync() to adminpack
On Tue, Jan 14, 2020 at 10:08 AM Atsushi Torikoshi wrote: > fails we can get error messages. So it seems straightforward for me to > 'return true on success and emit an ERROR otherwise'. I agree with reporting errors using ERROR, but it seems to me that we ought to then make the function return 'void' rather than 'bool'. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Code cleanup for build_regexp_split_result
Hi hackers, I find the build_regexp_split_result() has redundant codes, we can move it to before the condition check, can we? Best regards. Japin Li 0001-Code-cleanup-for-build_regexp_split_result.patch Description: 0001-Code-cleanup-for-build_regexp_split_result.patch
Re: progress report for ANALYZE
On Wed, Jan 15, 2020 at 02:11:10PM -0300, Alvaro Herrera wrote: > I just pushed this after some small extra tweaks. > > Thanks, Yamada-san, for seeing this to completion! Find attached minor fixes to docs - sorry I didn't look earlier. Possibly you'd also want to change the other existing instances of "preparing to begin". >From de108e69b5d33c881074b0a04697d7061684f823 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 15 Jan 2020 23:10:29 -0600 Subject: [PATCH v1] Doc review for ANALYZE progress (a166d408) --- doc/src/sgml/monitoring.sgml | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 8b44fb1..10871b7 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3525,7 +3525,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid, Whenever ANALYZE is running, the pg_stat_progress_analyze view will contain a - row for each backend that is currently running that command. The tables + row for each backend currently running ANALYZE. The tables below describe the information that will be reported and provide information about how to interpret it. @@ -3635,7 +3635,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid, initializing - The command is preparing to begin scanning the heap. This phase is + The command is preparing to scan the heap. This phase is expected to be very brief. @@ -3643,7 +3643,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid, acquiring sample rows The command is currently scanning the table given by - current_relid to obtain sample rows. + relid to obtain sample rows. @@ -3659,14 +3659,14 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid, computing statistics - The command is computing statistics from the samples rows obtained during + The command is computing statistics from the sample rows obtained during the table scan. computing extended statistics - The command is computing extended statistics from the samples rows obtained + The command is computing extended statistics from the sample rows obtained durring the table scan. -- 2.7.4
Re: empty range
Emre Hasegeli writes: >> It's only suggestion, i don't now if somebody wants store empty range >> without bounds. > I thought about the same while developing the BRIN inclusion operator > class. I am not sure how useful empty ranges are in practice, but > keeping their bound would only bring more flexibility, and eliminate > special cases on most of the range operators. For reference, we allow > empty boxes, and none of the geometric operators has to handle them > specially. I think it'd just move the special cases somewhere else. Consider regression=# select int4range(4,4) = int4range(5,5); ?column? -- t (1 row) How do you preserve that behavior ... or if you don't, how much damage does that do to the semantics of ranges? Right now there's a pretty solid set-theoretic basis for understanding what a range is, ie two ranges are the same if they include the same sets of elements. It seems like that goes out the window if we don't consider that all empty ranges are the same. BTW, I think the main reason for all the bound-normalization pushups is to try to have a rule that ranges that are set-theoretically equal will look the same. That also goes out the window if we make empty ranges look like this. regards, tom lane
Re: [HACKERS] WIP: Aggregation push-down
Tomas Vondra wrote: > Hi, > > I've been looking at the last version (v14) of this patch series, > submitted way back in July and unfortunately quiet since then. Antonin > is probably right one of the reasons for the lack of reviews is that it > requires interest/experience with planner. > > Initially it was also a bit hard to understand what are the benefits > (and the patch shifted a bit), which is now mostly addressed by the > README in the last patch. The trouble is that's hidden in the patch and > so not immediately obvious to people considering reviewing it :-( Tom > posted a nice summary in November 2018, but it was perhaps focused on > the internals. > > So my first suggestion it to write a short summary with example how it > affects practical queries (plan change, speedup) etc. I think README plus regression test output should be enough for someone who is about to review patch as complex as this. > My second suggestion is to have meaningful commit messages for each part > of the patch series, explaining why we need that particular part. It > might have been explained somewhere in the thread, but as a reviewer I > really don't want to reverse-engineer the whole thread. ok, done. > Now, regarding individual parts of the patch: > > > 1) v14-0001-Introduce-RelInfoList-structure.patch > - > > - I'm not entirely sure why we need this change. We had the list+hash > before, so I assume we do this because we need the output functions? I believe that this is what Tom proposed in [1], see "Maybe an appropriate preliminary patch is to refactor the support code ..." near the end of that message. The point is that now we need two lists: one for "plain" relations and one for grouped ones. > - The RelInfoList naming is pretty confusing, because it's actually not > a list but a combination of list+hash. And the embedded list is called > items, to make it yet a bit more confusing. I suggest we call this > just RelInfo or RelInfoLookup or something else not including List. I think it can be considered a list by the caller of add_join_rel() etc. The hash table is hidden from user and is empty until the list becomes long enough. The word "list" in the structure name may just indicate that new elements can be added to the end, which shouldn't be expected if the structure was an array. I searched a bit in tools/pgindent/typedefs.list and found at least a few structures that also do have "list" in the name but are not lists internally: CatCList, FuncCandidateList, MCVList. Actually I'm not opposed to renaming the structure, but don't have better idea right now. As for *Lookup: following is the only use of such a structure name in PG code and it's not something to store data in: typedef enum { IDENTIFIER_LOOKUP_NORMAL, /* normal processing of var names */ IDENTIFIER_LOOKUP_DECLARE, /* In DECLARE --- don't look up names */ IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ } IdentifierLookup; > - RelInfoEntry seems severely under-documented, particularly the data > part (which is void* making it even harder to understand what's it for > without having to read the functions). AFAICS it's just simply a link > to the embedded list, but maybe I'm wrong. This is just JoinHashEntry (which was also undocumented) renamed. I've added a short comment now. > - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c > right before build_join_rel. This makes diff clearer/smaller and > visual diffs nicer. Hm, it might improve readability of the diff, but this API is exactly where I still need feedback from Tom. I'm not eager to optimize the diff as long as there's a risk that these functions will be removed or renamed. > 2) v14-0002-Introduce-make_join_rel_common-function.patch > - > > I see no point in keeping this change in a separate patch, as it prety > much just renames make_join_rel to make_join_rel_common and then adds > > make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2) > { > return make_join_rel_common(root, rel1, rel2); > } > > which seems rather trivial and useless on it's own. I'd just merge it > into 0003 where we use the _common function more extensively. ok. I thought that this improves readability of the diffs, but it doesn't look that bad if this is included in 0003. > > 3) v14-0003-Aggregate-push-down-basic-functionality.patch > - > > I haven't done much review on this yet, but I've done some testing and > benchmarking so let me share at least that. The queries I used were of > the type I mentioned earlier in this thread - essentially a star schema, > i.e. fact table referencing dimensions, with aggregation per columns in > the dimension. So something like > > SELECT d.c, sum(f) FROM fact JOIN dimension d ON (
Re: SlabCheck leaks memory into TopMemoryContext
Andres Freund writes: > ... I thought you were asking whether > any additional memory could just be avoided... Well, I was kind of wondering that, but if it's not practical then preallocating the space instead will do. regards, tom lane
Re: SlabCheck leaks memory into TopMemoryContext
On Thu, Jan 16, 2020 at 10:27:01AM -0500, Tom Lane wrote: Andres Freund writes: ... I thought you were asking whether any additional memory could just be avoided... Well, I was kind of wondering that, but if it's not practical then preallocating the space instead will do. I don't think it's practical to rework the checks in a way that would not require allocations. Maybe it's possible, but I think it's not worth the extra complexity. The attached fix should do the trick - it pre-allocates the space when creating the context. There is a bit of complexity because we want to allocate the space as part of the context header, but nothin too bad. We might optimize it a bit by using a regular bitmap (instead of just an array of bools), but I haven't done that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/utils/mmgr/slab.c b/src/backend/utils/mmgr/slab.c index c5866d9cc3..7f9749a73f 100644 --- a/src/backend/utils/mmgr/slab.c +++ b/src/backend/utils/mmgr/slab.c @@ -70,6 +70,9 @@ typedef struct SlabContext int chunksPerBlock; /* number of chunks per block */ int minFreeChunks; /* min number of free chunks in any block */ int nblocks;/* number of blocks allocated */ +#ifdef MEMORY_CONTEXT_CHECKING + char *freechunks; /* bitmap of free chunks on a block */ +#endif /* blocks with free space, grouped by number of free chunks: */ dlist_head freelist[FLEXIBLE_ARRAY_MEMBER]; } SlabContext; @@ -229,6 +232,15 @@ SlabContextCreate(MemoryContext parent, /* Size of the memory context header */ headerSize = offsetof(SlabContext, freelist) + freelistSize; +#ifdef MEMORY_CONTEXT_CHECKING + /* +* With memory checking, we need to allocate extra space for the bitmap +* of free chunks. The space is allocated at the end, and we need proper +* alignment (it's an array of bools, so maybe MAXALIGN is not needed). +*/ + headerSize = MAXALIGN(headerSize) + chunksPerBlock * sizeof(bool); +#endif + slab = (SlabContext *) malloc(headerSize); if (slab == NULL) { @@ -258,6 +270,12 @@ SlabContextCreate(MemoryContext parent, for (i = 0; i < (slab->chunksPerBlock + 1); i++) dlist_init(&slab->freelist[i]); +#ifdef MEMORY_CONTEXT_CHECKING + /* set the freechunks pointer after the freelists array (aligned) */ + slab->freechunks = (char *) slab + + MAXALIGN(offsetof(SlabContext, freelist) + freelistSize); +#endif + /* Finally, do the type-independent part of context creation */ MemoryContextCreate((MemoryContext) slab, T_SlabContext, @@ -701,14 +719,10 @@ SlabCheck(MemoryContext context) int i; SlabContext *slab = castNode(SlabContext, context); const char *name = slab->header.name; - char *freechunks; Assert(slab); Assert(slab->chunksPerBlock > 0); - /* bitmap of free chunks on a block */ - freechunks = palloc(slab->chunksPerBlock * sizeof(bool)); - /* walk all the freelists */ for (i = 0; i <= slab->chunksPerBlock; i++) { @@ -731,7 +745,7 @@ SlabCheck(MemoryContext context) name, block->nfree, block, i); /* reset the bitmap of free chunks for this block */ - memset(freechunks, 0, (slab->chunksPerBlock * sizeof(bool))); + memset(slab->freechunks, 0, (slab->chunksPerBlock * sizeof(bool))); idx = block->firstFreeChunk; /* @@ -748,7 +762,7 @@ SlabCheck(MemoryContext context) /* count the chunk as free, add it to the bitmap */ nfree++; - freechunks[idx] = true; + slab->freechunks[idx] = true; /* read index of the next free chunk */ chunk = SlabBlockGetChunk(slab, block, idx); @@ -759,7 +773,7 @@ SlabCheck(MemoryContext context) for (j = 0; j < slab->chunksPerBlock; j++) { /* non-zero bit in the bitmap means chunk the chunk is used */ - if (!freechunks[j]) + if (!slab->freechunks[j]) { SlabChunk *chunk = SlabBlockGetChunk(slab, block, j);
Re: Code cleanup for build_regexp_split_result
Li Japin writes: > I find the build_regexp_split_result() has redundant codes, we can move it to > before the condition check, can we? Hm, yeah, that looks a bit strange. It was less strange before c8ea87e4bd950572cba4575e9a62284cebf85ac5, I think. Pushed with some additional simplification to get rid of the rather ugly (IMO) PG_USED_FOR_ASSERTS_ONLY variable. regards, tom lane
Re: SlabCheck leaks memory into TopMemoryContext
Tomas Vondra writes: > The attached fix should do the trick - it pre-allocates the space when > creating the context. There is a bit of complexity because we want to > allocate the space as part of the context header, but nothin too bad. We > might optimize it a bit by using a regular bitmap (instead of just an > array of bools), but I haven't done that. Hmm ... so if this is an array of bools, why isn't it declared bool* rather than char* ? (Pre-existing ugliness, sure, but we might as well fix it while we're here. Especially since you used sizeof(bool) in the space calculation.) I agree that maxaligning the start point of the array is pointless. I'd write "free chunks in a block" not "free chunks on a block", the latter seems rather shaky English. But that's getting picky. LGTM otherwise. regards, tom lane
Re: SlabCheck leaks memory into TopMemoryContext
Hi, On 2020-01-16 17:25:00 +0100, Tomas Vondra wrote: > On Thu, Jan 16, 2020 at 10:27:01AM -0500, Tom Lane wrote: > > Andres Freund writes: > > > ... I thought you were asking whether > > > any additional memory could just be avoided... > > > > Well, I was kind of wondering that, but if it's not practical then > > preallocating the space instead will do. > > > > I don't think it's practical to rework the checks in a way that would > not require allocations. Maybe it's possible, but I think it's not worth > the extra complexity. > > The attached fix should do the trick - it pre-allocates the space when > creating the context. There is a bit of complexity because we want to > allocate the space as part of the context header, but nothin too bad. We > might optimize it a bit by using a regular bitmap (instead of just an > array of bools), but I haven't done that. I don't get why it's advantageous to allocate this once for each slab, rather than having it as a global once for all slabs. But anyway, still clearly better than the current situation. - Andres
Re: SlabCheck leaks memory into TopMemoryContext
On Thu, Jan 16, 2020 at 08:48:49AM -0800, Andres Freund wrote: Hi, On 2020-01-16 17:25:00 +0100, Tomas Vondra wrote: On Thu, Jan 16, 2020 at 10:27:01AM -0500, Tom Lane wrote: > Andres Freund writes: > > ... I thought you were asking whether > > any additional memory could just be avoided... > > Well, I was kind of wondering that, but if it's not practical then > preallocating the space instead will do. > I don't think it's practical to rework the checks in a way that would not require allocations. Maybe it's possible, but I think it's not worth the extra complexity. The attached fix should do the trick - it pre-allocates the space when creating the context. There is a bit of complexity because we want to allocate the space as part of the context header, but nothin too bad. We might optimize it a bit by using a regular bitmap (instead of just an array of bools), but I haven't done that. I don't get why it's advantageous to allocate this once for each slab, rather than having it as a global once for all slabs. But anyway, still clearly better than the current situation. It's largely a matter of personal preference - I agree there are cases when global variables are the best solution, but I kinda dislike them. It seems cleaner to just allocate it as part of the slab, not having to deal with different number of chunks per block between slabs. Plus we don't have all that many slabs (like 2), and it's only really used in debug builds anyway. So I'm not all that woried about this wasting a couple extra kB of memory. YMMV of course ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SlabCheck leaks memory into TopMemoryContext
On Thu, Jan 16, 2020 at 11:43:34AM -0500, Tom Lane wrote: Tomas Vondra writes: The attached fix should do the trick - it pre-allocates the space when creating the context. There is a bit of complexity because we want to allocate the space as part of the context header, but nothin too bad. We might optimize it a bit by using a regular bitmap (instead of just an array of bools), but I haven't done that. Hmm ... so if this is an array of bools, why isn't it declared bool* rather than char* ? (Pre-existing ugliness, sure, but we might as well fix it while we're here. Especially since you used sizeof(bool) in the space calculation.) True. Will fix. I agree that maxaligning the start point of the array is pointless. I'd write "free chunks in a block" not "free chunks on a block", the latter seems rather shaky English. But that's getting picky. LGTM otherwise. OK. Barring objections I'll push and backpatch this later today. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Physical replication slot advance is not persistent
On 09.01.2020 09:36, Kyotaro Horiguchi wrote: Hello. At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov wrote in On 2019-12-26 16:35, Alexey Kondratov wrote: Another concern is that ReplicationSlotIsDirty is added with the only one user. It also cannot be used by SaveSlotToPath due to the simultaneous usage of both flags dirty and just_dirtied there. In that way, I hope that we should call ReplicationSlotSave unconditionally in the pg_replication_slot_advance, so slot will be saved or not automatically based on the slot->dirty flag. In the same time, ReplicationSlotsComputeRequiredXmin and ReplicationSlotsComputeRequiredLSN should be called by anyone, who modifies xmin and LSN fields in the slot. Otherwise, currently we are getting some leaky abstractions. Sounds reasonable. Great, so it seems that we have reached some agreement about who should mark slot as dirty, at least for now. If someone will utilise old WAL and after that crash will happen between steps 2) and 3), then we start with old value of restart_lsn, but without required WAL. I do not know how to properly reproduce it without gdb and power off, so the chance is pretty low, but still it could be a case. In the first place we advance required LSN for every reply message but save slot data only at checkpoint on physical repliation. Such a strict guarantee seems too much. ... I think we shouldn't touch the paths used by replication protocol. And don't we focus on how we make a change of a replication slot from SQL interface persistent? It seems to me that generaly we don't need to save dirty slots other than checkpoint, but the SQL function seems wanting the change to be saved immediately. As the result, please find the attached, which is following only the first paragraph cited above. OK, I have definitely overthought that, thanks. This looks like a minimal subset of changes that actually solves the bug. I would only prefer to keep some additional comments (something like the attached), otherwise after half a year it will be unclear again, why we save slot unconditionally here. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index bb69683e2a..084e0c2960 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(&MyReplicationSlot->mutex); retlsn = moveto; + + ReplicationSlotMarkDirty(); + + /* We moved retart_lsn, update the global value. */ + ReplicationSlotsComputeRequiredLSN(); } return retlsn; @@ -564,7 +569,10 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) (uint32) (moveto >> 32), (uint32) moveto, (uint32) (minlsn >> 32), (uint32) minlsn))); - /* Do the actual slot update, depending on the slot type */ + /* + * Do the actual slot update, depending on the slot type. Slot will be + * marked as dirty by pg_*_replication_slot_advance if changed. + */ if (OidIsValid(MyReplicationSlot->data.database)) endlsn = pg_logical_replication_slot_advance(moveto); else @@ -573,14 +581,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) values[0] = NameGetDatum(&MyReplicationSlot->data.name); nulls[0] = false; - /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) - { - ReplicationSlotMarkDirty(); - ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); - ReplicationSlotSave(); - } + /* + * Update the on disk state. No work here if + * pg_*_replication_slot_advance call was a no-op. + */ + ReplicationSlotSave(); ReplicationSlotRelease();
Re: our checks for read-only queries are not great
On Tue, Jan 14, 2020 at 1:46 PM Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > Robert Haas writes: > > > Speaking of sensible progress, I think we've drifted off on a tangent > > > here about ALTER SYSTEM. > > > > Agreed, that's not terribly relevant for the proposed patch. > > I agree that the proposed patch seems alright by itself, as the changes > it's making to existing behavior seem to all be bug-fixes and pretty > clear improvements not really related to 'read-only' transactions. There seems to be no disagreement on this point, so I have committed the patch. > It's unfortunate that we haven't been able to work through to some kind > of agreement around what "SET TRANSACTION READ ONLY" means, so that > users of it can know what to expect. I at least feel like we have a pretty good handle on what it was intended to mean; that is, "doesn't cause semantically significant changes to pg_dump output." I do hear some skepticism as to whether that's the best definition, but it has pretty good explanatory power relative to the current state of the code, which is something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: our checks for read-only queries are not great
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Tue, Jan 14, 2020 at 1:46 PM Stephen Frost wrote: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > > Robert Haas writes: > > > > Speaking of sensible progress, I think we've drifted off on a tangent > > > > here about ALTER SYSTEM. > > > > > > Agreed, that's not terribly relevant for the proposed patch. > > > > I agree that the proposed patch seems alright by itself, as the changes > > it's making to existing behavior seem to all be bug-fixes and pretty > > clear improvements not really related to 'read-only' transactions. > > There seems to be no disagreement on this point, so I have committed the > patch. Works for me. > > It's unfortunate that we haven't been able to work through to some kind > > of agreement around what "SET TRANSACTION READ ONLY" means, so that > > users of it can know what to expect. > > I at least feel like we have a pretty good handle on what it was > intended to mean; that is, "doesn't cause semantically significant > changes to pg_dump output." I do hear some skepticism as to whether > that's the best definition, but it has pretty good explanatory power > relative to the current state of the code, which is something. I think I agree with you regarding the original intent, though even there, as discussed elsewhere, it seems like there's perhaps either a bug or a disagreement about the specifics of what that means when it relates to committing a 2-phase transaction. Still, setting that aside for the moment, do we feel like this is enough to be able to update our documentation with? Thanks, Stephen signature.asc Description: PGP signature
Re: SlabCheck leaks memory into TopMemoryContext
Tomas Vondra writes: > On Thu, Jan 16, 2020 at 08:48:49AM -0800, Andres Freund wrote: >> I don't get why it's advantageous to allocate this once for each slab, >> rather than having it as a global once for all slabs. But anyway, still >> clearly better than the current situation. > It's largely a matter of personal preference - I agree there are cases > when global variables are the best solution, but I kinda dislike them. > It seems cleaner to just allocate it as part of the slab, not having to > deal with different number of chunks per block between slabs. > Plus we don't have all that many slabs (like 2), and it's only really > used in debug builds anyway. So I'm not all that woried about this > wasting a couple extra kB of memory. A positive argument for doing it like this is that the overhead goes away when the SlabContexts are all deallocated, while a global variable would presumably stick around indefinitely. But I concur that in current usage, there's hardly any point in worrying about the relative benefits. We should just keep it simple, and this seems marginally simpler than the other way. regards, tom lane
Re: psql - add SHOW_ALL_RESULTS option
Hi, This is one of the patches already marked as RFC (since September by Alvaro). Anyone interested in actually pushing it, so that it does not fall through to yet another commitfest? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Thoughts on "killed tuples" index hint bits support on standby
Hi, On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote: > First thing we need to consider - checksums and wal_log_hints are > widely used these days. So, at any moment master could send FPW page > with new "killed tuples" hints and overwrite hints set by standby. > Moreover it is not possible to distinguish hints are set by primary or > standby. Note that the FPIs are only going to be sent for the first write to a page after a checksum. I don't think you're suggesting we rely on them for correctness (this far into the email at least), but it still seems worthwhile to point out. > And there is where hot_standby_feedback comes to play. Master node > considers xmin of hot_standy_feedback replicas (RecentGlobalXmin) > while setting "killed tuples" bits. So, if hot_standby_feedback is > enabled on standby for a while - it could safely trust hint bits from > master. Well, not easily. There's no guarantee that the node it reports hot_standby_feedback to is actually the primary. It could be an cascading replica setup, that doesn't report hot_standby_feedback upstream. Also hot_standby_feedback only takes effect while a streaming connection is active, if that is even temporarily interrupted, the primary will loose all knowledge of the standby's horizon - unless replication slots are in use, that is. Additionally, we also need to handle the case where the replica currently has restarted, and is recovering using local WAL, and/or archive based recovery. In that case the standby could already have sent a *newer* horizon as feedback upstream, but currently again have an older view. It is entirely possible that the standby is consistent and queryable in such a state (if nothing forced disk writes during WAL replay, minRecoveryLSN will not be set to something too new). > Also, standby could set own hints using xmin it sends to primary > during feedback (but without marking page as dirty). We do something similar for heap hint bits already... > Of course all is not so easy, there are a few things and corner cases > to care about > * Looks like RecentGlobalXmin could be moved backwards in case of new > replica with lower xmin is connected (or by switching some replica to > hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved > strictly forward. I'm not sure this is a problem. If that happens we cannot rely on the different xmin horizon anyway, because action may have been taken on the old RecentGlobalXmin. Thus we need to be safe against that anyway. > * hot_standby_feedback could be enabled on the fly. In such a case we > need distinguish transactions which are safe or unsafe to deal with > hints. Standby could receive fresh RecentGlobalXmin as response to > feedback message. All standby transactions with xmin >= > RecentGlobalXmin are safe to use hints. > * hot_standby_feedback could be disabled on the fly. In such situation > standby needs to continue to send feedback while canceling all queries > with ignore_killed_tuples=true. Once all such queries are canceled - > feedback are no longer needed and should be disabled. I don't think we can rely on hot_standby_feedback at all. We can to avoid unnecessary cancellations, etc, and even assume it's setup up reasonably for some configurations, but there always needs to be an independent correctness backstop. I think it might be more promising to improve the the kill logic based on the WAL logged horizons from the primary. All I think we need to do is to use a more conservatively computed RecentGlobalXmin when determining whether tuples are dead, I think. We already regularly log a xl_running_xacts, adding information about the primaries horizon to that, and stashing it in shared memory on the standby, shouldn't be too hard. Then we can make a correct, albeit likely overly pessimistic, visibility determinations about tuples, and go on to set LP_DEAD. There are some complexities around how to avoid unnecessary query cancellations. We'd not want to trigger recovery conflicts based on the new xl_running_xacts field, as that'd make the conflict rate go through the roof - but I think we could safely use the logical minimum of the local RecentGlobalXmin, and the primaries'. That should allow us to set additional LP_DEAD safely, I believe. We could even rely on those LP_DEAD bits. But: I'm less clear on how we can make sure that we can *rely* on LP_DEAD to skip over entries during scans, however. The bits set as described above would be safe, but we also can see LP_DEAD set by the primary (and even upstream cascading standbys at least in case of new base backups taken from them), due to them being not being WAL logged. As we don't WAL log, there is no conflict associated with the LP_DEADs being set. My gut feeling is that it's going to be very hard to get around this, without adding WAL logging for _bt_killitems et al (including an interface for kill_prior_tuple to report the used horizon to the index). I'm wondering if we could recycle BTP
Re: SlabCheck leaks memory into TopMemoryContext
Hi, On 2020-01-16 18:01:53 +0100, Tomas Vondra wrote: > Plus we don't have all that many slabs (like 2) FWIW, I have a local patch that adds additional ones, for the relcache and catcache, that's how I noticed the leak. Because a test pgbench absolutely *tanked* in performance. Just for giggles. With leak: pgbench -n -M prepared -P1 -c20 -j20 -T6000 -S progress: 1.0 s, 81689.4 tps, lat 0.242 ms stddev 0.087 progress: 2.0 s, 51228.5 tps, lat 0.390 ms stddev 0.107 progress: 3.0 s, 42297.4 tps, lat 0.473 ms stddev 0.141 progress: 4.0 s, 34885.9 tps, lat 0.573 ms stddev 0.171 progress: 5.0 s, 31211.2 tps, lat 0.640 ms stddev 0.182 progress: 6.0 s, 27307.9 tps, lat 0.732 ms stddev 0.216 progress: 7.0 s, 25698.9 tps, lat 0.778 ms stddev 0.228 without: pgbench -n -M prepared -P1 -c20 -j20 -T6000 -S progress: 1.0 s, 144119.1 tps, lat 0.137 ms stddev 0.047 progress: 2.0 s, 148092.8 tps, lat 0.135 ms stddev 0.039 progress: 3.0 s, 148757.0 tps, lat 0.134 ms stddev 0.032 progress: 4.0 s, 148553.7 tps, lat 0.134 ms stddev 0.038 I do find the size of the impact quite impressive. It's all due to the TopMemoryContext's AllocSetCheck() taking longer and longer. > and it's only really used in debug builds anyway. So I'm not all that > woried about this wasting a couple extra kB of memory. IDK, making memory usage look different makes optimizing it harder. Not a hard rule, obviously, but ... Greetings, Andres Freund
Re: psql - add SHOW_ALL_RESULTS option
Tomas Vondra writes: > This is one of the patches already marked as RFC (since September by > Alvaro). Anyone interested in actually pushing it, so that it does not > fall through to yet another commitfest? TBH, I think we'd be better off to reject it. This makes a nontrivial change in a very long-standing psql behavior, with AFAICS no way to get back the old semantics. (The thread title is completely misleading about that; there's no "option" in the patch as it stands.) Sure, in a green field this behavior would likely be more sensible ... but that has to be weighed against the fact that it's behaved the way it does for a long time, and any existing scripts that are affected by that behavior have presumably deliberately chosen to use it. I can't imagine that changing this will make very many people happier. It seems much more likely that people who are affected will be unhappy. The compatibility issue could be resolved by putting in the option that I suppose was there at the beginning. But then we'd have to have a debate about which behavior would be default, and there would still be the question of who would find this to be an improvement. If you're chaining together commands with \; then it's likely that you are happy with the way it behaves today. Certainly there's been no drumbeat of bug reports about it. regards, tom lane
Re: making the backend's json parser work in frontend code
Hi Robert, On 1/15/20 2:02 PM, Robert Haas wrote: > The discussion on the backup manifest thread has gotten bogged down on > the issue of the format that should be used to store the backup > manifest file. I want something simple and ad-hoc; David Steele and > Stephen Frost prefer JSON. That is problematic because our JSON parser > does not work in frontend code, and I want to be able to validate a > backup against its manifest, which involves being able to parse the > manifest from frontend code. The latest development over there is that > David Steele has posted the JSON parser that he wrote for pgbackrest > with an offer to try to adapt it for use in front-end PostgreSQL code, > an offer which I genuinely appreciate. I'll write more about that over > on that thread. However, I decided to spend today doing some further > investigation of an alternative approach, namely making the backend's > existing JSON parser work in frontend code as well. I did not solve > all the problems there, but I did come up with some patches which I > think would be worth committing on independent grounds, and I think > the whole series is worth posting. So here goes. I was starting to wonder if it wouldn't be simpler to go back to the Postgres JSON parser and see if we can adapt it. I'm not sure that it *is* simpler, but it would almost certainly be more acceptable. > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm > missing something, this seems like an overdue cleanup. It's long been > the case that wchar.c is actually compiled and linked into both > frontend and backend code. Commit > 60f11b87a2349985230c08616fa8a34ffde934c8 added code into src/common > that depends on wchar.c being available, but didn't actually make > wchar.c part of src/common, which seems like an odd decision: the > functions in the library are dependent on code that is not part of any > library but whose source files get copied around where needed. Eh? This looks like an obvious improvement to me. > 0002 does some basic header cleanup to make it possible to include the > existing header file jsonapi.h in frontend code. The state of the JSON > headers today looks generally poor. There seems not to have been much > attempt to get the prototypes for a given source file, say foo.c, into > a header file with the same name, say foo.h. Also, dependencies > between various header files seem to be have added somewhat freely. > This patch does not come close to fixing all that, but I consider it a > modest down payment on a cleanup that probably ought to be taken > further. Agreed that these header files are fairly disorganized. In general the names json, jsonapi, jsonfuncs don't tell me a whole lot. I feel like I'd want to include json.h to get a json parser but it only contains one utility function before these patches. I can see that json.c primarily contains SQL functions so that's why. So the idea here is that json.c will have the JSON SQL functions, jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and jsonfuncs.c the utility functions? > 0003 splits json.c into two files, json.c and jsonapi.c. All the > lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into > jsonapi.c, while the stuff that pertains to the 'json' data type > remains in json.c. This also seems like a good cleanup, because to me, > at least, it's not a great idea to mix together code that is used by > both the json and jsonb data types as well as other things in the > system that want to generate or parse json together with things that > are specific to the 'json' data type. This seems like a good first step. I wonder if the remainder of the SQL json/jsonb functions should be moved to json.c/jsonb.c respectively? That does represent a lot of code churn though, so perhaps not worth it. > As far as I know all three of the above patches are committable as-is; > review and contrary opinions welcome. Agreed, with some questions as above. > On the other hand, 0004, 0005, and 0006 are charitably described as > experimental or WIP. 0004 and 0005 hack up jsonapi.c so that it can > still be compiled even if #include "postgres.h" is changed to #include > "postgres-fe.h" and 0006 moves it into src/common. Note that I say > that they make it compile, not work. It's not just untested; it's > definitely broken. But it gives a feeling for what the remaining > obstacles to making this code available in a frontend environment are. > Since I wrote my very first email complaining about the difficulty of > making the backend's JSON parser work in a frontend environment, one > obstacle has been knocked down: StringInfo is now available in > front-end code (commit 26aaf97b683d6258c098859e6b1268e1f5da242f). The > remaining problems (that I know about) have to do with error reporting > and multibyte character support; a read of the patches is suggested > for those wanting further details. Well, with the caveat that it doesn't work, i
Re: making the backend's json parser work in frontend code
Hi Robert, On 1/16/20 11:37 AM, David Steele wrote: The next question in my mind is given the caveat that the error handing is questionable in the front end, can we at least render/parse valid JSON with the code? Hrm, this bit was from an earlier edit. I meant: The next question in my mind is what will it take to get this working in a limited form so we can at least prototype it with pg_basebackup. I can hack on this with some static strings in front end code tomorrow to see what works and what doesn't if that makes sense. Regards, -- -David da...@pgmasters.net
Re: SlabCheck leaks memory into TopMemoryContext
Andres Freund writes: > On 2020-01-16 18:01:53 +0100, Tomas Vondra wrote: >> and it's only really used in debug builds anyway. So I'm not all that >> woried about this wasting a couple extra kB of memory. > IDK, making memory usage look different makes optimizing it harder. Not > a hard rule, obviously, but ... Well, if you're that excited about it, make a patch so we can see how ugly it ends up being. regards, tom lane
Re: Thoughts on "killed tuples" index hint bits support on standby
On Thu, Jan 16, 2020 at 9:54 AM Andres Freund wrote: > I don't think we can rely on hot_standby_feedback at all. We can to > avoid unnecessary cancellations, etc, and even assume it's setup up > reasonably for some configurations, but there always needs to be an > independent correctness backstop. +1 > I'm less clear on how we can make sure that we can *rely* on LP_DEAD to > skip over entries during scans, however. The bits set as described above > would be safe, but we also can see LP_DEAD set by the primary (and even > upstream cascading standbys at least in case of new base backups taken > from them), due to them being not being WAL logged. As we don't WAL log, > there is no conflict associated with the LP_DEADs being set. My gut > feeling is that it's going to be very hard to get around this, without > adding WAL logging for _bt_killitems et al (including an interface for > kill_prior_tuple to report the used horizon to the index). I agree. What about calling _bt_vacuum_one_page() more often than strictly necessary to avoid a page split on the primary? The B-Tree deduplication patch sometimes does that, albeit for completely unrelated reasons. (We don't want to have to unset an LP_DEAD bit in the case when a new/incoming duplicate tuple has a TID that overlaps with the posting list range of some existing duplicate posting list tuple.) I have no idea how you'd determine that it was time to call _bt_vacuum_one_page(). Seems worth considering. > I'm wondering if we could recycle BTPageOpaqueData.xact to store the > horizon applying to killed tuples on the page. We don't need to store > the level for leaf pages, because we have BTP_LEAF, so we could make > space for that (potentially signalled by a new BTP flag). Obviously we > have to be careful with storing xids in the index, due to potential > wraparound danger - but I think such page would have to be vacuumed > anyway, before a potential wraparound. You would think that, but unfortunately we don't currently do it that way. We store XIDs in deleted leaf pages that can sometimes be missed until the next wraparound. We need to do something like commit 6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's somewhere on my TODO list. > I think we could safely unset > the xid during nbtree single page cleanup, and vacuum, by making sure no > LP_DEAD entries survive, and by including the horizon in the generated > WAL record. > > That however still doesn't really fully allow us to set LP_DEAD on > standbys, however - but it'd allow us to take the primary's LP_DEADs > into account on a standby. I think we'd run into torn page issues, if we > were to do so without WAL logging, because we'd rely on the LP_DEAD bits > and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to > do so *iff* the page's LSN indicates that there has been a WAL record > covering it since the last redo location. That sounds like a huge mess. -- Peter Geoghegan
Re: making the backend's json parser work in frontend code
On Thu, Jan 16, 2020 at 1:37 PM David Steele wrote: > I was starting to wonder if it wouldn't be simpler to go back to the > Postgres JSON parser and see if we can adapt it. I'm not sure that it > *is* simpler, but it would almost certainly be more acceptable. That is my feeling also. > So the idea here is that json.c will have the JSON SQL functions, > jsonb.c the JSONB SQL functions, and jsonapi.c the parser, and > jsonfuncs.c the utility functions? Uh, I think roughly that, yes. Although I can't claim to fully understand everything that's here. > This seems like a good first step. I wonder if the remainder of the SQL > json/jsonb functions should be moved to json.c/jsonb.c respectively? > > That does represent a lot of code churn though, so perhaps not worth it. I don't have an opinion on this right now. > Well, with the caveat that it doesn't work, it's less than I expected. > > Obviously ereport() is a pretty big deal and I agree with Michael > downthread that we should port this to the frontend code. Another possibly-attractive option would be to defer throwing the error: i.e. set some flags in the lex or parse state or something, and then just return. The caller notices the flags and has enough information to throw an error or whatever it wants to do. The reason I think this might be attractive is that it dodges the whole question of what exactly throwing an error is supposed to do in a world without transactions, memory contexts, resource owners, etc. However, it has some pitfalls of its own, like maybe being too much code churn or hurting performance in non-error cases. > It would also be nice to unify functions like PQmblen() and pg_mblen() > if possible. I don't see how to do that at the moment, but I agree that it would be nice if we can figure it out. > The next question in my mind is given the caveat that the error handing > is questionable in the front end, can we at least render/parse valid > JSON with the code? That's a real good question. Thanks for offering to test it; I think that would be very helpful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Enabling B-Tree deduplication by default
On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan wrote: > There are some outstanding questions about how B-Tree deduplication > [1] should be configured, and whether or not it should be enabled by > default. I'm starting this new thread in the hopes of generating > discussion on these high level questions. It seems like the issue here is that you're pretty confident that deduplication will be a win for unique indexes, but not so confident that this will be true for non-unique indexes. I don't know that I understand why. It does seem odd to me to treat them differently, but it's possible that this is a reflection of my own lack of understanding. What do other database systems do? I wonder whether we could avoid the downside of having only one LP_DEAD bit per line pointer by having a bit per TID within the compressed tuples themselves. I assume you already thought about that, though. What are the characteristics of this system if you have an index that is not declared as UNIQUE but actually happens to be UNIQUE? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making the backend's json parser work in frontend code
On 1/15/20 4:40 PM, Andres Freund wrote: > > I'm not sure where I come down between using json and a simple ad-hoc > format, when the dependency for the former is making the existing json > parser work in the frontend. But if the alternative is to add a second > json parser, it very clearly shifts towards using an ad-hoc > format. Having to maintain a simple ad-hoc parser is a lot less > technical debt than having a second full blown json parser. Maybe at first, but it will grow and become more complex as new features are added. This has been our experience with pgBackRest, at least. > Imo even > when an external project or three also has to have that simple parser. I don't agree here. Especially if we outgrow the format and they need two parsers, depending on the version of PostgreSQL. To do page-level incrementals (which this feature is intended to enable) the user will need to be able to associate full and incremental backups and the only way I see to do that (currently) is to read the manifests, since the prior backup should be stored there. I think this means that parsing the manifest is not really optional -- it will be required to do any kind of automation with incrementals. It's easy enough for a tool like pgBackRest to do something like that, much harder for a user hacking together a tool in bash based on pg_basebackup. > If the alternative were to use that newly proposed json parser to > *replace* the backend one too, the story would again be different. That was certainly not my intention. Regards, -- -David da...@pgmasters.net
Re: making the backend's json parser work in frontend code
On 1/15/20 7:39 PM, Robert Haas wrote: > On Wed, Jan 15, 2020 at 6:40 PM Andres Freund wrote: >> It's not obvious why the better approach here wouldn't be to just have a >> very simple ereport replacement, that needs to be explicitly included >> from frontend code. It'd not be meaningfully harder, imo, and it'd >> require fewer adaptions, and it'd look more familiar. > > I agree that it's far from obvious that the hacks in the patch are > best; to the contrary, they are hacks. That said, I feel that the > semantics of throwing an error are not very well-defined in a > front-end environment. I mean, in a backend context, throwing an error > is going to abort the current transaction, with all that this implies. > If the frontend equivalent is to do nothing and hope for the best, I > doubt it will survive anything more than the simplest use cases. This > is one of the reasons I've been very reluctant to go do down this > whole path in the first place. The way we handle this in pgBackRest is to put a TRY ... CATCH block in main() to log and exit on any uncaught THROW. That seems like a reasonable way to start here. Without memory contexts that almost certainly will mean memory leaks but I'm not sure how much that matters if the action is to exit immediately. Regards, -- -David da...@pgmasters.net
Re: making the backend's json parser work in frontend code
David Steele writes: > On 1/15/20 7:39 PM, Robert Haas wrote: >>> I agree that it's far from obvious that the hacks in the patch are >>> best; to the contrary, they are hacks. That said, I feel that the >>> semantics of throwing an error are not very well-defined in a >>> front-end environment. I mean, in a backend context, throwing an error >>> is going to abort the current transaction, with all that this implies. >>> If the frontend equivalent is to do nothing and hope for the best, I >>> doubt it will survive anything more than the simplest use cases. This >>> is one of the reasons I've been very reluctant to go do down this >>> whole path in the first place. > The way we handle this in pgBackRest is to put a TRY ... CATCH block in > main() to log and exit on any uncaught THROW. That seems like a > reasonable way to start here. Without memory contexts that almost > certainly will mean memory leaks but I'm not sure how much that matters > if the action is to exit immediately. If that's the expectation, we might as well replace backend ereport(ERROR) with something that just prints a message and does exit(1). The question comes down to whether there are use-cases where a frontend application would really want to recover and continue processing after a JSON syntax problem. I'm not seeing that that's a near-term requirement, so maybe we could leave it for somebody to solve when and if they want to do it. regards, tom lane
Re: making the backend's json parser work in frontend code
Hi, On 2020-01-16 14:20:28 -0500, Tom Lane wrote: > David Steele writes: > > The way we handle this in pgBackRest is to put a TRY ... CATCH block in > > main() to log and exit on any uncaught THROW. That seems like a > > reasonable way to start here. Without memory contexts that almost > > certainly will mean memory leaks but I'm not sure how much that matters > > if the action is to exit immediately. > > If that's the expectation, we might as well replace backend ereport(ERROR) > with something that just prints a message and does exit(1). Well, the process might still want to do some cleanup of half-finished work. You'd not need to be resistant against memory leaks to do so, if followed by an exit. Obviously you can also do all the necessarily cleanup from within the ereport(ERROR) itself, but that doesn't seem appealing to me (not composable, harder to reuse for other programs, etc). Greetings, Andres Freund
Re: Patch to document base64 encoding
I just wanted to throw this in the archives; this doesn't need to affect your patch. Because of how the new tables look in the PDF docs, I thought it might be a good time to research how to make each function-entry occupy two rows: one for prototype, return type and description, and the other for the example and its result. Below is a first cut of how you'd implement that idea -- see colspec/spanspec/spanname ... only the output looks almost as bad (though the benefit is that it doesn't overwrite cell contents anymore). I think we have two choices. One is to figure out how to make this work (ie. make it pretty; maybe by using alternate cell backgrounds, say one white and one very light gray; maybe by using thinner/thicker inter-cell lines); the other is to forget tables altogether and format the info in some completely different way. Binary/String Conversion Functions Function Return Type Description Example Result convert_from convert_from(bytes bytea, src_encoding name) text Convert binary string to the database encoding. The original encoding is specified by src_encoding. The bytes must be valid in this encoding. See for available conversions. convert_from('text_in_utf8', 'UTF8') text_in_utf8 represented in the current database encoding -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SlabCheck leaks memory into TopMemoryContext
On Thu, Jan 16, 2020 at 12:33:03PM -0500, Tom Lane wrote: Tomas Vondra writes: On Thu, Jan 16, 2020 at 08:48:49AM -0800, Andres Freund wrote: I don't get why it's advantageous to allocate this once for each slab, rather than having it as a global once for all slabs. But anyway, still clearly better than the current situation. It's largely a matter of personal preference - I agree there are cases when global variables are the best solution, but I kinda dislike them. It seems cleaner to just allocate it as part of the slab, not having to deal with different number of chunks per block between slabs. Plus we don't have all that many slabs (like 2), and it's only really used in debug builds anyway. So I'm not all that woried about this wasting a couple extra kB of memory. A positive argument for doing it like this is that the overhead goes away when the SlabContexts are all deallocated, while a global variable would presumably stick around indefinitely. But I concur that in current usage, there's hardly any point in worrying about the relative benefits. We should just keep it simple, and this seems marginally simpler than the other way. I think the one possible argument against this approach might be that it adds a field to the struct, so if you have an extension using a Slab context, it'll break if you don't rebuild it. But that only matters if we want to backpatch it (which I think is not the plan) and with memory context checking enabled (which does not apply to regular packages). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Enabling B-Tree deduplication by default
On Thu, Jan 16, 2020 at 10:55 AM Robert Haas wrote: > On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan wrote: > > There are some outstanding questions about how B-Tree deduplication > > [1] should be configured, and whether or not it should be enabled by > > default. I'm starting this new thread in the hopes of generating > > discussion on these high level questions. > > It seems like the issue here is that you're pretty confident that > deduplication will be a win for unique indexes, but not so confident > that this will be true for non-unique indexes. Right. > I don't know that I understand why. The main reason that I am confident about unique indexes is that we only do a deduplication pass in a unique index when we observe that the incoming tuple (the one that might end up splitting the page) is a duplicate of some existing tuple. Checking that much is virtually free, since we already have the information close at hand today (we cache the _bt_check_unique() binary search bounds for reuse within _bt_findinsertloc() today). This seems to be an excellent heuristic, since we really only want to target unique index leaf pages where all or almost all insertions must be duplicates caused by non-HOT updates -- this category includes all the pgbench indexes, and includes all of the unique indexes in TPC-C. Whereas with non-unique indexes, we aren't specifically targeting version churn (though it will help with that too). Granted, the fact that the incoming tuple happens to be a duplicate is not a sure sign that the index is in this informal "suitable for deduplication" category of mine. The incoming duplicate could just be a once off. Even still, it's extremely unlikely to matter -- a failed deduplication pass really isn't that expensive anyway, since it takes place just before we split the page (we'll need the page in L1 cache anyway). If we consistently attempt deduplication in a unique index, then we're virtually guaranteed to consistently benefit from it. In general, the way that deduplication is only considered at the point where we'd otherwise have to split the page buys *a lot*. The idea of delaying page splits by doing something like load balancing or compression in a lazy fashion has a long history -- it was not my idea. I'm not talking about the LP_DEAD bit set deletion stuff here -- this goes back to the 1970s. > It does seem odd to me to treat them differently, but it's possible > that this is a reflection of my own lack of understanding. What do > other database systems do? Other database systems treat unique indexes very differently, albeit in a way that we're not really in a position to take too much away from -- other than the general fact that unique indexes can be thought of as very different things. In general, the unique indexes in other systems are expected to be unique in every sense, even during an "UPDATE foo SET unique_key = unique_key + 1" style query. Index tuples are slightly smaller in a unique index compared to an equivalent non-unique index in the case of one such system. Also, that same system has something called a "unique index scan" that can only be used with a unique index (and only when all columns appear in the query qual). > I wonder whether we could avoid the downside of having only one > LP_DEAD bit per line pointer by having a bit per TID within the > compressed tuples themselves. I assume you already thought about that, > though. So far, this lack of LP_DEAD bit granularity issue is only a theoretical problem. I haven't been able to demonstrate in any meaningful way. Setting LP_DEAD bits is bound to be correlated, and we only deduplicate to avoid a page split. Just last night I tried a variant pgbench workload with a tiny accounts table, an extremely skewed Zipf distribution, and lots of clients relative to the size of the machine. I used a non-unique index instead of a unique index, since that is likely to be where the patch was weakest (no _bt_check_unique() LP_DEAD bit setting that way). The patch still came out ahead of the master branch by about 3%. It's very hard to prove that there is no real downside to having only one LP_DEAD bit per posting list tuple, since absence of evidence isn't evidence of absence. I believe that it's much easier to make the argument that it's okay to one have one LP_DEAD bit per posting list within unique indexes specifically, though (because we understand that there can be no duplicates in the long run there). Throughout this work, and the v12 B-Tree work, I consistently made conservative decisions about space accounting in code like nbtsplitloc.c (the new nbtdedup.c code has to think about space in about the same way). My intuition is that space accounting is one area where we really ought to be conservative, since it's so hard to test. That's the main reason why I find the idea of having LP_DEAD bits within posting list tuples unappealing, whatever the benefits may be -- it adds complexity in the one area that I really d
Re: making the backend's json parser work in frontend code
On 1/16/20 12:26 PM, Andres Freund wrote: Hi, On 2020-01-16 14:20:28 -0500, Tom Lane wrote: David Steele writes: The way we handle this in pgBackRest is to put a TRY ... CATCH block in main() to log and exit on any uncaught THROW. That seems like a reasonable way to start here. Without memory contexts that almost certainly will mean memory leaks but I'm not sure how much that matters if the action is to exit immediately. If that's the expectation, we might as well replace backend ereport(ERROR) with something that just prints a message and does exit(1). Well, the process might still want to do some cleanup of half-finished work. You'd not need to be resistant against memory leaks to do so, if followed by an exit. Obviously you can also do all the necessarily cleanup from within the ereport(ERROR) itself, but that doesn't seem appealing to me (not composable, harder to reuse for other programs, etc). In pgBackRest we have a default handler that just logs the message to stderr and exits (though we consider it a coding error if it gets called). Seems like we could do the same here. Default message and exit if no handler, but optionally allow a handler (which could RETHROW to get to the default handler afterwards). It seems like we've been wanting a front end version of ereport() for a while so I'll take a look at that and see what it involves. Regards, -- -David da...@pgmasters.net
Re: making the backend's json parser work in frontend code
Robert Haas writes: > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm > missing something, this seems like an overdue cleanup. Here's a reviewed version of 0001. You missed fixing the MSVC build, and there were assorted comments and other things referencing wchar.c that needed to be cleaned up. Also, it seemed to me that if we are going to move wchar.c, we should also move encnames.c, so that libpq can get fully out of the symlinking-source-files business. It makes initdb less weird too. I took the liberty of sticking proper copyright headers onto these two files, too. (This makes the diff a lot more bulky :-(. Would it help to add the headers in a separate commit?) Another thing I'm wondering about is if any of the #ifndef FRONTEND code should get moved *back* to src/backend/utils/mb. But that could be a separate commit, too. Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should migrate over to src/include/common. But that'd be far more invasive to other source files, so I've not touched the issue here. regards, tom lane diff --git a/src/backend/utils/mb/Makefile b/src/backend/utils/mb/Makefile index cd4a016..b19a125 100644 --- a/src/backend/utils/mb/Makefile +++ b/src/backend/utils/mb/Makefile @@ -14,10 +14,8 @@ include $(top_builddir)/src/Makefile.global OBJS = \ conv.o \ - encnames.o \ mbutils.o \ stringinfo_mb.o \ - wchar.o \ wstrcmp.o \ wstrncmp.o diff --git a/src/backend/utils/mb/README b/src/backend/utils/mb/README index 7495ca5..ef36626 100644 --- a/src/backend/utils/mb/README +++ b/src/backend/utils/mb/README @@ -3,12 +3,8 @@ src/backend/utils/mb/README Encodings = -encnames.c: public functions for both the backend and the frontend. conv.c: static functions and a public table for code conversion -wchar.c: mostly static functions and a public table for mb string and - multibyte conversion mbutils.c: public functions for the backend only. - requires conv.c and wchar.c stringinfo_mb.c: public backend-only multibyte-aware stringinfo functions wstrcmp.c: strcmp for mb wstrncmp.c: strncmp for mb @@ -16,6 +12,12 @@ win866.c: a tool to generate KOI8 <--> CP866 conversion table iso.c: a tool to generate KOI8 <--> ISO8859-5 conversion table win1251.c: a tool to generate KOI8 <--> CP1251 conversion table +See also in src/common/: + +encnames.c: public functions for encoding names +wchar.c: mostly static functions and a public table for mb string and + multibyte conversion + Introduction http://www.cprogramming.com/tutorial/unicode.html diff --git a/src/backend/utils/mb/encnames.c b/src/backend/utils/mb/encnames.c deleted file mode 100644 index 12b61cd..000 --- a/src/backend/utils/mb/encnames.c +++ /dev/null @@ -1,629 +0,0 @@ -/* - * Encoding names and routines for work with it. All - * in this file is shared between FE and BE. - * - * src/backend/utils/mb/encnames.c - */ -#ifdef FRONTEND -#include "postgres_fe.h" -#else -#include "postgres.h" -#include "utils/builtins.h" -#endif - -#include -#include - -#include "mb/pg_wchar.h" - - -/* -- - * All encoding names, sorted: *** A L P H A B E T I C *** - * - * All names must be without irrelevant chars, search routines use - * isalnum() chars only. It means ISO-8859-1, iso_8859-1 and Iso8859_1 - * are always converted to 'iso88591'. All must be lower case. - * - * The table doesn't contain 'cs' aliases (like csISOLatin1). It's needed? - * - * Karel Zak, Aug 2001 - * -- - */ -typedef struct pg_encname -{ - const char *name; - pg_enc encoding; -} pg_encname; - -static const pg_encname pg_encname_tbl[] = -{ - { - "abc", PG_WIN1258 - }, /* alias for WIN1258 */ - { - "alt", PG_WIN866 - }, /* IBM866 */ - { - "big5", PG_BIG5 - }, /* Big5; Chinese for Taiwan multibyte set */ - { - "euccn", PG_EUC_CN - }, /* EUC-CN; Extended Unix Code for simplified - * Chinese */ - { - "eucjis2004", PG_EUC_JIS_2004 - }, /* EUC-JIS-2004; Extended UNIX Code fixed - * Width for Japanese, standard JIS X 0213 */ - { - "eucjp", PG_EUC_JP - }, /* EUC-JP; Extended UNIX Code fixed Width for - * Japanese, standard OSF */ - { - "euckr", PG_EUC_KR - }, /* EUC-KR; Extended Unix Code for Korean , KS - * X 1001 standard */ - { - "euctw", PG_EUC_TW - }, /* EUC-TW; Extended Unix Code for - * - * traditional Chinese */ - { - "gb18030", PG_GB18030 - }, /* GB18030;GB18030 */ - { - "gbk", PG_GBK - }, /* GBK; Chinese Windows CodePage 936 - * simplified Chinese */ - { - "iso88591", PG_LATIN1 - }, /* ISO-8859-1; RFC1345,KXS2 */ - { - "iso885910", PG_LATIN6 - }, /* ISO-8859-10; RFC1345,KXS2 */ - { - "iso885913", PG_LATIN7 - }, /* ISO-8859-13; RFC1345,KXS2 */ - { - "iso885914", PG_LATIN8 - }, /* ISO-8859-14; RFC1345,KXS2 */ - { - "iso885915", PG_LATIN9 - }, /* ISO-8859-15; RFC1345,KXS2 */ - { - "is
Re: SlabCheck leaks memory into TopMemoryContext
Tomas Vondra writes: > I think the one possible argument against this approach might be that it > adds a field to the struct, so if you have an extension using a Slab > context, it'll break if you don't rebuild it. But that only matters if > we want to backpatch it (which I think is not the plan) and with memory > context checking enabled (which does not apply to regular packages). Huh? That struct is private in slab.c, no? Any outside code relying on its contents deserves to break. I do think we ought to back-patch this, given the horrible results Andres showed. regards, tom lane
Re: SlabCheck leaks memory into TopMemoryContext
On Thu, Jan 16, 2020 at 03:15:41PM -0500, Tom Lane wrote: Tomas Vondra writes: I think the one possible argument against this approach might be that it adds a field to the struct, so if you have an extension using a Slab context, it'll break if you don't rebuild it. But that only matters if we want to backpatch it (which I think is not the plan) and with memory context checking enabled (which does not apply to regular packages). Huh? That struct is private in slab.c, no? Any outside code relying on its contents deserves to break. Ah, right. Silly me. I do think we ought to back-patch this, given the horrible results Andres showed. OK. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: making the backend's json parser work in frontend code
On Thu, Jan 16, 2020 at 3:11 PM Tom Lane wrote: > Robert Haas writes: > > 0001 moves wchar.c from src/backend/utils/mb to src/common. Unless I'm > > missing something, this seems like an overdue cleanup. > > Here's a reviewed version of 0001. You missed fixing the MSVC build, > and there were assorted comments and other things referencing wchar.c > that needed to be cleaned up. Wow, thanks. > Also, it seemed to me that if we are going to move wchar.c, we should > also move encnames.c, so that libpq can get fully out of the > symlinking-source-files business. It makes initdb less weird too. OK. > I took the liberty of sticking proper copyright headers onto these > two files, too. (This makes the diff a lot more bulky :-(. Would > it help to add the headers in a separate commit?) I wouldn't bother making it a separate commit, but please do whatever you like. > Another thing I'm wondering about is if any of the #ifndef FRONTEND > code should get moved *back* to src/backend/utils/mb. But that > could be a separate commit, too. +1 for moving that stuff to a separate backend-only file. > Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should > migrate over to src/include/common. But that'd be far more invasive > to other source files, so I've not touched the issue here. I don't have a view on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: our checks for read-only queries are not great
On Thu, Jan 16, 2020 at 12:22 PM Stephen Frost wrote: > I think I agree with you regarding the original intent, though even > there, as discussed elsewhere, it seems like there's perhaps either a > bug or a disagreement about the specifics of what that means when it > relates to committing a 2-phase transaction. Still, setting that aside > for the moment, do we feel like this is enough to be able to update our > documentation with? I think that would be possible. What did you have in mind? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: making the backend's json parser work in frontend code
On Thu, Jan 16, 2020 at 1:58 PM David Steele wrote: > To do page-level incrementals (which this feature is intended to enable) > the user will need to be able to associate full and incremental backups > and the only way I see to do that (currently) is to read the manifests, > since the prior backup should be stored there. I think this means that > parsing the manifest is not really optional -- it will be required to do > any kind of automation with incrementals. My current belief is that enabling incremental backup will require extending the manifest format either not at all or by adding one additional line with some LSN info. If we could foresee a need to store a bunch of additional *per-file* details, I'd be a lot more receptive to the argument that we ought to be using a more structured format like JSON. And it doesn't seem impossible that such a thing could happen, but I don't think it's at all clear that it actually will happen, or that it will happen soon enough that we ought to be worrying about it now. It's possible that we're chasing a real problem here, and if there's something we can agree on and get done I'd rather do that than argue, but I am still quite suspicious that there's no actually serious technical problem here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: psql - add SHOW_ALL_RESULTS option
On Thu, Jan 16, 2020 at 01:08:16PM -0500, Tom Lane wrote: Tomas Vondra writes: This is one of the patches already marked as RFC (since September by Alvaro). Anyone interested in actually pushing it, so that it does not fall through to yet another commitfest? TBH, I think we'd be better off to reject it. This makes a nontrivial change in a very long-standing psql behavior, with AFAICS no way to get back the old semantics. (The thread title is completely misleading about that; there's no "option" in the patch as it stands.) Sure, in a green field this behavior would likely be more sensible ... but that has to be weighed against the fact that it's behaved the way it does for a long time, and any existing scripts that are affected by that behavior have presumably deliberately chosen to use it. I can't imagine that changing this will make very many people happier. It seems much more likely that people who are affected will be unhappy. The compatibility issue could be resolved by putting in the option that I suppose was there at the beginning. But then we'd have to have a debate about which behavior would be default, and there would still be the question of who would find this to be an improvement. If you're chaining together commands with \; then it's likely that you are happy with the way it behaves today. Certainly there's been no drumbeat of bug reports about it. I don't know, really, I only pinged this as a CFM who sees a patch marked as RFC for months ... The current behavior certainly seems strange/wrong to me - if I send multiple queries to psql, I'd certainly expect results for all of them, not just the last one. So the current behavior seems pretty surprising. I'm unable to make any judgments about risks/benefits of this change. I can't imagine anyone intentionally relying on the current behavior, so I'd say the patch is unlikely to break anything (which is not already broken). But I don't have any data to support this ... Essentially, I'm just advocating to make a decision - we should either commit or reject the patch, not just move it to the next commitfest over and over. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgindent && weirdness
On Thu, Jan 16, 2020 at 3:59 PM Thomas Munro wrote: > On Wed, Jan 15, 2020 at 11:30 AM Tom Lane wrote: > > Yeah, it's been doing that for decades. I think the triggering > > factor is the typedef name (Var, here) preceding the &&. Here's a better fix: diff --git a/indent.c b/indent.c index 9faf57a..51a60a6 100644 --- a/indent.c +++ b/indent.c @@ -570,8 +570,11 @@ check_type: ps.in_or_st = false;/* turn off flag for structure decl or * initialization */ } - /* parenthesized type following sizeof or offsetof is not a cast */ - if (ps.keyword == 1 || ps.keyword == 2) + /* +* parenthesized type following sizeof or offsetof is not a cast; +* likewise for function-like macros that take a type +*/ + if (ps.keyword == 1 || ps.keyword == 2 || ps.last_token == ident) ps.not_cast_mask |= 1 << ps.p_l_follow; break;
Re: psql - add SHOW_ALL_RESULTS option
On 2020-Jan-16, Tom Lane wrote: > The compatibility issue could be resolved by putting in the option > that I suppose was there at the beginning. But then we'd have to > have a debate about which behavior would be default, and there would > still be the question of who would find this to be an improvement. > If you're chaining together commands with \; then it's likely that > you are happy with the way it behaves today. Certainly there's been > no drumbeat of bug reports about it. The patch originally submitted did indeed have the option (defaulting to "off", that is, the original behavior), and it was removed at request of reviewers Daniel Vérité, Peter Eisentraut and Kyotaro Horiguchi. My own opinion is that any scripts that rely heavily on the current behavior are stepping on shaky ground anyway. I'm not saying we should break them on every chance we get -- just that keeping them unharmed is not necessarily a priority, and that if this patch enables other psql features, it might be a good step forward. My own vote would be to use the initial patch (after applying any unrelated changes per later review), ie. add the feature with its disable button. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgindent && weirdness
On 2020-Jan-17, Thomas Munro wrote: > On Thu, Jan 16, 2020 at 3:59 PM Thomas Munro wrote: > > On Wed, Jan 15, 2020 at 11:30 AM Tom Lane wrote: > > > Yeah, it's been doing that for decades. I think the triggering > > > factor is the typedef name (Var, here) preceding the &&. > > Here's a better fix: This is indeed a very good fix! Several badly formatted sites in our code are improved with this change. Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: making the backend's json parser work in frontend code
Robert Haas writes: > On Thu, Jan 16, 2020 at 3:11 PM Tom Lane wrote: >> Here's a reviewed version of 0001. You missed fixing the MSVC build, >> and there were assorted comments and other things referencing wchar.c >> that needed to be cleaned up. > Wow, thanks. Pushed that. >> Another thing I'm wondering about is if any of the #ifndef FRONTEND >> code should get moved *back* to src/backend/utils/mb. But that >> could be a separate commit, too. > +1 for moving that stuff to a separate backend-only file. After a brief look, I propose the following: * I think we should just shove the "#ifndef FRONTEND" stuff in wchar.c into mbutils.c. It doesn't seem worth inventing a whole new file for that code, especially when it's arguably within the remit of mbutils.c anyway. * Let's remove the "#ifndef FRONTEND" restriction on the ICU-related stuff in encnames.c. Even if we don't need that stuff in frontend today, it's hardly unlikely that we will need it tomorrow. And there's not that much bulk there anyway. * The one positive reason for that restriction is the ereport() in get_encoding_name_for_icu. We could change that to be the usual #ifdef-ereport-or-printf dance, but I think there's a better way: put the ereport at the caller, by redefining that function to return NULL for an unsupported encoding. There's only one caller today anyhow. * PG_char_to_encoding() and PG_encoding_to_char() can be moved to mbutils.c; they'd fit reasonably well beside getdatabaseencoding and pg_client_encoding. (I also thought about utils/adt/misc.c, but that's not obviously better.) Barring objections I'll go make this happen shortly. >> Lastly, it strikes me that maybe pg_wchar.h, or parts of it, should >> migrate over to src/include/common. But that'd be far more invasive >> to other source files, so I've not touched the issue here. > I don't have a view on this. If anyone is hot to do this part, please have at it. I'm not. regards, tom lane
Re: psql - add SHOW_ALL_RESULTS option
Hello Tom, This is one of the patches already marked as RFC (since September by Alvaro). Anyone interested in actually pushing it, so that it does not fall through to yet another commitfest? TBH, I think we'd be better off to reject it. This makes a nontrivial change in a very long-standing psql behavior, with AFAICS no way to get back the old semantics. (The thread title is completely misleading about that; there's no "option" in the patch as it stands.) The thread title was not misleading, the initial version of the patch did offer an option. Then I was said "the current behavior is stupid (which I agree), let us change it to the sane behavior without option", then I'm told the contrary. Sigh. I still have the patch with the option, though. Sure, in a green field this behavior would likely be more sensible ... but that has to be weighed against the fact that it's behaved the way it does for a long time, and any existing scripts that are affected by that behavior have presumably deliberately chosen to use it. I cannot imagine many people actually relying on the current insane behavior. I can't imagine that changing this will make very many people happier. It seems much more likely that people who are affected will be unhappy. The compatibility issue could be resolved by putting in the option that I suppose was there at the beginning. Indeed. But then we'd have to have a debate about which behavior would be default, The patch was keeping current behavior as the default because people do not like a change whatever. and there would still be the question of who would find this to be an improvement. If you're chaining together commands with \; then it's likely that you are happy with the way it behaves today. Certainly there's been no drumbeat of bug reports about it. Why would there be bug report if this is a feature? :-) The behavior has been irritating me for a long time. It is plain stupid to be able to send queries but not see their results. -- Fabien.
Re: FETCH FIRST clause PERCENT option
Hi, This patch is marked as RFC since September. Since then there was no discussion on this thread, but Andrew proposed an alternative approach based on window functions in a separate thread [1] (both for this and for the WITH TIES case). I'll set this patch back to "needs review" - at this point we need to decide which of the approaches is the right one. regards [1] https://www.postgresql.org/message-id/flat/87o8wvz253@news-spur.riddles.org.uk -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: psql - add SHOW_ALL_RESULTS option
My own vote would be to use the initial patch (after applying any unrelated changes per later review), ie. add the feature with its disable button. I can do that, but not if there is a veto from Tom on the feature. I wish definite negative opinions by senior committers would be expressed earlier, so that people do not spend time rewiewing dead code and developing even deader code. -- Fabien.
Re: Recognizing superuser in pg_hba.conf
Hi, I see this patch is marked as RFC since 12/30, but there seems to be quite a lot of discussion about the syntax, keywords and how exactly to identify the superuser. So I'll switch it back to needs review, which I think is a better representation of the current state. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: remove separate postgres.(sh)description files
On Wed, Jan 08, 2020 at 02:33:23PM +0200, Heikki Linnakangas wrote: On 31/12/2019 02:08, John Naylor wrote: I'm guessing the initial data for pg_(sh)description is output into separate files because it was too difficult for the traditional shell script to maintain enough state to do otherwise. Yeah, I guess so. The roots of postgres.description goes all the way back to 1997, when not only genbki was a shell script, but also initdb. With Perl, it's just as easy to assemble the data into the same format as the rest of the catalogs and then let the generic code path output it into postgres.bki. The attached patch does that and simplifies the catalog makefile and initdb.c. Nice cleanup! Looks like we didn't have any mention of the postgres.(sh)decription files in the docs, so no doc updates needed. Grepping around, there are a few stray references to postgres.description still: $ git grep -r -I postgres.shdescript . src/backend/catalog/.gitignore:/postgres.shdescription src/backend/catalog/Makefile:# postgres.bki, postgres.description, postgres.shdescription, src/tools/msvc/clean.bat:if %DIST%==1 if exist src\backend\catalog\postgres.shdescription del /q src\backend\catalog\postgres.shdescription Barring objections, I'll remove those too, and commit this. +1 from me. Let's remove these small RFC patches out of the way. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Run-time pruning for ModifyTable
On Wed, Nov 27, 2019 at 05:17:06PM +0900, Michael Paquier wrote: On Tue, Nov 05, 2019 at 04:04:25PM +1300, Thomas Munro wrote: On Thu, Sep 12, 2019 at 10:10 AM Alvaro Herrera wrote: > Here's a rebased version of this patch (it had a trivial conflict). Hi, FYI partition_prune.sql currently fails (maybe something to do with commit d52eaa09?): David, perhaps you did not notice that? For now I have moved this patch to next CF waiting on author to look after the failure. Amit, Kato-san, both of you are marked as reviewers of this patch. Are you planning to look at it? David, this patch is marked as "waiting on author" since 11/27, and there have been no updates or responses since then. Do you plan to submit a new patch version in this CF? We're already half-way through, so there's not much time ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: row filtering for logical replication
On Thu, Nov 28, 2019 at 11:32:01AM +0900, Michael Paquier wrote: On Mon, Nov 25, 2019 at 11:48:29AM +0900, Amit Langote wrote: On Mon, Nov 25, 2019 at 11:38 AM Amit Langote wrote: Needed to be rebased, which I did, to be able to test them; patches attached. Oops, really attached this time. Euler, this thread is waiting for input from you regarding the latest comments from Amit. Euler, this patch is still in "waiting on author" since 11/25. Do you plan to review changes made by Amit in the patches he submitted, or what are your plans with this patch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support
On Fri, Nov 29, 2019 at 11:29:03AM +0900, Michael Paquier wrote: On Mon, Nov 11, 2019 at 03:44:54PM +0100, Nino Floris wrote: Alright, as usual life got in the way. I've attached a new version of the patch with pgindent changes. > What do you think about writing patch for ALTER TYPE? I'd rather not :$ > 1) Write migration script, which directly updates pg_type. This sounds like the best option right now, if we don't want people to do manual migrations to ltree 1.2. How would I best go at this? > Travis has a small complaint: Should be fixed! The latest patch provided fails to apply for me on HEAD. Please provide a rebase. For now I am bumping this patch to next CF with "waiting on author" as status. Nino, any plans to submit a rebased/fixed patch, so that people can review it? Not sure if this needs a simple rebase or something more complex, all I know is cputube can't apply it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improve search for missing parent downlinks in amcheck
On Fri, Nov 29, 2019 at 03:03:01PM +0900, Michael Paquier wrote: On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote: The revised patch seems to fix all of above. The latest patch is failing to apply. Please provide a rebase. This still does not apply (per cputube). Can you provide a fixed patch? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: making the backend's json parser work in frontend code
Robert Haas writes: > It's possible that we're chasing a real problem here, and if there's > something we can agree on and get done I'd rather do that than argue, > but I am still quite suspicious that there's no actually serious > technical problem here. It's entirely possible that you're right. But if this is a file format that is meant to be exposed to user tools, we need to take a very long view of the requirements for it. Five or ten years down the road, we might be darn glad we spent extra time now. regards, tom lane
Re: row filtering for logical replication
Em qui., 16 de jan. de 2020 às 18:57, Tomas Vondra escreveu: > > Euler, this patch is still in "waiting on author" since 11/25. Do you > plan to review changes made by Amit in the patches he submitted, or what > are your plans with this patch? > Yes, I'm working on Amit suggestions. I'll post a new patch as soon as possible. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: Verify true root on replicas with amcheck
On Thu, Jan 9, 2020 at 12:55 AM godjan • wrote: > Hi, we have trouble to detect true root corruptions on replicas. I made a > patch for resolving it with the locking meta page and potential root page. What do you mean by true root corruption? What is the cause of the problem? What symptom does it have in your application? While I was the one that wrote the existing !readonly/parent check for the true root (a check which your patch makes work with the regular bt_check_index() function), I wasn't thinking of any particular corruption scenario at the time. I wrote the check simply because it was easy to do so (with a heavyweight ShareLock on the index). > I heard that amcheck has an invariant about locking no more than 1 page at a > moment for avoiding deadlocks. Is there possible a deadlock situation? This is a conservative principle that I came up with when I wrote the original version of amcheck. It's not strictly necessary, but it seemed like a good idea. It should be safe to "couple" buffer locks in a way that matches the B-Tree code -- as long as it is thought through very carefully. I am probably going to relax the rule for one specific case soon -- see: https://postgr.es/m/f7527087-6e95-4077-b964-d2cafef62...@yandex-team.ru Your patch looks like it gets it right (it won't deadlock with other sessions that access the metapage), but I hesitate to commit it without a strong justification. Acquiring multiple buffer locks concurrently is worth avoiding wherever possible. -- Peter Geoghegan
Re: Setting min/max TLS protocol in clientside libpq
On Thu, Jan 16, 2020 at 09:56:01AM +0100, Daniel Gustafsson wrote: > The patch looks fine to me, I don't an issue with splitting it into a > refactoring patch and a TLS min/max version patch. Thanks, committed the refactoring part then. If the buildfarm breaks for a reason or another, the window to look at is narrower than if we had the full set of changes, and the git history is cleaner. I noticed as well a compilation warning when compiling with OpenSSL 1.0.2 from protocol_openssl.c because of missing declarations of the two routines because the header declaration was incorrect. Could you please rebase and fix the remaining pieces of the patch? -- Michael signature.asc Description: PGP signature
Re: Amcheck: do rightlink verification with lock coupling
On Mon, Jan 13, 2020 at 8:47 PM Andrey Borodin wrote: > > 11 янв. 2020 г., в 7:49, Peter Geoghegan написал(а): > > I'm curious why Andrey's corruption problems were not detected by the > > cross-page amcheck test, though. We compare the first non-pivot tuple > > on the right sibling leaf page with the last one on the target page, > > towards the end of bt_target_page_check() -- isn't that almost as good > > as what you have here in practice? I probably would have added > > something like this myself earlier, if I had reason to think that > > verification would be a lot more effective that way. > > We were dealing with corruption caused by lost page update. Consider two > pages: > A->B > If A is split into A` and C we get: > A`->C->B > But if update of A is lost we still have > A->B, but B backward pointers points to C. > B's smallest key is bigger than hikey of A`, but this do not violate > cross-pages invariant. > > Page updates may be lost due to bug in backup software with incremental > backups, bug in storage layer of Aurora-style system, bug in page cache, > incorrect > fsync error handling, bug in ssd firmware etc. And our data checksums do not > detect this kind of corruption. BTW I think that it would be better if our > checksums were not stored on a page itseft, they could detect this kind of > faults. I find this argument convincing. I'll try to get this committed soon. While you could have used bt_index_parent_check() or heapallindexed to detect the issue, those two options are a lot more expensive (plus the former option won't work on a standby). Relaxing the principle that says that we shouldn't hold multiple buffer locks concurrently doesn't seem like that much to ask for to get such a clear benefit. I think that this is safe, but page deletion/half-dead pages need more thought. In general, the target page could have become "ignorable" when rechecked. > We were able to timely detect all those problems on primaries in our testing > environment. But much later found out that some standbys were corrupted, > the problem appeared only when they were promoted. > Also, in nearby thread Grygory Rylov (g0djan) is trying to enable one more > invariant in standby checks. I looked at that thread just now, but Grygory didn't say why this true root check was particularly important, so I can't see much upside. Plus that seems riskier than what you have in mind here. Does it have something to do with the true root looking like a deleted page? The details matter. -- Peter Geoghegan
Re: pgindent && weirdness
On Thu, Jan 16, 2020 at 06:13:36PM -0300, Alvaro Herrera wrote: > This is indeed a very good fix! Several badly formatted sites in our > code are improved with this change. Nice find! Could you commit that? I can see many places improved as well, among explain.c, tablecmds.c, typecmds.c, and much more. -- Michael signature.asc Description: PGP signature
Re: making the backend's json parser work in frontend code
On Thu, Jan 16, 2020 at 7:33 AM Robert Haas wrote: > > > 0002 does some basic header cleanup to make it possible to include the > existing header file jsonapi.h in frontend code. The state of the JSON > headers today looks generally poor. There seems not to have been much > attempt to get the prototypes for a given source file, say foo.c, into > a header file with the same name, say foo.h. Also, dependencies > between various header files seem to be have added somewhat freely. > This patch does not come close to fixing all that, but I consider it a > modest down payment on a cleanup that probably ought to be taken > further. > > 0003 splits json.c into two files, json.c and jsonapi.c. All the > lexing and parsing stuff (whose prototypes are in jsonapi.h) goes into > jsonapi.c, while the stuff that pertains to the 'json' data type > remains in json.c. This also seems like a good cleanup, because to me, > at least, it's not a great idea to mix together code that is used by > both the json and jsonb data types as well as other things in the > system that want to generate or parse json together with things that > are specific to the 'json' data type. > I'm probably responsible for a good deal of the mess, so let me say Thankyou. I'll have a good look at these. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reorderbuffer crash during recovery
On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar wrote: > > One minor comment. Otherwise, the patch looks fine to me. > + /* > + * We set final_lsn on a transaction when we decode its commit or abort > + * record, but we never see those records for crashed transactions. To > + * ensure cleanup of these transactions, set final_lsn to that of their > + * last change; this causes ReorderBufferRestoreCleanup to do the right > + * thing. Final_lsn would have been set with commit_lsn earlier when we > + * decode it commit, no need to update in that case > + */ > + if (txn->final_lsn < change->lsn) > + txn->final_lsn = change->lsn; > > /decode it commit,/decode its commit, > Thanks Dilip for reviewing. I have fixed the comments you have suggested. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From 6c5de8db4b6ac4da9d4ada0e9fa56133af0ed008 Mon Sep 17 00:00:00 2001 From: vignesh Date: Fri, 17 Jan 2020 07:34:31 +0530 Subject: [PATCH] Reorder buffer crash while aborting old transactions. While aborting aborted transactions it crashed as there are no reorderbuffer changes present in the list because all the reorderbuffer changes were serialized. Fixing it by storing the last change's lsn while serializing the reorderbuffer changes. This lsn will be used as final_lsn which will help in cleaning of spilled files in pg_replslot. --- src/backend/replication/logical/reorderbuffer.c | 26 +++-- src/include/replication/reorderbuffer.h | 4 ++-- 2 files changed, 13 insertions(+), 17 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index bbd908a..fa8a6b0 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1974,21 +1974,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid) if (TransactionIdPrecedes(txn->xid, oldestRunningXid)) { - /* - * We set final_lsn on a transaction when we decode its commit or - * abort record, but we never see those records for crashed - * transactions. To ensure cleanup of these transactions, set - * final_lsn to that of their last change; this causes - * ReorderBufferRestoreCleanup to do the right thing. - */ - if (rbtxn_is_serialized(txn) && txn->final_lsn == 0) - { -ReorderBufferChange *last = -dlist_tail_element(ReorderBufferChange, node, &txn->changes); - -txn->final_lsn = last->lsn; - } - elog(DEBUG2, "aborting old transaction %u", txn->xid); /* remove potential on-disk data, and deallocate this tx */ @@ -2697,6 +2682,17 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, } pgstat_report_wait_end(); + /* + * We set final_lsn on a transaction when we decode its commit or abort + * record, but we never see those records for crashed transactions. To + * ensure cleanup of these transactions, set final_lsn to that of their + * last change; this causes ReorderBufferRestoreCleanup to do the right + * thing. Final_lsn would have been set with commit_lsn earlier when we + * decode its commit, no need to update in that case + */ + if (txn->final_lsn < change->lsn) + txn->final_lsn = change->lsn; + Assert(ondisk->change.action == change->action); } diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h index 4f6c65d..5b67457 100644 --- a/src/include/replication/reorderbuffer.h +++ b/src/include/replication/reorderbuffer.h @@ -208,8 +208,8 @@ typedef struct ReorderBufferTXN * * plain abort record * * prepared transaction abort * * error during decoding - * * for a crashed transaction, the LSN of the last change, regardless of - * what it was. + * * for a transaction with serialized changes, the LSN of the latest + * serialized one, unless one of the above cases. * */ XLogRecPtr final_lsn; -- 1.8.3.1 From e9a9983ac250f2e795bfdfe78cf50e9c6c7dc5a0 Mon Sep 17 00:00:00 2001 From: vignesh Date: Fri, 17 Jan 2020 07:00:14 +0530 Subject: [PATCH] Reorder buffer crash while aborting old transactions. While aborting aborted transactions it crashed as there are no reorderbuffer changes present in the list because all the reorderbuffer changes were serialized. Fixing it by storing the last change's lsn while serializing the reorderbuffer changes. This lsn will be used as final_lsn which will help in cleaning of spilled files in pg_replslot. --- src/backend/replication/logical/reorderbuffer.c | 26 +++-- src/include/replication/reorderbuffer.h | 4 ++-- 2 files changed, 13 insertions(+), 17 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 6dcd973..fb2bf70 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1877,21 +1877,6 @@ ReorderBufferAbortOld(ReorderBuffer *rb, Transa
Re: pgindent && weirdness
On 2020-Jan-17, Michael Paquier wrote: > On Thu, Jan 16, 2020 at 06:13:36PM -0300, Alvaro Herrera wrote: > > This is indeed a very good fix! Several badly formatted sites in our > > code are improved with this change. > > Nice find! Could you commit that? I can see many places improved as > well, among explain.c, tablecmds.c, typecmds.c, and much more. I think Tom is the only one who can commit that, https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Setting min/max TLS protocol in clientside libpq
On Fri, Jan 17, 2020 at 10:09:54AM +0900, Michael Paquier wrote: > Could you please rebase and fix the remaining pieces of the patch? And while I remember, you may want to add checks for incorrect bounds when validating the values in fe-connect.c... The same arguments as for the backend part apply because we'd want to make the implementation a maximum pluggable with all SSL libraries. -- Michael signature.asc Description: PGP signature
Re: Run-time pruning for ModifyTable
On Thu, Jan 16, 2020 at 10:45:25PM +0100, Tomas Vondra wrote: > David, this patch is marked as "waiting on author" since 11/27, and > there have been no updates or responses since then. Do you plan to > submit a new patch version in this CF? We're already half-way through, > so there's not much time ... The reason why I moved it to 2020-01 is that there was not enough time for David to reply back. At this stage, it seems more appropriate to me to mark it as returned with feedback and move on. -- Michael signature.asc Description: PGP signature
Re: pgindent && weirdness
Alvaro Herrera writes: > On 2020-Jan-17, Michael Paquier wrote: >> Nice find! Could you commit that? I can see many places improved as >> well, among explain.c, tablecmds.c, typecmds.c, and much more. > I think Tom is the only one who can commit that, > https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git I don't *think* that repo is locked down that hard --- IIRC, PG committers should have access to it. But I was hoping to hear Piotr's opinion before moving forward. regards, tom lane
Re: pgindent && weirdness
On 2020-Jan-16, Tom Lane wrote: > Alvaro Herrera writes: > > On 2020-Jan-17, Michael Paquier wrote: > >> Nice find! Could you commit that? I can see many places improved as > >> well, among explain.c, tablecmds.c, typecmds.c, and much more. > > > I think Tom is the only one who can commit that, > > https://git.postgresql.org/gitweb/?p=pg_bsd_indent.git > > I don't *think* that repo is locked down that hard --- IIRC, > PG committers should have access to it. But I was hoping to > hear Piotr's opinion before moving forward. FWIW I think this code predates Piotr's involvement, I think; at least, it was already there in the FreeBSD code he imported: https://github.com/pstef/freebsd_indent/commit/55c29a8774923f2d40fef7919b9490f61e57e7bb#diff-85c94ae15198235e2363f96216b9a1b2R565 -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgindent && weirdness
Alvaro Herrera writes: > On 2020-Jan-16, Tom Lane wrote: >> ... But I was hoping to >> hear Piotr's opinion before moving forward. > FWIW I think this code predates Piotr's involvement, I think; at least, > it was already there in the FreeBSD code he imported: > https://github.com/pstef/freebsd_indent/commit/55c29a8774923f2d40fef7919b9490f61e57e7bb#diff-85c94ae15198235e2363f96216b9a1b2R565 The roots of that code are even older than Postgres, I believe, and there may not be anybody left who understands it completely. But Piotr has certainly spent more time looking at it than I have, so I'd still like to hear what he thinks. regards, tom lane
Re: Reorderbuffer crash during recovery
On Fri, Jan 17, 2020 at 7:42 AM vignesh C wrote: > > On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar wrote: > > > > One minor comment. Otherwise, the patch looks fine to me. > > + /* > > + * We set final_lsn on a transaction when we decode its commit or abort > > + * record, but we never see those records for crashed transactions. To > > + * ensure cleanup of these transactions, set final_lsn to that of their > > + * last change; this causes ReorderBufferRestoreCleanup to do the right > > + * thing. Final_lsn would have been set with commit_lsn earlier when we > > + * decode it commit, no need to update in that case > > + */ > > + if (txn->final_lsn < change->lsn) > > + txn->final_lsn = change->lsn; > > > > /decode it commit,/decode its commit, > > > > Thanks Dilip for reviewing. > I have fixed the comments you have suggested. > Thanks for the updated patch. It looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Unnecessary delay in streaming replication due to replay lag
Hi Standby does not start walreceiver process until startup process finishes WAL replay. The more WAL there is to replay, longer is the delay in starting streaming replication. If replication connection is temporarily disconnected, this delay becomes a major problem and we are proposing a solution to avoid the delay. WAL replay is likely to fall behind when master is processing write-heavy workload, because WAL is generated by concurrently running backends on master while only one startup process on standby replays WAL records in sequence as new WAL is received from master. Replication connection between walsender and walreceiver may break due to reasons such as transient network issue, standby going through restart, etc. The delay in resuming replication connection leads to lack of high availability - only one copy of WAL is available during this period. The problem worsens when the replication is configured to be synchronous. Commits on master must wait until the WAL replay is finished on standby, walreceiver is then started and it confirms flush of WAL upto the commit LSN. If synchronous_commit GUC is set to remote_write, this behavior is equivalent to tacitly changing it to remote_apply until the replication connection is re-established! Has anyone encountered such a problem with streaming replication? We propose to address this by starting walreceiver without waiting for startup process to finish replay of WAL. Please see attached patchset. It can be summarized as follows: 0001 - TAP test to demonstrate the problem. 0002 - The standby startup sequence is changed such that walreceiver is started by startup process before it begins to replay WAL. 0003 - Postmaster starts walreceiver if it finds that a walreceiver process is no longer running and the state indicates that it is operating as a standby. This is a POC, we are looking for early feedback on whether the problem is worth solving and if it makes sense to solve if along this route. Hao and Asim 0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patch Description: Binary data 0003-Start-WAL-receiver-when-it-is-found-not-running.patch Description: Binary data 0002-Start-WAL-receiver-before-startup-process-replays-ex.patch Description: Binary data
Re: [HACKERS] Block level parallel vacuum
On Thu, Jan 16, 2020 at 5:34 PM Amit Kapila wrote: > > On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada > wrote: > > > > Right. Most indexes (all?) of tables that are used in the regression > > tests are smaller than min_parallel_index_scan_size. And we set > > min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not > > be speeded-up much because of the relation size. Since we instead > > populate new table for parallel vacuum testing the regression test for > > vacuum would take a longer time. > > > > Fair enough and I think it is good in a way that it won't change the > coverage of existing vacuum code. I have fixed all the issues > reported by Mahendra and have fixed a few other cosmetic things in the > attached patch. > I have few small comments. 1. logical streaming for large in-progress transactions+ + /* Can't perform vacuum in parallel */ + if (parallel_workers <= 0) + { + pfree(can_parallel_vacuum); + return lps; + } why are we checking parallel_workers <= 0, Function compute_parallel_vacuum_workers only returns 0 or greater than 0 so isn't it better to just check if (parallel_workers == 0) ? 2. +/* + * Macro to check if we are in a parallel vacuum. If true, we are in the + * parallel mode and the DSM segment is initialized. + */ +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL) (LVParallelState *) (lps) -> this typecast is not required, just (lps) != NULL should be enough. 3. + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes))); + prepare_index_statistics(shared, can_parallel_vacuum, nindexes); + pg_atomic_init_u32(&(shared->idx), 0); + pg_atomic_init_u32(&(shared->cost_balance), 0); + pg_atomic_init_u32(&(shared->active_nworkers), 0); I think it will look cleaner if we can initialize in the order they are declared in structure. 4. + VacuumSharedCostBalance = &(lps->lvshared->cost_balance); + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers); + + /* + * Set up shared cost balance and the number of active workers for + * vacuum delay. + */ + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance); + pg_atomic_write_u32(VacuumActiveNWorkers, 0); + + /* + * The number of workers can vary between bulkdelete and cleanup + * phase. + */ + ReinitializeParallelWorkers(lps->pcxt, nworkers); + + LaunchParallelWorkers(lps->pcxt); + + if (lps->pcxt->nworkers_launched > 0) + { + /* + * Reset the local cost values for leader backend as we have + * already accumulated the remaining balance of heap. + */ + VacuumCostBalance = 0; + VacuumCostBalanceLocal = 0; + } + else + { + /* + * Disable shared cost balance if we are not able to launch + * workers. + */ + VacuumSharedCostBalance = NULL; + VacuumActiveNWorkers = NULL; + } + I don't like the idea of first initializing the VacuumSharedCostBalance with lps->lvshared->cost_balance and then uninitialize if nworkers_launched is 0. I am not sure why do we need to initialize VacuumSharedCostBalance here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance);? I think we can initialize it only if nworkers_launched > 0 then we can get rid of the else branch completely. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Block level parallel vacuum
On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar wrote: > > On Thu, Jan 16, 2020 at 5:34 PM Amit Kapila wrote: > > > > On Thu, Jan 16, 2020 at 4:46 PM Masahiko Sawada > > wrote: > > > > > > Right. Most indexes (all?) of tables that are used in the regression > > > tests are smaller than min_parallel_index_scan_size. And we set > > > min_parallel_index_scan_size to 0 in vacuum.sql but VACUUM would not > > > be speeded-up much because of the relation size. Since we instead > > > populate new table for parallel vacuum testing the regression test for > > > vacuum would take a longer time. > > > > > > > Fair enough and I think it is good in a way that it won't change the > > coverage of existing vacuum code. I have fixed all the issues > > reported by Mahendra and have fixed a few other cosmetic things in the > > attached patch. > > > I have few small comments. > > 1. > logical streaming for large in-progress transactions+ > + /* Can't perform vacuum in parallel */ > + if (parallel_workers <= 0) > + { > + pfree(can_parallel_vacuum); > + return lps; > + } > > why are we checking parallel_workers <= 0, Function > compute_parallel_vacuum_workers only returns 0 or greater than 0 > so isn't it better to just check if (parallel_workers == 0) ? > > 2. > +/* > + * Macro to check if we are in a parallel vacuum. If true, we are in the > + * parallel mode and the DSM segment is initialized. > + */ > +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL) > > (LVParallelState *) (lps) -> this typecast is not required, just (lps) > != NULL should be enough. > > 3. > > + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes))); > + prepare_index_statistics(shared, can_parallel_vacuum, nindexes); > + pg_atomic_init_u32(&(shared->idx), 0); > + pg_atomic_init_u32(&(shared->cost_balance), 0); > + pg_atomic_init_u32(&(shared->active_nworkers), 0); > > I think it will look cleaner if we can initialize in the order they > are declared in structure. > > 4. > + VacuumSharedCostBalance = &(lps->lvshared->cost_balance); > + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers); > + > + /* > + * Set up shared cost balance and the number of active workers for > + * vacuum delay. > + */ > + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance); > + pg_atomic_write_u32(VacuumActiveNWorkers, 0); > + > + /* > + * The number of workers can vary between bulkdelete and cleanup > + * phase. > + */ > + ReinitializeParallelWorkers(lps->pcxt, nworkers); > + > + LaunchParallelWorkers(lps->pcxt); > + > + if (lps->pcxt->nworkers_launched > 0) > + { > + /* > + * Reset the local cost values for leader backend as we have > + * already accumulated the remaining balance of heap. > + */ > + VacuumCostBalance = 0; > + VacuumCostBalanceLocal = 0; > + } > + else > + { > + /* > + * Disable shared cost balance if we are not able to launch > + * workers. > + */ > + VacuumSharedCostBalance = NULL; > + VacuumActiveNWorkers = NULL; > + } > + > > I don't like the idea of first initializing the > VacuumSharedCostBalance with lps->lvshared->cost_balance and then > uninitialize if nworkers_launched is 0. > I am not sure why do we need to initialize VacuumSharedCostBalance > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, > VacuumCostBalance);? > I think we can initialize it only if nworkers_launched > 0 then we can > get rid of the else branch completely. I missed one of my comment + /* Carry the shared balance value to heap scan */ + if (VacuumSharedCostBalance) + VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance); + + if (nworkers > 0) + { + /* Disable shared cost balance */ + VacuumSharedCostBalance = NULL; + VacuumActiveNWorkers = NULL; + } Doesn't make sense to keep them as two conditions, we can combine them as below /* If shared costing is enable, carry the shared balance value to heap scan and disable the shared costing */ if (VacuumSharedCostBalance) { VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance); VacuumSharedCostBalance = NULL; VacuumActiveNWorkers = NULL; } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add pg_file_sync() to adminpack
On Thu, Jan 16, 2020 at 09:51:24AM -0500, Robert Haas wrote: > On Tue, Jan 14, 2020 at 10:08 AM Atsushi Torikoshi wrote: >> fails we can get error messages. So it seems straightforward for me to >> 'return true on success and emit an ERROR otherwise'. > > I agree with reporting errors using ERROR, but it seems to me that we > ought to then make the function return 'void' rather than 'bool'. Yeah, that should be either ERROR and return a void result, or issue a WARNING/ERROR (depending on the code path, maybe PANIC?) with a boolean status returned if there is a WARNING. Mixing both concepts with an ERROR all the time and always a true status is just weird, because you know that if no errors are raised then the status will be always true. So there is no point to have a boolean status to begin with. -- Michael signature.asc Description: PGP signature
Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node
On Thu, Jan 16, 2020 at 11:17:36PM +0900, Fujii Masao wrote: > OK, I updated the patch that way. > Attached is the updated version of the patch. Thanks. I have few tweaks to propose to the docs. +raise a PANIC-level error, aborting the recovery. Setting Instead of "PANIC-level error", I would just use "PANIC error", and instead of "aborting the recovery" just "crashing the server". +causes the system to ignore those WAL records WAL records are not ignored, but errors caused by incorrect page references in those WAL records are. The current phrasing sounds like the WAL records are not applied. Another thing that I just recalled. Do you think that it would be better to mention that invalid page references can only be seen after reaching the consistent point during recovery? The information given looks enough, but I was just wondering if that's worth documenting or not. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Block level parallel vacuum
On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar wrote: > > I have few small comments. > > 1. > logical streaming for large in-progress transactions+ > + /* Can't perform vacuum in parallel */ > + if (parallel_workers <= 0) > + { > + pfree(can_parallel_vacuum); > + return lps; > + } > > why are we checking parallel_workers <= 0, Function > compute_parallel_vacuum_workers only returns 0 or greater than 0 > so isn't it better to just check if (parallel_workers == 0) ? > Why to have such an assumption about compute_parallel_vacuum_workers()? The function compute_parallel_vacuum_workers() returns int, so such a check (<= 0) seems reasonable to me. > 2. > +/* > + * Macro to check if we are in a parallel vacuum. If true, we are in the > + * parallel mode and the DSM segment is initialized. > + */ > +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL) > > (LVParallelState *) (lps) -> this typecast is not required, just (lps) > != NULL should be enough. > I think the better idea would be to just replace it PointerIsValid like below. I see similar usage in other places. #define ParallelVacuumIsActive(lps) PointerIsValid(lps) > 3. > > + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes))); > + prepare_index_statistics(shared, can_parallel_vacuum, nindexes); > + pg_atomic_init_u32(&(shared->idx), 0); > + pg_atomic_init_u32(&(shared->cost_balance), 0); > + pg_atomic_init_u32(&(shared->active_nworkers), 0); > > I think it will look cleaner if we can initialize in the order they > are declared in structure. > Okay. > 4. > + VacuumSharedCostBalance = &(lps->lvshared->cost_balance); > + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers); > + > + /* > + * Set up shared cost balance and the number of active workers for > + * vacuum delay. > + */ > + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance); > + pg_atomic_write_u32(VacuumActiveNWorkers, 0); > + > + /* > + * The number of workers can vary between bulkdelete and cleanup > + * phase. > + */ > + ReinitializeParallelWorkers(lps->pcxt, nworkers); > + > + LaunchParallelWorkers(lps->pcxt); > + > + if (lps->pcxt->nworkers_launched > 0) > + { > + /* > + * Reset the local cost values for leader backend as we have > + * already accumulated the remaining balance of heap. > + */ > + VacuumCostBalance = 0; > + VacuumCostBalanceLocal = 0; > + } > + else > + { > + /* > + * Disable shared cost balance if we are not able to launch > + * workers. > + */ > + VacuumSharedCostBalance = NULL; > + VacuumActiveNWorkers = NULL; > + } > + > > I don't like the idea of first initializing the > VacuumSharedCostBalance with lps->lvshared->cost_balance and then > uninitialize if nworkers_launched is 0. > I am not sure why do we need to initialize VacuumSharedCostBalance > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, > VacuumCostBalance);? > I think we can initialize it only if nworkers_launched > 0 then we can > get rid of the else branch completely. > No, we can't initialize after nworkers_launched > 0 because by that time some workers would have already tried to access the shared cost balance. So, it needs to be done before launching the workers as is done in code. We can probably add a comment. > > + /* Carry the shared balance value to heap scan */ > + if (VacuumSharedCostBalance) > + VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance); > + > + if (nworkers > 0) > + { > + /* Disable shared cost balance */ > + VacuumSharedCostBalance = NULL; > + VacuumActiveNWorkers = NULL; > + } > > Doesn't make sense to keep them as two conditions, we can combine them as > below > > /* If shared costing is enable, carry the shared balance value to heap > scan and disable the shared costing */ > if (VacuumSharedCostBalance) > { > VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance); > VacuumSharedCostBalance = NULL; > VacuumActiveNWorkers = NULL; > } > makes sense to me, will change. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Block level parallel vacuum
On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila wrote: > > On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar wrote: > > > > I have few small comments. > > > > 1. > > logical streaming for large in-progress transactions+ > > + /* Can't perform vacuum in parallel */ > > + if (parallel_workers <= 0) > > + { > > + pfree(can_parallel_vacuum); > > + return lps; > > + } > > > > why are we checking parallel_workers <= 0, Function > > compute_parallel_vacuum_workers only returns 0 or greater than 0 > > so isn't it better to just check if (parallel_workers == 0) ? > > > > Why to have such an assumption about > compute_parallel_vacuum_workers()? The function > compute_parallel_vacuum_workers() returns int, so such a check > (<= 0) seems reasonable to me. Okay so I should probably change my statement to why compute_parallel_vacuum_workers is returning "int" instead of uint? I mean when this function is designed to return 0 or more worker why to make it return int and then handle extra values on caller. Am I missing something, can it really return negative in some cases? I find the below code in "compute_parallel_vacuum_workers" a bit confusing +static int +compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested, + bool *can_parallel_vacuum) +{ .. + /* The leader process takes one index */ + nindexes_parallel--;--> nindexes_parallel can become -1 + + /* No index supports parallel vacuum */ + if (nindexes_parallel == 0) . -> Now if it is 0 then return 0 but if its -1 then continue. seems strange no? I think here itself we can handle if (nindexes_parallel <= 0), that will make code cleaner. + return 0; + + /* Compute the parallel degree */ + parallel_workers = (nrequested > 0) ? + Min(nrequested, nindexes_parallel) : nindexes_parallel; > > > 2. > > +/* > > + * Macro to check if we are in a parallel vacuum. If true, we are in the > > + * parallel mode and the DSM segment is initialized. > > + */ > > +#define ParallelVacuumIsActive(lps) (((LVParallelState *) (lps)) != NULL) > > > > (LVParallelState *) (lps) -> this typecast is not required, just (lps) > > != NULL should be enough. > > > > I think the better idea would be to just replace it PointerIsValid > like below. I see similar usage in other places. > #define ParallelVacuumIsActive(lps) PointerIsValid(lps) Make sense to me. > > > 3. > > > > + shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes))); > > + prepare_index_statistics(shared, can_parallel_vacuum, nindexes); > > + pg_atomic_init_u32(&(shared->idx), 0); > > + pg_atomic_init_u32(&(shared->cost_balance), 0); > > + pg_atomic_init_u32(&(shared->active_nworkers), 0); > > > > I think it will look cleaner if we can initialize in the order they > > are declared in structure. > > > > Okay. > > > 4. > > + VacuumSharedCostBalance = &(lps->lvshared->cost_balance); > > + VacuumActiveNWorkers = &(lps->lvshared->active_nworkers); > > + > > + /* > > + * Set up shared cost balance and the number of active workers for > > + * vacuum delay. > > + */ > > + pg_atomic_write_u32(VacuumSharedCostBalance, VacuumCostBalance); > > + pg_atomic_write_u32(VacuumActiveNWorkers, 0); > > + > > + /* > > + * The number of workers can vary between bulkdelete and cleanup > > + * phase. > > + */ > > + ReinitializeParallelWorkers(lps->pcxt, nworkers); > > + > > + LaunchParallelWorkers(lps->pcxt); > > + > > + if (lps->pcxt->nworkers_launched > 0) > > + { > > + /* > > + * Reset the local cost values for leader backend as we have > > + * already accumulated the remaining balance of heap. > > + */ > > + VacuumCostBalance = 0; > > + VacuumCostBalanceLocal = 0; > > + } > > + else > > + { > > + /* > > + * Disable shared cost balance if we are not able to launch > > + * workers. > > + */ > > + VacuumSharedCostBalance = NULL; > > + VacuumActiveNWorkers = NULL; > > + } > > + > > > > I don't like the idea of first initializing the > > VacuumSharedCostBalance with lps->lvshared->cost_balance and then > > uninitialize if nworkers_launched is 0. > > I am not sure why do we need to initialize VacuumSharedCostBalance > > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, > > VacuumCostBalance);? > > I think we can initialize it only if nworkers_launched > 0 then we can > > get rid of the else branch completely. > > > > No, we can't initialize after nworkers_launched > 0 because by that > time some workers would have already tried to access the shared cost > balance. So, it needs to be done before launching the workers as is > done in code. We can probably add a comment. I don't think so, VacuumSharedCostBalance is a process local which is just pointing to the shared memory variable right? and each process has to point it to the shared memory and that we are already doing in parallel_vacuum_main. So we can initialize it after worker is launched. Basically code will look like below pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance); pg_atomic_write_u32(&(lps->lvshared->
Re: Unnecessary delay in streaming replication due to replay lag
On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote: > Standby does not start walreceiver process until startup process > finishes WAL replay. The more WAL there is to replay, longer is the > delay in starting streaming replication. If replication connection is > temporarily disconnected, this delay becomes a major problem and we > are proposing a solution to avoid the delay. Yeah, that's documented: https://www.postgresql.org/message-id/20190910062325.gd11...@paquier.xyz > We propose to address this by starting walreceiver without waiting for > startup process to finish replay of WAL. Please see attached > patchset. It can be summarized as follows: > > 0001 - TAP test to demonstrate the problem. There is no real need for debug_replay_delay because we have already recovery_min_apply_delay, no? That would count only after consistency has been reached, and only for COMMIT records, but your test would be enough with that. > 0002 - The standby startup sequence is changed such that >walreceiver is started by startup process before it begins >to replay WAL. See below. > 0003 - Postmaster starts walreceiver if it finds that a >walreceiver process is no longer running and the state >indicates that it is operating as a standby. I have not checked in details, but I smell some race conditions between the postmaster and the startup process here. > This is a POC, we are looking for early feedback on whether the > problem is worth solving and if it makes sense to solve if along this > route. You are not the first person interested in this problem, we have a patch registered in this CF to control the timing when a WAL receiver is started at recovery: https://commitfest.postgresql.org/26/1995/ https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f...@postgrespro.ru I am pretty sure that we should not change the default behavior to start the WAL receiver after replaying everything from the archives to avoid copying some WAL segments for nothing, so being able to use a GUC switch should be the way to go, and Konstantin's latest patch was using this approach. Your patch 0002 adds visibly a third mode: start immediately on top of the two ones already proposed: - Start after replaying all WAL available locally and in the archives. - Start after reaching a consistent point. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Block level parallel vacuum
On Fri, Jan 17, 2020 at 11:00 AM Dilip Kumar wrote: > > On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila wrote: > > > > On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar wrote: > > > > > > I have few small comments. > > > > > > 1. > > > logical streaming for large in-progress transactions+ > > > + /* Can't perform vacuum in parallel */ > > > + if (parallel_workers <= 0) > > > + { > > > + pfree(can_parallel_vacuum); > > > + return lps; > > > + } > > > > > > why are we checking parallel_workers <= 0, Function > > > compute_parallel_vacuum_workers only returns 0 or greater than 0 > > > so isn't it better to just check if (parallel_workers == 0) ? > > > > > > > Why to have such an assumption about > > compute_parallel_vacuum_workers()? The function > > compute_parallel_vacuum_workers() returns int, so such a check > > (<= 0) seems reasonable to me. > > Okay so I should probably change my statement to why > compute_parallel_vacuum_workers is returning "int" instead of uint? > Hmm, I think the number of workers at most places is int, so it is better to return int here which will keep it consistent with how we do at other places. See, the similar usage in compute_parallel_worker. I > mean when this function is designed to return 0 or more worker why to > make it return int and then handle extra values on caller. Am I > missing something, can it really return negative in some cases? > > I find the below code in "compute_parallel_vacuum_workers" a bit confusing > > +static int > +compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested, > + bool *can_parallel_vacuum) > +{ > .. > + /* The leader process takes one index */ > + nindexes_parallel--;--> nindexes_parallel can become -1 > + > + /* No index supports parallel vacuum */ > + if (nindexes_parallel == 0) . -> Now if it is 0 then return 0 but > if its -1 then continue. seems strange no? I think here itself we can > handle if (nindexes_parallel <= 0), that will make code cleaner. > + return 0; > + I think this got recently introduce by one of my changes based on the comment by Mahendra, we can adjust this check. > > > > > > I don't like the idea of first initializing the > > > VacuumSharedCostBalance with lps->lvshared->cost_balance and then > > > uninitialize if nworkers_launched is 0. > > > I am not sure why do we need to initialize VacuumSharedCostBalance > > > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, > > > VacuumCostBalance);? > > > I think we can initialize it only if nworkers_launched > 0 then we can > > > get rid of the else branch completely. > > > > > > > No, we can't initialize after nworkers_launched > 0 because by that > > time some workers would have already tried to access the shared cost > > balance. So, it needs to be done before launching the workers as is > > done in code. We can probably add a comment. > I don't think so, VacuumSharedCostBalance is a process local which is > just pointing to the shared memory variable right? > > and each process has to point it to the shared memory and that we are > already doing in parallel_vacuum_main. So we can initialize it after > worker is launched. > Basically code will look like below > > pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance); > pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0); > oh, I thought you were telling to initialize the shared memory itself after launching the workers. However, you are asking to change the usage of the local variable, I think we can do that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Block level parallel vacuum
On Fri, Jan 17, 2020 at 11:34 AM Amit Kapila wrote: > > On Fri, Jan 17, 2020 at 11:00 AM Dilip Kumar wrote: > > > > On Fri, Jan 17, 2020 at 10:44 AM Amit Kapila > > wrote: > > > > > > On Fri, Jan 17, 2020 at 9:36 AM Dilip Kumar wrote: > > > > > > > > I have few small comments. > > > > > > > > 1. > > > > logical streaming for large in-progress transactions+ > > > > + /* Can't perform vacuum in parallel */ > > > > + if (parallel_workers <= 0) > > > > + { > > > > + pfree(can_parallel_vacuum); > > > > + return lps; > > > > + } > > > > > > > > why are we checking parallel_workers <= 0, Function > > > > compute_parallel_vacuum_workers only returns 0 or greater than 0 > > > > so isn't it better to just check if (parallel_workers == 0) ? > > > > > > > > > > Why to have such an assumption about > > > compute_parallel_vacuum_workers()? The function > > > compute_parallel_vacuum_workers() returns int, so such a check > > > (<= 0) seems reasonable to me. > > > > Okay so I should probably change my statement to why > > compute_parallel_vacuum_workers is returning "int" instead of uint? > > > > Hmm, I think the number of workers at most places is int, so it is > better to return int here which will keep it consistent with how we do > at other places. See, the similar usage in compute_parallel_worker. Okay, I see. > > I > > mean when this function is designed to return 0 or more worker why to > > make it return int and then handle extra values on caller. Am I > > missing something, can it really return negative in some cases? > > > > I find the below code in "compute_parallel_vacuum_workers" a bit confusing > > > > +static int > > +compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int > > nrequested, > > + bool *can_parallel_vacuum) > > +{ > > .. > > + /* The leader process takes one index */ > > + nindexes_parallel--;--> nindexes_parallel can become -1 > > + > > + /* No index supports parallel vacuum */ > > + if (nindexes_parallel == 0) . -> Now if it is 0 then return 0 but > > if its -1 then continue. seems strange no? I think here itself we can > > handle if (nindexes_parallel <= 0), that will make code cleaner. > > + return 0; > > + > > I think this got recently introduce by one of my changes based on the > comment by Mahendra, we can adjust this check. Ok > > > > > > > > > I don't like the idea of first initializing the > > > > VacuumSharedCostBalance with lps->lvshared->cost_balance and then > > > > uninitialize if nworkers_launched is 0. > > > > I am not sure why do we need to initialize VacuumSharedCostBalance > > > > here? just to perform pg_atomic_write_u32(VacuumSharedCostBalance, > > > > VacuumCostBalance);? > > > > I think we can initialize it only if nworkers_launched > 0 then we can > > > > get rid of the else branch completely. > > > > > > > > > > No, we can't initialize after nworkers_launched > 0 because by that > > > time some workers would have already tried to access the shared cost > > > balance. So, it needs to be done before launching the workers as is > > > done in code. We can probably add a comment. > > I don't think so, VacuumSharedCostBalance is a process local which is > > just pointing to the shared memory variable right? > > > > and each process has to point it to the shared memory and that we are > > already doing in parallel_vacuum_main. So we can initialize it after > > worker is launched. > > Basically code will look like below > > > > pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance); > > pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0); > > > > oh, I thought you were telling to initialize the shared memory itself > after launching the workers. However, you are asking to change the > usage of the local variable, I think we can do that. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Errors when update a view with conditional-INSTEAD rules
Thanks a lot, Dean, to look into this and also sorry for the late reply. On Sun, Jan 5, 2020 at 12:08 AM Dean Rasheed wrote: > Tracing it through, this seems to be a result of > cab5dc5daf2f6f5da0ce79deb399633b4bb443b5 which added support for > updatable views with a mix of updatable and non-updatable columns. > That included a change to rewriteTargetListIU() to prevent it from > adding dummy targetlist entries for unassigned-to attributes for > auto-updatable views, in case they are no longer simple references to > the underlying relation. Instead, that is left to expand_targetlist(), > as for a normal table. However, in this case (an UPDATE on a view with > a conditional rule), the target relation of the original query isn't > rewritten (we leave it to the executor to report the error), and so > expand_targetlist() ends up adding a new targetlist entry that > references the target relation, which is still the original view. But > when the planner bulds the simple_rel_array, it only adds entries for > relations referenced in the query's jointree, which only includes the > base table by this point, not the view. Thus the new targetlist entry > added by expand_targetlist() refers to a NULL slot in the > simple_rel_array, and it blows up. > > That's a great analysis of this issue. > Given that this is a query that's going to fail anyway, I'm inclined > to think that the right thing to do is to throw the error sooner, in > rewriteQuery(), rather than attempting to plan a query that cannot be > executed. > I am wondering whether a simple auto-updatable view can have a conditional update instead rule. For the test case I added, does bellow plan looks reasonable? gpadmin=# explain UPDATE v1 SET b = 2 WHERE a = 1; QUERY PLAN --- Insert on t2 (cost=0.00..49.55 rows=1 width=8) -> Seq Scan on t1 (cost=0.00..49.55 rows=1 width=8) Filter: ((b > 100) AND (a > 2) AND (a = 1)) Update on t1 (cost=0.00..49.55 rows=3 width=14) -> Seq Scan on t1 (cost=0.00..49.55 rows=3 width=14) Filter: (((a > 2) IS NOT TRUE) AND (b > 100) AND (a = 1)) (7 rows) gpadmin=# UPDATE v1 SET b = 2 WHERE a = 1; UPDATE 1 The document also says that: "There is a catch if you try to use conditional rules for *complex view* updates: there must be an unconditional INSTEAD rule for each action you wish to allow on the view" which makes me think a simple view can have a conditional INSTEAD rule. And the document is explicitly changed in commit a99c42f291421572aef2: - There is a catch if you try to use conditional rules for view + There is a catch if you try to use conditional rules for complex view Does that mean we should support conditional rules for a simple view? Regards, Pengzhou Tang
Re: FETCH FIRST clause PERCENT option
Thank you for the notification, Tomas. At Thu, 16 Jan 2020 22:33:58 +0100, Tomas Vondra wrote in > This patch is marked as RFC since September. Since then there was no > discussion on this thread, but Andrew proposed an alternative approach > based on window functions in a separate thread [1] (both for this and > for the WITH TIES case). > > I'll set this patch back to "needs review" - at this point we need to > decide which of the approaches is the right one. Even consdiering that we have only two usage , WITH TIES and PERCENT, of the mechanism for now, I like [1] for the generic nature, simpleness and smaller invasiveness to executor. The fact that cursor needs materialize to allow backward scan would not be a big problem. It seems that PERCENT will be easily implemented on that. However, isn't there a possibility that we allow FETCH FIRST x PERCENT WITH TIES, though I'm not sure what SQL spec tells about that? I don't come up right now how to do that using window functions.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center