Shmem queue is not flushed if receiver is not yet attached
Hello, While testing on the current PG master, I noticed a problem between backends communicating over a shared memory queue. I think `shm_mq_sendv()` fails to flush the queue, even if `force_flush` is set to true, if the receiver is not yet attached to the queue. This simple fix solves the problem for me. On another note, `shm_mq.h` declares `shm_mq_flush()`, but I don't see it being implemented. Maybe just a leftover from the previous work? Though it seems useful to implement that API. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: https://www.enterprisedb..com 0001-Flush-the-queue-even-if-receiver-has-not-attached.patch Description: Binary data
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Mon, Jan 18, 2021 at 3:02 AM Tomas Vondra wrote: > > > Pushed. Thanks everyone for the effort put into this patch. The first > version was sent in 2015, so it took quite a bit of time. > > Thanks Tomas, Anastasia and everyone else who worked on the patch and ensured that it gets into the tree. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls
Hi Alvaro, On Tue, Sep 29, 2020 at 10:14 PM Alvaro Herrera wrote: > Hello > > Pavan Deolasee recently noted that a few of the > HeapTupleHeaderIndicatesMovedPartitions calls added by commit > 5db6df0c0117 are useless, since they are done after comparing t_self > with t_ctid. That's because t_self can never be set to the magical > values that indicate that the tuple moved partition. If the first > test fails (so we know t_self equals t_ctid), necessarily the second > test will also fail. > > So these checks can be removed and no harm is done. > > The patch looks good to me. The existing coding pattern was a bit confusing and that's how I noticed it. So +1 for fixing it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Mar 23, 2018 at 10:00 AM, Amit Langote < langote_amit...@lab.ntt.co.jp> wrote: > On 2018/03/23 3:42, Pavan Deolasee wrote: > > A slightly improved version attached. Apart from doc cleanup based on > > earlier feedback, fixed one assertion failure based on Rahila's report. > > This was happening when target relation is referenced in the source > > subquery. Fixed that and added a test case to test that situation. > > > > Rebased on current master. > > I tried these patches (applied 0002 on top of 0001). When applying 0002, > I got some apply errors: > > The next patch would create the file > src/test/isolation/expected/merge-delete.out, > which already exists! Assume -R? [n] > > I managed to apply it by ignoring the errors, but couldn't get make check > to pass; attached regressions.diffs if you want to take a look. > Thanks. Are you sure you're using a clean repo? I suspect you'd a previous version of the patch applied and hence the apply errors now. I also suspect that you may have made a mistake while resolving the conflicts while applying the patch (since a file at the same path existed). The failures also seem related to past version of the patch. I just checked with a freshly checked out repo and the patches apply correctly on the current master and regression passes too. http://commitfest.cputube.org/ also reported success overnight. > > Btw, is 0001 redundant with the latest patch on ON CONFLICT DO UPDATE > thread? Can I apply just 0002 on top of that patch? So, I tried that -- > that is, skipped your 0001 and instead applied ON CONFLICT DO UPDATE > patch, and then applied your 0002. Yes. I should probably rebase my patch on your v9 or just include the relevant changes in the MERGE patch itself to avoid any dependency right now. Will check. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Fri, Mar 9, 2018 at 8:49 PM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2/1/18 19:21, Simon Riggs wrote: > > If we really can't persuade you of that, it doesn't sink the patch. We > > can have the WAL pointer itself - it wouldn't save space but it would > > at least alleviate the spinlock. > > Do you want to send in an alternative patch that preserves the WAL > pointer and only changes the locking? > > Sorry for the delay. Here is an updated patch which now replaces xl_prev with xl_curr, thus providing similar safeguards against corrupted or torn WAL pages, but still providing benefits of atomic operations. I repeated the same set of tests and the results are almost similar. These tests are done on a different AWS instance though and hence not comparable to previous tests. What we do in these tests is essentially call ReserveXLogInsertLocation() 1M times to reserve 256 bytes each time, in a single select statement (the function is exported and a SQL-callable routine is added for the purpose of the tests) Patched: == #clients #tps 1 34.195311 2 29.001584 4 31.712009 8 35.489272 16 41.950044 Master: == #clients #tps 1 24.128004 2 12.326135 4 8.334143 8 16.035699 16 8.502794 So that's pretty good improvement across the spectrum. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg_reduce_wal_contention_v2.patch Description: Binary data
Re: Faster inserts with mostly-monotonically increasing values
Hi Andrew and Claudio, Thanks for the review! On Fri, Mar 23, 2018 at 10:19 AM, Andrew Dunstan < andrew.duns...@2ndquadrant.com> wrote: > On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire > wrote: > > > This patch looks in pretty good shape. I have been trying hard to > think of some failure mode but haven't been able to come up with one. > > Great! > > > Some comments > > > > +/* > > + * It's very common to have an index on an auto-incremented or > > + * monotonically increasing value. In such cases, every insertion > happens > > + * towards the end of the index. We try to optimise that case by > caching > > + * the right-most block of the index. If our cached block is still > the > > + * rightmost block, has enough free space to accommodate a new > entry and > > + * the insertion key is greater or equal to the first key in this > page, > > + * then we can safely conclude that the new key will be inserted in > the > > + * cached block. So we simply search within the cached page and > insert the > > + * key at the appropriate location. We call it a fastpath. > > > > It should say "the insertion key is strictly greater than the first key" > > Good catch. Fixed. > > > > > Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key > > difference). So it should say "rightmost leaf page". > > > right. > > Fixed. > [...] > > > > Setting "offset = InvalidOffsetNumber" in that contorted way is > > unnecessary. You can remove the first assignment and instead > > initialize unconditionally right after the fastpath block (that > > doesn't make use of offset anyway): > > > Yes, I noticed that and it's confusing, Just set it at the top. > > Good idea. Changed that way. > > > > Having costs in explain tests can be fragile. Better use "explain > > (costs off)". If you run "make check" continuously in a loop, you > > should get some failures related to that pretty quickly. > > > > Agree about costs off, but I'm fairly dubious of the value of using > EXPLAIN at all here.Nothing in the patch should result in any change > in EXPLAIN output. > I agree. I initially added EXPLAIN to ensure that we're doing index-only scans. But you're correct, we don't need them in the tests itself. > > I would probably just have a few regression lines that should be sure > to exercise the code path and leave it at that. > > I changed the regression tests to include a few more scenarios, basically using multi-column indexes in different ways and they querying rows by ordering rows in different ways. I did not take away the vacuum and I believe it will actually help the tests by introducing some fuzziness in the tests i.e. if the vacuum does not do its job, we might execute a different plan and ensure that the output remains unchanged. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg_btree_target_block_v4.patch Description: Binary data
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Mar 23, 2018 at 6:45 AM, Peter Geoghegan wrote: > On Thu, Mar 22, 2018 at 6:02 PM, Alvaro Herrera > wrote: > > Incremental development is a good thing. Trying to do everything in a > > single commit is great when time is infinite or even merely very long, > > but if you run out of it, which I'm sure is common, leaving some things > > out that can be reasonable implemented in a separate patch is perfectly > > acceptable. > > We're talking about something that took me less than an hour to get > working. AFAICT, it's just a matter of tweaking the grammar, and > adding a bit of transformWithClause() boilerplate to the start of > transformMergeStmt(). >> >> >> I quickly implemented CTE support myself (not wCTE support, since >> MERGE doesn't use RETURNING), and it wasn't tricky. It seems to work >> when I mechanically duplicate the approach taken with other types of >> DML statement in the parser. I have written a few tests, and so far it >> holds up. >> > > Ok, thanks. I started doing something similar, but great if you have > already implemented. I will focus on other things for now. > > I am sorry. I was under the impression that you're actually writing this piece of code and hence did not pay much attention till now. I should have confirmed with you instead of assuming. I think it's a bit too late now, but I will give it a fair try tomorrow. I don't want to spend too much time on it though given how close we are to the deadline. As Alvaro said, we can always revisit this for pg12. > As I've pointed out on this thread already, I'm often concerned about > supporting functionality like this because it increases my overall > confidence in the design. If it was genuinely hard to add WITH clause > support, then that would probably tell us something about the overall > design that likely creates problems elsewhere. It's easy to say that > it isn't worth holding the patch up for WITH clause support, because > that's true, but it's also beside the point. > Understood. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Mar 23, 2018 at 4:43 AM, Peter Geoghegan wrote: > On Thu, Mar 22, 2018 at 11:42 AM, Pavan Deolasee > wrote: > > A slightly improved version attached. > > You still need to remove this change: > > > diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h > > index a4574cd533..dbfb5d2a1a 100644 > > --- a/src/include/miscadmin.h > > +++ b/src/include/miscadmin.h > > @@ -444,5 +444,6 @@ extern bool has_rolreplication(Oid roleid); > > /* in access/transam/xlog.c */ > > extern bool BackupInProgress(void); > > extern void CancelBackup(void); > > +extern int64 GetXactWALBytes(void); > Sigh. Fixed in the recent version. > > I see that we're still using two target RTEs in this latest revision, > v23e -- the new approach to partitioning, which I haven't had time to > study in more detail, has not produced a change there. Yes, we continue to use two RTEs because I don't have any brighter idea than that to handle the case of partitioned table and right outer join. As I explained sometime back, this is necessary to ensure that we don't produce duplicate rows when a partition is joined with the source and then a second partition is joined again with the source. Now I don't know if we can run a join query and still have a single RTE, but that looks improbable and wrong. This causes > weird effects, such as the following: > > """ > pg@~[20658]=# create table foo(bar int4); > CREATE TABLE > pg@~[20658]=# merge into foo f using (select 1 col) dd on f.bar=dd.col > when matched then update set bar = f.barr + 1; > ERROR: column f.barr does not exist > LINE 1: ...n f.bar=dd.col when matched then update set bar = f.barr + 1... > ^ > HINT: Perhaps you meant to reference the column "f.bar" or the column > "f.bar". > > """ > > While I think this this particular HINT buglet is pretty harmless, I > continue to be concerned about the unintended consequences of having > multiple RTEs for MERGE's target table. Each RTE comes from a > different lookup path -- the first one goes through setTargetTable()'s > parserOpenTable() + addRangeTableEntryForRelation() calls. The second > one goes through transformFromClauseItem(), for the join, which > ultimately ends up calling transformTableEntry()/addRangeTableEntry(). > How's it different than running a INSERT query with the target table again specified in a subquery producing the rows to be inserted? For example, postgres=# insert into target as t select sid from source s join target t on t.ttid = s.sid; ERROR: column t.ttid does not exist LINE 1: ...rget as t select sid from source join target t on t.ttid = s... ^ HINT: Perhaps you meant to reference the column "t.tid" or the column "t.tid". postgres=# This produces a very similar looking HINT as your test above. I am certain that "target" table gets two RTEs, exactly via the same code paths as you discussed above. So if this is not a problem for INSERT, why it would be a problem for MERGE? May be I am missing a point here. > Consider commit 5f173040, which fixed a privilege escalation security > bug around multiple name lookup. Could the approach taken by MERGE > here introduce a similar security issue? > > Using GDB, I see two calls to RangeVarGetRelidExtended() when a simple > MERGE is executed. They both have identical relation arguments, that > look like this: > > (gdb) p *relation > $4 = { > type = T_RangeVar, > catalogname = 0x0, > schemaname = 0x0, > relname = 0x5600ebdcafb0 "foo", > inh = 1 '\001', > relpersistence = 112 'p', > alias = 0x5600ebdcb048, > location = 11 > } > > This seems like something that needs to be explained, at a minimum. > Even if I'm completely wrong about there being a security hazard, > maybe the suggestion that there might be still gives you some idea of > what I mean about unintended consequences. > Ok. I will try to explain it better and also think about the security hazards. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Sat, Mar 24, 2018 at 1:36 AM, Peter Geoghegan wrote: > > Fair enough. Attached patch shows what I'm on about. This should be > applied on top of 0001_merge_v23e_onconflict_work.patch + > 0002_merge_v23e_main.patch. I'm not expecting an authorship credit for > posting this patch. > Thanks for the patch. I will study and integrate this into the main patch. > > One thing that the test output shows that is interesting is that there > is never a "SubPlan 1" or "InitPlan 1" in EXPLAIN output -- it seems > to always start at "SubPlan 2". This probably has nothing to do with > CTEs in particular. I didn't notice this before now, although there > were no existing tests of EXPLAIN in the patch that show subplans or > initplans. > This query e.g. correctly starts at InitPlan 1 postgres=# EXPLAIN MERGE INTO m USING (SELECT 1 a, 'val' b) s ON m.k = s.a WHEN NOT MATCHED THEN INSERT VALUES ((select count(*) from pg_class), s.b); QUERY PLAN - Merge on m (cost=16.30..43.83 rows=6 width=106) InitPlan 1 (returns $0) -> Aggregate (cost=16.26..16.27 rows=1 width=8) -> Seq Scan on pg_class (cost=0.00..15.41 rows=341 width=0) -> Hash Right Join (cost=0.03..27.55 rows=6 width=106) Hash Cond: (m_1.k = s.a) -> Seq Scan on m m_1 (cost=0.00..22.70 rows=1270 width=14) -> Hash (cost=0.02..0.02 rows=1 width=96) -> Subquery Scan on s (cost=0.00..0.02 rows=1 width=96) -> Result (cost=0.00..0.01 rows=1 width=36) (10 rows) > > Is this somehow related to the issue of using two RTEs for the target > relation? That's certainly why we always see unaliased target table > "m" with the alias "m_1" in EXPLAIN output, so I would not be > surprised if it caused another EXPLAIN issue. > I don't think it's related to using two RTEs. The following EXPLAIN for a regular UPDATE query also shows a SubPlan starting at 2. I think it's just to do with how planner assigns the plan_id. postgres=# EXPLAIN WITH cte_basic AS (SELECT 1 a, 'cte_basic val' b) UPDATE m SET v = (SELECT b || ' merge update' FROM cte_basic WHERE cte_basic.a = m.k LIMIT 1) ; QUERY PLAN -- Update on m (cost=0.01..54.46 rows=1270 width=42) CTE cte_basic -> Result (cost=0.00..0.01 rows=1 width=36) -> Seq Scan on m (cost=0.00..54.45 rows=1270 width=42) SubPlan 2 -> Limit (cost=0.00..0.02 rows=1 width=32) -> CTE Scan on cte_basic (cost=0.00..0.02 rows=1 width=32) Filter: (a = m.k) (8 rows) A quick gdb tracing shows that the CTE itself is assigned plan_id 1 and the SubPlan then gets plan_id 2. I can investigate further, but given that we see a similar behaviour with regular UPDATE, I don't think it's worth. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Sat, Mar 24, 2018 at 5:28 PM, Robert Haas wrote: > > This is a patch about which multiple people have expressed concerns. > You're trying to jam a heavily redesigned version in at the last > minute without adequate review. Please don't do that. > > TBH this is not a heavily redesigned version. There were two parts to the original patch: 1. Replacing 8-byte xl_prev with 2-byte xl_walid in order to reduce the size of the WAL record header 2. Changing XLogCtlInsert.CurrBytePos to use atomic operations in order to reduce the spinlock contention. Most people expressed concerns regarding 1, but there were none regarding 2. Now it's possible that the entire attention got diverted to 1 and nobody really studied 2 in detail. Or may be 2 is mostly non-contentious, given the results of micro benchmarks. So what I've done in v2 is to just deal with part 2 i.e. replace access to CurrBytePos with atomic operations, based on the following suggestion by Simon. On 2/1/18 19:21, Simon Riggs wrote: > If we really can't persuade you of that, it doesn't sink the patch. We > can have the WAL pointer itself - it wouldn't save space but it would > at least alleviate the spinlock. This gives us the same level of guarantee that xl_prev used to offer, yet help us use atomic operations. I'll be happy if we can look at that particular change and see if there are any holes there. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Exclude unlogged tables from base backups
On Fri, Mar 23, 2018 at 9:51 PM, David Steele wrote: > On 3/23/18 12:14 PM, Teodor Sigaev wrote: > > > > Thank you, pushed > > Is it just me or the newly added test in 010_pg_basebackup.pl failing for others too? # Failed test 'unlogged main fork not in backup' # at t/010_pg_basebackup.pl line 112. t/010_pg_basebackup.pl ... 86/87 # Looks like you failed 1 test of 87. I manually ran pg_basebackup and it correctly excludes the main fork on an unlogged table from the backup, but it consistently copies the main fork while running "make check" and thus fails the test. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan < andrew.duns...@2ndquadrant.com> wrote: > On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee > wrote: > >> > >> > >> I would probably just have a few regression lines that should be sure > >> to exercise the code path and leave it at that. > >> > > > > I changed the regression tests to include a few more scenarios, basically > > using multi-column indexes in different ways and they querying rows by > > ordering rows in different ways. I did not take away the vacuum and I > > believe it will actually help the tests by introducing some fuzziness in > the > > tests i.e. if the vacuum does not do its job, we might execute a > different > > plan and ensure that the output remains unchanged. > > > > > If we're going to keep the vacuums in there, do we need to add a wait > barrier like Claudio suggested upthread? > > I don't think we need the wait barrier since we're no longer printing the explain plan. In the worst case, the vacuum may not get to set pages all-visible, thus planner choosing something other than an index-only-scan, but I guess that fuzziness actually helps the regression tests. That way we get confirmation regarding the final result irrespective of the plan chosen. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PATCH: Exclude unlogged tables from base backups
On Mon, Mar 26, 2018 at 1:03 PM, Pavan Deolasee wrote: > On Fri, Mar 23, 2018 at 9:51 PM, David Steele wrote: > >> On 3/23/18 12:14 PM, Teodor Sigaev wrote: >> > >> > Thank you, pushed >> >> > Is it just me or the newly added test in 010_pg_basebackup.pl failing for > others too? > > # Failed test 'unlogged main fork not in backup' > # at t/010_pg_basebackup.pl line 112. > t/010_pg_basebackup.pl ... 86/87 # Looks like you failed 1 test of 87. > > This one-liner patch fixes it for me. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001_pg_basebackup_fix.patch Description: Binary data
Re: PATCH: Exclude unlogged tables from base backups
On Mon, Mar 26, 2018 at 5:16 PM, Masahiko Sawada wrote: > On Mon, Mar 26, 2018 at 4:52 PM, Pavan Deolasee > > >> > > > > This one-liner patch fixes it for me. > > > > Isn't this issue already fixed by commit > d0c0c894533f906b13b79813f02b2982ac675074? Ah, right. Thanks for pointing out and sorry for the noise. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Mar 27, 2018 at 1:54 PM, Simon Riggs wrote: > On 26 March 2018 at 17:06, Simon Riggs wrote: > > On 26 March 2018 at 15:39, Pavan Deolasee > wrote: > > > > > That's all I can see so far. > > * change comment “once to” to “once” in src/include/nodes/execnodes.h > * change comment “and to run” to “and once to run” > * change “result relation” to “target relation” > Fixed all of that in the patch v26 set I just sent. > > * XXX we probably need to check plan output for CMD_MERGE also > Yeah. Added those checks for MERGE action's target lists in v26. > > * Spurious line feed in src/backend/optimizer/prep/preptlist.c > Couldn't spot it. Will look closer, but any hint will be appreciated. > > * No need to remove whitespace in src/backend/optimizer/util/relnode.c > Fixed in v26. > > * README should note that the TABLEOID junk column is not strictly > needed when joining to a non-partitioned table but we don't try to > optimize that away. Is that an XXX to fix in future or do we just > think the extra 4 bytes won't make much difference so we leave it? > I actually took the opportunity to conditionally fetch tableoid only if we are dealing with partitioned table. > > * Comment in rewriteTargetListMerge() should mention TABLEOID exists > to allow us to find the correct relation, not the correct row, comment > just copied from CTID above it. > > Fixed in v26. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] A design for amcheck heapam verification
On Tue, Mar 27, 2018 at 8:50 AM, Peter Geoghegan wrote: > On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin > wrote: > > I've just flipped patch to WoA. But if above issues will be fixed I > think that patch is ready for committer. > > Attached is v7, which has the small tweaks that you suggested. > > Thank you for the review. I hope that this can be committed shortly. > > Sorry for coming a bit too late on this thread, but I started looking at 0002 patch. * + * When index-to-heap verification is requested, a Bloom filter is used to + * fingerprint all tuples in the target index, as the index is traversed to + * verify its structure. A heap scan later verifies the presence in the heap + * of all index tuples fingerprinted within the Bloom filter. + * Is that correct? Aren't we actually verifying the presence in the index of all heap tuples? @@ -116,37 +140,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state, static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum); /* - * bt_index_check(index regclass) + * bt_index_check(index regclass, heapallindexed boolean) Can we come up with a better name for heapallindexed? May be "check_heapindexed"? + + /* + * Register our own snapshot in !readonly case, rather than asking + * IndexBuildHeapScan() to do this for us later. This needs to happen + * before index fingerprinting begins, so we can later be certain that + * index fingerprinting should have reached all tuples returned by + * IndexBuildHeapScan(). + */ + if (!state->readonly) + snapshot = RegisterSnapshot(GetTransactionSnapshot()); + } + So this looks safe. In !readonly mode, we take the snapshot *before* fingerprinting the index. Since we're using MVCC snapshot, any tuple which is visible to heapscan must be reachable via indexscan too. So we must find the index entry for every heap tuple returned by the scan. What I am not sure about is whether we can really examine an index which is valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that this amcheck might be running in a transaction block, probably in a repeatable-read isolation level and hence GetTransactionSnapshot() might actually return a snapshot which can't yet read the index consistently. In practice, this is quite unlikely, but I think we should check for that case if we agree that it could be a problem. The case with readonly mode is also interesting. Since we're scanning heap with SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the question is: can this happen? - some concurrent index scan sees a heaptuple as DEAD and marks the index entry as LP_DEAD - our index fingerprinting sees index tuple as LP_DEAD - our heap scan still sees the heaptuple as RECENTLY_DEAD Now that looks improbable given that we compute OldestXmin after our index fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is there a corner case that we're missing? Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS tuples, especially if those tuples were inserted/deleted by our own transaction? It probably worth thinking. Apart from that, comments in IndexBuildHeapRangeScan() claim that the function is called only with ShareLock and above, which is no longer true. We should check if that has any side-effects. I can't think of any, but better to verify and update the comments to reflect new reality, The partial indexes look fine since the non-interesting tuples never get called back. One thing that worth documenting/noting is the fact that a !readonly check will run with a long-duration registered snapshot, thus holding OldestXmin back. Is there anything we can do to lessen that burden like telling other backends to ignore our xmin while computing OldestXmin (like vacuum does)? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Tue, Mar 27, 2018 at 7:28 PM, Tom Lane wrote: > > > I had not noticed that in the kerfuffle over the more extreme version, > but I think this is a different route to breaking the same guarantees. > The point of the xl_prev links is, essentially, to allow verification > that we are reading a coherent series of WAL records, ie that record B > which appears to follow record A actually is supposed to follow it. > This throws away that cross-check just as effectively as the other patch > did, leaving us reliant solely on the per-record CRCs. CRCs are good, > but they do have a false-positive rate. > > An example of a case where xl_prev makes one feel a lot more comfortable > is the WAL-segment-switch option. The fact that the first record of the > next WAL file links back to the XLOG_SWITCH record is evidence that > ignoring multiple megabytes of WAL was indeed the intended thing to do. > An xl_curr field cannot provide any evidence as to whether WAL records > were missed. > TBH I still don't see why this does not provide the same guarantee that the current code provides, but given the concerns expressed by others, I am not gonna pursue beyond a point. But one last time :-) The current code uses xl_prev to cross-verify the record B, read after record A, indeed follows A and has a valid back-link to A. This deals with problems where B might actually be an old WAL record, carried over from a stale WAL file. Now if we store xl_curr, we can definitely guarantee that B is ahead of A because B->xl_curr will be greater than A->xl_curr (otherwise we bail out). So that deals with the problem of stale WAL records. In addition, we also know where A ends (we can deduce that even for XLOG_SWITCH records knowing where the next record will start after the switch) and hence we know where B should start. So we read at B and also confirm that B->xl_curr matches B's position. If it does not, we declare end-of-WAL and bail out. So where is the problem? > > 2. Does the new logic in pg_rewind to search backward for a checkpoint > > work reliably, and will it be slow? > > If you have to search backwards, this breaks it. Full stop. > We don't really need to fetch the previous record. We really need to find the last checkpoint prior to a given LSN. That can be done by reading WAL segments forward. It can be a little slow, but hopefully not a whole lot. A <- B <- C <- CHKPT <- D <- E <- F <- G So today, if we want to find last checkpoint prio to G, we go through the back-links until we find the first checkpoint record. In the proposed code, we read forward the current WAL segment, remember the last CHKPT record seen and once we see G, we know we have found the prior checkpoint. If the checkpoint does not exist in the current WAL, we read forward the previous WAL and return the last checkpoint record in that WAL and so on. So in the worst case, we might read a WAL segment extra before we find the checkpoint record. That's not ideal but not too bad given that only pg_rewind needs this and that too only once. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Wed, Mar 28, 2018 at 7:36 AM, Michael Paquier wrote: > On Tue, Mar 27, 2018 at 02:02:10PM -0400, Tom Lane wrote: > > > > > Well, the point of checkpoints is that WAL data before the last one > > should no longer matter anymore, isn't it? > > I have to agree with Tom here. If you force pg_rewind to replay more > WAL records from a checkpoint older than the checkpoint prior to where > WAL has forked at promotion then you have a risk of losing data. > Yeah, we should not do that. The patch surely does not intend to replay any more WAL than what we do today. The idea is to just use a different mechanism to find the prior checkpoint. But we should surely find the latest prior checkpoint. In the rare scenario that Tom showed, we should just throw an error and fix the patch if it's not doing that already. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Wed, Mar 28, 2018 at 8:28 AM, Peter Geoghegan wrote: > On Tue, Mar 27, 2018 at 2:28 AM, Pavan Deolasee > wrote: > > (Version 26) > > I have some feedback on this version: > > * ExecMergeMatched() needs to determine tuple lock mode for > EvalPlanQual() in a way that's based on how everything else works; > it's not okay to just use LockTupleExclusive in all cases. That will > often lead to lock escalation, which can cause unprincipled deadlocks. > You need to pass back the relevant info from routines like > heap_update(), which means more state needs to come back to > ExecMergeMatched() from routines like ExecUpdate(). > You're right. I am thinking what would be a good way to pass that information back. Should we add a new out parameter to ExecUpdate() or a new member to HeapUpdateFailureData? It seems only ExecUpdate() would need the change, so may be it's fine to change the API but HeapUpdateFailureData doesn't look bad either since it deals with failure cases and we are indeed dealing with ExecUpdate() failure. Any preference? > > * Doesn't ExecUpdateLockMode(), which is called from places like > ExecBRUpdateTriggers(), also need to be taught about > GetEPQRangeTableIndex() (i.e. the ri_mergeTargetRTI/ri_RangeTableIndex > divide)? You should audit everything like that carefully. Maybe > GetEPQRangeTableIndex() is not the best choke-point to do this kind of > thing. Not that I have a clearly better idea. > > * Looks like there is a similar problem in > ExecPartitionCheckEmitError(). I don't really understand how that > works, so I might be wrong here. > > * More or less the same issue seems to exist within ExecConstraints(), > including where GetInsertedColumns() is used. > They all look fine to me. Remember that we always resolve column references in WHEN targetlist from the target relation and hence things like updatedCols/insertedCols are get set on the target RTE. All of these routines read from the target RTE as far as I can see. But I will check in more detail. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan wrote: > On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee > wrote: > > + * When index-to-heap verification is requested, a Bloom filter is used > to > > + * fingerprint all tuples in the target index, as the index is > traversed to > > + * verify its structure. A heap scan later verifies the presence in the > > heap > > + * of all index tuples fingerprinted within the Bloom filter. > > + * > > > > Is that correct? Aren't we actually verifying the presence in the index > of > > all > > heap tuples? > > I think that you could describe it either way. We're verifying the > presence of heap tuples in the heap that ought to have been in the > index (that is, most of those that were fingerprinted). > Hmm Ok. It still sounds backwards to me, but then English is not my first or even second language. "A heap scan later verifies the presence in the heap of all index tuples fingerprinted" sounds as if we expect to find all fingerprinted index tuples in the heap. But in reality, we check if all heap tuples have a fingerprinted index tuple. > > You're right - there is a narrow window for REPEATABLE READ and > SERIALIZABLE transactions. This is a regression in v6, the version > removed the TransactionXmin test. > > I am tempted to fix this by calling GetLatestSnapshot() instead of > GetTransactionSnapshot(). However, that has a problem of its own -- it > won't work in parallel mode, and we're dealing with parallel > restricted functions, not parallel unsafe functions. I don't want to > change that to fix such a narrow issue. IMV, a better fix is to treat > this as a serialization failure. Attached patch, which applies on top > of v7, shows what I mean. > Ok. I am happy with that. > > I think that this bug is practically impossible to hit, because we use > the xmin from the pg_index tuple during "is index safe to use?" > indcheckxmin/TransactionXmin checks (the patch that I attached adds a > similar though distinct check), which raises another question for a > REPEATABLE READ xact. That question is: How is a REPEATABLE READ > transaction supposed to see the pg_index entry to get the index > relation's oid to call a verification function in the first place? Well pg_index entry will be visible and should be visible. Otherwise how do you even maintain a newly created index. I am not sure, but I guess we take fresh MVCC snapshots for catalog searches, even in RR transactions. > My > point is that there is no need for a more complicated solution than > what I propose. > I agree on that. > > I don't think so. The way we compute OldestXmin for > IndexBuildHeapRangeScan() is rather like a snapshot acquisition -- > GetOldestXmin() locks the proc array in shared mode. As you pointed > out, the fact that it comes after everything else (fingerprinting) > means that it must be equal to or later than what index scans saw, > that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD > bits). > > That's probably true. > > > Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_ > PROGRESS > > tuples, especially if those tuples were inserted/deleted by our own > > transaction? It probably worth thinking. > Anything here that you would like to check? I understand that you may see such tuples only for catalog tables. > > IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap > ShareUpdateExclusiveLock (it just says SharedLock), because that lock > strength doesn't have anything to do with IndexBuildHeapRangeScan() > when it operates with an MVCC snapshot. I think that this means that > this patch doesn't need to update comments within > IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems > independent. > Ok, I agree. But note that we are now invoking that code with AccessShareLock() whereas the existing callers either use ShareLock or ShareUpdateExclusiveLock. That's probably does not matter, but it's a change worth noting. > > > Is > > there anything we can do to lessen that burden like telling other > backends > > to > > ignore our xmin while computing OldestXmin (like vacuum does)? > > I don't think so. The transaction involved is still an ordinary user > transaction. > While that's true, I am thinking if there is anything we can do to stop this a consistency-checking utility to create other non-trivial side effects on the state of the database, even if that means we can not check all heap tuples. For example, can there be a way by which we allow concurrent vacuum or HOT prune to continue to prune away dead tuples, eve
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan wrote: > > I don't think so. The transaction involved is still an ordinary user > transaction. > > Mostly a nitpick, but I guess we should leave a comment after IndexBuildHeapScan() saying heap_endscan() is not necessary since IndexBuildHeapScan() does that internally. I stumbled upon that while looking for any potential leaks. I know at least one other caller of IndexBuildHeapScan() doesn't bother to say anything either, but it's helpful. FWIW I also looked at the 0001 patch and it looks fine to me. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Mar 27, 2018 at 5:00 PM, Simon Riggs wrote: > > In terms of further performance optimization, if there is just one > WHEN AND condition and no unconditional WHEN clauses then we can add > the WHEN AND easily to the join query. > > That seems like an easy thing to do for PG11 > > I think we need to be careful in terms of what can be pushed down to the join, in presence of WHEN NOT MATCHED actions. If we push the WHEN AND qual to the join then I am worried that some rows which should have been reported "matched" and later filtered out as part of the WHEN quals, will get reported as "not-matched", thus triggering WHEN NOT MATCHED action. For example, postgres=# select * from target ; a | b ---+ 1 | 10 2 | 20 (2 rows) postgres=# select * from source ; a | b ---+- 2 | 200 3 | 300 (2 rows) postgres=# BEGIN; BEGIN postgres=# EXPLAIN ANALYZE MERGE INTO target t USING source s ON t.a = s.a WHEN MATCHED AND t.a < 2 THEN UPDATE SET b = s.b WHEN NOT MATCHED THEN INSERT VALUES (s.a, -1); QUERY PLAN --- Merge on target t (cost=317.01..711.38 rows=25538 width=46) (actual time=0.104..0.104 rows=0 loops=1) * Tuples Inserted: 1* Tuples Updated: 0 Tuples Deleted: 0 * Tuples Skipped: 1* -> Merge Left Join (cost=317.01..711.38 rows=25538 width=46) (actual time=0.071..0.074 rows=2 loops=1) Merge Cond: (s.a = t_1.a) -> Sort (cost=158.51..164.16 rows=2260 width=40) (actual time=0.042..0.043 rows=2 loops=1) Sort Key: s.a Sort Method: quicksort Memory: 25kB -> Seq Scan on source s (cost=0.00..32.60 rows=2260 width=40) (actual time=0.027..0.031 rows=2 loops=1) -> Sort (cost=158.51..164.16 rows=2260 width=10) (actual time=0.019..0.020 rows=2 loops=1) Sort Key: t_1.a Sort Method: quicksort Memory: 25kB -> Seq Scan on target t_1 (cost=0.00..32.60 rows=2260 width=10) (actual time=0.012..0.014 rows=2 loops=1) Planning Time: 0.207 ms Execution Time: 0.199 ms (17 rows) postgres=# abort; ROLLBACK postgres=# BEGIN; BEGIN postgres=# EXPLAIN ANALYZE MERGE INTO target t USING source s ON t.a = s.a AND t.a < 2 WHEN MATCHED THEN UPDATE SET b = s.b WHEN NOT MATCHED THEN INSERT VALUES (s.a, -1); QUERY PLAN -- Merge on target t (cost=232.74..364.14 rows=8509 width=46) (actual time=0.128..0.128 rows=0 loops=1) * Tuples Inserted: 2* Tuples Updated: 0 Tuples Deleted: 0 Tuples Skipped: 0 -> Merge Right Join (cost=232.74..364.14 rows=8509 width=46) (actual time=0.070..0.072 rows=2 loops=1) Merge Cond: (t_1.a = s.a) -> Sort (cost=74.23..76.11 rows=753 width=10) (actual time=0.038..0.039 rows=1 loops=1) Sort Key: t_1.a Sort Method: quicksort Memory: 25kB -> Seq Scan on target t_1 (cost=0.00..38.25 rows=753 width=10) (actual time=0.026..0.028 rows=1 loops=1) Filter: (a < 2) Rows Removed by Filter: 1 -> Sort (cost=158.51..164.16 rows=2260 width=40) (actual time=0.024..0.025 rows=2 loops=1) Sort Key: s.a Sort Method: quicksort Memory: 25kB -> Seq Scan on source s (cost=0.00..32.60 rows=2260 width=40) (actual time=0.014..0.017 rows=2 loops=1) Planning Time: 0.218 ms Execution Time: 0.234 ms (19 rows) postgres=# abort; ROLLBACK If you look at the first MERGE statement, we filter one matched source row (2,200) using (t.a < 2) and do not take any action for that row. This filtering happens after the RIGHT JOIN has reported it as "matched". But if we push down the qual to the join, then the join will see that the source row has no match and hence send that row for NOT MATCHED processing, thus inserting it into the table again. I am not saying there is no scope for improvement. But we need to be careful about what can be pushed down to the join and what must be applied after the join. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Apr 3, 2018 at 8:31 AM, Robert Haas wrote: > On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan wrote: > > On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund > wrote: > >> I did a scan through this, as I hadn't been able to keep with the thread > >> previously. Sorry if some of the things mentioned here have been > >> discussed previously. I am just reading through the patch in its own > >> order, so please excuse if there's things I remark on that only later > >> fully make sense. > >> > >> > >> later update: TL;DR: I don't think the parser / executor implementation > >> of MERGE is architecturally sound. I think creating hidden joins during > >> parse-analysis to implement MERGE is a seriously bad idea and it needs > >> to be replaced by a different executor structure. > > > > +1. I continue to have significant misgivings about this. It has many > > consequences that we know about, and likely quite a few more that we > > don't. > > +1. I didn't understand from Peter's earlier comments that we were > doing that, and I agree that it isn't a good design choice. > > Honestly I don't think Peter ever raised concerns about the join, though I could be missing early discussions when I wasn't paying attention. It's there from day 1. Peter raised concerns about the two RTE stuff which was necessitated when we added support for partitioned table. We discussed that at some length, with your inputs and agreed that it's not necessarily a bad thing and probably the only way to deal with partitioned tables. Personally, I don't see why an internal join is bad. That's what MERGE is doing anyways, so it closely matches with the overall procedure. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Missing parse_merge.h?
On Tue, Apr 3, 2018 at 2:37 PM, Satoshi Nagayasu wrote: > Hi, > > I just got a compile error as below on the latest HEAD. > --- > gcc -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels > -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing > -fwrapv -fexcess-precision=standard -g -O2 -I. -I. > -I../../../src/include -D_GNU_SOURCE -c -o analyze.o analyze.c > > analyze.c:41:32: fatal error: parser/parse_merge.h: No such file or > directory > > #include "parser/parse_merge.h" > >^ > > compilation terminated. > > make[3]: *** [analyze.o] Error 1 > --- > > The latest commit [1] looks missing parse_merge.h. > Or did I make some mistake on building? > > Looks like Simon forgot to add new files in that commit. I am trying to get in touch with him so that he can add the missing files and correct the situation. Sorry for the trouble. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pgsql: New files for MERGE
On Wed, Apr 4, 2018 at 10:40 PM, Andres Freund wrote: > Hi, > > On 2018-04-03 08:32:45 -0700, Andres Freund wrote: > > Hi, > > > > On 2018-04-03 09:24:12 +, Simon Riggs wrote: > > > New files for MERGE > > > src/backend/executor/nodeMerge.c | 575 +++ > > > src/backend/parser/parse_merge.c | 660 > > > src/include/executor/nodeMerge.h | 22 + > > > src/include/parser/parse_merge.h | 19 + > > > > Getting a bit grumpy here. So you pushed this, without responding in > > any way to the objections I made in > > http://archives.postgresql.org/message-id/20180403021800. > b5nsgiclzanobiup%40alap3.anarazel.de > > and did it in a manner that doesn't even compile? > > This needs at the very least a response to the issues pointed out in the > referenced email that you chose to ignore without any sort of comment. > > Apologies from my end. Simon checked with me regarding your referenced email. I was in the middle of responding to it (with a add-on patch to take care of your review comments), but got side tracked by some high priority customer escalation. I shall respond soon. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pgsql: New files for MERGE
On Thu, Apr 5, 2018 at 12:16 AM, Andres Freund wrote: > Hi, > > On 2018-04-05 00:02:06 +0530, Pavan Deolasee wrote: > > Apologies from my end. Simon checked with me regarding your referenced > > email. I was in the middle of responding to it (with a add-on patch to > take > > care of your review comments), but got side tracked by some high priority > > customer escalation. I shall respond soon. > > Hows that an explanation for just going ahead and committing? Without > even commenting on why one thinks the pointed out issues are something > that can be resolved later or somesuch? This has an incredibly rushed > feel to it. > While I don't want to answer that on Simon's behalf, my feeling is that he may not seen your email since it came pretty late. He had probably planned to commit the patch again first thing in the morning with the fixes I'd sent. Anyways, I think your reviews comments are useful and I've incorporated most of those. Obviously certain things like creating a complete new executor machinery is not practical given where we're in the release cycle and I am not sure if that has any significant advantages over what we have today. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
ised that I couldn't find a ready way to get string representation of relation kind. Am I missing something? Of course, we can write for the purpose, but wanted to ensure we are not duplicating something already available. > > > + /* > +* Do basic expression transformation (same as a > ROW() > +* expr, but allow SetToDefault at top level) > + */ > + exprList = transformExpressionList(pstate, > + (List *) > linitial(valuesLists), > + > EXPR_KIND_VALUES_SINGLE, > + true); > + > + /* Prepare row for assignment to target table */ > + exprList = transformInsertRow(pstate, exprList, > + istmt->cols, > + icolumns, attrnos, > + false); > + } > > Can't we handle this with a littlebit less code duplication? > Hmm, yeah, that will be good. But given Tom's suggestions on the other thread, I would like to postpone any refactoring here. > > typedef struct HeapUpdateFailureData > { > + HTSU_Result result; > ItemPointerData ctid; > TransactionId xmax; > CommandId cmax; > + LockTupleMode lockmode; > } HeapUpdateFailureData; > > > These new fields seem not really relateto HUFD, but rather just fields > the merge code should maintain? > As I said, we can possibly track those separately. But they all arise as a result of update/delete/lock failure. So I would prefer to keep them along with other fields such as ctid and xmax/cmax. But if you or others insist, I can move them and pass in/out of various function calls where we need to maintain those. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Take-care-of-Andres-s-comments.patch Description: Binary data
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund wrote: > > > I've attached a noticeably editorialized patch: > > + /* +* As long as we don't support an UPDATE of INSERT ON CONFLICT for +* a partitioned table we shouldn't reach to a case where tuple to +* be lock is moved to another partition due to concurrent update +* of the partition key. +*/ + Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid)); + This is no longer true; at least not entirely. We still don't support ON CONFLICT DO UPDATE to move a row to a different partition, but otherwise it works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Add support for printing/reading MergeAction nodes
On Wed, Apr 4, 2018 at 11:21 PM, Tom Lane wrote: > Simon Riggs writes: > > On 4 April 2018 at 17:19, Tom Lane wrote: > >> If the MERGE patch has broken this, I'm going to push back on that > >> and push back on it hard, because it probably means there are a > >> whole bunch of other raw-grammar-output-only node types that can > >> now get past the parser stage as well, as children of these nodes. > >> The answer to that is not to add a ton of new infrastructure, it's > >> "you did MERGE wrong". > > > MERGE contains multiple actions of Insert, Update and Delete and these > > could be output in various debug modes. I'm not clear what meaning we > > might attach to them if we looked since that differs from normal > > INSERTs, UPDATEs, DELETEs, but lets see. > > What I'm complaining about is that that's a very poorly designed parsetree > representation. It may not be the worst one I've ever seen, but it's > got claims in that direction. You're repurposing InsertStmt et al to > do something that's *not* an INSERT statement, but by chance happens > to share some (not all) of the same fields. This is bad because it > invites confusion, and then bugs of commission or omission (eg, assuming > that some particular processing has happened or not happened to subtrees > of that parse node). The most egregious way in which it's a bad idea > is that, as I said, InsertStmt doesn't even exist in post-parse-analysis > trees so far as the normal type of INSERT is concerned. This just opens > a whole batch of ways to screw up. We have some types of raw parse nodes > that are replaced entirely during parse analysis, and we have some others > where it's convenient to use the same node type before and after parse > analysis, but we don't have any that are one way in one context and the > other way in other contexts. And this is not the place to start. > > I think you'd have been better off to fold all of those fields into > MergeAction, or perhaps make several variants of MergeAction. > > Hi Tom, Thanks for the review comments. Attached patch refactors the grammar/parser side per your comments. We no longer use InsertStmt/UpdateStmt/DeleteStmt/SelectStmt as part of MergeAction. Instead we only collect the necessary information for running the INSERT/UPDATE/DELETE actions. Speaking of MergeAction itself, I decided to use a new parser-only node named MergeWhenClause and removed unnecessary members from the MergeAction node which now gets to planner/executor. Regarding the original report by Marina I suspect she may have turned debug_print_parse=on while running regression. I could reproduce the failures in the isolation tests by doing same. The attached patch however passes all tests with the following additional GUCs. So I am more confident that we should have got the outfuncs.c support ok. debug_print_parse=on debug_print_rewritten=on debug_print_plan=on Also, I am now running tests with -DCOPY_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST since the buildfarm had earlier uncovered some issues with those flags. No problem there too. This now also enforces single VALUES clause in the grammar itself instead of doing that check at parse-analyse time. So that's a net improvement too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Separate-raw-parse-representation-from-rest.patch Description: Binary data
Re: [HACKERS] MERGE SQL Statement for PG11
On Thu, Apr 5, 2018 at 5:08 PM, Jesper Pedersen wrote: > Hi Simon and Paven, > > On 04/04/2018 08:46 AM, Jesper Pedersen wrote: > >> On 03/30/2018 07:10 AM, Simon Riggs wrote: >> >>> No problems found, but moving proposed commit to 2 April pm >>> >>> >> There is a warning for this, as attached. >> >> > Updated version due to latest refactoring. > Hi Jesper, The variable would become unused in non-assert builds. I see that. But simply removing it is not a solution and I don't think the code will compile that way. We should either rewrite that assertion or put it inside a #ifdef ASSERT_CHECKING block or simple remove that assertion because we already check for relkind in parse_merge.c. Will check. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund wrote: > > > My impression is that this simply shouldn't be going through > nodeModifyTuple, but be it's own nodeMerge node. The trigger handling > would need to be abstraced into execTrigger.c or such to avoid > duplication. We're now going from nodeModifyTable.c:ExecModifyTable() > into execMerge.c:ExecMerge(), back to > nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing > things that aren't appropriate for merge, we then pass an actionState, > which neuters part of ExecUpdate/Insert(). This is just bad. > > But wouldn't this lead to lot of code duplication? For example, ExecInsert/ExecUpdate does a bunch of supporting work (firing triggers, inserting into indexes just to name a few) that MERGE's INSERT/UPDATE needs as well. Now we can possibly move these support routines to a new file, say execModify.c and then let both Merge as well as ModifyTable node make use of that. But the fact is that ExecInsert/ExecUpdate knows a lot about ModifyTable already. So to separate ExecInsert/ExecUpdate from ModifyTable will require significant refactoring AFAICS. I am not saying we can't do that, but will have it's own consequences. If we would not have refactored to move ExecMerge and friends to a new file, then it may not have looked so odd (as you describe above). But I think moving the code to a new file was a net improvement. May be we can move ExecInsert/Update etc to a new file as I suggested, but still use the ModifyTable to run Merge. There are many things common between them. ModifyTable executes all DMLs and MERGE is just another DML which can run all three. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund wrote: > Hi, > > On 2018-04-05 11:31:48 +0530, Pavan Deolasee wrote: > > > +/*- > > > > > > + * > > > + * nodeMerge.c > > > + * routines to handle Merge nodes relating to the MERGE command > > > > > > Isn't this file misnamed and it should be execMerge.c? The rest of the > > > node*.c files are for nodes that are invoked via execProcnode / > > > ExecProcNode(). This isn't an actual executor node. > > > > > > > Makes sense. Done. (Now that the patch is committed, I don't know if > there > > would be a rethink about changing file names. May be not, just raising > that > > concern) > > It absolutely definitely needed to be renamed. But that's been done, so > ... > > > > > What's the reasoning behind making this be an anomaluous type of > > > executor node? > > > Didn't quite get that. I think naming of the file was bad (fixed now), > but > > I think it's a good idea to move the new code in a new file, from > > maintainability as well as coverage perspective. If you've something very > > different in mind, can you explain in more details? > > Well, it was kinda modeled as an executor node, including the file > name. That's somewhat fixed now. I still am extremely suspicious of > the codeflow here. > > My impression is that this simply shouldn't be going through > nodeModifyTuple, but be it's own nodeMerge node. The trigger handling > would need to be abstraced into execTrigger.c There is nothing called execTrigger.c. You probably mean creating something new or trigger.c. That being said, I don't think the current code is bad. There is already a switch to handle different command types. We add one more for CMD_MERGE and then fire individual BEFORE/AFTER statement triggers based on what actions MERGE may take. The ROW trigger handling doesn't require any change. > or such to avoid > duplication. We're now going from nodeModifyTable.c:ExecModifyTable() > into execMerge.c:ExecMerge(), back to > nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing > things that aren't appropriate for merge, we then pass an actionState, > which neuters part of ExecUpdate/Insert(). This is just bad. > I have spent most part of today trying to hack around the executor with the goal to do what you're suggesting. Having done so, I must admit I am actually quite puzzled why you think having a new node is going to be any significant improvement. I first tried to split nodeModifyTable.c into two parts. One that deals with just ModifyTable node (the exported Init/Exec/End functions to start with) and the other which abstracts out the actual Insert/Update/Delete operations. The idea was that the second part shouldn't know anything about ModifyTable and it being just one of the users. I started increasingly disliking the result as I continued hacking. The knowledge of ModifyTable node is quite wide-spread, including execPartition.c and FDW. The ExecInsert/Update, partitioning, ON CONFLICT handling all make use of ModifyTable extensively. While MERGE does not need ON CONFLICT, it needs everything else, even FDW at some point in future. Do you agree that this is a bad choice or am I getting it completely wrong or do you really expect MERGE to completely refactor nodeModifyTable.c, including the partitioning related code? I gave up on this approach. I then started working on other possibility where we keep a ModifyTable node inside a MergeTable (that's the name I am using, for now). The paths/plans/executor state gets built that way. So all common members still belong to the ModifyTable (and friends) and we only add MERGE specific information to MergeTable (and friends). This, at least looks somewhat better. We can have: - ExecInitMergeTable(): does some basic initialisation of the embedded ModifyTable node so that it can be passed around - ExecMergeTable(): executes the (only) subplan, fetches tuples, fires BR triggers, does work for handling transition tables (so mostly duplication of what ExecModifyTable does already) and then executes the MERGE actions. It will mostly use the ModifyTable node created during initialisation when calling ExecUpdate/Insert etc - ExecEndMergeTable(): ends the executor This is probably far better than the first approach. But is it really a huge improvement over the committed code? Or even an improvement at all? If that's not what you've in mind, can you please explain in more detail how to you see the final design and how exactly it's better than what we have today? Once there is clarity, I can work on it in a fairly quick manner. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pgsql: New files for MERGE
On Fri, Apr 6, 2018 at 1:51 PM, Simon Riggs wrote: > Given that the executor > manifestly works and has been re-engineered according to PeterG's > requests and that many performance concerns have already been > addressed prior to commit, Pavan and I were happy with it. My proposal > to commit the patch was given 5 days ahead of time and no comments > were received by anyone, not even PeterG. There was no rush and I > personally performed extensive reviews before final commit. I think it's quite unfair to say that Simon rushed into this. He said this on 29th March: On Thu, Mar 29, 2018 at 3:20 PM, Simon Riggs wrote: > On 28 March 2018 at 12:00, Pavan Deolasee > wrote: > > > v27 attached, though review changes are in > > the add-on 0005 patch. > > This all looks good now, thanks for making all of those changes. > > I propose [v27 patch1+patch3+patch5] as the initial commit candidate > for MERGE, with other patches following later before end CF. > > I propose to commit this tomorrow, 30 March, about 26 hours from now. > That will allow some time for buildfarm fixing/reversion before the > Easter weekend, then other patches to follow starting 2 April. That > then gives reasonable time to follow up on other issues that we will > no doubt discover fairly soon after commit, such as additional runs by > SQLsmith and more eyeballs. And he finally committed the patch on 2nd April late in the night. In between, there were zero objections and no comments at all. I don't know why this is considered as rushed. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Bugs in TOAST handling, OID assignment and redo recovery
tart. I don't think this is an unusual case and I'm surprised that we didn't notice this before. Most likely the other protections in the system to guard against wrapped around OIDs masked the bug. The TOAST table though escapes those protections in the specific case we discussed above and brought out the bug in open. I think the right fix is to simply ignore the nextOid counter while replaying ONLINE checkpoint record. We must have already initialised ShmemVariableCache->nextOid to the value stored in this (or some previous) checkpoint record when redo recovery is started. As and when we see XLOG_NEXTOID record, we would maintain the shared memory counter correctly. If we don't see any XLOG_NEXTOID record, the value we started with must be the correct value. I see no problem even when OID wraps around during redo recovery. XLOG_NEXTOID should record that correctly. Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin counter advances enough so that RECENTLY_DEAD tuples are ignored by the REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or a VACUUM should fix these transient errors. Also once manifested, the errors would stay until a REINDEX or VACUUM is performed. We tested this at least upto 9.1 and the bugs exist there too. In fact, these bugs probably existed forever, though I did not check very old releases. Attached is a simple reproducer and a proposed fix to address both the bugs. We should consider backpatching it all supported releases. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg_toast_corruption_v3.patch Description: Binary data export PGDATA=/tmp/toast_corrupt export PGPORT=5440 export PGDATABASE=postgres # turn checkpoint logging on start_options="-c checkpoint_timeout=30s -c checkpoint_completion_target=0.9 -c log_checkpoints=on -c max_prepared_transactions=10" initdb -D $PGDATA pg_ctl -l logfile -o "$start_options" -w start psql -c 'create table testtoast (a int, b text)' psql -c 'create table auxtable (a int, b text)' # Do clean shutdown and restart to ensure we start with no cached OIDs pg_ctl -w stop rm logfile pg_ctl -l logfile -o "$start_options" -w start # Generate some non-trivial work for the upcoming checkpoint. pg_controldata $PGDATA | grep NextOID psql -c "insert into auxtable select generate_series(1,10), 'foo'" # wait for the "time" checkpoint to start echo "waiting for checkpoint start" while ( true ); do tail logfile | grep "checkpoint starting" if [ "$?" == "0" ]; then echo "checkpoint is starting" break fi sleep 1 done # now consume some OIDs.. insert them into the toast table. Since we started # with zero cached OIDs, this should generate a XLOG_NEXTOID WAL record. # # This command must finish before the checkpoint finishes psql -c "insert into testtoast select 1, (select string_agg('', md5(random()::text)) from generate_series(1,100)) from generate_series(1,20);" # now wait for the checkpoint to finish. echo "waiting for checkpoint complete" while ( true ); do tail logfile | grep "checkpoint complete" if [ "$?" == "0" ]; then echo "checkpoint complete" break fi sleep 1 done # the checkpoint should have the NextOID prior to last block allocation pg_controldata $PGDATA | grep NextOID # do some work and then crash psql -c "insert into auxtable select generate_series(1,100), 'foo'" pg_ctl -m immediate stop sleep 2 # restart pg_ctl -l logfile -o "$start_options" -w start # the NextOID should have been advanced to include the OIDs that we consumed. # But that doesn't happen, leading to duplicate OIDs and problems there after. pg_controldata $PGDATA | grep NextOID # create a prepare transaction to hold back OldestXmin psql -c "begin; insert into auxtable select generate_series(1,100), 'foo'; prepare transaction 'p1';" # now delete the old tuples. That will lead to deletion from the TOAST table as # well. But since we're holding back OldestXmin, these tuples can't be cleaned # up psql -c "delete from testtoast" # this should now insert duplicate OIDs while the previous RECENTLY_DEAD tuples # are still around psql -c "insert into testtoast select 1, (select string_agg('', md5(random()::text)) from generate_series(1,100)) from generate_series(1,20);" ### ERRROR ### psql -c "\copy (select t.*::text from testtoast t) to /dev/null" # prove that REINDEX won't fix the problem unless OldestXmin advances. oid=`psql -At -c "select oid from pg_class where relname = 'testtoast'"` psql -c "REINDEX TABLE pg_toast.pg_toast_$oid" p
Re: Bugs in TOAST handling, OID assignment and redo recovery
Hi Heikki, On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas wrote: > >> > It would seem more straightforward to add a snapshot parameter to > GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast, > while others pass SnapshotDirty. With your patch, you check the index > twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with > SnapshotToast, in the caller. > Hmm. I actually wrote my first patch exactly like that. I am trying to remember why I discarded that approach. Was that to do with the fact that SnapshotToast can't see all toast tuples either? Yeah, I think so. For example, it won't see tuples with uncommitted-xmin, leading to different issues. Now it's unlikely that we will have a OID conflict where the old tuple has uncommitted-xmin, but not sure if we can completely rule that out. For example, if we did not uncover the redo recovery bug, we could have had a prepared transaction having inserted the old tuple, which now conflicts with new tuple and not detected by SnapshotToast. > > If I'm reading the rewrite case correctly, it's a bit different and quite > special. In the loop with GetNewOidWithIndex(), it needs to check that the > OID is unused in two tables, the old TOAST table, and the new one. You can > only pass one index to GetNewOidWithIndex(), so it needs to check the > second index manually. It's not because of the snapshot issue. Although I > wonder if we should be using SnapshotToast in that GetNewOidWithIndex() > call, too. I.e. if we should be checking both the old and the new toast > table with SnapshotToast. As I said, I am not sure if checking with just SnapshotToast is enough because it can't see "dirty" tuples. > > Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the value > in the online checkpoint record is greater than > ShmemVariableCache->nextXid. But we don't have such a wraparound-aware > concept of "greater than" for OIDs. Ignoring the online checkpoint record's > nextOid value seem fine to me. > Ok. Thanks for checking. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Tue, Apr 10, 2018 at 7:49 PM, Claudio Freire wrote: > On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas > wrote: > > > > > Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block? > > ISTM that we're failing to take advantage of this optimization for > unlogged > > tables, for no particular reason. Just an oversight? > > > > - Heikki > > Indeed. > > Maybe Pavan knows of one, but I don't see any reason not to apply this > to unlogged tables as well. It slipped the review. > Yes, looks like an oversight :-( I will fix it along with the other changes that Peter requested. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Bugs in TOAST handling, OID assignment and redo recovery
On Tue, Apr 10, 2018 at 7:24 PM, Pavan Deolasee wrote: > Hi Heikki, > > On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas > wrote: > >> >>> >> It would seem more straightforward to add a snapshot parameter to >> GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast, >> while others pass SnapshotDirty. With your patch, you check the index >> twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with >> SnapshotToast, in the caller. >> > > Hmm. I actually wrote my first patch exactly like that. I am trying to > remember why I discarded that approach. Was that to do with the fact that > SnapshotToast > can't see all toast tuples either? Yeah, I think so. For example, it won't > see tuples with uncommitted-xmin, leading to different issues. Now it's > unlikely that we will have a OID conflict where the old tuple has > uncommitted-xmin, but not sure if we can completely rule that out. > Or may be we simply err on the side of caution and scan the toast table with SnapshotAny while looking for a duplicate? That might prevent us from reusing an OID for a known-dead tuple, but should save us a second index scan and still work. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Bugs in TOAST handling, OID assignment and redo recovery
On Wed, Apr 11, 2018 at 8:20 PM, Tom Lane wrote: > Pavan Deolasee writes: > > Or may be we simply err on the side of caution and scan the toast table > > with SnapshotAny while looking for a duplicate? That might prevent us > from > > reusing an OID for a known-dead tuple, but should save us a second index > > scan and still work. > > +1. We really don't want to expend two indexscans on this. > > Ok. I propose attached patches, more polished this time. I also changed toastrel_valueid_exists() to use SnapshotAny, but I don't have any proofs to claim that's a needed change. But it seems harmless and given the number of toast related, hard to chase bugs we have seen over the years, may it's worth making it full proof too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0002-Do-more-extensive-search-while-looking-for-duplicate.patch Description: Binary data 0001-Do-not-overwrite-the-nextOid-counter-while-replaying.patch Description: Binary data
Re: Bugs in TOAST handling, OID assignment and redo recovery
On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane wrote: > So while looking at this, it suddenly occurred to me that probing with > SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment > either. > Yeah it occurred to me as well, but when I looked at the code, I couldn't find a case that is broken. I even tried a few test cases with DDLs etc. But I think what you did is fine and more bullet proof. So +1 to that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Bugs in TOAST handling, OID assignment and redo recovery
On Thu, Apr 12, 2018 at 5:21 AM, Tom Lane wrote: > Michael Paquier writes: > > I have not really checked this thread in details, but one thing that > > strikes me is that it would be rather easy to add a TAP test based on > > the initial script that Pavan has sent. Would that be worth testing > > cycles or not? > > I doubt it --- that test is so specialized that it'd be unlikely to > catch any other bug. > > I agree; it's not worth adding a TAP test except may be as a demonstration to write future tests for catching such issues. This was a very specialised test case written after getting a full grasp on the bug. And it just tests the thing that I knew is broken based on code reading. Also, with OID duplicate issue fixed, hitting more bugs in this area is going to be even more difficult. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Bugs in TOAST handling, OID assignment and redo recovery
On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 4/10/18 06:29, Pavan Deolasee wrote: > > One of our 2ndQuadrant support customers recently reported a sudden rush > > of TOAST errors post a crash recovery, nearly causing an outage. Most > > errors read like this: > > > > ERROR: unexpected chunk number 0 (expected 1) for toast value > > While researching this, I found that the terminology in this code is > quite inconsistent. It talks about chunks ids, chunk indexes, chunk > numbers, etc. seemingly interchangeably. The above error is actually > about the chunk_seq, not about the chunk_id, as one might think. > > The attached patch is my attempt to clean this up a bit. Thoughts? > While I agree that we should clean it up, I wonder if changing error text would be a good idea. These errors are being reported by a very long time and if we change the text, we might forget the knowledge about the past reports. Also, "toast value" is same as "chunk_id". Should we clean up something there too? "chunk_seq number" -- should that be just "chunk_seq"? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: VM map freeze corruption
On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan wrote: > > > My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId() > returns FRM_NOOP if the MultiXACT locked rows haven't committed. This > results in changed=false and totally_frozen=true(as initialized). When > this returns to lazy_scan_heap(), no rows are added to the frozen[] array. > Yet, tuple_totally_frozen is true. This means the page is marked frozen in > the VM, even though the MultiXACT row wasn't left untouch. > > A fix to heap_prepare_freeze_tuple() that seems to do the trick is: > else > { > Assert(flags & FRM_NOOP); > + totally_frozen = false; > } > That's a great find! This can definitely lead to various problems and could be one of the reasons behind the issue reported here [1]. For example, if we change the script slightly at the end, we can get the same error reported in the bug report. sleep 4; # Wait for share locks to be released # See if another vacuum freeze advances relminmxid beyond xmax present in the # heap echo "vacuum (verbose, freeze) t;" | $p echo "select pg_check_frozen('t');" | $p # See if a vacuum freeze scanning all pages corrects the problem echo "vacuum (verbose, freeze, disable_page_skipping) t;" | $p echo "select pg_check_frozen('t');" | $p Thanks, Pavan [1] https://www.postgresql.org/message-id/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K%2BA%40mail.gmail.com -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao wrote: > Hi, > > I'd like to propose to add $SUBJECT for performance improvement. > > When VACUUM tries to truncate the trailing empty pages, it scans > shared_buffers > to invalidate the pages-to-truncate during holding an AccessExclusive lock > on > the relation. So if shared_buffers is huge, other transactions need to > wait for > a very long time before accessing to the relation. Which would cause the > response-time spikes, for example, I observed such spikes several times on > the server with shared_buffers = 300GB while running the benchmark. > Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such > spikes > for that relation. > Alvaro reminded me that we already have a mechanism in place which forces VACUUM to give up the exclusive lock if another backend is waiting on the lock for more than certain pre-defined duration. AFAICS we give up the lock, but again retry truncation from the previously left off position. What if we make that lock-wait duration configurable on a per-table basis? And may be a special value to never truncate (though it seems quite excessive to me and a possible footgun) I was actually thinking in the other direction. So between the time VACUUM figures out it can possibly truncate last K pages, some backend may insert a tuple in some page and make the truncation impossible. What if we truncate the FSM before starting the backward scan so that new inserts go into the pages prior to the truncation point, if possible. That will increase the chances of VACUUM being able to truncate all the empty pages. Though I think in some cases it might lead to unnecessary further extension of the relation. May be we use some heuristic based on available free space in the table prior to the truncation point? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
On Wed, Apr 18, 2018 at 10:50 PM, Fujii Masao wrote: > > > I'm not sure if it's safe to cancel forcibly VACUUM's truncation during > scaning shared_buffers. That scan happens after WAL-logging and before > the actual truncation. > > Ah ok. I misread your proposal. This is about the shared_buffers scan in DropRelFileNodeBuffers() and we can't cancel that operation. What if we remember the buffers as seen by count_nondeletable_pages() and then just discard those specific buffers instead of scanning the entire shared_buffers again? Surely we revisit all to-be-truncated blocks before actual truncation. So we already know which buffers to discard. And we're holding exclusive lock at that point, so nothing can change underneath. Of course, we can't really remember a large number of buffers, so we can do this in small chunks. Scan last K blocks, remember those K buffers, discard those K buffers, truncate the relation and then try for next K blocks. If another backend requests lock on the table, we give up or retry after a while. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: VM map freeze corruption
On Wed, Apr 18, 2018 at 7:06 PM, Alvaro Herrera wrote: > > > IMO the cause is the totally_frozen variable, which starts life in a > bogus state (true) and the different code paths can set it to the right > state, or by inaction end up deciding that the initial bogus state was > correct in the first place. Errors of omission are far too easy in that > kind of model, ISTM, so I propose this slightly different patch, which > albeit yet untested seems easier to verify and easier to get right. > I wonder if we should just avoid initialising those variables at the top and take compiler's help to detect any missed branches. That way, we shall know exactly why and where we are making them true/false. I also think that we should differentiate between scenarios where xmin/xmax is already frozen and scenarios where we are freezing them now. I come up with attached. Not fully polished (and there is a XXX that I left for comments), but hopefully enough to tell what I am thinking. Do you think it's any improvement or actually makes the problem worse? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services frozen_v2.patch Description: Binary data
Re: Toast issues with OldestXmin going backwards
On Sat, 21 Apr 2018 at 6:46 PM, Andrew Gierth wrote: > > > So I dug into this one and it looks to me like the best approach. Here's > a draft patch against 10-stable, if nobody can think of any showstoppers > then I'll do the rest of the versions and commit them. > How does this guard against the case when the same OID gets inserted in the toast table again, with matching lengths etc? Rare but seems possible, no? I think we should look for a more complete solution how hard these bugs are to detect and fix. Thanks, Pavan > -- > Andrew (irc:RhodiumToad) > > -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Mon, Mar 11, 2019 at 1:37 PM Masahiko Sawada wrote: > > > I might be missing something but why do we need to recheck whether > each pages is all-frozen after insertion? I wonder if we can set > all-frozen without checking all tuples again in this case. > It's possible that the user may have inserted unfrozen rows (via regular INSERT or COPY without FREEZE option) before inserting frozen rows. So we can't set the all-visible/all-frozen flag unconditionally. I also find it safer to do an explicit check to ensure we never accidentally mark a page as all-frozen. Since the check is performed immediately after the page becomes full and only once per page, there shouldn't be any additional IO cost and the check should be quite fast. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Wed, Mar 13, 2019 at 11:37 AM Masahiko Sawada wrote: > > > I think that since COPY FREEZE can be executed only when the table is > created or truncated within the transaction other users cannot insert > any rows during COPY FREEZE. > > Right. But the truncating transaction can insert unfrozen rows into the table before inserting more rows via COPY FREEZE. postgres=# CREATE EXTENSION pageinspect ; CREATE EXTENSION postgres=# BEGIN; BEGIN postgres=# TRUNCATE testtab ; TRUNCATE TABLE postgres=# INSERT INTO testtab VALUES (100, 200); INSERT 0 1 postgres=# COPY testtab FROM STDIN WITH (FREEZE); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> 1 2 >> 2 3 >> \. COPY 2 postgres=# COMMIT; postgres=# SELECT lp, to_hex(t_infomask) FROM heap_page_items(get_raw_page('testtab', 0)); lp | to_hex + 1 | 800 2 | b00 3 | b00 (3 rows) The first row in inserted by regular insert and it's not frozen. The next 2 are frozen. We can't mark such as page all-visible, all-frozen. > > I'd suggest to measure performance overhead. I can imagine one use > case of COPY FREEZE is the loading a very large table. Since in > addition to set visibility map bits this patch could scan a very large > table I'm concerned that how much performance is degraded by this > patch. > Ok. I will run some tests. But please note that this patch is a bug fix to address the performance issue that is caused by having to rewrite the entire table when all-visible bit is set on the page during first vacuum. So while we may do some more work during COPY FREEZE, we're saving a lot of page writes during next vacuum. Also, since the scan that we are doing in this patch is done on a page that should be in the buffer cache, we will pay a bit in terms of CPU cost, but not anything in terms of IO cost. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: A separate table level option to control compression
Hi, On Thu, Mar 21, 2019 at 3:10 AM Shaun Thomas wrote: > > I can't really speak for the discussion related to `storage.sgml`, but > I based my investigation on the existing patch to `create_table.sgml`. > About the only thing I would suggest there is to possibly tweak the > wording. > > * "The compress_tuple_target ... " for example should probably read > "The toast_tuple_target parameter ...". > * "the (blocksize - header)" can drop "the". > * "If the value is set to a value" redundant wording should be rephrased; > "If the specified value is greater than toast_tuple_target, then > we will substitute the current setting of toast_tuple_target instead." > would work. > Thanks Shaun. Attached patch makes these adjustments. > * I'd recommend a short discussion on what negative consequences can be > expected by playing with this value. As an example in my tests, setting > it very high may result in extremely sparse pages that could have an > adverse impact on HOT updates. > Setting compress_tuple_target to a higher value won't be negative because the toast_tuple_target is used for compression anyways when compress_tuple_target is higher than toast_tuple_target. May be some discussion in the paragraph related to toast_tuple_target can be added to explain the negative impact of the high value. I added a small discussion about negative effects of setting compress_tuple_target lower though, per your suggestion. Also added some details in storage.sgml as recommended by Sawada-san. Hope this helps. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services 0001-Add-a-table-level-option-to-control-compression.patch Description: Binary data
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Thu, Mar 14, 2019 at 3:54 PM Masahiko Sawada wrote: > > > > > > > Ok. I will run some tests. But please note that this patch is a bug fix > to address the performance issue that is caused by having to rewrite the > entire table when all-visible bit is set on the page during first vacuum. > So while we may do some more work during COPY FREEZE, we're saving a lot of > page writes during next vacuum. Also, since the scan that we are doing in > this patch is done on a page that should be in the buffer cache, we will > pay a bit in terms of CPU cost, but not anything in terms of IO cost. > > Agreed. I had been misunderstanding this patch. The page scan during > COPY FREEZE is necessary and it's very cheaper than doing in the first > vacuum. > Thanks for agreeing to the need of this bug fix. I ran some simple tests anyways and here are the results. The test consists of a simple table with three columns, two integers and one char(100). I then ran COPY (FREEZE), loading 7M rows, followed by a VACUUM. The total size of the raw data is about 800MB and the table size in Postgres is just under 1GB. The results for 3 runs in milliseconds are: Master: COPY FREEZE: 40243.725 40309.67540783.836 VACUUM: 2685.871 2517.4452508.452 Patched: COPY FREEZE: 40942.410 40495.303 40638.075 VACUUM: 25.067 35.793 25.390 So there is a slight increase in the time to run COPY FREEZE, but a significant reduction in time to VACUUM the table. The benefits will only go up if the table is vacuumed much later when most of the pages are already written to the disk and removed from shared buffers and/or kernel cache. I hope this satisfies your doubts regarding performance implications of the patch. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Fri, Mar 22, 2019 at 12:19 PM Masahiko Sawada wrote: > I've looked at the patch and have comments and questions. > > +/* > + * While we are holding the lock on the page, check if all tuples > + * in the page are marked frozen at insertion. We can safely mark > + * such page all-visible and set visibility map bits too. > + */ > +if (CheckPageIsAllFrozen(relation, buffer)) > +PageSetAllVisible(page); > + > +MarkBufferDirty(buffer); > > Maybe we don't need to mark the buffer dirty if the page is not set > all-visible. > > Yeah, makes sense. Fixed. > If we have CheckAndSetPageAllVisible() for only this > situation we would rather need to check that the VM status of the page > should be 0 and then set two flags to the page? The 'flags' will > always be (VISIBILITYMAP_ALL_FROZEN | VISIBILITYMAP_ALL_VISIBLE) in > copy freeze case. I'm confused that this function has both code that > assumes some special situations and code that can be used in generic > situations. > If a second COPY FREEZE is run within the same transaction and if starts inserting into the page used by the previous COPY FREEZE, then the page will already be marked all-visible/all-frozen. So we can skip repeating the operation again. While it's quite unlikely that someone will do that and I can't think of a situation where only one of those flags will be set, I don't see a harm in keeping the code as is. This code is borrowed from vacuumlazy.c and at some point we can even move it to some common location. > Perhaps we can add some tests for this feature to pg_visibility module. > > That's a good idea. Please see if the tests included in the attached patch are enough. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services copy_freeze_v4.patch Description: Binary data
Re: Faster inserts with mostly-monotonically increasing values
On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan wrote: > On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee > wrote: > > Here is a patch that implements the idea. If the last insert happens to > be > > in the rightmost block of an index, then we cache the block and check > that > > first for the next insert. For the cached block to be usable for the > insert, > > it should still be the rightmost, the to-be-inserted key should go into > the > > cached block and there is enough free space in the block. If any of these > > conditions do not meet, we fall back to the regular code path, i.e. > descent > > down the index from the top. > > I'd be particularly concerned about page deletion/page recycling still > being correct with the patch, especially because it's hard to test > anything there. The P_ISLEAF() part of your fastpath's test hints at > this -- why should it not *always* be a leaf page (surely you should > be sure that the page cannot be recycled anyway)? I thought we can throw in a bunch of checks there to ensure that we are still looking at the correct rightmost page or else fallback to the regular path. Do you see something missing there? I don't claim that I've got everything correct, but since we're holding an exclusive lock on the page at that point, wouldn't we be able to guard against any concurrent splits/deletions? I will reread the documentation/code again, but I must admit, it's not particularly easy to understand all the concurrency issues. > I also have my > doubts about unique index enforcement remaining correct with the patch > when there are many physical duplicates, to the extent that more than > a single leaf page is needed for a single value. > Why would that happen? We are checking if the to-be-inserted key is strictly greater than the last key in the rightmost block and only then proceeding with the fast path. And we do this while holding an exclusive lock on the page. Are you suggesting that someone can concurrently still split the page? > > Maybe you should do something with page LSN here -- cache LSN > alongside block number, and have a non-matching LSN invalidate the > test. > This is a good idea, but we might need to track the block/LSN in shared memory. Otherwise with multiple backends doing inserts, we might never be able to match LSNs (i.e. they will mostly vary, thus falling back to slow path for majority of the time) > > How many clients are involved with your benchmark? > Just a single client for these benchmarks. > > > So as the size of the index increases, the benefits of the patch also > tend > > to increase. This is most likely because as the index becomes taller and > > taller, the costs associated with index descent becomes higher. > > FWIW, I think that int4 indexes have to be extremely large before they > exceed 4 levels (where the leaf level counts as a level). My > conservative back-of-an-envelope estimate is 9 billion index tuples. > > Hmm Ok. I am not entirely sure whether it's the depth or just purely binary searching through 3-4 index pages and/or pinning/unpinning buffers result in much overhead. I will run some more tests and collect evidences. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Tue, Jan 2, 2018 at 7:15 PM, Tels wrote: > Moin, > > > >> > >> > > Hmm Ok. I am not entirely sure whether it's the depth or just purely > > binary > > searching through 3-4 index pages and/or pinning/unpinning buffers result > > in much overhead. I will run some more tests and collect evidences. > > Just a question trying to understand how btree indexes work: > > If one inserts ever-increasing value, is the tree a balanced tree with a > minimum (or at least not-as-high) number of levels, or does it increase in > height every insert and creates a "tall stack"? > AFAIK all pages will be half-filled in this case. Imagine if one index page can accommodate N entries, the page will be split when (N+1)th entry is added. The left page will have half the entries and the right page will get the remaining. The next split will happen when the right page is full i.e. when another N/2 entries are added. Thus there will be a split at every N/2 inserts, creating an index with half filled pages. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Thu, Jan 4, 2018 at 6:05 PM, Alvaro Herrera wrote: > Pavan Deolasee wrote: > > On Tue, Jan 2, 2018 at 7:15 PM, Tels > wrote: > > > > > Just a question trying to understand how btree indexes work: > > > > > > If one inserts ever-increasing value, is the tree a balanced tree with > a > > > minimum (or at least not-as-high) number of levels, or does it > increase in > > > height every insert and creates a "tall stack"? > > > > AFAIK all pages will be half-filled in this case. Imagine if one index > page > > can accommodate N entries, the page will be split when (N+1)th entry is > > added. The left page will have half the entries and the right page will > get > > the remaining. The next split will happen when the right page is full > i.e. > > when another N/2 entries are added. Thus there will be a split at every > N/2 > > inserts, creating an index with half filled pages. > > Not really -- quoth _bt_findsplitloc(): > > * If the page is the rightmost page on its level, we instead try to > arrange > * to leave the left split page fillfactor% full. In this way, when we are > * inserting successively increasing keys (consider sequences, timestamps, > * etc) we will end up with a tree whose pages are about fillfactor% full, > * instead of the 50% full result that we'd get without this special case. > > Ah ok. Thanks for pointing that out. Makes a lot more sense. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Thu, Feb 1, 2018 at 11:05 AM, Michael Paquier wrote: > On Sat, Jan 13, 2018 at 03:40:01PM +, Simon Riggs wrote: > > The new two byte value is protected by CRC. The 2 byte value repeats > > every 32768 WAL files. Any bit error in that value that made it appear > > to be a current value would need to have a rare set of circumstances. > > If you use the two lower bytes of the segment number, then this gets > repeated every 64k segments, no? In terms of space this represents > 500GB worth of WAL segments with a default segment size. Hence the more > PostgreSQL scales, the more there is a risk of collision, and I am > pretty sure that there are already deployments around allocating > hundreds of gigs worth of space for the WAL partition. There are no > problems of this class if using the 8-byte field xl_prev. It seems to > me that we don't want to take any risks here. > IMHO we're missing a point here. When xl_prev is changed to a 2-byte value (as the patch suggests), the risk only comes when an old WAL file is recycled for some future WAL file and the old and the future WAL file's segment numbers share the same low order 16-bits. Now as the patch stands, we take precaution to never reuse a WAL file with conflicting low order bits. This particular piece of the patch deals with that: @@ -3496,6 +3495,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, return false; } (*segno)++; + + /* +* If avoid_conflicting_walid is true, then we must not recycle the +* WAL files so that the old and the recycled WAL segnos have the +* same lower order 16-bits. +*/ + if (avoid_conflicting_walid && + XLogSegNoToSegID(tmpsegno) == XLogSegNoToSegID(*segno)) + (*segno)++; + XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size); } } For example, old WAL file 00010001 will NEVER be reused as 000100010001. So even if there are any intact old WAL records in the recycled file, they will be detected correctly during replay since the SegID stored in the WAL record will not match the SegID as derived from the WAL file segment number. The replay code does this for every WAL record it sees. There were some concerns about bit-flipping, which may inadvertently SegID stored in the carried over WAL record so that it now matches with the current WAL files' SegID, but TBH I don't see why we can't trust CRC to detect that. Because if we can't, then there would be other problems during WAL replay. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Tue, Feb 6, 2018 at 9:50 AM, Peter Geoghegan wrote: > On Mon, Feb 5, 2018 at 7:56 PM, Robert Haas wrote: > > I don't think you get to make a unilateral decision to exclude > > features that work everywhere else from the scope of this patch. If > > there is agreement that those features can be left out of scope, then > > that is one thing, but so far all the commentary about the things that > > you've chosen to exclude has been negative. Nor have you really given > > any reason why they should be exempt. You've pointed out that > > parallel query doesn't handle everything (which is certainly true, but > > does not mean that any feature from now and the end of time is allowed > > to exclude from scope whatever seems inconvenient regardless of > > contrary community consensus) and you've pointed out here and > > elsewhere that somebody could go add the features you omitted later > > (which is also true, but misses the general point that we want > > committed patches to be reasonably complete already, not have big gaps > > that someone will have to fix later). > > For me, the concern is not really the omission of support for certain > features as such. The concern is that those omissions hint that there > is a problem with the design itself, particularly in the optimizer. > Allowing subselects in the UPDATE part of a MERGE do not seem like > they could be written as a neat adjunct to what Simon already came up > with. If that was possible, Simon probably already would have done it. > > As someone who's helping Simon with that part of the code, I must say that omission of sub-selects in the UPDATE targetlist and WHEN quals is not because of some known design problems. So while it may be true that we've a design problem, it's also quite likely that we are missing some planner/optimiser trick and once we add those missing pieces, it will start working. Same is the case with RLS. Partitioned table is something I am actively working on. I must say that the very fact that INSERT and UPDATE/DELETE take completely different paths in partitioned/inherited table, makes MERGE quite difficult because it has to carry out both the operations and hence require all the required machinery. If I understand correctly, INSERT ON CONFLICT must have faced similar problems and hence DO UPDATE does not work with partitioned table. I am not sure if that choice was made when INSERT ON CONFLICT was implemented or when partitioned table support was added. But the challenges look similar. I first tried to treat MERGE similar to UPDATE/DELETE case and ensure that the INSERTs go through the root partition. That mostly works, but the RIGHT OUTER join between the child tables and the source relation ends up emitting duplicate rows, if the partitioned table is the resultRelation and when it gets expanded in inheritance_planner(). That's a blocker. So what I am trying now is to push the join between the Append relation and the source relation below the ModifyTable node, so that we get the final join result. We can then look up the tableoid in the row returned from the join, find the corresponding result relation and then carry out MERGE actions. Note that unlike regular ExecModifyTable(), here we must execute just one subplan as that will return all the required tuples. Does anyone see a potential blocker with this approach, except that it may not be the most elegant way? I think EvalPlanQual might need some treatment because when the plan is re-executed, it will expect to the find the updated tuple in the slot of the underlying query's RTE and not in the resultRelation's RTE, which does not participate in the join at all. Anything else I could be missing out completely? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Wed, Feb 7, 2018 at 10:52 PM, Peter Geoghegan wrote: > > > Apparently there is a pending patch to do better there - the commit > message of 2f178441 refers to this. > > Thanks. Will look at it. > > The revised version also supports subqueries in SET targetlist as well as > > WHEN AND conditions. As expected, this needed a minor tweak in > > query/expression mutators. So once I worked on that for partitioned > tables, > > subqueries started working with very minimal adjustments elsewhere. Other > > things such as whole-var references are still broken. A few new tests are > > also added. > > Great! > > Wholerow references are expected to be a bit trickier. See commit > ad227837 for some hints on how you could fix this. > Thanks again. > > > Next I am going to look at RLS. While I've no prior experience with RLS, > I > > am expecting it to be less cumbersome in comparison to partitioning. I am > > out of action till Monday and may not be able to respond in time. But any > > reviews and comments are appreciated as always. > > I don't think that RLS support will be particularly challenging. It > might take a while. > Ok. I would start from writing a test case to check what works and what doesn't with the current patch and work from there. My understanding of RLS is limited right now, but from what I've seen in the code (while hacking other stuff), my guess is it will require us evaluating a set of quals and then deciding on the action. > > If your rapid progress here is any indication, most open items are not > likely to be particularly challenging. Once again, I suggest that a > good area for us to focus on is the semantics of READ COMMITTED > conflict handling. I understand getting EPQ semantics right is very important. Can you please (once again) summarise your thoughts on what you think is the *most* appropriate behaviour? I can then think how much efforts might be involved in that. If the efforts are disproportionately high, we can discuss if settling for some not-so-nice semantics, like we apparently did for partition key updates. I am sorry, I know you and Simon have probably done that a few times already and I should rather study those proposals first. So it's okay if you don't want to repeat those; I will work on them next week once I am back from holidays. Maybe you'd prefer to just blast through the > simpler open items, which is fine, but do bear in mind that the EPQ > and EPQ-adjacent stuff is probably going to be the thing that makes or > breaks this patch for v11. > TBH I did not consider partitioning any less complex and it was indeed very complex, requiring at least 3 reworks by me. And from what I understood, it would have been a blocker too. So is subquery handling and RLS. That's why I focused on addressing those items while you and Simon were still debating EPQ semantics. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Feb 9, 2018 at 6:53 AM, Peter Geoghegan wrote: > On Wed, Feb 7, 2018 at 7:51 PM, Pavan Deolasee > wrote: > > I understand getting EPQ semantics right is very important. Can you > please > > (once again) summarise your thoughts on what you think is the *most* > > appropriate behaviour? I can then think how much efforts might be > involved > > in that. If the efforts are disproportionately high, we can discuss if > > settling for some not-so-nice semantics, like we apparently did for > > partition key updates. > > I personally believe that the existing EPQ semantics are already > not-so-nice. They're what we know, though, and we haven't actually had > any real world complaints, AFAIK. > I agree. > > My concern is mostly just that MERGE manages to behave in a way that > actually "provides a single SQL statement that can conditionally > INSERT, UPDATE or DELETE rows, a task that would otherwise require > multiple procedural language statements", as the docs put it. As long > as MERGE manages to do something as close to that high level > description as possible in READ COMMITTED mode (with our current > semantics for multiple statements in RC taken as the baseline), then > I'll probably be happy. > IMO it will be quite hard, if not impossible, to guarantee the same semantics to a single statement MERGE and multi statement UPDATE/DELETE/INSERT in RC mode. For example, the multi statement model will execute each statement with a new MVCC snapshot and hence the rows visible to individual statement may vary. Whereas in MERGE, everything runs with a single snapshot. There could be other such subtle differences. > > Some novel new behavior -- "EPQ with a twist"-- is clearly necessary. > I feel a bit uneasy about it because anything that anybody suggests is > likely to be at least a bit arbitrary (EPQ itself is kind of > arbitrary). We only get to make a decision on how "EPQ with a twist" > will work once, and that should be a decision that is made following > careful deliberation. Ambiguity is much more likely to kill a patch > than a specific technical defect, at least in my experience. Somebody > can usually just fix a technical defect. > While I agree, I think we need to make these decisions in a time bound fashion. If there is too much ambiguity, then it's not a bad idea to settle for throwing appropriate errors instead of providing semantically wrong answers, even in some remote corner case. > > > > > TBH I did not consider partitioning any less complex and it was indeed > very > > complex, requiring at least 3 reworks by me. And from what I understood, > it > > would have been a blocker too. So is subquery handling and RLS. That's > why I > > focused on addressing those items while you and Simon were still debating > > EPQ semantics. > > Sorry if I came across as dismissive of that effort. That was > certainly not my intention. I am pleasantly surprised that you've > managed to move a number of things forward rather quickly. > > I'll rephrase: while it would probably have been a blocker in theory > (I didn't actually weigh in on that), I doubted that it would actually > end up doing so in practice (and it now looks like I was right to > doubt that, since you got it done). It was a theoretical blocker, as > opposed to an open item that could drag on indefinitely despite > everyone's best efforts. Obviously details matter, and obviously there > are a lot of details to get right outside of RC semantics, but it > seems wise to focus on the big risk that is EPQ/RC conflict handling. > Ok. I am now back from holidays and I will too start thinking about this. I've also requested a colleague to help us with comparing it against Oracle's behaviour. N That's not a gold standard for us, but knowing how other major databases handle RC conflicts, is not a bad idea. I see the following important areas and as long as we have a consistent and coherent handling of these cases, we should not have difficulty agreeing on a outcome. 1. Concurrent UPDATE does not affect MATCHED case. The WHEN conditions may or may not be affected. 2. Concurrently UPDATEd tuple fails the join qual and the current source tuple no longer matches with the updated target tuple that the EPQ is set for. It matches no other target tuple either. So a MATCHED case is turned into a NOT MATCHED case. 3. Concurrently UPDATEd tuple fails the join qual and the current source tuple no longer matches with the updated target tuple that the EPQ is set for. But it matches some other target tuple. So it's still a MATCHED case, but with different target tuple(s). 4. Concurrent UPDATE/INSERT creates a matching target tuple for a source tuple, t
Re: [HACKERS] MERGE SQL Statement for PG11
On Sat, Feb 10, 2018 at 7:19 AM, Tomas Vondra wrote: > Hi, > > On 02/07/2018 10:24 AM, Pavan Deolasee wrote: > > > >if (startWAL < GetXactWALBytes()) >ereport(ERROR, >(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot write to database ..."))); > > I think this actually fails to enforce the rule, because some writes may > not produce WAL (think of unlogged tables). I also suspect it may be > incorrect "in the opposite direction" because a query may not do any > changes and yet it may produce WAL (e.g. due to wal_hint_bits=true). > > So we may need to think of a different way to enforce this ... Yes, this does look problematic. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Feb 9, 2018 at 8:06 PM, Robert Haas wrote: > > > On that basis, of the options I listed in > http://postgr.es/m/CA+TgmoZDL-caukHkWet7sr7sqr0-e2T91+ > devhqen5sfqsm...@mail.gmail.com > I like #1 least. > > I also dislike #4 from that list for the reasons stated there. For > example, if you say WHEN MATCHED AND x.some_boolean and then WHEN > MATCHED, you expect that every tuple that hits the latter clause will > have that Boolean as false or null, but #4 makes that not true. > > I think the best options are #2 and #5 -- #2 because it's simple, and > #5 because it's (maybe) more intuitive, albeit at the risk of > livelock. As you said, #5 seems the best and that's what the patch does. But ISTM that the options you listed are not really the concerning points. As the patch stands today, we evaluate WHEN AND conditions separately, outside the EPQ. The problem arises when the join qual returns a different result with the updated tuple. I listed down those cases in my earlier email in the day. To me (and I assume to Peter and Simon too), those are the more interesting cases. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
On Fri, Feb 2, 2018 at 9:07 PM, Robert Haas wrote: > On Thu, Feb 1, 2018 at 7:21 PM, Simon Riggs wrote: > > Yes, it would be about 99% of the time. > > > > But you have it backwards - we are not assuming that case. That is the > > only case that has risk - the one where an old WAL record starts at > > exactly the place the latest one stops. Otherwise the rest of the WAL > > record will certainly fail the CRC check, since it will effectively > > have random data in it, as you say. > > OK, I get it now. Thanks for explaining. I think I understand now > why you think this problem can be solved just by controlling the way > we recycle segments, but I'm still not sure if that can really be made > fully reliable. Michael seems concerned about what might happen after > multiple recyclings, and Tom has raised the issue of old data > reappearing after a crash. > I'm not sure if Michael has spotted a real problem or was that just a concern. He himself later rightly pointed out that when a WAL file is switched, the old file is filled with zeros. So I don't see a problem there. May be I am missing something and Michael can explain further. Regarding Tom's concerns, that could be a problem if a file system crash survives a name change, but not the subsequent data written to the file. For this to be a problem, WAL file A is renamed to B and then renamed to C. File A and C share the same low order bits. Further upon file system crash, the file is correctly named as C, but the data written *before* the rename operation is lost. Is that a real possibility? Can we delay reusing low order bits a little further to address this problem? Of course, if the file system crash can survive many renames and still resurrect old data several renames before, then we shall have the same problem. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
Hi Stephen, On Tue, Feb 6, 2018 at 3:37 PM, Stephen Frost wrote: > > > Coming out of that, my understanding is that Simon is planning to have a > patch which implements RLS and partitioning (though the query plans for > partitioning may be sub-par and not ideal) as part of MERGE and I've > agreed to review at least the RLS bits (though my intention is to at > least go through the rest of the patch as well, though likely in less > detail). Of course, I encourage others to review it as well. > > Thanks for volunteering to review the RLS bits. I am planning to send a revised version soon. As I work through it, I am faced with some semantic questions again. Would appreciate if you or anyone have answers to those. While executing MERGE, for existing tuples in the target table, we may end up doing an UPDATE or DELETE, depending on the WHEN MATCHED AND conditions. So it seems unlikely that we would be able to push down USING security quals down to the scan. For example, if the target row is set for deletion, it seems wrong to exclude the row from the join based on UPDATE policy's USING quals. So I am thinking that we should apply the respective USING quals *after* the decision to either update, delete or do nothing for the given target tuple is made. The question I have is, if the USING qual evaluates to false or NULL, should we silently ignore the tuple (like regular UPDATE does) or throw an error (like INSERT ON CONFLICT DO UPDATE)? ISTM that we might have decided to throw an error in case of INSERT ON CONFLICT to avoid any confusion where the tuple is neither inserted nor updated. Similar situation may arise with MERGE because for a source row, we may neither do UPDATE (because of RLS) nor INSERT because a matching tuple already exists. But someone may argue that we should stay closer to regular UPDATE/DELETE. Apart from that, are there any security angles that we need to be mindful of and would those impact the choice? SELECT policies will be applied to the target table during the scan and rows which do not pass SELECT quals will not be processed at all. If there are NOT MATCHED actions, we might end up inserting duplicate rows in that case or throw errors, but I don't see anything wrong with that. Similar things would have happened if someone tried to insert rows into the table using regular INSERT. Similarly, INSERT policies will be applied when MERGE attempts to INSERT a row into the table and error will be thrown if the row does not satisfy INSERT policies. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
Hi Andreas, On Sun, Feb 4, 2018 at 5:45 PM, Andreas Seltenreich wrote: > > > Attached are testcases that trigger the assertions when run against an > empty database instead of the one left behind by make installcheck. The > "unrecognized node type" one appears to be a known issue according to > the wiki, so I didn't bother doing the work for this one as well. > Thanks for doing those tests. I've just sent v16a version of the patch and I think it fixes the issues reported so far. Can you please recheck? Please let me know if there are other issues detected by sqlsmith or otherwise. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Wed, Mar 27, 2019 at 9:47 AM Masahiko Sawada wrote: > > The patch looks good to me. There is no comment from me. > > Thanks for your review! Updated patch attached since patch failed to apply after recent changes in the master. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services copy_freeze_v5.patch Description: Binary data
Re: Re: A separate table level option to control compression
On Wed, Apr 3, 2019 at 10:19 AM Michael Paquier wrote: > > > I have been looking at this patch more, and here are some notes: > - The tests can be really simplified using directly reltoastrelid, so > I changed the queries this way. I am aware that the surroundings > hardcode directly the relation name, but that's not really elegant in > my opinion. And I am really tempted to adjust these as well to > directly use reltoastrelid. > I agree using reltoastrelid is a better way. I just followed the surrounding tests, but +1 to change all of these use reltoastrelid. > - The docs had a weird indentation. > Sorry about that. I was under the impression that it doesn't matter since the doc builder ultimately chooses the correct indentation. I'd a patch ready after Sawada's review, but since you already fixed that, I am not sending it. > - I think that we should be careful with the portability of > pg_column_size(), so I have added comparisons instead of the direct > values in a way which does not change the meaning of the tests nor > their coverage. > Ok, understood. > - Having RelationGetCompressTupleTarget use directly > toast_tuple_threshold as default argument is I think kind of > confusing, so let's use a different static value, named > COMPRESS_TUPLE_TARGET in the attached. This is similar to > TOAST_TUPLE_TARGET for the toast tuple threshold. > Sounds good. This also takes care of the other issue you brought about "hoff" getting subtracted twice. > - The comments in tuptoaster.h need to be updated to outline the > difference between the compression invocation and the toast invocation > thresholds. The wording could be better though. > I see that you've done this already. But please let me if more is needed. > - Better to avoid comments in the middle of the else/if blocks in my > opinion. > This is probably a personal preference and I've seen many other places where we do that (see e.g. lazy_scan_heap()). But I don't have any strong views on this. I see that you've updated comments in tuptoaster.h, so no problem if you want to remove the comments from the code block. > > Also, the previous versions of the patch do that when doing a heap > insertion (heapam.c and rewriteheap.c): > +toast_tuple_threshold = RelationGetToastTupleTarget(relation, > +TOAST_TUPLE_THRESHOLD); > +compress_tuple_threshold = > RelationGetCompressTupleTarget(relation, > +toast_tuple_threshold); > +compress_tuple_threshold = Min(compress_tuple_threshold, > + toast_tuple_threshold); > [...] > need_toast = (HeapTupleHasExternal(&oldtup) || > HeapTupleHasExternal(newtup) || > -newtup->t_len > TOAST_TUPLE_THRESHOLD); > +newtup->t_len > compress_tuple_threshold); > > This means that the original code always uses the compilation-time, > default value of toast_tuple_target for all relations. But this gets > changed so as we would use the value set at relation level for > toast_tuple_target if the reloption is changed, without touching > compress_tuple_threshold. This is out of the scope of this patch, but > shouldn't we always use the relation-level value instead of the > compiled one? Perhaps there is something I am missing? > Yeah, this is an issue with the existing code. Even though we allow setting toast_tuple_target to a value less than compile-time TOAST_TUPLE_THRESHOLD, the code doesn't really honour a value less than TOAST_TUPLE_THRESHOLD. In other words, setting toast_tuple_target lesser than TOAST_TUPLE_THRESHOLD doesn't have any effect. We don't even create a toast table if the estimated length of tuple is not greater than TOAST_TUPLE_THRESHOLD. The change introduced by this patch will now trigger the tuptoaster code when the compression or toast threshold is set to a value lower than TOAST_TUPLE_THRESHOLD. But as far as I can see, that doesn't have any bad effect on the toasting since toast_insert_or_update() is capable of handling the case when the toast table is missing. There will be a behavioural change though. e.g. CREATE TABLE testtab (a text) WITH (toast_tuple_target=256); INSERT INTO testtab VALUES ; Earlier this tuple would not have been toasted, even though toast_tuple_target is set to 256. But with the new code, the tuple will get toasted. So that's a change, but in the right direction as far as I can see. Also, this is a bit unrelated to what this patch is trying to achieve. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
Hi, On Fri, Apr 5, 2019 at 9:05 AM Andres Freund wrote: > > > I think the right approach would be to do all of this in heap_insert and > heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN > is specified, remember whether it is either currently empty, or is > already marked as all-visible. If previously empty, mark it as all > visible at the end. If already all visible, there's no need to change > that. If not yet all-visible, no need to do anything, since it can't > have been inserted with COPY FREEZE. We're doing roughly the same. If we are running INSERT_FROZEN, whenever we're about to switch to a new page, we check if the previous page should be marked all-frozen and do it that way. The special code in copy.c is necessary to take care of the last page which we don't get to handle in the regular code path. Or are you suggesting that we don't even rescan the page for all-frozen tuples at the end and just simply mark it all-frozen at the start, when the first tuple is inserted and then don't touch the PD_ALL_VISIBLE/visibility map bit as we go on inserting more tuples in the page? Anyways, if major architectural changes are required then it's probably too late to consider this for PG12, even though it's more of a bug fix and a candidate for back-patching too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Fri, Apr 5, 2019 at 8:37 AM Andres Freund wrote: > Hi, > > On 2019-04-05 00:06:04 -0300, Alvaro Herrera wrote: > > > > > Hmm, isn't there already a critical section in visibilitymap_set itself? > > There is, but the proposed code sets all visible on the page, and marks > the buffer dirty, before calling visibilitymap_set. > How's it any different than what we're doing at vacuumlazy.c:1322? We set the page-level bit, mark the buffer dirty and then call visibilitymap_set(), all outside a critical section. 1300 /* mark page all-visible, if appropriate */ 1301 if (all_visible && !all_visible_according_to_vm) 1302 { 1303 uint8 flags = VISIBILITYMAP_ALL_VISIBLE; 1304 1305 if (all_frozen) 1306 flags |= VISIBILITYMAP_ALL_FROZEN; 1307 1308 /* 1309 * It should never be the case that the visibility map page is set 1310 * while the page-level bit is clear, but the reverse is allowed 1311 * (if checksums are not enabled). Regardless, set the both bits 1312 * so that we get back in sync. 1313 * 1314 * NB: If the heap page is all-visible but the VM bit is not set, 1315 * we don't need to dirty the heap page. However, if checksums 1316 * are enabled, we do need to make sure that the heap page is 1317 * dirtied before passing it to visibilitymap_set(), because it 1318 * may be logged. Given that this situation should only happen in 1319 * rare cases after a crash, it is not worth optimizing. 1320 */ 1321 PageSetAllVisible(page); 1322 MarkBufferDirty(buf); 1323 visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, 1324 vmbuffer, visibility_cutoff_xid, flags); 1325 } As the first para in that comment says, I thought it's ok for page-level bit to be set but the visibility bit to be clear, but not the vice versa. The proposed code does not introduce any new behaviour AFAICS. But I might be missing something. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Caveats from reloption toast_tuple_target
On Thu, Apr 4, 2019 at 11:36 AM Michael Paquier wrote: > > > I mean that toast_tuple_target is broken as-is, because it should be > used on the new tuples of a relation as a threshold to decide if this > tuple should be toasted or not, but we don't actually use the > reloption value for that decision-making: the default threshold > TOAST_TUPLE_THRESHOLD gets used instead all the time! The code does > not even create a toast table in some cases. > I think the problem with the existing code is that while it allows users to set toast_tuple_target to be less than TOAST_TUPLE_THRESHOLD, the same is not honoured while deciding whether to toast a row or not. AFAICS it works ok when toast_tuple_target is greater than or equal to TOAST_TUPLE_THRESHOLD i.e. it won't toast the rows unless they are larger than toast_tuple_target. IMV it makes sense to simply cap the lower limit of toast_tuple_target to the compile time default and update docs to reflect that. Otherwise, we need to deal with the possibility of dynamically creating the toast table if the relation option is lowered after creating the table. Your proposed patch handles the case when the toast_tuple_target is specified at CREATE TABLE, but we would still have problem with ALTER TABLE, no? But there might be side effects of changing the lower limit for pg_dump/pg_restore. So we would need to think about that too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
On Tue, 28 May 2019 at 4:36 PM, Andres Freund wrote: > Hi, > > On 2019-04-07 18:04:27 -0700, Andres Freund wrote: > > Here's a *prototype* patch for this. It only implements what I > > described for heap_multi_insert, not for plain inserts. I wanted to see > > what others think before investing additional time. > > Pavan, are you planning to work on this for v13 CF1? Or have you lost > interest on the topic? > Yes, I plan to work on it. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: MERGE SQL statement for PG12
Hi Tomas, Sorry for a delayed response. On Mon, Oct 29, 2018 at 4:59 PM Tomas Vondra wrote: > Hi Pavan, > > On 10/29/2018 10:23 AM, Pavan Deolasee wrote: > > > > ... > > > > Thanks for keeping an eye on the patch. I've rebased the patch > > against the current master. A new version is attached. > > > > Thanks, > > Pavan > > > > It's good to see the patch moving forward. What are your thoughts > regarding things that need to be improved/fixed to make it committable? > > I briefly discussed the patch with a couple of people at pgconf.eu last > week, and IIRC the main concern was how the merge is represented in > parser/executor. > > Yes, Andres and to some extent Peter G had expressed concerns about that. Andres suggested that we should rework the parser and executor side. Here are some excerpts from his review comments. "I don't think the parser / executor implementation of MERGE is architecturally sound. I think creating hidden joins during parse-analysis to implement MERGE is a seriously bad idea and it needs to be replaced by a different executor structure." +* As top-level statements INSERT, UPDATE and DELETE have a Query, whereas +* with MERGE the individual actions do not require separate planning, +* only different handling in the executor. See nodeModifyTable handling +* of commandType CMD_MERGE. +* +* A sub-query can include the Target, but otherwise the sub-query cannot +* reference the outermost Target table at all. +*/ + qry->querySource = QSRC_PARSER; Why is this, and not building a proper executor node for merge that knows how to get the tuples, the right approach? We did a rough equivalent for matview updates, and I think it turned out to be a pretty bad plan. On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund wrote: > > > My impression is that this simply shouldn't be going through > nodeModifyTuple, but be it's own nodeMerge node. The trigger handling > would need to be abstraced into execTrigger.c or such to avoid > duplication. We're now going from nodeModifyTable.c:ExecModifyTable() > into execMerge.c:ExecMerge(), back to > nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing > things that aren't appropriate for merge, we then pass an actionState, > which neuters part of ExecUpdate/Insert(). This is just bad. > To be honest, I need more directions on how to address these major architectural concerns. I'd tried to rework the executor part and had commented on that. But I know that was probably done in a hurry since we were trying to save a revert. Having said that, I am still not very sure how exactly the executor side should look like without causing too much duplication of handling nodeModifyTable() and proposed nodeMerge(). If Andres can give me further inputs, I will be more than happy to figure out the details and improve the patch. As far as the parser side goes, do I understand correctly that instead of building a special join in transformMergeStmt() as the patch does today, we should follow what transformUpdateStmt() does for a potential join? i.e. just have a single RTE for the source relation and present it as a left hand side for the implicit join? If I get a little more direction on this, I can try to address the issues. It looks very likely that the patch won't get much review in the current state. But if I get inputs, I can work out the details and try to get the patch in committable state. BTW I am aware that the patch is bit-rotten because of the partitioning related changes. I will address those and post a revised patch soon. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Feb 16, 2018 at 6:37 AM, Peter Geoghegan wrote: > > ISTM that a MERGE isn't really a thing that replaces 2 or 3 other DML > statements, at least in most cases. It's more like a replacement for > procedural code with an outer join, with an INSERT, UPDATE or DELETE > that affects zero or one rows inside the procedural loop that > processes matching/non-matching rows. The equivalent procedural code > could ultimately perform *thousands* of snapshot acquisitions for > thousands of RC DML statements. MERGE is sometimes explained in terms > of "here is the kind of procedural code that you don't have to write > anymore, thanks to MERGE" -- that's what the code looks like. > > I attach a rough example of this, that uses plpgsql. > Thanks for writing the sample code. I understand you probably don't mean to suggest that we need to mimic the behaviour of the plpgsql code and the semantics offered by MERGE would most likely be different than what the plpgsql sample does. Because there are several problems with the plpgsql code: - It would never turn a MATCHED case into a NOT MATCHED case because of concurrent UPDATE/DELETE - The WHERE clauses attached to the UPDATE/DELETE statement should be using the quals attached to the WHEN clauses to ensure they are evaluated on the new version of the row, if needed. > > >> Some novel new behavior -- "EPQ with a twist"-- is clearly necessary. > >> I feel a bit uneasy about it because anything that anybody suggests is > >> likely to be at least a bit arbitrary (EPQ itself is kind of > >> arbitrary). We only get to make a decision on how "EPQ with a twist" > >> will work once, and that should be a decision that is made following > >> careful deliberation. Ambiguity is much more likely to kill a patch > >> than a specific technical defect, at least in my experience. Somebody > >> can usually just fix a technical defect. > TBH that's one reason why I like Simon's proposed behaviour of throwing errors in case of corner cases. I am not suggesting that's what we do at the end, but it's definitely worth considering. > > > > > > While I agree, I think we need to make these decisions in a time bound > > fashion. If there is too much ambiguity, then it's not a bad idea to > settle > > for throwing appropriate errors instead of providing semantically wrong > > answers, even in some remote corner case. > > Everything is still on the table, I think. > Ok. > > > Ok. I am now back from holidays and I will too start thinking about this. > > I've also requested a colleague to help us with comparing it against > > Oracle's behaviour. N That's not a gold standard for us, but knowing how > > other major databases handle RC conflicts, is not a bad idea. > > The fact that Oracle doesn't allow WHEN MATCHED ... AND quals did seem > like it might be significant to me. > > Here are some observations from Rahila's analysis so far. I must say, Oracle's handling seems quite inconsistent, especially the conditions under which it sometimes re-evaluates the join and sometimes don't. - Oracle does not support multiple WHEN MATCHED clauses. So the question of re-checking all WHEN clauses does not arise. - Only one UPDATE and one DELETE clause is supported. The DELETE must be used in conjunction with UPDATE. - The DELETE clause is invoked iff the UPDATE clause is invoked. It works on the updated rows. Since the row is already updated (and locked) by the MERGE, DELETE action never blocks on a concurrent update/delete - MERGE does not allow updating the column used in the JOIN's ON qual - In case of concurrent UPDATE, the join is re-evaluated iff the concurrent UPDATE updates (modifies?) the same column that MERGE is updating OR a column that MERGE is referencing in the WHERE clause is updated by the concurrent update. IOW if the MERGE and concurrent UPDATE is operating on different columns, join is NOT re-evaluated, thus possibly invoking WHEN MATCHED action on a row which no longer matches the join condition. - In case of concurrent DELETE, the join is re-evaluated and the action may change from MATCHED to NOT MATCHED I am curiously surprised by it's behaviour of re-evaluating join only when certain columns are updated. It looks to me irrespective of what we choose, our implementation would be much superior to what Oracle offers. BTW I've sent v17a of the patch, which is very close to being complete from my perspective (except some documentation fixes/improvements). The only thing pending is the decision to accept or change the currently implemented concurrency semantics. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Thu, Mar 1, 2018 at 3:50 AM, Peter Geoghegan wrote: > On Fri, Feb 9, 2018 at 6:36 AM, Robert Haas wrote: > > Here's my $0.02: I think that new concurrency errors thrown by the > > merge code itself deserve strict scrutiny and can survive only if they > > have a compelling justification, but if the merge code happens to > > choose an action that leads to a concurrency error of its own, I think > > that deserves only mild scrutiny. > > > > On that basis, of the options I listed in > > http://postgr.es/m/CA+TgmoZDL-caukHkWet7sr7sqr0-e2T91+ > devhqen5sfqsm...@mail.gmail.com > > I like #1 least. > > > > I also dislike #4 from that list for the reasons stated there. For > > example, if you say WHEN MATCHED AND x.some_boolean and then WHEN > > MATCHED, you expect that every tuple that hits the latter clause will > > have that Boolean as false or null, but #4 makes that not true. > > > > I think the best options are #2 and #5 -- #2 because it's simple, and > > #5 because it's (maybe) more intuitive, albeit at the risk of > > livelock. But I'm willing to convinced of some other option; like > > you, I'm willing to be open-minded about this. But, as you say, we > > need a coherent, well-considered justification for the selected > > option, not just "well, this is what we implemented so you should like > > it". The specification should define the implementation, not the > > reverse. > > At first I hated option #2. I'm warming to #2 a lot now, though, > because I've come to understand the patch's approach a little better. > (Pavan and Simon should still verify that I got things right in my > mail from earlier today, though.) > > It now seems like the patch throws a RC serialization error more or > less only due to concurrent deletions (rarely, it will actually be a > concurrent update that changed the join qual values of our MERGE). > We're already not throwing the error (we just move on to the next > input row from the join) when we happen to not have a WHEN NOT MATCHED > case. But why even make that distinction? Why not just go ahead and > WHEN NOT MATCHED ... INSERT when the MERGE happens to have such a > clause? The INSERT will succeed, barring any concurrent conflicting > insertion by a third session -- a hazard that has nothing to do with > RC conflict handling in particular. > > Just inserting without throwing a RC serialization error is almost > equivalent to a new MVCC snapshot being acquired due to a RC conflict, > so all of this now looks okay to me. Especially because every other > MERGE implementation seems to have serious issues with doing anything > too fancy with the MERGE target table's quals within the main ON join. > I think that SQL Server actually assumes that you're using the > target's PK in a simple equi-join. All the examples look like that, > and this assumption is heavily suggested by the "Do not attempt to > improve query performance by filtering out rows in the target table in > the ON clause" weasel words from their docs, that I referenced in my > mail from earlier today. > > I think you've fairly accurately described what the patch does today. I take your point that we can very well just execute the WHEN NOT MATCHED action if the join condition fails for the updated tuple. There is one case we ought to think about though and that might explain why executing the WHEN NOT MATCHED action may not be best choice. Or for that matter even skipping the target when no NOT MATCHED action exists, as the patch does today. What if the updated tuple fails the join qual with respect to the current tuple from the source relation but it now matches some other tuple from the source relation? I described this case in one of the earlier emails too. In this case, we might end up doing an INSERT (if we decide to execute WHEN NOT MATCHED action), even though a MATCH exists. If there is no WHEN NOT MATCHED action, the current patch will just skip the updated tuple even though a match exists, albeit it's not the current source tuple. Oracle behaves differently and it actually finds a new matching tuple from the source relation and executes the WHEN MATCHED action, using that source tuple. But I am seriously doubtful that we want to go down that path and whether it's even feasible. Our regular UPDATE .. FROM does not do that either. Given that, it seems better to just throw an error (even when no NOT MATCHED action exists) and explain to the users that MERGE will work as long as concurrent updates don't modify the columns used in the join condition. Concurrent deletes should be fine and we may actually even invoke WHEN NOT MATCHED action in that case. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ON CONFLICT DO UPDATE for partitioned tables
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera wrote: > > > Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused: > expand_targetlist() runs *after*, not before, so how could it have > affected the result? > If I understand correctly, planner must have called expand_targetlist() once for the parent relation's descriptor and added any dropped columns from the parent relation. So we may not find mapped attributes for those dropped columns in the parent. I haven't actually tested this case though. I wonder if it makes sense to actually avoid expanding on-conflict-set targetlist in case the target is a partition table and deal with it during execution, like we are doing now. > I'm obviously confused about what > expand_targetlist call this comment is talking about. Anyway I wanted > to make it use resjunk entries instead, but that broke some other case > that I didn't have time to research yesterday. I'll get back to this > soon, but in the meantime, here's what I have. > + conv_setproj = + adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel), + RelationGetDescr(partrel), + RelationGetRelationName(partrel), + firstVarno, + conv_setproj); Aren't we are adding back Vars referring to the parent relation, after having converted the existing ones to use partition relation's varno? May be that works because missing attributes are already added during planning and expand_targetlist() here only adds dropped columns, which are just NULL constants. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan wrote: > On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire > wrote: > > > I believe PKs are a prime candidate for this optimization, and > > expecting it to apply only when no concurrency is involved is severely > > dumbing down the optimization. > > Pavan justified the patch using a benchmark that only involved a > single client -- hardly typical for a patch that changes the B-Tree > code. If the benefits with many clients can be shown to matter, that > will make this much more interesting to me. Ok. I will repeat those tests with more number of clients and report back. Regarding your suggestion about using page LSN to detect intermediate activity, my concern is that unless we store that value in shared memory, concurrent backends, even if inserting values in an order, will make backend-local cached page LSN invalid and the optimisation will not kick-in. I am yet to digest the entire conversation between you and Claudio; you guys clearly understand b-tree internals better than me. It seems while you're worried about missing out on something, Claudio feels that we can find a safe way just looking at the information available in the current page. I feel the same way, but will need to re-read the discussion carefully again. Simon had raised concerns about DESC indexes and whether we need to do the checks for leftmost page in that case. I haven't yet figured out if DESC indexes are actually stored in the reverse order. I am gonna look at that too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Wed, Feb 28, 2018 at 12:38 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila > wrote: > >> +# Concurrency error from GetTupleForTrigger >> +# Concurrency error from ExecLockRows >> >> I think you don't need to mention above sentences in spec files. >> Apart from that, your patch looks good to me. I have marked it as >> Ready For Committer. >> > > I too have tested this feature with isolation framework and this look good > to me. > > It looks to me that we are trying to fix only one issue here with concurrent updates. What happens if a non-partition key is first updated and then a second session updates the partition key? For example, with your patches applied: CREATE TABLE pa_target (key integer, val text) PARTITION BY LIST (key); CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1); CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2); INSERT INTO pa_target VALUES (1, 'initial1'); session1: BEGIN; UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1; UPDATE 1 postgres=# SELECT * FROM pa_target ; key | val -+- 1 | initial1 *updated by update1* (1 row) session2: UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE key = 1 session1: postgres=# COMMIT; COMMIT postgres=# SELECT * FROM pa_target ; key | val -+- 2 | initial1 updated by update2 (1 row) Ouch. The committed updates by session1 are overwritten by session2. This clearly violates the rules that rest of the system obeys and is not acceptable IMHO. Clearly, ExecUpdate() while moving rows between partitions is missing out on re-constructing the to-be-updated tuple, based on the latest tuple in the update chain. Instead, it's simply deleting the latest tuple and inserting a new tuple in the new partition based on the old tuple. That's simply wrong. I haven't really thought carefully to see if this should be a separate patch, but it warrants attention. We should at least think through all different concurrency aspects of partition key updates and think about a holistic solution, instead of fixing one problem at a time. This probably also shows that isolation tests for partition key updates are either missing (I haven't checked) or they need more work. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Tue, Feb 13, 2018 at 12:41 PM, amul sul wrote: > > Thanks for the confirmation, updated patch attached. > > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not do anything to deal with the fact that t_ctid may no longer point to itself to mark end of the chain. I just can't see how that would work. But if it does, it needs good amount of comments explaining why and most likely updating comments at other places where chain following is done. For example, how's this code in heap_get_latest_tid() is still valid? Aren't we setting "ctid" to some invalid value here? 2302 /* 2303 * If there's a valid t_ctid link, follow it, else we're done. 2304 */ 2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) || 2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) || 2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid)) 2308 { 2309 UnlockReleaseBuffer(buffer); 2310 break; 2311 } 2312 2313 ctid = tp.t_data->t_ctid; This is just one example. I am almost certain there are many such cases that will require careful attention. What happens if a partition key update deletes a row, but the operation is aborted? Do we need any special handling for that case? I am actually worried that we're tinkering with ip_blkid to handle one corner case of detecting partition key update. This is going to change on-disk format and probably need more careful attention. Are we certain that we would never require update-chain following when partition keys are updated? If so, can we think about some other mechanism which actually even leaves behind ? I am not saying we should do that, but it warrants a thought. May be it was discussed somewhere else and ruled out. I happened to notice this patch because of the bug I encountered. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila wrote: > On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee > wrote: > > > > On Tue, Feb 13, 2018 at 12:41 PM, amul sul wrote: > >> > >> Thanks for the confirmation, updated patch attached. > >> > > > > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch > does not > > do anything to deal with the fact that t_ctid may no longer point to > itself > > to mark end of the chain. I just can't see how that would work. > > > > I think it is not that patch doesn't care about the end of the chain. > For example, ctid pointing to itself is used to ensure that for > deleted rows, nothing more needs to be done like below check in the > ExecUpdate/ExecDelete code path. > Yeah, but it only looks for places where it needs to detect deleted tuples and thus wants to throw an error. I am worried about other places where it is assumed that the ctid points to a valid looking tid, self or otherwise. I see no such places being either updated or commented. Now may be there is no danger because of other protections in place, but it looks hazardous. > > I have not tested, but it seems this could be problematic, but I feel > we could deal with such cases by checking invalid block id in the > above if check. Another one such case is in EvalPlanQualFetch > > Right. > > > What happens if a partition key update deletes a row, but the operation > is > > aborted? Do we need any special handling for that case? > > > > If the transaction is aborted than future updates would update the > ctid to a new row, do you see any problem with it? > I don't know. May be there is none. But it needs to explained why it's not a problem. > > > I am actually worried that we're tinkering with ip_blkid to handle one > > corner case of detecting partition key update. This is going to change > > on-disk format and probably need more careful attention. Are we certain > that > > we would never require update-chain following when partition keys are > > updated? > > > > I think we should never need update-chain following when the row is > moved from one partition to another partition, otherwise, we don't > change anything on the tuple. > I am not sure I follow. I understand that it's probably a tough problem to follow update chain from one partition to another. But why do you think we would never need that? What if someone wants to improve on the restriction this patch is imposing and actually implement partition key UPDATEs the way we do for regular tables i.e. instead of throwing error, we actually update/delete the row in the new partition? > > > If so, can we think about some other mechanism which actually even > > leaves behind ? I am not saying we should do > that, > > but it warrants a thought. > > > > Oh, this would much bigger disk-format change and need much more > thoughts, where will we store new partition information. > Yeah, but the disk format will change probably change just once. Or may be this can be done local to a partition table without requiring any disk format changes? Like adding a nullable hidden column in each partition to store the forward pointer? Thanks, Pavan
Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
On Thu, Mar 8, 2018 at 10:27 PM, Robert Haas wrote: > > However, there's no such thing as a free lunch. We can't use the CTID > field to point to a CTID in another table because there's no room to > include the identify of the other table in the field. We can't widen > it to make room because that would break on-disk compatibility and > bloat our already-too-big tuple headers. So, we cannot make it work > like it does when the updates are confined to a single partition. > Therefore, the only options are (1) ignore the problem, and let a > cross-partition update look entirely like a delete+insert, (2) try to > throw some error in the case where this introduces user-visible > anomalies that wouldn't be visible otherwise, or (3) revert update > tuple routing entirely. I voted for (1), but the consensus was (2). > I think that (3) will make a lot of people sad; it's a very good > feature. I am definitely not suggesting to do #3, though I agree with Tom that the option is on table. May be two back-to-back bugs in the area makes me worried and raises questions about the amount of testing the feature has got. In addition, making such a significant on-disk change for one corner case, for which even #1 might be acceptable, seems a lot. If we at all want to go in that direction, I would suggest considering a patch that I wrote last year to free-up additional bits from the ctid field (as part of the WARM). I know Tom did not like that either, but at the very least, it provides us a lot more room for future work, with the same amount of risk. > If we want to have (2), then we've got to have some way to > mark a tuple that was deleted as part of a cross-partition update, and > that requires a change to the on-disk format. > I think the question is: isn't there an alternate way to achieve the same result? One alternate way would be to do what I suggested above i.e. free up more bits and use one of those. Another way would be to add a hidden column to the partition table, when it is created or when it is attached as a partition. This only penalises the partition tables, but keeps rest of the system out of it. Obviously, if this column is added when the table is attached as a partition, as against at table creation time, then the old tuple may not have room to store this additional field. May be we can handle that by double updating the tuple? That seems bad, but then it only impacts the case when a partition key is updated. And we can clearly document performance implications of that operation. I am not sure how common this case is going to be anyways. With this hidden column, we can even store a pointer to another partition and do something with that, if at all needed. That's just one idea. Of course, I haven't thought about it for more than 10mins, so most likely I may have missed out on details and it's probably a stupid idea afterall. But there could be other ideas too. And even if we can't find one, my vote would be to settle for #1 instead of trying to do #2. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee wrote: > > > On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan wrote: > >> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire >> wrote: >> >> > I believe PKs are a prime candidate for this optimization, and >> > expecting it to apply only when no concurrency is involved is severely >> > dumbing down the optimization. >> >> Pavan justified the patch using a benchmark that only involved a >> single client -- hardly typical for a patch that changes the B-Tree >> code. If the benefits with many clients can be shown to matter, that >> will make this much more interesting to me. > > > Ok. I will repeat those tests with more number of clients and report back. > > So I repeated the tests with 1,2,4 and 8 clients, each running the following statement and a total of 1024 transactions. So roughly 100M rows are inserted. INSERT INTO testtab(b) SELECT generate_series(1,10); The table definition is: postgres=# \d+ testtab Table "public.testtab" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ++---+--++-+--+- a | bigint | | not null | nextval('testtab_a_seq'::regclass) | plain | | b | bigint | | | | plain | | Indexes: "testtab_a_key" UNIQUE CONSTRAINT, btree (a) After taking average of 3-runs: +-++---+ | clients | Patched - time in sec | Master - time in sec | +-++---+ | 1 | 311.8643602| 411.832757| +-++---+ | 2 | 252.5433 | 300.7875613 | +-++---+ | 4 | 337.0414279| 350.9636766 | +-++---+ | 8 | 444.2035582| 477.1903417 | +-++---+ So yes, the benefits of the patch go down with higher number of clients, but it does not entirely vanish. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
ing until we came to support partitioning. Even with partitioning I could get everything to work, without duplicating the RTE, except the duplicate rows issue. I don't know how to solve that without doing what I've done or completely rewriting UPDATE/DELETE handling for inheritance/partition table. If you or others have better ideas, they are most welcome. > This is an extract from header comments for inheritance_planner() > (master branch): > > * We have to handle this case differently from cases where a source > relation > * is an inheritance set. Source inheritance is expanded at the bottom of > the > * plan tree (see allpaths.c), but target inheritance has to be expanded at > * the top. The reason is that for UPDATE, each target relation needs a > * different targetlist matching its own column set. Fortunately, > * the UPDATE/DELETE target can never be the nullable side of an outer > join, > * so it's OK to generate the plan this way. > > Of course, that isn't true of MERGE: The MERGE target *can* be the > nullable side of an outer join. That's probably a big complication for > using one target RTE. Your approach to implementing partitioning [1] > seems to benefit from having two different RTEs, in a way -- you > sidestep the restriction. Right. The entire purpose of having two different RTEs is to work around this problem. I explained this approach here [1]. I didn't receive any objections then, but that's mostly because nobody read it carefully. As I said, if we have an alternate feasible and better mechanism, let's go for it as long as efforts are justifiable. Thanks, Pavan [1] https://www.postgresql.org/message-id/CABOikdM%2Bc1vB_%2B3tYEjO%3DJ6U2uNHzKU_b%3DU72tadD5-9xQcbHA%40mail.gmail.com -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire wrote: > On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee > wrote: > > > > > > > > > So yes, the benefits of the patch go down with higher number of clients, > but > > it does not entirely vanish. > > What if you implement my suggestion? > > That should improve the multi-client case considerably. > Yes, I will try that next - it seems like a good idea. So the idea would be: check if the block is still the rightmost block and the insertion-key is greater than the first key in the page. If those conditions are satisfied, then we do a regular binary search within the page to find the correct location. This might add an overhead of binary search when keys are strictly ordered and a single client is inserting the data. If that becomes a concern, we might be able to look for that special case too and optimise for it too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Sun, Mar 11, 2018 at 11:18 AM, Peter Geoghegan wrote: > > It sounds like we should try to thoroughly understand why these > duplicates arose. Did you actually call EvalPlanQualSetPlan() for all > subplans at the time? > > The reason for duplicates or even wrong answers is quite simple. The way UPDATE/DELETE currently works for partition table is that we expand the inheritance tree for the parent result relation and then create a subplan for each partition separately. This works fine, even when there exists a FROM/USING clause in the UPDATE/DELETE statement because the final result does not change irrespective of whether you first do a UNION ALL between all partitions and then find the candidate rows or whether you find candidate rows from individual partitions separately. In case of MERGE though, since we are performing a RIGHT OUTER JOIN between the result relation and the source relation, we may conclude that a matching target row does not exist for a source row, whereas it actually exists but in some other partition. For example, CREATE TABLE target (key int, val text) PARTITION BY LIST ( key); CREATE TABLE part1 PARTITION OF target FOR VALUES IN (1, 2, 3); CREATE TABLE part2 PARTITION OF target FOR VALUES IN (4, 5, 6); CREATE TABLE source (skey integer); INSERT INTO source VALUES (1), (4), (7); INSERT INTO part1 VALUES (1, 'v1'), (2, 'v2'), (3, 'v3'); INSERT INTO part2 VALUES (4, 'v4'), (5, 'v5'), (6, 'v6'); postgres=# SELECT * FROM target RIGHT OUTER JOIN source ON key = skey; key | val | skey -+-+-- 1 | v1 |1 4 | v4 |4 | |7 (3 rows) This gives the right answer. But if we join individual partitions and then do a UNION ALL, postgres=# SELECT * FROM part1 RIGHT OUTER JOIN source ON key = skey UNION ALL SELECT * FROM part2 RIGHT OUTER JOIN source ON key = skey; key | val | skey -+-+-- 1 | v1 |1 | |4 | |7 | |1 4 | v4 |4 | |7 (6 rows) This is what nodeModifyTable does and hence we end up getting duplicates or even incorrectly declared NOT MATCHED rows, where as they are matched in a different partition. I don't think not calling EvalPlanQualSetPlan() on all subplans is a problem because we really never execute those subplans. In fact. we should fix that so that those subplans are not even initialised. > As you know, there is an ON CONFLICT DO UPDATE + partitioning patch in > the works from Alvaro. In your explanation about that approach that > you cited, you wondered what the trouble might have been with ON > CONFLICT + partitioning, and supposed that the issues were similar > there. Are they? Has that turned up much? > > Well, I initially thought that ON CONFLICT DO UPDATE on partition table may have the same challenges, but that's probably not the case. For INSERT ON CONFLICT it's still just an INSERT path, with some special handling for UPDATEs. Currently, for partition or inherited table, UPDATEs/DELETEs go via inheritance_planner() thus expanding inheritance for the result relation where as INSERTs go via simple grouping_planner(). For MERGE, we do all three DMLs. That doesn't mean we could not re-implement MERGE on the lines of INSERTs, but that would most likely mean complete re-writing of the UPDATEs/DELETEs for partition/inheritance tables. The challenges would just be the same in both cases. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire wrote: > On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee > > > > > Yes, I will try that next - it seems like a good idea. So the idea would > be: > > check if the block is still the rightmost block and the insertion-key is > > greater than the first key in the page. If those conditions are > satisfied, > > then we do a regular binary search within the page to find the correct > > location. This might add an overhead of binary search when keys are > strictly > > ordered and a single client is inserting the data. If that becomes a > > concern, we might be able to look for that special case too and optimise > for > > it too. > > Yeah, pretty much that's the idea. Beware, if the new item doesn't > fall in the rightmost place, you still need to check for serialization > conflicts. > So I've been toying with this idea since yesterday and I am quite puzzled with the results. See the attached patch which compares the insertion key with the last key inserted by this backend, if the cached block is still the rightmost block in the tree. I initially only compared with the first key in the page, but I tried this version because of the strange performance regression which I still have no answers. For a small number of clients, the patched version does better. But as the number of clients go up, the patched version significantly underperforms master. I roughly counted the number of times the fastpath is taken and I noticed that almost 98% inserts take the fastpath. I first thought that the "firstkey" location in the page might be becoming a hot-spot for concurrent processes and hence changed that to track the per-backend last offset and compare against that the next time. But that did not help much. +-++---+ | clients | Master - Avg load time in sec | Patched - Avg load time in sec | +-++---+ | 1 | 500.0725203| 347.632079| +-++---+ | 2 | 308.4580771| 263.9120163 | +-++---+ | 4 | 359.4851779| 514.7187444 | +-++---+ | 8 | 476.4062592| 780.2540855 | +-++---+ The perf data does not show anything interesting either. I mean there is a reduction in CPU time spent in btree related code in the patched version, but the overall execution time to insert the same number of records go up significantly. Perf (master): === + 72.59% 1.81% postgres postgres[.] ExecInsert + 47.55% 1.27% postgres postgres[.] ExecInsertIndexTuples + 44.24% 0.48% postgres postgres[.] btinsert - 42.40% 0.87% postgres postgres[.] _bt_doinsert - 41.52% _bt_doinsert + 21.14% _bt_search + 12.57% _bt_insertonpg + 2.03% _bt_binsrch 1.60% _bt_mkscankey 1.20% LWLockAcquire + 1.03% _bt_freestack 0.67% LWLockRelease 0.57% _bt_check_unique + 0.87% _start + 26.03% 0.95% postgres postgres[.] ExecScan + 21.14% 0.82% postgres postgres[.] _bt_search + 20.70% 1.31% postgres postgres[.] ExecInterpExpr + 19.05% 1.14% postgres postgres[.] heap_insert + 18.84% 1.16% postgres postgres[.] nextval_internal + 18.08% 0.84% postgres postgres[.] ReadBufferExtended + 17.24% 2.03% postgres postgres[.] ReadBuffer_common + 12.57% 0.59% postgres postgres[.] _bt_insertonpg + 11.12% 1.63% postgres postgres[.] XLogInsert +9.90% 0.10% postgres postgres[.] _bt_relandgetbuf +8.97% 1.16% postgres postgres[.] LWLockAcquire +8.42% 2.03% postgres postgres[.] XLogInsertRecord +7.26% 1.01% postgres postgres[.] _bt_binsrch +7.07% 1.20% postgres postgres[.] RelationGetBufferForTuple +6.27% 4.92% postgres postgres[.] _bt_compare +5.97% 0.63% postgres postgres[.] read_seq_tuple.isra.3 +5.70% 4.89% postgres postgres[.] hash_search_with_hash_value +5.44% 5.44% postgres postgres[.] LWLockAttemptLock Perf (Patched): + 69.33% 2.36% postgres postgres[.] ExecInsert + 35
Re: Faster inserts with mostly-monotonically increasing values
On Wed, Mar 14, 2018 at 10:58 AM, Claudio Freire wrote: > > > I'm thinking there could be contention on some lock somewhere. > > Can you attach the benchmark script you're using so I can try to reproduce > it? > I am using a very ad-hoc script.. But here it is.. It assumes a presence of a branch named "btree_rightmost" with the patched code. You will need to make necessary adjustments of course. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services start_options="-c shared_buffers=16GB -c max_wal_size=64GB -c min_wal_size=16GB -c checkpoint_timeout=60min -c bgwriter_delay=100 -c bgwriter_lru_maxpages=100 -c bgwriter_lru_multiplier=3 -c bgwriter_flush_after=256 -c checkpoint_completion_target=0.9" PGDATA=/data/pgsql echo "INSERT INTO testtab(b) SELECT generate_series(1,10);" > s1.sql for b in master btree_rightmost; do git checkout $b > /dev/null 2>&1 make -s clean > /dev/null 2>&1 ./configure --prefix=$HOME/pg-install/$b make -s -j8 install > /dev/null 2>&1 export PATH=$HOME/pg-install/$b/bin:$PATH for c in 1 2 4 8; do pg_ctl -D $PGDATA -w stop rm -rf $PGDATA initdb -D $PGDATA pg_ctl -D $PGDATA -o "$start_options" -w start -l logfile.$b psql -c "CREATE TABLE testtab (a bigserial UNIQUE, b bigint);" postgres echo "$b $c" >> results.txt num_txns_per_client=$((1024 / $c)) time pgbench -n -l -c $c -j 1 -t $num_txns_per_client -f s1.sql postgres >> results.txt done done
Re: Faster inserts with mostly-monotonically increasing values
On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs wrote: > On 14 March 2018 at 04:36, Pavan Deolasee > wrote: > > I wonder if the shortened code path actually leads to > > heavier contention for EXCLUSIVE lock on the rightmost page, which in > turn > > causes the slowdown. But that's just a theory. Any ideas how to prove or > > disprove that theory or any other pointers? > > Certainly: dropping performance with higher sessions means increased > contention is responsible. Your guess is likely correct. > > Suggest making the patch attempt a ConditionalLock on the target > block, so if its contended we try from top of index. So basically only > attempt the optimization if uncontended. This makes sense because in > many cases if the block is locked it is filling up and the RHS block > is more likely to change anyway. Hmm. I can try that. It's quite puzzling though that slowing down things actually make them better. I can also try to reduce the amount of time EX lock is held by doing some checks with a SHARE lock and then trade it for EX lock if we conclude that fast path can be taken. We can do page lsn check to confirm nothing changed when the lock was traded. Will check both approaches. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Wed, Mar 14, 2018 at 7:21 PM, Simon Riggs wrote: > On 14 March 2018 at 13:33, Pavan Deolasee > wrote: > > > > > > On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs < > simon.ri...@2ndquadrant.com> > > wrote: > >> > >> On 14 March 2018 at 04:36, Pavan Deolasee > >> wrote: > >> > I wonder if the shortened code path actually leads to > >> > heavier contention for EXCLUSIVE lock on the rightmost page, which in > >> > turn > >> > causes the slowdown. But that's just a theory. Any ideas how to prove > or > >> > disprove that theory or any other pointers? > >> > >> Certainly: dropping performance with higher sessions means increased > >> contention is responsible. Your guess is likely correct. > >> > >> Suggest making the patch attempt a ConditionalLock on the target > >> block, so if its contended we try from top of index. So basically only > >> attempt the optimization if uncontended. This makes sense because in > >> many cases if the block is locked it is filling up and the RHS block > >> is more likely to change anyway. > > > > > > Hmm. I can try that. It's quite puzzling though that slowing down things > > actually make them better. > > That's not what is happening though. > > The cache path is 1) try to lock cached block, 2) when got lock check > relevance, if stale 3) recheck from top > > The non-cached path is just 3) recheck from top > > The overall path length is longer in the cached case but provides > benefit if we can skip step 3 in high % of cases. The non-cached path > is more flexible because it locates the correct RHS block, even if it > changes dynamically between starting the recheck from top. > > So as I noted in one of the previous emails, the revised patch still takes fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in just 2% cases should cause such dramatic slowdown. If there is enough delay in step 1 then any benefit is lost anyway, so > there is no point taking the cached path even when successful, so its > better to ignore in that case. The non-cached path is also going to land on the same page, it just that the _bt_search() will ensure that it doesn't happen quickly. So my understanding that the additional time spent in _bt_search() accidentally reduces contention on the EX lock and ends up improving net performance. I know it sounds weird.. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
Hi Stephen, On Fri, Mar 16, 2018 at 7:28 AM, Stephen Frost wrote: > Greetings Pavan, all, > > * Pavan Deolasee (pavan.deola...@gmail.com) wrote: > > On 9 March 2018 at 08:29, Peter Geoghegan wrote: > > > My #1 concern has become RLS, and > > > perhaps only because I haven't studied it in enough detail. > > > > Sure. I've done what I thought is the right thing to do, but please > check. > > Stephen also wanted to review RLS related code; I don't know if he had > > chance to do so. > > I've started looking through the code from an RLS perspective and, at > least on my initial review, it looks alright. > Thanks for taking time out to review the patch. It certainly helps a lot. > > A couple test that aren't included that I think should be in the > regression suire are where both tables have RLS policies and > where the RLS tables have actual SELECT policies beyond just 'true'. > > I certainly see SELECT policies which limits the records that > individuals are allowed to see very frequently and it's an > important case to ensure works correctly. I did a few tests here myself > and they behaved as I expected, and reading through the code it looks > reasonable, but they should be simple to write tests which run very > quickly. > Ok. I will add those tests. > > I'm a bit on the fence about it, but as MERGE is added in quite a few > places which previously mentioned UPDATE and DELETE throughout the > system, I wonder if we shouldn't do better than this: > > =*# create policy p1 on t1 for merge using ((c1 % 2) = 0); > ERROR: syntax error at or near "merge" > LINE 1: create policy p1 on t1 for merge using ((c1 % 2) = 0); > > Specifically, perhaps we should change that to pick up on MERGE being > asked for there and return an error saying that policies for the MERGE > command aren't supported with a HINT that MERGE respects > INSERT/UPDATE/DELETE policies and that users should use those instead. > Hmm. I am not sure if that would be a good idea just for RLS. We might then also want to change several other places in the grammar to accept INSERT/UPDATE/DELETE keyword and handle that during parse analysis. We can certainly do that, but I am not sure if it adds a lot of value and certainly adds a lot more code. We should surely document these things, if we are not already. > > Also, in nodeModifyTable.c, there's a comment: > > * The WITH CHECK quals are applied in ExecUpdate() and hence we need > * not do anything special to handle them. > > Which I believe is actually getting at the fact that ExecUpdate() will > run ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK ...) and therefore in > ExecMergeMatched() we don't need to check WCO_RLS_UPDATE_CHECK, but we > do still need to check WCO_RLS_MERGE_UPDATE_CHECK (and that's what the > code does). Right. > One thing I wonder about there though is if we really need > to segregate those..? What's actually making WCO_RLS_MERGE_UPDATE_CHECK > different from WCO_RLS_UPDATE_CHECK? I get that the DELETE case is > different, because in a regular DELETE we'll never even see the row, but > for MERGE, we will see the row (assuming it passes SELECT policies, of > course) and then will check if it matches and that's when we realize > that we've been asked to run a DELETE, so we have the special-case of > WCO_RLS_MERGE_DELETE_CHECK, but that's not true of UPDATE, so I think > this might be adding a bit of unnecessary complication by introducing > WCO_RLS_MERGE_UPDATE_CHECK. > I've modelled this code on the lines of ON CONFLICT DO NOTHING. And quite similar to that, I believe we need separate handling for MERGE as well because we don't (and can't) push down the USING security quals of UPDATE/DELETE to the scan. So we need to separately check that the target row actually passes the USING quals. WCO_RLS_MERGE_UPDATE_CHECK and WCO_RLS_MERGE_DELETE_CHECK are used for that purpose. One point to deliberate though is whether it's a good idea to throw an error when the USING quals fail or should we silently ignore such rows. Regular UPDATE/DELETE does the latter and ON CONFLICT DO UPDATE does the former. I chose to throw an error because otherwise it may get confusing to the user since a row would neither be updated (meaning, it will be seen as a case of NOT MATCHED), but nor be inserted. I hope no problem there. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ON CONFLICT DO UPDATE for partitioned tables
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera wrote: > > @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting int num_subplan_partition_offsets; TupleTableSlot *partition_tuple_slot; TupleTableSlot *root_tuple_slot; + List **partition_arbiter_indexes; + TupleTableSlot **partition_conflproj_slots; + TupleTableSlot **partition_existing_slots; } PartitionTupleRouting; I am curious why you decided to add these members to PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better place to track these or is there some rule that we follow? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ON CONFLICT DO UPDATE for partitioned tables
On Fri, Mar 16, 2018 at 5:13 PM, Etsuro Fujita wrote: > (2018/03/16 19:43), Pavan Deolasee wrote: > >> On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera > <mailto:alvhe...@2ndquadrant.com>> wrote: >> > > @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting >> int num_subplan_partition_offsets; >> TupleTableSlot *partition_tuple_slot; >> TupleTableSlot *root_tuple_slot; >> + List **partition_arbiter_indexes; >> + TupleTableSlot **partition_conflproj_slots; >> + TupleTableSlot **partition_existing_slots; >> } PartitionTupleRouting; >> > > I am curious why you decided to add these members to >> PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better >> place to track these or is there some rule that we follow? >> > > I just started reviewing the patch, so maybe I'm missing something, but I > think it would be a good idea to have these in that structure, not in > ResultRelInfo, because these would be required only for partitions chosen > via tuple routing. > Hmm, yeah, probably you're right. The reason I got confused is because the patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the converted projection info/where clause for a given partition in its ResultRelationInfo. So I was wondering if we can move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and then just use that per-partition structures too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Mon, Mar 12, 2018 at 5:43 PM, Pavan Deolasee wrote: > > > On Sun, Mar 11, 2018 at 11:18 AM, Peter Geoghegan wrote: > >> >> >> As you know, there is an ON CONFLICT DO UPDATE + partitioning patch in >> the works from Alvaro. In your explanation about that approach that >> you cited, you wondered what the trouble might have been with ON >> CONFLICT + partitioning, and supposed that the issues were similar >> there. Are they? Has that turned up much? >> >> > Well, I initially thought that ON CONFLICT DO UPDATE on partition table > may have the same challenges, but that's probably not the case. For INSERT > ON CONFLICT it's still just an INSERT path, with some special handling for > UPDATEs. Currently, for partition or inherited table, UPDATEs/DELETEs go > via inheritance_planner() thus expanding inheritance for the result > relation where as INSERTs go via simple grouping_planner(). > > For MERGE, we do all three DMLs. That doesn't mean we could not > re-implement MERGE on the lines of INSERTs, but that would most likely mean > complete re-writing of the UPDATEs/DELETEs for partition/inheritance > tables. The challenges would just be the same in both cases. > > Having thought more about this in the last couple of days, I am actually inclined to try out rewrite the UPDATE handling of MERGE on the lines of what ON CONFLICT DO UPDATE patch is doing. This might help us to completely eliminate invoking inheritance_planner() for partition table and that will be a huge win for tables with several hundred partitions. The code might also look much cleaner that way. I am gonna give it a try for next couple of days and see if its doable. Thanks, Pavan
Re: Faster inserts with mostly-monotonically increasing values
On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire wrote: > > > I applied the attached patches on top of your patch, so it would be > nice to see if you can give it a try in your test hardware to see > whether it helps. > I can confirm that I no longer see any regression at 8 or even 16 clients. In fact, the patched version consistent do better than master even at higher number of clients. The only thing I am a bit puzzled is that I am no longer able to reproduce the 40-50% gains that I'd earlier observed with a single client. I now get 20-25% with smaller number of client and 10-15% with larger number of clients. I haven't been able to establish why it's happening, but since it's a different AWS instance (though of the same type), I am inclined to blame it on that for now. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: ON CONFLICT DO UPDATE for partitioned tables
eck to confirm that the resultant list is actually ordered? @@ -66,7 +67,8 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate, EState *estate, PartitionTupleRouting *proute, ResultRelInfo *targetRelInfo, - TupleTableSlot *slot); + TupleTableSlot *slot, + int *partition_index); static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node); static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate); static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate); @@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate, TupleTableSlot *slot, TupleTableSlot *planSlot, EState *estate, +int partition_index, bool canSetTag) { HeapTuple tuple; If we move the list of arbiter indexes and the tuple desc to ResultRelInfo, as suggested above, I think we can avoid making any API changes to ExecPrepareTupleRouting and ExecInsert. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Faster inserts with mostly-monotonically increasing values
On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire wrote: > > If you're not planning on making any further changes, would you mind > posting a coalesced patch? > > Notice that I removed the last offset thing only halfway, so it would > need some cleanup. > Here is an updated patch. I removed the last offset caching thing completely and integrated your changes for conditional lock access. Some other improvements to test cases too. I realised that we must truncate and re-insert to test index fastpath correctly. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg_btree_target_block_v3.patch Description: Binary data
Re: ON CONFLICT DO UPDATE for partitioned tables
On Thu, Mar 22, 2018 at 3:01 PM, Amit Langote wrote: > > > > > Why do we need to pin the descriptor? If we do need, why don't we need > > corresponding ReleaseTupDesc() call? > > PinTupleDesc was added in the patch as Alvaro had submitted it upthread, > which it wasn't clear to me either why it was needed. Looking at it > closely, it seems we can get rid of it if for the lack of corresponding > ReleaseTupleDesc(). ExecSetSlotDescriptor() seems to take care of pinning > and releasing tuple descriptors that are passed to it. If some > partition's tupDesc remains pinned because it was the last one that was > passed to it, the final ExecResetTupleTable will take care of releasing it. > > I have removed the instances of PinTupleDesc in the updated patch, but I'm > not sure why the loose PinTupleDesc() in the previous version of the patch > didn't cause reference leak warnings or some such. > Yeah, it wasn't clear to me as well. But I did not investigate. May be Alvaro knows better. > > I am still confused if the partition_conflproj_tdescs really belongs to > > PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW > for > > the > > MERGE patch, I moved everything to a new struct and made it part of the > > ResultRelInfo. If no re-mapping is necessary, we can just point to the > > struct > > in the root relation's ResultRelInfo. Otherwise create and populate a new > > one > > in the partition relation's ResultRelInfo. > > > > + leaf_part_rri->ri_onConflictSetWhere = > > + ExecInitQual(onconflwhere, &mtstate->ps); > > + } > > > > So ri_onConflictSetWhere and ri_onConflictSetProj are part of the > > ResultRelInfo. Why not move mt_conflproj_tupdesc, > > partition_conflproj_tdescs, > > partition_arbiter_indexes etc to the ResultRelInfo as well? > > I have moved both the projection tupdesc and the arbiter indexes list into > ResultRelInfo as I wrote above. > > Thanks. It's looking much better now. I think we can possibly move all ON CONFLICT related members to a separate structure and just copy the pointer to the structure if (map == NULL). That might make the code a bit more tidy. Is there anything that needs to be done for transition tables? I checked and didn't see anything, but please check. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] MERGE SQL Statement for PG11
On Thu, Mar 22, 2018 at 1:15 AM, Peter Geoghegan wrote: > > > > > > > No, it did not. We only support VALUES clause with INSERT action. > > But can't you have a subselect in the VALUES()? Support for subselects > seems like a totally distinct thing to the restriction that only a > (single row) VALUES() is allowed in INSERT actions. > > Ah, right. That works even today. postgres=# CREATE TABLE target (a int, b text); CREATE TABLE postgres=# MERGE INTO target USING (SELECT 1) s ON false WHEN NOT MATCHED THEN INSERT VALUES ((SELECT count(*) FROM pg_class), (SELECT relname FROM pg_class LIMIT 1)); MERGE 1 postgres=# SELECT * FROM target; a | b -+ 755 | pgbench_source (1 row) Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Assertion failure while streaming toasted data
Hi, While working on an output plugin that uses streaming protocol, I hit an assertion failure. Further investigations revealed a possible bug in core Postgres. This must be new to PG14 since streaming support is new to this release. I extended the test_decoding regression test to demonstrate the failure. PFA ``` 2021-05-25 11:32:19.493 IST client backend[68321] pg_regress/stream STATEMENT: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', ' 1', 'stream-changes', '1'); TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 3476, PID: 68321) ``` >From my preliminary analysis, it looks like we fail to adjust the memory accounting after streaming toasted tuples. More concretely, after `ReorderBufferProcessPartialChange()` processes the in-progress transaction, `ReorderBufferTruncateTXN()` truncates the accumulated changed in the transaction, but fails to adjust the buffer size for toast chunks. Maybe we are missing a call to `ReorderBufferToastReset()` somewhere? >From what I see, the assertion only triggers when data is inserted via COPY (multi-insert). Let me know if anything else is needed to reproduce this. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: https://www.enterprisedb.com stream_toasted_txn_test.patch Description: Binary data
Re: Assertion failure while streaming toasted data
On Tue, May 25, 2021 at 1:26 PM Dilip Kumar wrote: > > > I have identified the cause of the issue, basically, the reason is if > we are doing a multi insert operation we don't set the toast cleanup > until we get the last tuple of the xl_multi_insert [1]. Now, with > streaming, we can process the transaction in between the multi-insert > but while doing that the "change->data.tp.clear_toast_afterwards" is > set to false in all the tuples in this stream. And due to this we will > not clean up the toast. > Thanks. That matches my understanding too. > > One simple fix could be that we can just clean the toast memory if we > are processing in the streaming mode (as shown in the attached patch). > I am not entirely sure if it works correctly. I'd tried something similar, but the downstream node using my output plugin gets NULL values for the toast columns. It's a bit hard to demonstrate that with the test_decoding plugin, but if you have some other mechanism to test that change with an actual downstream node receiving and applying changes, it will be useful to test with that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Assertion failure while streaming toasted data
On Tue, May 25, 2021 at 1:49 PM Dilip Kumar wrote: > On Tue, May 25, 2021 at 1:45 PM Pavan Deolasee > wrote: > > > > I am not entirely sure if it works correctly. I'd tried something > similar, but the downstream node using > > my output plugin gets NULL values for the toast columns. It's a bit > hard to demonstrate that with the > > test_decoding plugin, but if you have some other mechanism to test that > change with an actual downstream > > node receiving and applying changes, it will be useful to test with that. > > Okay, I will test that. Thanks. > > I modified test_decoding to print the tuples (like in the non-streamed case) and included your proposed fix. PFA When the transaction is streamed, I see: ``` + opening a streamed block for transaction + table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb' data[text]:'ccc' + table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd' data[text]:'eee' + table public.toasted: INSERT: id[integer]:9003 other[text]:'bar' data[text]:unchanged-toast-datum ``` For a non-streamed case, the `data[text]` column shows the actual data. That probably manifests into NULL data when downstream handles it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: https://www.enterprisedb.com stream_toasted_txn_print_tuples.patch Description: Binary data
Re: Assertion failure while streaming toasted data
On Tue, May 25, 2021 at 2:57 PM Dilip Kumar wrote: > > > > > Yes, I am able to reproduce this, basically, until we get the last > > tuple of the multi insert we can not clear the toast data otherwise we > > can never form a complete tuple. So the only possible fix I can think > > of is to consider the multi-insert WAL without the final multi-insert > > tuple as partial data then we will avoid streaming until we get the > > complete WAL of one multi-insert. I am working on the patch to fix > > this, I will share that in some time. > > The attached patch should fix the issue, now the output is like below > > Thanks. This looks fine to me. We should still be able to stream multi-insert transactions (COPY) as and when the copy buffer becomes full and is flushed. That seems to be a reasonable restriction to me. We should incorporate the regression test in the final patch. I am not entirely sure if what I have done is acceptable (or even works in all scenarios). We could possibly have a long list of tuples instead of doing the exponential magic. Or we should consider lowering the min value for logical_decoding_work_mem and run these tests with a much lower value. In fact, that's how I caught the problem in the first place. I had deliberately lowered the value to 1kB so that streaming code kicks in very often and even for small transactions. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: https://www.enterprisedb..com
Re: Assertion failure while streaming toasted data
On Tue, May 25, 2021 at 6:43 PM Dilip Kumar wrote: > > I have also added a test case for this. > > Is that test good enough to trigger the original bug? In my experience, I had to add a lot more tuples before the logical_decoding_work_mem threshold was crossed and the streaming kicked in. I would suggest running the test without the fix and check if the assertion hits. If so, we are good to go. Thanks, Pavan -- Pavan Deolasee EnterpriseDB: https://www.enterprisedb..com