Shmem queue is not flushed if receiver is not yet attached

2022-03-17 Thread Pavan Deolasee
Hello,

While testing on the current PG master, I noticed a problem between
backends communicating over a shared memory queue. I think `shm_mq_sendv()`
fails to flush the queue, even if  `force_flush` is set to true, if the
receiver is not yet attached to the queue. This simple fix solves
the problem for me.

On another note, `shm_mq.h` declares `shm_mq_flush()`, but I don't see it
being implemented. Maybe just a leftover from the previous work? Though it
seems useful to implement that API.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


0001-Flush-the-queue-even-if-receiver-has-not-attached.patch
Description: Binary data


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-17 Thread Pavan Deolasee
On Mon, Jan 18, 2021 at 3:02 AM Tomas Vondra 
wrote:

>
>
> Pushed. Thanks everyone for the effort put into this patch. The first
> version was sent in 2015, so it took quite a bit of time.
>
>
Thanks Tomas, Anastasia and everyone else who worked on the patch and
ensured that it gets into the tree.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: some pointless HeapTupleHeaderIndicatesMovedPartitions calls

2021-01-17 Thread Pavan Deolasee
Hi Alvaro,

On Tue, Sep 29, 2020 at 10:14 PM Alvaro Herrera 
wrote:

> Hello
>
> Pavan Deolasee recently noted that a few of the
> HeapTupleHeaderIndicatesMovedPartitions calls added by commit
> 5db6df0c0117 are useless, since they are done after comparing t_self
> with t_ctid.  That's because t_self can never be set to the magical
> values that indicate that the tuple moved partition.  If the first
> test fails (so we know t_self equals t_ctid), necessarily the second
> test will also fail.
>
> So these checks can be removed and no harm is done.
>
>
The patch looks good to me. The existing coding pattern was a bit confusing
and that's how I noticed it. So +1 for fixing it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-22 Thread Pavan Deolasee
On Fri, Mar 23, 2018 at 10:00 AM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:

> On 2018/03/23 3:42, Pavan Deolasee wrote:
> > A slightly improved version attached. Apart from doc cleanup based on
> > earlier feedback, fixed one assertion failure based on Rahila's report.
> > This was happening when target relation is referenced in the source
> > subquery. Fixed that and added a test case to test that situation.
> >
> > Rebased on current master.
>
> I tried these patches (applied 0002 on top of 0001).  When applying 0002,
> I got some apply errors:
>
> The next patch would create the file
> src/test/isolation/expected/merge-delete.out,
> which already exists!  Assume -R? [n]
>
> I managed to apply it by ignoring the errors, but couldn't get make check
> to pass; attached regressions.diffs if you want to take a look.
>


Thanks. Are you sure you're using a clean repo? I suspect you'd a previous
version of the patch applied and hence the apply errors now. I also suspect
that you may have made a mistake while resolving the conflicts while
applying the patch (since a file at the same path existed). The failures
also seem related to past version of the patch.

I just checked with a freshly checked out repo and the patches apply
correctly on the current master and regression passes too.
http://commitfest.cputube.org/ also reported success overnight.



>
> Btw, is 0001 redundant with the latest patch on ON CONFLICT DO UPDATE
> thread?  Can I apply just 0002 on top of that patch?  So, I tried that --
> that is, skipped your 0001 and instead applied ON CONFLICT DO UPDATE
> patch, and then applied your 0002.


Yes.  I should probably rebase my patch on your v9 or just include the
relevant changes in the MERGE patch itself to avoid any dependency right
now. Will check.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-23 Thread Pavan Deolasee
On Fri, Mar 9, 2018 at 8:49 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2/1/18 19:21, Simon Riggs wrote:
> > If we really can't persuade you of that, it doesn't sink the patch. We
> > can have the WAL pointer itself - it wouldn't save space but it would
> > at least alleviate the spinlock.
>
> Do you want to send in an alternative patch that preserves the WAL
> pointer and only changes the locking?
>
>
Sorry for the delay. Here is an updated patch which now replaces xl_prev
with xl_curr, thus providing similar safeguards against corrupted or torn
WAL pages, but still providing benefits of atomic operations.

I repeated the same set of tests and the results are almost similar. These
tests are done on a different AWS instance though and hence not comparable
to previous tests. What we do in these tests is essentially
call ReserveXLogInsertLocation() 1M times to reserve 256 bytes each time,
in a single select statement (the function is exported and a SQL-callable
routine is added for the purpose of the tests)

Patched:
==
#clients  #tps
1 34.195311
2 29.001584
4 31.712009
8 35.489272
16 41.950044

Master:
==
#clients  #tps
1 24.128004
2 12.326135
4 8.334143
8 16.035699
16 8.502794

So that's pretty good improvement across the spectrum.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_reduce_wal_contention_v2.patch
Description: Binary data


Re: Faster inserts with mostly-monotonically increasing values

2018-03-23 Thread Pavan Deolasee
Hi Andrew and Claudio,

Thanks for the review!

On Fri, Mar 23, 2018 at 10:19 AM, Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

> On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire 
> wrote:
>
>
> This patch looks in pretty good shape. I have been trying hard to
> think of some failure mode but haven't been able to come up with one.
>
>
Great!


>
> > Some comments
> >
> > +/*
> > + * It's very common to have an index on an auto-incremented or
> > + * monotonically increasing value. In such cases, every insertion
> happens
> > + * towards the end of the index. We try to optimise that case by
> caching
> > + * the right-most block of the index. If our cached block is still
> the
> > + * rightmost block, has enough free space to accommodate a new
> entry and
> > + * the insertion key is greater or equal to the first key in this
> page,
> > + * then we can safely conclude that the new key will be inserted in
> the
> > + * cached block. So we simply search within the cached page and
> insert the
> > + * key at the appropriate location. We call it a fastpath.
> >
> > It should say "the insertion key is strictly greater than the first key"
>
>
Good catch. Fixed.


>
> >
> > Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
> > difference). So it should say "rightmost leaf page".
>
>
> right.
>
>
Fixed.


> [...]
> >
> > Setting "offset = InvalidOffsetNumber" in that contorted way is
> > unnecessary. You can remove the first assignment and instead
> > initialize unconditionally right after the fastpath block (that
> > doesn't make use of offset anyway):
>
>
> Yes, I noticed that and it's confusing, Just set it at the top.
>
>
Good idea. Changed that way.


> >
> > Having costs in explain tests can be fragile. Better use "explain
> > (costs off)". If you run "make check" continuously in a loop, you
> > should get some failures related to that pretty quickly.
> >
>
> Agree about costs off, but I'm fairly dubious of the value of using
> EXPLAIN at all here.Nothing in the patch should result in any change
> in EXPLAIN output.
>

I agree. I initially added EXPLAIN to ensure that we're doing index-only
scans. But you're correct, we don't need them in the tests itself.


>
> I would probably just have a few regression lines that should be sure
> to exercise the code path and leave it at that.
>
>
I changed the regression tests to include a few more scenarios, basically
using multi-column indexes in different ways and they querying rows by
ordering rows in different ways. I did not take away the vacuum and I
believe it will actually help the tests by introducing some fuzziness in
the tests i.e. if the vacuum does not do its job, we might execute a
different plan and ensure that the output remains unchanged.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_btree_target_block_v4.patch
Description: Binary data


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-23 Thread Pavan Deolasee
On Fri, Mar 23, 2018 at 6:45 AM, Peter Geoghegan  wrote:

> On Thu, Mar 22, 2018 at 6:02 PM, Alvaro Herrera 
> wrote:
> > Incremental development is a good thing.  Trying to do everything in a
> > single commit is great when time is infinite or even merely very long,
> > but if you run out of it, which I'm sure is common, leaving some things
> > out that can be reasonable implemented in a separate patch is perfectly
> > acceptable.
>
> We're talking about something that took me less than an hour to get
> working. AFAICT, it's just a matter of tweaking the grammar, and
> adding a bit of transformWithClause() boilerplate to the start of
> transformMergeStmt().
>>
>>
>> I quickly implemented CTE support myself (not wCTE support, since
>> MERGE doesn't use RETURNING), and it wasn't tricky. It seems to work
>> when I mechanically duplicate the approach taken with other types of
>> DML statement in the parser. I have written a few tests, and so far it
>> holds up.
>>
>
> Ok, thanks. I started doing something similar, but great if you have
> already implemented. I will focus on other things for now.
>
>

I am sorry. I was under the impression that you're actually writing this
piece of code and hence did not pay much attention till now. I should have
confirmed with you instead of assuming. I think it's a bit too late now,
but I will give it a fair try tomorrow. I don't want to spend too much time
on it though given how close we are to the deadline. As Alvaro said, we can
always revisit this for pg12.


> As I've pointed out on this thread already, I'm often concerned about
> supporting functionality like this because it increases my overall
> confidence in the design. If it was genuinely hard to add WITH clause
> support, then that would probably tell us something about the overall
> design that likely creates problems elsewhere. It's easy to say that
> it isn't worth holding the patch up for WITH clause support, because
> that's true, but it's also beside the point.
>

Understood.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-23 Thread Pavan Deolasee
On Fri, Mar 23, 2018 at 4:43 AM, Peter Geoghegan  wrote:

> On Thu, Mar 22, 2018 at 11:42 AM, Pavan Deolasee
>  wrote:
> > A slightly improved version attached.
>
> You still need to remove this change:
>
> > diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
> > index a4574cd533..dbfb5d2a1a 100644
> > --- a/src/include/miscadmin.h
> > +++ b/src/include/miscadmin.h
> > @@ -444,5 +444,6 @@ extern bool has_rolreplication(Oid roleid);
> >  /* in access/transam/xlog.c */
> >  extern bool BackupInProgress(void);
> >  extern void CancelBackup(void);
> > +extern int64 GetXactWALBytes(void);
>

Sigh. Fixed in the recent version.


>
> I see that we're still using two target RTEs in this latest revision,
> v23e -- the new approach to partitioning, which I haven't had time to
> study in more detail, has not produced a change there.


Yes, we continue to use two RTEs because I don't have any brighter idea
than that to handle the case of partitioned table and right outer join. As
I explained sometime back, this is necessary to ensure that we don't
produce duplicate rows when a partition is joined with the source and then
a second partition is joined again with the source.

Now I don't know if we can run a join query and still have a single RTE,
but that looks improbable and wrong.


This causes
> weird effects, such as the following:
>
> """
> pg@~[20658]=# create table foo(bar int4);
> CREATE TABLE
> pg@~[20658]=# merge into foo f using (select 1 col) dd on f.bar=dd.col
> when matched then update set bar = f.barr + 1;
> ERROR:  column f.barr does not exist
> LINE 1: ...n f.bar=dd.col when matched then update set bar = f.barr + 1...
>  ^
> HINT:  Perhaps you meant to reference the column "f.bar" or the column
> "f.bar".
>
> """
>
> While I think this this particular HINT buglet is pretty harmless, I
> continue to be concerned about the unintended consequences of having
> multiple RTEs for MERGE's target table. Each RTE comes from a
> different lookup path -- the first one goes through setTargetTable()'s
> parserOpenTable() + addRangeTableEntryForRelation() calls. The second
> one goes through transformFromClauseItem(), for the join, which
> ultimately ends up calling transformTableEntry()/addRangeTableEntry().
>


How's it different than running a INSERT query with the target table again
specified in a subquery producing the rows to be inserted? For example,

postgres=# insert into target as t select sid from source s join target t
on t.ttid = s.sid;
ERROR:  column t.ttid does not exist
LINE 1: ...rget as t select sid from source join target t on t.ttid = s...
 ^
HINT:  Perhaps you meant to reference the column "t.tid" or the column
"t.tid".
postgres=#

This produces a very similar looking HINT as your test above. I am certain
that "target" table gets two RTEs, exactly via the same code paths as you
discussed above. So if this is not a problem for INSERT, why it would be a
problem for MERGE? May be I am missing a point here.



> Consider commit 5f173040, which fixed a privilege escalation security
> bug around multiple name lookup. Could the approach taken by MERGE
> here introduce a similar security issue?
>
> Using GDB, I see two calls to RangeVarGetRelidExtended() when a simple
> MERGE is executed. They both have identical relation arguments, that
> look like this:
>
> (gdb) p *relation
> $4 = {
>   type = T_RangeVar,
>   catalogname = 0x0,
>   schemaname = 0x0,
>   relname = 0x5600ebdcafb0 "foo",
>   inh = 1 '\001',
>   relpersistence = 112 'p',
>   alias = 0x5600ebdcb048,
>   location = 11
> }
>
> This seems like something that needs to be explained, at a minimum.
> Even if I'm completely wrong about there being a security hazard,
> maybe the suggestion that there might be still gives you some idea of
> what I mean about unintended consequences.
>

Ok. I will try to explain it better and also think about the security
hazards.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-23 Thread Pavan Deolasee
On Sat, Mar 24, 2018 at 1:36 AM, Peter Geoghegan  wrote:

>
> Fair enough. Attached patch shows what I'm on about. This should be
> applied on top of 0001_merge_v23e_onconflict_work.patch +
> 0002_merge_v23e_main.patch. I'm not expecting an authorship credit for
> posting this patch.
>

Thanks for the patch. I will study and integrate this into the main patch.


>
> One thing that the test output shows that is interesting is that there
> is never a "SubPlan 1" or "InitPlan 1" in EXPLAIN output -- it seems
> to always start at "SubPlan 2". This probably has nothing to do with
> CTEs in particular. I didn't notice this before now, although there
> were no existing tests of EXPLAIN in the patch that show subplans or
> initplans.
>

This query e.g. correctly starts at InitPlan 1

postgres=# EXPLAIN MERGE INTO m USING (SELECT 1 a, 'val' b) s ON m.k = s.a
WHEN NOT MATCHED THEN INSERT VALUES ((select count(*) from pg_class), s.b);
   QUERY PLAN
-
 Merge on m  (cost=16.30..43.83 rows=6 width=106)
   InitPlan 1 (returns $0)
 ->  Aggregate  (cost=16.26..16.27 rows=1 width=8)
   ->  Seq Scan on pg_class  (cost=0.00..15.41 rows=341 width=0)
   ->  Hash Right Join  (cost=0.03..27.55 rows=6 width=106)
 Hash Cond: (m_1.k = s.a)
 ->  Seq Scan on m m_1  (cost=0.00..22.70 rows=1270 width=14)
 ->  Hash  (cost=0.02..0.02 rows=1 width=96)
   ->  Subquery Scan on s  (cost=0.00..0.02 rows=1 width=96)
 ->  Result  (cost=0.00..0.01 rows=1 width=36)
(10 rows)



>
> Is this somehow related to the issue of using two RTEs for the target
> relation? That's certainly why we always see unaliased target table
> "m" with the alias "m_1" in EXPLAIN output, so I would not be
> surprised if it caused another EXPLAIN issue.
>

I don't think it's related to using two RTEs. The following EXPLAIN for a
regular UPDATE query also shows a SubPlan starting at 2. I think it's just
to do with how planner assigns the plan_id.

postgres=# EXPLAIN WITH cte_basic AS (SELECT 1 a, 'cte_basic val' b) UPDATE
m SET v = (SELECT b || ' merge update' FROM cte_basic WHERE cte_basic.a =
m.k LIMIT 1) ;
  QUERY PLAN

--
 Update on m  (cost=0.01..54.46 rows=1270 width=42)
   CTE cte_basic
 ->  Result  (cost=0.00..0.01 rows=1 width=36)
   ->  Seq Scan on m  (cost=0.00..54.45 rows=1270 width=42)
 SubPlan 2
   ->  Limit  (cost=0.00..0.02 rows=1 width=32)
 ->  CTE Scan on cte_basic  (cost=0.00..0.02 rows=1
width=32)
   Filter: (a = m.k)
(8 rows)

A quick gdb tracing shows that the CTE itself is assigned plan_id 1 and the
SubPlan then gets plan_id 2. I can investigate further, but given that we
see a similar behaviour with regular UPDATE, I don't think it's worth.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-24 Thread Pavan Deolasee
On Sat, Mar 24, 2018 at 5:28 PM, Robert Haas  wrote:

>
> This is a patch about which multiple people have expressed concerns.
> You're trying to jam a heavily redesigned version in at the last
> minute without adequate review.  Please don't do that.
>
>
TBH this is not a heavily redesigned version. There were two parts to the
original patch:

1. Replacing 8-byte xl_prev with 2-byte xl_walid in order to reduce the
size of the WAL record header
2. Changing XLogCtlInsert.CurrBytePos to use atomic operations in order to
reduce the spinlock contention.

Most people expressed concerns regarding 1, but there were none regarding
2. Now it's possible that the entire attention got diverted to 1 and nobody
really studied 2 in detail. Or may be 2 is mostly non-contentious, given
the results of micro benchmarks.

So what I've done in v2 is to just deal with part 2 i.e. replace
access to CurrBytePos
with atomic operations, based on the following suggestion by Simon.

On 2/1/18 19:21, Simon Riggs wrote:
> If we really can't persuade you of that, it doesn't sink the patch. We
> can have the WAL pointer itself - it wouldn't save space but it would
> at least alleviate the spinlock.

This gives us the same level of guarantee that xl_prev used to offer, yet
help us use atomic operations. I'll be happy if we can look at that
particular change and see if there are any holes there.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: Exclude unlogged tables from base backups

2018-03-26 Thread Pavan Deolasee
On Fri, Mar 23, 2018 at 9:51 PM, David Steele  wrote:

> On 3/23/18 12:14 PM, Teodor Sigaev wrote:
> >
> > Thank you, pushed
>
>
Is it just me or the newly added test in 010_pg_basebackup.pl failing for
others too?

#   Failed test 'unlogged main fork not in backup'
#   at t/010_pg_basebackup.pl line 112.
t/010_pg_basebackup.pl ... 86/87 # Looks like you failed 1 test of 87.

I manually ran pg_basebackup and it correctly excludes the main fork on an
unlogged table from the backup, but it consistently copies the main fork
while running "make check" and thus fails the test.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-26 Thread Pavan Deolasee
On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

> On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
>  wrote:
> >>
> >>
> >> I would probably just have a few regression lines that should be sure
> >> to exercise the code path and leave it at that.
> >>
> >
> > I changed the regression tests to include a few more scenarios, basically
> > using multi-column indexes in different ways and they querying rows by
> > ordering rows in different ways. I did not take away the vacuum and I
> > believe it will actually help the tests by introducing some fuzziness in
> the
> > tests i.e. if the vacuum does not do its job, we might execute a
> different
> > plan and ensure that the output remains unchanged.
> >
>
>
> If we're going to keep the vacuums in there, do we need to add a wait
> barrier like Claudio suggested upthread?
>
>
I don't think we need the wait barrier since we're no longer printing the
explain plan. In the worst case, the vacuum may not get to set pages
all-visible, thus planner choosing something other than an index-only-scan,
but I guess that fuzziness actually helps the regression tests. That way we
get confirmation regarding the final result irrespective of the plan
chosen.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: Exclude unlogged tables from base backups

2018-03-26 Thread Pavan Deolasee
On Mon, Mar 26, 2018 at 1:03 PM, Pavan Deolasee 
wrote:

> On Fri, Mar 23, 2018 at 9:51 PM, David Steele  wrote:
>
>> On 3/23/18 12:14 PM, Teodor Sigaev wrote:
>> >
>> > Thank you, pushed
>>
>>
> Is it just me or the newly added test in 010_pg_basebackup.pl failing for
> others too?
>
> #   Failed test 'unlogged main fork not in backup'
> #   at t/010_pg_basebackup.pl line 112.
> t/010_pg_basebackup.pl ... 86/87 # Looks like you failed 1 test of 87.
>
>
This one-liner patch fixes it for me.

Thanks,
Pavan



-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001_pg_basebackup_fix.patch
Description: Binary data


Re: PATCH: Exclude unlogged tables from base backups

2018-03-26 Thread Pavan Deolasee
On Mon, Mar 26, 2018 at 5:16 PM, Masahiko Sawada 
wrote:

> On Mon, Mar 26, 2018 at 4:52 PM, Pavan Deolasee
>
> >>
> >
> > This one-liner patch fixes it for me.
> >
>
> Isn't this issue already fixed by commit
> d0c0c894533f906b13b79813f02b2982ac675074?


Ah, right. Thanks for pointing out and sorry for the noise.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-27 Thread Pavan Deolasee
On Tue, Mar 27, 2018 at 1:54 PM, Simon Riggs  wrote:

> On 26 March 2018 at 17:06, Simon Riggs  wrote:
> > On 26 March 2018 at 15:39, Pavan Deolasee 
> wrote:
> >
>
> > That's all I can see so far.
>
> * change comment “once to” to “once” in src/include/nodes/execnodes.h
> * change comment “and to run” to “and once to run”
> * change “result relation” to “target relation”
>

Fixed all of that in the patch v26 set I just sent.


>
> * XXX we probably need to check plan output for CMD_MERGE also
>

Yeah. Added those checks for MERGE action's target lists in v26.


>
> * Spurious line feed in src/backend/optimizer/prep/preptlist.c
>

Couldn't spot it. Will look closer, but any hint will be appreciated.


>
> * No need to remove whitespace in src/backend/optimizer/util/relnode.c
>

Fixed in v26.


>
> * README should note that the TABLEOID junk column is not strictly
> needed when joining to a non-partitioned table but we don't try to
> optimize that away. Is that an XXX to fix in future or do we just
> think the extra 4 bytes won't make much difference so we leave it?
>

I actually took the opportunity to conditionally fetch tableoid only if we
are dealing with partitioned table.


>
> * Comment in rewriteTargetListMerge() should mention TABLEOID exists
> to allow us to find the correct relation, not the correct row, comment
> just copied from CTID above it.
>
>
Fixed in v26.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] A design for amcheck heapam verification

2018-03-27 Thread Pavan Deolasee
On Tue, Mar 27, 2018 at 8:50 AM, Peter Geoghegan  wrote:

> On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin 
> wrote:
> > I've just flipped patch to WoA. But if above issues will be fixed I
> think that patch is ready for committer.
>
> Attached is v7, which has the small tweaks that you suggested.
>
> Thank you for the review. I hope that this can be committed shortly.
>
>
Sorry for coming a bit too late on this thread, but I started looking at
0002 patch.

  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the
heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *

Is that correct? Aren't we actually verifying the presence in the index of
all
heap tuples?

@@ -116,37 +140,47 @@ static inline bool
invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber
blocknum);

 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)

Can we come up with a better name for heapallindexed? May be
"check_heapindexed"?

+
+ /*
+ * Register our own snapshot in !readonly case, rather than asking
+ * IndexBuildHeapScan() to do this for us later.  This needs to happen
+ * before index fingerprinting begins, so we can later be certain that
+ * index fingerprinting should have reached all tuples returned by
+ * IndexBuildHeapScan().
+ */
+ if (!state->readonly)
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ }
+

So this looks safe. In !readonly mode, we take the snapshot *before*
fingerprinting the index. Since we're using MVCC snapshot, any tuple which
is
visible to heapscan must be reachable via indexscan too. So we must find the
index entry for every heap tuple returned by the scan.

What I am not sure about is whether we can really examine an index which is
valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
this
amcheck might be running in a transaction block, probably in a
repeatable-read
isolation level and hence GetTransactionSnapshot() might actually return a
snapshot which can't yet read the index consistently. In practice, this is
quite unlikely, but I think we should check for that case if we agree that
it
could be a problem.

The case with readonly mode is also interesting. Since we're scanning heap
with
SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
question is: can this happen?

- some concurrent index scan sees a heaptuple as DEAD and marks the index
  entry as LP_DEAD
- our index fingerprinting sees index tuple as LP_DEAD
- our heap scan still sees the heaptuple as RECENTLY_DEAD

Now that looks improbable given that we compute OldestXmin after our index
fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
there a corner case that we're missing?

Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Apart from that, comments in IndexBuildHeapRangeScan() claim that the
function
is called only with ShareLock and above, which is no longer true. We should
check if that has any side-effects. I can't think of any, but better to
verify
and update the comments to reflect new reality,

The partial indexes look fine since the non-interesting tuples never get
called
back.

One thing that worth documenting/noting is the fact that a !readonly check
will
run with a long-duration registered snapshot, thus holding OldestXmin back.
Is
there anything we can do to lessen that burden like telling other backends
to
ignore our xmin while computing OldestXmin (like vacuum does)?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-27 Thread Pavan Deolasee
On Tue, Mar 27, 2018 at 7:28 PM, Tom Lane  wrote:

>
>
> I had not noticed that in the kerfuffle over the more extreme version,
> but I think this is a different route to breaking the same guarantees.
> The point of the xl_prev links is, essentially, to allow verification
> that we are reading a coherent series of WAL records, ie that record B
> which appears to follow record A actually is supposed to follow it.
> This throws away that cross-check just as effectively as the other patch
> did, leaving us reliant solely on the per-record CRCs.  CRCs are good,
> but they do have a false-positive rate.
>
> An example of a case where xl_prev makes one feel a lot more comfortable
> is the WAL-segment-switch option.  The fact that the first record of the
> next WAL file links back to the XLOG_SWITCH record is evidence that
> ignoring multiple megabytes of WAL was indeed the intended thing to do.
> An xl_curr field cannot provide any evidence as to whether WAL records
> were missed.
>

TBH I still don't see why this does not provide the same guarantee that the
current code provides, but given the concerns expressed by others, I am not
gonna pursue beyond a point. But one last time :-)

The current code uses xl_prev to cross-verify the record B, read after
record A, indeed follows A and has a valid back-link to A. This deals with
problems where B might actually be an old WAL record, carried over from a
stale WAL file.

Now if we store xl_curr, we can definitely guarantee that B is ahead of A
because B->xl_curr will be greater than A->xl_curr (otherwise we bail out).
So that deals with the problem of stale WAL records. In addition, we also
know where A ends (we can deduce that even for XLOG_SWITCH records knowing
where the next record will start after the switch) and hence we know where
B should start. So we read at B and also confirm that B->xl_curr matches
B's position. If it does not, we declare end-of-WAL and bail out. So where
is the problem?


> > 2. Does the new logic in pg_rewind to search backward for a checkpoint
> > work reliably, and will it be slow?
>
> If you have to search backwards, this breaks it.  Full stop.
>

We don't really need to fetch the previous record. We really need to find
the last checkpoint prior to a given LSN. That can be done by reading WAL
segments forward. It can be a little slow, but hopefully not a whole lot.

A <- B <- C <- CHKPT <- D <- E <- F <- G

So today, if we want to find last checkpoint prio to G, we go through the
back-links until we find the first checkpoint record. In the proposed code,
we read forward the current WAL segment, remember the last CHKPT record
seen and once we see G, we know we have found the prior checkpoint. If the
checkpoint does not exist in the current WAL, we read forward the previous
WAL and return the last checkpoint record in that WAL and so on. So in the
worst case, we might read a WAL segment extra before we find the checkpoint
record. That's not ideal but not too bad given that only pg_rewind needs
this and that too only once.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-27 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 7:36 AM, Michael Paquier 
wrote:

> On Tue, Mar 27, 2018 at 02:02:10PM -0400, Tom Lane wrote:
>
> >
> > Well, the point of checkpoints is that WAL data before the last one
> > should no longer matter anymore, isn't it?
>
> I have to agree with Tom here.  If you force pg_rewind to replay more
> WAL records from a checkpoint older than the checkpoint prior to where
> WAL has forked at promotion then you have a risk of losing data.
>

Yeah, we should not do that. The patch surely does not intend to replay any
more WAL than what we do today. The idea is to just use a different
mechanism to find the prior checkpoint. But we should surely find the
latest prior checkpoint. In the rare scenario that Tom showed, we should
just throw an error and fix the patch if it's not doing that already.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-27 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 8:28 AM, Peter Geoghegan  wrote:

> On Tue, Mar 27, 2018 at 2:28 AM, Pavan Deolasee
>  wrote:
> > (Version 26)
>
> I have some feedback on this version:
>
> * ExecMergeMatched() needs to determine tuple lock mode for
> EvalPlanQual() in a way that's based on how everything else works;
> it's not okay to just use LockTupleExclusive in all cases. That will
> often lead to lock escalation, which can cause unprincipled deadlocks.
> You need to pass back the relevant info from routines like
> heap_update(), which means more state needs to come back to
> ExecMergeMatched() from routines like ExecUpdate().
>

You're right. I am thinking what would be a good way to pass that
information back. Should we add a new out parameter to ExecUpdate() or a
new member to HeapUpdateFailureData? It seems only ExecUpdate() would need
the change, so may be it's fine to change the API but HeapUpdateFailureData
doesn't look bad either since it deals with failure cases and we are indeed
dealing with ExecUpdate() failure. Any preference?


>
> * Doesn't ExecUpdateLockMode(), which is called from places like
> ExecBRUpdateTriggers(), also need to be taught about
> GetEPQRangeTableIndex() (i.e. the ri_mergeTargetRTI/ri_RangeTableIndex
> divide)? You should audit everything like that carefully. Maybe
> GetEPQRangeTableIndex() is not the best choke-point to do this kind of
> thing. Not that I have a clearly better idea.
>
> * Looks like there is a similar problem in
> ExecPartitionCheckEmitError(). I don't really understand how that
> works, so I might be wrong here.
>
> * More or less the same issue seems to exist within ExecConstraints(),
> including where GetInsertedColumns() is used.
>

They all look fine to me. Remember that we always resolve column references
in WHEN targetlist from the target relation and hence things like
updatedCols/insertedCols are get set on the target RTE. All of these
routines read from the target RTE as far as I can see. But I will check in
more detail.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

> On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
>  wrote:
> > + * When index-to-heap verification is requested, a Bloom filter is used
> to
> > + * fingerprint all tuples in the target index, as the index is
> traversed to
> > + * verify its structure.  A heap scan later verifies the presence in the
> > heap
> > + * of all index tuples fingerprinted within the Bloom filter.
> > + *
> >
> > Is that correct? Aren't we actually verifying the presence in the index
> of
> > all
> > heap tuples?
>
> I think that you could describe it either way. We're verifying the
> presence of heap tuples in the heap that ought to have been in the
> index (that is, most of those that were fingerprinted).
>

Hmm Ok. It still sounds backwards to me, but then English is not my first
or even second language. "A heap scan later verifies the presence in the
heap of all index tuples fingerprinted" sounds as if we expect to find all
fingerprinted index tuples in the heap. But in reality, we check if all
heap tuples have a fingerprinted index tuple.


>
> You're right - there is a narrow window for REPEATABLE READ and
> SERIALIZABLE transactions. This is a regression in v6, the version
> removed the TransactionXmin test.
>
> I am tempted to fix this by calling GetLatestSnapshot() instead of
> GetTransactionSnapshot(). However, that has a problem of its own -- it
> won't work in parallel mode, and we're dealing with parallel
> restricted functions, not parallel unsafe functions. I don't want to
> change that to fix such a narrow issue. IMV, a better fix is to treat
> this as a serialization failure. Attached patch, which applies on top
> of v7, shows what I mean.
>

Ok. I am happy with that.


>
> I think that this bug is practically impossible to hit, because we use
> the xmin from the pg_index tuple during "is index safe to use?"
> indcheckxmin/TransactionXmin checks (the patch that I attached adds a
> similar though distinct check), which raises another question for a
> REPEATABLE READ xact. That question is: How is a REPEATABLE READ
> transaction supposed to see the pg_index entry to get the index
> relation's oid to call a verification function in the first place?


Well pg_index entry will be visible and should be visible. Otherwise how do
you even maintain a newly created index. I am not sure, but I guess we take
fresh MVCC snapshots for catalog searches, even in RR transactions.


> My
> point is that there is no need for a more complicated solution than
> what I propose.
>

I agree on that.


>
> I don't think so. The way we compute OldestXmin for
> IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
> GetOldestXmin() locks the proc array in shared mode. As you pointed
> out, the fact that it comes after everything else (fingerprinting)
> means that it must be equal to or later than what index scans saw,
> that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
> bits).
>
>
That's probably true.


>
> > Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_
> PROGRESS
> > tuples, especially if those tuples were inserted/deleted by our own
> > transaction? It probably worth thinking.
>

Anything here that you would like to check? I understand that you may see
such tuples only for catalog tables.


>
> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
> strength doesn't have anything to do with IndexBuildHeapRangeScan()
> when it operates with an MVCC snapshot. I think that this means that
> this patch doesn't need to update comments within
> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
> independent.
>

Ok, I agree. But note that we are now invoking that code
with AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a
change worth noting.


>
> > Is
> > there anything we can do to lessen that burden like telling other
> backends
> > to
> > ignore our xmin while computing OldestXmin (like vacuum does)?
>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>

While that's true, I am thinking if there is anything we can do to stop
this a consistency-checking utility to create other non-trivial side
effects on the state of the database, even if that means we can not check
all heap tuples. For example, can there be a way by which we allow
concurrent vacuum or HOT prune to continue to prune away dead tuples, eve

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee
On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>
>
Mostly a nitpick, but I guess we should leave a comment
after IndexBuildHeapScan() saying heap_endscan() is not necessary
since IndexBuildHeapScan()
does that internally. I stumbled upon that while looking for any potential
leaks. I know at least one other caller of IndexBuildHeapScan() doesn't
bother to say anything either, but it's helpful.

FWIW I also looked at the 0001 patch and it looks fine to me.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-28 Thread Pavan Deolasee
On Tue, Mar 27, 2018 at 5:00 PM, Simon Riggs  wrote:

>
> In terms of further performance optimization, if there is just one
> WHEN AND condition and no unconditional WHEN clauses then we can add
> the WHEN AND easily to the join query.
>
> That seems like an easy thing to do for PG11
>
>
I think we need to be careful in terms of what can be pushed down to the
join, in presence of WHEN NOT MATCHED actions. If we push the WHEN AND qual
to the join then I am worried that some rows which should have been
reported "matched" and later filtered out as part of the WHEN quals, will
get reported as "not-matched", thus triggering WHEN NOT MATCHED action.

For example,

postgres=# select * from target ;
 a | b
---+
 1 | 10
 2 | 20
(2 rows)

postgres=# select * from source ;
 a |  b
---+-
 2 | 200
 3 | 300
(2 rows)


postgres=# BEGIN;
BEGIN
postgres=# EXPLAIN ANALYZE MERGE INTO target t USING source s ON t.a = s.a
WHEN MATCHED AND t.a < 2 THEN UPDATE SET b = s.b WHEN NOT MATCHED THEN
INSERT VALUES (s.a, -1);
QUERY PLAN

---
 Merge on target t  (cost=317.01..711.38 rows=25538 width=46) (actual
time=0.104..0.104 rows=0 loops=1)
*   Tuples Inserted: 1*
   Tuples Updated: 0
   Tuples Deleted: 0
*   Tuples Skipped: 1*
   ->  Merge Left Join  (cost=317.01..711.38 rows=25538 width=46) (actual
time=0.071..0.074 rows=2 loops=1)
 Merge Cond: (s.a = t_1.a)
 ->  Sort  (cost=158.51..164.16 rows=2260 width=40) (actual
time=0.042..0.043 rows=2 loops=1)
   Sort Key: s.a
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on source s  (cost=0.00..32.60 rows=2260
width=40) (actual time=0.027..0.031 rows=2 loops=1)
 ->  Sort  (cost=158.51..164.16 rows=2260 width=10) (actual
time=0.019..0.020 rows=2 loops=1)
   Sort Key: t_1.a
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on target t_1  (cost=0.00..32.60 rows=2260
width=10) (actual time=0.012..0.014 rows=2 loops=1)
 Planning Time: 0.207 ms
 Execution Time: 0.199 ms
(17 rows)


postgres=# abort;
ROLLBACK


postgres=# BEGIN;
BEGIN
postgres=# EXPLAIN ANALYZE MERGE INTO target t USING source s ON t.a = s.a
AND t.a < 2 WHEN MATCHED THEN UPDATE SET b = s.b WHEN NOT MATCHED THEN
INSERT VALUES (s.a, -1);
QUERY PLAN

--
 Merge on target t  (cost=232.74..364.14 rows=8509 width=46) (actual
time=0.128..0.128 rows=0 loops=1)
*   Tuples Inserted: 2*
   Tuples Updated: 0
   Tuples Deleted: 0
   Tuples Skipped: 0
   ->  Merge Right Join  (cost=232.74..364.14 rows=8509 width=46) (actual
time=0.070..0.072 rows=2 loops=1)
 Merge Cond: (t_1.a = s.a)
 ->  Sort  (cost=74.23..76.11 rows=753 width=10) (actual
time=0.038..0.039 rows=1 loops=1)
   Sort Key: t_1.a
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on target t_1  (cost=0.00..38.25 rows=753
width=10) (actual time=0.026..0.028 rows=1 loops=1)
 Filter: (a < 2)
 Rows Removed by Filter: 1
 ->  Sort  (cost=158.51..164.16 rows=2260 width=40) (actual
time=0.024..0.025 rows=2 loops=1)
   Sort Key: s.a
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on source s  (cost=0.00..32.60 rows=2260
width=40) (actual time=0.014..0.017 rows=2 loops=1)
 Planning Time: 0.218 ms
 Execution Time: 0.234 ms
(19 rows)

postgres=# abort;
ROLLBACK


If you look at the first MERGE statement, we filter one matched source row
(2,200) using (t.a < 2) and do not take any action for that row. This
filtering happens after the RIGHT JOIN has reported it as "matched". But if
we push down the qual to the join, then the join will see that the source
row has no match and hence send that row for NOT MATCHED processing, thus
inserting it into the table again.

I am not saying there is no scope for improvement. But we need to be
careful about what can be pushed down to the join and what must be applied
after the join.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-02 Thread Pavan Deolasee
On Tue, Apr 3, 2018 at 8:31 AM, Robert Haas  wrote:

> On Mon, Apr 2, 2018 at 10:40 PM, Peter Geoghegan  wrote:
> > On Mon, Apr 2, 2018 at 7:18 PM, Andres Freund 
> wrote:
> >> I did a scan through this, as I hadn't been able to keep with the thread
> >> previously. Sorry if some of the things mentioned here have been
> >> discussed previously. I am just reading through the patch in its own
> >> order, so please excuse if there's things I remark on that only later
> >> fully make sense.
> >>
> >>
> >> later update: TL;DR: I don't think the parser / executor implementation
> >> of MERGE is architecturally sound.  I think creating hidden joins during
> >> parse-analysis to implement MERGE is a seriously bad idea and it needs
> >> to be replaced by a different executor structure.
> >
> > +1. I continue to have significant misgivings about this. It has many
> > consequences that we know about, and likely quite a few more that we
> > don't.
>
> +1.  I didn't understand from Peter's earlier comments that we were
> doing that, and I agree that it isn't a good design choice.
>
>
Honestly I don't think Peter ever raised concerns about the join, though I
could be missing early discussions when I wasn't paying attention. It's
there from day 1. Peter raised concerns about the two RTE stuff which was
necessitated when we added support for partitioned table. We discussed that
at some length, with your inputs and agreed that it's not necessarily a bad
thing and probably the only way to deal with partitioned tables.

Personally, I don't see why an internal join is bad. That's what MERGE is
doing anyways, so it closely matches with the overall procedure.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Missing parse_merge.h?

2018-04-03 Thread Pavan Deolasee
On Tue, Apr 3, 2018 at 2:37 PM, Satoshi Nagayasu  wrote:

> Hi,
>
> I just got a compile error as below on the latest HEAD.
> ---
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -fexcess-precision=standard -g -O2 -I. -I.
> -I../../../src/include  -D_GNU_SOURCE   -c -o analyze.o analyze.c
>
> analyze.c:41:32: fatal error: parser/parse_merge.h: No such file or
> directory
>
> #include "parser/parse_merge.h"
>
>^
>
> compilation terminated.
>
> make[3]: *** [analyze.o] Error 1
> ---
>
> The latest commit [1] looks missing parse_merge.h.
> Or did I make some mistake on building?
>
>
Looks like Simon forgot to add new files in that commit. I am trying to get
in touch with him so that he can add the missing files and correct the
situation. Sorry for the trouble.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: New files for MERGE

2018-04-04 Thread Pavan Deolasee
On Wed, Apr 4, 2018 at 10:40 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-03 08:32:45 -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2018-04-03 09:24:12 +, Simon Riggs wrote:
> > > New files for MERGE
> > > src/backend/executor/nodeMerge.c   |  575 +++
> > > src/backend/parser/parse_merge.c   |  660 
> > > src/include/executor/nodeMerge.h   |   22 +
> > > src/include/parser/parse_merge.h   |   19 +
> >
> > Getting a bit grumpy here.  So you pushed this, without responding in
> > any way to the objections I made in
> > http://archives.postgresql.org/message-id/20180403021800.
> b5nsgiclzanobiup%40alap3.anarazel.de
> > and did it in a manner that doesn't even compile?
>
> This needs at the very least a response to the issues pointed out in the
> referenced email that you chose to ignore without any sort of comment.
>
>
Apologies from my end. Simon checked with me regarding your referenced
email. I was in the middle of responding to it (with a add-on patch to take
care of your review comments), but got side tracked by some high priority
customer escalation. I shall respond soon.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: New files for MERGE

2018-04-04 Thread Pavan Deolasee
On Thu, Apr 5, 2018 at 12:16 AM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-05 00:02:06 +0530, Pavan Deolasee wrote:
> > Apologies from my end. Simon checked with me regarding your referenced
> > email. I was in the middle of responding to it (with a add-on patch to
> take
> > care of your review comments), but got side tracked by some high priority
> > customer escalation. I shall respond soon.
>
> Hows that an explanation for just going ahead and committing? Without
> even commenting on why one thinks the pointed out issues are something
> that can be resolved later or somesuch?  This has an incredibly rushed
> feel to it.
>

While I don't want to answer that on Simon's behalf, my feeling is that he
may not seen your email since it came pretty late. He had probably planned
to commit the patch again first thing in the morning with the fixes I'd
sent.

Anyways, I think your reviews comments are useful and I've incorporated
most of those. Obviously certain things like creating a complete new
executor machinery is not practical given where we're in the release cycle
and I am not sure if that has any significant advantages over what we have
today.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-04 Thread Pavan Deolasee
ised
that I couldn't find a ready way to get string representation of relation
kind. Am I missing something? Of course, we can write for the purpose, but
wanted to ensure we are not duplicating something already available.


>
>
> +   /*
> +* Do basic expression transformation (same as a
> ROW()
> +* expr, but allow SetToDefault at top level)
> +        */
> +   exprList = transformExpressionList(pstate,
> +  (List *)
> linitial(valuesLists),
> +
> EXPR_KIND_VALUES_SINGLE,
> +  true);
> +
> +   /* Prepare row for assignment to target table */
> +   exprList = transformInsertRow(pstate, exprList,
> + istmt->cols,
> + icolumns, attrnos,
> + false);
> +   }
>
> Can't we handle this with a littlebit less code duplication?
>

Hmm, yeah, that will be good. But given Tom's suggestions on the other
thread, I would like to postpone any refactoring here.


>
>  typedef struct HeapUpdateFailureData
>  {
> +   HTSU_Result result;
> ItemPointerData ctid;
> TransactionId xmax;
> CommandId   cmax;
> +   LockTupleMode   lockmode;
>  } HeapUpdateFailureData;
>
>
> These new fields seem not really relateto HUFD, but rather just fields
> the merge code should maintain?
>

As I said, we can possibly track those separately. But they all arise as a
result of update/delete/lock failure. So I would prefer to keep them along
with other fields such as ctid and xmax/cmax. But if you or others insist,
I can move them and pass in/out of various function calls where we need to
maintain those.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Take-care-of-Andres-s-comments.patch
Description: Binary data


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-04-04 Thread Pavan Deolasee
On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund  wrote:

>
>
> I've attached a noticeably editorialized patch:
>
>

+   /*
+* As long as we don't support an UPDATE of INSERT ON CONFLICT
for
+* a partitioned table we shouldn't reach to a case where tuple
to
+* be lock is moved to another partition due to concurrent
update
+* of the partition key.
+*/
+   Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+

This is no longer true; at least not entirely. We still don't support ON
CONFLICT DO UPDATE to move a row to a different partition, but otherwise it
works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Add support for printing/reading MergeAction nodes

2018-04-05 Thread Pavan Deolasee
On Wed, Apr 4, 2018 at 11:21 PM, Tom Lane  wrote:

> Simon Riggs  writes:
> > On 4 April 2018 at 17:19, Tom Lane  wrote:
> >> If the MERGE patch has broken this, I'm going to push back on that
> >> and push back on it hard, because it probably means there are a
> >> whole bunch of other raw-grammar-output-only node types that can
> >> now get past the parser stage as well, as children of these nodes.
> >> The answer to that is not to add a ton of new infrastructure, it's
> >> "you did MERGE wrong".
>
> > MERGE contains multiple actions of Insert, Update and Delete and these
> > could be output in various debug modes. I'm not clear what meaning we
> > might attach to them if we looked since that differs from normal
> > INSERTs, UPDATEs, DELETEs, but lets see.
>
> What I'm complaining about is that that's a very poorly designed parsetree
> representation.  It may not be the worst one I've ever seen, but it's
> got claims in that direction.  You're repurposing InsertStmt et al to
> do something that's *not* an INSERT statement, but by chance happens
> to share some (not all) of the same fields.  This is bad because it
> invites confusion, and then bugs of commission or omission (eg, assuming
> that some particular processing has happened or not happened to subtrees
> of that parse node).  The most egregious way in which it's a bad idea
> is that, as I said, InsertStmt doesn't even exist in post-parse-analysis
> trees so far as the normal type of INSERT is concerned.  This just opens
> a whole batch of ways to screw up.  We have some types of raw parse nodes
> that are replaced entirely during parse analysis, and we have some others
> where it's convenient to use the same node type before and after parse
> analysis, but we don't have any that are one way in one context and the
> other way in other contexts.  And this is not the place to start.
>
> I think you'd have been better off to fold all of those fields into
> MergeAction, or perhaps make several variants of MergeAction.
>
>
Hi Tom,

Thanks for the review comments.

Attached patch refactors the grammar/parser side per your comments. We no
longer use InsertStmt/UpdateStmt/DeleteStmt/SelectStmt as part of
MergeAction. Instead we only collect the necessary information for running
the INSERT/UPDATE/DELETE actions. Speaking of MergeAction itself, I decided
to use a new parser-only node named MergeWhenClause and removed unnecessary
members from the MergeAction node which now gets to planner/executor.

Regarding the original report by Marina I suspect she may have turned
debug_print_parse=on while running regression. I could reproduce the
failures in the isolation tests by doing same. The attached patch however
passes all tests with the following additional GUCs. So I am more confident
that we should have got the outfuncs.c support ok.

debug_print_parse=on
debug_print_rewritten=on
debug_print_plan=on

Also, I am now running tests with -DCOPY_PARSE_PLAN_TREES
-DRAW_EXPRESSION_COVERAGE_TEST since the buildfarm had earlier uncovered
some issues with those flags. No problem there too.

This now also enforces single VALUES clause in the grammar itself instead
of doing that check at parse-analyse time. So that's a net improvement too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Separate-raw-parse-representation-from-rest.patch
Description: Binary data


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-05 Thread Pavan Deolasee
On Thu, Apr 5, 2018 at 5:08 PM, Jesper Pedersen 
wrote:

> Hi Simon and Paven,
>
> On 04/04/2018 08:46 AM, Jesper Pedersen wrote:
>
>> On 03/30/2018 07:10 AM, Simon Riggs wrote:
>>
>>> No problems found, but moving proposed commit to 2 April pm
>>>
>>>
>> There is a warning for this, as attached.
>>
>>
> Updated version due to latest refactoring.
>


Hi Jesper,

The variable would become unused in non-assert builds. I see that. But
simply removing it is not a solution and I don't think the code will
compile that way. We should either rewrite that assertion or put it inside
a #ifdef ASSERT_CHECKING block or simple remove that assertion because we
already check for relkind in parse_merge.c. Will check.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-05 Thread Pavan Deolasee
On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund  wrote:

>
>
> My impression is that this simply shouldn't be going through
> nodeModifyTuple, but be it's own nodeMerge node. The trigger handling
> would need to be abstraced into execTrigger.c or such to avoid
> duplication.  We're now going from nodeModifyTable.c:ExecModifyTable()
> into execMerge.c:ExecMerge(), back to
> nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing
> things that aren't appropriate for merge, we then pass an actionState,
> which neuters part of ExecUpdate/Insert(). This is just bad.
>
>
But wouldn't this lead to lot of code duplication? For example,
ExecInsert/ExecUpdate does a bunch of supporting work (firing triggers,
inserting into indexes just to name a few) that MERGE's INSERT/UPDATE needs
as well. Now we can possibly move these support routines to a new file, say
execModify.c and then let both Merge as well as ModifyTable node make use
of that. But the fact is that ExecInsert/ExecUpdate knows a lot about
ModifyTable already. So to separate ExecInsert/ExecUpdate from ModifyTable
will require significant refactoring AFAICS. I am not saying we can't do
that, but will have it's own consequences.

If we would not have refactored to move ExecMerge and friends to a new
file, then it may not have looked so odd (as you describe above). But I
think moving the code to a new file was a net improvement. May be we can
move ExecInsert/Update etc to a new file as I suggested, but still use the
ModifyTable to run Merge. There are many things common between them.
ModifyTable executes all DMLs and MERGE is just another DML which can run
all three.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-06 Thread Pavan Deolasee
On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund  wrote:

> Hi,
>
> On 2018-04-05 11:31:48 +0530, Pavan Deolasee wrote:
> > > +/*-
> > > 
> > > + *
> > > + * nodeMerge.c
> > > + *   routines to handle Merge nodes relating to the MERGE command
> > >
> > > Isn't this file misnamed and it should be execMerge.c?  The rest of the
> > > node*.c files are for nodes that are invoked via execProcnode /
> > > ExecProcNode().  This isn't an actual executor node.
> > >
> >
> > Makes sense. Done. (Now that the patch is committed, I don't know if
> there
> > would be a rethink about changing file names. May be not, just raising
> that
> > concern)
>
> It absolutely definitely needed to be renamed. But that's been done, so
> ...
>
>
> > > What's the reasoning behind making this be an anomaluous type of
> > > executor node?
>
> > Didn't quite get that. I think naming of the file was bad (fixed now),
> but
> > I think it's a good idea to move the new code in a new file, from
> > maintainability as well as coverage perspective. If you've something very
> > different in mind, can you explain in more details?
>
> Well, it was kinda modeled as an executor node, including the file
> name. That's somewhat fixed now.   I still am extremely suspicious of
> the codeflow here.
>
> My impression is that this simply shouldn't be going through
> nodeModifyTuple, but be it's own nodeMerge node. The trigger handling
> would need to be abstraced into execTrigger.c


There is nothing called execTrigger.c. You probably mean creating something
new or trigger.c. That being said, I don't think the current code is bad.
There is already a switch to handle different command types. We add one
more for CMD_MERGE and then fire individual BEFORE/AFTER statement triggers
based on what actions MERGE may take. The ROW trigger handling doesn't
require any change.


> or such to avoid
> duplication.  We're now going from nodeModifyTable.c:ExecModifyTable()
> into execMerge.c:ExecMerge(), back to
> nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing
> things that aren't appropriate for merge, we then pass an actionState,
> which neuters part of ExecUpdate/Insert(). This is just bad.
>

I have spent most part of today trying to hack around the executor with the
goal to do what you're suggesting. Having done so, I must admit I am
actually quite puzzled why you think having a new node is going to be any
significant improvement.

I first tried to split nodeModifyTable.c into two parts. One that deals
with just ModifyTable node (the exported Init/Exec/End functions to start
with) and the other which abstracts out the actual Insert/Update/Delete
operations. The idea was that the second part shouldn't know anything about
ModifyTable and it being just one of the users.  I started increasingly
disliking the result as I continued hacking. The knowledge of ModifyTable
node is quite wide-spread, including execPartition.c and FDW. The
ExecInsert/Update, partitioning, ON CONFLICT handling all make use of
ModifyTable extensively. While MERGE does not need ON CONFLICT, it needs
everything else, even FDW at some point in future. Do you agree that this
is a bad choice or am I getting it completely wrong or do you really expect
MERGE to completely refactor nodeModifyTable.c, including the partitioning
related code?

I gave up on this approach.

I then started working on other possibility where we keep a ModifyTable
node inside a MergeTable (that's the name I am using, for now). The
paths/plans/executor state gets built that way. So all common members still
belong to the ModifyTable (and friends) and we only add MERGE specific
information to MergeTable (and friends).

This, at least looks somewhat better. We can have:

- ExecInitMergeTable(): does some basic initialisation of the embedded
ModifyTable node so that it can be passed around
- ExecMergeTable(): executes the (only) subplan, fetches tuples, fires BR
triggers, does work for handling transition tables (so mostly duplication
of what ExecModifyTable does already) and then executes the MERGE actions.
It will mostly use the ModifyTable node created during initialisation when
calling ExecUpdate/Insert etc
- ExecEndMergeTable(): ends the executor

This is probably far better than the first approach. But is it really a
huge improvement over the committed code? Or even an improvement at all?

If that's not what you've in mind, can you please explain in more detail
how to you see the final design and how exactly it's better than what we
have today? Once there is clarity, I can work on it in a fairly quick
manner.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: New files for MERGE

2018-04-06 Thread Pavan Deolasee
On Fri, Apr 6, 2018 at 1:51 PM, Simon Riggs  wrote:

>  Given that the executor
> manifestly works and has been re-engineered according to PeterG's
> requests and that many performance concerns have already been
> addressed prior to commit, Pavan and I were happy with it. My proposal
> to commit the patch was given 5 days ahead of time and no comments
> were received by anyone, not even PeterG. There was no rush and I
> personally performed extensive reviews before final commit.


 I think it's quite unfair to say that Simon rushed into this. He said this
on 29th March:

On Thu, Mar 29, 2018 at 3:20 PM, Simon Riggs  wrote:

> On 28 March 2018 at 12:00, Pavan Deolasee 
> wrote:
>
> > v27 attached, though review changes are in
> > the add-on 0005 patch.
>
> This all looks good now, thanks for making all of those changes.
>
> I propose [v27 patch1+patch3+patch5] as the initial commit candidate
> for MERGE, with other patches following later before end CF.
>
> I propose to commit this tomorrow, 30 March, about 26 hours from now.
> That will allow some time for buildfarm fixing/reversion before the
> Easter weekend, then other patches to follow starting 2 April. That
> then gives reasonable time to follow up on other issues that we will
> no doubt discover fairly soon after commit, such as additional runs by
> SQLsmith and more eyeballs.



And he finally committed the patch on 2nd April late in the night. In
between, there were zero objections and no comments at all. I don't know
why this is considered as rushed.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Bugs in TOAST handling, OID assignment and redo recovery

2018-04-10 Thread Pavan Deolasee
tart.

I don't think this is an unusual case and I'm surprised that we didn't
notice this before. Most likely the other protections in the system to
guard against wrapped around OIDs masked the bug. The TOAST table though
escapes those protections in the specific case we discussed above and
brought out the bug in open.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin
counter advances enough so that RECENTLY_DEAD tuples are ignored by the
REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or
a VACUUM should fix these transient errors. Also once manifested, the
errors would stay until a REINDEX or VACUUM is performed.

We tested this at least upto 9.1 and the bugs exist there too. In fact,
these bugs probably existed forever, though I did not check very old
releases.

Attached is a simple reproducer and a proposed fix to address both the
bugs. We should consider backpatching it all supported releases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_toast_corruption_v3.patch
Description: Binary data

export PGDATA=/tmp/toast_corrupt
export PGPORT=5440
export PGDATABASE=postgres

# turn checkpoint logging on
start_options="-c checkpoint_timeout=30s -c checkpoint_completion_target=0.9 -c log_checkpoints=on -c max_prepared_transactions=10"
initdb -D $PGDATA
pg_ctl -l logfile -o "$start_options" -w start
psql -c 'create table testtoast (a int, b text)'
psql -c 'create table auxtable (a int, b text)'

# Do clean shutdown and restart to ensure we start with no cached OIDs
pg_ctl -w stop
rm logfile
pg_ctl -l logfile -o "$start_options" -w start

# Generate some non-trivial work for the upcoming checkpoint.
pg_controldata $PGDATA | grep NextOID
psql -c "insert into auxtable select generate_series(1,10), 'foo'"

# wait for the "time" checkpoint to start
echo "waiting for checkpoint start"
while ( true );
do
	tail logfile | grep "checkpoint starting"
	if [ "$?" == "0" ]; then
		echo "checkpoint is starting"
		break
	fi
	sleep 1
done

# now consume some OIDs.. insert them into the toast table. Since we started
# with zero cached OIDs, this should generate a XLOG_NEXTOID WAL record.
#
# This command must finish before the checkpoint finishes
psql -c "insert into testtoast select 1, (select string_agg('', md5(random()::text)) from generate_series(1,100)) from generate_series(1,20);"

# now wait for the checkpoint to finish.
echo "waiting for checkpoint complete"
while ( true );
do
	tail logfile | grep "checkpoint complete"
	if [ "$?" == "0" ]; then
		echo "checkpoint complete"
		break
	fi
	sleep 1
done

# the checkpoint should have the NextOID prior to last block allocation
pg_controldata $PGDATA | grep NextOID

# do some work and then crash
psql -c "insert into auxtable select generate_series(1,100), 'foo'"
pg_ctl -m immediate stop
sleep 2

# restart
pg_ctl -l logfile -o "$start_options" -w start

# the NextOID should have been advanced to include the OIDs that we consumed.
# But that doesn't happen, leading to duplicate OIDs and problems there after.
pg_controldata $PGDATA | grep NextOID

# create a prepare transaction to hold back OldestXmin
psql -c "begin; insert into auxtable select generate_series(1,100), 'foo'; prepare transaction 'p1';"

# now delete the old tuples. That will lead to deletion from the TOAST table as
# well. But since we're holding back OldestXmin, these tuples can't be cleaned
# up
psql -c "delete from testtoast"

# this should now insert duplicate OIDs while the previous RECENTLY_DEAD tuples
# are still around
psql -c "insert into testtoast select 1, (select string_agg('', md5(random()::text)) from generate_series(1,100)) from generate_series(1,20);"

### ERRROR ###
psql -c "\copy (select t.*::text from testtoast t) to /dev/null"

# prove that REINDEX won't fix the problem unless OldestXmin advances.
oid=`psql -At -c "select oid from pg_class where relname = 'testtoast'"`
psql -c "REINDEX TABLE pg_toast.pg_toast_$oid"
p

Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-10 Thread Pavan Deolasee
Hi Heikki,

On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas  wrote:

>
>>
> It would seem more straightforward to add a snapshot parameter to
> GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast,
> while others pass SnapshotDirty. With your patch, you check the index
> twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with
> SnapshotToast, in the caller.
>

Hmm. I actually wrote my first patch exactly like that. I am trying to
remember why I discarded that approach. Was that to do with the fact
that  SnapshotToast
can't see all toast tuples either? Yeah, I think so. For example, it won't
see tuples with uncommitted-xmin, leading to different issues. Now it's
unlikely that we will have a OID conflict where the old tuple has
uncommitted-xmin, but not sure if we can completely rule that out. For
example, if we did not uncover the redo recovery bug, we could have had a
prepared transaction having inserted the old tuple, which now conflicts
with new tuple and not detected by SnapshotToast.



>
> If I'm reading the rewrite case correctly, it's a bit different and quite
> special. In the loop with GetNewOidWithIndex(), it needs to check that the
> OID is unused in two tables, the old TOAST table, and the new one. You can
> only pass one index to GetNewOidWithIndex(), so it needs to check the
> second index manually. It's not because of the snapshot issue. Although I
> wonder if we should be using SnapshotToast in that GetNewOidWithIndex()
> call, too. I.e. if we should be checking both the old and the new toast
> table with SnapshotToast.


As I said, I am not sure if checking with just SnapshotToast is enough
because it can't see "dirty" tuples.


>
> Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the value
> in the online checkpoint record is greater than
> ShmemVariableCache->nextXid. But we don't have such a wraparound-aware
> concept of "greater than" for OIDs. Ignoring the online checkpoint record's
> nextOid value seem fine to me.
>

Ok. Thanks for checking.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-04-10 Thread Pavan Deolasee
On Tue, Apr 10, 2018 at 7:49 PM, Claudio Freire 
wrote:

> On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas 
> wrote:
>
> >
> > Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
> > ISTM that we're failing to take advantage of this optimization for
> unlogged
> > tables, for no particular reason. Just an oversight?
> >
> > - Heikki
>
> Indeed.
>
> Maybe Pavan knows of one, but I don't see any reason not to apply this
> to unlogged tables as well. It slipped the review.
>

Yes, looks like an oversight :-( I will fix it along with the other changes
that Peter requested.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-11 Thread Pavan Deolasee
On Tue, Apr 10, 2018 at 7:24 PM, Pavan Deolasee 
wrote:

> Hi Heikki,
>
> On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas 
> wrote:
>
>>
>>>
>> It would seem more straightforward to add a snapshot parameter to
>> GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast,
>> while others pass SnapshotDirty. With your patch, you check the index
>> twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with
>> SnapshotToast, in the caller.
>>
>
> Hmm. I actually wrote my first patch exactly like that. I am trying to
> remember why I discarded that approach. Was that to do with the fact that  
> SnapshotToast
> can't see all toast tuples either? Yeah, I think so. For example, it won't
> see tuples with uncommitted-xmin, leading to different issues. Now it's
> unlikely that we will have a OID conflict where the old tuple has
> uncommitted-xmin, but not sure if we can completely rule that out.
>

Or may be we simply err on the side of caution and scan the toast table
with SnapshotAny while looking for a duplicate? That might prevent us from
reusing an OID for a known-dead tuple, but should save us a second index
scan and still work.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-11 Thread Pavan Deolasee
On Wed, Apr 11, 2018 at 8:20 PM, Tom Lane  wrote:

> Pavan Deolasee  writes:
> > Or may be we simply err on the side of caution and scan the toast table
> > with SnapshotAny while looking for a duplicate? That might prevent us
> from
> > reusing an OID for a known-dead tuple, but should save us a second index
> > scan and still work.
>
> +1.  We really don't want to expend two indexscans on this.
>
>
Ok. I propose attached patches, more polished this time. I also
changed toastrel_valueid_exists() to use SnapshotAny, but I don't have any
proofs to claim that's a needed change. But it seems harmless and given the
number of toast related, hard to chase bugs we have seen over the years,
may it's worth making it full proof too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0002-Do-more-extensive-search-while-looking-for-duplicate.patch
Description: Binary data


0001-Do-not-overwrite-the-nextOid-counter-while-replaying.patch
Description: Binary data


Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-12 Thread Pavan Deolasee
On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane  wrote:

> So while looking at this, it suddenly occurred to me that probing with
> SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
> either.
>

Yeah it occurred to me as well, but when I looked at the code, I couldn't
find a case that is broken. I even tried a few test cases with DDLs etc.
But I think what you did is fine and more bullet proof. So +1 to that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-12 Thread Pavan Deolasee
On Thu, Apr 12, 2018 at 5:21 AM, Tom Lane  wrote:

> Michael Paquier  writes:
> > I have not really checked this thread in details, but one thing that
> > strikes me is that it would be rather easy to add a TAP test based on
> > the initial script that Pavan has sent.  Would that be worth testing
> > cycles or not?
>
> I doubt it --- that test is so specialized that it'd be unlikely to
> catch any other bug.
>
>
I agree; it's not worth adding a TAP test except may be as a demonstration
to write future tests for catching such issues. This was a very specialised
test case written after getting a full grasp on the bug. And it just tests
the thing that I knew is broken based on code reading. Also, with OID
duplicate issue fixed, hitting more bugs in this area is going to be even
more difficult.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Bugs in TOAST handling, OID assignment and redo recovery

2018-04-17 Thread Pavan Deolasee
On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 4/10/18 06:29, Pavan Deolasee wrote:
> > One of our 2ndQuadrant support customers recently reported a sudden rush
> > of TOAST errors post a crash recovery, nearly causing an outage. Most
> > errors read like this:
> >
> > ERROR: unexpected chunk number 0 (expected 1) for toast value 
>
> While researching this, I found that the terminology in this code is
> quite inconsistent.  It talks about chunks ids, chunk indexes, chunk
> numbers, etc. seemingly interchangeably.  The above error is actually
> about the chunk_seq, not about the chunk_id, as one might think.
>
> The attached patch is my attempt to clean this up a bit.  Thoughts?
>

While I agree that we should clean it up, I wonder if changing error text
would be a good idea. These errors are being reported by a very long time
and if we change the text, we might forget the knowledge about the past
reports.

Also, "toast value" is same as "chunk_id". Should we clean up something
there too? "chunk_seq number" -- should that be just "chunk_seq"?


Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: VM map freeze corruption

2018-04-18 Thread Pavan Deolasee
On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan  wrote:

>
>
> My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId()
> returns FRM_NOOP if the MultiXACT locked rows haven't committed.  This
> results in changed=false and totally_frozen=true(as initialized).  When
> this returns to lazy_scan_heap(), no rows are added to the frozen[] array.
> Yet, tuple_totally_frozen is true.  This means the page is marked frozen in
> the VM, even though the MultiXACT row wasn't left untouch.
>
> A fix to heap_prepare_freeze_tuple() that seems to do the trick is:
> else
> {
> Assert(flags & FRM_NOOP);
> +  totally_frozen = false;
> }
>

That's a great find! This can definitely lead to various problems and could
be one of the reasons behind the issue reported here [1]. For example, if
we change the script slightly at the end, we can get the same error
reported in the bug report.

sleep 4;  # Wait for share locks to be released

# See if another vacuum freeze advances relminmxid beyond xmax present in
the
# heap
echo "vacuum (verbose, freeze) t;" | $p
echo "select pg_check_frozen('t');" | $p

# See if a vacuum freeze scanning all pages corrects the problem
echo "vacuum (verbose, freeze, disable_page_skipping) t;" | $p
echo "select pg_check_frozen('t');" | $p

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K%2BA%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Pavan Deolasee
On Tue, Apr 17, 2018 at 11:05 PM, Fujii Masao  wrote:

> Hi,
>
> I'd like to propose to add $SUBJECT for performance improvement.
>
> When VACUUM tries to truncate the trailing empty pages, it scans
> shared_buffers
> to invalidate the pages-to-truncate during holding an AccessExclusive lock
> on
> the relation. So if shared_buffers is huge, other transactions need to
> wait for
> a very long time before accessing to the relation. Which would cause the
> response-time spikes, for example, I observed such spikes several times on
> the server with shared_buffers = 300GB while running the benchmark.
> Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
> spikes
> for that relation.
>

Alvaro reminded me that we already have a mechanism in place which forces
VACUUM to give up the exclusive lock if another backend is waiting on the
lock for more than certain pre-defined duration. AFAICS we give up the
lock, but again retry truncation from the previously left off position.
What if we make that lock-wait duration configurable on a per-table basis?
And may be a special value to never truncate (though it seems quite
excessive to me and a possible footgun)

I was actually thinking in the other direction. So between the time VACUUM
figures out it can possibly truncate last K pages, some backend may insert
a tuple in some page and make the truncation impossible. What if we
truncate the FSM before starting the backward scan so that new inserts go
into the pages prior to the truncation point, if possible. That will
increase the chances of VACUUM being able to truncate all the empty pages.
Though I think in some cases it might lead to unnecessary further extension
of the relation. May be we use some heuristic based on available free space
in the table prior to the truncation point?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Pavan Deolasee
On Wed, Apr 18, 2018 at 10:50 PM, Fujii Masao  wrote:

>
>
> I'm not sure if it's safe to cancel forcibly VACUUM's truncation during
> scaning shared_buffers. That scan happens after WAL-logging and before
> the actual truncation.
>
>
Ah ok. I misread your proposal. This is about the shared_buffers scan
in DropRelFileNodeBuffers() and we can't cancel that operation.

What if we remember the buffers as seen by count_nondeletable_pages() and
then just discard those specific buffers instead of scanning the entire
shared_buffers again? Surely we revisit all to-be-truncated blocks before
actual truncation. So we already know which buffers to discard. And we're
holding exclusive lock at that point, so nothing can change underneath. Of
course, we can't really remember a large number of buffers, so we can do
this in small chunks. Scan last K blocks, remember those K buffers, discard
those K buffers, truncate the relation and then try for next K blocks. If
another backend requests lock on the table, we give up or retry after a
while.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: VM map freeze corruption

2018-04-19 Thread Pavan Deolasee
On Wed, Apr 18, 2018 at 7:06 PM, Alvaro Herrera 
wrote:

>
>
> IMO the cause is the totally_frozen variable, which starts life in a
> bogus state (true) and the different code paths can set it to the right
> state, or by inaction end up deciding that the initial bogus state was
> correct in the first place.  Errors of omission are far too easy in that
> kind of model, ISTM, so I propose this slightly different patch, which
> albeit yet untested seems easier to verify and easier to get right.
>

I wonder if we should just avoid initialising those variables at the top
and take compiler's help to detect any missed branches. That way, we shall
know exactly why and where we are making them true/false. I also think that
we should differentiate between scenarios where xmin/xmax is already frozen
and scenarios where we are freezing them now.

I come up with attached. Not fully polished (and there is a XXX that I left
for comments), but hopefully enough to tell what I am thinking. Do you
think it's any improvement or actually makes the problem worse?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


frozen_v2.patch
Description: Binary data


Re: Toast issues with OldestXmin going backwards

2018-04-21 Thread Pavan Deolasee
On Sat, 21 Apr 2018 at 6:46 PM, Andrew Gierth 
wrote:

>
>
> So I dug into this one and it looks to me like the best approach. Here's
> a draft patch against 10-stable, if nobody can think of any showstoppers
> then I'll do the rest of the versions and commit them.
>

How does this guard against the case when the same OID gets inserted in the
toast table again, with matching lengths etc? Rare but seems possible, no?

I think we should look for a more complete solution how hard these bugs are
to detect and fix.

Thanks,
Pavan


> --
> Andrew (irc:RhodiumToad)
>
> --
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-12 Thread Pavan Deolasee
On Mon, Mar 11, 2019 at 1:37 PM Masahiko Sawada 
wrote:

>
>
> I might be missing something but why do we need to recheck whether
> each pages is all-frozen after insertion? I wonder if we can set
> all-frozen without checking all tuples again in this case.
>

It's possible that the user may have inserted unfrozen rows (via regular
INSERT or COPY without FREEZE option) before inserting frozen rows. So we
can't set the all-visible/all-frozen flag unconditionally. I also find it
safer to do an explicit check to ensure we never accidentally mark a page
as all-frozen. Since the check is performed immediately after the page
becomes full and only once per page, there shouldn't be any additional IO
cost and the check should be quite fast.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-14 Thread Pavan Deolasee
On Wed, Mar 13, 2019 at 11:37 AM Masahiko Sawada 
wrote:

>
>
> I think that since COPY FREEZE can be executed only when the table is
> created or truncated within the transaction other users cannot insert
> any rows during COPY FREEZE.
>
>
Right. But the truncating transaction can insert unfrozen rows into the
table before inserting more rows via COPY FREEZE.

postgres=# CREATE EXTENSION pageinspect ;
CREATE EXTENSION
postgres=# BEGIN;
BEGIN
postgres=# TRUNCATE testtab ;
TRUNCATE TABLE
postgres=# INSERT INTO testtab VALUES (100, 200);
INSERT 0 1
postgres=# COPY testtab FROM STDIN WITH (FREEZE);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1 2
>> 2 3
>> \.
COPY 2
postgres=# COMMIT;

postgres=# SELECT lp, to_hex(t_infomask) FROM
heap_page_items(get_raw_page('testtab', 0));
 lp | to_hex
+
  1 | 800
  2 | b00
  3 | b00
(3 rows)

The first row in inserted by regular insert and it's not frozen. The next 2
are frozen. We can't mark such as page all-visible, all-frozen.


>
> I'd suggest to measure performance overhead. I can imagine one use
> case of COPY FREEZE is the loading a very large table. Since in
> addition to set visibility map bits this patch could scan a very large
> table I'm concerned that how much performance is degraded by this
> patch.
>

Ok. I will run some tests. But please note that this patch is a bug fix to
address the performance issue that is caused by having to rewrite the
entire table when all-visible bit is set on the page during first vacuum.
So while we may do some more work during COPY FREEZE, we're saving a lot of
page writes during next vacuum. Also, since the scan that we are doing in
this patch is done on a page that should be in the buffer cache, we will
pay a bit in terms of CPU cost, but not anything in terms of IO cost.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Re: A separate table level option to control compression

2019-03-21 Thread Pavan Deolasee
Hi,

On Thu, Mar 21, 2019 at 3:10 AM Shaun Thomas 
wrote:

>
> I can't really speak for the discussion related to `storage.sgml`, but
> I based my investigation on the existing patch to `create_table.sgml`.
> About the only thing I would suggest there is to possibly tweak the
> wording.
>
> * "The compress_tuple_target ... " for example should probably read
>   "The toast_tuple_target parameter ...".
> * "the (blocksize - header)" can drop "the".
> * "If the value is set to a value" redundant wording should be rephrased;
>   "If the specified value is greater than toast_tuple_target, then
>   we will substitute the current setting of toast_tuple_target instead."
>   would work.
>

Thanks Shaun. Attached patch makes these adjustments.


> * I'd recommend a short discussion on what negative consequences can be
>   expected by playing with this value. As an example in my tests, setting
>   it very high may result in extremely sparse pages that could have an
>   adverse impact on HOT updates.
>

Setting compress_tuple_target to a higher value won't be negative because
the toast_tuple_target is used for compression anyways when
compress_tuple_target is higher than toast_tuple_target. May be some
discussion in the paragraph related to toast_tuple_target can be added to
explain the negative impact of the high value.

I added a small discussion about negative effects of setting
compress_tuple_target lower though, per your suggestion.

Also added some details in storage.sgml as recommended by Sawada-san. Hope
this helps.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Add-a-table-level-option-to-control-compression.patch
Description: Binary data


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-21 Thread Pavan Deolasee
On Thu, Mar 14, 2019 at 3:54 PM Masahiko Sawada 
wrote:

>
> >
> >
> > Ok. I will run some tests. But please note that this patch is a bug fix
> to address the performance issue that is caused by having to rewrite the
> entire table when all-visible bit is set on the page during first vacuum.
> So while we may do some more work during COPY FREEZE, we're saving a lot of
> page writes during next vacuum. Also, since the scan that we are doing in
> this patch is done on a page that should be in the buffer cache, we will
> pay a bit in terms of CPU cost, but not anything in terms of IO cost.
>
> Agreed. I had been misunderstanding this patch. The page scan during
> COPY FREEZE is necessary and it's very cheaper than doing in the first
> vacuum.
>

Thanks for agreeing to the need of this bug fix. I ran some simple tests
anyways and here are the results.

The test consists of a simple table with three columns, two integers and
one char(100). I then ran COPY (FREEZE), loading 7M rows, followed by a
VACUUM. The total size of the raw data is about 800MB and the table size in
Postgres is just under 1GB. The results for 3 runs in milliseconds are:

Master:
 COPY FREEZE: 40243.725   40309.67540783.836
 VACUUM: 2685.871  2517.4452508.452

Patched:
 COPY FREEZE: 40942.410  40495.303   40638.075
 VACUUM: 25.067  35.793   25.390

So there is a slight increase in the time to run COPY FREEZE, but a
significant reduction in time to VACUUM the table. The benefits will only
go up if the table is vacuumed much  later when most of the pages are
already written to the disk and removed from shared buffers and/or kernel
cache.

I hope this satisfies your doubts regarding performance implications of the
patch.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-03-26 Thread Pavan Deolasee
On Fri, Mar 22, 2019 at 12:19 PM Masahiko Sawada 
wrote:

> I've looked at the patch and have comments and questions.
>
> +/*
> + * While we are holding the lock on the page, check if all tuples
> + * in the page are marked frozen at insertion. We can safely mark
> + * such page all-visible and set visibility map bits too.
> + */
> +if (CheckPageIsAllFrozen(relation, buffer))
> +PageSetAllVisible(page);
> +
> +MarkBufferDirty(buffer);
>
> Maybe we don't need to mark the buffer dirty if the page is not set
> all-visible.
>
>
Yeah, makes sense. Fixed.


>  If we have CheckAndSetPageAllVisible() for only this
> situation we would rather need to check that the VM status of the page
> should be 0 and then set two flags to the page? The 'flags' will
> always be (VISIBILITYMAP_ALL_FROZEN | VISIBILITYMAP_ALL_VISIBLE) in
> copy freeze case. I'm confused that this function has both code that
> assumes some special situations and code that can be used in generic
> situations.
>

If a second COPY FREEZE is run within the same transaction and if starts
inserting into the page used by the previous COPY FREEZE, then the page
will already be marked all-visible/all-frozen. So we can skip repeating the
operation again. While it's quite unlikely that someone will do that and I
can't think of a situation where only one of those flags will be set, I
don't see a harm in keeping the code as is. This code is borrowed from
vacuumlazy.c and at some point we can even move it to some common location.


> Perhaps we can add some tests for this feature to pg_visibility module.
>
>
That's a good idea. Please see if the tests included in the attached patch
are enough.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


copy_freeze_v4.patch
Description: Binary data


Re: Faster inserts with mostly-monotonically increasing values

2018-01-02 Thread Pavan Deolasee
On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan  wrote:

> On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
>  wrote:
> > Here is a patch that implements the idea. If the last insert happens to
> be
> > in the rightmost block of an index, then we cache the block and check
> that
> > first for the next insert. For the cached block to be usable for the
> insert,
> > it should still be the rightmost, the to-be-inserted key should go into
> the
> > cached block and there is enough free space in the block. If any of these
> > conditions do not meet, we fall back to the regular code path, i.e.
> descent
> > down the index from the top.
>
> I'd be particularly concerned about page deletion/page recycling still
> being correct with the patch, especially because it's hard to test
> anything there. The P_ISLEAF() part of your fastpath's test hints at
> this -- why should it not *always* be a leaf page (surely you should
> be sure that the page cannot be recycled anyway)?


I thought we can throw in a bunch of checks there to ensure that we are
still looking at the correct rightmost page or else fallback to the regular
path. Do you see something missing there? I don't claim that I've got
everything correct, but since we're holding an exclusive lock on the page
at that point, wouldn't we be able to guard against any concurrent
splits/deletions? I will reread the documentation/code again, but I must
admit, it's not particularly easy to understand all the concurrency issues.



> I also have my
> doubts about unique index enforcement remaining correct with the patch
> when there are many physical duplicates, to the extent that more than
> a single leaf page is needed for a single value.
>

Why would that happen? We are checking if the to-be-inserted key is
strictly greater than the last key in the rightmost block and only then
proceeding with the fast path. And we do this while holding an exclusive
lock on the page. Are you suggesting that someone can concurrently still
split the page?


>
> Maybe you should do something with page LSN here -- cache LSN
> alongside block number, and have a non-matching LSN invalidate the
> test.
>

This is a good idea, but we might need to track the block/LSN in shared
memory. Otherwise with multiple backends doing inserts, we might never be
able to match LSNs (i.e. they will mostly vary, thus falling back to slow
path for majority of the time)


>
> How many clients are involved with your benchmark?
>

Just a single client for these benchmarks.


>
> > So as the size of the index increases, the benefits of the patch also
> tend
> > to increase. This is most likely because as the index becomes taller and
> > taller, the costs associated with index descent becomes higher.
>
> FWIW, I think that int4 indexes have to be extremely large before they
> exceed 4 levels (where the leaf level counts as a level). My
> conservative back-of-an-envelope estimate is 9 billion index tuples.
>
>
Hmm Ok. I am not entirely sure whether it's the depth or just purely binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-01-04 Thread Pavan Deolasee
On Tue, Jan 2, 2018 at 7:15 PM, Tels  wrote:

> Moin,
>
>
> >>
> >>
> > Hmm Ok. I am not entirely sure whether it's the depth or just purely
> > binary
> > searching through 3-4 index pages and/or pinning/unpinning buffers result
> > in much overhead. I will run some more tests and collect evidences.
>
> Just a question trying to understand how btree indexes work:
>
> If one inserts ever-increasing value, is the tree a balanced tree with a
> minimum (or at least not-as-high) number of levels, or does it increase in
> height every insert and creates a "tall stack"?
>

AFAIK all pages will be half-filled in this case. Imagine if one index page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will get
the remaining. The next split will happen when the right page is full  i.e.
when another N/2 entries are added. Thus there will be a split at every N/2
inserts, creating an index with half filled pages.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-01-04 Thread Pavan Deolasee
On Thu, Jan 4, 2018 at 6:05 PM, Alvaro Herrera 
wrote:

> Pavan Deolasee wrote:
> > On Tue, Jan 2, 2018 at 7:15 PM, Tels 
> wrote:
> >
> > > Just a question trying to understand how btree indexes work:
> > >
> > > If one inserts ever-increasing value, is the tree a balanced tree with
> a
> > > minimum (or at least not-as-high) number of levels, or does it
> increase in
> > > height every insert and creates a "tall stack"?
> >
> > AFAIK all pages will be half-filled in this case. Imagine if one index
> page
> > can accommodate N entries, the page will be split when (N+1)th entry is
> > added. The left page will have half the entries and the right page will
> get
> > the remaining. The next split will happen when the right page is full
> i.e.
> > when another N/2 entries are added. Thus there will be a split at every
> N/2
> > inserts, creating an index with half filled pages.
>
> Not really -- quoth _bt_findsplitloc():
>
>  * If the page is the rightmost page on its level, we instead try to
> arrange
>  * to leave the left split page fillfactor% full.  In this way, when we are
>  * inserting successively increasing keys (consider sequences, timestamps,
>  * etc) we will end up with a tree whose pages are about fillfactor% full,
>  * instead of the 50% full result that we'd get without this special case.
>
>
Ah ok. Thanks for pointing that out. Makes a lot more sense.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-31 Thread Pavan Deolasee
On Thu, Feb 1, 2018 at 11:05 AM, Michael Paquier 
wrote:

> On Sat, Jan 13, 2018 at 03:40:01PM +, Simon Riggs wrote:
> > The new two byte value is protected by CRC. The 2 byte value repeats
> > every 32768 WAL files. Any bit error in that value that made it appear
> > to be a current value would need to have a rare set of circumstances.
>
> If you use the two lower bytes of the segment number, then this gets
> repeated every 64k segments, no?  In terms of space this represents
> 500GB worth of WAL segments with a default segment size.  Hence the more
> PostgreSQL scales, the more there is a risk of collision, and I am
> pretty sure that there are already deployments around allocating
> hundreds of gigs worth of space for the WAL partition.  There are no
> problems of this class if using the 8-byte field xl_prev.  It seems to
> me that we don't want to take any risks here.
>

IMHO we're missing a point here. When xl_prev is changed to a 2-byte value
(as the patch suggests), the risk only comes when an old WAL file is
recycled for some future WAL file and the old and the future WAL file's
segment numbers share the same low order 16-bits. Now as the patch stands,
we take precaution to never reuse a WAL file with conflicting low order
bits.

This particular piece of the patch deals with that:

@@ -3496,6 +3495,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char
*tmppath,
return false;
}
(*segno)++;
+
+   /*
+* If avoid_conflicting_walid is true, then we must not recycle
the
+* WAL files so that the old and the recycled WAL segnos have
the
+* same lower order 16-bits.
+*/
+   if (avoid_conflicting_walid &&
+   XLogSegNoToSegID(tmpsegno) == XLogSegNoToSegID(*segno))
+   (*segno)++;
+
XLogFilePath(path, ThisTimeLineID, *segno, wal_segment_size);
}
}

For example, old WAL file 00010001 will NEVER be reused
as  000100010001. So even if there are any intact old WAL
records in the recycled file, they will be detected correctly during replay
since the SegID stored in the WAL record will not match the SegID as
derived from the WAL file segment number. The replay code does this for
every WAL record it sees.

There were some concerns about bit-flipping, which may inadvertently SegID
stored in the carried over WAL record so that it now matches with the
current WAL files' SegID, but TBH I don't see why we can't trust CRC to
detect that. Because if we can't, then there would be other problems during
WAL replay.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-05 Thread Pavan Deolasee
On Tue, Feb 6, 2018 at 9:50 AM, Peter Geoghegan  wrote:

> On Mon, Feb 5, 2018 at 7:56 PM, Robert Haas  wrote:
> > I don't think you get to make a unilateral decision to exclude
> > features that work everywhere else from the scope of this patch.  If
> > there is agreement that those features can be left out of scope, then
> > that is one thing, but so far all the commentary about the things that
> > you've chosen to exclude has been negative.  Nor have you really given
> > any reason why they should be exempt.  You've pointed out that
> > parallel query doesn't handle everything (which is certainly true, but
> > does not mean that any feature from now and the end of time is allowed
> > to exclude from scope whatever seems inconvenient regardless of
> > contrary community consensus) and you've pointed out here and
> > elsewhere that somebody could go add the features you omitted later
> > (which is also true, but misses the general point that we want
> > committed patches to be reasonably complete already, not have big gaps
> > that someone will have to fix later).
>
> For me, the concern is not really the omission of support for certain
> features as such. The concern is that those omissions hint that there
> is a problem with the design itself, particularly in the optimizer.
> Allowing subselects in the UPDATE part of a MERGE do not seem like
> they could be written as a neat adjunct to what Simon already came up
> with. If that was possible, Simon probably already would have done it.
>
>
As someone who's helping Simon with that part of the code, I must say that
omission of sub-selects in the UPDATE targetlist and WHEN quals is not
because of some known design problems.  So while it may be true that we've
a design problem, it's also quite likely that we are missing some
planner/optimiser trick and once we add those missing pieces, it will start
working. Same is the case with RLS.

Partitioned table is something I am actively working on. I must say that
the very fact that INSERT and UPDATE/DELETE take completely different paths
in partitioned/inherited table, makes MERGE quite difficult because it has
to carry out both the operations and hence require all the required
machinery. If I understand correctly, INSERT ON CONFLICT must have faced
similar problems and hence DO UPDATE does not work with partitioned table.
I am not sure if that choice was made when INSERT ON CONFLICT was
implemented or when partitioned table support was added. But the challenges
look similar.

I first tried to treat MERGE similar to UPDATE/DELETE case and ensure that
the INSERTs go through the root partition. That mostly works, but the RIGHT
OUTER join between the child tables and the source relation ends up
emitting duplicate rows, if the partitioned table is the resultRelation and
when it gets expanded in inheritance_planner(). That's a blocker. So what I
am trying now is to push the join between the Append relation and the
source relation below the ModifyTable node, so that we get the final join
result. We can then look up the tableoid in the row returned from the join,
find the corresponding result relation and then carry out MERGE actions.
Note that unlike regular ExecModifyTable(), here we must execute just one
subplan as that will return all the required tuples.

Does anyone see a potential blocker with this approach, except that it may
not be the most elegant way? I think EvalPlanQual might need some treatment
because when the plan is re-executed, it will expect to the find the
updated tuple in the slot of the underlying query's RTE and not in the
resultRelation's RTE, which does not participate in the join at all.
Anything else I could be missing out completely?

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-07 Thread Pavan Deolasee
On Wed, Feb 7, 2018 at 10:52 PM, Peter Geoghegan  wrote:

>
>
> Apparently there is a pending patch to do better there - the commit
> message of 2f178441 refers to this.
>
>
Thanks. Will look at it.


> > The revised version also supports subqueries in SET targetlist as well as
> > WHEN AND conditions. As expected, this needed a minor tweak in
> > query/expression mutators. So once I worked on that for partitioned
> tables,
> > subqueries started working with very minimal adjustments elsewhere. Other
> > things such as whole-var references are still broken. A few new tests are
> > also added.
>
> Great!
>
> Wholerow references are expected to be a bit trickier. See commit
> ad227837 for some hints on how you could fix this.
>

Thanks again.


>
> > Next I am going to look at RLS. While I've no prior experience with RLS,
> I
> > am expecting it to be less cumbersome in comparison to partitioning. I am
> > out of action till Monday and may not be able to respond in time. But any
> > reviews and comments are appreciated as always.
>
> I don't think that RLS support will be particularly challenging. It
> might take a while.
>

Ok. I would start from writing a test case to check what works and what
doesn't with the current patch and work from there. My understanding of RLS
is limited right now, but from what I've seen in the code (while hacking
other stuff), my guess is it will require us evaluating a set of quals and
then deciding on the action.


>
> If your rapid progress here is any indication, most open items are not
> likely to be particularly challenging. Once again, I suggest that a
> good area for us to focus on is the semantics of READ COMMITTED
> conflict handling.


I understand getting EPQ semantics right is very important. Can you please
(once again) summarise your thoughts on what you think is the *most*
appropriate behaviour? I can then think how much efforts might be involved
in that. If the efforts are disproportionately high, we can discuss if
settling for some not-so-nice semantics, like we apparently did for
partition key updates.

I am sorry, I know you and Simon have probably done that a few times
already and I should rather study those proposals first. So it's okay if
you don't want to repeat those; I will work on them next week once I am
back from holidays.

Maybe you'd prefer to just blast through the
> simpler open items, which is fine, but do bear in mind that the EPQ
> and EPQ-adjacent stuff is probably going to be the thing that makes or
> breaks this patch for v11.
>

TBH I did not consider partitioning any less complex and it was indeed very
complex, requiring at least 3 reworks by me. And from what I understood, it
would have been a blocker too. So is subquery handling and RLS. That's why
I focused on addressing those items while you and Simon were still debating
EPQ semantics.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-11 Thread Pavan Deolasee
On Fri, Feb 9, 2018 at 6:53 AM, Peter Geoghegan  wrote:

> On Wed, Feb 7, 2018 at 7:51 PM, Pavan Deolasee 
> wrote:
> > I understand getting EPQ semantics right is very important. Can you
> please
> > (once again) summarise your thoughts on what you think is the *most*
> > appropriate behaviour? I can then think how much efforts might be
> involved
> > in that. If the efforts are disproportionately high, we can discuss if
> > settling for some not-so-nice semantics, like we apparently did for
> > partition key updates.
>
> I personally believe that the existing EPQ semantics are already
> not-so-nice. They're what we know, though, and we haven't actually had
> any real world complaints, AFAIK.
>

I agree.


>
> My concern is mostly just that MERGE manages to behave in a way that
> actually "provides a single SQL statement that can conditionally
> INSERT, UPDATE or DELETE rows, a task that would otherwise require
> multiple procedural language statements", as the docs put it. As long
> as MERGE manages to do something as close to that high level
> description as possible in READ COMMITTED mode (with our current
> semantics for multiple statements in RC taken as the baseline), then
> I'll probably be happy.
>

IMO it will be quite hard, if not impossible, to guarantee the same
semantics to a single statement MERGE and multi statement
UPDATE/DELETE/INSERT in RC mode. For example, the multi statement model
will execute each statement with a new MVCC snapshot and hence the rows
visible to individual statement may vary. Whereas in MERGE, everything runs
with a single snapshot. There could be other such subtle differences.


>
> Some novel new behavior -- "EPQ with a twist"-- is clearly necessary.
> I feel a bit uneasy about it because anything that anybody suggests is
> likely to be at least a bit arbitrary (EPQ itself is kind of
> arbitrary). We only get to make a decision on how "EPQ with a twist"
> will work once, and that should be a decision that is made following
> careful deliberation. Ambiguity is much more likely to kill a patch
> than a specific technical defect, at least in my experience. Somebody
> can usually just fix a technical defect.
>

While I agree, I think we need to make these decisions in a time bound
fashion. If there is too much ambiguity, then it's not a bad idea to settle
for throwing appropriate errors instead of providing semantically wrong
answers, even in some remote corner case.


>
>
>
> > TBH I did not consider partitioning any less complex and it was indeed
> very
> > complex, requiring at least 3 reworks by me. And from what I understood,
> it
> > would have been a blocker too. So is subquery handling and RLS. That's
> why I
> > focused on addressing those items while you and Simon were still debating
> > EPQ semantics.
>
> Sorry if I came across as dismissive of that effort. That was
> certainly not my intention. I am pleasantly surprised that you've
> managed to move a number of things forward rather quickly.
>
> I'll rephrase: while it would probably have been a blocker in theory
> (I didn't actually weigh in on that), I doubted that it would actually
> end up doing so in practice (and it now looks like I was right to
> doubt that, since you got it done). It was a theoretical blocker, as
> opposed to an open item that could drag on indefinitely despite
> everyone's best efforts. Obviously details matter, and obviously there
> are a lot of details to get right outside of RC semantics, but it
> seems wise to focus on the big risk that is EPQ/RC conflict handling.
>

Ok. I am now back from holidays and I will too start thinking about this.
I've also requested a colleague to help us with comparing it against
Oracle's behaviour. N That's not a gold standard for us, but knowing how
other major databases handle RC conflicts, is not a bad idea.

I see the following important areas and as long as we have a consistent and
coherent handling of these cases, we should not have difficulty agreeing on
a outcome.

1. Concurrent UPDATE does not affect MATCHED  case. The WHEN conditions may
or may not be affected.
2. Concurrently UPDATEd tuple fails the join qual and the current source
tuple no longer matches with the updated target tuple that the EPQ is set
for. It matches no other target tuple either. So a MATCHED case is turned
into a NOT MATCHED case.
3. Concurrently UPDATEd tuple fails the join qual and the current source
tuple no longer matches with the updated target tuple that the EPQ is set
for. But it matches some other target tuple. So it's still a MATCHED case,
but with different target tuple(s).
4. Concurrent UPDATE/INSERT creates a matching target tuple for a source
tuple, t

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-12 Thread Pavan Deolasee
On Sat, Feb 10, 2018 at 7:19 AM, Tomas Vondra 
wrote:

> Hi,
>
> On 02/07/2018 10:24 AM, Pavan Deolasee wrote:
> >
>
>if (startWAL < GetXactWALBytes())
>ereport(ERROR,
>(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot write to database ...")));
>
> I think this actually fails to enforce the rule, because some writes may
> not produce WAL (think of unlogged tables). I also suspect it may be
> incorrect "in the opposite direction" because a query may not do any
> changes and yet it may produce WAL (e.g. due to wal_hint_bits=true).
>
> So we may need to think of a different way to enforce this ...


Yes, this does look problematic.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-12 Thread Pavan Deolasee
On Fri, Feb 9, 2018 at 8:06 PM, Robert Haas  wrote:

>
>
> On that basis, of the options I listed in
> http://postgr.es/m/CA+TgmoZDL-caukHkWet7sr7sqr0-e2T91+
> devhqen5sfqsm...@mail.gmail.com
> I like #1 least.
>
> I also dislike #4 from that list for the reasons stated there.  For
> example, if you say WHEN MATCHED AND x.some_boolean and then WHEN
> MATCHED, you expect that every tuple that hits the latter clause will
> have that Boolean as false or null, but #4 makes that not true.
>
> I think the best options are #2 and #5 -- #2 because it's simple, and
> #5 because it's (maybe) more intuitive, albeit at the risk of
> livelock.


As you said, #5 seems the best and that's what the patch does. But ISTM
that the options you listed are not really the concerning points. As the
patch stands today, we evaluate WHEN AND conditions separately, outside the
EPQ. The problem arises when the join qual returns a different result with
the updated tuple. I listed down those cases in my earlier email in the
day. To me (and I assume to Peter and Simon too), those are the more
interesting cases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-02-12 Thread Pavan Deolasee
On Fri, Feb 2, 2018 at 9:07 PM, Robert Haas  wrote:

> On Thu, Feb 1, 2018 at 7:21 PM, Simon Riggs  wrote:
> > Yes, it would be about 99% of the time.
> >
> > But you have it backwards - we are not assuming that case. That is the
> > only case that has risk - the one where an old WAL record starts at
> > exactly the place the latest one stops. Otherwise the rest of the WAL
> > record will certainly fail the CRC check, since it will effectively
> > have random data in it, as you say.
>
> OK, I get it now.  Thanks for explaining.  I think I understand now
> why you think this problem can be solved just by controlling the way
> we recycle segments, but I'm still not sure if that can really be made
> fully reliable.  Michael seems concerned about what might happen after
> multiple recyclings, and Tom has raised the issue of old data
> reappearing after a crash.
>

I'm not sure if Michael has spotted a real problem or was that just a
concern. He himself later rightly pointed out that when a WAL file is
switched, the old file is filled with zeros. So I don't see a problem
there. May be I am missing something and Michael can explain further.

Regarding Tom's concerns, that could be a problem if a file system crash
survives a name change, but not the subsequent data written to the file.
For this to be a problem, WAL file A is renamed to B and then renamed to C.
File A and C share the same low order bits. Further upon file system crash,
the file is correctly named as C, but the data written *before* the rename
operation is lost. Is that a real possibility? Can we delay reusing low
order bits a little further to address this problem? Of course, if the file
system crash can survive many renames and still resurrect old data several
renames before, then we shall have the same problem.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-13 Thread Pavan Deolasee
Hi Stephen,

On Tue, Feb 6, 2018 at 3:37 PM, Stephen Frost  wrote:

>
>
> Coming out of that, my understanding is that Simon is planning to have a
> patch which implements RLS and partitioning (though the query plans for
> partitioning may be sub-par and not ideal) as part of MERGE and I've
> agreed to review at least the RLS bits (though my intention is to at
> least go through the rest of the patch as well, though likely in less
> detail).  Of course, I encourage others to review it as well.
>
>
Thanks for volunteering to review the RLS bits. I am planning to send a
revised version soon. As I work through it, I am faced with some semantic
questions again. Would appreciate if you or anyone have answers to those.

While executing MERGE, for existing tuples in the target table, we may end
up doing an UPDATE or DELETE, depending on the WHEN MATCHED AND conditions.
So it seems unlikely that we would be able to push down USING security
quals down to the scan. For example, if the target row is set for deletion,
it seems wrong to exclude the row from the join based on UPDATE policy's
USING quals. So I am thinking that we should apply the respective USING
quals *after* the decision to either update, delete or do nothing for the
given target tuple is made.

The question I have is, if the USING qual evaluates to false or NULL,
should we silently ignore the tuple (like regular UPDATE does) or throw an
error (like INSERT ON CONFLICT DO UPDATE)? ISTM that we might have decided
to throw an error in case of INSERT ON CONFLICT to avoid any confusion
where the tuple is neither inserted nor updated. Similar situation may
arise with MERGE because for a source row, we may neither do UPDATE
(because of RLS) nor INSERT because a matching tuple already exists. But
someone may argue that we should stay closer to regular UPDATE/DELETE.
Apart from that, are there any security angles that we need to be mindful
of and would those impact the choice?

SELECT policies will be applied to the target table during the scan and
rows which do not pass SELECT quals will not be processed at all. If there
are NOT MATCHED actions, we might end up inserting duplicate rows in that
case or throw errors, but I don't see anything wrong with that. Similar
things would have happened if someone tried to insert rows into the table
using regular INSERT.

Similarly, INSERT policies will be applied when MERGE attempts to INSERT a
row into the table and error will be thrown if the row does not satisfy
INSERT policies.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-15 Thread Pavan Deolasee
Hi Andreas,

On Sun, Feb 4, 2018 at 5:45 PM, Andreas Seltenreich 
wrote:

>
>
> Attached are testcases that trigger the assertions when run against an
> empty database instead of the one left behind by make installcheck.  The
> "unrecognized node type" one appears to be a known issue according to
> the wiki, so I didn't bother doing the work for this one as well.
>

Thanks for doing those tests. I've just sent v16a version of the patch and
I think it fixes the issues reported so far. Can you please recheck? Please
let me know if there are other issues detected by sqlsmith or otherwise.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-04-02 Thread Pavan Deolasee
On Wed, Mar 27, 2019 at 9:47 AM Masahiko Sawada 
wrote:

>
> The patch looks good to me. There is no comment from me.
>
>
Thanks for your review! Updated patch attached since patch failed to apply
after recent changes in the master.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


copy_freeze_v5.patch
Description: Binary data


Re: Re: A separate table level option to control compression

2019-04-02 Thread Pavan Deolasee
On Wed, Apr 3, 2019 at 10:19 AM Michael Paquier  wrote:

>
>
> I have been looking at this patch more, and here are some notes:
> - The tests can be really simplified using directly reltoastrelid, so
> I changed the queries this way.  I am aware that the surroundings
> hardcode directly the relation name, but that's not really elegant in
> my opinion.  And I am really tempted to adjust these as well to
> directly use reltoastrelid.
>

I agree using reltoastrelid is a better way. I just followed the
surrounding tests, but +1 to change all of these use reltoastrelid.


> - The docs had a weird indentation.
>

Sorry about that. I was under the impression that it doesn't matter since
the doc builder ultimately chooses the correct indentation. I'd a patch
ready after Sawada's review, but since you already fixed that, I am not
sending it.


> - I think that we should be careful with the portability of
> pg_column_size(), so I have added comparisons instead of the direct
> values in a way which does not change the meaning of the tests nor
> their coverage.
>

Ok, understood.


> - Having RelationGetCompressTupleTarget use directly
> toast_tuple_threshold as default argument is I think kind of
> confusing, so let's use a different static value, named
> COMPRESS_TUPLE_TARGET in the attached.  This is similar to
> TOAST_TUPLE_TARGET for the toast tuple threshold.
>

Sounds good. This also takes care of the other issue you brought about
"hoff" getting subtracted twice.


> - The comments in tuptoaster.h need to be updated to outline the
> difference between the compression invocation and the toast invocation
> thresholds.  The wording could be better though.
>

I see that you've done this already. But please let me if more is needed.


> - Better to avoid comments in the middle of the else/if blocks in my
> opinion.
>

This is probably a personal preference and I've seen many other places
where we do that (see e.g. lazy_scan_heap()). But I don't have any strong
views on this. I see that you've updated comments in tuptoaster.h, so no
problem if you want to remove the comments from the code block.


>
> Also, the previous versions of the patch do that when doing a heap
> insertion (heapam.c and rewriteheap.c):
> +toast_tuple_threshold = RelationGetToastTupleTarget(relation,
> +TOAST_TUPLE_THRESHOLD);
> +compress_tuple_threshold =
> RelationGetCompressTupleTarget(relation,
> +toast_tuple_threshold);
> +compress_tuple_threshold = Min(compress_tuple_threshold,
> +   toast_tuple_threshold);
> [...]
> need_toast = (HeapTupleHasExternal(&oldtup) ||
>   HeapTupleHasExternal(newtup) ||
> -newtup->t_len > TOAST_TUPLE_THRESHOLD);
> +newtup->t_len > compress_tuple_threshold);
>
> This means that the original code always uses the compilation-time,
> default value of toast_tuple_target for all relations.  But this gets
> changed so as we would use the value set at relation level for
> toast_tuple_target if the reloption is changed, without touching
> compress_tuple_threshold.  This is out of the scope of this patch, but
> shouldn't we always use the relation-level value instead of the
> compiled one?  Perhaps there is something I am missing?
>

Yeah, this is an issue with the existing code. Even though we allow setting
toast_tuple_target to a value less than compile-time TOAST_TUPLE_THRESHOLD,
the code doesn't really honour a value less than TOAST_TUPLE_THRESHOLD. In
other words, setting toast_tuple_target lesser than TOAST_TUPLE_THRESHOLD
doesn't have any effect. We don't even create a toast table if the
estimated length of tuple is not greater than TOAST_TUPLE_THRESHOLD.

The change introduced by this patch will now trigger the tuptoaster code
when the compression or toast threshold is set to a value lower than
TOAST_TUPLE_THRESHOLD. But as far as I can see, that doesn't have any bad
effect on the toasting since toast_insert_or_update() is capable of
handling the case when the toast table is missing. There will be a
behavioural change though. e.g.

CREATE TABLE testtab (a text) WITH (toast_tuple_target=256);
INSERT INTO testtab VALUES ;

Earlier this tuple would not have been toasted, even though
toast_tuple_target is set to 256. But with the new code, the tuple will get
toasted. So that's a change, but in the right direction as far as I can
see. Also, this is a bit unrelated to what this patch is trying to achieve.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-04-04 Thread Pavan Deolasee
Hi,

On Fri, Apr 5, 2019 at 9:05 AM Andres Freund  wrote:

>
>
> I think the right approach would be to do all of this in heap_insert and
> heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN
> is specified, remember whether it is either currently empty, or is
> already marked as all-visible. If previously empty, mark it as all
> visible at the end. If already all visible, there's no need to change
> that. If not yet all-visible, no need to do anything, since it can't
> have been inserted with COPY FREEZE.


We're doing roughly the same. If we are running INSERT_FROZEN, whenever
we're about to switch to a new page, we check if the previous page should
be marked all-frozen and do it that way. The special code in copy.c is
necessary to take care of the last page which we don't get to handle in the
regular code path.

Or are you suggesting that we don't even rescan the page for all-frozen
tuples at the end and just simply mark it all-frozen at the start, when the
first tuple is inserted and then don't touch the PD_ALL_VISIBLE/visibility
map bit as we go on inserting more tuples in the page?

Anyways, if major architectural changes are required then it's probably too
late to consider this for PG12, even though it's more of a bug fix and a
candidate for back-patching too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-04-04 Thread Pavan Deolasee
On Fri, Apr 5, 2019 at 8:37 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-05 00:06:04 -0300, Alvaro Herrera wrote:
>
> >
> > Hmm, isn't there already a critical section in visibilitymap_set itself?
>
> There is, but the proposed code sets all visible on the page, and marks
> the buffer dirty, before calling visibilitymap_set.
>

How's it any different than what we're doing at vacuumlazy.c:1322? We set
the page-level bit, mark the buffer dirty and then call
visibilitymap_set(), all outside a critical section.

1300 /* mark page all-visible, if appropriate */
1301 if (all_visible && !all_visible_according_to_vm)
1302 {
1303 uint8   flags = VISIBILITYMAP_ALL_VISIBLE;
1304
1305 if (all_frozen)
1306 flags |= VISIBILITYMAP_ALL_FROZEN;
1307
1308 /*
1309  * It should never be the case that the visibility map
page is set
1310  * while the page-level bit is clear, but the reverse is
allowed
1311  * (if checksums are not enabled).  Regardless, set the
both bits
1312  * so that we get back in sync.
1313  *
1314  * NB: If the heap page is all-visible but the VM bit is
not set,
1315  * we don't need to dirty the heap page.  However, if
checksums
1316  * are enabled, we do need to make sure that the heap page
is
1317  * dirtied before passing it to visibilitymap_set(),
because it
1318  * may be logged.  Given that this situation should only
happen in
1319  * rare cases after a crash, it is not worth optimizing.
1320  */
1321 PageSetAllVisible(page);
1322 MarkBufferDirty(buf);
1323 visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
1324   vmbuffer, visibility_cutoff_xid, flags);
1325 }

As the first para in that comment says, I thought it's ok for page-level
bit to be set but the visibility bit to be clear, but not the vice versa.
The proposed code does not introduce any  new behaviour AFAICS. But I might
be missing something.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Caveats from reloption toast_tuple_target

2019-04-04 Thread Pavan Deolasee
On Thu, Apr 4, 2019 at 11:36 AM Michael Paquier  wrote:

>
>
> I mean that toast_tuple_target is broken as-is, because it should be
> used on the new tuples of a relation as a threshold to decide if this
> tuple should be toasted or not, but we don't actually use the
> reloption value for that decision-making: the default threshold
> TOAST_TUPLE_THRESHOLD gets used instead all the time!  The code does
> not even create a toast table in some cases.
>

I think the problem with the existing code is that while it allows users to
set toast_tuple_target to be less than TOAST_TUPLE_THRESHOLD, the same is
not honoured while deciding whether to toast a row or not. AFAICS it works
ok when toast_tuple_target is greater than or equal to
TOAST_TUPLE_THRESHOLD i.e. it won't toast the rows unless they are larger
than toast_tuple_target.

IMV it makes sense to simply cap the lower limit of toast_tuple_target to
the compile time default and update docs to reflect that. Otherwise, we
need to deal with the possibility of dynamically creating the toast table
if the relation option is lowered after creating the table. Your proposed
patch handles the case when the toast_tuple_target is specified at CREATE
TABLE, but we would still have problem with ALTER TABLE, no? But there
might be side effects of changing the lower limit for pg_dump/pg_restore.
So we would need to think about that too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-05-29 Thread Pavan Deolasee
On Tue, 28 May 2019 at 4:36 PM, Andres Freund  wrote:

> Hi,
>
> On 2019-04-07 18:04:27 -0700, Andres Freund wrote:
> > Here's a *prototype* patch for this.  It only implements what I
> > described for heap_multi_insert, not for plain inserts. I wanted to see
> > what others think before investing additional time.
>
> Pavan, are you planning to work on this for v13 CF1? Or have you lost
> interest on the topic?
>

Yes, I plan to work on it.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MERGE SQL statement for PG12

2018-11-21 Thread Pavan Deolasee
Hi Tomas,

Sorry for a delayed response.

On Mon, Oct 29, 2018 at 4:59 PM Tomas Vondra 
wrote:

> Hi Pavan,
>
> On 10/29/2018 10:23 AM, Pavan Deolasee wrote:
> >
> > ...
> >
> > Thanks for keeping an eye on the patch. I've rebased the patch
> > against the current master. A new version is attached.
> >
> > Thanks,
> > Pavan
> >
>
> It's good to see the patch moving forward. What are your thoughts
> regarding things that need to be improved/fixed to make it committable?
>
> I briefly discussed the patch with a couple of people at pgconf.eu last
> week, and IIRC the main concern was how the merge is represented in
> parser/executor.
>
>
Yes, Andres and to some extent Peter G had expressed concerns about that.
Andres suggested that we should rework the parser and executor side. Here
are some excerpts from his review comments.


"I don't think the parser / executor implementation
of MERGE is architecturally sound.  I think creating hidden joins during
parse-analysis to implement MERGE is a seriously bad idea and it needs
to be replaced by a different executor structure."

+* As top-level statements INSERT, UPDATE and DELETE have a Query,
whereas
+* with MERGE the individual actions do not require separate planning,
+* only different handling in the executor. See nodeModifyTable handling
+* of commandType CMD_MERGE.
+*
+* A sub-query can include the Target, but otherwise the sub-query
cannot
+* reference the outermost Target table at all.
+*/
+   qry->querySource = QSRC_PARSER;

Why is this, and not building a proper executor node for merge that
knows how to get the tuples, the right approach?  We did a rough
equivalent for matview updates, and I think it turned out to be a pretty
bad plan.



On Fri, Apr 6, 2018 at 1:30 AM, Andres Freund  wrote:

>
>
> My impression is that this simply shouldn't be going through
> nodeModifyTuple, but be it's own nodeMerge node. The trigger handling
> would need to be abstraced into execTrigger.c or such to avoid
> duplication.  We're now going from nodeModifyTable.c:ExecModifyTable()
> into execMerge.c:ExecMerge(), back to
> nodeModifyTable.c:ExecUpdate/Insert(). To avoid ExecInsert() doing
> things that aren't appropriate for merge, we then pass an actionState,
> which neuters part of ExecUpdate/Insert(). This is just bad.
>



To be honest, I need more directions on how to address these major
architectural concerns. I'd tried to rework the executor part and had
commented on that. But I know that was probably done in a hurry since we
were trying to save a revert. Having said that, I am still not very sure
how exactly the executor side should look like without causing too much
duplication of handling nodeModifyTable() and proposed nodeMerge(). If
Andres can give me further inputs, I will be more than happy to figure out
the details and improve the patch.

As far as the parser side goes, do I understand correctly that instead of
building a special join in transformMergeStmt() as the patch does today, we
should follow what transformUpdateStmt() does for a potential join? i.e.
just have a single RTE for the source relation and present it as a left
hand side for the implicit join? If I get a little more direction on this,
I can try to address the issues.

It looks very likely that the patch won't get much review in the current
state. But if I get inputs, I can work out the details and try to get the
patch in committable state.

BTW I am aware that the patch is bit-rotten because of the partitioning
related changes. I will address those and post a revised patch soon.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-27 Thread Pavan Deolasee
On Fri, Feb 16, 2018 at 6:37 AM, Peter Geoghegan  wrote:

>
> ISTM that a MERGE isn't really a thing that replaces 2 or 3 other DML
> statements, at least in most cases. It's more like a replacement for
> procedural code with an outer join, with an INSERT, UPDATE or DELETE
> that affects zero or one rows inside the procedural loop that
> processes matching/non-matching rows. The equivalent procedural code
> could ultimately perform *thousands* of snapshot acquisitions for
> thousands of RC DML statements. MERGE is sometimes explained in terms
> of "here is the kind of procedural code that you don't have to write
> anymore, thanks to MERGE" -- that's what the code looks like.
>
> I attach a rough example of this, that uses plpgsql.
>

Thanks for writing the sample code. I understand you probably don't mean to
suggest that we need to mimic the behaviour of the plpgsql code and the
semantics offered by MERGE would most likely be different than what the
plpgsql sample does. Because there are several problems with the plpgsql
code:

- It would never turn a MATCHED case into a NOT MATCHED case because of
concurrent UPDATE/DELETE
- The WHERE clauses attached to the UPDATE/DELETE statement should be using
the quals attached to the WHEN clauses to ensure they are evaluated on the
new version of the row, if needed.


>
> >> Some novel new behavior -- "EPQ with a twist"-- is clearly necessary.
> >> I feel a bit uneasy about it because anything that anybody suggests is
> >> likely to be at least a bit arbitrary (EPQ itself is kind of
> >> arbitrary). We only get to make a decision on how "EPQ with a twist"
> >> will work once, and that should be a decision that is made following
> >> careful deliberation. Ambiguity is much more likely to kill a patch
> >> than a specific technical defect, at least in my experience. Somebody
> >> can usually just fix a technical defect.
>

TBH that's one reason why I like Simon's proposed behaviour of throwing
errors in case of corner cases. I am not suggesting that's what we do at
the end, but it's definitely worth considering.


> >
> >
> > While I agree, I think we need to make these decisions in a time bound
> > fashion. If there is too much ambiguity, then it's not a bad idea to
> settle
> > for throwing appropriate errors instead of providing semantically wrong
> > answers, even in some remote corner case.
>
> Everything is still on the table, I think.
>

Ok.


>
> > Ok. I am now back from holidays and I will too start thinking about this.
> > I've also requested a colleague to help us with comparing it against
> > Oracle's behaviour. N That's not a gold standard for us, but knowing how
> > other major databases handle RC conflicts, is not a bad idea.
>
> The fact that Oracle doesn't allow WHEN MATCHED ... AND quals did seem
> like it might be significant to me.
>
>
Here are some observations from Rahila's analysis so far. I must say,
Oracle's handling seems quite inconsistent, especially the conditions under
which it sometimes re-evaluates the join and sometimes don't.

- Oracle does not support multiple WHEN MATCHED clauses. So the question of
re-checking all WHEN clauses does not arise.

- Only one UPDATE and one DELETE clause is supported. The DELETE must be
used in conjunction with UPDATE.

- The DELETE clause is invoked iff the UPDATE clause is invoked. It works
on the updated rows. Since the row is already updated (and locked) by the
MERGE, DELETE action never blocks on a concurrent update/delete

- MERGE does not allow updating the column used in the JOIN's ON qual

- In case of concurrent UPDATE, the join is re-evaluated iff the concurrent
  UPDATE updates (modifies?) the same column that MERGE is updating OR a
column
  that MERGE is referencing in the WHERE clause is updated by the
concurrent update. IOW if the
  MERGE and concurrent UPDATE is operating on different columns, join is NOT
  re-evaluated, thus possibly invoking WHEN MATCHED action on a row which no
  longer matches the join condition.

- In case of concurrent DELETE, the join is re-evaluated and the action may
change from MATCHED to NOT MATCHED

I am curiously surprised by it's behaviour of re-evaluating join only when
certain columns are updated. It looks to me irrespective of what we choose,
our implementation would be much superior to what Oracle offers.

BTW I've sent v17a of the patch, which is very close to being complete from
my perspective (except some documentation fixes/improvements). The only
thing pending is the decision to accept or change the currently implemented
concurrency semantics.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-01 Thread Pavan Deolasee
On Thu, Mar 1, 2018 at 3:50 AM, Peter Geoghegan  wrote:

> On Fri, Feb 9, 2018 at 6:36 AM, Robert Haas  wrote:
> > Here's my $0.02: I think that new concurrency errors thrown by the
> > merge code itself deserve strict scrutiny and can survive only if they
> > have a compelling justification, but if the merge code happens to
> > choose an action that leads to a concurrency error of its own, I think
> > that deserves only mild scrutiny.
> >
> > On that basis, of the options I listed in
> > http://postgr.es/m/CA+TgmoZDL-caukHkWet7sr7sqr0-e2T91+
> devhqen5sfqsm...@mail.gmail.com
> > I like #1 least.
> >
> > I also dislike #4 from that list for the reasons stated there.  For
> > example, if you say WHEN MATCHED AND x.some_boolean and then WHEN
> > MATCHED, you expect that every tuple that hits the latter clause will
> > have that Boolean as false or null, but #4 makes that not true.
> >
> > I think the best options are #2 and #5 -- #2 because it's simple, and
> > #5 because it's (maybe) more intuitive, albeit at the risk of
> > livelock.  But I'm willing to convinced of some other option; like
> > you, I'm willing to be open-minded about this.  But, as you say, we
> > need a coherent, well-considered justification for the selected
> > option, not just "well, this is what we implemented so you should like
> > it".  The specification should define the implementation, not the
> > reverse.
>
> At first I hated option #2. I'm warming to #2 a lot now, though,
> because I've come to understand the patch's approach a little better.
> (Pavan and Simon should still verify that I got things right in my
> mail from earlier today, though.)
>
> It now seems like the patch throws a RC serialization error more or
> less only due to concurrent deletions (rarely, it will actually be a
> concurrent update that changed the join qual values of our MERGE).
> We're already not throwing the error (we just move on to the next
> input row from the join) when we happen to not have a WHEN NOT MATCHED
> case. But why even make that distinction? Why not just go ahead and
> WHEN NOT MATCHED ... INSERT when the MERGE happens to have such a
> clause? The INSERT will succeed, barring any concurrent conflicting
> insertion by a third session -- a hazard that has nothing to do with
> RC conflict handling in particular.
>
> Just inserting without throwing a RC serialization error is almost
> equivalent to a new MVCC snapshot being acquired due to a RC conflict,
> so all of this now looks okay to me. Especially because every other
> MERGE implementation seems to have serious issues with doing anything
> too fancy with the MERGE target table's quals within the main ON join.
> I think that SQL Server actually assumes that you're using the
> target's PK in a simple equi-join. All the examples look like that,
> and this assumption is heavily suggested by the "Do not attempt to
> improve query performance by filtering out rows in the target table in
> the ON clause" weasel words from their docs, that I referenced in my
> mail from earlier today.
>
>
I think you've fairly accurately described what the patch does today. I
take your point that we can very well just execute the WHEN NOT MATCHED
action if the join condition fails for the updated tuple. There is one case
we ought to think about though and that might explain why executing the
WHEN NOT MATCHED action may not be best choice. Or for that matter even
skipping the target when no NOT MATCHED action exists, as the patch does
today.

What if the updated tuple fails the join qual with respect to the current
tuple from the source relation but it now matches some other tuple from the
source relation? I described this case in one of the earlier emails too. In
this case, we might end up doing an INSERT (if we decide to execute WHEN
NOT MATCHED action), even though a MATCH exists. If there is no WHEN NOT
MATCHED action, the current patch will just skip the updated tuple even
though a match exists, albeit it's not the current source tuple.

Oracle behaves differently and it actually finds a new matching tuple from
the source relation and executes the WHEN MATCHED action, using that source
tuple. But I am seriously doubtful that we want to go down that path and
whether it's even feasible. Our regular UPDATE .. FROM does not do that
either. Given that, it seems better to just throw an error (even when no
NOT MATCHED action exists) and explain to the users that MERGE will work as
long as concurrent updates don't modify the columns used in the join
condition. Concurrent deletes should be fine and we may actually even
invoke WHEN NOT MATCHED action in that case.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-03-05 Thread Pavan Deolasee
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera 
wrote:

>
>
> Re. the "ugly hack" comments in adjust_inherited_tlist(), I'm confused:
> expand_targetlist() runs *after*, not before, so how could it have
> affected the result?
>

If I understand correctly, planner must have called expand_targetlist()
once for the parent relation's descriptor and added any dropped columns
from the parent relation. So we may not find mapped attributes for those
dropped columns in the parent. I haven't actually tested this case though.

I wonder if it makes sense to actually avoid expanding on-conflict-set
targetlist in case the target is a partition table and deal with it during
execution, like we are doing now.


> I'm obviously confused about what
> expand_targetlist call this comment is talking about.  Anyway I wanted
> to make it use resjunk entries instead, but that broke some other case
> that I didn't have time to research yesterday.  I'll get back to this
> soon, but in the meantime, here's what I have.
>

+   conv_setproj =
+
 adjust_and_expand_partition_tlist(RelationGetDescr(firstResultRel),
+ RelationGetDescr(partrel),
+
 RelationGetRelationName(partrel),
+ firstVarno,
+ conv_setproj);

Aren't we are adding back Vars referring to the parent relation, after
having converted the existing ones to use partition relation's varno? May
be that works because missing attributes are already added during planning
and expand_targetlist() here only adds dropped columns, which are just NULL
constants.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-05 Thread Pavan Deolasee
On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan  wrote:

> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire 
> wrote:
>
> > I believe PKs are a prime candidate for this optimization, and
> > expecting it to apply only when no concurrency is involved is severely
> > dumbing down the optimization.
>
> Pavan justified the patch using a benchmark that only involved a
> single client -- hardly typical for a patch that changes the B-Tree
> code. If the benefits with many clients can be shown to matter, that
> will make this much more interesting to me.


Ok. I will repeat those tests with more number of clients and report back.

Regarding your suggestion about using page LSN to detect intermediate
activity, my concern is that unless we store that value in shared memory,
concurrent backends, even if inserting values in an order, will make
backend-local cached page LSN invalid and the optimisation will not
kick-in.

I am yet to digest the entire conversation between you and Claudio; you
guys clearly understand b-tree internals better than me. It seems while
you're worried about missing out on something, Claudio feels that we can
find a safe way just looking at the information available in the current
page. I feel the same way, but will need to re-read the discussion
carefully again.

Simon had raised concerns about DESC indexes and whether we need to do the
checks for leftmost page in that case. I haven't yet figured out if DESC
indexes are actually stored in the reverse order. I am gonna look at that
too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-07 Thread Pavan Deolasee
On Wed, Feb 28, 2018 at 12:38 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila 
> wrote:
>
>> +# Concurrency error from GetTupleForTrigger
>> +# Concurrency error from ExecLockRows
>>
>> I think you don't need to mention above sentences in spec files.
>> Apart from that, your patch looks good to me.  I have marked it as
>> Ready For Committer.
>>
>
> I too have tested this feature with isolation framework and this look good
> to me.
>
>
It looks to me that we are trying to fix only one issue here with
concurrent updates. What happens if a non-partition key is first updated
and then a second session updates the partition key?

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
 key | val
-+-
   1 | initial1 *updated by update1*
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1
WHERE key = 1


session1:
postgres=# COMMIT;
COMMIT



postgres=# SELECT * FROM pa_target ;
 key | val
-+-
   2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This
clearly violates the rules that rest of the system obeys and is not
acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out
on re-constructing the to-be-updated tuple, based on the latest tuple in
the update chain. Instead, it's simply deleting the latest tuple and
inserting a new tuple in the new partition based on the old tuple. That's
simply wrong.

I haven't really thought carefully to see if this should be a separate
patch, but it warrants attention. We should at least think through all
different concurrency aspects of partition key updates and think about a
holistic solution, instead of fixing one problem at a time. This probably
also shows that isolation tests for partition key updates are either
missing (I haven't checked) or they need more work.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-07 Thread Pavan Deolasee
On Tue, Feb 13, 2018 at 12:41 PM, amul sul  wrote:

>
> Thanks for the confirmation, updated patch attached.
>
>
I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
not do anything to deal with the fact that t_ctid may no longer point to
itself to mark end of the chain. I just can't see how that would work. But
if it does, it needs good amount of comments explaining why and most likely
updating comments at other places where chain following is done. For
example, how's this code in heap_get_latest_tid() is still valid? Aren't we
setting "ctid" to some invalid value here?

2302 /*
2303  * If there's a valid t_ctid link, follow it, else we're done.
2304  */
2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308 {
2309 UnlockReleaseBuffer(buffer);
2310 break;
2311 }
2312
2313 ctid = tp.t_data->t_ctid;

This is just one example. I am almost certain there are many such cases
that will require careful attention.

What happens if a partition key update deletes a row, but the operation is
aborted? Do we need any special handling for that case?

I am actually worried that we're tinkering with ip_blkid to handle one
corner case of detecting partition key update. This is going to change
on-disk format and probably need more careful attention. Are we certain
that we would never require update-chain following when partition keys are
updated? If so, can we think about some other mechanism which actually even
leaves behind ? I am not saying we should do that,
but it warrants a thought. May be it was discussed somewhere else and ruled
out. I happened to notice this patch because of the bug I encountered.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-07 Thread Pavan Deolasee
On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila 
wrote:

> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>  wrote:
> >
> > On Tue, Feb 13, 2018 at 12:41 PM, amul sul  wrote:
> >>
> >> Thanks for the confirmation, updated patch attached.
> >>
> >
> > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch
> does not
> > do anything to deal with the fact that t_ctid may no longer point to
> itself
> > to mark end of the chain. I just can't see how that would work.
> >
>
> I think it is not that patch doesn't care about the end of the chain.
>  For example, ctid pointing to itself is used to ensure that for
> deleted rows, nothing more needs to be done like below check in the
> ExecUpdate/ExecDelete code path.
>

Yeah, but it only looks for places where it needs to detect deleted tuples
and thus wants to throw an error. I am worried about other places where it
is assumed that the ctid points to a valid looking tid, self or otherwise.
I see no such places being either updated or commented.

Now may be there is no danger because of other protections in place, but it
looks hazardous.


>
> I have not tested, but it seems this could be problematic, but I feel
> we could deal with such cases by checking invalid block id in the
> above if check.  Another one such case is in EvalPlanQualFetch
>
>
Right.


>
> > What happens if a partition key update deletes a row, but the operation
> is
> > aborted? Do we need any special handling for that case?
> >
>
> If the transaction is aborted than future updates would update the
> ctid to a new row, do you see any problem with it?
>

I don't know. May be there is none. But it needs to explained why it's not
a problem.


>
> > I am actually worried that we're tinkering with ip_blkid to handle one
> > corner case of detecting partition key update. This is going to change
> > on-disk format and probably need more careful attention. Are we certain
> that
> > we would never require update-chain following when partition keys are
> > updated?
> >
>
> I think we should never need update-chain following when the row is
> moved from one partition to another partition, otherwise, we don't
> change anything on the tuple.
>

I am not sure I follow. I understand that it's probably a tough problem to
follow update chain from one partition to another. But why do you think we
would never need that? What if someone wants to improve on the restriction
this patch is imposing and actually implement partition key UPDATEs the way
we do for regular tables i.e. instead of throwing error, we actually
update/delete the row in the new partition?


>
> > If so, can we think about some other mechanism which actually even
> > leaves behind ? I am not saying we should do
> that,
> > but it warrants a thought.
> >
>
> Oh, this would much bigger disk-format change and need much more
> thoughts, where will we store new partition information.
>

Yeah, but the disk format will change probably change just once. Or may be
this can be done local to a partition table without requiring any disk
format changes? Like adding a nullable hidden column in each partition to
store the forward pointer?

Thanks,
Pavan


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-08 Thread Pavan Deolasee
On Thu, Mar 8, 2018 at 10:27 PM, Robert Haas  wrote:

>
> However, there's no such thing as a free lunch.  We can't use the CTID
> field to point to a CTID in another table because there's no room to
> include the identify of the other table in the field.  We can't widen
> it to make room because that would break on-disk compatibility and
> bloat our already-too-big tuple headers.  So, we cannot make it work
> like it does when the updates are confined to a single partition.
> Therefore, the only options are (1) ignore the problem, and let a
> cross-partition update look entirely like a delete+insert, (2) try to
> throw some error in the case where this introduces user-visible
> anomalies that wouldn't be visible otherwise, or (3) revert update
> tuple routing entirely.  I voted for (1), but the consensus was (2).
> I think that (3) will make a lot of people sad; it's a very good
> feature.


I am definitely not suggesting to do #3, though I agree with Tom that the
option is on table. May be two back-to-back bugs in the area makes me
worried and raises questions about the amount of testing the feature has
got. In addition, making such a significant on-disk change for one corner
case, for which even #1 might be acceptable, seems a lot. If we at all want
to go in that direction, I would suggest considering a patch that I wrote
last year to free-up additional bits from the ctid field (as part of the
WARM). I know Tom did not like that either, but at the very least, it
provides us a lot more room for future work, with the same amount of risk.


> If we want to have (2), then we've got to have some way to
> mark a tuple that was deleted as part of a cross-partition update, and
> that requires a change to the on-disk format.
>

I think the question is: isn't there an alternate way to achieve the same
result? One alternate way would be to do what I suggested above i.e. free
up more bits and use one of those. Another way would be to add a hidden
column to the partition table, when it is created or when it is attached as
a partition. This only penalises the partition tables, but keeps rest of
the system out of it. Obviously, if this column is added when the table is
attached as a partition, as against at table creation time, then the old
tuple may not have room to store this additional field. May be we can
handle that by double updating the tuple? That seems bad, but then it only
impacts the case when a partition key is updated. And we can clearly
document performance implications of that operation. I am not sure how
common this case is going to be anyways. With this hidden column, we can
even store a pointer to another partition and do something with that, if at
all needed.

That's just one idea. Of course, I haven't thought about it for more than
10mins, so most likely I may have missed out on details and it's probably a
stupid idea afterall. But there could be other ideas too. And even if we
can't find one, my vote would be to settle for #1 instead of trying to do
#2.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-09 Thread Pavan Deolasee
On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee 
wrote:

>
>
> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan  wrote:
>
>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire 
>> wrote:
>>
>> > I believe PKs are a prime candidate for this optimization, and
>> > expecting it to apply only when no concurrency is involved is severely
>> > dumbing down the optimization.
>>
>> Pavan justified the patch using a benchmark that only involved a
>> single client -- hardly typical for a patch that changes the B-Tree
>> code. If the benefits with many clients can be shown to matter, that
>> will make this much more interesting to me.
>
>
> Ok. I will repeat those tests with more number of clients and report back.
>
>
So I repeated the tests with 1,2,4 and 8 clients, each running the
following statement and a total of 1024 transactions. So roughly 100M rows
are inserted.

INSERT INTO testtab(b) SELECT generate_series(1,10);

The table definition is:
postgres=# \d+ testtab
   Table "public.testtab"
 Column |  Type  | Collation | Nullable |  Default
 | Storage | Stats target | Description
++---+--++-+--+-
 a  | bigint |   | not null |
nextval('testtab_a_seq'::regclass) | plain   |  |
 b  | bigint |   |  |
  | plain   |  |
Indexes:
"testtab_a_key" UNIQUE CONSTRAINT, btree (a)


After taking average of 3-runs:

+-++---+
| clients | Patched - time in sec | Master - time in sec |
+-++---+
| 1   | 311.8643602| 411.832757|
+-++---+
| 2   | 252.5433   | 300.7875613   |
+-++---+
| 4   | 337.0414279| 350.9636766   |
+-++---+
| 8   | 444.2035582| 477.1903417   |
+-++---+

So yes, the benefits of the patch go down with higher number of clients,
but it does not entirely vanish.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-10 Thread Pavan Deolasee
ing until we came to support partitioning.
Even with partitioning I could get everything to work, without duplicating
the RTE, except the duplicate rows issue. I don't know how to solve that
without doing what I've done or completely rewriting UPDATE/DELETE handling
for inheritance/partition table. If you or others have better ideas, they
are most welcome.


> This is an extract from header comments for inheritance_planner()
> (master branch):
>
>  * We have to handle this case differently from cases where a source
> relation
>  * is an inheritance set. Source inheritance is expanded at the bottom of
> the
>  * plan tree (see allpaths.c), but target inheritance has to be expanded at
>  * the top.  The reason is that for UPDATE, each target relation needs a
>  * different targetlist matching its own column set.  Fortunately,
>  * the UPDATE/DELETE target can never be the nullable side of an outer
> join,
>  * so it's OK to generate the plan this way.
>
> Of course, that isn't true of MERGE: The MERGE target *can* be the
> nullable side of an outer join. That's probably a big complication for
> using one target RTE. Your approach to implementing partitioning [1]
> seems to benefit from having two different RTEs, in a way -- you
> sidestep the restriction.


Right. The entire purpose of having two different RTEs is to work around
this problem. I explained this approach here [1]. I didn't receive any
objections then, but that's mostly because nobody read it carefully. As I
said, if we have an alternate feasible and better mechanism, let's go for
it as long as efforts are justifiable.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CABOikdM%2Bc1vB_%2B3tYEjO%3DJ6U2uNHzKU_b%3DU72tadD5-9xQcbHA%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-10 Thread Pavan Deolasee
On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire 
wrote:

> On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee 
> wrote:
> >
> >
>
> >
> > So yes, the benefits of the patch go down with higher number of clients,
> but
> > it does not entirely vanish.
>
> What if you implement my suggestion?
>
> That should improve the multi-client case considerably.
>


Yes, I will try that next - it seems like a good idea. So the idea would
be: check if the block is still the rightmost block and the insertion-key
is greater than the first key in the page. If those conditions are
satisfied, then we do a regular binary search within the page to find the
correct location. This might add an overhead of binary search when keys are
strictly ordered and a single client is inserting the data. If that becomes
a concern, we might be able to look for that special case too and optimise
for it too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-12 Thread Pavan Deolasee
On Sun, Mar 11, 2018 at 11:18 AM, Peter Geoghegan  wrote:

>
> It sounds like we should try to thoroughly understand why these
> duplicates arose. Did you actually call EvalPlanQualSetPlan() for all
> subplans at the time?
>
>
The reason for duplicates or even wrong answers is quite simple. The way
UPDATE/DELETE currently works for partition table is that we expand the
inheritance tree for the parent result relation and then create a subplan
for each partition separately. This works fine, even when there exists a
FROM/USING clause in the UPDATE/DELETE statement because the final result
does not change irrespective of whether you first do a UNION ALL between
all partitions and then find the candidate rows or whether you find
candidate rows from individual partitions separately.

In case of MERGE though, since we are performing a RIGHT OUTER JOIN between
the result relation and the source relation, we may conclude that a
matching target  row does not exist for a source row, whereas it actually
exists but in some other partition. For example,

CREATE TABLE target (key int, val text) PARTITION BY LIST ( key);
CREATE TABLE part1 PARTITION OF target FOR VALUES IN (1, 2, 3);
CREATE TABLE part2 PARTITION OF target FOR VALUES IN (4, 5, 6);
CREATE TABLE source (skey integer);
INSERT INTO source VALUES (1), (4), (7);
INSERT INTO part1 VALUES (1, 'v1'), (2, 'v2'), (3, 'v3');
INSERT INTO part2 VALUES (4, 'v4'), (5, 'v5'), (6, 'v6');

postgres=# SELECT * FROM target RIGHT OUTER JOIN source ON key = skey;
 key | val | skey
-+-+--
   1 | v1  |1
   4 | v4  |4
 | |7
(3 rows)

This gives the right answer. But if we join individual partitions and then
do a UNION ALL,

postgres=# SELECT * FROM part1 RIGHT OUTER JOIN source ON key = skey UNION
ALL SELECT * FROM part2 RIGHT OUTER JOIN source ON key = skey;
 key | val | skey
-+-+--
   1 | v1  |1
 | |4
 | |7
 | |1
   4 | v4  |4
 | |7
(6 rows)

This is what nodeModifyTable does and hence we end up getting duplicates or
even incorrectly declared NOT MATCHED rows, where as they are matched in a
different partition.

I don't think not calling EvalPlanQualSetPlan() on all subplans is a
problem because we really never execute those subplans. In fact. we should
fix that so that those subplans are not even initialised.


> As you know, there is an ON CONFLICT DO UPDATE + partitioning patch in
> the works from Alvaro. In your explanation about that approach that
> you cited, you wondered what the trouble might have been with ON
> CONFLICT + partitioning, and supposed that the issues were similar
> there. Are they? Has that turned up much?
>
>
Well, I initially thought that ON CONFLICT DO UPDATE on partition table may
have the same challenges, but that's probably not the case. For INSERT ON
CONFLICT it's still just an INSERT path, with some special handling for
UPDATEs. Currently, for partition or inherited table, UPDATEs/DELETEs go
via inheritance_planner() thus expanding inheritance for the result
relation where as INSERTs go via simple grouping_planner().

For MERGE, we do all three DMLs. That doesn't mean we could not
re-implement MERGE on the lines of INSERTs, but that would most likely mean
complete re-writing of the UPDATEs/DELETEs for partition/inheritance
tables. The challenges would just be the same in both cases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-13 Thread Pavan Deolasee
On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire 
wrote:

> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>
> >
> > Yes, I will try that next - it seems like a good idea. So the idea would
> be:
> > check if the block is still the rightmost block and the insertion-key is
> > greater than the first key in the page. If those conditions are
> satisfied,
> > then we do a regular binary search within the page to find the correct
> > location. This might add an overhead of binary search when keys are
> strictly
> > ordered and a single client is inserting the data. If that becomes a
> > concern, we might be able to look for that special case too and optimise
> for
> > it too.
>
> Yeah, pretty much that's the idea. Beware, if the new item doesn't
> fall in the rightmost place, you still need to check for serialization
> conflicts.
>

So I've been toying with this idea since yesterday and I am quite puzzled
with the results. See the attached patch which compares the insertion key
with the last key inserted by this backend, if the cached block is still
the rightmost block in the tree. I initially only compared with the first
key in the page, but I tried this version because of the strange
performance regression which I still have no answers.

For a small number of clients, the patched version does better. But as the
number of clients go up, the patched version significantly underperforms
master. I roughly counted the number of times the fastpath is taken and I
noticed that almost 98% inserts take the fastpath. I first thought that the
"firstkey" location in the page might be becoming a hot-spot for concurrent
processes and hence changed that to track the per-backend last offset and
compare against that the next time. But that did not help much.

+-++---+
| clients | Master - Avg load time in sec | Patched - Avg load time in sec |
+-++---+
| 1   | 500.0725203| 347.632079|
+-++---+
| 2   | 308.4580771| 263.9120163   |
+-++---+
| 4   | 359.4851779| 514.7187444   |
+-++---+
| 8   | 476.4062592| 780.2540855   |
+-++---+

The perf data does not show anything interesting either. I mean there is a
reduction in CPU time spent in btree related code in the patched version,
but the overall execution time to insert the same number of records go up
significantly.

Perf (master):
===

+   72.59% 1.81%  postgres  postgres[.] ExecInsert

+   47.55% 1.27%  postgres  postgres[.]
ExecInsertIndexTuples
+   44.24% 0.48%  postgres  postgres[.] btinsert

-   42.40% 0.87%  postgres  postgres[.] _bt_doinsert

   - 41.52% _bt_doinsert

  + 21.14% _bt_search

  + 12.57% _bt_insertonpg

  + 2.03% _bt_binsrch

1.60% _bt_mkscankey

1.20% LWLockAcquire

  + 1.03% _bt_freestack

0.67% LWLockRelease

0.57% _bt_check_unique

   + 0.87% _start

+   26.03% 0.95%  postgres  postgres[.] ExecScan

+   21.14% 0.82%  postgres  postgres[.] _bt_search

+   20.70% 1.31%  postgres  postgres[.] ExecInterpExpr

+   19.05% 1.14%  postgres  postgres[.] heap_insert

+   18.84% 1.16%  postgres  postgres[.] nextval_internal

+   18.08% 0.84%  postgres  postgres[.] ReadBufferExtended

+   17.24% 2.03%  postgres  postgres[.] ReadBuffer_common

+   12.57% 0.59%  postgres  postgres[.] _bt_insertonpg

+   11.12% 1.63%  postgres  postgres[.] XLogInsert

+9.90% 0.10%  postgres  postgres[.] _bt_relandgetbuf

+8.97% 1.16%  postgres  postgres[.] LWLockAcquire

+8.42% 2.03%  postgres  postgres[.] XLogInsertRecord

+7.26% 1.01%  postgres  postgres[.] _bt_binsrch

+7.07% 1.20%  postgres  postgres[.]
RelationGetBufferForTuple
+6.27% 4.92%  postgres  postgres[.] _bt_compare

+5.97% 0.63%  postgres  postgres[.]
read_seq_tuple.isra.3
+5.70% 4.89%  postgres  postgres[.]
hash_search_with_hash_value
+5.44% 5.44%  postgres  postgres[.] LWLockAttemptLock


Perf (Patched):


+   69.33% 2.36%  postgres  postgres[.] ExecInsert

+   35

Re: Faster inserts with mostly-monotonically increasing values

2018-03-13 Thread Pavan Deolasee
On Wed, Mar 14, 2018 at 10:58 AM, Claudio Freire 
wrote:

>
>
> I'm thinking there could be contention on some lock somewhere.
>
> Can you attach the benchmark script you're using so I can try to reproduce
> it?
>


I am using a very ad-hoc script.. But here it is.. It assumes a presence of
a branch named "btree_rightmost" with the patched code.

You will need to make necessary adjustments of course.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

start_options="-c shared_buffers=16GB -c max_wal_size=64GB -c min_wal_size=16GB -c checkpoint_timeout=60min -c bgwriter_delay=100 -c bgwriter_lru_maxpages=100 -c bgwriter_lru_multiplier=3 -c bgwriter_flush_after=256 -c checkpoint_completion_target=0.9"
PGDATA=/data/pgsql

echo "INSERT INTO testtab(b) SELECT generate_series(1,10);" > s1.sql

for b in master btree_rightmost; do
git checkout $b > /dev/null 2>&1
make -s clean > /dev/null 2>&1
./configure --prefix=$HOME/pg-install/$b
make -s -j8 install > /dev/null 2>&1
export PATH=$HOME/pg-install/$b/bin:$PATH
for c in 1 2 4 8; do
pg_ctl -D $PGDATA -w stop
rm -rf $PGDATA
initdb -D $PGDATA
pg_ctl -D $PGDATA -o "$start_options" -w start -l logfile.$b
psql -c "CREATE TABLE testtab (a bigserial UNIQUE, b bigint);" postgres
echo "$b $c" >> results.txt
num_txns_per_client=$((1024 / $c))
time pgbench -n -l -c $c -j 1 -t $num_txns_per_client -f s1.sql postgres >> results.txt
done
done



Re: Faster inserts with mostly-monotonically increasing values

2018-03-14 Thread Pavan Deolasee
On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs 
wrote:

> On 14 March 2018 at 04:36, Pavan Deolasee 
> wrote:
> > I wonder if the shortened code path actually leads to
> > heavier contention for EXCLUSIVE lock on the rightmost page, which in
> turn
> > causes the slowdown. But that's just a theory. Any ideas how to prove or
> > disprove that theory or any other pointers?
>
> Certainly: dropping performance with higher sessions means increased
> contention is responsible. Your guess is likely correct.
>
> Suggest making the patch attempt a ConditionalLock on the target
> block, so if its contended we try from top of index. So basically only
> attempt the optimization if uncontended. This makes sense because in
> many cases if the block is locked it is filling up and the RHS block
> is more likely to change anyway.


Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

I can also try to reduce the amount of time EX lock is held by doing some
checks with a SHARE lock and then trade it for EX lock if we conclude that
fast path can be taken. We can do page lsn check to confirm nothing changed
when the lock was traded. Will check both approaches.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-14 Thread Pavan Deolasee
On Wed, Mar 14, 2018 at 7:21 PM, Simon Riggs 
wrote:

> On 14 March 2018 at 13:33, Pavan Deolasee 
> wrote:
> >
> >
> > On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <
> simon.ri...@2ndquadrant.com>
> > wrote:
> >>
> >> On 14 March 2018 at 04:36, Pavan Deolasee 
> >> wrote:
> >> > I wonder if the shortened code path actually leads to
> >> > heavier contention for EXCLUSIVE lock on the rightmost page, which in
> >> > turn
> >> > causes the slowdown. But that's just a theory. Any ideas how to prove
> or
> >> > disprove that theory or any other pointers?
> >>
> >> Certainly: dropping performance with higher sessions means increased
> >> contention is responsible. Your guess is likely correct.
> >>
> >> Suggest making the patch attempt a ConditionalLock on the target
> >> block, so if its contended we try from top of index. So basically only
> >> attempt the optimization if uncontended. This makes sense because in
> >> many cases if the block is locked it is filling up and the RHS block
> >> is more likely to change anyway.
> >
> >
> > Hmm. I can try that. It's quite puzzling though that slowing down things
> > actually make them better.
>
> That's not what is happening though.
>
> The cache path is 1) try to lock cached block, 2) when got lock check
> relevance, if stale 3) recheck from top
>
> The non-cached path is just 3) recheck from top
>
> The overall path length is longer in the cached case but provides
> benefit if we can skip step 3 in high % of cases. The non-cached path
> is more flexible because it locates the correct RHS block, even if it
> changes dynamically between starting the recheck from top.
>
>
So as I noted in one of the previous emails, the revised patch still takes
fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3
in just 2% cases should cause such dramatic slowdown.

If there is enough delay in step 1 then any benefit is lost anyway, so
> there is no point taking the cached path even when successful, so its
> better to ignore in that case.


The non-cached path is also going to land on the same page, it just that
the _bt_search() will ensure that it doesn't happen quickly. So my
understanding that the additional time spent in _bt_search() accidentally
reduces contention on the EX lock and ends up improving net performance. I
know it sounds weird..

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-15 Thread Pavan Deolasee
Hi Stephen,


On Fri, Mar 16, 2018 at 7:28 AM, Stephen Frost  wrote:

> Greetings Pavan, all,
>
> * Pavan Deolasee (pavan.deola...@gmail.com) wrote:
> > On 9 March 2018 at 08:29, Peter Geoghegan  wrote:
> > >  My #1 concern has become RLS, and
> > > perhaps only because I haven't studied it in enough detail.
> >
> > Sure. I've done what I thought is the right thing to do, but please
> check.
> > Stephen also wanted to review RLS related code; I don't know if he had
> > chance to do so.
>
> I've started looking through the code from an RLS perspective and, at
> least on my initial review, it looks alright.
>


Thanks for taking time out to review the patch. It certainly helps a lot.


>
> A couple test that aren't included that I think should be in the
> regression suire are where both tables have RLS policies and
> where the RLS tables have actual SELECT policies beyond just 'true'.
>
> I certainly see SELECT policies which limits the records that
> individuals are allowed to see very frequently and it's an
> important case to ensure works correctly.  I did a few tests here myself
> and they behaved as I expected, and reading through the code it looks
> reasonable, but they should be simple to write tests which run very
> quickly.
>

Ok. I will add those tests.


>
> I'm a bit on the fence about it, but as MERGE is added in quite a few
> places which previously mentioned UPDATE and DELETE throughout the
> system, I wonder if we shouldn't do better than this:
>
> =*# create policy p1 on t1 for merge using ((c1 % 2) = 0);
> ERROR:  syntax error at or near "merge"
> LINE 1: create policy p1 on t1 for merge using ((c1 % 2) = 0);
>
> Specifically, perhaps we should change that to pick up on MERGE being
> asked for there and return an error saying that policies for the MERGE
> command aren't supported with a HINT that MERGE respects
> INSERT/UPDATE/DELETE policies and that users should use those instead.
>

Hmm. I am not sure if that would be a good idea just for RLS. We might then
also want to change several other places in the grammar to accept
INSERT/UPDATE/DELETE keyword and handle that during parse analysis. We can
certainly do that, but I am not sure if it adds a lot of value and
certainly adds a lot more code. We should surely document these things, if
we are not already.


>
> Also, in nodeModifyTable.c, there's a comment:
>
>  * The WITH CHECK quals are applied in ExecUpdate() and hence we need
>  * not do anything special to handle them.
>
> Which I believe is actually getting at the fact that ExecUpdate() will
> run ExecWithCheckOptions(WCO_RLS_UPDATE_CHECK ...) and therefore in
> ExecMergeMatched() we don't need to check WCO_RLS_UPDATE_CHECK, but we
> do still need to check WCO_RLS_MERGE_UPDATE_CHECK (and that's what the
> code does).


Right.


>   One thing I wonder about there though is if we really need
> to segregate those..?  What's actually making WCO_RLS_MERGE_UPDATE_CHECK
> different from WCO_RLS_UPDATE_CHECK?  I get that the DELETE case is
> different, because in a regular DELETE we'll never even see the row, but
> for MERGE, we will see the row (assuming it passes SELECT policies, of
> course) and then will check if it matches and that's when we realize
> that we've been asked to run a DELETE, so we have the special-case of
> WCO_RLS_MERGE_DELETE_CHECK, but that's not true of UPDATE, so I think
> this might be adding a bit of unnecessary complication by introducing
> WCO_RLS_MERGE_UPDATE_CHECK.
>

I've modelled this code on the lines of ON CONFLICT DO NOTHING. And quite
similar to that, I believe we need separate handling for MERGE as well
because we don't (and can't) push down the USING security quals of
UPDATE/DELETE to the scan. So we need to separately check that the target
row actually passes the USING quals. WCO_RLS_MERGE_UPDATE_CHECK and
WCO_RLS_MERGE_DELETE_CHECK
are used for that purpose.

One point to deliberate though is whether it's a good idea to throw an
error when the USING quals fail or should we silently ignore such rows.
Regular UPDATE/DELETE does the latter and ON CONFLICT DO UPDATE does the
former. I chose to throw an error because otherwise it may get confusing to
the user since a row would neither be updated (meaning, it will be seen as
a case of NOT MATCHED), but nor be inserted. I hope no problem there.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-03-16 Thread Pavan Deolasee
On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera 
wrote:

>
>
@@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
int num_subplan_partition_offsets;
TupleTableSlot *partition_tuple_slot;
TupleTableSlot *root_tuple_slot;
+   List  **partition_arbiter_indexes;
+   TupleTableSlot **partition_conflproj_slots;
+   TupleTableSlot **partition_existing_slots;
 } PartitionTupleRouting;


I am curious why you decided to add these members to PartitionTupleRouting
structure. Wouldn't ResultRelationInfo be a better place to track these or
is there some rule that we follow?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-03-16 Thread Pavan Deolasee
On Fri, Mar 16, 2018 at 5:13 PM, Etsuro Fujita 
wrote:

> (2018/03/16 19:43), Pavan Deolasee wrote:
>
>> On Fri, Mar 2, 2018 at 9:06 PM, Alvaro Herrera > <mailto:alvhe...@2ndquadrant.com>> wrote:
>>
>
> @@ -106,6 +120,9 @@ typedef struct PartitionTupleRouting
>>  int num_subplan_partition_offsets;
>>  TupleTableSlot *partition_tuple_slot;
>>  TupleTableSlot *root_tuple_slot;
>> +   List  **partition_arbiter_indexes;
>> +   TupleTableSlot **partition_conflproj_slots;
>> +   TupleTableSlot **partition_existing_slots;
>>   } PartitionTupleRouting;
>>
>
> I am curious why you decided to add these members to
>> PartitionTupleRouting structure. Wouldn't ResultRelationInfo be a better
>> place to track these or is there some rule that we follow?
>>
>
> I just started reviewing the patch, so maybe I'm missing something, but I
> think it would be a good idea to have these in that structure, not in
> ResultRelInfo, because these would be required only for partitions chosen
> via tuple routing.
>

Hmm, yeah, probably you're right. The reason I got confused is because the
patch uses ri_onConflictSetProj/ri_onConflictSetWhere to store the
converted projection info/where clause for a given partition in its
ResultRelationInfo. So I was wondering if we can
move mt_arbiterindexes, mt_existing and mt_conflproj to ResultRelInfo and
then just use that per-partition structures too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-17 Thread Pavan Deolasee
On Mon, Mar 12, 2018 at 5:43 PM, Pavan Deolasee 
wrote:

>
>
> On Sun, Mar 11, 2018 at 11:18 AM, Peter Geoghegan  wrote:
>
>>
>>
>> As you know, there is an ON CONFLICT DO UPDATE + partitioning patch in
>> the works from Alvaro. In your explanation about that approach that
>> you cited, you wondered what the trouble might have been with ON
>> CONFLICT + partitioning, and supposed that the issues were similar
>> there. Are they? Has that turned up much?
>>
>>
> Well, I initially thought that ON CONFLICT DO UPDATE on partition table
> may have the same challenges, but that's probably not the case. For INSERT
> ON CONFLICT it's still just an INSERT path, with some special handling for
> UPDATEs. Currently, for partition or inherited table, UPDATEs/DELETEs go
> via inheritance_planner() thus expanding inheritance for the result
> relation where as INSERTs go via simple grouping_planner().
>
> For MERGE, we do all three DMLs. That doesn't mean we could not
> re-implement MERGE on the lines of INSERTs, but that would most likely mean
> complete re-writing of the UPDATEs/DELETEs for partition/inheritance
> tables. The challenges would just be the same in both cases.
>
>
Having thought more about this in the last couple of days, I am actually
inclined to try out rewrite the UPDATE handling of MERGE on the lines of
what ON CONFLICT DO UPDATE patch is doing. This might help us to completely
eliminate invoking inheritance_planner() for partition table and that will
be a huge win for tables with several hundred partitions. The code might
also look much cleaner that way. I am gonna give it a try for next couple
of days and see if its doable.

Thanks,
Pavan


Re: Faster inserts with mostly-monotonically increasing values

2018-03-19 Thread Pavan Deolasee
On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire 
wrote:

>
>
> I applied the attached patches on top of your patch, so it would be
> nice to see if you can give it a try in your test hardware to see
> whether it helps.
>

I can confirm that I no longer see any regression at 8 or even 16 clients.
In fact, the patched version consistent do better than master even at
higher number of clients.

The only thing I am a bit puzzled is that I am no longer able to reproduce
the 40-50% gains that I'd earlier observed with a single client. I now get
20-25% with smaller number of client and 10-15% with larger number of
clients. I haven't been able to establish why it's happening, but since
it's a different AWS instance (though of the same type), I am inclined to
blame it on that for now.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-03-21 Thread Pavan Deolasee
eck to confirm that the
resultant list is actually ordered?

@@ -66,7 +67,8 @@ static TupleTableSlot
*ExecPrepareTupleRouting(ModifyTableState *mtstate,
  EState *estate,
  PartitionTupleRouting *proute,
  ResultRelInfo *targetRelInfo,
- TupleTableSlot *slot);
+ TupleTableSlot *slot,
+ int *partition_index);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
 static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
@@ -264,6 +266,7 @@ ExecInsert(ModifyTableState *mtstate,
 TupleTableSlot *slot,
 TupleTableSlot *planSlot,
 EState *estate,
+int partition_index,
 bool canSetTag)
 {
  HeapTuple tuple;

If we move the list of arbiter indexes and the tuple desc to ResultRelInfo,
as
suggested above, I think we can avoid making any API changes to
ExecPrepareTupleRouting and ExecInsert.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Faster inserts with mostly-monotonically increasing values

2018-03-21 Thread Pavan Deolasee
On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire 
wrote:

>
> If you're not planning on making any further changes, would you mind
> posting a coalesced patch?
>
> Notice that I removed the last offset thing only halfway, so it would
> need some cleanup.
>

Here is an updated patch. I removed the last offset caching thing
completely and integrated your changes for conditional lock access. Some
other improvements to test cases  too. I realised that we must truncate and
re-insert to test index fastpath correctly.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_btree_target_block_v3.patch
Description: Binary data


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-03-22 Thread Pavan Deolasee
On Thu, Mar 22, 2018 at 3:01 PM, Amit Langote  wrote:

>
> >
> > Why do we need to pin the descriptor? If we do need, why don't we need
> > corresponding ReleaseTupDesc() call?
>
> PinTupleDesc was added in the patch as Alvaro had submitted it upthread,
> which it wasn't clear to me either why it was needed.  Looking at it
> closely, it seems we can get rid of it if for the lack of corresponding
> ReleaseTupleDesc().  ExecSetSlotDescriptor() seems to take care of pinning
> and releasing tuple descriptors that are passed to it.  If some
> partition's tupDesc remains pinned because it was the last one that was
> passed to it, the final ExecResetTupleTable will take care of releasing it.
>
> I have removed the instances of PinTupleDesc in the updated patch, but I'm
> not sure why the loose PinTupleDesc() in the previous version of the patch
> didn't cause reference leak warnings or some such.
>

Yeah, it wasn't clear to me as well. But I did not investigate. May be
Alvaro knows better.


> > I am still confused if the partition_conflproj_tdescs really belongs to
> > PartitionTupleRouting or should it be a part of the ResultRelInfo. FWIW
> for
> > the
> > MERGE patch, I moved everything to a new struct and made it part of the
> > ResultRelInfo. If no re-mapping is necessary, we can just point to the
> > struct
> > in the root relation's ResultRelInfo. Otherwise create and populate a new
> > one
> > in the partition relation's ResultRelInfo.
> >
> > + leaf_part_rri->ri_onConflictSetWhere =
> > + ExecInitQual(onconflwhere, &mtstate->ps);
> > + }
> >
> > So ri_onConflictSetWhere and ri_onConflictSetProj are part of the
> > ResultRelInfo. Why not move mt_conflproj_tupdesc,
> > partition_conflproj_tdescs,
> > partition_arbiter_indexes etc to the ResultRelInfo as well?
>
> I have moved both the projection tupdesc and the arbiter indexes list into
> ResultRelInfo as I wrote above.
>
>
Thanks. It's looking much better now. I think we can possibly move all ON
CONFLICT related members to a separate structure and just copy the pointer
to the structure if (map == NULL). That might make the code a bit more tidy.

Is there anything that needs to be done for transition tables? I checked
and didn't see anything, but please check.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-22 Thread Pavan Deolasee
On Thu, Mar 22, 2018 at 1:15 AM, Peter Geoghegan  wrote:

>
> >
> >
> > No, it did not. We only support VALUES clause with INSERT action.
>
> But can't you have a subselect in the VALUES()? Support for subselects
> seems like a totally distinct thing to the restriction that only a
> (single row) VALUES() is allowed in INSERT actions.
>
>
Ah, right. That works even today.

postgres=# CREATE TABLE target (a int, b text);
CREATE TABLE

postgres=# MERGE INTO target USING (SELECT 1) s ON false WHEN NOT MATCHED
THEN INSERT VALUES ((SELECT count(*) FROM pg_class), (SELECT relname FROM
pg_class LIMIT 1));
MERGE 1

postgres=# SELECT * FROM target;
  a  |   b
-+
 755 | pgbench_source
(1 row)


Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Assertion failure while streaming toasted data

2021-05-24 Thread Pavan Deolasee
Hi,

While working on an output plugin that uses streaming protocol, I hit an
assertion failure. Further investigations revealed a possible bug in core
Postgres. This must be new to PG14 since streaming support is new to this
release. I extended the test_decoding regression test to demonstrate the
failure. PFA

```
2021-05-25 11:32:19.493 IST client backend[68321] pg_regress/stream
STATEMENT:  SELECT data FROM pg_logical_slot_get_changes('regression_slot',
NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '
1', 'stream-changes', '1');
TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line:
3476, PID: 68321)
```

>From my preliminary analysis, it looks like we fail to adjust the memory
accounting after streaming toasted tuples. More concretely, after
`ReorderBufferProcessPartialChange()` processes the in-progress
transaction,  `ReorderBufferTruncateTXN()` truncates the accumulated
changed in the transaction, but fails to adjust the buffer size for toast
chunks. Maybe we are missing a call to `ReorderBufferToastReset()`
somewhere?

>From what I see, the assertion only triggers when data is inserted via COPY
(multi-insert).

Let me know if anything else is needed to reproduce this.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb.com


stream_toasted_txn_test.patch
Description: Binary data


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 1:26 PM Dilip Kumar  wrote:

>
>
> I have identified the cause of the issue, basically, the reason is if
> we are doing a multi insert operation we don't set the toast cleanup
> until we get the last tuple of the xl_multi_insert [1].  Now, with
> streaming, we can process the transaction in between the multi-insert
> but while doing that the "change->data.tp.clear_toast_afterwards" is
> set to false in all the tuples in this stream. And due to this we will
> not clean up the toast.
>

Thanks. That matches my understanding too.


>
> One simple fix could be that we can just clean the toast memory if we
> are processing in the streaming mode (as shown in the attached patch).
>

I am not entirely sure if it works correctly. I'd tried something similar,
but the downstream node using
my output plugin gets NULL values for the toast columns.  It's a bit hard
to demonstrate that with the
test_decoding plugin, but if you have some other mechanism to test that
change with an actual downstream
node receiving and applying changes, it will be useful to test with that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 1:49 PM Dilip Kumar  wrote:

> On Tue, May 25, 2021 at 1:45 PM Pavan Deolasee 
> wrote:
> >
> > I am not entirely sure if it works correctly. I'd tried something
> similar, but the downstream node using
> > my output plugin gets NULL values for the toast columns.  It's a bit
> hard to demonstrate that with the
> > test_decoding plugin, but if you have some other mechanism to test that
> change with an actual downstream
> > node receiving and applying changes, it will be useful to test with that.
>
> Okay, I will test that.  Thanks.
>
>
I modified test_decoding to print the tuples (like in the non-streamed
case) and included your proposed fix. PFA

When the transaction is streamed, I see:
```
+ opening a streamed block for transaction
+ table public.toasted: INSERT: id[integer]:9001 other[text]:'bbb'
data[text]:'ccc'
+ table public.toasted: INSERT: id[integer]:9002 other[text]:'ddd'
data[text]:'eee'
+ table public.toasted: INSERT: id[integer]:9003 other[text]:'bar'
data[text]:unchanged-toast-datum

```

For a non-streamed case, the `data[text]` column shows the actual data.
That probably manifests into NULL data when downstream handles it.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb.com


stream_toasted_txn_print_tuples.patch
Description: Binary data


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 2:57 PM Dilip Kumar  wrote:

>
> >
> > Yes, I am able to reproduce this, basically, until we get the last
> > tuple of the multi insert we can not clear the toast data otherwise we
> > can never form a complete tuple.  So the only possible fix I can think
> > of is to consider the multi-insert WAL without the final multi-insert
> > tuple as partial data then we will avoid streaming until we get the
> > complete WAL of one multi-insert.  I am working on the patch to fix
> > this, I will share that in some time.
>
> The attached patch should fix the issue, now the output is like below
>
>
Thanks. This looks fine to me. We should still be able to stream
multi-insert transactions (COPY) as and when the copy buffer becomes full
and is flushed. That seems to be a reasonable restriction to me.

We should incorporate the regression test in the final patch. I am not
entirely sure if what I have done is acceptable (or even works in
all scenarios). We could possibly have a long list of tuples instead of
doing the exponential magic. Or we should consider lowering the min value
for logical_decoding_work_mem and run these tests with a much lower value.
In fact, that's how I caught the problem in the first place. I had
deliberately lowered the value to 1kB so that streaming code kicks in very
often and even for small transactions.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


Re: Assertion failure while streaming toasted data

2021-05-25 Thread Pavan Deolasee
On Tue, May 25, 2021 at 6:43 PM Dilip Kumar  wrote:

>
> I have also added a test case for this.
>
>
Is that test good enough to trigger the original bug? In my experience, I
had to add a lot more tuples before the logical_decoding_work_mem
threshold was crossed and the streaming kicked in. I would suggest running
the test without the fix and check if the assertion hits. If so, we are
good to go.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com


  1   2   >