Re: Reducing power consumption on idle servers

2022-02-21 Thread Simon Riggs
On Sat, 19 Feb 2022 at 17:03, Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-19 14:10:39 +, Simon Riggs wrote:
> > Some years ago we did a pass through the various worker processes to
> > add hibernation as a mechanism to reduce power consumption on an idle
> > server. Replication never got the memo, so power consumption on an
> > idle server is not very effective on standby or logical subscribers.
> > The code and timing for hibernation is also different for each worker,
> > which is confusing.
>
> Yea, I think it's high time that we fix this.

Good to have your support. There are millions of servers running idle
for some part of their duty cycle.

This patch seeks to change the situation for the better in PG15, i.e.
soon, so the changes proposed are deliberately light. It also seeks to
provide a framework that writers of background worker processes can
follow, since we can't just fix core, we need to fix all the various
bgworkers in use as well.

> IMO we should instead consider either deprecating file based promotion, or
> adding an optional dependency on filesystem monitoring APIs (i.e. inotify etc)
> that avoid the need to poll for file creation.

Deprecating explicit file-based promotion is possible and simple, so
that is the approach in the latest version of the patch.

Thanks for the suggestion.

I've re-analyzed all of the code around startup/walreceiver
interactions and there isn't any need for 5s delay on startup process,
IMHO, so we can sleep longer, for startup process.

> IMO we should actually consider moving *away* from hibernation for the cases
> where we currently use it and rely much more heavily on being notified when
> there's work to do, without a timeout when not.

I don't think that is a practical approach this close to end of PG15.
This would require changing behavior of bgwriter, walwriter, pgarch
when they are not broken. The likelihood that we would break something
is too high.

What is an issue is that the sleep times of various procs are on
completely different cycles, which is why I am proposing normalizing
them so that Postgres can actually sleep effectively.

> > * autovac launcher - autovacuum_naptime
>
> On production systems autovacuum_naptime often can't be a large value,
> otherwise it's easy to not keep up on small busy tables. That's fine for
> actually busy servers, but with the increase in hosted PG offerings, the
> defaults in those offerings needs to cater to a broader audience.

Autovac varies its wakeup cycle according to how much work is done. It
is OK to set autovacuum_naptime without affecting power consumption
when idle.

Idle for autovac is defined slightly differently, since if all user
work completes then there may still be a lot of vacuuming to do before
it goes fully idle. But my observation is that there are many servers
that go idle for more than 50% of each week, when operating 8-12 hours
per day, 5 days per week, so we can still save a lot of power.

This patch doesn't change how autovac works, it just uses a common
setting for the hibernation that eventually occurs.

> > These servers don't try to hibernate at all:
> > * logical worker - 1s
>
> Not great.

Agreed, the patch improves this, roughly same as walreceiver.

> > * logical launcher - wal_retrieve_retry_interval (undocumented)
>
> I think it's actually 180s in the happy path. The wal_retrieve_retry_interval
> is about how often workers get restarted. But if there's no need to restart,
> it sleeps longer.

I propose normalizing all of the various hibernation times to the same value.


> > * wal_receiver - 100ms, currently gets woken when WAL arrives
>
> This is a fairly insane one. We should compute a precise timeout based on
> wal_receiver_timeout.

That is exactly what the patch does, when it hibernates.


I wasn't aware of Thomas' work, but now that I am we can choose which
of those approaches to use for WAL receiver. I hope that we can fix
logical worker and wal receiver to use the same algorithm. The rest of
this patch would still be valid, whatever we do for those two procs.

--
Simon Riggshttp://www.EnterpriseDB.com/


hibernate_to_reduce_power_consumption.v2.patch
Description: Binary data


Re: Reducing power consumption on idle servers

2022-02-21 Thread Simon Riggs
On Mon, 21 Feb 2022 at 16:49, Chapman Flack  wrote:

> Shouldn't the comment be "with work_done=false" ?

Good catch, thanks.

I've also added docs to say that "promote_trigger_file" is now
deprecated. There were no tests for that functionality, so just as
well it is being removed.

v3 attached.


--
Simon Riggshttp://www.EnterpriseDB.com/


hibernate_to_reduce_power_consumption.v3.patch
Description: Binary data


Buffer Manager and Contention

2022-02-24 Thread Simon Riggs
Thinking about poor performance in the case where the data fits in
RAM, but the working set is too big for shared_buffers, I notice a
couple of things that seem bad in BufMgr, but don't understand why
they are like that.

1. If we need to allocate a buffer to a new block we do this in one
step, while holding both partition locks for the old and the new tag.
Surely it would cause less contention to make the old block/tag
invalid (after flushing), drop the old partition lock and then switch
to the new one? i.e. just hold one mapping partition lock at a time.
Is there a specific reason we do it this way?

2. Possibly connected to the above, we issue BufTableInsert() BEFORE
we issue BufTableDelete(). That means we need extra entries in the
buffer mapping hash table to allow us to hold both the old and the new
at the same time, for a short period. The way dynahash.c works, we try
to allocate an entry from the freelist and if that doesn't work, we
begin searching ALL the freelists for free entries to steal. So if we
get enough people trying to do virtual I/O at the same time, then we
will hit a "freelist storm" where everybody is chasing the last few
entries. It would make more sense if we could do BufTableDelete()
first, then hold onto the buffer mapping entry rather than add it to
the freelist, so we can use it again when we do BufTableInsert() very
shortly afterwards. That way we wouldn't need to search the freelist
at all. What is the benefit or reason of doing the Delete after the
Insert?

Put that another way, it looks like BufTable functions are used in a
way that mismatches against the way dynahash is designed.

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Logical insert/update/delete WAL records for custom table AMs

2022-02-24 Thread Simon Riggs
On Mon, 17 Jan 2022 at 09:05, Jeff Davis  wrote:
>
> On Wed, 2022-01-05 at 20:19 +, Simon Riggs wrote:
> > I spoke with Jeff in detail about this patch in NYC Dec 21, and I now
> > think it is worth pursuing. It seems much more likely that this would
> > be acceptable than fully extensible rmgr.
>
> Thank you. I had some conversations with others who felt this approach
> is wasteful, which it is. But if this patch still has potential, I'm
> happy to pursue it along with the extensible rmgr approach.

It's self-contained and generally useful for a range of applications,
so the code would be long-lived.

Calling it wasteful presumes the way you'd use it. If you make logical
log entries you don't need to make physical ones, so its actually the
physical log entries that would be wasteful.

Logical log entries don't need to be decoded, so it's actually more
efficient, plus we could skip index entries altogether.

I would argue that this is the way we should be doing WAL, with
occasional divergence to physical records for performance, rather than
the other way around.

> > So I'm signing up as a reviewer and we'll see if this is good to go.
>
> Great, I attached a rebased version.

The approach is perfectly fine, it just needs to be finished and rebased.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Buffer Manager and Contention

2022-02-24 Thread Simon Riggs
On Fri, 25 Feb 2022 at 01:20, Kyotaro Horiguchi  wrote:

> Yura Sokolov is proposing a patch to separte the two partition locks.
>
> https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
>
> And it seems to me viable for me and a benchmarking in the thread
> showed a good result.  I'd appreciate your input on that thread.

Certainly, I will stop this thread and contribute to Yura's thread.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: BufferAlloc: don't take two simultaneous locks

2022-02-24 Thread Simon Riggs
On Mon, 21 Feb 2022 at 08:06, Yura Sokolov  wrote:
>
> Good day, Kyotaro Horiguchi and hackers.
>
> В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov  
> > wrote in
> > > Hello, all.
> > >
> > > I thought about patch simplification, and tested version
> > > without BufTable and dynahash api change at all.
> > >
> > > It performs suprisingly well. It is just a bit worse
> > > than v1 since there is more contention around dynahash's
> > > freelist, but most of improvement remains.
> > >
> > > I'll finish benchmarking and will attach graphs with
> > > next message. Patch is attached here.
> >
> > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > some comments needs to be revised.
>
> Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=rdddqusxmtft6jprdx...@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
   
https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Simon Riggs
On Fri, 25 Feb 2022 at 09:24, Yura Sokolov  wrote:

> > This approach is cleaner than v1, but should also perform better
> > because there will be a 1:1 relationship between a buffer and its
> > dynahash entry, most of the time.
>
> Thank you for suggestion. Yes, it is much clearer than my initial proposal.
>
> Should I incorporate it to v4 patch? Perhaps, it could be a separate
> commit in new version.

I don't insist that you do that, but since the API changes are a few
hours work ISTM better to include in one patch for combined perf
testing. It would be better to put all changes in this area into PG15
than to split it across multiple releases.

> Why there is need for this? Which way backend could be forced to abort
> between BufTableReuse and BufTableAssign in this code path? I don't
> see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> something.

Sounds reasonable.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: suboverflowed subtransactions concurrency performance optimize

2022-03-07 Thread Simon Riggs
On Mon, 7 Mar 2022 at 09:49, Julien Rouhaud  wrote:
>
> Hi,
>
> On Mon, Jan 17, 2022 at 01:44:02PM +, Simon Riggs wrote:
> >
> > Re-attached, so that the CFapp isn't confused between the multiple
> > patches on this thread.
>
> Thanks a lot for working on this!
>
> The patch is simple and overall looks good to me.  A few comments though:
>
>
> +/*
> + * Single-item cache for results of SubTransGetTopmostTransaction.  It's 
> worth having
> + * such a cache because we frequently find ourselves repeatedly checking the
> + * same XID, for example when scanning a table just after a bulk insert,
> + * update, or delete.
> + */
> +static TransactionId cachedFetchXid = InvalidTransactionId;
> +static TransactionId cachedFetchTopmostXid = InvalidTransactionId;
>
> The comment is above the 80 chars after
> s/TransactionLogFetch/SubTransGetTopmostTransaction/, and I don't think this
> comment is valid for subtrans.c.

What aspect makes it invalid? The comment seems exactly applicable to
me; Andrey thinks so also.

> Also, maybe naming the first variable cachedFetchSubXid would make it a bit
> clearer?

Sure, that can be done.

> It would be nice to see some benchmarks, for both when this change is
> enough to avoid a contention (when there's a single long-running overflowed
> backend) and when it's not enough.  That will also be useful if/when working 
> on
> the "rethink_subtrans" patch.

The patch doesn't do anything about the case of when there's a single
long-running overflowed backend, nor does it claim that.

The patch will speed up calls to SubTransGetTopmostTransaction(), which occur in
src/backend/access/heap/heapam.c
src/backend/utils/time/snapmgr.c
src/backend/storage/lmgr/lmgr.c
src/backend/storage/ipc/procarray.c

The patch was posted because TransactionLogFetch() has a cache, yet
SubTransGetTopmostTransaction() does not, yet the argument should be
identical in both cases.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reducing power consumption on idle servers

2022-03-09 Thread Simon Riggs
On Sat, 26 Feb 2022 at 17:44, Tom Lane  wrote:
>
> Magnus Hagander  writes:
> >> Deprecating explicit file-based promotion is possible and simple, so
> >> that is the approach in the latest version of the patch.
>
> > Is there any actual use-case for this other than backwards
> > compatibility?
>
> The fundamental problem with signal-based promotion is that it's
> an edge-triggered rather than level-triggered condition, ie if you
> miss the single notification event you're screwed.  I'm not sure
> why we'd want to go there, considering all the problems we're having
> with edge-triggered APIs in nearby threads.
>
> We could combine the two approaches, perhaps: have "pg_ctl promote"
> create a trigger file and then send a signal.  The trigger file
> would record the existence of the promotion request, so that if the
> postmaster isn't running right now or misses the signal, it'd still
> promote when restarted or otherwise at the next opportunity.
> But with an approach like this, we could expect quick response to
> the signal in normal conditions, without need for constant wakeups
> to check for the file.  Also, this route would allow overloading
> of the signal, since it would just mean "wake up, I asked you to
> do something" rather than needing to carry the full semantics of
> what is to be done.

The above is how this works now, luckily, but few comments explain why.

This current state evolved because I first added file-based promotion,
for the reasons you give, but it was slow, so the signal approach was
added later, with new API.

What we are discussing deprecating is the ability to create the
trigger file directly (the original API). Once that is gone the
mechanism would be to request promotion via pg_ctl promote (the new
API), which then creates the trigger file and sends a signal. So in
both APIs the trigger file is still used.

In this patch, if you create the trigger file
directly/explicitly/yourself (the old API) then it will still trigger
when hibernating, but it will be slower. The new API is unaffected by
this change.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: support for MERGE

2022-03-10 Thread Simon Riggs
On Sun, 27 Feb 2022 at 17:35, Alvaro Herrera  wrote:

> > > +   /*
> > > +* Project the tuple.  In case of a 
> > > partitioned table, the
> > > +* projection was already built to use the 
> > > root's descriptor,
> > > +* so we don't need to map the tuple here.
> > > +*/
> > > +   actionInfo.actionState = action;
> > > +   insert_slot = ExecProject(action->mas_proj);
> > > +
> > > +   (void) ExecInsert(mtstate, rootRelInfo,
> > > + 
> > > insert_slot, slot,
> > > + estate, 
> > > &actionInfo,
> > > + 
> > > mtstate->canSetTag);
> > > +   InstrCountFiltered1(&mtstate->ps, 1);
> >
> > What happens if somebody concurrently inserts a conflicting row?
>
> An open question to which I don't have any good answer RN.

Duplicate rows should produce a uniqueness violation error in one of
the transactions, assuming there is a constraint to define the
conflict. Without such a constraint there is no conflict.

Concurrent inserts are checked by merge-insert-update.spec, which
correctly raises an ERROR in this case, as documented.

Various cases were not tested in the patch - additional patch
attached, but nothing surprising there.

ExecInsert() does not return from such an ERROR, so the code fragment
appears correct to me.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


new_insert_tests_for_merge.patch
Description: Binary data


Re: Parameter for planner estimate of recursive queries

2022-03-10 Thread Simon Riggs
On Fri, 28 Jan 2022 at 14:07, Robert Haas  wrote:
>
> On Tue, Jan 25, 2022 at 4:44 AM Peter Eisentraut
>  wrote:
> > On the one hand, this smells like a planner hint.  But on the other
> > hand, it doesn't look like we will come up with proper graph-aware
> > selectivity estimation system any time soon, so just having all graph
> > OLTP queries suck until then because the planner hint is hardcoded
> > doesn't seem like a better solution.  So I think this setting can be ok.
>
> I agree. It's a bit lame, but seems pretty harmless, and I can't see
> us realistically doing a lot better with any reasonable amount of
> work.

Shall I set this as Ready For Committer?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reducing power consumption on idle servers

2022-03-10 Thread Simon Riggs
On Wed, 9 Mar 2022 at 01:16, Zheng Li  wrote:

> > 1. Standardize the hibernation time at 60s, using a #define
> > HIBERNATE_DELAY_SEC 60
>
> I notice in patch 3 HIBERNATE_DELAY_SEC has been increased to 300
> seconds, what’s the reasoning behind it? Is longer hibernation delay
> better? If so can we set it to INT_MAX (the max timeout allowed by
> WaitLatch()) in which case a worker in hibernation only relies on
> wakeup? I think it would be nice to run experiments to verify that the
> patch reduces power consumption while varying the value of
> HIBERNATE_DELAY_SEC.

Setting it to INT_MAX would be the same as not allowing a timeout,
which changes a lot of current behavior and makes it less robust.

Waking once per minute is what we do in various cases, so 60sec is a
good choice.

In the case of logical rep launcher we currently use 300sec, so using
60s would decrease this.

I don't see much difference between power consumption with timeouts of
60s and 300s.

In the latest patch, I chose 300s. Does anyone have an opinion on the
value here?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Parameter for planner estimate of recursive queries

2022-03-23 Thread Simon Riggs
On Tue, 22 Mar 2022 at 00:04, Andres Freund  wrote:

> On 2022-03-10 17:42:14 +0000, Simon Riggs wrote:
> > Shall I set this as Ready For Committer?
>
> Currently this CF entry fails on cfbot: 
> https://cirrus-ci.com/task/4531771134967808?logs=test_world#L1158
>
> [16:27:35.772] #   Failed test 'no parameters missing from 
> postgresql.conf.sample'
> [16:27:35.772] #   at t/003_check_guc.pl line 82.
> [16:27:35.772] #  got: '1'
> [16:27:35.772] # expected: '0'
> [16:27:35.772] # Looks like you failed 1 test of 3.
> [16:27:35.772] [16:27:35] t/003_check_guc.pl ..
>
> Marked as waiting on author.

Thanks.

I've corrected that by adding a line to postgresql.conf.sample, as
well as adding docs.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


recursive_worktable_estimate.v2.patch
Description: Binary data


Re: Parameter for planner estimate of recursive queries

2022-03-23 Thread Simon Riggs
On Wed, 23 Mar 2022 at 17:36, Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Tue, Jan 25, 2022 at 4:44 AM Peter Eisentraut
> >  wrote:
> >> On the one hand, this smells like a planner hint.  But on the other
> >> hand, it doesn't look like we will come up with proper graph-aware
> >> selectivity estimation system any time soon, so just having all graph
> >> OLTP queries suck until then because the planner hint is hardcoded
> >> doesn't seem like a better solution.  So I think this setting can be ok.
>
> > I agree. It's a bit lame, but seems pretty harmless, and I can't see
> > us realistically doing a lot better with any reasonable amount of
> > work.
>
> Yeah, agreed on all counts.  The thing that makes it lame is that
> there's no reason to expect that the same multiplier is good for
> every recursive query done in an installation, or even in a session.
>
> One could imagine dealing with that by adding custom syntax to WITH,
> as we have already done once:
>
> WITH RECURSIVE cte1 AS SCALE 1.0 (SELECT ...
>
> But I *really* hesitate to go there, mainly because once we do
> something like that we can't ever undo it.  I think Simon's
> proposal is a reasonable low-effort compromise.
>
> Some nitpicks:
>
> * The new calculation needs clamp_row_est(), since the float
> GUC could be fractional or even zero.

True, will do.

> * Do we want to prevent the GUC value from being zero?  It's not
> very sensible, plus I think we might want to reserve that value
> to mean "use the built-in calculation", in case we ever do put
> in some smarter logic here.  But I'm not sure what a reasonable
> non-zero lower bound would be.

Agreed, makes sense.

> * The proposed docs claim that a smaller setting works by biasing
> the planner towards fast-start plans, but I don't think I believe
> that explanation.  I'd venture that we want text more along the
> lines of "This may help the planner choose the most appropriate
> method for joining the work table to the query's other tables".

OK, will improve.

[New patch version pending]

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reducing power consumption on idle servers

2022-03-23 Thread Simon Riggs
On Tue, 22 Mar 2022 at 00:54, Andres Freund  wrote:
>
> On 2022-02-21 21:04:19 +, Simon Riggs wrote:
> > On Mon, 21 Feb 2022 at 16:49, Chapman Flack  wrote:
> >
> > > Shouldn't the comment be "with work_done=false" ?
> >
> > Good catch, thanks.
> >
> > I've also added docs to say that "promote_trigger_file" is now
> > deprecated. There were no tests for that functionality, so just as
> > well it is being removed.
>
> Doesn't pass tests, and hasn't for weeks from what I can see: 
> https://cirrus-ci.com/task/5925633648754688?logs=test_world#L1153
>
> Marked as waiting-on-author.

Hmm, just the not-in-sample test again. Fixed by marking GUC_NOT_IN_SAMPLE.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


hibernate_to_reduce_power_consumption.v4.patch
Description: Binary data


Re: Parameter for planner estimate of recursive queries

2022-03-23 Thread Simon Riggs
On Wed, 23 Mar 2022 at 18:20, Simon Riggs  wrote:

> [New patch version pending]

-- 
Simon Riggshttp://www.EnterpriseDB.com/


recursive_worktable_estimate.v3.patch
Description: Binary data


Re: Huge memory consumption on partitioned table with FKs

2020-12-02 Thread Simon Riggs
On Tue, 1 Dec 2020 at 00:03, Alvaro Herrera  wrote:
> cache entry is reused in the common case where they are identical.

Does a similar situation exist for partition statistics accessed
during planning? Or planning itself?

It would be useful to avoid repeated access to similar statistics and
repeated planning of sub-plans for similar partitions.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: SQL/JSON: functions

2020-12-15 Thread Simon Riggs
On Fri, 17 Jul 2020 at 21:26, Nikita Glukhov  wrote:
>
> Attached 50th version of the patches. Only the documentation was changed
> since the previous version.

I can imagine the effort required to get to v50, so I salute your efforts.

The document for SQL Standard has now been published as CD
9075-2-Foundation (Thanks Peter).

That gives us a clearer picture of what is being voted on and should
allow Nikita to complete his work.

I suggest we move forwards on this now, but if anyone objects to
including this in PG14 in favour of waiting for the vote, please say
so clearly so we can skip to PG15.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reloptions for table access methods

2020-12-15 Thread Simon Riggs
On Tue, 1 Sept 2020 at 18:21, Jeff Davis  wrote:

> I went with the simple approach because fixing that problem seemed a
> bit over-engineered. Here are some thoughts on what we could do:

The simple patch is admirable, but not something we should put into core.

I definitely don't agree with the premise that all existing heap
options should be common across all new or extension tableAMs. There
are dozens of such options and I don't believe that they would all
have meaning in all future storage plugins.

I think options should just work exactly the same for Indexes and Tables.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Discarding DISCARD ALL

2020-12-23 Thread Simon Riggs
Currently, session poolers operating in transaction mode need to send
a "server_reset_query" which is mostly DISCARD ALL.

It seems strange to me that we put this work onto the pooler, forcing
poolers to repeatedly issue the same command, at some cost in
performance. Measuring the overhead with pgbench might miss the points
that poolers are frequently configured on different network hosts and
that monitoring tools used in production will record the DISCARD
statement. YMMV, but the overhead is measurably non-zero.

Proposal is to have a simple new parameter:
  transaction_cleanup = off (default) | on
A setting of "on" will issue the equivalent of a DISCARD ALL as soon
as the transaction has been ended by a COMMIT, ROLLBACK or PREPARE.

Poolers such as pgbouncer would then be able to connect transaction
mode pools by setting transaction_cleanup=on at time of connection,
avoiding any need to issue a server_reset_query, removing the DISCARD
ALL command from the normal execution path, while still achieving the
same thing.

This has an additional side benefit: if we know we will clean up at
the end of the transaction, then all temp tables become effectively ON
COMMIT DROP and we are able to allow temp tables in prepared
transactions. There are likely other side benefits from this
knowledge, allowing us to further tune the PostgreSQL server to the
common use case of transaction session poolers. I think it should be
possible to avoid looking for holdable portals if we are dropping them
all anyway.

Patch attached, passes make check with new tests added.

Comments welcome.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


transaction_cleanup.v4.patch
Description: Binary data


Re: Discarding DISCARD ALL

2020-12-23 Thread Simon Riggs
On Wed, 23 Dec 2020 at 15:19, Andreas Karlsson  wrote:
>
> On 12/23/20 3:47 PM, Vladimir Sitnikov wrote:
> > Simon>It seems strange to me that we put this work onto the pooler, forcing
> > Simon>poolers to repeatedly issue the same command
> >
> > What if poolers learn to manage connections and prepared statements better?
> > Then poolers won't have to reset the session every time, and everyone wins.
>
> While that is be possible to implement since some client libraries
> implement this in their pools (e.g. Sequel for Ruby) this patch would
> help connection poolers which are not aware of prepared statements, for
> example PgBouncer, so it is worthwhile as long as there are connection
> poolers out there which are not aware of prepared statements. And even
> the connection poolers which are aware might want to automatically drop
> temporary tables and reset GUCs. So I do not think that this feature
> would become pointless even if people write a patch for PgBouncer.

The whole premise of the patch is tighter integration, with the server
providing the facilities that poolers need.

The patch can be enhanced to do whatever else we agree is desirable.

Do we need something like DISCARD ALL EXCEPT PREPARED STATEMENTS;  ??

If there are different requirements for each pooler, what are they?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Disable WAL logging to speed up data loading

2020-12-28 Thread Simon Riggs
On Fri, 25 Dec 2020 at 07:09, Michael Paquier  wrote:
>
> On Thu, Dec 03, 2020 at 03:52:47AM +, tsunakawa.ta...@fujitsu.com wrote:
> > The code looks good, and the performance seems to be nice, so I
> > marked this ready for committer.
>
> FWIW, I am extremely afraid of this proposal because this is basically
> a footgun able to corrupt customer instances, and I am ready to bet
> that people would switch wal_level to none because they see a lot of
> benefits in doing so at first sight, until the host running the server
> is plugged off and they need to use pg_resetwal in urgency to bring an
> instance online.

Agreed, it is a footgun. -1 to commit the patch as-is.

The patch to avoid WAL is simple but it is dangerous for both the user
and the PostgreSQL project.

In my experience, people will use this option and when it crashes and
they lose their data, they will claim PostgreSQL broke and that they
were not stupid enough to use this option. Data loss has always been
the most serious error for PostgreSQL and our reputation for
protecting data has been hard won; it can easily be lost in a moment
of madness. Please consider how the headlines will read, bearing in
mind past incidents and negative press. Yes, we did think of this
feature already and rejected it.

If we ever did allow such an option, it must contain these things (IMHO):
* the option should be called "unsafe" or "allows_data_loss", not just
"none" (anything less than "minimal" must be insufficient or
unsafe...)
* the option must be set in the control file and be part of the same
patch, so users cannot easily edit things to hide their unsafe usage
* we must not support the option of going down to "unsafe" and then
back up again. It must be a one-way transition from "unsafe" to a
higher level, so if people want to use this for temp reporting servers
or initial loading, great, but they can't use it as a quick speed-up
for databases containing needs-to-be-safe data. Possibly the state
change might be "unsafe" -> "needs_backup" -> "minimal"... or some
other way to signal to backup.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: create table like: ACCESS METHOD

2020-12-30 Thread Simon Riggs
On Tue, 29 Dec 2020 at 23:08, Justin Pryzby  wrote:
>
> On Fri, Dec 25, 2020 at 03:41:46PM +0900, Michael Paquier wrote:
> > On Wed, Dec 09, 2020 at 02:13:29PM -0600, Justin Pryzby wrote:
> > > I thought this was a good idea, but didn't hear back when I raised it 
> > > before.
> > >
> > > Failing to preserve access method is arguably a bug, reminiscent of CREATE
> > > STATISTICS and 5564c1181.  But maybe it's not important to backpatch a 
> > > fix in
> > > this case, since access methods are still evolving.
> >
> > Interesting.  Access methods for tables are released for more than one
> > year now, so my take about a backpatch is that this boat has already
> > sailed.  This may give a reason to actually not introduce this
> > feature.
>
> Are you saying that since v12/13 didn't preserve the access method, it might 
> be
> preferred to never do it ?  I think it's reasonable to not change v12/13 but
> the behavior seems like an omission going forward.  It's not so important 
> right
> now, since AMs aren't widely used.

Omitting copying the AM seems like a bug during
  CREATE TABLE likeamlike(LIKE likeam INCLUDING ALL);
But this does allow you to specify the TableAM by using
default_table_access_method, and to use
  CREATE TABLE likeamlike(LIKE likeam INCLUDING ALL) USING (heapdup);
if you wish to set the AM explicitly, so I don't see this as needing backpatch.

Which means that the emphasis for the earlier functionality was
towards one "preferred AM" rather than using multiple AMs at same
time. Allowing this change in later releases makes sense.

Please make sure this is marked as an incompatibility in the release notes.

> > This patch should have more tests.  Something could be added in
> > create_am.sql where there is a fake heap2 as table AM.
>
> Yes, I had already done that locally.

There are no tests for the new functionality, please could you add some?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Allow CURRENT_ROLE in GRANTED BY

2020-12-30 Thread Simon Riggs
On Thu, 10 Dec 2020 at 18:40, Peter Eisentraut
 wrote:
>
> On 2020-06-24 20:21, Peter Eisentraut wrote:
> > On 2020-06-24 10:12, Vik Fearing wrote:
> >> On 6/24/20 8:35 AM, Peter Eisentraut wrote:
> >>> I was checking some loose ends in SQL conformance, when I noticed: We
> >>> support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
> >>> CURRENT_ROLE in that place, even though in PostgreSQL they are
> >>> equivalent.  Here is a trivial patch to add that.
> >>
> >>
> >> The only thing that isn't dead-obvious about this patch is the commit
> >> message says "[PATCH 1/2]".  What is in the other part?
> >
> > Hehe.  The second patch is some in-progress work to add the GRANTED BY
> > clause to the regular GRANT command.  More on that perhaps at a later date.
>
> Here is the highly anticipated and quite underwhelming second part of
> this patch set.

Looks great, but no test to confirm it works. I would suggest adding a
test and committing directly since I don't see any cause for further
discussion.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Disable WAL logging to speed up data loading

2020-12-30 Thread Simon Riggs
On Wed, 30 Dec 2020 at 13:04, Michael Paquier  wrote:

> > Yes, we did think of this feature already and rejected it.
>
> I don't recall seeing such a discussion.  Do you have some references?
> Just a matter of curiosity :)

Off-list discussions.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-12-30 Thread Simon Riggs
On Mon, 16 Nov 2020 at 15:32, Bharath Rupireddy
 wrote:
>
> On Mon, Nov 16, 2020 at 8:02 PM Paul Guo  wrote:
> >
> > > On Nov 13, 2020, at 7:21 PM, Bharath Rupireddy 
> > >  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 3:47 PM Paul Guo  wrote:
> > >>
> > >> Thanks for doing this. There might be another solution - use raw insert 
> > >> interfaces (i.e. raw_heap_insert()).
> > >> Attached is the test (not formal) patch that verifies this idea. 
> > >> raw_heap_insert() writes the page into the
> > >> table files directly and also write the FPI xlog when the tuples filled 
> > >> up the whole page. This seems be
> > >> more efficient.
> > >>
> > >
> > > Thanks. Will the new raw_heap_insert() APIs scale well (i.e. extend
> > > the table parallelly) with parallelism? The existing
> > > table_multi_insert() API scales well, see, for instance, the benefit
> > > with parallel copy[1] and parallel multi inserts in CTAS[2].
> >
> > Yes definitely some work needs to be done to make raw heap insert 
> > interfaces fit the parallel work, but
> > it seems that there is no hard blocking issues for this?
> >
>
> I may be wrong here. If we were to allow raw heap insert APIs to
> handle parallelism, shouldn't we need some sort of shared memory to
> allow coordination among workers? If we do so, at the end, aren't
> these raw insert APIs equivalent to current table_multi_insert() API
> which uses a separate shared ring buffer(bulk insert state) for
> insertions?
>
> And can we think of these raw insert APIs similar to the behaviour of
> table_multi_insert() API for unlogged tables?

I found the additional performance of Paul Guo's work to be compelling
and the idea workable for very large loads.

Surely LockRelationForExtension() is all the inter-process
coordination we need to make this work for parallel loads?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reloptions for table access methods

2020-12-30 Thread Simon Riggs
On Tue, 15 Dec 2020 at 23:59, Jeff Davis  wrote:
>
> On Tue, 2020-12-15 at 17:37 +, Simon Riggs wrote:
> >
> > I definitely don't agree with the premise that all existing heap
> > options should be common across all new or extension tableAMs. There
> > are dozens of such options and I don't believe that they would all
> > have meaning in all future storage plugins.
>
> I agree in principle, but looking at the options that are present
> today, a lot of them are integrated into general database features,
> like autovacuum, toast, logical replication, and parellel query
> planning.
>
> We need to have some answer about how these features should interact
> with a custom AM if we can't assume anything about the reloptions it
> understands.
>
> > I think options should just work exactly the same for Indexes and
> > Tables.
>
> How should that work with the existing code? Should we have separate AM
> hooks for each of these interaction points, and then have the AM figure
> out how to handle its options? What about the toast.* options?
>
> It feels like some common options would make sense to avoid too much
> code duplication.
>
> I am not trying to push this in a specific direction, but I don't have
> a lot of good answers, so I went for the simplest thing I could think
> of that would allow an extension to have its own options, even if it's
> a bit hacky. I'm open to alternatives.

Your argument seems to be that all new Table AMs should offer some
kind of support for these "standard" options, even if it is to accept
and then ignore them, which would presumably require them to document
that. If that is the case, then at very least offer a doc patch that
says that so people know what to do and doc comments describe how we
should treat new parameters in the future.

But you cannot seriously argue that a one line patch that changes user
visible behavior can be acceptable in Postgres core without tests or
docs or code comments. You know as a Committer that that position is
not sustainable.

Minor changes could be OK if they are complete. Please either push
forward to a usable commit or withdraw, so other options can be
considered.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Zedstore - compressed in-core columnar storage

2020-12-31 Thread Simon Riggs
On Wed, 18 Nov 2020 at 00:31, Jacob Champion  wrote:
>
> So that's in need of resolution. I'd expect gin and gist to be pretty
> flaky until we fix that.

Jacob and Soumyadeep,

Thanks for submitting this. I think a fix is still outstanding? and
the patch fails to apply on HEAD in two places.
Please can you submit the next version?

Do you mind if we add this for review to the Jan CF?

It is a lot of code and I think there is significant difficulty for
the community to accept that as-is, even though it looks to be a very
high quality submission. So I would like to suggest a strategy for
commit: we accept Zedstore as "Beta" or "Experimental" in PG14,
perhaps with a WARNING/Caution similar to the one that used to be
given by Postgres in earlier versions when you created a Hash index.
We keep Zedstore in "Beta" mode until a later release, PG15 or later
when we can declare Zedstore fully safe. That approach allows us to
get this into the repo asap, and then be fixed and improved
incrementally from here.

e.g.

"NOTICE:  Caution: Zedstore is an experimental feature in PostgreSQL14
intended for robustness and performance testing only. Your data and/or
query accuracy may be at risk if you rely on this."

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Safety/validity of resetting permissions by updating system tables

2021-01-03 Thread Simon Riggs
On Fri, Jan 1, 2021 at 7:35 PM Isaac Morland  wrote:

> The use case is to ensure that after doing my GRANTs the permissions are in a 
> known state, no matter what they were before. Typically, one would follow a 
> reset command with some GRANTs. So maybe my permissions script contains:
>
> GRANT UPDATE ON TABLE t1, t2 TO u1, u2;
>
> Later, I revise this to:
>
> GRANT UPDATE ON TABLE t1, t2 TO u1;
>
> But the obsolete permissions will still be available to u2. I would like to 
> be able to put something like this at the top of the permissions script:
>
> RESET PERMISSIONS ON ALL TABLES IN SCHEMA test;
>
> Or in a different context:
>
> RESET PERMISSIONS ON TABLE t1, t2;
>
> Note: I'm not particularly fond of "RESET PERMISSIONS" as the syntax; I just 
> wrote that as an example of what it might look like.
>
> If the tables are newly created this would have no effect; if they were 
> existing tables it would change the permissions to what newly created tables 
> would have.
>
> In the absence of default privileges, I think it's clear that this means 
> setting the acl column (relacl, proacl, ...) to NULL; with default 
> privileges, I think it probably means resetting acl to NULL and then applying 
> the current default privileges as if the object had just been created by its 
> owner. As you point out, it's possible the object never had this privilege 
> set, which is an argument against using the word "reset" in describing the 
> feature. Maybe "GRANT DEFAULT"? But it's weird for GRANT to actually revoke 
> privileges, as it would for most object types.

Exactly what's wrong with "REVOKE ALL ON ALL TABLES IN SCHEMA test" at
the top of your script? You say there is a problem, but don't describe
the precise problem. Can you give a fully worked example so we can
understand how to resolve?

The meaning of GRANT and REVOKE is now defined by SQL Standard, so not
something we can easily change.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-07 Thread Simon Riggs
On Mon, Jan 4, 2021 at 2:24 PM Masahiko Sawada  wrote:
> Please also note the v7 patch cannot be applied to the current HEAD. I'm 
> switching the patch as Waiting on Author.

Surafel, please say whether you are working on this or not. If you
need help, let us know.

On Tue, 7 Jan 2020 at 10:33, Kyotaro Horiguchi  wrote:
> I think that the start and end timestamps represent the period where
> that version of the row was active. So UPDATE should modify the start
> timestamp of the new version to the same value with the end timestamp
> of the old version to the updated time. Thus, I don't think adding
> start timestamp to PK doesn't work as expected. That hinders us from
> rejecting rows with the same user-defined unique key because start
> timestamps is different each time of inserts. I think what Surafel is
> doing in the current patch is correct. Only end_timestamp = +infinity
> rejects another non-historical (= live) version with the same
> user-defined unique key.

The end_time needs to be updated when a row is updated, so it cannot
form part of the PK. If you try to force that to happen, then logical
replication will not work with system versioned tables, which would be
a bad thing. So *only* start_time should be added to the PK to make
this work. (A later comment also says the start_time needs to be
updated, which makes no sense!)

On Wed, 23 Oct 2019 at 21:03, Vik Fearing  wrote:
> I don't see any error handling for transaction anomalies.  In READ
> COMMITTED, you can easily end up with a case where the end time comes
> before the start time.  I don't even see anything constraining start
> time to be strictly inferior to the end time.  Such a constraint will be
> necessary for application-time periods (which your patch doesn't address
> at all but that's okay).

I don't see how it can have meaning to have an end_time earlier than a
start_time, so yes that should be checked. Having said that, if we use
a statement timestamp on row insertion then, yes, the end_time could
be earlier than start time, so that is just wrong. Ideally we would
use commit timestamp and fill the values in later. So the only thing
that makes sense for me is to use the dynamic time of insertion while
we hold the buffer lock, otherwise we will get anomalies.

The work looks interesting and I will be doing a longer review.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Reloptions for table access methods

2021-01-07 Thread Simon Riggs
On Sat, Jan 2, 2021 at 6:44 PM Jeff Davis  wrote:
>
> On Wed, 2020-12-30 at 21:23 +, Simon Riggs wrote:
> > But you cannot seriously argue that a one line patch that changes
> > user
> > visible behavior can be acceptable in Postgres core without tests or
> > docs or code comments.
>
> Hi Simon,
>
> Often, good documented APIs come about after a few extensions gain
> experience hacking things together using undocumented assumptions and
> internal APIs. But in this case, extension authors have no opportunity
> to hack in reloptions for a TableAM, because of this needless extra
> check that my patch would eliminate.
>
> The patch does not have any user-visible impact. To have any effect, an
> extension would need to use these internal APIs in a specific way that
> is not documented externally. I see my tiny patch as a tiny improvement
> to the backend code because it eliminates a useless extra check, and
> therefore doesn't need much justification. If others see it as a burden
> on the engine code, that's a different story.
>
> If we insist that this must be a fully-supported feature or nothing at
> all, then I'll withdraw the patch. But I doubt that will result in a
> proper feature for v14, it just means that when we do get around to
> supporting extensible reloptions, there will be no hacks in the wild to
> learn from.

Thanks for the reply. I'm trying to get my head around this before a
longer reply.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-07 Thread Simon Riggs
On Thu, Jan 7, 2021 at 5:59 PM Simon Riggs  wrote:
>
> On Mon, Jan 4, 2021 at 2:24 PM Masahiko Sawada  wrote:
> > Please also note the v7 patch cannot be applied to the current HEAD. I'm 
> > switching the patch as Waiting on Author.
>
> Surafel, please say whether you are working on this or not. If you
> need help, let us know.

I've minimally rebased the patch to current head so that it compiles
and passes current make check.

>From here, I will add further docs and tests to enhance review and
discover issues.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


system-versioned-temporal-table_2021_v8.patch
Description: Binary data


Re: WIP: System Versioned Temporal Table

2021-01-09 Thread Simon Riggs
On Fri, Jan 8, 2021 at 9:19 PM Ryan Lambert  wrote:

>> Updated v11 with additional docs and some rewording of messages/tests
>> to use "system versioning" correctly.
>>
>> No changes on the points previously raised.
>>
> Thank you!  The v11 applies and installs.  I tried a simple test, 
> unfortunately it appears the versioning is not working. The initial value is 
> not preserved through an update and a new row does not appear to be created.

Agreed. I already noted this in my earlier review comments.

I will send in a new version with additional tests so we can see more
clearly that the tests fail on the present patch.

I will post more on this next week.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: NOT VALID for Unique Indexes

2021-02-26 Thread Simon Riggs
On Mon, Jan 18, 2021 at 12:34 AM David Fetter  wrote:
>
> On Thu, Jan 14, 2021 at 04:22:17PM +, Simon Riggs wrote:
> > As you may be aware the NOT VALID qualifier currently only applies to
> > CHECK and FK constraints, but not yet to unique indexes. I have had
> > customer requests to change that.
>
> This is a great feature!
>
> Not exactly on point with this, but in a pretty closely related
> context, is there some way we could give people the ability to declare
> at their peril that a constraint is valid without incurring the full
> scan that VALIDATE currently does? This is currently doable by
> fiddling directly with the catalog, which operation is broadly more
> dangerous and ill-advised.

That is what NOT VALID allows, but it can't be relied on for optimization.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: NOT VALID for Unique Indexes

2021-02-26 Thread Simon Riggs
On Mon, Jan 18, 2021 at 11:19 PM japin  wrote:
>
>
> On Fri, 15 Jan 2021 at 00:22, Simon Riggs  
> wrote:
> > As you may be aware the NOT VALID qualifier currently only applies to
> > CHECK and FK constraints, but not yet to unique indexes. I have had
> > customer requests to change that.
> >
> > It's a reasonably common requirement to be able to change an index
> > to/from a unique index, i.e. Unique -> NonUnique or NonUnique to
> > Unique. Previously, it was easy enough to do that using a catalog
> > update, but with security concerns  and the fact that the optimizer
> > uses the uniqueness to optimize queries means that there is a gap in
> > our support. We obviously need to scan the index to see if it actually
> > can be marked as unique.
> >
> > In terms of locking we need to exclude writes while we add uniqueness,
> > so scanning the index to check it is unique would cause problems. So
> > we need to do the same thing as we do with other constraint types: add
> > the constraint NOT VALID in one transaction and then later validate it
> > in a separate transaction (if ever).
> >
> > I present a WIP patch to show it's a small patch to change Uniqueness
> > for an index, with docs and tests.
> >
> > ALTER INDEX SET [NOT] UNIQUE [NOT VALID]
> > ALTER INDEX VALIDATE UNIQUE
> >
> > It doesn't do the index validation scan (yet), but I wanted to check
> > acceptability, syntax and requirements before I do that.
> >
> > I can also add similar syntax for UNIQUE and PK constraints.
> >
> > Thoughts please?
>
> Great! I have some questions.
>
> 1. In the patch, you add a new attribute named "induniquevalid" in pg_index,
>however, there is a "indisvalid" in pg_index, can we use "indisvalid"?

indisvalid already has defined meaning related to creating indexes
concurrently, so I was forced to create another column with a similar
name.

Thanks for reviewing the code in detail.

> 2. The foreign key and CHECK constraints are valid by using
>ALTER TABLE .. ADD table_constraint [ NOT VALID ]
>ALTER TABLE .. VALIDATE CONSTRAINT constraint_name
>
>Should we implement unique index valid/not valid same as foreign key and
>CHECK constraints?

Yes, that is possible. (I wrote the NOT VALID patch for FKs, so was
aware of that).

The syntax I presented was for ALTER INDEX. Not all UNIQUE indexes are
constraints, so it is important to add the option on ALTER INDEX.
Adding the ALTER TABLE syntax can be done later.

> 3. If we use the syntax to valid/not valid the unique, should we support
>other constraints, such as foreign key and CHECK constraints?

I'm sorry, I don't understand that question. FKs and CHECK constrants
are already supported, as you point out above.


I won't be able to finish this patch in time for this next CF, but
thanks for your interest, I will complete for PG15 later this year.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-27 Thread Simon Riggs
On Mon, 26 Oct 2020 at 21:15, Peter Geoghegan  wrote:

> Now for the not-so-good news. The TPS numbers looked like this
> (results in original chronological order of the runs, which I've
> interleaved):

While it is important we investigate the worst cases, I don't see this
is necessarily bad.

HOT was difficult to measure, but on a 2+ hour run on a larger table,
the latency graph was what showed it was a winner. Short runs and
in-memory data masked the benefits in our early analyses.

So I suggest not looking at the totals and averages but on the peaks
and the long term trend. Showing that in graphical form is best.

> The patch adds a backstop. It seems to me that that's really what we
> need here. Predictability over time and under a large variety of
> different conditions. Real workloads constantly fluctuate.

Yeh, agreed. This is looking like a winner now, but lets check.

> Even if people end up not buying my argument that it's worth it for
> workloads like this, there are various options. And, I bet I can
> further improve the high contention cases without losing the valuable
> part -- there are a number of ways in which I can get the CPU costs
> down further that haven't been fully explored (yes, it really does
> seem to be CPU costs, especially due to TID sorting). Again, this
> patch is all about extreme pathological workloads, system stability,
> and system efficiency over time -- it is not simply about increasing
> system throughput. There are some aspects of this design (that come up
> with extreme workloads) that may in the end come down to value
> judgments. I'm not going to tell somebody that they're wrong for
> prioritizing different things (within reason, at least). In my opinion
> almost all of the problems we have with VACUUM are ultimately
> stability problems, not performance problems per se. And, I suspect
> that we do very well with stupid benchmarks like this compared to
> other DB systems precisely because we currently allow non-HOT updaters
> to "live beyond their means" (which could in theory be great if you
> frame it a certain way that seems pretty absurd to me). This suggests
> we can "afford" to go a bit slower here as far as the competitive
> pressures determine what we should do (notice that this is a distinct
> argument to my favorite argument, which is that we cannot afford to
> *not* go a bit slower in certain extreme cases).
>
> I welcome debate about this.

Agreed, we can trade initial speed for long term consistency. I guess
there are some heuristics there on that tradeoff.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Background writer and checkpointer in crash recovery

2020-11-11 Thread Simon Riggs
On Sun, 30 Aug 2020 at 01:39, Tom Lane  wrote:
>
> Thomas Munro  writes:
> > Once we had the checkpointer running, we could also consider making
> > the end-of-recovery checkpoint optional, or at least the wait for it
> > to complete.  I haven't shown that in this patch, but it's just
> > different checkpoint request flags and possibly an end-of-recovery
> > record.  What problems do you foresee with that?  Why should we have
> > "fast" promotion but not "fast" crash recovery?
>
> I think that the rationale for that had something to do with trying
> to reduce the costs of repeated crashes.  If you've had one crash,
> you might well have another one in your near future, due to the same
> problem recurring.  Therefore, avoiding multiple replays of the same WAL
> is attractive; and to do that we have to complete a checkpoint before
> giving users a chance to crash us again.

Agreed. That rationale is shown in comments and in the commit message.

"We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful."

Having said that, we did raise the checkpoint_timeout by a lot, so the
situation today might be quite different. A large checkpoint_timeout
could eventually overflow shared buffers, with the right workload.

We don't have any stats to show whether this patch is worthwhile or
not, so I suggest adding the attached instrumentation patch as well so
we can see on production systems whether checkpoint_timeout is too
high by comparison with pg_stat_bgwriter. The patch is written in the
style of log_checkpoints.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


crash_recovery_stats.v1.patch
Description: Binary data


Detecting File Damage & Inconsistencies

2020-11-11 Thread Simon Riggs
I would like to propose a few points that will help us detect file
damage, inconsistencies in files and track actions of users.

Optionally, we could add these fields onto the WAL commit record:
* 8-byte session start time (from MyStartTime)
* 2-byte pid (from MyProcPid)
* 4-byte user oid
Identified by a name flag bit, XACT_XINFO_HAS_USERINFO.
That allows us to match up transactions with the server log %c option.
Another option might allow text username to be added to each commit as well.

XLogLongPageHeaderData has 4 bytes of unused data because of
alignment. We could use that space for 1) the Xid Epoch value or 2) a
CRC value - since only WAL records are covered by a CRC, not page or
file headers. Perhaps we should add both?

There are also 2 bytes unused in the XLogRecord header (with 8 byte
alignment). We could optionally use that space for the pid that wrote
the record, but that's not compelling. What can we use those 2 bytes
for?

REINDEX VERIFY
After the new index is created, but before we drop the old index:
Check whether the two indexes match:
* checks whether the previous index had pointers to row versions that
don't exist
* checks whether the heap has rows that were not in the old index
This approach piggybacks on existing operations. AccessShareLock is
held on both indexes before the old one is dropped.

Other ideas are welcome.

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Detecting File Damage & Inconsistencies

2020-11-11 Thread Simon Riggs
On Thu, 12 Nov 2020 at 06:42, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Simon Riggs 
> > I would like to propose a few points that will help us detect file
> > damage, inconsistencies in files and track actions of users.
>
> Hello, Simon san.  Long time no see.  I'm happy to see you be back here 
> recently.

Thank you, happy to be back. It's good to have the time to contribute again.

> What kind of improvement do you expect?  What problems would this make 
> detectable?

If a rogue user/process is suspected, this would allow you to identify
more easily the changes made by specific sessions/users.

> > * 2-byte pid (from MyProcPid)
>
> pid is 4 bytes on Windows.  Isn't it also 4 byte on Linux when some kernel 
> parameter is set to a certain value?

4 bytes is no problem, thanks for pointing that out.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Detecting File Damage & Inconsistencies

2020-11-13 Thread Simon Riggs
On Fri, 13 Nov 2020 at 00:50, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Simon Riggs 
> > If a rogue user/process is suspected, this would allow you to identify
> > more easily the changes made by specific sessions/users.
>
> Isn't that kind of auditing a job of pgAudit or log_statement = mod?  Or, 
> does "more easily" mean that you find pgAudit complex to use and/or 
> log_statement's overhead is big?

Well, I designed pgaudit, so yes, I think pgaudit is useful.

However, pgaudit works at the statement level, not the data level. So
using pgaudit to locate data rows that have changed is fairly hard.

What I'm proposing is an option to add 16 bytes onto each COMMIT
record, which is considerably less than turning on full auditing in
pgaudit. This option would allow identifying data at the row level, so
you could for example find all rows changed by specific sessions.
Also, because it is stored in WAL it will show updates that might no
longer exist in the database because the changed row versions might
have been vacuumed away. So pgaudit will tell you that happened, but
having extra info in WAL is important also.

So thank you for the question because it has allowed me to explain why
it is useful and important.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-16 Thread Simon Riggs
On Tue, 10 Nov 2020 at 03:02, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > Yeah ... it would be much better if we can make it use atomics instead.
>
> I was thinking more like "do we need any locking at all".
>
> Assuming that a proc's vacuumFlags can be set by only the process itself,
> there's no write conflicts to worry about.  On the read side, there's a
> hazard that onlookers will not see the PROC_IN_SAFE_IC flag set; but
> that's not any different from what the outcome would be if they looked
> just before this stanza executes.  And even if they don't see it, at worst
> we lose the optimization being proposed.
>
> There is a question of whether it's important that both copies of the flag
> appear to update atomically ... but that just begs the question "why in
> heaven's name are there two copies?"

Agreed to all of the above, but I think the issue is miniscule because
ProcArrayLock is acquired in LW_EXCLUSIVE mode at transaction commit.
So it doesn't seem like there is much need to optimize this particular
aspect of the patch.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-16 Thread Simon Riggs
The docs are misleading for this feature, since they say:
"This option disables all page-skipping behavior, and is
intended to be used only when the contents of the visibility map are
suspect, which should happen only if there is a hardware or software
issue causing database corruption."

The docs do correctly say "Pages where all tuples are known to be
frozen can always be skipped". Checking the code, lazy_scan_heap()
comments say
"we can still skip pages that are all-frozen, since such pages do not
need freezing".

The code is quite clear: DISABLE_PAGE_SKIPPING makes the vacuum into
an aggressive vacuum. Line 487, heap_vacuum_rel().  Aggressive vacuums
can still skip a page that is frozen, and rely on the visibility map
for that information.

So the docs are wrong - we don't disable *all* page-skipping and it is
not appropriate to warn users away from this feature by saying "is
intended to be used only when the contents of the visibility map are
suspect".

Reworded docs patch attached.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


aggressive_rewording.v1.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-17 Thread Simon Riggs
On Mon, 16 Nov 2020 at 22:53, Masahiko Sawada  wrote:
> On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:

> I don't think the doc is wrong. If DISABLE_PAGE_SKIPPING is specified,
> we not only set aggressive = true but also skip checking visibility
> map. For instance, see line 905 and line 963, lazy_scan_heap().

OK, so you're saying that the docs illustrate the true intention of
the patch, which I immediately accept since I know you were the
author. Forgive me for not discussing it with you first, I thought
this was a clear cut case.

But that then highlights another area where the docs are wrong...

> On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:
> > The docs do correctly say "Pages where all tuples are known to be
> > frozen can always be skipped". Checking the code, lazy_scan_heap()
> > comments say
> > "we can still skip pages that are all-frozen, since such pages do not
> > need freezing".

The docs say this:
"Pages where all tuples are known to be frozen can always be skipped."
Why bother to say that if the feature then ignores that point and
scans them anyway?
May I submit a patch to remove that sentence?

Anyway, we're back to where I started: looking for a user-initiated
command option that allows a table to scanned aggressively so that
relfrozenxid can be moved forward as quickly as possible. This is what
I thought that you were trying to achieve with DISABLE_PAGE_SKIPPING
option, my bad.

Now David J, above, says this would be VACUUM FREEZE, but I don't
think that is right. Setting VACUUM FREEZE has these effects: 1) makes
a vacuum aggressive, but it also 2) moves the freeze limit so high
that it freezes mostly everything. (1) allows the vacuum to reset
relfrozenxid, but (2) actually slows down the scan by making it freeze
more blocks than it would do normally.

So we have 3 ways to reset relfrozenxid by a user action:
VACUUM (DISABLE_PAGE_SKIPPING ON) - scans all blocks, deliberately
ignoring the ones it could have skipped. This certainly slows it down.
VACUUM (FREEZE ON) - freezes everything in its path, slowing down the
scan by writing too many blocks.
VACUUM (FULL on) - rewrites table and rebuilds index, so very slow

What I think we need is a 4th option which aims to move relfrozenxid
forwards as quickly as possible
* initiates an aggressive scan, so it does not skip blocks because of
busy buffer pins
* skip pages that are all-frozen, as we are allowed to do
* uses normal freeze limits, so we avoid writing to blocks if possible

If we do all 3 of those things, the scan will complete as quickly as
possible and reset relfrozenxid quickly. It would make sense to use
that in conjunction with index_cleanup=off

As an additional optimization, if we do find a row that needs freezing
on a data block, we should simply freeze *all* row versions on the
page, not just the ones below the selected cutoff. This is justified
since writing the block is the biggest cost and it doesn't make much
sense to leave a few rows unfrozen on a block that we are dirtying.

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Detecting File Damage & Inconsistencies

2020-11-17 Thread Simon Riggs
On Fri, 13 Nov 2020 at 11:24, Simon Riggs  wrote:
>
> On Fri, 13 Nov 2020 at 00:50, tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Simon Riggs 
> > > If a rogue user/process is suspected, this would allow you to identify
> > > more easily the changes made by specific sessions/users.
> >
> > Isn't that kind of auditing a job of pgAudit or log_statement = mod?  Or, 
> > does "more easily" mean that you find pgAudit complex to use and/or 
> > log_statement's overhead is big?
>
> Well, I designed pgaudit, so yes, I think pgaudit is useful.
>
> However, pgaudit works at the statement level, not the data level. So
> using pgaudit to locate data rows that have changed is fairly hard.
>
> What I'm proposing is an option to add 16 bytes onto each COMMIT
> record, which is considerably less than turning on full auditing in
> pgaudit. This option would allow identifying data at the row level, so
> you could for example find all rows changed by specific sessions.
> Also, because it is stored in WAL it will show updates that might no
> longer exist in the database because the changed row versions might
> have been vacuumed away. So pgaudit will tell you that happened, but
> having extra info in WAL is important also.
>
> So thank you for the question because it has allowed me to explain why
> it is useful and important.

Patch attached to implement "wal_sessioninfo" option.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


wal_sessioninfo.v2.patch
Description: Binary data


Re: Detecting File Damage & Inconsistencies

2020-11-18 Thread Simon Riggs
On Wed, 18 Nov 2020 at 06:42, Craig Ringer
 wrote:
>
> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs  wrote:
>>
>>
>> What I'm proposing is an option to add 16 bytes onto each COMMIT
>> record
>
>
> Would it make sense to write this at the time we write a topxid assignment to 
> WAL instead?
>
> Otherwise it won't be accessible to streaming-mode logical decoding.

Do you mean extend the xl_xact_assignment record? My understanding is
that is not sent in all cases, so not sure what you mean by "instead".

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-18 Thread Simon Riggs
On Wed, 18 Nov 2020 at 10:28, Masahiko Sawada  wrote:

> > So we have 3 ways to reset relfrozenxid by a user action:
> > VACUUM (DISABLE_PAGE_SKIPPING ON) - scans all blocks, deliberately
> > ignoring the ones it could have skipped. This certainly slows it down.
> > VACUUM (FREEZE ON) - freezes everything in its path, slowing down the
> > scan by writing too many blocks.
> > VACUUM (FULL on) - rewrites table and rebuilds index, so very slow
> >
> > What I think we need is a 4th option which aims to move relfrozenxid
> > forwards as quickly as possible
> > * initiates an aggressive scan, so it does not skip blocks because of
> > busy buffer pins
> > * skip pages that are all-frozen, as we are allowed to do
> > * uses normal freeze limits, so we avoid writing to blocks if possible
> >
>
> This can be done with VACUUM today by vacuum_freeze_table_age and
> vacuum_multixact_freeze_table_age to 0. Adding an option for this
> behavior would be good for users to understand and would work well
> with the optimization.
>
> > If we do all 3 of those things, the scan will complete as quickly as
> > possible and reset relfrozenxid quickly. It would make sense to use
> > that in conjunction with index_cleanup=off
>
> Agreed.

Patches attached.
1. vacuum_anti_wraparound.v2.patch
2. vacuumdb_anti_wrap.v1.patch - depends upon (1)

> > As an additional optimization, if we do find a row that needs freezing
> > on a data block, we should simply freeze *all* row versions on the
> > page, not just the ones below the selected cutoff. This is justified
> > since writing the block is the biggest cost and it doesn't make much
> > sense to leave a few rows unfrozen on a block that we are dirtying.
>
> +1

I'll work on that.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


vacuumdb_anti_wrap.v1.patch
Description: Binary data


vacuum_anti_wraparound.v2.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-19 Thread Simon Riggs
On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
>
> On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  wrote:
> > Patches attached.
> > 1. vacuum_anti_wraparound.v2.patch
> > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
>
> I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> else, but I dislike anti-wraparound.

-1 for using the term AGGRESSIVE, which seems likely to offend people.
I'm sure a more descriptive term exists.

> It's neither the most aggressive
> thing we can do to prevent wraparound (that's FREEZE),

The new option is not the same thing as the FREEZE option, as discussed above.

> nor is it the
> case that vacuums without this option (or indeed any options) can't
> help prevent wraparound, because the aggressive strategy  may be
> chosen anyway.

Maybe.

The "aim [is] to move relfrozenxid forwards as quickly as possible" so
as to avoid wraparound, so having an unambiguous command that does
that is important for usability. It also allows us to rely on the
user's explicit intention to optimize vacuum towards that goal.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-19 Thread Simon Riggs
On Wed, 18 Nov 2020 at 02:04, Alvaro Herrera  wrote:
>
> On 2020-Nov-17, Simon Riggs wrote:
>
> > As an additional optimization, if we do find a row that needs freezing
> > on a data block, we should simply freeze *all* row versions on the
> > page, not just the ones below the selected cutoff. This is justified
> > since writing the block is the biggest cost and it doesn't make much
> > sense to leave a few rows unfrozen on a block that we are dirtying.
>
> Yeah.  We've had earlier proposals to use high and low watermarks: if any
> tuple is past the high watermark, then freeze all tuples that are past
> the low watermark.  However this is ancient thinking (prior to
> HEAP_XMIN_FROZEN) and we don't need the low watermark to be different
> from zero, since the original xid is retained anyway.
>
> So +1 for this idea.

Patch to do this attached, for discussion.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


one_freeze_then_max_freeze.v5.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs
On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
>
> On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> >
> > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > >
> > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > wrote:
> > > > Patches attached.
> > > > 1. vacuum_anti_wraparound.v2.patch
> > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > >
> > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > else, but I dislike anti-wraparound.
> >
> > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > I'm sure a more descriptive term exists.
>
> Since we use the term aggressive scan in the docs, I personally don't
> feel unnatural about that. But since this option also disables index
> cleanup when not enabled explicitly, I’m concerned a bit if user might
> get confused. I came up with some names like FEEZE_FAST and
> FREEZE_MINIMAL but I'm not sure these are better.

FREEZE_FAST seems good.

> BTW if this option also disables index cleanup for faster freezing,
> why don't we disable heap truncation as well?

Good idea

-- 
Simon Riggshttp://www.2ndQuadrant.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs
On Fri, 20 Nov 2020 at 03:54, Masahiko Sawada  wrote:

> > > So +1 for this idea.
> >
> > Patch to do this attached, for discussion.
>
> Thank you for the patch!
>
> +*
> +* Once we decide to dirty the data block we may as well 
> freeze
> +* any tuples that are visible to all, since the additional
> +* cost of freezing multiple tuples is low.
>
> I'm concerned that always freezing all tuples when we're going to make
> the page dirty would affect the existing vacuum workload much. The
> additional cost of freezing multiple tuples would be low but if we
> freeze tuples we would also need to write WAL, which is not negligible
> overhead I guess. In the worst case, if a table has dead tuples on all
> pages we process them, but with this patch, in addition to that, we
> will end up not only freezing all live tuples but also writing
> XLOG_HEAP2_FREEZE_PAGE WAL for all pages. So I think it would be
> better either to freeze all tuples if we find a tuple that needs to be
> frozen or to make this behavior work only if the new VACUUM option is
> specified.

The additional cost of freezing is sizeof(xl_heap_freeze_tuple) = 11 bytes

I guess there is some overhead for writing the WAL record itself, the
only question is how much. If that is a concern then we definitely
don't want to do that only when using FAST_FREEZE, since that would
slow it down when we want to speed it up.

I've updated the patch to match your proposal, so we can compare. It
seems a shorter patch.

(This patch is an optimization that is totally independent to the
other proposals on this thread).

-- 
Simon Riggshttp://www.EnterpriseDB.com/


one_freeze_then_max_freeze.v6.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs
On Fri, 20 Nov 2020 at 10:15, Simon Riggs  wrote:
>
> On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
> >
> > On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> > >
> > > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > > >
> > > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > > wrote:
> > > > > Patches attached.
> > > > > 1. vacuum_anti_wraparound.v2.patch
> > > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > > >
> > > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > > else, but I dislike anti-wraparound.
> > >
> > > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > > I'm sure a more descriptive term exists.
> >
> > Since we use the term aggressive scan in the docs, I personally don't
> > feel unnatural about that. But since this option also disables index
> > cleanup when not enabled explicitly, I’m concerned a bit if user might
> > get confused. I came up with some names like FEEZE_FAST and
> > FREEZE_MINIMAL but I'm not sure these are better.
>
> FREEZE_FAST seems good.
>
> > BTW if this option also disables index cleanup for faster freezing,
> > why don't we disable heap truncation as well?
>
> Good idea

Patch attached, using the name "FAST_FREEZE" instead.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


vacuum_fast_freeze.v3.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-20 Thread Simon Riggs
On Fri, 20 Nov 2020 at 14:07, Alvaro Herrera  wrote:
>
> On 2020-Nov-20, Masahiko Sawada wrote:
>
> > I'm concerned that always freezing all tuples when we're going to make
> > the page dirty would affect the existing vacuum workload much. The
> > additional cost of freezing multiple tuples would be low but if we
> > freeze tuples we would also need to write WAL, which is not negligible
> > overhead I guess. In the worst case, if a table has dead tuples on all
> > pages we process them, but with this patch, in addition to that, we
> > will end up not only freezing all live tuples but also writing
> > XLOG_HEAP2_FREEZE_PAGE WAL for all pages. So I think it would be
> > better either to freeze all tuples if we find a tuple that needs to be
> > frozen or to make this behavior work only if the new VACUUM option is
> > specified.
>
> There are two costs associated with this processing.  One is dirtying
> the page (which means it needs to be written down when evicted), and the
> other is to write WAL records for each change.  The cost for the latter
> is going to be the same in both cases (with this change and without)
> because the same WAL will have to be written -- the only difference is
> *when* do you pay it.  The cost of the former is quite different; with
> Simon's patch we dirty the page once, and without the patch we may dirty
> it several times before it becomes "stable" and no more writes are done
> to it.
>
> (If you have tables whose pages change all the time, there would be no
> difference with or without the patch.)
>
> Dirtying the page less times means less full-page images to WAL, too,
> which can be significant.

Summary of patch effects:

one_freeze_then_max_freeze.v5.patch
doesn't increase/decrease the number of pages that are dirtied by
freezing, but when a page will be dirtied **by any activity** it
maximises the number of rows frozen, so in some cases it may write a
WAL freeze record when it would not have done previously.

one_freeze_then_max_freeze.v6.patch
doesn't increase/decrease the number of pages that are dirtied by
freezing, but when a page will be dirtied **by freezing only** it
maximises the number of rows frozen, so the number of WAL records for
freezing is the same, but they may contain more tuples than before and
will increase the probability that the page is marked all_frozen.

So yes, as you say, the net effect will be to reduce the number of
write I/Os in subsequent vacuums required to move forward
relfrozenxid.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-29 Thread Simon Riggs
On Fri, 20 Nov 2020 at 11:47, Simon Riggs  wrote:
>
> On Fri, 20 Nov 2020 at 10:15, Simon Riggs  wrote:
> >
> > On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
> > >
> > > On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> > > >
> > > > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > > > >
> > > > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > > > wrote:
> > > > > > Patches attached.
> > > > > > 1. vacuum_anti_wraparound.v2.patch
> > > > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > > > >
> > > > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > > > else, but I dislike anti-wraparound.
> > > >
> > > > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > > > I'm sure a more descriptive term exists.
> > >
> > > Since we use the term aggressive scan in the docs, I personally don't
> > > feel unnatural about that. But since this option also disables index
> > > cleanup when not enabled explicitly, I’m concerned a bit if user might
> > > get confused. I came up with some names like FEEZE_FAST and
> > > FREEZE_MINIMAL but I'm not sure these are better.
> >
> > FREEZE_FAST seems good.
> >
> > > BTW if this option also disables index cleanup for faster freezing,
> > > why don't we disable heap truncation as well?
> >
> > Good idea
>
> Patch attached, using the name "FAST_FREEZE" instead.

And the additional patch also.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


vacuumdb_fast_freeze.v2.patch
Description: Binary data


Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-29 Thread Simon Riggs
On Fri, 20 Nov 2020 at 15:29, Alvaro Herrera  wrote:
>
> Note on heap_prepare_freeze_tuple()'s fifth parameter, it's not valid to
> pass OldestXmin; you need a multixact limit there, not an Xid limit.  I
> think the return value of GetOldestMultiXactId is a good choice.  AFAICS
> this means that you'll need to add a new output argument to
> vacuum_set_xid_limits (You *could* call GetOldestMultiXactId again, but
> it seems a waste).
>
> Maybe it's time for vacuum_set_xid_limits to have a caller's-stack-
> allocated struct for input and output values, rather than so many
> arguments and output pointers to fill.

The idea was to maximise freezing because we are already dirtying this
data block, so the reasoning doesn't extend directly to multixacts.
Being more active with multixacts could easily cause more churn there.

So I'm changing the patch to only work with Xids not both xids and
multixacts. If this gets committed we can think more about multixacts
and whether to optimize freezing them in the same situation or not.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


one_freeze_then_max_freeze.v7.patch
Description: Binary data


Re: support for MERGE

2021-01-13 Thread Simon Riggs
On Sun, Jan 10, 2021 at 1:44 AM Tomas Vondra
 wrote:

> 5) WHEN AND
>
> I admit the "WHEN AND" conditions sounds a bit cryptic - it took me a
> while to realize what this refers to. Is that a term established by SQL
> Standard, or something we invented?

As Vik notes, this refers to the WHEN [NOT] MATCHED AND when-and-clause
so in that case I was referring to the "when-and_clause" portion.
Yes, that is part of the standard.

> 6) walsender.c
>
> Huh, why does this patch touch this at all?

Nothing I added, IIRC, nor am I aware of why that would exist.

> 7) rewriteHandler.c
>
> I see MERGE "doesn't support" rewrite rules in the sense that it simply
> ignores them. Shouldn't it error-out instead? Seems like a foot-gun to
> me, because people won't realize this limitation and may not notice
> their rules don't fire.

Simply ignoring rules is consistent with COPY, that was the only
reason for that choice. It could certainly throw an error instead.

> 8) varlena.c
>
> Again, why are these changes to length checks in a MERGE patch?

Nothing I added, IIRC, nor am I aware of why that would exist.

> 9) parsenodes.h
>
> Should we rename mergeTarget_relation to mergeTargetRelation? The
> current name seems like a mix between two naming schemes.

+1

We've had code from 4-5 people in the patch now, so I will re-review
myself to see if I can shed light on anything.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




NOT VALID for Unique Indexes

2021-01-14 Thread Simon Riggs
As you may be aware the NOT VALID qualifier currently only applies to
CHECK and FK constraints, but not yet to unique indexes. I have had
customer requests to change that.

It's a reasonably common requirement to be able to change an index
to/from a unique index, i.e. Unique -> NonUnique or NonUnique to
Unique. Previously, it was easy enough to do that using a catalog
update, but with security concerns  and the fact that the optimizer
uses the uniqueness to optimize queries means that there is a gap in
our support. We obviously need to scan the index to see if it actually
can be marked as unique.

In terms of locking we need to exclude writes while we add uniqueness,
so scanning the index to check it is unique would cause problems. So
we need to do the same thing as we do with other constraint types: add
the constraint NOT VALID in one transaction and then later validate it
in a separate transaction (if ever).

I present a WIP patch to show it's a small patch to change Uniqueness
for an index, with docs and tests.

ALTER INDEX SET [NOT] UNIQUE [NOT VALID]
ALTER INDEX VALIDATE UNIQUE

It doesn't do the index validation scan (yet), but I wanted to check
acceptability, syntax and requirements before I do that.

I can also add similar syntax for UNIQUE and PK constraints.

Thoughts please?

-- 
Simon Riggshttp://www.EnterpriseDB.com/


alter_index_set_unique_not_valid.v4.patch
Description: Binary data


Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Simon Riggs
On Thu, Jan 14, 2021 at 5:46 PM Surafel Temesgen  wrote:

> On Fri, Jan 8, 2021 at 7:50 PM Ryan Lambert  wrote:
>>
>> I prefer to have them hidden by default.  This was mentioned up-thread with 
>> no decision, it seems the standard is ambiguous.  MS SQL appears to have 
>> flip-flopped on this decision [1].

I think the default should be like this:

SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
should NOT include the Start and End timestamp columns
because this acts like a normal query just with a different snapshot timestamp

SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
SHOULD include the Start and End timestamp columns
since this form of query can include multiple row versions for the
same row, so it makes sense to see the validity times

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Simon Riggs
On Thu, Jan 14, 2021 at 5:42 PM Surafel Temesgen  wrote:
>
> Hi Simon,
> Thank you for all the work you does

No problem.

> On Mon, Jan 11, 2021 at 5:02 PM Simon Riggs  
> wrote:
>>
>>
>>
>> * Anomalies around use of CURRENT_TIMESTAMP are not discussed or resolved.
>> Probably need to add a test that end_timestamp > start_timestamp or ERROR,
>> which effectively enforces serializability.
>>
>
>
> This scenario doesn't happen.

Yes, I think it can. The current situation is that the Start or End is
set to the Transaction Start Timestamp.
So if t2 starts before t1, then if t1 creates a row and t2 deletes it
then we will have start=t1 end=t2, but t2 There are no possibility of a record being deleted or updated before inserting

Agreed, but that was not the point.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Simon Riggs
On Fri, Jan 15, 2021 at 4:46 PM Surafel Temesgen  wrote:
>
>
>
> On Fri, Jan 15, 2021 at 12:27 AM Simon Riggs  
> wrote:
>>
>>
>> Yes, I think it can. The current situation is that the Start or End is
>> set to the Transaction Start Timestamp.
>> So if t2 starts before t1, then if t1 creates a row and t2 deletes it
>> then we will have start=t1 end=t2, but t2> Your tests don't show that because it must happen concurrently.
>> We need to add an isolation test to show this, or to prove it doesn't happen.
>>
>
>
> Does MVCC allow that? i am not expert on MVCC but i don't
> think t2 can see the row create by translation started before
> itself

Yeh. Read Committed mode can see anything committed prior to the start
of the current statement, but UPDATEs always update the latest version
even if they can't see it.

Anyway, isolationtester spec file needed to test this. See
src/test/isolation and examples in specs/ directory

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Simon Riggs
On Fri, Jan 15, 2021 at 4:56 PM Surafel Temesgen  wrote:
>
>
>
> On Fri, Jan 15, 2021 at 12:22 AM Simon Riggs  
> wrote:
>>
>> SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
>> should NOT include the Start and End timestamp columns
>> because this acts like a normal query just with a different snapshot 
>> timestamp
>>
>> SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
>> SHOULD include the Start and End timestamp columns
>> since this form of query can include multiple row versions for the
>> same row, so it makes sense to see the validity times
>>
>
> One disadvantage of returning system time columns is it
> breaks upward compatibility. if an existing application wants to
> switch to system versioning it will break.

There are no existing applications, so for PostgreSQL, it wouldn't be an issue.

If you mean compatibility with other databases, that might be an
argument to do what others have done. What have other databases done
for SELECT * ?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Discarding DISCARD ALL

2021-01-20 Thread Simon Riggs
On Wed, 20 Jan 2021 at 14:21, James Coleman  wrote:

> An alternative approach that occurred to me while typing this reply: a
> setting in Postgres that would disallow setting session level GUCs
> (i.e., enforce `SET LOCAL` transaction level usage instead) would
> remove a large chunk of our need to set server_reset_query_always=1
> (and more interestingly it'd highlight when broken code gets pushed).
> But even with that, I see some value in the proposed setting since
> there is additional session state beyond GUCs.

With transaction_cleanup=on we could force all SETs to be SET LOCAL.

The point is that if we declare ahead of time that the transaction
will be reset then we can act differently and more easily for various
circumstances, for SETs, for Temp tables and others.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Simon Riggs
On Tue, Jan 26, 2021 at 11:33 AM Vik Fearing  wrote:
>
> On 1/11/21 3:02 PM, Simon Riggs wrote:
> > * UPDATE foo SET start_timestamp = DEFAULT should fail but currently doesn't
>
> I'm still in the weeds of reviewing this patch, but why should this
> fail?  It should not fail.

It should not be possible for the user to change the start or end
timestamp of a system_time time range, by definition.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Simon Riggs
On Tue, Jan 26, 2021 at 12:51 PM Vik Fearing  wrote:
>
> On 1/26/21 1:16 PM, Simon Riggs wrote:
> > On Tue, Jan 26, 2021 at 11:33 AM Vik Fearing  
> > wrote:
> >>
> >> On 1/11/21 3:02 PM, Simon Riggs wrote:
> >>> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently 
> >>> doesn't
> >>
> >> I'm still in the weeds of reviewing this patch, but why should this
> >> fail?  It should not fail.
> >
> > It should not be possible for the user to change the start or end
> > timestamp of a system_time time range, by definition.
>
> Correct, but setting it to DEFAULT is not changing it.
>
> See also SQL:2016 11.5  General Rule 3.a.

Thanks for pointing this out. Identity columns don't currently work that way.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-01-28 Thread Simon Riggs
On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada  wrote:

> This entry has been "Waiting on Author" status and the patch has not
> been updated since Nov 30. Are you still planning to work on this?

Yes, new patch version tomorrow. Thanks for the nudge and the review.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: suboverflowed subtransactions concurrency performance optimize

2021-12-03 Thread Simon Riggs
On Fri, 3 Dec 2021 at 01:27, Dilip Kumar  wrote:

> > On review, I think it is also possible that we update subtrans ONLY if
> > someone uses >PGPROC_MAX_CACHED_SUBXIDS.
> > This would make subtrans much smaller and avoid one-entry-per-page
> > which is a major source of cacheing.
> > This would means some light changes in GetSnapshotData().
> > Let me know if that seems interesting also?
>
> Do you mean to say avoid setting the sub-transactions parent if the
> number of sun-transactions is not crossing PGPROC_MAX_CACHED_SUBXIDS?
> But the TransactionIdDidCommit(), might need to fetch the parent if
> the transaction status is TRANSACTION_STATUS_SUB_COMMITTED, so how
> would we handle that?

TRANSACTION_STATUS_SUB_COMMITTED is set as a transient state during
final commit.
In that case, the top-level xid is still in procarray when nsubxids <
PGPROC_MAX_CACHED_SUBXIDS
so we need not consult pg_subtrans in that case, see step 4 of
TransactionIdIsInProgress()

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: suboverflowed subtransactions concurrency performance optimize

2021-12-08 Thread Simon Riggs
On Wed, 1 Dec 2021 at 06:41, Andrey Borodin  wrote:

> > On review, I think it is also possible that we update subtrans ONLY if
> > someone uses >PGPROC_MAX_CACHED_SUBXIDS.
> > This would make subtrans much smaller and avoid one-entry-per-page
> > which is a major source of cacheing.
> > This would means some light changes in GetSnapshotData().
> > Let me know if that seems interesting also?
>
> I'm afraid of unexpected performance degradation. When the system runs fine, 
> you provision a VM of some vCPU\RAM, and then some backend uses a little more 
> than 64 subtransactions and all the system is stuck. Or will it affect only 
> backend using more than 64 subtransactions?

That is the objective: to isolate the effect to only those that
overflow. It seems possible.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: suboverflowed subtransactions concurrency performance optimize

2021-12-08 Thread Simon Riggs
On Fri, 3 Dec 2021 at 06:27, Dilip Kumar  wrote:
>
> On Tue, Nov 30, 2021 at 5:49 PM Simon Riggs
>  wrote:
>
> > transam.c uses a single item cache to prevent thrashing from repeated
> > lookups, which reduces problems with shared access to SLRUs.
> > multitrans.c also has similar.
> >
> > I notice that subtrans. doesn't have this, but could easily do so.
> > Patch attached, which seems separate to other attempts at tuning.
>
> Yeah, this definitely makes sense.
>
> > On review, I think it is also possible that we update subtrans ONLY if
> > someone uses >PGPROC_MAX_CACHED_SUBXIDS.
> > This would make subtrans much smaller and avoid one-entry-per-page
> > which is a major source of cacheing.
> > This would means some light changes in GetSnapshotData().
> > Let me know if that seems interesting also?
>
> Do you mean to say avoid setting the sub-transactions parent if the
> number of sun-transactions is not crossing PGPROC_MAX_CACHED_SUBXIDS?

Yes.

This patch shows where I'm going, with changes in GetSnapshotData()
and XidInMVCCSnapshot() and XactLockTableWait().
Passes make check, but needs much more, so this is review-only at this
stage to give a flavour of what is intended.

(No where near replacing the subtrans module as I envisage as the
final outcome, meaning we don't need ExtendSUBTRANS()).

-- 
Simon Riggshttp://www.EnterpriseDB.com/


rethink_subtrans.v4.patch
Description: Binary data


Documenting when to retry on serialization failure

2021-12-09 Thread Simon Riggs
"Applications using this level must be prepared to retry transactions
due to serialization failures."
...
"When an application receives this error message, it should abort the
current transaction and retry the whole transaction from the
beginning."

I note that the specific error codes this applies to are not
documented, so lets discuss what the docs for that would look like.

I had a conversation with Kevin Grittner about retry some years back
and it seemed clear that the application should re-execute application
logic from the beginning, rather than just slavishly re-execute the
same SQL. But that is not documented either.

Is *automatic* retry possible? In all cases? None? Or maybe Some?

ISTM that we can't retry anything where a transaction has replied to a
user and then the user issued a subsequent SQL statement, since we
have no idea whether the subsequent SQL was influenced by the initial
reply.

But what about the case of a single statement transaction? Can we just
re-execute then? I guess if it didn't run anything other than
IMMUTABLE functions then it should be OK, assuming the inputs
themselves were immutable, which we've no way for the user to declare.
Could we allow a user-defined auto_retry parameter?

We don't mention that a transaction might just repeatedly fail either.

Anyway, know y'all would have some opinions on this. Happy to document
whatever we agree.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Documenting when to retry on serialization failure

2021-12-29 Thread Simon Riggs
On Wed, 29 Dec 2021 at 03:30, Thomas Munro  wrote:
>
> On Fri, Dec 10, 2021 at 1:43 AM Simon Riggs
>  wrote:
> > "Applications using this level must be prepared to retry transactions
> > due to serialization failures."
> > ...
> > "When an application receives this error message, it should abort the
> > current transaction and retry the whole transaction from the
> > beginning."
> >
> > I note that the specific error codes this applies to are not
> > documented, so lets discuss what the docs for that would look like.
>
> +1 for naming the error.

I've tried to sum up the various points from everybody into this doc
patch. Thanks all for replies.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


retryable_error_docs.v1.patch
Description: Binary data


Re: Parameter for planner estimate of recursive queries

2021-12-31 Thread Simon Riggs
On Wed, 27 Oct 2021 at 15:58, Simon Riggs  wrote:

> The poor performance is traced to the planner cost estimates for
> recursive queries. Specifically, the cost of the recursive arm of the
> query is evaluated based upon both of these hardcoded assumptions:
>
> 1. The recursion will last for 10 loops
> 2. The average size of the worktable will be 10x the size of the
> initial query (non-recursive term).
>
> Taken together these assumptions lead to a very poor estimate of the
> worktable activity (in this case), which leads to the plan changing as
> a result.
>
> The factor 10 is a reasonably safe assumption and helps avoid worst
> case behavior in bigger graph queries. However, the factor 10 is way
> too large for many types of graph query, such as where the path
> through the data is tight, and/or the query is written to prune bushy
> graphs, e.g. shortest path queries. The factor 10 should not be
> hardcoded in the planner, but should be settable, just as
> cursor_tuple_fraction is.

If you think this should be derived without parameters, then we would
want a function that starts at 1 for 1 input row and gets much larger
for larger input. The thinking here is that Graph OLTP is often a
shortest path between two nodes, whereas Graph Analytics and so the
worktable will get much bigger.

So I'm, thinking we use

rel->tuples = min(1, cte_rows * cte_rows/2);

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Documenting when to retry on serialization failure

2022-01-04 Thread Simon Riggs
On Thu, 16 Dec 2021 at 06:05, Greg Stark  wrote:

> So a lot of users are probably looking at something like "BEGIN;
> SELECT create_customer_order(); COMMIT" and wondering why the
> server can't handle automatically retrying the query if they get an
> isolation failure.

I agree with you that it would be desirable to retry for the simple
case of an autocommit/single statement transaction run with
default_transaction_isolation = 'serializability'.

The most important question before we take further action is whether
this would be correct to do so, in all cases.

Some problem cases would help us decide either way.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: SKIP LOCKED assert triggered

2022-01-04 Thread Simon Riggs
On Tue, 4 Jan 2022 at 16:15, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > Pushed, thanks Simon for reporting this problem.

And causing another; my bad, apologies.

> Umm ...
>
>Assert(TM_WouldBlock || !(tuple->t_data->t_infomask & 
> HEAP_XMAX_INVALID));
>
> AFAICS, this assertion condition is constant-true,
> because TM_WouldBlock is a nonzero constant.  Perhaps you meant
>
>Assert(result == TM_WouldBlock || !(tuple->t_data->t_infomask & 
> HEAP_XMAX_INVALID));

Yes, I think that's what I meant.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: support for MERGE

2022-01-04 Thread Simon Riggs
On Wed, 22 Dec 2021 at 11:35, Simon Riggs  wrote:
>
> On Mon, 15 Nov 2021 at 22:45, Alvaro Herrera  wrote:
> >
> > On 2021-Nov-15, Alvaro Herrera wrote:
> >
> > > Thanks everyone for the feedback.  I attach a version with the fixes
> > > that were submitted, as well as some additional changes:
> >
> > Attachment failure.
>
> I rebased this, please check.
>
> And 2 patch-on-patches that resolve a few minor questions/docs wording.

Here is another patch-on-patch to fix a syntax error in the MERGE docs.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


merge_fix_docs_into.v1.patch
Description: Binary data


Re: Pluggable toaster

2022-01-05 Thread Simon Riggs
On Thu, 30 Dec 2021 at 16:40, Teodor Sigaev  wrote:

> We are working on custom toaster for JSONB [1], because current TOAST is
> universal for any data type and because of that it has some disadvantages:
> - "one toast fits all"  may be not the best solution for particular
>   type or/and use cases
> - it doesn't know the internal structure of data type, so it  cannot
>   choose an optimal toast strategy
> - it can't  share common parts between different rows and even
>   versions of rows

Agreed, Oleg has made some very clear analysis of the value of having
a higher degree of control over toasting from within the datatype.

In my understanding, we want to be able to
1. Access data from a toasted object one slice at a time, by using
knowledge of the structure
2. If toasted data is updated, then update a minimum number of
slices(s), without rewriting the existing slices
3. If toasted data is expanded, then allownew slices to be appended to
the object without rewriting the existing slices

> Modification of current toaster for all tasks and cases looks too
> complex, moreover, it  will not works for  custom data types. Postgres
> is an extensible database,  why not to extent its extensibility even
> further, to have pluggable TOAST! We  propose an idea to separate
> toaster from  heap using  toaster API similar to table AM API etc.
> Following patches are applicable over patch in [1]

ISTM that we would want the toast algorithm to be associated with the
datatype, not the column?
Can you explain your thinking?

We already have Expanded toast format, in-memory, which was designed
specifically to allow us to access sub-structure of the datatype
in-memory. So I was expecting to see an Expanded, on-disk, toast
format that roughly matched that concept, since Tom has already shown
us the way. (varatt_expanded). This would be usable by both JSON and
PostGIS.


Some other thoughts:

I imagine the data type might want to keep some kind of dictionary
inside the main toast pointer, so we could make allowance for some
optional datatype-specific private area in the toast pointer itself,
allowing a mix of inline and out-of-line data, and/or a table of
contents to the slices.

I'm thinking could also tackle these things at the same time:
* We want to expand TOAST to 64-bit pointers, so we can have more
pointers in a table
* We want to avoid putting the data length into the toast pointer, so
we can allow the toasted data to be expanded without rewriting
everything (to avoid O(N^2) cost)

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-05 Thread Simon Riggs
On Thu, 30 Dec 2021 at 13:19, Maxim Orlov  wrote:
>
> Hi, hackers!
>
> Long time wraparound was a really big pain for highly loaded systems. One 
> source of performance degradation is the need to vacuum before every 
> wraparound.
> And there were several proposals to make XIDs 64-bit like [1], [2], [3] and 
> [4] to name a few.

Very good to see this revived.

> PFA updated working patch v6 for PG15 development cycle.
> It is based on a patch by Alexander Korotkov version 5 [5] with a few fixes, 
> refactoring and was rebased to PG15.
>
> Main changes:
> - Change TransactionId to 64bit

This sounds like a good route to me.

> - Disk tuple format (HeapTupleHeader) is unchanged: xmin and xmax
>   remains 32bit
> -- 32bit xid is named ShortTransactionId now.
> -- Exception: see "double xmax" format below.
> - Heap page format is changed to contain xid and multixact base value,
>   tuple's xmin and xmax are offsets from.
> -- xid_base and multi_base are stored as a page special data. PageHeader
>remains unmodified.
> -- If after upgrade page has no free space for special data, tuples are
>converted to "double xmax" format: xmin became virtual
>FrozenTransactionId, xmax occupies the whole 64bit.
>Page converted to new format when vacuum frees enough space.
> - In-memory tuples (HeapTuple) were enriched with copies of xid_base and
>   multi_base from a page.

I think we need more Overview of Behavior than is available with this
patch, perhaps in the form of a README, such as in
src/backend/access/heap/README.HOT.

Most people's comments are about what the opportunities and problems
caused, and mine are definitely there also. i.e. explain the user
visible behavior.
Please explain the various new states that pages can be in and what
the effects are,

My understanding is this would be backwards compatible, so we can
upgrade to it. Please confirm.

Thanks

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Logical insert/update/delete WAL records for custom table AMs

2022-01-05 Thread Simon Riggs
On Wed, 10 Nov 2021 at 03:17, Amit Kapila  wrote:
>
> On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis  wrote:
> >
> > On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote:
> > > I am not talking about decoding plugin but rather decoding itself,
> > > basically, the work we do in reorderbuffer.c, decode.c, etc. The two
> > > things I remember were tuple format and transaction ids as mentioned
> > > in my previous email.
> >
> > If it's difficult to come up with something that will work for all
> > table AMs, then it suggests that we might want to go towards fully
> > extensible rmgr, and have a decoding method in the rmgr.
> >
> > I started a thread (with a patch) here:
> >
> >
> > https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.ca...@j-davis.com
> >
> > > I think we should try to find a solution for
> > > tuple format as the current decoding code relies on it if we want
> > > decoding to deal with another table AMs transparently.
> >
> > Agreed, but it seems like we need basic extensibility first. For now,
> > we'll need to convert to a heap tuple,
> >
>
> Okay, but that might have a cost because we need to convert it before
> WAL logging it, and then we probably also need to convert back to the
> original table AM format during recovery/standby apply.

I spoke with Jeff in detail about this patch in NYC Dec 21, and I now
think it is worth pursuing. It seems much more likely that this would
be acceptable than fully extensible rmgr.

Amit asks a good question: should we be writing a message that seems
to presume the old heap tuple format? My answer is that we clearly
need it to be written in *some* common format, and the current heap
format currently works, so de facto it is the format we should use.
Right now, TAMs have to reformat back into this same format, so it is
the way the APIs work. Put it another way: I don't see any other
format that makes sense to use, now, but that could change in the
future.

So I'm signing up as a reviewer and we'll see if this is good to go.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Collecting statistics about contents of JSONB columns

2022-01-05 Thread Simon Riggs
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
 wrote:

> The patch does something far more
> elegant - it simply uses stavalues to store an array of JSONB documents,
> each describing stats for one path extracted from the sampled documents.

Sounds good.

> I'm sure there's plenty open questions - for example I think we'll need
> some logic to decide which paths to keep, otherwise the statistics can
> get quite big, if we're dealing with large / variable documents. We're
> already doing something similar for MCV lists.
>
> One of Nikita's patches not included in this thread allow "selective"
> statistics, where you can define in advance a "filter" restricting which
> parts are considered interesting by ANALYZE. That's interesting, but I
> think we need some simple MCV-like heuristics first anyway.
>
> Another open question is how deep the stats should be. Imagine documents
> like this:
>
>{"a" : {"b" : {"c" : {"d" : ...
>
> The current patch build stats for all possible paths:
>
>   "a"
>   "a.b"
>   "a.b.c"
>   "a.b.c.d"
>
> and of course many of the paths will often have JSONB documents as
> values, not simple scalar values. I wonder if we should limit the depth
> somehow, and maybe build stats only for scalar values.

The user interface for designing filters sounds hard, so I'd hope we
can ignore that for now.

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-06 Thread Simon Riggs
On Thu, 30 Dec 2021 at 13:19, Maxim Orlov  wrote:

> Your opinions are very much welcome!

This is a review of the Int64 options patch,
"v6-0001-Add-64-bit-GUCs-for-xids.patch"

Applies cleanly, with some fuzz, compiles cleanly and passes make check.
Patch eyeballs OK, no obvious defects.
Tested using the attached test, so seems to work correctly.
On review of docs, no additions or changes required.
Perhaps add something to README? If so, minor doc patch attached.

Otherwise, this sub-patch is READY FOR COMMITTER.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


int64opts_test.patch
Description: Binary data


int64_minor_README.v1.patch
Description: Binary data


Re: Logical insert/update/delete WAL records for custom table AMs

2022-01-06 Thread Simon Riggs
On Sun, 31 Oct 2021 at 18:10, Jeff Davis  wrote:

> I'm looking for some review on the approach and structure before I
> polish and test it.

Repurposing the logical msg rmgr into a general purpose logical rmgr
seems right.

Structure looks obvious, which is good.

Please pursue this and I will review with you as you go.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-06 Thread Simon Riggs
On Thu, 6 Jan 2022 at 13:15, Finnerty, Jim  wrote:
>
> (Maxim) Re: -- If after upgrade page has no free space for special data, 
> tuples are
>converted to "double xmax" format: xmin became virtual
>FrozenTransactionId, xmax occupies the whole 64bit.
>Page converted to new format when vacuum frees enough space.
>
> A better way would be to prepare the database for conversion to the 64-bit 
> XID format before the upgrade so that it ensures that every page has enough 
> room for the two new epochs (E bits).

Most software has a one-stage upgrade model. What you propose would
have us install 2 things, with a step in-between, which makes it
harder to manage.

> 1. Enforce the rule that no INSERT or UPDATE to an existing page will leave 
> less than E bits of free space on a heap page
>
> 2. Run an online and restartable task, analogous to pg_repack, that rewrites 
> and splits any page that has less than E bits of free space. This needs to be 
> run on all non-temp tables in all schemas in all databases.  DDL operations 
> are not allowed on a target table while this operation runs, which is 
> enforced by taking an ACCESS SHARE lock on each table while the process is 
> running. To mitigate the effects of this restriction, the restartable task 
> can be restricted to run only in certain hours.  This could be implemented as 
> a background maintenance task that runs for X hours as of a certain time of 
> day and then kicks itself off again in 24-X hours, logging its progress.
>
> When this task completes, the database is ready for upgrade to 64-bit XIDs, 
> and there is no possibility that any page has insufficient free space for the 
> special data.
>
> Would you agree that this approach would completely eliminate the need for a 
> "double xmax" representation?

I agree about the idea for scanning existing data blocks, but why not
do this AFTER upgrade?

1. Upgrade, with important aspect not-enabled-yet, but everything else
working - all required software is delivered in one shot, with fast
upgrade
2. As each table is VACUUMed, we confirm/clean/groom data blocks so
each table is individually confirmed as being ready. The pace that
this happens at is under user control.
3. When all tables have been prepared, then restart to allow xid64 format usage

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-07 Thread Simon Riggs
On Fri, 7 Jan 2022 at 16:09, Justin Pryzby  wrote:
>
> On Fri, Jan 07, 2022 at 03:53:51PM +, Finnerty, Jim wrote:
> > I'd still like a plan to retire the "double xmax" representation 
> > eventually.  Previously I suggested that this could be done as a 
> > post-process, before upgrade is complete, but that could potentially make 
> > upgrade very slow.
> >
> > Another way to retire the "double xmax" representation eventually could be 
> > to disallow "double xmax" pages in subsequent major version upgrades (e.g. 
> > to PG16, if "double xmax" pages are introduced in PG15).  This gives the 
> > luxury of time after a fast upgrade to convert all pages to contain the 
> > epochs, while still providing a path to more maintainable code in the 
> > future.
>
> Yes, but how are you planning to rewrite it?  Is vacuum enough?

Probably not, but VACUUM is the place to add such code.

> I suppose it'd need FREEZE + DISABLE_PAGE_SKIPPING ?

Yes

> This would preclude upgrading "across" v15.  Maybe that'd be okay, but it'd be
> a new and atypical restriction.

I don't see that restriction. Anyone upgrading from before PG15 would
apply the transform. Just because we introduce a transform in PG15
doesn't mean we can't apply that transform in later releases as well,
to allow say PG14 -> PG16.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-12 Thread Simon Riggs
On Sat, 8 Jan 2022 at 08:21, Maxim Orlov  wrote:
>>
>>Perhaps we can merge some of the code cleanup that it contained, such as 
>> using XID_FMT everywhere and creating a type for the kind of page returned 
>> by TransactionIdToPage() to make the code cleaner.
>
>
> Agree, I think this is a good idea.

Looks to me like the best next actions would be:

1. Submit a patch that uses XID_FMT everywhere, as a cosmetic change.
This looks like it will reduce the main patch size considerably and
make it much less scary. That can be cleaned up and committed while we
discuss the main approach.

2. Write up the approach in a detailed README, so people can
understand the proposal and assess if there are problems. A few short
notes and a link back to old conversations isn't enough to allow wide
review and give confidence on such a major patch.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-03-17 Thread Simon Riggs
On Fri, 12 Mar 2021 at 22:16, Tomas Vondra
 wrote:
>
> On 1/28/21 2:33 PM, Simon Riggs wrote:
> > On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada  wrote:
> >
> >> This entry has been "Waiting on Author" status and the patch has not
> >> been updated since Nov 30. Are you still planning to work on this?
> >
> > Yes, new patch version tomorrow. Thanks for the nudge and the review.
> >
>
> So, is it tomorrow already? ;-)

Been a long day. ;-)

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-04-08 Thread Simon Riggs
On Thu, 8 Apr 2021 at 16:23, David Steele  wrote:
>
> On 3/17/21 4:50 PM, Simon Riggs wrote:
> > On Fri, 12 Mar 2021 at 22:16, Tomas Vondra
> >  wrote:
> >>
> >> On 1/28/21 2:33 PM, Simon Riggs wrote:
> >>> On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada  
> >>> wrote:
> >>>
> >>>> This entry has been "Waiting on Author" status and the patch has not
> >>>> been updated since Nov 30. Are you still planning to work on this?
> >>>
> >>> Yes, new patch version tomorrow. Thanks for the nudge and the review.
> >>>
> >>
> >> So, is it tomorrow already? ;-)
> >
> > Been a long day. ;-)
>
> It has been five months since this patch was updated, so marking
> Returned with Feedback.
>
> Please resubmit to the next CF when you have a new patch.

There are 2 separate patch-sets on this thread, with separate CF entries.

One has requested changes that have not been made. I presume this one
has been RwF'd.

What is happening about the other patch?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-04-08 Thread Simon Riggs
On Thu, 8 Apr 2021 at 16:58, David Steele  wrote:

> >> It has been five months since this patch was updated, so marking
> >> Returned with Feedback.
> >>
> >> Please resubmit to the next CF when you have a new patch.
> >
> > There are 2 separate patch-sets on this thread, with separate CF entries.
> >
> > One has requested changes that have not been made. I presume this one
> > has been RwF'd.
> >
> > What is happening about the other patch?
>
> Hmmm, well https://commitfest.postgresql.org/32/2908 and
> https://commitfest.postgresql.org/32/2909 are both linked to the same
> thread with the same patch, so I thought it was a duplicate.

Nobody had mentioned any confusion. Multiple patches on one thread is common.

> It's not clear to me which patch is which, so perhaps move one CF entry
> to next CF and clarify which patch is current?

Entry: Maximize page freezing
has this patch, perfectly fine, awaiting review from Alvaro since 29 Nov.
one_freeze_then_max_freeze.v7.patch

Other patch is awaiting changes.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-04-08 Thread Simon Riggs
On Thu, 8 Apr 2021 at 17:44, David Steele  wrote:
>
> On 4/8/21 12:29 PM, Simon Riggs wrote:
> > On Thu, 8 Apr 2021 at 16:58, David Steele  wrote:
> >
> >>>> It has been five months since this patch was updated, so marking
> >>>> Returned with Feedback.
> >>>>
> >>>> Please resubmit to the next CF when you have a new patch.
> >>>
> >>> There are 2 separate patch-sets on this thread, with separate CF entries.
> >>>
> >>> One has requested changes that have not been made. I presume this one
> >>> has been RwF'd.
> >>>
> >>> What is happening about the other patch?
> >>
> >> Hmmm, well https://commitfest.postgresql.org/32/2908 and
> >> https://commitfest.postgresql.org/32/2909 are both linked to the same
> >> thread with the same patch, so I thought it was a duplicate.
> >
> > Nobody had mentioned any confusion. Multiple patches on one thread is 
> > common.
>
> Yes, but these days they generally get updated in the same reply so the
> cfbot can test them all. In your case only the latest patch was being
> tested and it was not applying cleanly.
>
> >> It's not clear to me which patch is which, so perhaps move one CF entry
> >> to next CF and clarify which patch is current?
> >
> > Entry: Maximize page freezing
> > has this patch, perfectly fine, awaiting review from Alvaro since 29 Nov.
> > one_freeze_then_max_freeze.v7.patch
>
> That CF entry was marked Waiting for Author on 3/11, so I thought it was
> for this patch. In fact, both CF entries were Waiting for Author for
> some time.

That was done by someone that hadn't mentioned it to me, or the list.

It is not Waiting on Author.

> Moved to the next CF in Waiting for Author state.

That is clearly an incorrect state.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2021-04-08 Thread Simon Riggs
On Thu, 8 Apr 2021 at 18:15, Alvaro Herrera  wrote:
>
> On 2021-Apr-08, Simon Riggs wrote:
>
> > On Thu, 8 Apr 2021 at 16:58, David Steele  wrote:
>
> > > It's not clear to me which patch is which, so perhaps move one CF entry
> > > to next CF and clarify which patch is current?
> >
> > Entry: Maximize page freezing
> > has this patch, perfectly fine, awaiting review from Alvaro since 29 Nov.
> > one_freeze_then_max_freeze.v7.patch
>
> Oh dear, was this waiting on me?  It was not in my to-do, so I didn't
> pay close attention.

No problem, if I felt the idea deserved priority attention I would
have pinged you.

I think I'll open a new thread later to allow it to be discussed
without confusion.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-06 Thread Simon Riggs
On Thu, 6 Aug 2020 at 02:07, Andres Freund  wrote:

>
> On 2020-08-05 19:55:49 -0400, Alvaro Herrera wrote:
> > ... which means the flag I had added two days earlier has never been
> > used for anything.  We've carried the flag forward to this day for
> > almost 13 years, dutifully turning it on and off ... but never checking
> > it anywhere.
> >
> > I propose to remove it, as in the attached patch.
>
> I'm mildly against that, because I'd really like to start making use of
> the flag. Not so much for cancellations, but to avoid the drastic impact
> analyze has on bloat.  In OLTP workloads with big tables, and without
> disabled cost limiting for analyze (or slow IO), the snapshot that
> analyze holds is often by far the transaction with the oldest xmin.
>
> It's not entirely trivial to fix (just ignoring it could lead to
> detoasting issues), but also not that.
>
> Only mildly against because it'd not be hard to reintroduce once we need
> it.
>

Good points, both.

The most obvious way to avoid long analyze snapshots is to make the
analysis take multiple snapshots as it runs, rather than try to invent some
clever way of ignoring the analyze snapshots (which as Alvaro points out,
we never did). All we need to do is to have an analyze snapshot last for at
most N rows, but keep scanning until we have the desired sample size. Doing
that would mean the analyze sample wouldn't come from a single snapshot,
but then who cares? There is no requirement for consistency - the sample
would be arguably *more* stable because it comes from multiple points in
time, not just one.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
Mission Critical Databases


Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-06 Thread Simon Riggs
On Thu, 6 Aug 2020 at 22:35, Tom Lane  wrote:

> Robert Haas  writes:
> > ... how
> > important is stability to ANALYZE? If you *either* retake your MVCC
> > snapshots periodically as you re-scan the table *or* use a non-MVCC
> > snapshot for the scan, you can get those same kinds of artifacts: you
> > might see two copies of a just-updated row, or none. Maybe this would
> > actually *break* something - e.g. could there be code that would get
> > confused if we sample multiple rows for the same value in a column
> > that has a UNIQUE index? But I think mostly the consequences would be
> > that you might get somewhat different results from the statistics.
>
> Yeah, that's an excellent point.  I can imagine somebody complaining
> "this query clearly matches a unique index, why is the planner estimating
> multiple rows out?".  But most of the time it wouldn't matter much.
> (And I think you can get cases like that anyway today.)
>
> > It's not clear to me that it would even be correct to categorize those
> > somewhat-different results as "less accurate."
>
> Estimating two rows where the correct answer is one row is clearly
> "less accurate".  But I suspect you'd have to be quite unlucky to
> get such a result in practice from Simon's proposal, as long as we
> weren't super-aggressive about changing ANALYZE's snapshot a lot.
>

Seems like we're agreed we can use more than one snapshot, the only
discussion is "how many?"

The more you take the more weirdness you will see, so adopting an approach
of one-snapshot-per-row seems like the worst case for accuracy, even if it
does make analyze faster.

(If we do want to speed up ANALYZE, we should use the system block sampling
approach, but the argument against that is less independence of rows.)

Keeping the discussion on reducing the impact of bernoulli sampled analyze, I
was imagining we would do one snapshot for each block of rows with default
statistics_target, so that default behavior would be unaffected. Larger
settings would be chunked according to the default, so
stats_target=10k(max) would take a 1/100 = 100 snapshots. That approach
allows people to vary that using an existing parameter if needed.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
Mission Critical Databases


Re: massive FPI_FOR_HINT load after promote

2020-08-14 Thread Simon Riggs
On Mon, 10 Aug 2020 at 23:56, Alvaro Herrera  wrote:
>
> The problem was simply that when a page is
> examined by a seqscan, we do HeapTupleSatisfiesVisibility of each tuple
> in isolation; and for each tuple we call SetHintBits().  And only the
> first time the FPI happens; by the time we get to the second tuple, the
> page is already dirty, so there's no need to emit an FPI.  But the FPI
> we sent only had the bit on the first tuple ... so the standby will not
> have the bit set for any subsequent tuple.  And on promotion, the
> standby will have to have the bits set for all those tuples, unless you
> happened to dirty the page again later for other reasons.

Which probably means that pg_rewind is broken because it won't be able
to rewind correctly.

> One simple idea to try to forestall this problem would be to modify the
> algorithm so that all tuples are scanned and hinted if the page is going
> to be dirtied -- then send a single FPI setting bits for all tuples,
> instead of just on the first tuple.

This would make latency much worse for non seqscan cases.

Certainly for seqscans it would make sense to emit a message that sets
all tuples at once, or possibly emit an FPI and then follow that with
a second message that sets all other hints on the page.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
Mission Critical Databases




Re: Next Steps with Hash Indexes

2021-10-13 Thread Simon Riggs
 work well for *all*
cases before we allow multiple columns for some cases. The user will
already get to compare btree-vs-hash before they use them in a
particular use case. The perfect should not be the enemy of the good.

Storing multiple hashes uses more space and is more complex. It
doesn't feel like a good solution to me, given the purpose of an index
is not completeness but optimality.
Storing 2 4-byte hashes uses 20% more space than one 4-byte hash.
Storing 2 8-byte hashes uses 40% more space than one 4-byte hash.

> > IMHO, multi-column hash collisions are a secondary issue, given that
> > we can already select the column order for an index and hash indexes
> > would only be used by explicit user choice. If there are some minor
> > sub-optimal aspects of using hash indexes, then btree was already the
> > default and a great choice for many cases.
> >
>
> But we can't select arbitrary column order, because the first column is
> used to select the bucket. Which means it's also about what columns are
> used by the queries. If the query is not using the first column, it
> can't use the index.

Neither approach works in that case, so it is moot. i.e. you cannot
use a first-column hash, nor an all-column hash.


--
Simon Riggshttp://www.EnterpriseDB.com/




Re: Next Steps with Hash Indexes

2021-10-14 Thread Simon Riggs
On Wed, 13 Oct 2021 at 20:16, Peter Geoghegan  wrote:
>
> On Wed, Oct 13, 2021 at 3:44 AM Simon Riggs
>  wrote:
> > > IMO it'd be nice to show some numbers to support the claims that storing
> > > the extra hashes and/or 8B hashes is not worth it ...
> >
> > Using an 8-byte hash is possible, but only becomes effective when
> > 4-byte hash collisions get hard to manage. 8-byte hash also makes the
> > index 20% bigger, so it is not a good default.
>
> Are you sure? I know that nbtree index tuples for a single-column int8
> index are exactly the same size as those from a single column int4
> index, due to alignment overhead at the tuple level. So my guess is
> that hash index tuples (which use the same basic IndexTuple
> representation) work in the same way.

The hash index tuples are 20-bytes each. If that were rounded up to
8-byte alignment, then that would be 24 bytes.

Using pageinspect, the max(live_items) on any data page (bucket or
overflow) is 407 items, so they can't be 24 bytes long.


Other stats of interest would be that the current bucket design/page
splitting is very effective at maintaining distribution. On a hash
index for a table with 2 billion rows in it, with integer values from
1 to 2billion, there are 3670016 bucket pages and 524286 overflow
pages, distributed so that 87.5% of buckets have no overflow pages,
and 12.5% of buckets have only one overflow page; there are no buckets
with >1 overflow page. The most heavily populated overflow page has
209 items.

The CREATE INDEX time is fairly poor at present, but that can be
optimized easily enough, but I expect to do that after uniqueness is
added, since it would complicate the code to do that work in a
different order.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Next Steps with Hash Indexes

2021-10-17 Thread Simon Riggs
On Thu, 14 Oct 2021 at 16:09, Peter Geoghegan  wrote:
>
> On Thu, Oct 14, 2021 at 12:48 AM Simon Riggs
>  wrote:
> > The hash index tuples are 20-bytes each. If that were rounded up to
> > 8-byte alignment, then that would be 24 bytes.
> >
> > Using pageinspect, the max(live_items) on any data page (bucket or
> > overflow) is 407 items, so they can't be 24 bytes long.
>
> That's the same as an nbtree page, which confirms my suspicion. The 20
> bytes consists of a 16 byte tuple, plus a 4 byte line pointer. The
> tuple-level alignment overhead gets you from 12 bytes to 16 bytes with
> a single int4 column. So the padding is there for the taking.

Thank you for nudging me to review the tuple length.

Since hash indexes never store Nulls, and the hash is always fixed
length, ISTM that we can compress the hash index entries down to
ItemPointerData (6 bytes) plus any hashes.

That doesn't change any arguments about size differences between
approaches, but we can significantly reduce index size (by up to 50%).

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Parameter for planner estimate of recursive queries

2021-10-27 Thread Simon Riggs
I've been investigating the poor performance of a WITH RECURSIVE
query, which I've recreated with test data.

The first thing was to re-write the query, which helped improve
performance by about 30%, but the plan was still very bad. With a
small patch I've been able to improve performance by about x100.

The poor performance is traced to the planner cost estimates for
recursive queries. Specifically, the cost of the recursive arm of the
query is evaluated based upon both of these hardcoded assumptions:

1. The recursion will last for 10 loops
2. The average size of the worktable will be 10x the size of the
initial query (non-recursive term).

Taken together these assumptions lead to a very poor estimate of the
worktable activity (in this case), which leads to the plan changing as
a result.

The factor 10 is a reasonably safe assumption and helps avoid worst
case behavior in bigger graph queries. However, the factor 10 is way
too large for many types of graph query, such as where the path
through the data is tight, and/or the query is written to prune bushy
graphs, e.g. shortest path queries. The factor 10 should not be
hardcoded in the planner, but should be settable, just as
cursor_tuple_fraction is.

I've written a short patch to make the estimate of the avg size of the
worktable configurable:

  recursive_worktable_estimate = N (default 10)

Using this parameter with the test query results in a consistently
repeatable ~100x gain in performance, using
recursive_worktable_estimate = 1 for a shortest path query:

Unpatched: 1775ms
Patched: 17.2ms

This is because the estimated size of the worktable is closer to the
truth and so leads naturally to a more sensible plan. EXPLAINs
attached - please look at the estimated rows for the WorkTable Scan.

There are various options for setting the two estimates: just one, or
other, or both values separately, or both together. Note that I
haven't touched the estimate that recursion will last for 10 loops. I
figured that people would object to two knobs.

Thoughts?


There are 2 other ways to speed up the query. One is to set
enable_seqscan = off, which helps about 20%, but may have other
consequences. Two is to set work_mem = 512MB, so that the poor plan
(hash) works as fast as possible, but that is far from optimal either.
Getting the right plan is x20 faster than either of those
alternatives.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


recursive_worktable_estimate.v1.patch
Description: Binary data


graph_query_explains.out
Description: Binary data


Re: PITR: enhance getRecordTimestamp()

2021-11-03 Thread Simon Riggs
On Wed, 3 Nov 2021 at 13:28, Daniel Gustafsson  wrote:
>
> > On 30 Jun 2021, at 11:59, Simon Riggs  wrote:
>
> > For PITR, getRecordTimestamp() did not include all record types that
> > contain times.
> > Add handling for checkpoints, end of recovery and prepared xact record 
> > types.
>
> + 
> This breaks doc compilation, and looks like a stray tag as you want this entry
> in the currently open variablelist?

Thanks. Fixed and rebased.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


pitr_enhance_getRecordTimestamp.v2.patch
Description: Binary data


Re: lastOverflowedXid does not handle transaction ID wraparound

2021-11-03 Thread Simon Riggs
On Thu, 21 Oct 2021 at 05:01, Kyotaro Horiguchi  wrote:
>
> At Wed, 20 Oct 2021 08:55:12 -0700, Stan Hu  wrote in
> > On Wed, Oct 20, 2021 at 4:00 AM Andrey Borodin  wrote:
> >
> > >
> > >
> > > > 17 окт. 2021 г., в 21:55, Dmitry Dolgov <9erthali...@gmail.com>
> > > написал(а):
> > > > I wonder what would be side
> > > > effects of clearing it when the snapshot is not suboverfloved anymore?
> > >
> > > I think we should just invalidate lastOverflowedXid on every
> > > XLOG_RUNNING_XACTS if subxid_overflow == false. I can't find a reason not
> > > to do so.

I believe that to be an incorrect fix, but so very nearly correct.
There is a documented race condition in the generation of a
XLOG_RUNNING_XACTS that means there could be a new overflow event
after the snapshot was taken but before it was logged.

> > On a replica, I think it's possible for lastOverflowedXid to be set even if
> > subxid_overflow is false on the primary and secondary (
> > https://github.com/postgres/postgres/blob/dc899146dbf0e1d23fb24155a5155826ddce34c9/src/backend/storage/ipc/procarray.c#L1327).
> > I thought subxid_overflow only gets set if there are more than
> > PGPROC_MAX_CACHED_SUBXIDS (64) used in a given transaction.
> >
> > Should the replica be invalidating lastOverflowedXid if subxcnt goes to
> > zero in XLOG_RUNNING_XACTS? But if there's an outstanding snapshot with an
> > xmin that precedes lastOverflowedXid we might violate MVCC if we invalidate
> > this, so I wonder if we also need to check the snapshot with the lowest
> > xmin?
>
> lastOverflowedXid is the smallest subxid that possibly exists but
> possiblly not known to the standby. So if all top-level transactions
> older than lastOverflowedXid end, that means that all the
> subtransactions in doubt are known to have been ended.

Agreed

> XLOG_RUNNING_XACTS reports oldestRunningXid, which is the oldest
> running top-transaction.  Standby expires xids in KnownAssignedXids
> array that precede to the oldestRunningXid.  We are sure that all
> possiblly-overflown subtransactions are gone as well if the oldest xid
> is newer than the first overflowed subtransaction.

Agreed

> As a cross check, the following existing code in GetSnapshotData means
> that no overflow is not happening if the smallest xid in the known
> assigned list is larger than lastOverflowedXid, which agrees to the
> consideration above.
>
> procaray.c:2428
> >   subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, 
> > &xmin,
> > 
> > xmax);
> >
> >   if (TransactionIdPrecedesOrEquals(xmin, 
> > procArray->lastOverflowedXid))
> >   suboverflowed = true;
>
>
> If the discussion so far is correct, the following diff will fix the
> issue.
>
> diff --git a/src/backend/storage/ipc/procarray.c 
> b/src/backend/storage/ipc/procarray.c
> index bd3c7a47fe..19682b73ec 100644
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -4463,6 +4463,12 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
>  {
> LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> KnownAssignedXidsRemovePreceding(xid);
> +   /*
> +* reset lastOverflowedXid if we know transactions that have been 
> possiblly
> +* running are being gone.
> +*/
> +   if (TransactionIdPrecedes(procArray->lastOverflowedXid, xid))
> +   procArray->lastOverflowedXid = InvalidTransactionId;
> LWLockRelease(ProcArrayLock);
>  }

So I agree with this fix.

It is however, an undocumented modularity violation. I think that is
acceptable because of the ProcArrayLock traffic, but needs to have a
comment to explain this at the call to
ExpireOldKnownAssignedTransactionIds() i.e. " and potentially reset
lastOverflowedXid", as well as a comment on the
ExpireOldKnownAssignedTransactionIds() function.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: lastOverflowedXid does not handle transaction ID wraparound

2021-11-04 Thread Simon Riggs
On Wed, 3 Nov 2021 at 22:07, Alexander Korotkov  wrote:
>
> Hi!
>
> On Wed, Nov 3, 2021 at 8:51 PM Simon Riggs  
> wrote:
> > It is however, an undocumented modularity violation. I think that is
> > acceptable because of the ProcArrayLock traffic, but needs to have a
> > comment to explain this at the call to
> > ExpireOldKnownAssignedTransactionIds() i.e. " and potentially reset
> > lastOverflowedXid", as well as a comment on the
> > ExpireOldKnownAssignedTransactionIds() function.
>
> Thank you for your feedback.  Please find the revised patch attached.
> It incorporates this function comment changes altogether with minor
> editings and commit message. Let me know if you have further
> suggestions.
>
> I'm going to push this if no objections.

Looks good, go for it.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




SKIP LOCKED assert triggered

2021-11-12 Thread Simon Riggs
The combination of these two statements in a transaction hits an
Assert in heapam.c at line 4770 on REL_14_STABLE

BEGIN;
SELECT * FROM queue LIMIT 1 FOR UPDATE SKIP LOCKED;
...
UPDATE queue SET status = 'UPDATED' WHERE id = :id;
COMMIT;

pgbench reliably finds this, running from inside a PL/pgSQL function.

Alvaro suggests we just ignore the Assert, which on testing appears to
be the right approach. Patch attached, which solves it for me.

There is no formal test that does lock then update, so I have
attempted to create one, but this has not successfully reproduced it
(attached anyway), but this code is different from the pgbench test.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


skip_locked_assert.v1.patch
Description: Binary data


skip_locked_then_update_test.v1.patch
Description: Binary data


Re: WIP: System Versioned Temporal Table

2021-11-15 Thread Simon Riggs
On Mon, 15 Nov 2021 at 09:47, Daniel Gustafsson  wrote:
>
> > On 19 Sep 2021, at 20:32, Simon Riggs  wrote:
>
> > My preferred approach would be to do this "for free" in the table
> > access method, but we're a long way from this in terms of actual
> > implementation. When Corey  suggested earlier that we just put the
> > syntax in there, this was the direction I was thinking.
> >
> > After waiting a day since I wrote the above, I think we should go with
> > (2) as Corey suggests, at least for now, and we can always add (3)
> > later.
>
> This patch no longer applies, are there plans on implementing the approaches
> discussed above, or should we close this entry and open a new one when a
> freshly baked pathed is ready?

As I mentioned upthread, there are at least 2 different ways forward
(1) and (2), both of which need further work. I don't think that
additional work is impossible, but it is weeks of work, not days and
it needs to be done in collaboration with other thoughts on other
threads Corey refers to.

I have no plans on taking this patch further, but will give some help
to anyone that wishes to do that.

I suggest we Return with Feedback.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: increase size of pg_commit_ts buffers

2021-11-30 Thread Simon Riggs
On Fri, 12 Nov 2021 at 17:39, Tomas Vondra
 wrote:

> So +1 to just get this committed, as it is.

This is a real issue affecting Postgres users. Please commit this soon.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




  1   2   3   4   5   6   7   8   >