RE: get_database_name() from background worker

2019-12-12 Thread ROS Didier
Hi
   With pg_background extension ,it is possible to make  "autonomous 
transaction" which means possibility to commit in a transaction.
 It is like a client which connects to a postgresql instance. So you can 
execute any sql orders .

Best Regards
Didier ROS
-Message d'origine-
De : tsunakawa.ta...@fujitsu.com [mailto:tsunakawa.ta...@fujitsu.com] 
Envoyé : jeudi 12 décembre 2019 02:04
À : 'Koichi Suzuki' 
Cc : ROS Didier ; pgsql-hackers@lists.postgresql.org
Objet : RE: get_database_name() from background worker

From: Koichi Suzuki 
> I'm not using this.   Is this the must to use get_database_name()?

I don't think pg_background is a must, but the system catalog access by 
get_database_name() should require database connection and transaction.  See 
src/test/modules/worker_spi/worker_spi.c for an example of background worker.  
That uses both of them.


Regards
Takayuki Tsunakawa





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: Add .editorconfig

2019-12-12 Thread Peter Eisentraut

On 2019-12-11 18:54, Andreas Karlsson wrote:

I have not used .editorconfig that much, but would it makes sense to add
the below?

[*]
end_of_line = lf


I think that would best be done in response to an actual need.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-12 Thread Amit Kapila
On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar  wrote:
>
> On Thu, 12 Dec 2019 at 11:34, Amit Khandekar  wrote:
>
> > So max_changes_in_memory is the one
> > that allows us to reduce the number of transactions required, so we
> > can cut down on the outer loop iterations and make the test finish
> > much earlier.
>
> >
> > But also note that, we can't use the test suite in
> > contrib/test_decoding, because max_changes_in_memory needs server
> > restart. So we need to shift this test to src/test/recovery. And
> > there, I guess it is not that critical for the testcase to be very
> > quick because the tests in general are much slower than the ones in
> > contrib/test_decoding, although it would be nice to make it fast. What
> > I propose is to modify  max_changes_in_memory, do a server restart
> > (which takes hardly a sec), run the testcase (3.5 sec) and then
> > restart after resetting the guc. So totally it will be around 4-5
> > seconds.
>
> Sorry I meant max_files_per_process.
>

Okay, what time other individual tests take in that directory on your
machine?  How about providing a separate test patch for this case so
that I can also test it?  I think we can take the opinion of others as
well if they are fine with adding this test, otherwise, we can go
ahead with the main patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Let people set host(no)ssl settings from initdb

2019-12-12 Thread Peter Eisentraut

On 2019-12-12 07:24, David Fetter wrote:

That problem exists even before you get to the question of whether
this specific option is useful or well-designed ... a question I'm
not opining about here, but it would certainly require thought.

I think it was a reasonable extension. We cover lines that start with
local and host, but they can also start with hostssl and hostnossl.


I suspect the real purpose here is to easily reject non-SSL connections 
altogether.  This is currently quite cumbersome and requires careful 
ongoing maintenance of pg_hba.conf.  But I see two problems with the 
proposed approach: (1) initdb doesn't support setting up SSL, so the 
only thing you can achieve here is to reject all TCP/IP connections, 
until you have set up SSL. (2) The default pg_hba.conf only covers 
localhost connections.  The value of enforcing SSL connections to 
localhost is probably quite low.  You still need ongoing careful 
pg_hba.conf maintenance as you add more host entries.


Maybe we just need something like libpq's sslmode on the server side. 
Probably not quite the same, perhaps just ssl = require.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Minimal logical decoding on standbys

2019-12-12 Thread Rahila Syed
Hi Amit,

Please see following comments:

1.  0002-Add-info-in-WAL-records-in-preparation-for-logical-s.patch

--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@

 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"

The above header inclusion is not necessary as the code compiles fine
without it.
Also, this patch does not apply cleanly on latest master due to the above
line.

2.  Following test fails with error.
make -C src/test/recovery/ check PROVE_TESTS=t/
018_standby_logical_decoding_xmins.pl
#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 120.
#  got: ''
# expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 141.
#  got: ''
# expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 159.
#  got: ''
# expected: anything else
t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed
out executing this query:
#

Physical catalog_xmin is NULL on master after logical slot creation on
standby .

Due to this below command in the test fails with syntax error as it
constructs the SQL query using catalog_xmin value
obtained above:

# SELECT catalog_xmin::varchar::int >
# FROM pg_catalog.pg_replication_slots
# WHERE slot_name = 'master_physical';
#
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR:  syntax error at or near "FROM"
# LINE 3:   FROM pg_catalog.pg_replication_slots

Thank you,
-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-12 Thread Amit Kapila
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas  wrote:
>
> On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar  wrote:
> > I have rebased the patch set on the latest head.
>
> 0001 looks like a clever approach, but are you sure it doesn't hurt
> performance when many small XLOG records are being inserted? I think
> XLogRecordAssemble() can get pretty hot in some workloads.
>

I don't think we have evaluated it yet, but we should do it.  The
point to note is that it is only for the case when wal_level is
'logical' (see IsSubTransactionAssignmentPending) in which case we
already log more WAL, so this might not impact much.  I guess that it
might be better to have that check in XLogRecordAssemble for the sake
of clarity.

>
> Regarding 0005, it seems to me that this is no good:
>
> + errmsg("improper heap_getnext call")));
>
> I think we should be using elog() rather than ereport() here, because
> this should only happen if there's a bug in a logical decoding plugin.
> At first, I thought maybe this should just be an Assert(), but since
> there are third-party logical decoding plugins available, checking
> this even in non-assert builds seems like a good idea. However, I
> think making it translatable is overkill; users should never see this,
> only developers.
>

makes sense.  I think we should change it.

>
> + if (prev_lsn != InvalidXLogRecPtr)
> + Assert(prev_lsn <= change->lsn);
>
> There is no reason to ever write an if statement that contains only an
> Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
> || prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.
>

Agreed.

> The purpose and mechanism of the is_schema_sent flag is not clear to
> me. The word "schema" here seems to be being used to mean "snapshot,"
> which is rather confusing.
>

I have explained this flag below along with invalidations as both are
slightly related.

> I'm also somewhat unclear on what's happening here with invalidations.
> Perhaps that's as much a defect in my understanding as it is
> reflective of any problem with the patch, but I also don't see any
> comments either in 0002 or later patches explaining the theory of
> operation. If I've missed some, please point me in the right
> direction. Hypothetically speaking, it seems to me that if you just
> did InvalidateSystemCaches() every time the snapshot changed, you
> wouldn't need anything else (unless we're concerned with
> non-transactional invalidation messages like smgr and relmapper
> invalidations; not quite sure how those are handled). And, on the
> other hand, if we don't do InvalidateSystemCaches() every time the
> snapshot changes, then I don't understand why this works now, even
> without streaming.
>

I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache).  However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages.  So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.

The second related concept is that before sending each change to
downstream (via pgoutput), we check whether we need to send the
schema.  This we decide based on the local map entry
(RelationSyncEntry) which indicates whether the schema for the
relation is already sent or not. Once the schema of the relation is
sent, the entry for that relation in the map will indicate it. At the
time of invalidation processing we also blew up this map, so it always
reflects the correct state.

Now, to decode an in-progress transaction, we need to ensure that we
have received the WAL for all the invalidations before decoding the
WAL of action that happened immediately after that catalog change.
This is the reason we started WAL logging individual Invalidations.
So, with this change we don't need to execute all the invalidations
for each catalog change, rather execute them as and when their WAL is
being decoded.

The current mechanism to send schema changes won't work for streaming
transactions because after sending the change, subtransaction might
abort.  On subtransaction abort, the downstream will simply discard
the changes where we will lose the previous schema change sent.  There
is no such problem currently because we process all the aborts before
sending any change.  So, the current idea of having a schema_sent flag
in each map entry (RelationSyncEntry) won't work for streaming
transactions.  To solve this problem initially patch has kept a flag
'is_schema_sen

Re: backup manifests

2019-12-12 Thread Suraj Kharage
Thanks, Robert for the review.

On Wed, Dec 11, 2019 at 1:10 AM Robert Haas  wrote:

> On Tue, Dec 10, 2019 at 6:40 AM Suraj Kharage
>  wrote:
> > Please find attached patch for backup validator implementation (0004
> patch). This patch is based
> > on Rushabh's latest patch for backup manifest.
> >
> > There are some functions required at client side as well, so I have
> moved those functions
> > and some data structure at common place so that they can be accessible
> for both. (0003 patch).
> >
> > My colleague Rajkumar Raghuwanshi has prepared the WIP patch (0005) for
> tap test cases which
> > is also attached. As of now, test cases related to the tablespace and
> tar backup  format are missing,
> > will continue work on same and submit the complete patch.
> >
> > With this mail, I have attached the complete patch stack for backup
> manifest and backup
> > validate implementation.
> >
> > Please let me know your thoughts on the same.
>
> Well, for the second time on this thread, please don't take a bunch of
> somebody else's code and post it in a patch that doesn't attribute
> that person as one of the authors. For the second time on this thread,
> the person is me, but don't borrow *anyone's* code without proper
> attribution. It's really important!
>
> On a related note, it's a very good idea to use git format-patch and
> git rebase -i to maintain patch stacks like this. Rushabh seems to
> have done that, but the files you're posting look like raw 'git diff'
> output. Notice that this gives him a way to include authorship
> information and a tentative commit message in each patch, but you
> don't have any of that.
>

Sorry, I have corrected this in the attached v2 patch set.


> Also on a related note, part of the process of adapting existing code
> to a new purpose is adapting the comments. You haven't done that:
>
> + * Search a result-set hash table for a row matching a given filename.
> ...
> + * Insert a row into a result-set hash table, provided no such row is
> already
> ...
> + * Most of the values
> + * that we're hashing are short integers formatted as text, so there
> + * shouldn't be much room for pathological input.
>
Corrected in v2 patch.


> I think that what we should actually do here is try to use simplehash.
> Right now, it won't work for frontend code, but I posted some patches
> to try to address that issue:
>
>
> https://www.postgresql.org/message-id/CA%2BTgmob8oyh02NrZW%3DxCScB%2B5GyJ-jVowE3%2BTWTUmPF%3DFsGWTA%40mail.gmail.com
>
> That would have a few advantages. One, we wouldn't need to know the
> number of elements in advance, because simplehash can grow
> dynamically. Two, we could use the iteration interfaces to walk the
> hash table.  Your solution to that is pgrhash_seq_search, but that's
> actually not well-designed, because it's not a generic iterator
> function but something that knows specifically about the 'touch' flag.
> I incidentally suggest renaming 'touch' to 'matched;' 'touch' is not
> bad, but I think 'matched' will be a little more recognizable.
>

Thanks for the suggestion. Will try to implement the same and update
accordingly.
I am assuming that I need to build the patch based on the changes that you
proposed on the mentioned thread.


> Please run pgindent. If needed, first add locally defined types to
> typedefs.list, so that things indent properly.
>
> It's not a crazy idea to try to share some data structures and code
> between the frontend and the backend here, but I think
> src/common/backup.c and src/include/common/backup.h is a far too
> generic name given what the code is actually doing. It's mostly about
> checksums, not backup, and I think it should be named accordingly. I
> suggest removing "manifestinfo" and renaming the rest to just talk
> about checksums rather than manifests. That would make it logical to
> reuse this for any other future code that needs a configurable
> checksum type. Also, how about adding a function like:
>
> extern bool parse_checksum_algorithm(char *name, ChecksumAlgorithm *algo);
>
> ...which would return true and set *algo if name is recognized, and
> return false otherwise. That code could be used on both the client and
> server sides of this patch, and by any future patches that want to
> return this scaffolding.
>

Corrected the filename and implemented the function as suggested.


> The file header for backup.h has the wrong filename (string.h). The
> header format looks somewhat atypical compared to what we normally do,
> too.


My bad, corrected the header format as well.


>
>
It's arguable, but I tend to think that it would be better to
> hex-encode the CRC rather than printing it as an integer.  Maybe
> hex_encode() is another thing that could be moved into the new
> src/common file.


We are already encoding the CRC checksum as well. Please let me know if I
misunderstood anything.
Moved hex_encode into src/common.


> As I said before about Rushabh's patch set, it's very confusing that
> we have so many pat

Re: Wrong assert in TransactionGroupUpdateXidStatus

2019-12-12 Thread Alvaro Herrera
On 2019-Dec-11, Amit Kapila wrote:

> On Wed, Dec 11, 2019 at 4:02 AM Andres Freund  wrote:
> > On 2019-12-10 13:55:40 +0530, Dilip Kumar wrote:

> > /*
> >  * We don't try to do group update optimization if a 
> > process has
> >  * overflowed the subxids array in its PGPROC, since in 
> > that case we
> >  * don't have a complete list of XIDs for it.
> >  */
> > Assert(THRESHOLD_SUBTRANS_CLOG_OPT <= 
> > PGPROC_MAX_CACHED_SUBXIDS);
> >
> > Even if these weren't redundant, it can't make sense to test such a
> > static condition only within an if?
> 
> I don't remember exactly the reason for this, but now I don't find the
> Assert within if () meaningful.  I think we should remove the Assert
> inside if() unless Robert or someone see any use of it.

The more I look at both these asserts, the less sense they make.  Why
does clog.c care about PGPROC at all?  Looking at the callers of that
routine, nowhere do they concern themselves with whether the overflowed
flag has been set or not.  It seems to me that the StaticAssert() should
be near the PGPROC_MAX_CACHED_SUBXIDS definition, not the SUBTRANS
definition (maybe as StaticAssertDecl, as in
201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217 )

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Append with naive multiplexing of FDWs

2019-12-12 Thread Kyotaro Horiguchi
Hello.

I think I can say that this patch doesn't slows non-AsyncAppend,
non-postgres_fdw scans.


At Mon, 9 Dec 2019 12:18:44 -0500, Bruce Momjian  wrote in 
> Certainly any overhead on normal queries would be unacceptable.

I took performance numbers on the current shape of the async execution
patch for the following scan cases.

t0   : single local table (parallel disabled)
pll  : local partitioning (local Append, parallel disabled)
ft0  : single foreign table
pf0  : inheritance on 4 foreign tables, single connection
pf1  : inheritance on 4 foreign tables, 4 connections
ptf0 : partition on 4 foreign tables, single connection
ptf1 : partition on 4 foreign tables, 4 connections

The benchmarking system is configured as the follows on a single
machine.

  [ benchmark client   ]
   |  |
(localhost:5433)(localhost:5432)
   |  |
   ++  |   +--+   |
   |V  V   V  |   V
   | [master server]  |  [async server]
   |   V  |   V
   +--fdw--+  +--fdw--+


The patch works roughly in the following steps.

1. Planner decides how many children out of an append can run
  asynchrnously (called as async-capable.).

2. While ExecInit if an Append doesn't have an async-capable children,
  ExecAppend that is exactly the same function is set as
  ExecProcNode. Otherwise ExecAppendAsync is used.

If the infrastructure part in the patch causes any degradation, the
"t0"(scan on local single table) and/or "pll" test (scan on a local
paritioned table) gets slow.

3. postgresql_fdw always runs async-capable code path.

If the postgres_fdw part causes degradation, ft0 reflects that.


The tables has two integers and the query does sum(a) on all tuples.

With the default fetch_size = 100, number is run time in ms.  Each
number is the average of 14 runs.

 master  patched   gain
t0   73257130 +2.7%
pll  45584484 +1.7%
ft0  36703675 -0.1%
pf0  23221550+33.3%
pf1  23671475+37.7%
ptf0 25171624+35.5%
ptf1 23431497+36.2%

With larger fetch_size (200) the gain mysteriously decreases for
sharing single connection cases (pf0, ptf0), but others don't seem
change so much.

 master  patched   gain
t0   72127252 -0.6%
pll  45464397 +3.3%
ft0  37123731 -0.5%
pf0  21311570+26.4%
pf1  19261189+38.3%
ptf0 20011557+22.2%
ptf1 19031193+37.4%

FWIW, attached are the test script.

gentblr2.sql: Table creation script.
testrun.sh  : Benchmarking script.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
SELECT :scale  * 0 as th0,
   :scale  * 1 as th1,
   :scale  * 2 as th2,
   :scale  * 3 as th3,
   :scale  * 4 as th4,
   :scale  * 10 as th10,
   :scale  * 20 as th20,
   :scale  * 30 as th30,
   :scale  * 40 as th40,
   :scale  * 100 as th100,
   :scale  * 1000 as th1000
\gset

DROP TABLE IF EXISTS t0 CASCADE;
DROP TABLE IF EXISTS pl CASCADE;
DROP TABLE IF EXISTS pll CASCADE;
DROP TABLE IF EXISTS pf0 CASCADE;
DROP TABLE IF EXISTS pf1 CASCADE;
DROP TABLE IF EXISTS ptf0;
DROP TABLE IF EXISTS ptf1;

CREATE TABLE pl (a int, b int);
CREATE TABLE cl1 (LIKE pl) INHERITS (pl);
CREATE TABLE cl2 (LIKE pl) INHERITS (pl);
CREATE TABLE cl3 (LIKE pl) INHERITS (pl);
CREATE TABLE cl4 (LIKE pl) INHERITS (pl);
INSERT INTO cl1 (SELECT a, a FROM generate_series(:th0, :th1 - 1) a);
INSERT INTO cl2 (SELECT a, a FROM generate_series(:th1, :th2 - 1) a);
INSERT INTO cl3 (SELECT a, a FROM generate_series(:th2, :th3 - 1) a);
INSERT INTO cl4 (SELECT a, a FROM generate_series(:th3, :th4 - 1) a);

CREATE TABLE pll (a int, b int);
CREATE TABLE cll1 (LIKE pl) INHERITS (pll);
CREATE TABLE cll2 (LIKE pl) INHERITS (pll);
CREATE TABLE cll3 (LIKE pl) INHERITS (pll);
CREATE TABLE cll4 (LIKE pl) INHERITS (pll);
INSERT INTO cll1 (SELECT a, a FROM generate_series(:th0, :th10 - 1) a);
INSERT INTO cll2 (SELECT a, a FROM generate_series(:th10, :th20 - 1) a);
INSERT INTO cll3 (SELECT a, a FROM generate_series(:th20, :th30 - 1) a);
INSERT INTO cll4 (SELECT a, a FROM generate_series(:th30, :th40 - 1) a);


CREATE TABLE t0  (LIKE pl);
INSERT INTO t0 (SELECT a, a FROM generate_series(0, :th100 - 1) a);

DROP SERVER IF EXISTS svl CASCADE;
DROP SERVER IF EXISTS sv0 CASCADE;
DROP SERVER IF EXISTS sv1 CASCADE;
DROP SERVER IF EXISTS sv2 CASCADE;
DROP SERVER IF EXISTS sv3 CASCADE;
DROP SERVER IF EXISTS sv4 CASCADE;

CREATE SERVER svl FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host :'svhost', 
port :'svport', dbname :'svdbname', fetch_size :'fetchsize');
CREATE SERVER sv0 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host :'svhost', 
port :'svport', dbname :'svdbname', fetch_size :'fetchsize');
CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host :'svhost', 
port :'svport', dbname :'svdbname', fetch_size :'fetchsize');
CREATE SERVER sv2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host :'svhost', 
port :'svport', dbname :

Re: non-exclusive backup cleanup is mildly broken

2019-12-12 Thread Magnus Hagander
On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi 
wrote:

> Hello.
>
> At Wed, 11 Dec 2019 17:32:05 -0500, Robert Haas 
> wrote in
> > While reviewing the first patch in Asif Rehman's series of patches for
> > parallel backup over at
> >
> http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgplrq4cbqe5zviuxygkq3v...@mail.gmail.com
> > I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694
> > introduced a use of cancel_before_shmem_exit which falsifies the
> > comment for that function. So you can cause a spurious WARNING in the
> > logs by doing something like this, with max_prepared_transactions set
> > to a non-zero value:
> >
> > select pg_start_backup('one', false, false);
> > begin;
> > prepare transaction 'nothing';
> > select pg_stop_backup(false);
> > \q
> >
> > in the server log:
> > WARNING:  aborting backup due to backend exiting before pg_stop_backup
> > was called
> >
> > And you can cause an assertion failure like this:
> >
> > select pg_start_backup('one', false, false);
> > begin;
> > prepare transaction 'nothing';
> > select pg_stop_backup(false);
> > select pg_start_backup('two');
> > \q
> >
> > We've discussed before the idea that it might be good to remove the
> > limitation that before_shmem_exit() can only remove the
> > most-recently-added callback, which would be one way to fix this
> > problem, but I'd like to propose an alternative solution which I think
> > will work out more nicely for the patch mentioned above: don't use
> > cancel_before_shmem_exit, and just leave the callback registered for
> > the lifetime of the backend. That requires some adjustment of the
> > callback, since it needs to tolerate exclusive backup mode being in
> > effect.
> >
> > The attached patch takes that approach. Thoughts welcome on (1) the
> > approach and (2) whether to back-patch. I think there's little doubt
> > that this is formally a bug, but it's a pretty minor one.
>
> The direction seems reasonable, but the patch doesn't free up the
> before_shmem_exec slot nor avoid duplicate registration of the
> callback. Actually before_shmem_exit_list gets bloat with multiple
> do_pg_abort_backup entries through repeated rounds of non-exclusive
> backups.
>
> As the result, if one ends a session while a non-exclusive backup is
> active after closing the previous non-exclusive backup,
> do_pg_abort_backup aborts for assertion failure.
>

+1, I also think the direction seems perfectly reasonable, but we should
avoid re-adding the callback since we're not removing it. Leaving it around
seems cheap enough as long as there is only one.

My first reaction would be to just disallow the combination of prepared
transactions and start/stop backups. But lookingat it it seems like an
unnecessray restriction and an approach like this one seems better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Wrong assert in TransactionGroupUpdateXidStatus

2019-12-12 Thread Amit Kapila
On Thu, Dec 12, 2019 at 6:10 PM Alvaro Herrera  wrote:
>
> On 2019-Dec-11, Amit Kapila wrote:
>
> > On Wed, Dec 11, 2019 at 4:02 AM Andres Freund  wrote:
> > > On 2019-12-10 13:55:40 +0530, Dilip Kumar wrote:
>
> > > /*
> > >  * We don't try to do group update optimization if a 
> > > process has
> > >  * overflowed the subxids array in its PGPROC, since in 
> > > that case we
> > >  * don't have a complete list of XIDs for it.
> > >  */
> > > Assert(THRESHOLD_SUBTRANS_CLOG_OPT <= 
> > > PGPROC_MAX_CACHED_SUBXIDS);
> > >
> > > Even if these weren't redundant, it can't make sense to test such a
> > > static condition only within an if?
> >
> > I don't remember exactly the reason for this, but now I don't find the
> > Assert within if () meaningful.  I think we should remove the Assert
> > inside if() unless Robert or someone see any use of it.
>
> The more I look at both these asserts, the less sense they make.  Why
> does clog.c care about PGPROC at all?
>

It is mainly for group updates.  Basically, we want to piggyback the
procs that are trying to update clog at the same time on the proc
which got the CLogControlLock. This avoids taking/releasing that lock
multiple times.  See TransactionGroupUpdateXidStatus.

>  Looking at the callers of that
> routine, nowhere do they concern themselves with whether the overflowed
> flag has been set or not.  It seems to me that the StaticAssert() should
> be near the PGPROC_MAX_CACHED_SUBXIDS definition, not the SUBTRANS
> definition (maybe as StaticAssertDecl, as in
> 201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217 )
>

Sounds reasonable.  We can do that once the patch mentioned by you got
committed.  For now, we are planning to just remove the Assert inside
if() condition. Do you see any problem with that?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Index corruption / planner issue with one table in my pg 11.6 instance

2019-12-12 Thread Jeremy Finzel
On Tue, Dec 10, 2019 at 8:25 AM Jeremy Finzel  wrote:

> Is there a way for me to test this theory?  I tried the following with no
> change in behavior:
>
>1. Disable write load to table
>2. Vacuum analyze table (not vac full)
>3. Create index
>4. Explain
>
> Still did not pick up the index.
>

Just another followup: with no other intervention on our part, after many
hours the planner is picking up the index.

I don't quite know what is causing it still, but is this behavior actually
desired?  It's pretty inconvenient when trying to build an index for a
query need and immediately use it which used to work :).

Thanks,
Jeremy


Re: tuplesort test coverage

2019-12-12 Thread Tom Lane
Andres Freund  writes:
> I pushed this now. We'll see what the slower buildfarm animals say. I'll
> try to see how long they took in a few days.

friarbird (a CLOBBER_CACHE_ALWAYS animal) just showed a failure in this:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-12-12%2006%3A20%3A02

== pgsql.build/src/test/regress/regression.diffs 
===
diff -U3 
/pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out 
/pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out
--- 
/pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out 
2019-11-13 19:54:11.0 -0500
+++ /pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out   
2019-12-12 08:25:23.0 -0500
@@ -625,13 +625,13 @@
Group Key: a.col12
Filter: (count(*) > 1)
->  Merge Join
- Merge Cond: (a.col12 = b.col12)
- ->  Sort
-   Sort Key: a.col12 DESC
-   ->  Seq Scan on test_mark_restore a
+ Merge Cond: (b.col12 = a.col12)
  ->  Sort
Sort Key: b.col12 DESC
->  Seq Scan on test_mark_restore b
+ ->  Sort
+   Sort Key: a.col12 DESC
+   ->  Seq Scan on test_mark_restore a
 (14 rows)
 
 :qry;

Since a and b are exactly the same table, in principle it's a matter of
chance which one the planner will put on the outside of the join.
I think what happened here is that the test ran long enough for
autovacuum/autoanalyze to come along and scan the table, changing its
stats in between where the planner picked up the stats for a and those
for b, and we ended up making the opposite join order choice.

I considered fixing this by adding some restriction clause on b so
that the join order choice isn't such a coin flip.  But it's not
clear that the problem couldn't recur anyway --- the table stats
would change significantly on auto-analyze, since the test script
isn't bothering to create any stats itself.

What seems like a simpler and more reliable fix is to make
test_mark_restore a temp table, thus keeping autovac away from it.
Is there a reason in terms of the test's goals not to do that?

Also ... why in the world does the script drop its tables at the end
with IF EXISTS?  They'd better exist at that point.  I object
to the DROP IF EXISTS up at the top, too.  The regression tests
do not need to be designed to deal with an unpredictable start state,
and coding them to do so can have no effect other than possibly
masking problems.

regards, tom lane




Duplicate function call on timestamp2tm

2019-12-12 Thread Li Japin
Hi,

I find there is a duplicate function call on timestamp2tm in timestamptz_part 
and timestamp_part.
Is that necessary? I remove the latter one and it also works.

Best,
Japin.



remove-duplicate-timestamp2tm-function-call.patch
Description: remove-duplicate-timestamp2tm-function-call.patch


Re: Collation versioning

2019-12-12 Thread Julien Rouhaud
Hello Thomas,

Thanks for looking at it!

On Thu, Dec 12, 2019 at 5:01 AM Thomas Munro  wrote:
>
> We create duplicate pg_depend records:
>
> [...]
>
> I wondered if that was harmless

That's the assumed behavior of recordMultipleDependencies:

/*
* Record the Dependency.  Note we don't bother to check for
* duplicate dependencies; there's no harm in them.
*/

We could add a check to skip duplicates for the "track_version ==
true" path, or switch to flags if we want to also skip duplicates in
other cases, but it'll make recordMultipleDependencies a little bit
more specialised.

> but for one thing it causes duplicate warnings:

Yes, that should be avoided.

> Here's another way to get a duplicate, and in this example you also
> get an unnecessary dependency on 100 "default" for this index:
>
> postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
> collate "fr_FR";
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16460 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16460 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16460 |0 |   3456 |  100 |   0 |
> 0:34.0| n
> (3 rows)
>
> Or... maybe 100 should be there, by simple analysis of the x in the
> WHERE clause, but it's the same if you write x collate "fr_FR" >
> 'helicopter' collate "fr_FR", and in that case there are no
> expressions of collation "default" anywhere.

Ah good point.  That's because expression_tree_walker() will dig into
CollateExpr->args and eventually reach the underlying Var.  I don't
see an easy way to avoid that while still properly recording the
required dependency for an even more realistic index such as

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" >
'helicopter' COLLATE "en_US")::text) collate "fr_FR";

and for instance not for

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" ||
'helicopter' COLLATE "en_US")) collate "fr_FR";


> The indirection through composite types works nicely:
>
> postgres=# create type foo_t as (en text collate "en_CA", fr text
> collate "fr_CA");
> CREATE TYPE
> postgres=# create table t (foo foo_t);
> CREATE TABLE
> postgres=# create index on t(foo);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16444 |0 |   3456 |12554 |   0 |
> 0:34.0| n
> 1259 | 16444 |0 |   3456 |12597 |   0 |
> 0:34.0| n
> (2 rows)
>
> ... but again it shows the extra and technically unnecessary
> dependencies (only 12603 "fr_FR" is really needed):
>
> postgres=# create index on t(((foo).fr collate "fr_FR"));
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16445 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16445 |0 |   3456 |12597 |   0 |
> 0:34.0| n
> 1259 | 16445 |0 |   3456 |12554 |   0 |
> 0:34.0| n
> (3 rows)

Yes :(

> I check that nested types are examined recursively, as appropriate.  I
> also tested domains, arrays, arrays of domains, expressions extracting
> an element from an array of a domain with an explicit collation, and
> the only problem I could find was more ways to get duplicates.  Hmm...
> what else is there that can contain a collatable type...?  Ranges!
>
> postgres=# create type myrange as range (subtype = text);
> CREATE TYPE
> postgres=# drop table t;
> DROP TABLE
> postgres=# create table t (x myrange);
> CREATE TABLE
> postgres=# create index on t(x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> (0 rows)
>
> ... or perhaps, more realistically, a GIST index might actually be
> useful for range queries, and we're not capturing the dependency:
>
> postgres=# create index t_x_idx on t using gist (x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid 

Re: Duplicate function call on timestamp2tm

2019-12-12 Thread Tom Lane
Li Japin  writes:
> I find there is a duplicate function call on timestamp2tm in timestamptz_part 
> and timestamp_part.
> Is that necessary? I remove the latter one and it also works.

Huh.  I do believe you're right.  Must be an ancient copy-and-paste
mistake?

regards, tom lane




Re: Wrong assert in TransactionGroupUpdateXidStatus

2019-12-12 Thread Alvaro Herrera
On 2019-Dec-12, Amit Kapila wrote:

> On Thu, Dec 12, 2019 at 6:10 PM Alvaro Herrera  
> wrote:

> > The more I look at both these asserts, the less sense they make.  Why
> > does clog.c care about PGPROC at all?
> 
> It is mainly for group updates.  Basically, we want to piggyback the
> procs that are trying to update clog at the same time on the proc
> which got the CLogControlLock. This avoids taking/releasing that lock
> multiple times.  See TransactionGroupUpdateXidStatus.

Yeah, I (think I) understand that.  My point is that conceptually, the
fact that a PGPROC has overflowed does not really affect clog.c in any
way.

> > Looking at the callers of that routine, nowhere do they concern
> > themselves with whether the overflowed
> > flag has been set or not.  It seems to me that the StaticAssert() should
> > be near the PGPROC_MAX_CACHED_SUBXIDS definition, not the SUBTRANS
> > definition (maybe as StaticAssertDecl, as in
> > 201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217 )
> 
> Sounds reasonable.  We can do that once the patch mentioned by you got
> committed.  For now, we are planning to just remove the Assert inside
> if() condition. Do you see any problem with that?

Nope.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Let people set host(no)ssl settings from initdb

2019-12-12 Thread David Fetter
On Thu, Dec 12, 2019 at 10:47:52AM +0100, Peter Eisentraut wrote:
> On 2019-12-12 07:24, David Fetter wrote:
> > > That problem exists even before you get to the question of whether
> > > this specific option is useful or well-designed ... a question I'm
> > > not opining about here, but it would certainly require thought.
> > I think it was a reasonable extension. We cover lines that start with
> > local and host, but they can also start with hostssl and hostnossl.
> 
> I suspect the real purpose here is to easily reject non-SSL connections
> altogether.  This is currently quite cumbersome and requires careful ongoing
> maintenance of pg_hba.conf.

Yes, and kinda.  It's certainly possible to put lines high up in
pg_hba.conf that read:

hostnossl all all 0.0.0.0/0 reject
hostnossl all all ::/0  reject

and then the only ongoing maintenance is not to put lines above them
that contradict it.

> But I see two problems with the proposed approach: (1) initdb
> doesn't support setting up SSL, so the only thing you can achieve
> here is to reject all TCP/IP connections, until you have set up SSL.

I don't believe any special setup is needed to require TLS for the
connection, which is what this patch handles in a straightforward way.

Setting up cert-based auth is the hassle you describe.

> (2) The default pg_hba.conf only covers localhost connections.

As of this patch, it can be asked to cover all connections.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Duplicate function call on timestamp2tm

2019-12-12 Thread Tom Lane
I wrote:
> Li Japin  writes:
>> I find there is a duplicate function call on timestamp2tm in 
>> timestamptz_part and timestamp_part.
>> Is that necessary? I remove the latter one and it also works.

> Huh.  I do believe you're right.  Must be an ancient copy-and-paste
> mistake?

Ah, after looking in the git history, not quite that ancient:
this duplication dates to commit 258ee1b63, which moved these
switch cases from the "if (type == RESERV)" switches in the
same functions.  In the previous coding these function calls
were actually necessary, but here they're redundant.  I guess
that's just additional ammunition for Greg's point that the
keywords were misclassified ;-).

I see from the code coverage report that we're missing coverage
for these and some other paths in timestamp[tz]_part.  Think
I'll go add some more test cases while I'm at it.

Thanks again for the report!

regards, tom lane




Re: Wrong assert in TransactionGroupUpdateXidStatus

2019-12-12 Thread Robert Haas
On Tue, Dec 10, 2019 at 4:32 PM Andres Freund  wrote:
> and then, within an if():
>
> /*
>  * We don't try to do group update optimization if a process 
> has
>  * overflowed the subxids array in its PGPROC, since in that 
> case we
>  * don't have a complete list of XIDs for it.
>  */
> Assert(THRESHOLD_SUBTRANS_CLOG_OPT <= 
> PGPROC_MAX_CACHED_SUBXIDS);
>
> Even if these weren't redundant, it can't make sense to test such a
> static condition only within an if? Is it possible this was actually
> intended to test something different?

Based on the comment, I imagine it might've been intended to read
Assert(nsubxids <= PGPROC_MAX_CACHED_SUBXIDS).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Make autovacuum sort tables in descending order of xid_age

2019-12-12 Thread Mark Dilger




On 11/30/19 2:23 PM, David Fetter wrote:

On Sat, Nov 30, 2019 at 10:04:07AM -0800, Mark Dilger wrote:

On 11/29/19 2:21 PM, David Fetter wrote:

On Fri, Nov 29, 2019 at 07:01:39PM +0100, David Fetter wrote:

Folks,

Per a suggestion Christophe made, please find attached a patch to
$Subject:

Apart from carefully fudging with pg_resetwal, and short of running in
production for a few weeks, what would be some good ways to test this?


Per discussion on IRC with Sehrope Sarkuni, please find attached a
patch with one fewer bug, this one in the repalloc() calls.


Hello David,

Here are my initial thoughts.

Although you appear to be tackling the problem of vacuuming tables
with older Xids first *per database*,


Yes, that's what's come up for me in production, but lately,
production has consisted of a single active DB maxing out hardware. I
can see how in other situations--multi-tenant, especially--it would
make more sense to sort the DBs first.


I notice you don't address that in your latest patch.  Do you have
any thoughts on whether that needs to be handled in this patch?
Should tackling that problem be left for later?


have you considered changing the logic in building and sorting the
database list in get_database_list and rebuild_database_list?  I'm
just curious what your thoughts might be on this subject.


I hadn't, but now that you mention it, it seems like a reasonable
thing to try.


As far as sorting the list of tables in an array and then copying
that array into a linked list, I think there is no need.  The
copying of table_ages into table_oids is followed immediately by

   foreach(cell, table_oids)

and then table_oids seems not to serve any further purpose.  Perhaps
you can just iterate over table_ages directly and avoid the extra
copying.


I hadn't looked toward any optimizations in this section, given that
the vacuums in question can take hours or days, but I can see how that
would make the code cleaner, so please find that change attached.


That looks better, thanks!


I have not tested this change, but I may do so later today or perhaps
on Monday.


The code compiles cleanly and passes all regression tests, but I don't
think those tests really cover what you are changing.  Have you been
using any test framework for this?

I wonder if you might add information about table size, table changes,
and bloat to your RelFrozenXidAge struct and modify rfxa_comparator to
use a heuristic to cost the (age, size, bloat, changed) grouping and
sort on that cost, such that really large bloated tables with old xids
might get vacuumed before smaller, less bloated tables that have
even older xids. Sorting the tables based purely on xid_age seems to
ignore other factors that are worth considering.  I do not have a
formula for how those four factors should be weighted in the heuristic,
but you are implicitly assigning three of them a weight of zero in
your current patch.

relation_needs_vacanalyze currently checks the reltuples, n_dead_tuples
and changes_since_analyze along with vac_scale_factor and
anl_scale_factor for the relation, but only returns booleans dovacuum,
doanalyze, and wraparound.  If you pass your RelFrozenXidAge struct
(perhaps renamed) into relation_needs_vacanalyze, it could store those
values for the relation so that you don't need to look it up again when
sorting.

--
Mark Dilger




Re: Duplicate function call on timestamp2tm

2019-12-12 Thread Li Japin
Thanks for your confirm. Is there anything I can do?

On Dec 12, 2019, at 11:13 PM, Tom Lane 
mailto:t...@sss.pgh.pa.us>> wrote:

Ah, after looking in the git history, not quite that ancient:
this duplication dates to commit 258ee1b63, which moved these
switch cases from the "if (type == RESERV)" switches in the
same functions.  In the previous coding these function calls
were actually necessary, but here they're redundant.  I guess
that's just additional ammunition for Greg's point that the
keywords were misclassified ;-).



Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-12 Thread Amit Khandekar
On Thu, 12 Dec 2019 at 14:18, Amit Kapila  wrote:
>
> On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar  
> wrote:
> >
> > On Thu, 12 Dec 2019 at 11:34, Amit Khandekar  wrote:
> >
> > > So max_changes_in_memory is the one
> > > that allows us to reduce the number of transactions required, so we
> > > can cut down on the outer loop iterations and make the test finish
> > > much earlier.
> >
> > >
> > > But also note that, we can't use the test suite in
> > > contrib/test_decoding, because max_changes_in_memory needs server
> > > restart. So we need to shift this test to src/test/recovery. And
> > > there, I guess it is not that critical for the testcase to be very
> > > quick because the tests in general are much slower than the ones in
> > > contrib/test_decoding, although it would be nice to make it fast. What
> > > I propose is to modify  max_changes_in_memory, do a server restart
> > > (which takes hardly a sec), run the testcase (3.5 sec) and then
> > > restart after resetting the guc. So totally it will be around 4-5
> > > seconds.
> >
> > Sorry I meant max_files_per_process.
> >
>
> Okay, what time other individual tests take in that directory on your
> machine?
For src/test/recovery directory, on average, a test takes about 4-5 seconds.

> How about providing a separate test patch for this case so
> that I can also test it?
Attached is a v4 patch that also addresses your code comments so far.
I have included the test case in 006_logical_decoding.pl. I observed
that the test case just adds only about 0.5 to 1 sec time. Please
verify on your env also, and also whether the test reproduces the
issue without the code changes.

> I think we can take the opinion of others as
> well if they are fine with adding this test, otherwise, we can go
> ahead with the main patch.
Sure.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep_v4.patch
Description: Binary data


Re: Duplicate function call on timestamp2tm

2019-12-12 Thread Tom Lane
Li Japin  writes:
> Thanks for your confirm. Is there anything I can do?

No, I've got it.

In adding the test coverage I spoke of, I thought we should allow
the date_part tests to check all the entries in timestamp[tz]_tbl
not just those around current time, and I found an independent
problem:

  timestamp  |  isoyear  | week | isodow | dow | doy 
-+---+--++-+-
...
 Tue Feb 16 17:32:01 0097 BC |   -96 |7 |  2 |   2 |  47
 Sat Feb 16 17:32:01 0097|97 |7 |  6 |   6 |  47

that is, the ISOYEAR case is failing to correct for BC years.

We could imagine fixing this in date2isoyear() but I think it's
safer to leave that function alone and do the corrections
in timestamp[tz]_part.  Note for example that formatting.c
already applies a BC correction to the result; and I think the
usage in date2isoyearday() requires sticking to the year-zero-exists
convention, too.

regards, tom lane




shared tempfile was not removed on statement_timeout (unreproducible)

2019-12-12 Thread Justin Pryzby
I have a nagios check on ancient tempfiles, intended to catch debris left by
crashed processes.  But triggered on this file:

$ sudo find /var/lib/pgsql/12/data/base/pgsql_tmp -ls
1429774 drwxr-x---   3 postgres postgres 4096 Dec 12 11:32 
/var/lib/pgsql/12/data/base/pgsql_tmp
1698684 drwxr-x---   2 postgres postgres 4096 Dec  7 01:35 
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset
169347 5492 -rw-r-   1 postgres postgres  5619712 Dec  7 01:35 
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0
169346 5380 -rw-r-   1 postgres postgres  5505024 Dec  7 01:35 
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/1.0

I found:
 2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to statement 
timeout   | CLUSTER 
pg_stat_database_snap USI
 2019-12-07 01:35:56 | 11025 | postgres | temporary file: path 
"base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | CLUSTER 
pg_stat_database_snap USI

I don't have a saved log entry for sharedfileset/0.0 - that may be significant,
or may be a bug in my log-archiving script.

The process has not crashed since it started:

postgres 18145 1  0 Nov18 ?00:51:39 /usr/pgsql-12/bin/postmaster -D 
/var/lib/pgsql/12/data
postgres 18147 18145  0 Nov18 ?00:00:53 postgres: logger
  
postgres 18149 18145  0 Nov18 ?00:38:42 postgres: checkpointer  
  

version | PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit

Just to be sure:
$ ps -wwf 11025 || echo not running
UIDPID  PPID  C STIME TTY  STAT   TIME CMD
not running

I wasn't able to reproduce it like this.
PGOPTIONS='-c maintenance_work_mem=128MB -c client_min_messages=debug' psql 
postgres -c 'CREATE TABLE t (i int unique); INSERT INTO t SELECT 
generate_series(1,99)' -c 'SET statement_timeout=4999' -c 'CLUSTER t USING 
t_i_key'

Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
non-regular files and thus shared filesets.  Maybe that's worth discussion on a
new thread ?

src/backend/utils/adt/genfile.c
/* Ignore anything but regular files */
if (!S_ISREG(attrib.st_mode))
continue;

BTW there's no other tablespaces.

Justin




force_parallel_mode = regress has a blind spot

2019-12-12 Thread Tom Lane
I have just found out the hard way (cf commits 22864f6e0, 776a2c887)
that if one uses EXPLAIN with both ANALYZE and VERBOSE selected,
the output is not the same between force_parallel_mode = off and
force_parallel_mode = regress.  This seems to me to be quite broken;
what's the point of force_parallel_mode = regress if it doesn't
produce the same output?

The reason there's a problem is that ExplainNode() will show
per-worker detail if both es->analyze and es->verbose are set,
even when the only reason there's a worker process is that
force_parallel_mode injected a supposedly-invisible Gather.

I don't see any way to fix this that doesn't involve some sort
of "action at a distance".  One could imagine hiding the per-worker
detail if we're underneath a Gather that has invisible set to
true, but it's not really clear to me that that would do the
right things in a plan with multiple Gather nodes.  Any thoughts
about that?

regards, tom lane




Re: VACUUM memory management

2019-12-12 Thread Ibrar Ahmed
On Wed, Dec 11, 2019 at 9:29 PM Ibrar Ahmed  wrote:

>
>
> On Wed, Dec 11, 2019 at 7:29 PM Robert Haas  wrote:
>
>> On Mon, Dec 9, 2019 at 2:02 PM Ibrar Ahmed  wrote:
>> >> Did you see this thread?
>> >>
>> https://postgr.es/m/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3ngzdwfwzbdvugvh...@mail.gmail.com
>> >>
>> > Yes, and somehow did what is explained.
>>
>> Did you modify Claudio's patch or write a totally new one?
>
>
> I wrote completely new patch. I tried multiple techniques like using a
> list instead of fixed size array which I thought was most suitable here,
> but leave that because of conflict with Parallel Vacuum.
>
>
>> In either case, why did you choose that approach?
>
>
> This is the simplest technique. I just divided the maintenance_work_mem in
> chunks and allocate chunks as needed. This technique change minimum code
> and do what we want to achieve.
>
>
>> If you wrote a totally new one, have you compared your work with
>> Claudio's, to see if he covered
>> anything you might need to cover?
>
>
> No, this part I missed, I will do that and will share my thoughts.
>
> I checked the patch, and it does not do anything special which my patch is
not doing except one thing. The patch is claiming to increase the limit of
1GB along with that, but I have not touched that. In my case, we are still
under the limit of maintaines_work_mem but allocate memory in chunks. In
that case, you have the leverage to set a big value of maintaness_work_mem
(even if you don't need that)  because it will not allocate all the memory
at the start.

Secondly, the patch refactors the whole area of code which makes this patch
larger than expected. The code changes in the patch are almost doubled from
my patch. By the way, now I took the test cases from the patch and included
that into my patch (Credit Claudio)

 Please explain why your patch is
>
>> better/different than his.
>>
>>
>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
> --
> Ibrar Ahmed
>


-- 
Ibrar Ahmed


vacuum_dyn_mem_ibrar_v2.patch
Description: Binary data


Re: Make autovacuum sort tables in descending order of xid_age

2019-12-12 Thread David Fetter
On Thu, Dec 12, 2019 at 08:02:25AM -0800, Mark Dilger wrote:
> On 11/30/19 2:23 PM, David Fetter wrote:
> > On Sat, Nov 30, 2019 at 10:04:07AM -0800, Mark Dilger wrote:
> > > On 11/29/19 2:21 PM, David Fetter wrote:
> > > > On Fri, Nov 29, 2019 at 07:01:39PM +0100, David Fetter wrote:
> > > > > Folks,
> > > > > 
> > > > > Per a suggestion Christophe made, please find attached a patch to
> > > > > $Subject:
> > > > > 
> > > > > Apart from carefully fudging with pg_resetwal, and short of running in
> > > > > production for a few weeks, what would be some good ways to test this?
> > > > 
> > > > Per discussion on IRC with Sehrope Sarkuni, please find attached a
> > > > patch with one fewer bug, this one in the repalloc() calls.
> > > 
> > > Hello David,
> > > 
> > > Here are my initial thoughts.
> > > 
> > > Although you appear to be tackling the problem of vacuuming tables
> > > with older Xids first *per database*,
> > 
> > Yes, that's what's come up for me in production, but lately,
> > production has consisted of a single active DB maxing out hardware. I
> > can see how in other situations--multi-tenant, especially--it would
> > make more sense to sort the DBs first.
> 
> I notice you don't address that in your latest patch.  Do you have
> any thoughts on whether that needs to be handled in this patch?

My thought is that it doesn't.

> > > I have not tested this change, but I may do so later today or perhaps
> > > on Monday.
> 
> The code compiles cleanly and passes all regression tests, but I don't
> think those tests really cover what you are changing.  Have you been
> using any test framework for this?

I don't have one :/

> I wonder if you might add information about table size, table changes,
> and bloat to your RelFrozenXidAge struct and modify rfxa_comparator to
> use a heuristic to cost the (age, size, bloat, changed) grouping and
> sort on that cost, such that really large bloated tables with old xids
> might get vacuumed before smaller, less bloated tables that have
> even older xids. Sorting the tables based purely on xid_age seems to
> ignore other factors that are worth considering.  I do not have a
> formula for how those four factors should be weighted in the heuristic,
> but you are implicitly assigning three of them a weight of zero in
> your current patch.

I think it's vastly premature to come up with complex sorting systems
right now.  Just sorting in descending order of age should either have
or not have positive effects.

> relation_needs_vacanalyze currently checks the reltuples, n_dead_tuples
> and changes_since_analyze along with vac_scale_factor and
> anl_scale_factor for the relation, but only returns booleans dovacuum,
> doanalyze, and wraparound.

Yeah, I looked at that. It's for a vastly different purpose, namely
deciding what's an emergency and what's probably not, but needs
attention anyhow.  My goal was something a little finer-grained and, I
hope, a little easier to establish the (lack of) benefits because only
one thing is getting changed.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: allowing broader use of simplehash

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-11 10:05:00 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2019 at 4:59 PM Andres Freund  wrote:
> > > A significant problem in either case is that a simplehash wants to
> > > live in a memory context; no such thing exists either for data in
> > > shared memory nor in frontend code. However, it seems to be quite easy
> > > to provide a way for simplehash to be defined so that it doesn't care
> > > about memory contexts. See 0001.
> >
> > I wonder if we shouldn't instead just go for an "implicit" memory
> > context instead. It's a bit ugly to have a growing set of different
> > signatures.
> 
> I don't really know what you mean by this. I don't actually think the
> different signatures are a big deal. It affects a pretty limited
> number of functions, and that seems less ugly than trying to create
> some sort of dummy not-really-a-context object that can live in
> frontend code, and a lot easier than actually making contexts work in
> frontend code. The latter might be the better long-term solution, but
> I don't think we should insist on doing it first.

I was basically just thinking that we could pass the context to use via
CurrentMemoryContext, instead of explicitly passing it in.


> Another way forward would be to replace the MemoryContext references
> with a void * that happens, in the case of the backend, to be a
> MemoryContext, and could be NULL when none is required. However, that
> would give up some type-checking for no current benefit. If simplehash
> becomes more widely used and at some point it's clear that this would
> be a net win, we can change it then.

Yea, that seems worse. I'd rather work on the MemoryContext
infrastructure being available for frontend code.

Greetings,

Andres Freund




Re: allowing broader use of simplehash

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-11 10:50:16 -0500, Robert Haas wrote:
> On Tue, Dec 10, 2019 at 4:59 PM Andres Freund  wrote:
> > 3) For lots of one-off uses of hashtables that aren't performance
> >critical, we want a *simple* API. That IMO would mean that key/value
> >end up being separately allocated pointers, and that just a
> >comparator is provided when creating the hashtable.
> 
> I think the simplicity of the API is a key point. Some things that are
> bothersome about dynahash:
> 
> - It knows about memory contexts and insists on having its own.

Which is a waste, in a good number of cases.


> - You can't just use a hash table in shared memory; you have to
> "attach" to it first and have an object in backend-private memory.

I'm not quite sure there's all that good an alternative to this,
tbh. For efficiency it's useful to have backend-local state, I
think. And I don't really see how to have that without needing to attach.


> - The usual way of getting a shared hash table is ShmemInitHash(), but
> that means that the hash table has its own named chunk and that it's
> in the main shared memory segment. If you want to put it inside
> another chunk or put it in DSM or whatever, it doesn't work.

I don't think it's quite realistic for the same implementation - although
the code could partially be shared and just specialized for both cases -
to be used for DSM and "normal" shared memory. That's however not an
excuse to have drastically different interfaces for both.



> - It knows about LWLocks and if it's a shared table it needs its own
> tranche of them.
> - hash_search() is hard to wrap your head around.
>

> One thing I dislike about simplehash is that the #define-based
> interface is somewhat hard to use. It's not that it's a bad design.

I agree. It's the best I could come up taking the limitations of C into
account, when focusing on speed and type safety.  I really think this
type of hack is a stopgap measure, and we ought to upgrade to a subset
of C++.


> It's just you have to sit down and think for a while to figure out
> which things you need to #define in order to get it to do what you
> want. I'm not sure that's something that can or needs to be fixed, but
> it's something to consider. Even dynahash, as annoying as it is, is in
> some ways easier to get up and running.

I have been wondering about providing one simplehash wrapper in a
central place that uses simplehash to store a {key*, value*}, and has a
creation interface that just accepts a comparator. Plus a few wrapper
creation functions for specific types (e.g. string, oid, int64).  While
we'd not want to use that for really performance critical paths, for 80%
of the cases it'd be sufficient.


Greetings,

Andres Freund




Re: global / super barriers (for checksums)

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-11 13:35:26 -0500, Robert Haas wrote:
> While I have passionate philosophical feelings about this topic, for
> purposes of the present thread the really important question (IMV,
> anyway) is whether there's any way of getting a patch for global
> barriers committed in advance of the first user of such barriers.

Right. I think there is.


> If not, then I guess we'll need to decide which of checksum
> enable/disable and ALTER SYSTEM READ ONLY is going to go first and
> commit this only when that patch is ready to go as well. Or, I
> suppose, commit it with a dummy placeholder that then gets replaced by
> whichever patch goes first, but I'm not sure whether people would be
> OK with that.

I'd either add a test (if we have some) or placeholder kind
initially. But I'd also be ok with going for either of the other
versions directly - but it seems harder to tackle the patches together.

Greetings,

Andres Freund




Re: On disable_cost

2019-12-12 Thread Greg Stark
On Wed, 11 Dec 2019 at 01:24, Laurenz Albe  wrote:
>
> On Tue, 2019-12-10 at 15:50 -0700, Jim Finnerty wrote:
> > As a proof of concept, I hacked around a bit today to re-purpose one of the
> > bits of the Cost structure to mean "is_disabled" so that we can distinguish
> > 'disabled' from 'non-disabled' paths without making the Cost structure any
> > bigger.  In fact, it's still a valid double.  The obvious choice would have
> > been to re-purpose the sign bit, but I've had occasion to exploit negative
> > costs before so for this POC I used the high-order bit of the fractional
> > bits of the double.  (see Wikipedia for double precision floating point for
> > the layout).
> >
> > The idea is to set a special bit when disable_cost is added to a cost.
> > Dedicating multiple bits instead of just 1 would be easily done, but as it
> > is we can accumulate many disable_costs without overflowing, so just
> > comparing the cost suffices.
>
> Doesn't that rely on a specific implementation of double precision (IEEE)?
> I thought that we don't want to limit ourselves to platforms with IEEE floats.

We could always implement it again in another format

However, I wouldn't have expected to be bit twiddling. I would have
expected to use standard functions like ldexp to do this. In fact I
think if you use the high bit of the exponent you could do it entirely
using ldexp and regular double comparisons (with fabs).

Ie, to set the bit you set cost = ldexp(cost, __DBL_MAX_EXP__/2). And
to check for the bit being set you compare ilogb(cost,
__DBL_MAX_EXP__/2). Hm. that doesn't handle if the cost is already < 1
in which case I guess you would have to set it to 1 first. Or reserve
the two high bits of the cost so you can represent disabled values
that had negative exponents before being disabled.

I wonder if it wouldn't be a lot cleaner and more flexible to just go
with a plain float for Cost and use the other 32 bits for counters and
bitmasks and still be ahead of the game. A double can store 2^1024 but
a float 2^128 which still feels like it should be more than enough to
store the kinds of costs plans have without the disabled costs. 2^128
milliseconds is still 10^28 years which is an awfully expensive
query

-- 
greg




MSYS2 support

2019-12-12 Thread Peter Eisentraut
There were a number of recent threads about building PostgreSQL on 
MSYS2.  This has been confusing on occasion; see for example [0].  MSYS2 
is actually a derivative of Cygwin.  What most people are actually doing 
is using MSYS2 has the host environment for doing a kind of 
cross-compilation to MinGW.


You can also build natively on MSYS2, using the existing Cygwin support. 
 Except that it won't work because configure doesn't recognize the 
config.guess output.  Attached are a couple of small patches to fix that 
up.  The first patch fixes configure as explained.  The second patch 
fixes some warnings in ps_status.c.  It's curious why the existing build 
farm members don't issue warnings there, but maybe their compilers are 
too old.  The third patch fixes another warning; again, not sure why 
original Cygwin doesn't warn.  It might be a bit too broad to apply like 
that.


MSYS2 doesn't ship with cygserver AFAICT, so you can't run a PostgreSQL 
server, but everything else should work.



[0]: 
https://www.postgresql.org/message-id/6672cebd-0c07-ce1e-36f8-6ae82c496...@2ndquadrant.com


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 676e1392705e017cc5ae17e0f35952f8a1405eab Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 18 Aug 2019 00:17:16 +0200
Subject: [PATCH 1/3] Add support for MSYS2

It's basically a variant of Cygwin, so use that template.
---
 configure| 12 +---
 configure.in | 12 +---
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/configure b/configure
index 3d9bd0bdf8..7f86a7e5ff 100755
--- a/configure
+++ b/configure
@@ -2946,7 +2946,7 @@ else
 
 case $host_os in
  aix*) template=aix ;;
-  cygwin*) template=cygwin ;;
+  cygwin*|msys*) template=cygwin ;;
   darwin*) template=darwin ;;
 dragonfly*) template=netbsd ;;
  freebsd*) template=freebsd ;;
@@ -15843,24 +15843,22 @@ fi
 
 
 
-case $host_os in
+if test "$PORTNAME" = "win32" -o "$PORTNAME" = "cygwin"; then
# Cygwin and (apparently, based on test results) Mingw both
# have a broken strtof(), so substitute the same replacement
# code we use with VS2013. That's not a perfect fix, since
# (unlike with VS2013) it doesn't avoid double-rounding, but
# we have no better options. To get that, though, we have to
# force the file to be compiled despite HAVE_STRTOF.
-   mingw*|cygwin*)
-   case " $LIBOBJS " in
+   case " $LIBOBJS " in
   *" strtof.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS strtof.$ac_objext"
  ;;
 esac
 
-   { $as_echo "$as_me:${as_lineno-$LINENO}: On $host_os we will 
use our strtof wrapper." >&5
+   { $as_echo "$as_me:${as_lineno-$LINENO}: On $host_os we will use our 
strtof wrapper." >&5
 $as_echo "$as_me: On $host_os we will use our strtof wrapper." >&6;}
-   ;;
-esac
+fi
 
 case $host_os in
 
diff --git a/configure.in b/configure.in
index 34aea0b4ac..797a9db01d 100644
--- a/configure.in
+++ b/configure.in
@@ -59,7 +59,7 @@ PGAC_ARG_REQ(with, template, [NAME], [override operating 
system template],
 
 case $host_os in
  aix*) template=aix ;;
-  cygwin*) template=cygwin ;;
+  cygwin*|msys*) template=cygwin ;;
   darwin*) template=darwin ;;
 dragonfly*) template=netbsd ;;
  freebsd*) template=freebsd ;;
@@ -1743,18 +1743,16 @@ AC_REPLACE_FUNCS(m4_normalize([
strtof
 ]))
 
-case $host_os in
+if test "$PORTNAME" = "win32" -o "$PORTNAME" = "cygwin"; then
# Cygwin and (apparently, based on test results) Mingw both
# have a broken strtof(), so substitute the same replacement
# code we use with VS2013. That's not a perfect fix, since
# (unlike with VS2013) it doesn't avoid double-rounding, but
# we have no better options. To get that, though, we have to
# force the file to be compiled despite HAVE_STRTOF.
-   mingw*|cygwin*)
-   AC_LIBOBJ([strtof])
-   AC_MSG_NOTICE([On $host_os we will use our strtof wrapper.])
-   ;;
-esac
+   AC_LIBOBJ([strtof])
+   AC_MSG_NOTICE([On $host_os we will use our strtof wrapper.])
+fi
 
 case $host_os in
 
-- 
2.24.0

From bf5be8c29babcd86e12d9f3acfb87d391d61e187 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 11 Dec 2019 08:52:50 +0100
Subject: [PATCH 2/3] Fix compiler warnings

The PS_USE_NONE case left a couple of unused variables exposed.
---
 src/backend/utils/misc/ps_status.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/utils/misc/ps_status.c 
b/src/backend/utils/misc/ps_status.c
index 6c851dd498..288d71d4d2 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -89,6 +89,8 @@ bool  update_process_title = true;
 #endif
 
 
+#ifndef PS_USE_NONE
+
 #ifndef PS_USE_CLOBBER_ARGV
 /* all but one option need a buffer to write their ps line in */
 #define PS_BUFFER_SIZE 256
@@ -104,6 +106,8 @@ static

What constrains the range of SERIALIZABLEXACT xmin values?

2019-12-12 Thread Thomas Munro
Hi,

Every SERIALIZABLEXACT holds an xmin that comes from a Snapshot, and
SxactGlobalXmin holds the oldest of them.  But a SERIALIZABLEXACT can
live longer than a transaction and snapshot, so now I'm wondering if
it's possible to reach a state where there exist SERIALIZABLEXACT
objects with a range of xmin values that break the modular
TransactionIdPrecedes()-based logic in SetNewSxactGlobalXmin(), which
relies on the set of values not spanning more than half of the 2^32
clock.  If that state is reachable, then I think the effect would be
to call ClearOldPredicateLocks() at the wrong times (too much and
we'll waste CPU in that function, not enough and we'll "leak"
predicate locks by not cleaning them up as eagerly as we intended).




Re: Make autovacuum sort tables in descending order of xid_age

2019-12-12 Thread Mark Dilger




On 12/12/19 11:26 AM, David Fetter wrote:

On Thu, Dec 12, 2019 at 08:02:25AM -0800, Mark Dilger wrote:

On 11/30/19 2:23 PM, David Fetter wrote:

On Sat, Nov 30, 2019 at 10:04:07AM -0800, Mark Dilger wrote:

On 11/29/19 2:21 PM, David Fetter wrote:

On Fri, Nov 29, 2019 at 07:01:39PM +0100, David Fetter wrote:

Folks,

Per a suggestion Christophe made, please find attached a patch to
$Subject:

Apart from carefully fudging with pg_resetwal, and short of running in
production for a few weeks, what would be some good ways to test this?


Per discussion on IRC with Sehrope Sarkuni, please find attached a
patch with one fewer bug, this one in the repalloc() calls.


Hello David,

Here are my initial thoughts.

Although you appear to be tackling the problem of vacuuming tables
with older Xids first *per database*,


Yes, that's what's come up for me in production, but lately,
production has consisted of a single active DB maxing out hardware. I
can see how in other situations--multi-tenant, especially--it would
make more sense to sort the DBs first.


I notice you don't address that in your latest patch.  Do you have
any thoughts on whether that needs to be handled in this patch?


My thought is that it doesn't.


I can live with that for now.  I'd like the design to be compatible with
revisiting that in a subsequent patch.


I have not tested this change, but I may do so later today or perhaps
on Monday.


The code compiles cleanly and passes all regression tests, but I don't
think those tests really cover what you are changing.  Have you been
using any test framework for this?


I don't have one :/


We need to get that fixed.


I wonder if you might add information about table size, table changes,
and bloat to your RelFrozenXidAge struct and modify rfxa_comparator to
use a heuristic to cost the (age, size, bloat, changed) grouping and
sort on that cost, such that really large bloated tables with old xids
might get vacuumed before smaller, less bloated tables that have
even older xids. Sorting the tables based purely on xid_age seems to
ignore other factors that are worth considering.  I do not have a
formula for how those four factors should be weighted in the heuristic,
but you are implicitly assigning three of them a weight of zero in
your current patch.


I think it's vastly premature to come up with complex sorting systems
right now.  Just sorting in descending order of age should either have
or not have positive effects.


I hear what you are saying, but I'm going to argue the other side.

  Let C = 1.0002065
  Let x = xid_age for a table
  Let v = clamp(n_dead_tuples / reltuples*2) to max 0.5
  Let a = clamp(changes_since_analyze / reltuples) to max 0.5

  Let score = C**x + v + a

With x = 1 million=> C**x = 1.02
 x = 200 million  => C**x = 62.2
 x = 2**32=> C**x = FLT_MAX - delta

The maximum contribution to the score that n_dead_tuples and
changes_since_analyze can make is 1.0.  Once the xid age reaches one
million, it will start to be the dominant factor.  By the time it
reaches the default value of 200 million for freeze_max_age it is
far and away the dominant factor, and the xid age of one table vs.
another never overflows FLT_MAX given that 2**32 is the largest
xid age your current system can store in the uint32 you are using.

The computed score is a 32 bit float, which takes no more memory
to store than the xid_age field you are storing.  So storing the
score rather than the xid age is memory-wise equivalent to your
patch.

I doubt the computation time for the exponential is relevant
compared to the n*log(n) average sorting time of the quicksort.
It is even less relevant compared to the time it takes to vacuum
the tables.  I doubt my proposal has a measurable run-time impact.

On the upside, if you have a database with autovacuum configured
aggressively, you can get the tables with the most need vacuumed
first, with need computed relative to vac_scale_factor and
anl_scale_factor, which helps for a different use case than yours.
The xid age problem might not exist for databases where autovacuum
has enough resources to never fall behind.  Those databases will
have other priorities for where autovacuum spends its time.

I'm imagining coming back with two patches later, one that does
something more about choosing which database to vacuum first, and
another that recomputes which table to vacuum next when a worker
finishes vacuuming a table.  These combined could help keep tables
that are sensitive to statistics changes vacuumed more frequently
than others.


relation_needs_vacanalyze currently checks the reltuples, n_dead_tuples
and changes_since_analyze along with vac_scale_factor and
anl_scale_factor for the relation, but only returns booleans dovacuum,
doanalyze, and wraparound.


Yeah, I looked at that. It's for a vastly different purpose, namely
deciding what's an emergency and what's probably not, but needs
attention anyhow.  My goal was something a little finer-g

Re: What constrains the range of SERIALIZABLEXACT xmin values?

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-13 10:30:19 +1300, Thomas Munro wrote:
> Every SERIALIZABLEXACT holds an xmin that comes from a Snapshot, and
> SxactGlobalXmin holds the oldest of them.  But a SERIALIZABLEXACT can
> live longer than a transaction and snapshot, so now I'm wondering if
> it's possible to reach a state where there exist SERIALIZABLEXACT
> objects with a range of xmin values that break the modular
> TransactionIdPrecedes()-based logic in SetNewSxactGlobalXmin(), which
> relies on the set of values not spanning more than half of the 2^32
> clock.  If that state is reachable, then I think the effect would be
> to call ClearOldPredicateLocks() at the wrong times (too much and
> we'll waste CPU in that function, not enough and we'll "leak"
> predicate locks by not cleaning them up as eagerly as we intended).

I have only a weak grasp of the details of serializable, so maybe I'm
entirely off base here: Can there actually be SERIALIZABLEXACT entries
with xmins that don't also exist in the table? I'd think that the fact
that rows with the relevant xmins won't commonly be removable would
possibly provide enough interlock?

Greetings,

Andres Freund




Re: Make autovacuum sort tables in descending order of xid_age

2019-12-12 Thread Mark Dilger




On 12/12/19 1:35 PM, Mark Dilger wrote:

   Let C = 1.0002065
   Let x = xid_age for a table
   Let v = clamp(n_dead_tuples / reltuples*2) to max 0.5
   Let a = clamp(changes_since_analyze / reltuples) to max 0.5

   Let score = C**x + v + a


I should hasten to add that this is just a proof of concept
formula, not one that I'm specifically advocating.  The point
is that we can devise a scoring system in which the xid age
is the dominant factor whenever it could possibly matter,
while still letting other factors prevail when xid age is
of little consequence.

--
Mark Dilger




Re: Remove configure --disable-float4-byval and --disable-float8-byval

2019-12-12 Thread Peter Geoghegan
On Fri, Nov 1, 2019 at 1:19 PM Robert Haas  wrote:
> On Fri, Nov 1, 2019 at 3:15 PM Peter Geoghegan  wrote:
> > I don't think that those two things are equivalent at all. There may
> > even be workloads that will benefit when run on 32-bit hardware.
> > Having to palloc() and pfree() with 8 byte integers is probably very
> > slow.
>
> Yeah! I mean, users who are using only 4-byte or smaller pass-by-value
> quantities will be harmed, especially in cases where they are storing
> a lot of them at the same time (e.g. sorting) and especially if they
> double their space consumption and run out of their very limited
> supply of memory.

Apparently Linux has almost no upstream resources for testing 32-bit
x86, and it shows:

https://lwn.net/ml/oss-security/calcetrw1z0gclfjz-1jwj_wct3+axxkp_wocxy8jkbslzv8...@mail.gmail.com/

I think that this kind of thing argues for minimizing the amount of
code that can only be tested on a small minority of the computers that
are in general use today. If no Postgres hacker regularly runs the
code, then its chances of having bugs are far higher. Having coverage
in the buildfarm certainly helps, but it's no substitute.

Sticking with the !USE_FLOAT8_BYVAL example, it's easy to imagine
somebody forgetting to add a !USE_FLOAT8_BYVAL block, that contains
the required pfree(). Now you have a memory leak that only affects a
small minority of platforms. How likely is it that buildfarm coverage
will help somebody detect that problem?

-- 
Peter Geoghegan




Re: Make autovacuum sort tables in descending order of xid_age

2019-12-12 Thread Mark Dilger




On 12/12/19 1:35 PM, Mark Dilger wrote:

Once the xid age reaches one
million, it will start to be the dominant factor.


Actually, it doesn't change much from x = 1 to x = 1,000,000
but I was planning to add another factor to the formula and
forgot before sending the email.  I'll leave that as an
exercise for the reader.

--
Mark Dilger




Re: log bind parameter values on error

2019-12-12 Thread Andres Freund
Hi,

I see that you pushed this patch. I'm unfortunately unhappy with the
approach taken.  As previously said, handling a lot of this from exec_*
is a mistake in my opinion. Pretty much the same problem exists for
parametrized query execution from other contexts, e.g. for queries
executed inside plpgsql. By putting the responsibility to manage error
contexts etc from with exec_*, we'd need to add a copy of that to pretty
much every place executing queries. This should all be in the portal
code, with an optimization for computing the parameter values from
strings inside postgres.c, when the input is text.

Greetings,

Andres Freund




Re: log bind parameter values on error

2019-12-12 Thread Alvaro Herrera
Hello

On 2019-Dec-12, Andres Freund wrote:

> I see that you pushed this patch.

Yeah, a version of it -- significantly different from what Alexey
submitted last.

> I'm unfortunately unhappy with the
> approach taken.  As previously said, handling a lot of this from exec_*
> is a mistake in my opinion. Pretty much the same problem exists for
> parametrized query execution from other contexts, e.g. for queries
> executed inside plpgsql. By putting the responsibility to manage error
> contexts etc from with exec_*, we'd need to add a copy of that to pretty
> much every place executing queries. This should all be in the portal
> code, with an optimization for computing the parameter values from
> strings inside postgres.c, when the input is text.

Hmm.  I think there are two pieces of interest for this feature.  One
saves a text version of the params; the other is setting the error
context that reports them.  I agree that possibly it would work to have
PortalStart be in charge of the former.

However, AFAICS the error context must be set up in each place that's
executing the query; I don't think it would work to try to hack it in
SPI_cursor_open_internal, for example which is what plpgsql uses; rather
it has to be in plpgsql/pl_exec.c itself.  (For example, you cannot even
put it directly in exec_dynquery_with_params -- it has to be in the
callers of that routine, and surround all code until after
exec_for_query.)

I think the current state of affairs is already a good improvement over
what was there before.  Can it be improved?  Sure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Corruption with duplicate primary key

2019-12-12 Thread Tomas Vondra

On Wed, Dec 11, 2019 at 11:46:40PM +, Alex Adriaanse wrote:

On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:

At first I thought maybe this might be due to collations changing and
breaking the index silently. What collation are you using?


We're using en_US.utf8. We did not make any collation changes to my
knowledge.



Well, the idea was more that glibc got updated and the collations
changed because of that (without PostgreSQL having a chance to even
notice that).


1) When you do the queries, do they use index scan or sequential
scan?  Perhaps it does sequential scan, and if you force index scan
(e.g. by rewriting the query) it'll only find one of those rows.


By default it used an index scan. When I re-ran the query today (and
confirmed that the query used an index only scan) I did not see any
duplicates. If I force a sequential scan using "SET
enable_index[only]scan = false" the duplicates reappear.



Hmmm, that's probably a sign of some sort of index corruption. Clearly,
when a row can't be found through an index, it's invisible to code
enforcing the unique constraint (relying on the index).


However, using a backup from a week ago I see duplicates in both the
query that uses an index only scan as well as the query that uses the
sequential scan. So somehow over the past week the index got changed to
eliminate duplicates.



Hmmm, that's interesting ... and confusing.

The good thing is that this is not an upgrade issue, because there was
no corruption right after the upgrade.

But then apparently the corruption appeared, and then disappeared for
some unknown reason, but only from the index. Puzzling.


2) Can you check in backups if this data corruption was present in
the PG10 cluster, before running pg_upgrade?


Sure. I just checked and did not see any corruption in the PG10
pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and
right after the upgrade I did not see any corruption either. I checked
using both index scans and sequential scans.



OK, thanks. That's valuable piece of information.

How active is the system and can you do PITR? That is, can you try
restoring it into different points in time by replaying WAL? Then we
could check narrow-down when the corruption appeared and inspect the WAL
from that period.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: tuplesort test coverage

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-12 09:27:04 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I pushed this now. We'll see what the slower buildfarm animals say. I'll
> > try to see how long they took in a few days.
> 
> friarbird (a CLOBBER_CACHE_ALWAYS animal) just showed a failure in this:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=friarbird&dt=2019-12-12%2006%3A20%3A02
> 
> == pgsql.build/src/test/regress/regression.diffs 
> ===
> diff -U3 
> /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out
>  /pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out
> --- 
> /pgbuild/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/tuplesort.out
>2019-11-13 19:54:11.0 -0500
> +++ /pgbuild/root/HEAD/pgsql.build/src/test/regress/results/tuplesort.out 
> 2019-12-12 08:25:23.0 -0500
> @@ -625,13 +625,13 @@
> Group Key: a.col12
> Filter: (count(*) > 1)
> ->  Merge Join
> - Merge Cond: (a.col12 = b.col12)
> - ->  Sort
> -   Sort Key: a.col12 DESC
> -   ->  Seq Scan on test_mark_restore a
> + Merge Cond: (b.col12 = a.col12)
>   ->  Sort
> Sort Key: b.col12 DESC
> ->  Seq Scan on test_mark_restore b
> + ->  Sort
> +   Sort Key: a.col12 DESC
> +   ->  Seq Scan on test_mark_restore a
>  (14 rows)
>  
>  :qry;
> 
> Since a and b are exactly the same table, in principle it's a matter of
> chance which one the planner will put on the outside of the join.

Yea.


> I think what happened here is that the test ran long enough for
> autovacuum/autoanalyze to come along and scan the table, changing its
> stats in between where the planner picked up the stats for a and those
> for b, and we ended up making the opposite join order choice.

Sounds reasonable.


> What seems like a simpler and more reliable fix is to make
> test_mark_restore a temp table, thus keeping autovac away from it.
> Is there a reason in terms of the test's goals not to do that?

I can't see any reason. The sorting code shouldn't care about the source
of tuples. I guess there could at some point be tests for parallel
sorting, but that'd just use a different table.


> Also ... why in the world does the script drop its tables at the end
> with IF EXISTS?  They'd better exist at that point.  I object
> to the DROP IF EXISTS up at the top, too.  The regression tests
> do not need to be designed to deal with an unpredictable start state,
> and coding them to do so can have no effect other than possibly
> masking problems.

Well, it makes it a heck of a lot easier to run tests in isolation while
evolving them. While I personally think it's good to leave cleanup for
partial states in for cases where it was helpful during development, I
also don't care about it strongly.

Greetings,

Andres Freund




Errors "failed to construct the join relation" and "failed to build any 2-way joins"

2019-12-12 Thread Will Leinweber
On 12.1, fresh initdb the following query gives me the error
"ERROR:  failed to construct the join relation"

  SELECT FROM (
SELECT FROM pg_catalog.pg_stat_bgwriter AS ref_0
LEFT JOIN pg_catalog.pg_stat_bgwriter AS ref_1 ON (true), LATERAL (
  SELECT FROM pg_catalog.pg_publication AS ref_2, LATERAL (
SELECT FROM pg_catalog.pg_class
WHERE ref_1.buffers_alloc IS NOT NULL
  ) AS subq_0
  WHERE true
  LIMIT 1
) AS subq_1
WHERE true
  ) AS subq_2

If you move the limit up into subq_0, then the error changes to
"ERROR:  failed to build any 2-way joins"

  SELECT FROM (
SELECT FROM pg_catalog.pg_stat_bgwriter AS ref_0
LEFT JOIN pg_catalog.pg_stat_bgwriter AS ref_1 ON (true), LATERAL (
  SELECT FROM pg_catalog.pg_publication AS ref_2, LATERAL (
SELECT FROM pg_catalog.pg_class
WHERE ref_1.buffers_alloc IS NOT NULL
LIMIT 1
  ) AS subq_0
  WHERE true
) AS subq_1
WHERE true
  ) AS subq_2

I'm unable to reproduce either of the errors on 11.6 or 11.4. I haven't tried
any other versions. The actual value of the limit doesn't appear to matter,
just if it's present or not.

— Will




Re: shared tempfile was not removed on statement_timeout (unreproducible)

2019-12-12 Thread Thomas Munro
On Fri, Dec 13, 2019 at 7:05 AM Justin Pryzby  wrote:
> I have a nagios check on ancient tempfiles, intended to catch debris left by
> crashed processes.  But triggered on this file:
>
> $ sudo find /var/lib/pgsql/12/data/base/pgsql_tmp -ls
> 1429774 drwxr-x---   3 postgres postgres 4096 Dec 12 11:32 
> /var/lib/pgsql/12/data/base/pgsql_tmp
> 1698684 drwxr-x---   2 postgres postgres 4096 Dec  7 01:35 
> /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset
> 169347 5492 -rw-r-   1 postgres postgres  5619712 Dec  7 01:35 
> /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0
> 169346 5380 -rw-r-   1 postgres postgres  5505024 Dec  7 01:35 
> /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/1.0
>
> I found:
>  2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to 
> statement timeout   | CLUSTER 
> pg_stat_database_snap USI
>  2019-12-07 01:35:56 | 11025 | postgres | temporary file: path 
> "base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | CLUSTER 
> pg_stat_database_snap USI

Hmm.  I played around with this and couldn't reproduce it, but I
thought of something.  What if the statement timeout is reached while
we're in here:

#0  PathNameDeleteTemporaryDir (dirname=0x7fffd010
"base/pgsql_tmp/pgsql_tmp28884.31.sharedfileset") at fd.c:1471
#1  0x00a32c77 in SharedFileSetDeleteAll (fileset=0x80182e2cc)
at sharedfileset.c:177
#2  0x00a327e1 in SharedFileSetOnDetach (segment=0x80a6e62d8,
datum=34385093324) at sharedfileset.c:206
#3  0x00a365ca in dsm_detach (seg=0x80a6e62d8) at dsm.c:684
#4  0x0061621b in DestroyParallelContext (pcxt=0x80a708f20) at
parallel.c:904
#5  0x005d97b3 in _bt_end_parallel (btleader=0x80fe9b4b0) at
nbtsort.c:1473
#6  0x005d92f0 in btbuild (heap=0x80a7bc4c8,
index=0x80a850a50, indexInfo=0x80fec1ab0) at nbtsort.c:340
#7  0x0067445b in index_build (heapRelation=0x80a7bc4c8,
indexRelation=0x80a850a50, indexInfo=0x80fec1ab0, isreindex=true,
parallel=true) at index.c:2963
#8  0x00677bd3 in reindex_index (indexId=16532,
skip_constraint_checks=true, persistence=112 'p', options=0) at
index.c:3591
#9  0x00678402 in reindex_relation (relid=16508, flags=18,
options=0) at index.c:3807
#10 0x0073928f in finish_heap_swap (OIDOldHeap=16508,
OIDNewHeap=16573, is_system_catalog=false,
swap_toast_by_content=false, check_constraints=false,
is_internal=true, frozenXid=604, cutoffMulti=1, newrelpersistence=112
'p') at cluster.c:1409
#11 0x007389ab in rebuild_relation (OldHeap=0x80a7bc4c8,
indexOid=16532, verbose=false) at cluster.c:622
#12 0x0073849e in cluster_rel (tableOid=16508, indexOid=16532,
options=0) at cluster.c:428
#13 0x00737f22 in cluster (stmt=0x800cfcbf0, isTopLevel=true)
at cluster.c:185
#14 0x00a7cc5c in standard_ProcessUtility (pstmt=0x800cfcf40,
queryString=0x800cfc120 "cluster t USING t_i_idx ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x800cfd030, completionTag=0x7fffe0b0 "") at utility.c:654

The CHECK_FOR_INTERRUPTS() inside the walkdir() loop could ereport()
out of there after deleting some but not all of your files, but the
code in dsm_detach() has already popped the callback (which it does
"to avoid infinite error recursion"), so it won't run again on error
cleanup.  Hmm.  But then... maybe the two log lines you quoted should
be the other way around for that.

> Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
> non-regular files and thus shared filesets.  Maybe that's worth discussion on 
> a
> new thread ?
>
> src/backend/utils/adt/genfile.c
> /* Ignore anything but regular files */
> if (!S_ISREG(attrib.st_mode))
> continue;

+1, that's worth fixing.




Re: Memory-Bounded Hash Aggregation

2019-12-12 Thread Jeff Davis
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
> 13) As for this:
> 
>  /* make sure that we don't exhaust the hash bits */
>  if (partition_bits + input_bits >= 32)
>  partition_bits = 32 - input_bits;
> 
> We already ran into this issue (exhausting bits in a hash value) in
> hashjoin batching, we should be careful to use the same approach in
> both
> places (not the same code, just general approach).

I assume you're talking about ExecHashIncreaseNumBatches(), and in
particular, commit 8442317b. But that's a 10-year-old commit, so
perhaps you're talking about something else?

It looks like that code in HJ is protecting against having a very large
number of batches, such that we can't allocate an array of pointers for
each batch. And it seems like the concern is more related to a planner
error causing such a large nbatch.

I don't quite see the analogous case in HashAgg. npartitions is already
constrained to a maximum of 256. And the batches are individually
allocated, held in a list, not an array.

It could perhaps use some defensive programming to make sure that we
don't run into problems if the max is set very high.

Can you clarify what you're looking for here?

Perhaps I can also add a comment saying that we can have less than
HASH_MIN_PARTITIONS when running out of bits.

Regards,
Jeff Davis






Re: shared tempfile was not removed on statement_timeout (unreproducible)

2019-12-12 Thread Justin Pryzby
On Fri, Dec 13, 2019 at 03:03:47PM +1300, Thomas Munro wrote:
> On Fri, Dec 13, 2019 at 7:05 AM Justin Pryzby  wrote:
> > I have a nagios check on ancient tempfiles, intended to catch debris left by
> > crashed processes.  But triggered on this file:
> >
> > $ sudo find /var/lib/pgsql/12/data/base/pgsql_tmp -ls
> > 1429774 drwxr-x---   3 postgres postgres 4096 Dec 12 11:32 
> > /var/lib/pgsql/12/data/base/pgsql_tmp
> > 1698684 drwxr-x---   2 postgres postgres 4096 Dec  7 01:35 
> > /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset
> > 169347 5492 -rw-r-   1 postgres postgres  5619712 Dec  7 01:35 
> > /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0
> > 169346 5380 -rw-r-   1 postgres postgres  5505024 Dec  7 01:35 
> > /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/1.0
> >
> > I found:
> >  2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to 
> > statement timeout   | CLUSTER 
> > pg_stat_database_snap USI
> >  2019-12-07 01:35:56 | 11025 | postgres | temporary file: path 
> > "base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | CLUSTER 
> > pg_stat_database_snap USI
> 
> Hmm.  But then... maybe the two log lines you quoted should
> be the other way around for that.

And, it's actually the other way around, when I order BY something better than
left(log_time::text,19).

postgres=# SELECT log_time pid, session_line ln, pid, database db, 
left(message,99), left(query,33) FROM jrn_postgres_log WHERE user_name IS NOT 
NULL AND log_time BETWEEN '2019-12-07 01:35' AND '2019-12-07 01:36' ORDER BY 
1,2;

 2019-12-07 01:35:56.626-06 |  1 | 11025 | postgres | temporary file: path 
"base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | CLUSTER 
pg_stat_database_snap USI
 2019-12-07 01:35:56.626-06 |  2 | 11025 | postgres | canceling statement due 
to statement timeout   | CLUSTER 
pg_stat_database_snap USI




Re: Wrong assert in TransactionGroupUpdateXidStatus

2019-12-12 Thread Amit Kapila
On Thu, Dec 12, 2019 at 8:44 PM Robert Haas  wrote:
>
> On Tue, Dec 10, 2019 at 4:32 PM Andres Freund  wrote:
> > and then, within an if():
> >
> > /*
> >  * We don't try to do group update optimization if a 
> > process has
> >  * overflowed the subxids array in its PGPROC, since in 
> > that case we
> >  * don't have a complete list of XIDs for it.
> >  */
> > Assert(THRESHOLD_SUBTRANS_CLOG_OPT <= 
> > PGPROC_MAX_CACHED_SUBXIDS);
> >
> > Even if these weren't redundant, it can't make sense to test such a
> > static condition only within an if? Is it possible this was actually
> > intended to test something different?
>
> Based on the comment, I imagine it might've been intended to read
> Assert(nsubxids <= PGPROC_MAX_CACHED_SUBXIDS).
>

Do you think we need such an Assert after having StaticAssert for
(THRESHOLD_SUBTRANS_CLOG_OPT <= PGPROC_MAX_CACHED_SUBXIDS)  and then
an if statement containing (nsubxids <= THRESHOLD_SUBTRANS_CLOG_OPT)
just before this Assert?  Sure, we can keep this for extra safety, but
I don't see the need for it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: shared tempfile was not removed on statement_timeout (unreproducible)

2019-12-12 Thread Thomas Munro
On Fri, Dec 13, 2019 at 3:13 PM Justin Pryzby  wrote:
> On Fri, Dec 13, 2019 at 03:03:47PM +1300, Thomas Munro wrote:
> > On Fri, Dec 13, 2019 at 7:05 AM Justin Pryzby  wrote:
> > >  2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to 
> > > statement timeout   | CLUSTER 
> > > pg_stat_database_snap USI
> > >  2019-12-07 01:35:56 | 11025 | postgres | temporary file: path 
> > > "base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | 
> > > CLUSTER pg_stat_database_snap USI
> >
> > Hmm.  But then... maybe the two log lines you quoted should
> > be the other way around for that.
>
> And, it's actually the other way around, when I order BY something better than
> left(log_time::text,19).

Hah.

Ok, so it looks like we shouldn't be relying on the same code path for
'happy' and 'error' cleanup.  This could probably be fixed with a well
placed explicit call to SharedFileSetDeleteAll() or a new function
SharedFileSetDestroy(), and perhaps a flag in shmem to say it's been
done so the callback doesn't do it again needlessly.  I don't think
this problem is specific to parallel index creation.




Re: On disable_cost

2019-12-12 Thread Thomas Munro
On Wed, Dec 11, 2019 at 7:24 PM Laurenz Albe  wrote:
> Doesn't that rely on a specific implementation of double precision (IEEE)?
> I thought that we don't want to limit ourselves to platforms with IEEE floats.

Just by the way, you might want to read the second last paragraph of
the commit message for 02ddd499.  The dream is over, we're never going
to run on Vax.




Re: error context for vacuum to include block number

2019-12-12 Thread Justin Pryzby
On Wed, Dec 11, 2019 at 12:33:53PM -0300, Alvaro Herrera wrote:
> On 2019-Dec-11, Justin Pryzby wrote:
> > +   cbarg.blkno = 0; /* Not known yet */
> Shouldn't you use InvalidBlockNumber for this initialization?
..
> > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, 
> > blkno);
> > +   cbarg.blkno = blkno;
> I would put this before pgstat_progress_update_param, just out of
> paranoia.
..
> Lose this hunk?

Addressed those.

> Or do you intend that this is going to be used for lazy_vacuum_heap too?
> Maybe it should.

Done in a separate patch.

On Wed, Dec 11, 2019 at 08:54:25AM -0800, Andres Freund wrote:
> Hm, wonder if could be worthwhile to not use a separate struct here, but
> instead extend one of the existing structs to contain the necessary
> information. Or perhaps have one new struct with all the necessary
> information. There's already quite a few places that do
> get_namespace_name(), for example.

Didn't find a better struct to use yet.

> > +   vacuum_error_callback_arg cbarg;
> Not a fan of "cbarg", too generic.
..
> I think this will misattribute errors that happen when in the:

Probably right.  Attached should address it. 

On Wed, Dec 11, 2019 at 08:54:25AM -0800, Andres Freund wrote:
> > +typedef struct
> > +{
> > +   char *relname;
> > +   char *relnamespace;
> > +   BlockNumber blkno;
> > +} vacuum_error_callback_arg;
> 
> Hm, wonder if could be worthwhile to not use a separate struct here, but
> instead extend one of the existing structs to contain the necessary
> information. Or perhaps have one new struct with all the necessary
> information. There's already quite a few places that do
> get_namespace_name(), for example.

> Not a fan of "cbarg", too generic.

> I think this will misattribute errors that happen when in the:

I think that's addressed after deduplicating in attached.

Deduplication revealed 2nd progress call, which seems to have been included
redundantly at c16dc1aca.

-   /* Remove tuples from heap */
-   pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-
PROGRESS_VACUUM_PHASE_VACUUM_HEAP);

Justin
>From db8a62da96a7591345bb4dc2a7c725bd0d4878d1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 27 Nov 2019 20:07:10 -0600
Subject: [PATCH v4 1/5] Rename buf to avoid shadowing buf of another type

---
 src/backend/access/heap/vacuumlazy.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..043ebb4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -517,7 +517,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
-	StringInfoData buf;
+	StringInfoData sbuf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -1481,33 +1481,33 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * This is pretty messy, but we split it up so that we can skip emitting
 	 * individual parts of the message when not applicable.
 	 */
-	initStringInfo(&buf);
-	appendStringInfo(&buf,
+	initStringInfo(&sbuf);
+	appendStringInfo(&sbuf,
 	 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
 	 nkeep, OldestXmin);
-	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
+	appendStringInfo(&sbuf, _("There were %.0f unused item identifiers.\n"),
 	 nunused);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
+	appendStringInfo(&sbuf, ngettext("Skipped %u page due to buffer pins, ",
 	"Skipped %u pages due to buffer pins, ",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
+	appendStringInfo(&sbuf, ngettext("%u frozen page.\n",
 	"%u frozen pages.\n",
 	vacrelstats->frozenskipped_pages),
 	 vacrelstats->frozenskipped_pages);
-	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
+	appendStringInfo(&sbuf, ngettext("%u page is entirely empty.\n",
 	"%u pages are entirely empty.\n",
 	empty_pages),
 	 empty_pages);
-	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
+	appendStringInfo(&sbuf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
 	RelationGetRelationName(onerel),
 	tups_vacuumed, num_tuples,
 	vacrelstats->scanned_pages, nblocks),
-			 errdetail_internal("%s", buf.data)));
-	pfree(buf.data);
+			 errdetail_internal("%s", sbuf.data)));
+	pfree(sbuf.data);
 }
 
 
-- 
2.7.4

>From 640d771b6a698189ba43886d368855afd35488fd Mon Sep 17 00:00:00 2001
From: Justin P

Re: archive status ".ready" files may be created too early

2019-12-12 Thread Kyotaro Horiguchi
Hello.

At Thu, 12 Dec 2019 22:50:20 +, "Bossart, Nathan"  
wrote in 
> Hi hackers,
> 
> I believe I've uncovered a bug that may cause archive status ".ready"
> files to be created too early, which in turn may cause an incorrect
> version of the corresponding WAL segment to be archived.
> 
> The crux of the issue seems to be that XLogWrite() does not wait for
> the entire record to be written to disk before creating the ".ready"
> file.  Instead, it just waits for the last page of the segment to be
> written before notifying the archiver.  If PostgreSQL crashes before
> it is able to write the rest of the record, it will end up reusing the
> ".ready" segment at the end of crash recovery.  In the meantime, the
> archiver process may have already processed the old version of the
> segment.

Year, that can happen if the server restarted after the crash.

> This issue seems to most often manifest as WAL corruption on standby
> servers after the primary server has crashed because it ran out of
> disk space.

In the first place, it's quite bad to set restart_after_crash to on,
or just restart crashed master in replication set. The standby can be
incosistent at the time of master crash, so it should be fixed using
pg_rewind or should be recreated from a base backup.

Even without that archiving behavior, a standby may receive wal bytes
inconsistent to the bytes from the same master just before crash. It
is not limited to segment boundary. It can happen on every block
boundary and could happen everywhere with more complecated steps.

What you are calling as a "problem" seems coming from allowing the
restart_after_crash behavior. On the other hand, as recommended in the
documentation, archive_command can refuse overwriting of the same
segment, but we don't impose to do that.

As the result the patch doesn't seem to save anything than setting up
and operating correctly.

> Another thing I am exploring is whether a crash in between writing the
> last page of a segment and creating the ".ready" file could cause the
> archiver process to skip processing it altogether.  In the scenario I
> mention earlier, the server seems to recreate the ".ready" file since
> it rewrites a portion of the segment.  However, if a WAL record fits
> perfectly into the last section of the segment, I am not sure whether
> the ".ready" file would be created after restart.

Why that segment needs .ready after restart, even though nothing could
be written to the old segment?

> I am admittedly in the early stages of working on this problem, but I
> thought it would be worth reporting to the community early on in case
> anyone has any thoughts on or past experiences with this issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Block level parallel vacuum

2019-12-12 Thread Masahiko Sawada
Sorry for the late reply.

On Fri, 6 Dec 2019 at 14:20, Amit Kapila  wrote:
>
> On Thu, Dec 5, 2019 at 7:44 PM Robert Haas  wrote:
> >
> > I think it might be a good idea to change what we expect index AMs to
> > do rather than trying to make anything that they happen to be doing
> > right now work, no matter how crazy. In particular, suppose we say
> > that you CAN'T add data on to the end of IndexBulkDeleteResult any
> > more, and that instead the extra data is passed through a separate
> > parameter. And then you add an estimate method that gives the size of
> > the space provided by that parameter (and if the estimate method isn't
> > defined then the extra parameter is passed as NULL) and document that
> > the data stored there might get flat-copied.
> >
>
> I think this is a good idea and serves the purpose we are trying to
> achieve currently.  However, if there are any IndexAM that is using
> the current way to pass stats with additional information, they would
> need to change even if they don't want to use parallel vacuum
> functionality (say because their indexes are too small or whatever
> other reasons).  I think this is a reasonable trade-off and the
> changes on their end won't be that big.  So, we should do this.
>
> > Now, you've taken the
> > onus off of parallel vacuum to cope with any crazy thing a
> > hypothetical AM might be doing, and instead you've defined the
> > behavior of that hypothetical AM as wrong. If somebody really needs
> > that, it's now their job to modify the index AM machinery further
> > instead of your job to somehow cope.
> >
>
> makes sense.
>
> > > Here, we have a need to reduce the number of workers.  Index Vacuum
> > > has two different phases (index vacuum and index cleanup) which uses
> > > the same parallel-context/DSM but both could have different
> > > requirements for workers.  The second phase (cleanup) would normally
> > > need fewer workers as if the work is done in the first phase, second
> > > wouldn't need it, but we have exceptions like gin indexes where we
> > > need it for the second phase as well because it takes the pass
> > > over-index again even if we have cleaned the index in the first phase.
> > > Now, consider the case where we have 3 btree indexes and 2 gin
> > > indexes, we would need 5 workers for index vacuum phase and 2 workers
> > > for index cleanup phase.  There are other cases too.
> > >
> > > We also considered to have a separate DSM for each phase, but that
> > > appeared to have overhead without much benefit.
> >
> > How about adding an additional argument to ReinitializeParallelDSM()
> > that allows the number of workers to be reduced? That seems like it
> > would be less confusing than what you have now, and would involve
> > modify code in a lot fewer places.
> >
>
> Yeah, we can do that.  We can maintain some information in
> LVParallelState which indicates whether we need to reinitialize the
> DSM before launching workers.  Sawada-San, do you see any problem with
> this idea?

I think the number of workers could be increased in cleanup phase. For
example, if we have 1 brin index and 2 gin indexes then in bulkdelete
phase we need only 1 worker but in cleanup we need 2 workers.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-12-12 Thread Thomas Munro
On Sat, Nov 30, 2019 at 10:57 AM Thomas Munro  wrote:
> On Fri, Nov 29, 2019 at 12:34 PM Thomas Munro  wrote:
> > ... or stop using
> > _mdfd_getseg() for this so that you can remove segments independently
> > without worrying about sync requests for other segments (it was
> > actually like that in an earlier version of the patch for commit
> > 3eb77eba, but someone complained that it didn't benifit from fd
> > caching).
>
> Not sure which approach I prefer yet, but here's a patch showing that
> alternative.

Here's a better version: it uses the existing fd if we have it already
in md_seg_fds, but opens and closes a transient one if not.


0001-Don-t-use-_mdfd_getseg-in-mdsyncfiletag-v2.patch
Description: Binary data


Re: get_database_name() from background worker

2019-12-12 Thread Craig Ringer
On Thu, 12 Dec 2019 at 16:21, ROS Didier  wrote:

> Hi
>With pg_background extension ,it is possible to make  "autonomous
> transaction" which means possibility to commit in a transaction.
>  It is like a client which connects to a postgresql instance. So you can
> execute any sql orders .
>
>
Yes, that's possible. It's not easy though and I strongly suggest you look
into existing approaches like using dblink instead.

Please start a new thread rather than following an unrelated existing one.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: xact_start for walsender & logical decoding not updated

2019-12-12 Thread Craig Ringer
On Wed, 11 Dec 2019 at 02:08, Alvaro Herrera 
wrote:

> On 2019-Dec-10, Tomas Vondra wrote:
>
> > On Tue, Dec 10, 2019 at 09:42:17AM +0900, Kyotaro Horiguchi wrote:
> > > At Tue, 10 Dec 2019 00:44:09 +0100, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote in
>
> > > I'm not sure how much xact_start for walsender is useful and we really
> > > is not running a statement there.  Also autovac launcher starts
> > > transaction without a valid statement timestamp perhaps for the same
> > > reason.
> >
> > Maybe, but then maybe we should change it so that we don't report any
> > timestamps for such processes.
>
> Yeah, I think we should to that.


Agreed. Don't report a transaction start timestamp at all if we're not in a
read/write txn in the walsender, which we should never be when using a
historic snapshot.

It's not interesting or relevant.

Reporting the commit timestamp of the current or last-processed xact would
likely just be fonfusing. I'd rather see that in pg_stat_replication if
we're going to show it, that way we can label it usefully.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: [HACKERS] Block level parallel vacuum

2019-12-12 Thread Amit Kapila
On Fri, Dec 13, 2019 at 10:03 AM Masahiko Sawada
 wrote:
>
> Sorry for the late reply.
>
> On Fri, 6 Dec 2019 at 14:20, Amit Kapila  wrote:
> >
> >
> > > > Here, we have a need to reduce the number of workers.  Index Vacuum
> > > > has two different phases (index vacuum and index cleanup) which uses
> > > > the same parallel-context/DSM but both could have different
> > > > requirements for workers.  The second phase (cleanup) would normally
> > > > need fewer workers as if the work is done in the first phase, second
> > > > wouldn't need it, but we have exceptions like gin indexes where we
> > > > need it for the second phase as well because it takes the pass
> > > > over-index again even if we have cleaned the index in the first phase.
> > > > Now, consider the case where we have 3 btree indexes and 2 gin
> > > > indexes, we would need 5 workers for index vacuum phase and 2 workers
> > > > for index cleanup phase.  There are other cases too.
> > > >
> > > > We also considered to have a separate DSM for each phase, but that
> > > > appeared to have overhead without much benefit.
> > >
> > > How about adding an additional argument to ReinitializeParallelDSM()
> > > that allows the number of workers to be reduced? That seems like it
> > > would be less confusing than what you have now, and would involve
> > > modify code in a lot fewer places.
> > >
> >
> > Yeah, we can do that.  We can maintain some information in
> > LVParallelState which indicates whether we need to reinitialize the
> > DSM before launching workers.  Sawada-San, do you see any problem with
> > this idea?
>
> I think the number of workers could be increased in cleanup phase. For
> example, if we have 1 brin index and 2 gin indexes then in bulkdelete
> phase we need only 1 worker but in cleanup we need 2 workers.
>

I think it shouldn't be more than the number with which we have
created a parallel context, no?  If that is the case, then I think it
should be fine.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: A varint implementation for PG?

2019-12-12 Thread Craig Ringer
On Tue, 10 Dec 2019 at 09:51, Andres Freund  wrote:

> Hi,
>
> I several times, most recently for the record format in the undo
> patchset, wished for a fast variable width integer implementation for
> postgres. Using very narrow integers, for space efficiency, solves the
> space usage problem, but leads to extensibility / generality problems.
>

Yes. I've wanted flexible but efficiently packed integers quite a bit too,
especially when working with wire protocols.

Am I stabbing completely in the dark when wondering if this might be a step
towards a way to lift the size limit on VARLENA Datums like bytea ?

There are obvious practical concerns with doing so, given that our protocol
offers no handle based lazy fetching for big VARLENA values, but that too
needs a way to represent sizes sensibly and flexibly.


> Even with those caveats, I think that's a pretty good result. Other
> encodings were more expensive. And I think there's definitely some room
> for optimization left.


I don't feel at all qualified to question your analysis of the appropriate
representation. But your explanation certainly makes a lot of sense as
someone approaching the topic mostly fresh - I've done a bit with BCD but
not much else.

I assume we'd be paying a price in padding and alignment in most cases, and
probably more memory copying, but these representations would likely be
appearing mostly in places where other costs are overwhelmingly greater
like network or disk I/O.

If data lengths longer than that are required for a use case


If baking a new variant integer format now, I think limiting it to 64 bits
is probably a mistake given how long-lived PostgreSQL is, and how hard it
can be to change things in the protocol, on disk, etc.


> it
> probably is better to either a) use the max-representable 8 byte integer
> as an indicator that the length is stored or b) sacrifice another bit to
> represent whether the integer is the data itself or the length.
>

I'd be inclined to suspect that (b) is likely worth doing. If nothing else
because not being able to represent the full range of a 64-bit integer in
the variant type is potentially going to be a seriously annoying hassle at
points where we're interacting with places that could use the full width.
We'd then have the potential for variant integers of > 2^64 but at least
that's wholly under our control.

I also routinely underestimate how truly huge a 64-bit integer really is.
But even now 8 petabytes isn't as inconceivable as it used to be

It mostly depends on how often you expect you'd be coming up on the
boundaries where the extra bit would push you up a variant size.

Do others see use in this?


Yes. Very, very much yes.

I'd be quick to want to expose it to SQL too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Questions about PostgreSQL implementation details

2019-12-12 Thread Craig Ringer
On Tue, 10 Dec 2019 at 01:21, Tom Lane  wrote:

> Mark Dilger  writes:
> > [ useful tips about finding the code that implements a SQL command ]
>
> BTW, if it wasn't obvious already, you *really* want to have some kind
> of tool that easily finds the definition of a particular C symbol.
> You can fall back on "grep -r" or "git grep", but lots of people use
> ctags or etags or some other C-aware indexing tool.
>
>
I strongly recommend cscope with editor integration for your preferred
editor btw.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: [HACKERS] Block level parallel vacuum

2019-12-12 Thread Masahiko Sawada
On Fri, 13 Dec 2019 at 14:19, Amit Kapila  wrote:
>
> On Fri, Dec 13, 2019 at 10:03 AM Masahiko Sawada
>  wrote:
> >
> > Sorry for the late reply.
> >
> > On Fri, 6 Dec 2019 at 14:20, Amit Kapila  wrote:
> > >
> > >
> > > > > Here, we have a need to reduce the number of workers.  Index Vacuum
> > > > > has two different phases (index vacuum and index cleanup) which uses
> > > > > the same parallel-context/DSM but both could have different
> > > > > requirements for workers.  The second phase (cleanup) would normally
> > > > > need fewer workers as if the work is done in the first phase, second
> > > > > wouldn't need it, but we have exceptions like gin indexes where we
> > > > > need it for the second phase as well because it takes the pass
> > > > > over-index again even if we have cleaned the index in the first phase.
> > > > > Now, consider the case where we have 3 btree indexes and 2 gin
> > > > > indexes, we would need 5 workers for index vacuum phase and 2 workers
> > > > > for index cleanup phase.  There are other cases too.
> > > > >
> > > > > We also considered to have a separate DSM for each phase, but that
> > > > > appeared to have overhead without much benefit.
> > > >
> > > > How about adding an additional argument to ReinitializeParallelDSM()
> > > > that allows the number of workers to be reduced? That seems like it
> > > > would be less confusing than what you have now, and would involve
> > > > modify code in a lot fewer places.
> > > >
> > >
> > > Yeah, we can do that.  We can maintain some information in
> > > LVParallelState which indicates whether we need to reinitialize the
> > > DSM before launching workers.  Sawada-San, do you see any problem with
> > > this idea?
> >
> > I think the number of workers could be increased in cleanup phase. For
> > example, if we have 1 brin index and 2 gin indexes then in bulkdelete
> > phase we need only 1 worker but in cleanup we need 2 workers.
> >
>
> I think it shouldn't be more than the number with which we have
> created a parallel context, no?  If that is the case, then I think it
> should be fine.

Right. I thought that ReinitializeParallelDSM() with an additional
argument would reduce DSM but I understand that it doesn't actually
reduce DSM but just have a variable for the number of workers to
launch, is that right? And we also would need to call
ReinitializeParallelDSM() at the beginning of vacuum index or vacuum
cleanup since we don't know that we will do either index vacuum or
index cleanup, at the end of index vacum.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_ls_tmpdir to show shared filesets

2019-12-12 Thread Justin Pryzby
On Thu, Dec 12, 2019, Justin Pryzby wrote in 
20191212180506.gr2...@telsasoft.com:
> Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
> non-regular files and thus shared filesets.  Maybe that's worth discussion on 
> a
> new thread ?
> 
> src/backend/utils/adt/genfile.c
> /* Ignore anything but regular files */
> if (!S_ISREG(attrib.st_mode))
> continue;

I suggested that pg_ls_tmpdir should show shared filesets like
> 169347 5492 -rw-r-   1 postgres postgres  5619712 Dec  7 01:35 
> /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0

Should it have an extra column for the parent dir (null for nonshared filesets).
Maybe it would only show 1) files; and, 2) parents named
pgsql_tmp[0-9.]+.sharedfileset; and maybe, 3) files directly underneath (2).

Or should it require an argument to show them?  
pg_ls_tmpdir(sharedfileset=False)

That allows enumerating an entire directory and its subdirs, except hidden
files, and probably except files more than one level deep.  I guess pg_ls_dir
already allows that.

Actually, my suggestion would be to make pg_ls_tmpdir expose "isdir", same as
pg_stat_file.

That's already possible using pg_ls_dir:

postgres=# SELECT name, s.size, s.modification, s.isdir FROM (SELECT 
'base/pgsql_tmp'p)p, pg_ls_dir(p)name, pg_stat_file(p||'/'||name)s;
  name  | size |  modification  | isdir 
+--++---
 pgsql_tmp11025.0.sharedfileset | 4096 | 2019-12-07 01:35:56-06 | t

pg_tmpdir() might return (name,size,modtime), or perhaps
(name,isdir,size,modtime), which would be more likely to cause deliberate
breakage if someone assumed that record[1] was the size and rest of result was
same...

Justin




Re: A varint implementation for PG?

2019-12-12 Thread Andres Freund
Hi,

On 2019-12-13 13:31:55 +0800, Craig Ringer wrote:
> Am I stabbing completely in the dark when wondering if this might be a step
> towards a way to lift the size limit on VARLENA Datums like bytea ?

It could be - but I think it'd be a pretty small piece of it. But yes, I
have mused about that.



> > Even with those caveats, I think that's a pretty good result. Other
> > encodings were more expensive. And I think there's definitely some room
> > for optimization left.
> 
> 
> I don't feel at all qualified to question your analysis of the appropriate
> representation. But your explanation certainly makes a lot of sense as
> someone approaching the topic mostly fresh - I've done a bit with BCD but
> not much else.
> 
> I assume we'd be paying a price in padding and alignment in most cases, and
> probably more memory copying, but these representations would likely be
> appearing mostly in places where other costs are overwhelmingly greater
> like network or disk I/O.

I don't really see where padding/alignment costs come into play here?



> If data lengths longer than that are required for a use case
> 
> 
> If baking a new variant integer format now, I think limiting it to 64 bits
> is probably a mistake given how long-lived PostgreSQL is, and how hard it
> can be to change things in the protocol, on disk, etc.

I don't think it's ever going to be sensible to transport 64bit quanta
of data. Also, uh, it'd be larger than the data a postgres instance
could really contain, given LSNs are 64 bit.



> > it
> > probably is better to either a) use the max-representable 8 byte integer
> > as an indicator that the length is stored or b) sacrifice another bit to
> > represent whether the integer is the data itself or the length.

> I'd be inclined to suspect that (b) is likely worth doing. If nothing else
> because not being able to represent the full range of a 64-bit integer in
> the variant type is potentially going to be a seriously annoying hassle at
> points where we're interacting with places that could use the full width.
> We'd then have the potential for variant integers of > 2^64 but at least
> that's wholly under our control.

I'm very very staunchly against doing either of these for the varints
used widely. Throwing away even a bit is quite painful, as it
e.g. reduces the range representable in a single byte from 0 - 127/-64 -
63 to 0 - 63/-32 - 31.  Without ever being useful, given what kind of
things varints are commonly going to describe. There's e.g. simply no
practical use of describing a single WAL record length that's bigger
than 63 bit can represent.

I *can* see a separate varint type, probably sharing some code, that
supports storing arbitrarily large numbers. But using that everywhere
would be pointless.


> I'd be quick to want to expose it to SQL too.

It'll be a bit problmeatic to deal with all the casting necessary, and
with the likely resulting overload resolution issues.  I'm wondering
whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
that encodes int2/4/8 as varints when inside a tuple, but otherwise just
let it be a normal integer.

Greetings,

Andres Freund




Re: Questions about PostgreSQL implementation details

2019-12-12 Thread Craig Ringer
On Mon, 9 Dec 2019 at 23:35, Julien Delplanque 
wrote:

> Hello PostgreSQL hackers,
>
> I hope I am posting on the right mailing-list.
>
> I am actually doing a PhD related to relational databases and software
> engineering.
>
> I use PostgreSQL for my research.
>
> I have a few questions about the internals of PostgreSQL and I think they
> require experts knowledge.
>
> I could not find documentation about that in the nice PostgreSQL
> documentation but maybe I missed something? Tell me if it is the case.
>

There are a bunch of README files in the source tree that concern various
innards of PostgreSQL. They're not always referred to by any comments etc,
so you have to know they exist. They're usually well worth reading, though
it can take a while before you understand enough of PostgreSQL's
architecture for them to make sense...

Try

find src/ -name README\*


> Q1. Are PostgreSQL's meta-description tables (such as pg_class) the
> "reality" concerning the state of the DB or are they just a virtual
> representation ?
>

That's been largely answered. But I want to point out an important caveat
that isn't obvious to new people: The oid of a relation (pg_class.oid) is
not the same thing as the pg_class.relfilenode, which is usually the base
of the filename of the on-disk storage for the relation. On an idle or new
database most relations  are created with an equal oid and relfilename, so
it's easy to think the oid maps to the on-disk name of a relation, but it
doesn't. The relation oid will not change so long as the relation exists,
but the relfilenode may change if the table contents are rewritten, etc.
Additionally, there are special tables that are "relmapped" such that they
don't have a normal relfilenode at all, instead access is indirected via a
separate mapping. IIRC that's mainly necessary so we can bootstrap access
to the catalog tables that tell us how to read the catalogs.

What I would like to know with this question is: would it be possible to
> implement DDL queries (e.g. CREATE TABLE, DROP TABLE, CREATE VIEW, ALTER
> TABLE, etc.) as DML queries that modify the meta-data stored in
> meta-description tables?
>

Not really.

PostgreSQL has a caching layer - sycache, relcache, catcache - and
invalidation scheme that it relies on. It doesn't execute regular queries
on the system catalogs. It also has simplifying rules around how they are
updated and accessed. See the logic in genam.c etc. Catalogs may also
represent things that aren't just other DB rows - for example, pg_class
entries are associated with files on disk for individual database tables.

You can't just insert into pg_class, pg_attribute, etc and expect that to
safely create a table. Though it's surprising how much you can get away
with by hacking the catalogs if you're very careful and you trick
PostgreSQL into firing appropriate invalidations. I'd quite like to have a
SQL-exposed way to do a forced global cache flush and invalidation for use
in emergency scary catalog hacking situations.

So you can do quite a bit with direct catalog surgery, but it's dangerous
and if you break the database, you get to keep the pieces.

Q1.1 If it is possible, is what is done in reality? I have the feeling that
> it is not the case and that DDL queries are implemented in C directly.
>

Right. See standard_ProcessUtility() and friends.

Q1.2 If it is possible and not done, what is the reason?
>

Speed - no need to run the full executor. Simplification of catalog access.
Caching and invalidations. Chicken/egg problems: how do you "CREATE TABLE
pg_class"? . Lots more.


> Q2. Are PostgreSQL's "meta-constraints" (i.e. constraints related to
> database structure such as "a table can only have a single primary key")
> implemented in C code or via data constraints on PostgreSQL's
> meta-description tables?
>

System catalogs are not permitted to have CONSTRAINTs (CHECK constraints,
UNIQUE constraints, PRIMARY KEY constraints, FOREIGN KEY constraints, etc).

All such management is done in C level logic with the assistance of the
pg_depend catalog and the relationships it tracks.


> Q2.1 If they are not implemented via data constraints on meta-description
> tables, why ?
>

Same as above.


> Q2.2 Is there somewhere in the documentation a list of such
> "meta-constraints" implemented by PostgreSQL?
>

Not AFAIK.

Why?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: Conflict handling for COPY FROM

2019-12-12 Thread Surafel Temesgen
Hi Asaba,

On Thu, Dec 12, 2019 at 7:51 AM asaba.takan...@fujitsu.com <
asaba.takan...@fujitsu.com> wrote:

> Hello Surafel,
>
> I'm very interested in this patch.
> Although I'm a beginner,I would like to participate in the development of
> PostgreSQL.
>
>
> 1. I want to suggest new output format.
> In my opinion, it's kind to display description of output and add "line
> number" and "error" to output.
> For example,
>
> error lines
>
> line number | first | second | third | error
> +---++---+
>   1 | 1 | 10 |   0.5 |   UNIQUE
>   2 | 2 | 42 |   0.1 |CHECK
>   3 | 3 |   NULL | 0 | NOT NULL
> (3 rows)
>
> Although only unique or exclusion constraint violation returned back to
> the caller currently,
> I think that column "error" will be useful when it becomes possible to
> handle other types of errors(check, not-null and so on).
>

currently we can't get violation kind in speculative insertion


> If you assume that users re-execute COPY FROM with the output lines as
> input, these columns are obstacles.
> Therefore I think that this output format should be displayed only when we
> set new option(for example ERROR_VERBOSE) like "COPY FROM ...
> ERROR_VERBOSE;".
>
>
i agree adding optional feature for this is useful in same scenario but
i think its a material for future improvement after basic feature done.


>
> 2. I have a question about copy meta-command.
> When I executed copy meta-command, output wasn't displayed.
> Does it correspond to copy meta-command?
>
>
okay . i will look at it
thank you

regards
Surafel


Re: xact_start for walsender & logical decoding not updated

2019-12-12 Thread Kyotaro Horiguchi
At Fri, 13 Dec 2019 13:05:41 +0800, Craig Ringer  wrote 
in 
> On Wed, 11 Dec 2019 at 02:08, Alvaro Herrera 
> wrote:
> 
> > On 2019-Dec-10, Tomas Vondra wrote:
> >
> > > On Tue, Dec 10, 2019 at 09:42:17AM +0900, Kyotaro Horiguchi wrote:
> > > > At Tue, 10 Dec 2019 00:44:09 +0100, Tomas Vondra <
> > tomas.von...@2ndquadrant.com> wrote in
> >
> > > > I'm not sure how much xact_start for walsender is useful and we really
> > > > is not running a statement there.  Also autovac launcher starts
> > > > transaction without a valid statement timestamp perhaps for the same
> > > > reason.
> > >
> > > Maybe, but then maybe we should change it so that we don't report any
> > > timestamps for such processes.
> >
> > Yeah, I think we should to that.
> 
> 
> Agreed. Don't report a transaction start timestamp at all if we're not in a
> read/write txn in the walsender, which we should never be when using a
> historic snapshot.
> 
> It's not interesting or relevant.
> 
> Reporting the commit timestamp of the current or last-processed xact would
> likely just be fonfusing. I'd rather see that in pg_stat_replication if
> we're going to show it, that way we can label it usefully.

Sounds reasonable.  By the way, the starting of this thread is a valid
value in xact_timestample for a moment at the starting of logical
replication. (I couln't see it unless I inserted a sleep() in
IndentifySystem()). I'm not sure but AFAIS it is the only instance in
walsendeer. Should we take the trouble to stop that?  (I put -1 for it)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Errors "failed to construct the join relation" and "failed to build any 2-way joins"

2019-12-12 Thread Tom Lane
Will Leinweber  writes:
> On 12.1, fresh initdb the following query gives me the error
> "ERROR:  failed to construct the join relation"

I'm getting an assertion failure in an assert-enabled build, here:

(gdb) f 3
#3  0x006f382a in create_lateral_join_info (root=0x2d380c8)
at initsplan.c:637
637 Assert(!bms_is_member(rti, lateral_relids));

Eyeing the plan produced by v11, I'm suspecting some oversight in
the RTE_RESULT changes (4be058fe9); but I haven't actually bisected.
Too tired to look closer right now.

regards, tom lane




Re: segmentation fault when cassert enabled

2019-12-12 Thread vignesh C
On Fri, Dec 6, 2019 at 5:30 PM Amit Kapila  wrote:
>
> On Mon, Nov 25, 2019 at 8:25 PM Jehan-Guillaume de Rorthais
>  wrote:
> >
> > On Wed, 6 Nov 2019 14:34:38 +0100
> > Peter Eisentraut  wrote:
> >
> > > On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:
> > > > My best bet so far is that logicalrep_relmap_invalidate_cb is not called
> > > > after the DDL on the subscriber so the relmap cache is not invalidated. 
> > > > So
> > > > we end up with slot->tts_tupleDescriptor->natts superior than
> > > > rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
> > > > attrmap and the sigsev.
> > >
> > > It looks like something like that is happening.  But it shouldn't.
> > > Different table schemas on publisher and subscriber are well supported,
> > > so this must be an edge case of some kind.  Please continue investigating.
> >
> > I've been able to find the origin of the crash, but it was a long journey.
> >
> > 
> >
> >   I was unable to debug using gdb record because of this famous error:
> >
> > Process record does not support instruction 0xc5 at address 
> > 0x1482758a4b30.
> >
> > Program stopped.
> > __memset_avx2_unaligned_erms ()
> > at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
> > 168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
> >   file or directory.
> >
> >   Trying with rr, I had constant "stack depth limit exceeded", even with
> >   unlimited one. Does it worth opening a discussion or a wiki page about 
> > these
> >   tools? Peter, it looks like you have some experience with rr, any 
> > feedback?
> >
> >   Finally, Julien Rouhaud spend some time with me after work hours, 
> > a,swering
> >   my questions about some parts of the code and pointed me to the excellent
> >   backtrace_functions GUC addition few days ago. This finally did the trick 
> > to
> >   find out what was happening. Many thanks Julien!
> >
> > 
> >
> > Back to the bug itself. Consider a working logical replication with constant
> > update/insert activity, eg. pgbench running against provider.
> >
> > Now, on the subscriber side, a session issue an "ALTER TABLE ADD
> > COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
> > message is then pending for this table.
> >
> > In the meantime, the logical replication worker receive an UPDATE to apply. 
> > It
> > opens the local relation using "logicalrep_rel_open". It finds the related
> > entry in LogicalRepRelMap is valid, so it does not update its attrmap
> > and directly opens and locks the local relation:
> >
>
> - /* Try to find and lock the relation by name. */
> + /* Try to find the relation by name */
>   relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,\
> remoterel->relname, -1),
> - lockmode, true);
> + NoLock, true);
>
> I think we can't do this because it could lead to locking the wrong
> reloid.  See RangeVarGetRelidExtended.  It ensures that after locking
> the relation (which includes accepting invalidation messages), that
> the reloid is correct.  I think changing the code in the way you are
> suggesting can lead to locking incorrect reloid.
>

I have made changes to fix the comment provided. The patch for the
same is attached. Could not add a test case for this scenario is based
on timing issue.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 9bec5e5250c5af034096ed9ab966c2c237518976 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais , vignesh.c 
Date: Fri, 13 Dec 2019 11:59:13 +0530
Subject: [PATCH] Fix subscriber invalid memory access on DDL

Before this patch, the attrmap structure mapping the local fields
to remote ones for a subscribed relation was rebuild before handling
pending invalidation messages and potential relcache updates. This
was leading to an invalid memory access by overflowing the attrmap
when building the related tuple slot received from the provider.
---
 src/backend/replication/logical/relation.c | 53 ++
 1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f84..9009ebc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -220,6 +220,8 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
+	Oid			relid = InvalidOid;
+	LogicalRepRelation	*remoterel;
 
 	if (LogicalRepRelMap == NULL)
 		logicalrep_relmap_init();
@@ -232,20 +234,24 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 		elog(ERROR, "no relation map entry for remote relation ID %u",
 			 remoteid);
 
-	/* Need to update the local cache? */
-	if (!OidIsValid(entry->localreloid))
-	{
-		Oid			relid;
-		int			i;
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		LogicalRepRelation *remoterel;
-		MemoryContext oldctx;
+	remoterel = &entry->remoterel;
 
-

Re: [HACKERS] Block level parallel vacuum

2019-12-12 Thread Amit Kapila
On Fri, Dec 13, 2019 at 11:08 AM Masahiko Sawada
 wrote:
>
> On Fri, 13 Dec 2019 at 14:19, Amit Kapila  wrote:
> >
> > > > >
> > > > > How about adding an additional argument to ReinitializeParallelDSM()
> > > > > that allows the number of workers to be reduced? That seems like it
> > > > > would be less confusing than what you have now, and would involve
> > > > > modify code in a lot fewer places.
> > > > >
> > > >
> > > > Yeah, we can do that.  We can maintain some information in
> > > > LVParallelState which indicates whether we need to reinitialize the
> > > > DSM before launching workers.  Sawada-San, do you see any problem with
> > > > this idea?
> > >
> > > I think the number of workers could be increased in cleanup phase. For
> > > example, if we have 1 brin index and 2 gin indexes then in bulkdelete
> > > phase we need only 1 worker but in cleanup we need 2 workers.
> > >
> >
> > I think it shouldn't be more than the number with which we have
> > created a parallel context, no?  If that is the case, then I think it
> > should be fine.
>
> Right. I thought that ReinitializeParallelDSM() with an additional
> argument would reduce DSM but I understand that it doesn't actually
> reduce DSM but just have a variable for the number of workers to
> launch, is that right?
>

Yeah, probably, we need to change the nworkers stored in the context
and it should be lesser than the value already stored in that number.

> And we also would need to call
> ReinitializeParallelDSM() at the beginning of vacuum index or vacuum
> cleanup since we don't know that we will do either index vacuum or
> index cleanup, at the end of index vacum.
>

Right.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Fetching timeline during recovery

2019-12-12 Thread Michael Paquier
On Wed, Dec 11, 2019 at 10:45:25AM -0500, Stephen Frost wrote:
> I'm confused- wouldn't the above approach be a function that's returning
> only one row, if you had a bunch of columns and then had NULL values for
> those cases that didn't apply..?  Or, if you were thinking about the SRF
> approach that you suggested, you could use a WHERE clause to make it
> only one row...  Though I can see how it's nicer to just have one row in
> some cases which is why I was suggesting the "bunch of columns"
> approach.

Oh, sorry.  I see the confusion now and that's my fault.  In
https://www.postgresql.org/message-id/20191211052002.gk72...@paquier.xyz
I mentioned a SRF function which takes an input argument, but that
makes no sense.  What I would prefer having is just having one
function, returning one row (LSN, TLI), using in input one argument to
extract the WAL information the caller wants with five possible cases
(write, insert, flush, receive, replay).

Then, what you are referring to is one function which returns all
(LSN,TLI) for the five cases (write, insert, etc.), so it would return
one row with 10 columns, with NULL mapping to the values which have no
meaning (like replay on a primary).

And on top of that we have a third possibility: one SRF function
returning 5 rows with three attributes (mode, LSN, TLI), where mode
corresponds to one value in the set {write, insert, etc.}.

I actually prefer the first one, and you mentioned the second.  But
there could be a point in doing the third one.  An advantage of the
second and third ones is that you may be able to get a consistent view
of all the data, but it means holding locks to look at the values a
bit longer.  Let's see what others think.
--
Michael


signature.asc
Description: PGP signature