MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Regina Obe
On December 13th this change to context creation was committed, which broke
PostGIS trunk compile against PostgreSQL 11 head.  
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fa6f00b1308d
d10da4eca2f31ccbfc7b35bb461

Ticketed in PostGIS here:
https://trac.osgeo.org/postgis/ticket/3946


It's been broken for a couple of months
https://trac.osgeo.org/postgis/ticket/3904  with just core dumping but at
least it used to compile before December 13th.


In going thru the changes I see that MemoryContextCreate was changed to not
return a context anymore and to take in pointer to the context that should
be returned among other changes.

I also notice that MemoryContextCreate seems to be rarely used in PostgreSQL
code in places I would expect and instead AllocSetContextCreateExtended is
used.

So is the preferred approach to not use MemoryContextCreate in extensions
and instead for PG < 11 use AllocSetContextCreate and PG >= use
AllocSetContextCreateExtended?

Sorry if my questions don't make any sense.  Still trying to grapple with
all these memory context functions and how each is different from the other.

For reference, one of the slices of code we have in place that broke looks
something like this and we've got I think at least 5 more such instances
sprinkled  in PostGIS code base.

https://git.osgeo.org/gitea/postgis/postgis/src/branch/svn-trunk/libpgcommon
/lwgeom_transform.c#L550



MemoryContext PJMemoryContext;
POSTGIS_DEBUGF(3, "adding SRID %d with proj4text \"%s\" to query
cache at index %d", srid, proj_str, PROJ4Cache->PROJ4SRSCacheCount);

PJMemoryContext = MemoryContextCreate(T_AllocSetContext, 8192,
  &PROJ4SRSCacheContextMethods,

PROJ4Cache->PROJ4SRSCacheContext,
  "PostGIS PROJ4 PJ Memory
Context");



Thanks,
Regina




Re: Protect syscache from bloating with negative cache entries

2017-12-19 Thread Kyotaro HORIGUCHI
At Mon, 18 Dec 2017 12:14:24 -0500, Robert Haas  wrote 
in 
> On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund  wrote:
> > I'm not 100% convinced either - but I also don't think it matters all
> > that terribly much. As long as the overall hash hit rate is decent,
> > minor increases in the absolute number of misses don't really matter
> > that much for syscache imo.  I'd personally go for something like:
> >
> > 1) When about to resize, check if there's entries of a generation -2
> >around.
> >
> >Don't resize if more than 15% of entries could be freed. Also, stop
> >reclaiming at that threshold, to avoid unnecessary purging cache
> >entries.
> >
> >Using two generations allows a bit more time for cache entries to
> >marked as fresh before resizing next.
> >
> > 2) While resizing increment generation count by one.
> >
> > 3) Once a minute, increment generation count by one.
> >
> >
> > The one thing I'm not quite have a good handle upon is how much, and if
> > any, cache reclamation to do at 3). We don't really want to throw away
> > all the caches just because a connection has been idle for a few
> > minutes, in a connection pool that can happen occasionally. I think I'd
> > for now *not* do any reclamation except at resize boundaries.
> 
> My starting inclination was almost the opposite.  I think that you
> might be right that a minute or two of idle time isn't sufficient
> reason to flush our local cache, but I'd be inclined to fix that by
> incrementing the generation count every 10 minutes or so rather than
> every minute, and still flush things more then 1 generation old.  The
> reason for that is that I think we should ensure that the system
> doesn't sit there idle forever with a giant cache.  If it's not using
> those cache entries, I'd rather have it discard them and rebuild the
> cache when it becomes active again.

I see three kinds of syscache entries.

A. An entry for an actually existing object.

  This is literally a syscache entry. This kind of entry is not
  necessary to be removed but can be removed after ignorance for
  a certain period of time.

B. An entry for an object which once existed but no longer.

  This can be removed any time after the removal of the object
  and is a main cause of stats bloat or relcache bloat which are
  the motive of this thread. We can know whether the entries of
  this kind are removable using cache invalidation
  mechanism. (the patch upthread)

  We can queue the oids that specify the entries to remove, then
  actually remove at the next resize. (And this also could be
  another cause of bloat. So we could forcibly flush a hash when
  the oid list becomes longer than some threashold.)

C. An entry for a just non-existent objects.

  I'm not sure how we should treat this since the necessity of a
  entry of the kind purely stands on whether the entry will be
  accessed sometime. But we could put the same assumption to A.


> Now, I also see that your point about trying to clean up before
> resizing.  That does seem like a good idea, although we have to be
> careful not to be too eager to clean up there, or we'll just result in
> artificially limiting the cache size when it's unwise to do so.  But I
> guess that's what you meant by "Also, stop reclaiming at that
> threshold, to avoid unnecessary purging cache entries."  I think the
> idea you are proposing is that:
> 
> 1. The first time we are due to expand the hash table, we check
> whether we can forestall that expansion by doing a cleanup; if so, we
> do that instead.
> 
> 2. After that, we just expand.
> 
> That seems like a fairly good idea, although it might be a better idea
> to allow cleanup if enough time has passed.  If we hit the expansion
> threshold twice an hour apart, there's no reason not to try cleanup
> again.

Aa session with intermittently executes queries run in a very
short time could be considered as an example workload where
cleanup with such criteria is unwelcomed. But syscache won't
bloat in the case.

> Generally, the way I'm viewing this is that a syscache entry means
> paying memory to save CPU time.  Each 8kB of memory we use to store
> system cache entries is one less block we have for the OS page cache
> to hold onto our data blocks.  If we had an oracle (the kind from

Sure

> Delphi, not Redwood City) that told us with perfect accuracy when to
> discard syscache entries, it would throw away syscache entries

Except for the B in the aboves. The logic seems somewhat alien to
the time-based cleanup but this can be the measure for quick
bloat of some syscahces.

> whenever the marginal execution-time performance we could buy from
> another 8kB in the page cache is greater than the marginal
> execution-time performance we could buy from those syscache entries.
> In reality, it's hard to know which of those things is of greater
> value.  If the system isn't meaningfully memory-constrained, we ought
> to just always hang onto the syscache entries, as we 

Re: access/parallel.h lacks PGDLLIMPORT

2017-12-19 Thread Amit Kapila
On Thu, Dec 14, 2017 at 8:42 PM, Robert Haas  wrote:
> On Wed, Dec 13, 2017 at 8:19 PM, Thomas Munro
>  wrote:
>> I suppose that extensions are supposed to be allowed to use the
>> facilities in access/parallel.h.  I noticed in passing when I wrote a
>> throwaway test harness that my Windows built drone complained:
>>
>> test_sharedtuplestore.obj : error LNK2001: unresolved external symbol
>> ParallelWorkerNumber
>> [C:\projects\postgres\test_sharedtuplestore.vcxproj]
>> .\Release\test_sharedtuplestore\test_sharedtuplestore.dll : fatal
>> error LNK1120: 1 unresolved externals
>> [C:\projects\postgres\test_sharedtuplestore.vcxproj]
>>
>> I suppose that all three of these might need that, if they're part of
>> the API for parallel worker management:
>>
>> extern volatile bool ParallelMessagePending;
>> extern int  ParallelWorkerNumber;
>> extern bool InitializingParallelWorker;
>>
>> I'm less sure about the other two but at least ParallelWorkerNumber is
>> essential for anything that needs to coordinate access to input/output
>> arrays or similar.
>
> I can't really think of a reason for extensions to need to access
> ParallelMessagePending.  InitializingParallelWorker could be useful if
> the extension is doing something strange with custom GUCs.
> ParallelWorkerNumber is useful for the reason you state.
>

I also think it is good to allow ParallelWorkerNumber to be used in
extensions.  Attached is the patch for same.  I think for other two we
should wait till there is really a good use case for them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


dllimport_parallelworkernum_v1.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2017-12-19 Thread Nikhil Sontakke
> I think we would need to fire invalidations at COMMIT PREPARED, yet
> logically decode them at PREPARE.
>
Yes, we need invalidations to logically decode at PREPARE and then we need
invalidations to be executed at COMMIT PREPARED time as well.

DecodeCommit() needs to know when it's processing a COMMIT PREPARED
whether this transaction was decoded at PREPARE time.The main issue is
that we cannot expect the ReorderBufferTXN structure which was created
at PREPARE time to be around till the COMMIT PREPARED gets called. The
patch earlier was not cleaning this structure at PREPARE and was
adding an is_prepared flag to it so that COMMIT PREPARED knew that it
was decoded at PREPARE time. This structure can very well be not
around when you restart between PREPARE and COMMIT PREPARED, for
example.

So now, it's the onus of the prepare filter callback to always give us
the answer if a given transaction was decoded at PREPARE time or not.
We now hand over the ReorderBufferTxn structure (it can be NULL), xid
and gid and the prepare filter tells us what to do. Always. The
is_prepared flag can be cached in the txn structure to aid in
re-lookups, but if it's not set, the filter could do xid lookup, gid
inspection and other shenanigans to give us the same answer every
invocation around.

Because of the above, we can very well cleanup the ReorderBufferTxn at
PREPARE time and it need not hang around till COMMIT PREPARED gets
called, which is a good win in terms of resource management.

My test cases pass (including the scenario described earlier) with the
above code changes in place.

I have also added crash testing related TAP test cases, they uncovered
a bug in the prepare redo restart code path which I fixed. I believe
this patch is in very stable state now. Multiple runs of the crash TAP
test pass without issues. Multiple runs of "make check-world" with
cassert enabled also pass without issues.

Note that this patch does not contain the HeapTupleSatisfiesVacuum
changes. I believe we need changes to HeapTupleSatisfiesVacuum given
than logical decoding changes the assumption that catalog tuples
belonging to a transaction which never committed can be reclaimed
immediately. With 2PC logical decoding or streaming logical decoding,
we can always have a split time window in which the ongoing decode
cycle needs those tuples. The solution is that even for aborted
transactions, we do not return HEAPTUPLE_DEAD if the transaction id is
newer than the OldestXmin (same logic we use for deleted tuples of
committed transactions). We can do this only for catalog table rows
(both system and user defined) to limit the scope of impact. In any
case, this needs to be a separate patch along with a separate
discussion thread.

Peter, I will submit a follow-on patch with documentation changes
soon. But this patch is complete IMO, with all the required 2PC
logical decoding functionality.

Comments, feedback is most welcome.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


2pc_logical_19_12_17_without_docs.patch
Description: Binary data


Re: [HACKERS] Runtime Partition Pruning

2017-12-19 Thread Beena Emerson
Hi David,

Thank you for reviewing and looking at this.

I have attached the WIP patch which incorporates some of Robert's
comments and is rebased over Amit's v14 patch.

On Sat, Dec 16, 2017 at 11:35 AM, David Rowley
 wrote:
> On 13 December 2017 at 00:33, Beena Emerson  wrote:
>> PFA the updated patch, this can be applied over the v13 patches [1]
>> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3
>
> Hi Beena,
>
> Thanks for posting an updated patch.
>
> I've been looking over this and I think that the use of the
> PartitionDispatch in set_append_subplan_indexes is not correct. What
> we need here is the index of the Append's subnode and that's not what
> RelationGetPartitionDispatchInfo() gives you. Remember that some
> partitions could have been pruned away already during planning.
>
> This quick example shows that the partition selection is not correct.
>
> create table p (a int, b int) partition by range (a);
>
> create table p_a_neg partition of p for values from (minvalue) to (0)
> partition by range (b);
> create table p_a_pos partition of p for values from (0) to (maxvalue)
> partition by range (b);
>
> create table p_a_neg_b_neg partition of p_a_neg for values from
> (minvalue) to (0);
> create table p_a_neg_b_pos partition of p_a_neg for values from (0) to
> (maxvalue);
>
> create table p_a_pos_b_neg partition of p_a_pos for values from
> (minvalue) to (0);
> create table p_a_pos_b_pos partition of p_a_pos for values from (0) to
> (maxvalue);
>
> prepare q1 (int, int) as select * from p where a = $1 and b = $1;
>
> explain analyze execute q1 (-1,-1); -- this works.
>   QUERY PLAN
> --
>  Append  (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099
> rows=0 loops=1)
>Runtime Partition Pruning: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.023..0.023 rows=0 loops=1)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not.
>   QUERY PLAN
> --
>  Append  (cost=0.00..175.60 rows=4 width=8) (actual
> time=758996.359..758996.359 rows=0 loops=1)
>Runtime Partition Pruning: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.056..0.056 rows=0 loops=1)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
>->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>  Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> So, I started to look at what the best way to put this right might be.
> I see that since Parallel Append was committed that the subnodes are
> now sorted in cost order inside create_append_path(), so likely we'll
> want to delay figuring out the subpath list indexes until after that's
> done since sorting would scramble our index arrays. We could simply
> look at the subpaths at the end of create_append_path() and create
> some sort of new matrix type that can accept the output of Amit's
> get_partitions_from_clauses() and translate that Bitmapset into the
> subpath indexes (another Bitmapset). This will also need to work for
> sub-partitions too, so this matrix must be some sort of tree that we
> can descend into when we see that get_partitions_from_clauses returned
> a bit for a sub-partition instead of a leaf-partition.

Yes, the change in sort order means that the current
append_paths_array cannot be used for Parallel append and a new logic
has to be devised. I have still not thought about it but your method
seems like a good way to go. Currently I have worked on the Parallel
bit considering that the appends_path_array holds the correct
subplan_index.

>
> I bashed this idea around a bit and I came up with the attached. It's
> very far from complete and in a very WIP state. I've not really done
> anything to make the correct clause list available in nodeAppend.c
> yet, but I think the code that's there is worthy of a look. I've not
> done that much work on the new choose_next_subplan* func

Re: [HACKERS] Runtime Partition Pruning

2017-12-19 Thread Beena Emerson
Hello,

On Mon, Dec 18, 2017 at 4:03 PM, David Rowley
 wrote:
>
>> We could do something similar here using a similar code structure.  Maybe,
>> add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking
>> ExecSetupPartitionTupleRouting), that accepts AppendState node.
>> Furthermore, it might be a good idea to have something similar to
>> ExecFindPartition(), say, ExecGetPartitions().  That is, we have new
>> functions for run-time pruning that are counterparts to corresponding
>> functions for tuple routing.
>
> Seems to me in this case we're better to build this structure during
> planning and save it with the plan so that it can be used over and
> over, rather than building it again and again each time the plan is
> executed. Likely a common use case for run-time pruning is when the
> plan is going to be used multiple times with different parameters, so
> we really don't want to repeat any work that we don't have to here.
>

I agree. It would be better to avoid building the structure during execution.
PFA the updated patch.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Implement-runtime-partiton-pruning_v7_WIP.patch
Description: Binary data


Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Oleksandr Shulgin
(cross-posting admin and hackers)

Hello,

I wonder if I'm alone in my wish to have a way for estimating how much
maintenance work memory would suffice to allocate for a session when
creating an index and avoid spilling to disk?

Recently I had to re-create some indexes on a 9.6 server and I had some
input on the on-disk index size: one was around 30 GB, the other -- a bit
over 60 GB according to \di+ output.  The total number of live tuples in
the table itself was close to 1.3e+9, the table had an estimated 25% bloat.

I had some spare memory on the machine so I've given it 60 GB for
maintenance_work_mem and expected that at least the smaller of the two will
fit in memory completely.  To my surprise that didn't suffice and both
indexes were building with some disk spill.

Is anyone aware of a query to estimate the memory requirements for CREATE
INDEX [CONCURRENTLY]?

I've looked in the postgres wiki, but didn't find anything to that end.
Nor searching the archives of pgsql-admin did help.

I understand that there were some changes in recent releases related to
memory allocation (e.g. allowing huge allocation in 9.4), but at least
targeting 9.6 or 10 would make sense.  There are also a lot of ways how one
CREATE INDEX can be different from the other, but in the most simple case
where you have fixed-width columns and building the full index (i.e. no
WHERE clause), it should be possible.

Not hasting to look in the source to calculate all the sizeof()s yet:
waiting on your reply and suggestions. ;-)

Cheers!
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707


Basebackups reported as idle

2017-12-19 Thread Magnus Hagander
AFAICT, base backups running on the replication protocol are always
reported as "idle" in pg_stat_activity. This seems to have been an
oversight in the "include walsender backends in pg_stat_activity" in 10,
which does include it for walsenders in general, just not for the ones
sending base backups. (and was then improved on later with the "include all
non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since you
can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't
think it has a large risk of breaking things. Thoughts?

Also, in setting this, there is no real way to differentiate between a
regular walsender and a basebackup walsender, other than looking at the
wait events. They're both listed as walsenders. Should there be?  (That
might not be as easily backpatchable, so keeping that as a separate one)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 05ca95bac2..9fc031b141 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -697,6 +697,9 @@ SendBaseBackup(BaseBackupCmd *cmd)
 
 	WalSndSetState(WALSNDSTATE_BACKUP);
 
+	/* Report to pgstat that this process is running */
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
 	if (update_process_title)
 	{
 		char		activitymsg[50];
@@ -707,6 +710,8 @@ SendBaseBackup(BaseBackupCmd *cmd)
 	}
 
 	perform_base_backup(&opt);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
 static void


Re: Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Oleksandr Shulgin
On Tue, Dec 19, 2017 at 10:47 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> (cross-posting admin and hackers)
>
> Hello,
>
> I wonder if I'm alone in my wish to have a way for estimating how much
> maintenance work memory would suffice to allocate for a session when
> creating an index and avoid spilling to disk?
>
> Recently I had to re-create some indexes on a 9.6 server and I had some
> input on the on-disk index size: one was around 30 GB, the other -- a bit
> over 60 GB according to \di+ output.  The total number of live tuples in
> the table itself was close to 1.3e+9, the table had an estimated 25% bloat.
>
> I had some spare memory on the machine so I've given it 60 GB for
> maintenance_work_mem and expected that at least the smaller of the two will
> fit in memory completely.  To my surprise that didn't suffice and both
> indexes were building with some disk spill.
>
> Is anyone aware of a query to estimate the memory requirements for CREATE
> INDEX [CONCURRENTLY]?
>
> I've looked in the postgres wiki, but didn't find anything to that end.
> Nor searching the archives of pgsql-admin did help.
>
> I understand that there were some changes in recent releases related to
> memory allocation (e.g. allowing huge allocation in 9.4), but at least
> targeting 9.6 or 10 would make sense.  There are also a lot of ways how one
> CREATE INDEX can be different from the other, but in the most simple case
> where you have fixed-width columns and building the full index (i.e. no
> WHERE clause), it should be possible.
>

Now I see I fail to mention this is the default btree index with all
default options.  Obviously other indexes can be very different in memory
requirements.

Not hasting to look in the source to calculate all the sizeof()s yet:
> waiting on your reply and suggestions. ;-)
>

If there would be an option in the database itself to provide those
estimation, we wouldn't even need to figure out estimation queries.
"EXPLAIN CREATE INDEX" anyone?

Regards,
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707


Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-19 Thread Amit Kapila
On Thu, Dec 14, 2017 at 3:05 AM, Robert Haas  wrote:
> On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapila  wrote:
>
>> This also doesn't appear to be completely safe.  If we add
>> proc_exit(1) after attaching to error queue (say after
>> pq_set_parallel_master) in the worker, then it will lead to *hang* as
>> anyone_alive will still be false and as it will find that the sender
>> is set for the error queue, it won't return any failure.  Now, I think
>> even if we check worker status (which will be stopped) and break after
>> the new error condition, it won't work as it will still return zero
>> rows in the case reported by you above.
>
> Hmm, there might still be a problem there.  I was thinking that once
> the leader attaches to the queue, we can rely on the leader reaching
> "ERROR: lost connection to parallel worker" in HandleParallelMessages.
> However, that might not work because nothing sets
> ParallelMessagePending in that case.  The worker will have detached
> the queue via shm_mq_detach_callback, but the leader will only
> discover the detach if it actually tries to read from that queue.
>

I think it would have been much easier to fix this problem if we would
have some way to differentiate whether the worker has stopped
gracefully or not.  Do you think it makes sense to introduce such a
state in the background worker machinery?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



non-bulk inserts and tuple routing

2017-12-19 Thread Amit Langote
Hi.

I have a patch that rearranges the code around partition tuple-routing,
such that allocation of per-partition objects (ResultRelInfo,
TupleConversionMap, etc.) is delayed until a given partition is actually
inserted into (i.e., a tuple is routed to it).  I can see good win for
non-bulk inserts with the patch and the patch is implemented such that it
doesn't affect the bulk-insert case much.

Performance numbers:

* Uses following hash-partitioned table:

create table t1 (a int, b int) partition by hash (a);
create table t1_x partition of t1 for values with (modulus M, remainder R)
...


* Non-bulk insert uses the following code (insert 100,000 rows one-by-one):

do $$
begin
  for i in 1..10 loop
insert into t1 values (i, i+1);
  end loop;
end; $$;

* Times in milliseconds:

#parts   HEADPatched

 8   6216.300   4977.670
16   9061.388   6360.093
32  14081.656   8752.405
64  24887.110  13919.384
   128  45926.251  24582.411
   256  88088.084  45490.894

As you can see the performance can be as much as 2x faster with the patch,
although time taken still increases as the number of partitions increases,
because we still lock *all* partitions at the beginning.

* Bulk-inserting 100,000 rows using COPY:

copy t1 from '/tmp/t1.csv' csv;

* Times in milliseconds:

#parts   HEADPatched

 8458.301450.875
16409.271510.723
32500.960612.003
64430.687795.046
   128449.314565.786
   256493.171490.187

Not much harm here, although numbers are a bit noisy.

Patch is divided into 4, first 3 of which are refactoring patches.

I know this patch will conflict severely with [1] and [2], so it's fine if
we consider applying these later.  Will add this to next CF.

Thanks,
Amit

[1] https://commitfest.postgresql.org/16/1023/

[2] https://commitfest.postgresql.org/16/1184/
From a87be8a84d467d65cc0b6cf02655fc3b2b9a458f Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 19 Dec 2017 10:43:45 +0900
Subject: [PATCH 1/4] Teach CopyFrom to use ModifyTableState for tuple-routing

This removes all fields of CopyStateData that were meant for
tuple routing and instead uses ModifyTableState that has all those
fields, including transition_tupconv_maps.  In COPY's case,
transition_tupconv_maps is only required if tuple routing is being
used, so it's safe.
---
 src/backend/commands/copy.c | 79 -
 1 file changed, 42 insertions(+), 37 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28ae4..c82103e1c5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -166,14 +166,7 @@ typedef struct CopyStateData
boolvolatile_defexprs;  /* is any of defexprs volatile? 
*/
List   *range_table;
 
-   PartitionDispatch *partition_dispatch_info;
-   int num_dispatch;   /* Number of entries in the 
above array */
-   int num_partitions; /* Number of members in the 
following arrays */
-   ResultRelInfo **partitions; /* Per partition result relation pointers */
-   TupleConversionMap **partition_tupconv_maps;
-   TupleTableSlot *partition_tuple_slot;
TransitionCaptureState *transition_capture;
-   TupleConversionMap **transition_tupconv_maps;
 
/*
 * These variables are used to reduce overhead in textual COPY FROM.
@@ -2289,6 +2282,7 @@ CopyFrom(CopyState cstate)
ResultRelInfo *resultRelInfo;
ResultRelInfo *saved_resultRelInfo = NULL;
EState *estate = CreateExecutorState(); /* for ExecConstraints() */
+   ModifyTableState *mtstate = makeNode(ModifyTableState);
ExprContext *econtext;
TupleTableSlot *myslot;
MemoryContext oldcontext = CurrentMemoryContext;
@@ -2478,22 +2472,28 @@ CopyFrom(CopyState cstate)
TupleTableSlot *partition_tuple_slot;
int num_parted,
num_partitions;
-
-   ExecSetupPartitionTupleRouting(NULL,
+   ModifyTable *node = makeNode(ModifyTable);
+
+   /* Just need make this field appear valid. */
+   node->nominalRelation = 1;
+   mtstate->ps.plan = (Plan *) node;
+   mtstate->ps.state = estate;
+   mtstate->resultRelInfo = resultRelInfo;
+   ExecSetupPartitionTupleRouting(mtstate,
   
cstate->rel,
-  1,
+  
node->nominalRelation,
   
estate,

Re: non-bulk inserts and tuple routing

2017-12-19 Thread Ashutosh Bapat
On Tue, Dec 19, 2017 at 3:36 PM, Amit Langote
 wrote:
>
> * Bulk-inserting 100,000 rows using COPY:
>
> copy t1 from '/tmp/t1.csv' csv;
>
> * Times in milliseconds:
>
> #parts   HEADPatched
>
>  8458.301450.875
> 16409.271510.723
> 32500.960612.003
> 64430.687795.046
>128449.314565.786
>256493.171490.187

While the earlier numbers were monotonically increasing with number of
partitions, these numbers don't. For example the number on HEAD with 8
partitions is higher than that with 128 partitions as well. That's
kind of wierd. May be something wrong with the measurement. Do we see
similar unstability when bulk inserting in an unpartitioned table?
Also, the numbers against 64 partitions are really bad. That's almost
2x slower.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Add hint about replication slots when nearing wraparound

2017-12-19 Thread Feike Steenbergen
Hi,

While doing some wraparound debugging, I saw the hint regarding upcoming
wraparound did not include the problem of having a stale replication
slot (which I'm actually using to force wraparound issues).

I remember a few discussions where a stale replication slot was actually
the culprit in these situations.

Something like the attached maybe?

regards,

Feike


0001-Add-hint-about-replication-slots-for-wraparound.patch
Description: Binary data


Re: non-bulk inserts and tuple routing

2017-12-19 Thread Amit Langote
Hi Ashutosh.

On 2017/12/19 19:12, Ashutosh Bapat wrote:
> On Tue, Dec 19, 2017 at 3:36 PM, Amit Langote
>  wrote:
>>
>> * Bulk-inserting 100,000 rows using COPY:
>>
>> copy t1 from '/tmp/t1.csv' csv;
>>
>> * Times in milliseconds:
>>
>> #parts   HEADPatched
>>
>>  8458.301450.875
>> 16409.271510.723
>> 32500.960612.003
>> 64430.687795.046
>>128449.314565.786
>>256493.171490.187
> 
> While the earlier numbers were monotonically increasing with number of
> partitions, these numbers don't. For example the number on HEAD with 8
> partitions is higher than that with 128 partitions as well. That's
> kind of wierd. May be something wrong with the measurement.

In the bulk-insert case, we initialize partitions only once, because the
COPY that loads those 100,000 rows is executed just once.

Whereas in the non-bulk insert case, we initialize partitions (lock,
allocate various objects) 100,000 times, because that's how many times the
INSERT is executed, once for each of 100,000 rows to be inserted.

Without the patch, the object initialization occurs N times where N is the
number of partitions.  With the patch it occurs just once -- only for the
partition to which the row was routed.  Time required, although became
smaller with the patch, is still monotonically increasing, because the
patch didn't do anything about locking all partitions.

Does that make sense?

> Do we see
> similar unstability when bulk inserting in an unpartitioned table?
> Also, the numbers against 64 partitions are really bad. That's almost
> 2x slower.

Sorry, as I said the numbers I initially posted were a bit noisy.  I just
re-ran that COPY against the patched and get the following numbers:

#partsPatched

 8441.852
16417.510
32435.276
64486.497
   128436.473
   256446.312

Thanks,
Amit




Re: [HACKERS] [PATCH] Lockable views

2017-12-19 Thread Yugo Nagata
On Tue, 17 Oct 2017 11:59:05 +0900 (JST)
Tatsuo Ishii  wrote:

> > I'm a bit confused. What is difference between tables and functions
> > in a subquery with regard to view locking? I think also none view queries
> > using a subquery do not care about the changes of tables in the 
> > subquery while executing the query. I might be misnderstanding
> > the problem you mentioned.
> 
> The difference is in the function cases we concern the function
> definition. While the table cases need to care about table
> definitions *and* contents of the table.
> 
> If we are changing the table definition, AccessExclusiveLock will be
> held for the table and the updation will be blocked anyway. So we
> don't need to care about the table definition changes.
> 
> On the other hand, table contents changes need to be cared because no
> automatic locking are held in some cases. I think whether tables in
> the subquery need locking or not is depending on use cases.
> 
> So I dug into the previous candidates a little bit more:
> 
> 1) Leave as it is (ignore tables appearing in a subquery)
> 
> 2) Lock all tables including in a subquery
> 
> 3) Check subquery in the view definition. If there are some tables
>involved, emit an error and abort.
> 
> I think one of the problems with #2 is, we will lock tables involved
> by the view in random order, which could cause unwanted dead
> locks. This is not good and I cannot see any easy way to avoid
> this. Also some tables may not need to be locked.
> 
> Problem with #3 is, it does not help a user who wants to control
> lockings by himself/herself.
> 
> So it seem #1 is the most reasonable way to deal with the problem
> assuming that it's user's responsibility to take appropriate locks on
> the tables in the subquery.

Thank you for your response. I agree to adopt #1.

> 
> > BTW, I found that if we have to handle subqueries in where clause, we would
> > also have to care about subqueries in target list... The view defined as
> > below is also updatable.
> > 
> >  =# create view v7 as select (select * from tbl2 limit 1) from tbl;
> 
> The view is not updatable. You will get something like if you try to update 
> v7:
> 
> DETAIL:  Views that have no updatable columns are not automatically updatable.

Although you can not insert into or update v7, you can delete tuples from v7
since it just delete tuples from table tbl regardless with any column.
However, as disussed above, if it is user's responsibility to take appropriate
locks on the tables in subqueries in the target list, we don't need to
care about these. 

> 
> On the other hand this:
> 
> create view v7 as select i, (select j from tbl2 limit 1) from tbl;
> 
> will be updatable. In this case column j of v7 will never be
> updatable. And you should do something like:
> 
> insert into v7(i) values...
> 
> In short, you don't need to care about a subquery appearing in the TLE
> as far as the view locking concerns.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Yugo Nagata 



Re: New gist vacuum.

2017-12-19 Thread Andrey Borodin
Hi hackers!

Here is the patch that deletes pages during GiST VACUUM.

> 12 нояб. 2017 г., в 23:20, Andrey Borodin  написал(а):
> 
> If author and community do not object, I want to continue work on 
> Konstantin's patch.


==Purpose==
Long story short, some time ago Konstantin Kuznetsov hacked out a patch that 
added GiST scan with physical order of scan.
This scan is using a lot of memory to build map of whole GiST graph. If there 
is not enough maintenance memory, patch had the fallback to old GiST VACUUM.
New behavior was deleting pages while old (still used) was not.

I've rebased patch, fixed some bugs and decided that it solves too much in a 
single step.

Here is the patch, which adds functionality of GiST page deletes.
If this is committed, porting physical scan code will be much easier.

==What is changed==
When GiST VACUUM scans graph for removed tuples, it remembers internal pages 
that are referencing completely empty leaf pages.
Then in additional step, these pages are rescanned to delete references and 
mark leaf pages as free.

==Limitations==
At least one reference on each internal pages is left undeleted to preserve 
balancing of the tree.
Pages that has FOLLOW-RIGHT flag also are not deleted, even if empty.


Thank you for your attention, any thoughts are welcome.

Best regards, Andrey Borodin.


0001-Delete-pages-during-GiST-VACCUM.patch
Description: Binary data


Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Андрей Жиденков
Few day ago a faced a problem: Pl/PgSQL procedure works slower when running
in parallel threads. I found the correlation between number of assignments
in procedure code and performance. I decided to write the simple benchmark
procedures and perform some test on PostgreSQL 9.6.5 database installed on
the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).

This benchmark showed me that a simple Pl/PgSQL procedure with a simple
loop inside works slower when running even in 2 threads. There is a
procedure:

CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
DECLARE
  v INTEGER; i INTEGER;
BEGIN
  for i in 1..1000 loop
v := 1;
  end loop;
END;
$$ LANGUAGE plpgsql;

What is the point? I know, that Pl/PgSQL performs a SELECT query to
calculate each value for assignment but I didn't expect that it produce
side effects like this. If there is some buffer lock or anything else?

I've been written a post with charts and detailed explanation to display
these side effects:
http://telegra.ph/Notes-about-PlPgSQL-assignment-performance-12-19

Any help would be greatly appreciated.
--


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Pavel Stehule
Hi

2017-12-19 12:28 GMT+01:00 Андрей Жиденков :

> Few day ago a faced a problem: Pl/PgSQL procedure works slower when
> running in parallel threads. I found the correlation between number of
> assignments in procedure code and performance. I decided to write the
> simple benchmark procedures and perform some test on PostgreSQL 9.6.5
> database installed on the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).
>
> This benchmark showed me that a simple Pl/PgSQL procedure with a simple
> loop inside works slower when running even in 2 threads. There is a
> procedure:
>
> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
> DECLARE
>   v INTEGER; i INTEGER;
> BEGIN
>   for i in 1..1000 loop
> v := 1;
>   end loop;
> END;
> $$ LANGUAGE plpgsql;
>
> What is the point? I know, that Pl/PgSQL performs a SELECT query to
> calculate each value for assignment but I didn't expect that it produce
> side effects like this. If there is some buffer lock or anything else?
>

I am little bit lost when you are speaking about threads. Postgres doesn't
use it.

your test is not correct - benchmark_test should be marked as immutable.
What will be result?

Regards

Pavel




>
> I've been written a post with charts and detailed explanation to display
> these side effects: http://telegra.ph/Notes-about-PlPgSQL-
> assignment-performance-12-19
>
> Any help would be greatly appreciated.
> --
>
>


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Hannu Krosing
On 19.12.2017 11:36, Pavel Stehule wrote:
> Hi
>
> 2017-12-19 12:28 GMT+01:00 Андрей Жиденков  >:
>
> Few day ago a faced a problem: Pl/PgSQL procedure works slower
> when running in parallel threads. I found the correlation between
> number of assignments in procedure code and performance. I decided
> to write the simple benchmark procedures and perform some test on
> PostgreSQL 9.6.5 database installed on the server with 20 CPU
> cores (2 Xeon E5-2690V2 CPUs).
>
> This benchmark showed me that a simple Pl/PgSQL procedure with a
> simple loop inside works slower when running even in 2 threads.
> There is a procedure:
>
> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
> DECLARE
>   v INTEGER; i INTEGER;
> BEGIN
>   for i in 1..1000 loop
>     v := 1;
>   end loop;
> END;
> $$ LANGUAGE plpgsql;
>
> What is the point? I know, that Pl/PgSQL performs a SELECT query
> to calculate each value for assignment but I didn't expect that it
> produce side effects like this. If there is some buffer lock or
> anything else?
>
>
> I am little bit lost when you are speaking about threads. Postgres
> doesn't use it.
>
> your test is not correct - benchmark_test should be marked as immutable.

Would marking it IMMUTABLE not cache the result and thus bypass the
actual testing ?

> What will be result?
>
> Regards
>
> Pavel
>
>
>  
>
>
> I've been written a post with charts and detailed explanation to
> display these side
> effects: 
> http://telegra.ph/Notes-about-PlPgSQL-assignment-performance-12-19
> 
>
> Any help would be greatly appreciated.
> -- 
>
>

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
https://2ndquadrant.com/



Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Andrey Zhidenkov
When I run this test in 2 threads I expect that running time will be the
same, because PostgreSQL will fork process for the second connection and
this process will be served by a separate CPU core because I have more than
2 cores.
Yes, IMMUTABLE flag helps, but I think It's just because Postgres actually
executes procedure only once.

On Tue, Dec 19, 2017 at 2:36 PM, Pavel Stehule 
wrote:

> Hi
>
> 2017-12-19 12:28 GMT+01:00 Андрей Жиденков :
>
>> Few day ago a faced a problem: Pl/PgSQL procedure works slower when
>> running in parallel threads. I found the correlation between number of
>> assignments in procedure code and performance. I decided to write the
>> simple benchmark procedures and perform some test on PostgreSQL 9.6.5
>> database installed on the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).
>>
>> This benchmark showed me that a simple Pl/PgSQL procedure with a simple
>> loop inside works slower when running even in 2 threads. There is a
>> procedure:
>>
>> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
>> DECLARE
>>   v INTEGER; i INTEGER;
>> BEGIN
>>   for i in 1..1000 loop
>> v := 1;
>>   end loop;
>> END;
>> $$ LANGUAGE plpgsql;
>>
>> What is the point? I know, that Pl/PgSQL performs a SELECT query to
>> calculate each value for assignment but I didn't expect that it produce
>> side effects like this. If there is some buffer lock or anything else?
>>
>
> I am little bit lost when you are speaking about threads. Postgres doesn't
> use it.
>
> your test is not correct - benchmark_test should be marked as immutable.
> What will be result?
>
> Regards
>
> Pavel
>
>
>
>
>>
>> I've been written a post with charts and detailed explanation to display
>> these side effects: http://telegra.ph/Notes-about-PlPgSQL-assignment-
>> performance-12-19
>>
>> Any help would be greatly appreciated.
>> --
>>
>>
>


-- 
С уважением, Андрей Жиденков.


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Pavel Stehule
2017-12-19 12:40 GMT+01:00 Hannu Krosing :

> On 19.12.2017 11:36, Pavel Stehule wrote:
>
> Hi
>
> 2017-12-19 12:28 GMT+01:00 Андрей Жиденков :
>
>> Few day ago a faced a problem: Pl/PgSQL procedure works slower when
>> running in parallel threads. I found the correlation between number of
>> assignments in procedure code and performance. I decided to write the
>> simple benchmark procedures and perform some test on PostgreSQL 9.6.5
>> database installed on the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).
>>
>> This benchmark showed me that a simple Pl/PgSQL procedure with a simple
>> loop inside works slower when running even in 2 threads. There is a
>> procedure:
>>
>> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
>> DECLARE
>>   v INTEGER; i INTEGER;
>> BEGIN
>>   for i in 1..1000 loop
>> v := 1;
>>   end loop;
>> END;
>> $$ LANGUAGE plpgsql;
>>
>> What is the point? I know, that Pl/PgSQL performs a SELECT query to
>> calculate each value for assignment but I didn't expect that it produce
>> side effects like this. If there is some buffer lock or anything else?
>>
>
> I am little bit lost when you are speaking about threads. Postgres doesn't
> use it.
>
> your test is not correct - benchmark_test should be marked as immutable.
>
>
> Would marking it IMMUTABLE not cache the result and thus bypass the actual
> testing ?
>

CREATE OR REPLACE FUNCTION public.fx1()
 RETURNS void
 LANGUAGE plpgsql
AS $function$
begin
  for i in 1..10
  loop
raise notice '%', i;
  end loop;
end;
$function$

postgres=# do $$
postgres$# begin
postgres$#   for i in 1..2
postgres$#   loop
postgres$# perform fx1();
postgres$#   end loop;
postgres$# end;
postgres$# $$;
NOTICE:  1
NOTICE:  2
NOTICE:  3
NOTICE:  4
NOTICE:  5
NOTICE:  6
NOTICE:  7
NOTICE:  8
NOTICE:  9
NOTICE:  10
NOTICE:  1
NOTICE:  2
NOTICE:  3
NOTICE:  4
NOTICE:  5
NOTICE:  6
NOTICE:  7
NOTICE:  8
NOTICE:  9
NOTICE:  10
DO

test it.

Personally - this test is little bit bad. What is goal? PLpgSQL is glue for
SQL queries - nothing less, nothing more.





>
> What will be result?
>
> Regards
>
> Pavel
>
>
>
>
>>
>> I've been written a post with charts and detailed explanation to display
>> these side effects: http://telegra.ph/Notes-about-PlPgSQL-assignment-
>> performance-12-19
>>
>> Any help would be greatly appreciated.
>> --
>>
>>
>
> --
> Hannu Krosing
> PostgreSQL Consultant
> Performance, Scalability and High Availabilityhttps://2ndquadrant.com/
>
>


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Pavel Stehule
2017-12-19 12:45 GMT+01:00 Andrey Zhidenkov :

> When I run this test in 2 threads I expect that running time will be the
> same, because PostgreSQL will fork process for the second connection and
> this process will be served by a separate CPU core because I have more than
> 2 cores.
> Yes, IMMUTABLE flag helps, but I think It's just because Postgres actually
> executes procedure only once.
>

surely not - test it.

I am lazy think about it - but probably real reason is +/-  execution of
read only transactions or possibly write transactions.

PostgreSQL is primary ACID database. You cannot to think about it like
scripting environment only.

Regards

Pavel


> On Tue, Dec 19, 2017 at 2:36 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> 2017-12-19 12:28 GMT+01:00 Андрей Жиденков :
>>
>>> Few day ago a faced a problem: Pl/PgSQL procedure works slower when
>>> running in parallel threads. I found the correlation between number of
>>> assignments in procedure code and performance. I decided to write the
>>> simple benchmark procedures and perform some test on PostgreSQL 9.6.5
>>> database installed on the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).
>>>
>>> This benchmark showed me that a simple Pl/PgSQL procedure with a simple
>>> loop inside works slower when running even in 2 threads. There is a
>>> procedure:
>>>
>>> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
>>> DECLARE
>>>   v INTEGER; i INTEGER;
>>> BEGIN
>>>   for i in 1..1000 loop
>>> v := 1;
>>>   end loop;
>>> END;
>>> $$ LANGUAGE plpgsql;
>>>
>>> What is the point? I know, that Pl/PgSQL performs a SELECT query to
>>> calculate each value for assignment but I didn't expect that it produce
>>> side effects like this. If there is some buffer lock or anything else?
>>>
>>
>> I am little bit lost when you are speaking about threads. Postgres
>> doesn't use it.
>>
>> your test is not correct - benchmark_test should be marked as immutable.
>> What will be result?
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>>
>>> I've been written a post with charts and detailed explanation to display
>>> these side effects: http://telegra.ph/Notes-about-PlPgSQL-assignment-pe
>>> rformance-12-19
>>>
>>> Any help would be greatly appreciated.
>>> --
>>>
>>>
>>
>
>
> --
> С уважением, Андрей Жиденков.
>


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Pavel Stehule
2017-12-19 12:46 GMT+01:00 Pavel Stehule :

>
>
> 2017-12-19 12:40 GMT+01:00 Hannu Krosing :
>
>> On 19.12.2017 11:36, Pavel Stehule wrote:
>>
>> Hi
>>
>> 2017-12-19 12:28 GMT+01:00 Андрей Жиденков :
>>
>>> Few day ago a faced a problem: Pl/PgSQL procedure works slower when
>>> running in parallel threads. I found the correlation between number of
>>> assignments in procedure code and performance. I decided to write the
>>> simple benchmark procedures and perform some test on PostgreSQL 9.6.5
>>> database installed on the server with 20 CPU cores (2 Xeon E5-2690V2 CPUs).
>>>
>>> This benchmark showed me that a simple Pl/PgSQL procedure with a simple
>>> loop inside works slower when running even in 2 threads. There is a
>>> procedure:
>>>
>>> CREATE OR REPLACE FUNCTION benchmark_test() RETURNS VOID AS $$
>>> DECLARE
>>>   v INTEGER; i INTEGER;
>>> BEGIN
>>>   for i in 1..1000 loop
>>> v := 1;
>>>   end loop;
>>> END;
>>> $$ LANGUAGE plpgsql;
>>>
>>> What is the point? I know, that Pl/PgSQL performs a SELECT query to
>>> calculate each value for assignment but I didn't expect that it produce
>>> side effects like this. If there is some buffer lock or anything else?
>>>
>>
>> I am little bit lost when you are speaking about threads. Postgres
>> doesn't use it.
>>
>> your test is not correct - benchmark_test should be marked as immutable.
>>
>>
>> Would marking it IMMUTABLE not cache the result and thus bypass the
>> actual testing ?
>>
>
> CREATE OR REPLACE FUNCTION public.fx1()
>  RETURNS void
>  LANGUAGE plpgsql
> AS $function$
> begin
>   for i in 1..10
>   loop
> raise notice '%', i;
>   end loop;
> end;
> $function$
>
> postgres=# do $$
> postgres$# begin
> postgres$#   for i in 1..2
> postgres$#   loop
> postgres$# perform fx1();
> postgres$#   end loop;
> postgres$# end;
> postgres$# $$;
> NOTICE:  1
> NOTICE:  2
> NOTICE:  3
> NOTICE:  4
> NOTICE:  5
> NOTICE:  6
> NOTICE:  7
> NOTICE:  8
> NOTICE:  9
> NOTICE:  10
> NOTICE:  1
> NOTICE:  2
> NOTICE:  3
> NOTICE:  4
> NOTICE:  5
> NOTICE:  6
> NOTICE:  7
> NOTICE:  8
> NOTICE:  9
> NOTICE:  10
> DO
>
> test it.
>
> Personally - this test is little bit bad. What is goal? PLpgSQL is glue
> for SQL queries - nothing less, nothing more.
>

I am wrong - sorry

It needs a fake parameter

postgres=# create or replace function fx1(int)
returns void as $$
begin
  for i in 1..10
  loop
raise notice '%', i;
  end loop;
end;
$$ language plpgsql immutable;

postgres=# do $$
begin
  for i in 1..2
  loop
perform fx1(i);
  end loop;
end;
$$;
NOTICE:  1
NOTICE:  2
NOTICE:  3
NOTICE:  4
NOTICE:  5
NOTICE:  6
NOTICE:  7
NOTICE:  8
NOTICE:  9
NOTICE:  10
NOTICE:  1
NOTICE:  2
NOTICE:  3
NOTICE:  4
NOTICE:  5
NOTICE:  6
NOTICE:  7
NOTICE:  8
NOTICE:  9
NOTICE:  10
DO


>
>
>
>
>>
>> What will be result?
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>>
>>> I've been written a post with charts and detailed explanation to display
>>> these side effects: http://telegra.ph/Notes-about-PlPgSQL-assignment-pe
>>> rformance-12-19
>>>
>>> Any help would be greatly appreciated.
>>> --
>>>
>>>
>>
>> --
>> Hannu Krosing
>> PostgreSQL Consultant
>> Performance, Scalability and High Availabilityhttps://2ndquadrant.com/
>>
>>
>


Re: Top-N sorts verses parallelism

2017-12-19 Thread Thomas Munro
On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas  wrote:
> I went through the callers to create_sort_path and the only one that
> looks like it can pass a limit is the one you and Jeff already
> identified.  So I think the question is just whether
> create_gather_merge_path needs a similar fix.

I might be missing something, but it looks like there are no cases
where we have a limit_tuples value we could use AND we're relying on
create_gather_merge_path's own ability to create sort paths.  So I
suspect there is no reason to change create_gather_merge_path itself
to deal with tuple limits.  I looked at each of its callers:

1.  create_ordered_paths is the case the patch posted earlier covers:
it has a useful limit_tuples value but it creates the sort path itself
first, so there is no need for create_gather_merge_path to be aware of
it.

2.  create_grouping_paths doesn't have limit_tuples value because
grouping always inhibits limits.

3.  generate_gather_paths is in turn called by:

3.1.  standard_joinsearch can't use limits (at least in general) since
it's dealing with a join.

3.2.  geco's merge_clump is also about joins, so ditto.

3.3.  set_rel_pathlist will consider only pathkeys from existing index
scans that set_plain_rel_pathlist found, not creating new pathkeys by
sorting.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Top-N sorts verses parallelism

2017-12-19 Thread Amit Kapila
On Tue, Dec 19, 2017 at 5:24 PM, Thomas Munro
 wrote:
> On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas  wrote:
>> I went through the callers to create_sort_path and the only one that
>> looks like it can pass a limit is the one you and Jeff already
>> identified.  So I think the question is just whether
>> create_gather_merge_path needs a similar fix.
>
> I might be missing something, but it looks like there are no cases
> where we have a limit_tuples value we could use AND we're relying on
> create_gather_merge_path's own ability to create sort paths.  So I
> suspect there is no reason to change create_gather_merge_path itself
> to deal with tuple limits.
>

Exactly.  I was about to post the same and my analysis results are
same as yours.  I think this raises the question, do we really need
cost_sort at that place and if so for which case?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Add support for tuple routing to foreign partitions

2017-12-19 Thread Etsuro Fujita

(2017/12/18 23:25), Alvaro Herrera wrote:

InitResultRelInfo becomes unintelligible after this patch -- it was
straightforward but adding partition_root makes things shaky.  Please
add a proper comment indicating what each argument is.


I was thiking that the comment I added to the definition of the 
ResultRelInfo struct in execnodes.h would make that function 
intelligible, but I agree on that point.  Please fined attached a new 
version of the patch adding such comments.


Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/input/file_fdw.source
--- b/contrib/file_fdw/input/file_fdw.source
***
*** 176,182  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
--- 176,188 
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
  SELECT tableoid::regclass, * FROM p2;
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
***
*** 315,321  SELECT tableoid::regclass, * FROM p2;
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
--- 315,321 
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR
! ERROR:  cannot route copied tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
***
*** 341,348  SELECT tableoid::regclass, * FROM p2;
   p2   | 2 | qux
  (2 rows)
  
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
--- 341,366 
   p2   | 2 | qux
  (2 rows)
  
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+  Insert on public.pt
+Foreign Insert on public.p1
+Insert on public.p2
+->  Result
+  Output: 1, 'xyzzy'::text
+ 
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to foreign table "p1"
! \t on
! EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
!  Insert on public.pt
!Foreign Insert on public.p1
!Insert on public.p2
!->  Result
!  Output: 2, 'xyzzy'::text
! 
! \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 7088,7093  NOTICE:  drop cascades to foreign table bar2
--- 7088,7295 
  drop table loct1;
  drop table loct2;
  -- ===
+ -- test tuple routing for foreign-table partitions
+ -- ===
+ create table pt (a int, b int) partition by list (a);
+ create table locp partition of pt for values in (1);
+ create table locfoo (a int check (a in (2)), b int);
+ create foreign table remp partition of pt for values in (2) server loopback options (table_name 'locfoo');
+ explain (verbose, costs off)
+ insert into pt values (1, 1), (2, 1);
+QUERY PLAN
+ -
+  Insert on public.pt
+Insert on public.locp
+Foreign Insert on public.remp
+  Remote SQL: INSERT INTO public.locfoo(a, b) VALUES ($1, $2)
+->  Values Scan on "*VALUES*"
+  Output: "*VALUES*".column1, "*VALUES*".column2
+ (6 rows)
+ 
+ insert into pt values (1, 1), (2, 1);
+ select tableoid::regclass, * FROM pt;
+  tableoid | a | b 
+ --+---+---
+  locp | 1 | 1
+  remp | 2 | 1
+ (2 rows)
+ 
+ select tableoid::regclass, * FROM locp;
+  tableoid | a | b 
+ --+---+---
+  locp | 1 | 1
+ (1 row)
+ 
+ select tableoid::regclass, * FROM remp;
+  tableoid | a | b 
+ --+---+---
+  remp | 2 | 1
+ (1 row)
+ 
+ explain (verbose, costs off)
+ insert into pt values (1, 2), (2, 2) returning *;
+QUERY PLAN   
+ 
+  Insert on public.pt
+Output: pt.a, pt.b
+Insert on public.locp
+Foreign Insert on public.remp
+  Remote SQL: 

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-19 Thread Aleksander Alekseev
Hello Robert,

> I think this doesn't really show much because it's apparently limited
> by the speed of fsync() on your filesystem.  You might try running the
> test with synchronous_commit=off.

You are right, synchronous_commit=off revealed a noticeable performance
degradation. Also I realized that using log_statement=all was not very
smart as well. Here are the results.

10.1, ptrack_enable=false, synchronous_commit = off

transaction type: 
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1713550
latency average = 0.700 ms
latency stddev = 0.434 ms
tps = 5711.822110 (including connections establishing)
tps = 5712.251807 (excluding connections establishing)

10.1, ptrack_enable=true, synchronous_commit = off

transaction type: 
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1691011
latency average = 0.710 ms
latency stddev = 0.380 ms
tps = 5636.691378 (including connections establishing)
tps = 5636.730514 (excluding connections establishing)

10.1, without ptrack, synchronous_commit = off

transaction type: 
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1843623
latency average = 0.651 ms
latency stddev = 0.589 ms
tps = 6145.395486 (including connections establishing)
tps = 6145.441431 (excluding connections establishing)

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Alvaro Herrera
Andrey Zhidenkov wrote:
> When I run this test in 2 threads I expect that running time will be the
> same, because PostgreSQL will fork process for the second connection and
> this process will be served by a separate CPU core because I have more than
> 2 cores.
> Yes, IMMUTABLE flag helps, but I think It's just because Postgres actually
> executes procedure only once.

Just a guess without actually looking at the WaitEvents (which you
should do) is that this is blocking on snapshot acquisition or something
like that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Package version in PG_VERSION and version()

2017-12-19 Thread Christoph Berg
Re: Robert Haas 2017-12-17 

> Unfortunately, actually modifying the main version number breaks large
> numbers of tools and drivers that think they know what a PostgreSQL
> version number looks like, as many people who work for my employer can
> testify to from personal experience with a piece of software that
> displays a non-default version number.  I think --with-extra-version
> is therefore badly-designed and probably mostly useless in its current
> form, and as Christoph's example shows, it's not really adapted for
> the kind of string he wants to add.  I don't really care whether we
> leave --with-extra-version as-is and add something else for the kind
> of thing Christoph wants to do, or whether we add a different thing
> that does what he wants to do, but I think it's a very good idea to
> provide something along the lines of what he wants.

My idea is to put the information in a place where it is accessible,
but not obtrusive. version() seemed to be the only place for that
(server_version is just the short version), and it already has the
compiler version, so it fits, I think.

> In short, "the version number is important enough to show" != "the
> version number is important enough to break compatibility with large
> numbers of tools and drivers".

Exactly.

Added as https://commitfest.postgresql.org/16/1418/ now.

If people think it's worth it (windows?), I'll add a configure
--switch, otherwise with just reading from $VENDOR_VERSION, it's a
one-liner.

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE



Re: [HACKERS] path toward faster partition pruning

2017-12-19 Thread David Rowley
On 19 December 2017 at 17:36, David Rowley  wrote:
> I'm sorry to say this is another micro review per code I'm stumbling
> over when looking at the run-time partition pruning stuff.

Again, another micro review. I apologise for the slow trickle of
review. Again, these are just things I'm noticing while reading
through while thinking of the run-time pruning patch.


1. The following Assert appears to be testing for the presence of
cosmic rays :-)

/*
* Determine set of partitions using provided keys, which proceeds in a
* manner determined by the partitioning method.
*/
if (keys->n_eqkeys == partkey->partnatts)
{
Assert(keys->n_eqkeys == partkey->partnatts);

Perhaps it's misplaced during a rewrite? Should be safe enough to
remove it, I'd say.

2. The following code in classify_partition_bounding_keys() misses
looking under the RelabelType for rightarg:

leftop = (Expr *) get_leftop(clause);
if (IsA(leftop, RelabelType))
leftop = ((RelabelType *) leftop)->arg;
rightop = (Expr *) get_rightop(clause);
if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
constexpr = rightop;
else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
constexpr = leftop;

This breaks the following case:

create table thisthat (a varchar not null) partition by list (a);
create table this partition of thisthat for values in('this');
create table that partition of thisthat for values in('that');
explain select * from thisthat where 'this' = a; -- does not work
 QUERY PLAN

 Append  (cost=0.00..54.00 rows=14 width=32)
   ->  Seq Scan on that  (cost=0.00..27.00 rows=7 width=32)
 Filter: ('this'::text = (a)::text)
   ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
 Filter: ('this'::text = (a)::text)
(5 rows)


explain select * from thisthat where a = 'this'; -- works as we look
through the RelabelType on left arg.
 QUERY PLAN

 Append  (cost=0.00..27.00 rows=7 width=32)
   ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
 Filter: ((a)::text = 'this'::text)

3. The follow code assumes there will be a commutator for the operator:

if (constexpr == rightop)
pc->op = opclause;
else
{
OpExpr   *commuted;

commuted = (OpExpr *) copyObject(opclause);
commuted->opno = get_commutator(opclause->opno);
commuted->opfuncid = get_opcode(commuted->opno);
commuted->args = list_make2(rightop, leftop);
pc->op = commuted;
}

I had to hunt for it, but it appears that you're pre-filtering clauses
with the Consts on the left and no valid commutator in
match_clauses_to_partkey. I think it's likely worth putting a comment
to mention that reversed clauses with no commutator should have been
filtered out beforehand. I'd say it's also worthy of an Assert().

4. The spelling of "arbitrary" is incorrect in:

* partattno == 0 refers to arbirtary expressions, which get the

5. I've noticed that partition pruning varies slightly from constraint
exclusion in the following case:

create table ta (a int not null) partition by list (a);
create table ta1 partition of ta for values in(1,2);
create table ta2 partition of ta for values in(3,4);

explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
not eliminated.
 QUERY PLAN
-
 Append  (cost=0.00..96.50 rows=5050 width=4)
   ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
 Filter: ((a <> 1) AND (a <> 2))
   ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
 Filter: ((a <> 1) AND (a <> 2))
(5 rows)


alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
check constraint to see if can be removed.
explain select * from ta where a <> 1 and a <> 2; -- it can.
 QUERY PLAN
-
 Append  (cost=0.00..48.25 rows=2525 width=4)
   ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
 Filter: ((a <> 1) AND (a <> 2))
(3 rows)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Craig Ringer
On 18 December 2017 at 10:05, Robert Haas  wrote:

> On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer 
> wrote:
> > On 15 December 2017 at 09:24, Greg Stark  wrote:
> >> Another simpler option would be to open up a new file in the log
> >> directory
> >
> > ... if we have one.
> >
> > We might be logging to syslog, or whatever else.
> >
> > I'd rather keep it simple(ish).
>
> +1.  I still think just printing it to the log is fine.
>
>
Here we go. Implemented pretty much as outlined above. A new
pg_diag_backend(pid) function sends a new
ProcSignalReason PROCSIG_DIAG_REQUEST. It's handled by
CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output to elog(...).

I didn't want to mess with the MemoryContextMethods and expose a
printf-wrapper style typedef in memnodes.h, so I went with a hook global.
It's a diagnostic routine so I don't think that's going to be a great
bother. By default it's set to write_stderr. That just writes to vfprintf
on unix so the outcome's the same as our direct use of fprintf now.

On Windows, though, using write_stderr will make Pg attempt to write memory
context dumps to the eventlog with ERROR level if running as a service with
no console. That means we vsnprintf the string into a static buffer first.
I'm not 100% convinced of the wisdom of that because it could flood the
event log, which is usually size limited by # of events and recycled. It'd
be better to write one event than write one per memory context line, but
it's not safe to do that when OOM. I lean toward this being a win: at least
Windows users should have some hope of finding out why Pg OOM'd, which
currently they don't when it runs as a service. If we don't, we should look
at using SetStdHandle to write stderr to a secondary logfile instead.

I'm happy to just add a trivial vfprintf wrapper so we preserve exactly the
same behaviour instead, but I figured I'd start with reusing write_stderr.

I'd really like to preserve the writing to elog(...) not fprintf, because
on some systems simply finding where stderr is written can be a challenge,
if it's not going to /dev/null via some detached session. Is it in
journald? In some separate log? Being captured by the logging collector
(and probably silently eaten)? Who knows!

Using elog(...) provides a log_line_prefix and other useful info to
associate the dump with the process of interest and what it's doing at the
time.

Also, an astonishing number of deployed systems I've worked with actually
don't put the pid or anything useful in log_line_prefix to make grouping up
multi-line output practical. Which is insane. But it's only PGC_SIGHUP so
fixing it is easy enough.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 663e31ffbe471b5dac18ff6322c0ea45a948350a Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 19 Dec 2017 20:47:36 +0800
Subject: [PATCH v1] Add pg_diag_backend to print memory context info

The new pg_diag_backend(pid) function signals a backend to dump
a summary of its memory context state when it next calls
CHECK_FOR_INTERRUPTS(). The memory context information is
output to the server log.

This works on any backend that uses the standard
procsignal_sigusr1_handler or explicitly handles PROCSIG_DIAG_REQUEST in
its own handler. Currently this includes normal user backends, the
walsender and autovacuum workers but not the other system processes.
Background workers must handle it explicitly handle SIGUSR1 with
procsignal_sigusr1_handler.
---
 doc/src/sgml/bgworker.sgml  | 10 -
 doc/src/sgml/func.sgml  | 13 ++
 src/backend/storage/ipc/procsignal.c|  3 ++
 src/backend/tcop/postgres.c | 70 +
 src/backend/utils/adt/misc.c| 16 
 src/backend/utils/init/globals.c|  1 +
 src/backend/utils/mmgr/aset.c   |  4 +-
 src/backend/utils/mmgr/mcxt.c   | 10 +++--
 src/include/catalog/pg_proc.h   |  2 +
 src/include/miscadmin.h |  4 ++
 src/include/storage/procsignal.h|  5 ++-
 src/include/utils/memutils.h|  4 ++
 src/test/regress/sql/misc_functions.sql |  9 +
 13 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 4bc2b69..50bb54f 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -185,7 +185,8 @@ typedef struct BackgroundWorker
when the process is started or exits.  It should be 0 for workers registered
at postmaster startup time, or when the backend registering the worker does
not wish to wait for the worker to start up.  Otherwise, it should be
-   initialized to MyProcPid.
+   initialized to MyProcPid. Typically this sets the
+   receiving process's latch via its signal handler.
   
 
   Once running, the process can connect to a database by calling
@@ -208,7 +209,12 @@ typedef struct BackgroundWorker
allow the p

Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Andrey Zhidenkov
I've digged into the source code a little bit and found that chain:

PLPGSQL_STMT_ASSIGN -> exec_stmt_assign() -> exec_assign_expr()
-> exec_eval_expr() -> exec_run_select()
-> SPI_execute_plan_with_paramlist() -> _SPI_execute_plan() which finnaly
calls PushActiveSnapshot() and PopActiveSnapshot() wich just do memory
context allocations and use malloc() to copy snaphot.

Maybe I have missed something?

On Tue, Dec 19, 2017 at 4:34 PM, Alvaro Herrera 
wrote:

> Andrey Zhidenkov wrote:
> > When I run this test in 2 threads I expect that running time will be the
> > same, because PostgreSQL will fork process for the second connection and
> > this process will be served by a separate CPU core because I have more
> than
> > 2 cores.
> > Yes, IMMUTABLE flag helps, but I think It's just because Postgres
> actually
> > executes procedure only once.
>
> Just a guess without actually looking at the WaitEvents (which you
> should do) is that this is blocking on snapshot acquisition or something
> like that.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>



-- 
С уважением, Андрей Жиденков.


Re: WIP Patch: Pgbench Serialization and deadlock errors

2017-12-19 Thread Fabien COELHO


Hello Marina,


This is the fourth version of the patch for pgbench.


Consider adding the patch to the next commitfest?

--
Fabien.



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Pavel Stehule
Hi

2017-12-19 14:44 GMT+01:00 Craig Ringer :

> On 18 December 2017 at 10:05, Robert Haas  wrote:
>
>> On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer 
>> wrote:
>> > On 15 December 2017 at 09:24, Greg Stark  wrote:
>> >> Another simpler option would be to open up a new file in the log
>> >> directory
>> >
>> > ... if we have one.
>> >
>> > We might be logging to syslog, or whatever else.
>> >
>> > I'd rather keep it simple(ish).
>>
>> +1.  I still think just printing it to the log is fine.
>>
>>
> Here we go. Implemented pretty much as outlined above. A new
> pg_diag_backend(pid) function sends a new ProcSignalReason 
> PROCSIG_DIAG_REQUEST.
> It's handled by CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output
> to elog(...).
>
> I didn't want to mess with the MemoryContextMethods and expose a
> printf-wrapper style typedef in memnodes.h, so I went with a hook global.
> It's a diagnostic routine so I don't think that's going to be a great
> bother. By default it's set to write_stderr. That just writes to vfprintf
> on unix so the outcome's the same as our direct use of fprintf now.
>
> On Windows, though, using write_stderr will make Pg attempt to write
> memory context dumps to the eventlog with ERROR level if running as a
> service with no console. That means we vsnprintf the string into a static
> buffer first. I'm not 100% convinced of the wisdom of that because it could
> flood the event log, which is usually size limited by # of events and
> recycled. It'd be better to write one event than write one per memory
> context line, but it's not safe to do that when OOM. I lean toward this
> being a win: at least Windows users should have some hope of finding out
> why Pg OOM'd, which currently they don't when it runs as a service. If we
> don't, we should look at using SetStdHandle to write stderr to a secondary
> logfile instead.
>
> I'm happy to just add a trivial vfprintf wrapper so we preserve exactly
> the same behaviour instead, but I figured I'd start with reusing
> write_stderr.
>
> I'd really like to preserve the writing to elog(...) not fprintf, because
> on some systems simply finding where stderr is written can be a challenge,
> if it's not going to /dev/null via some detached session. Is it in
> journald? In some separate log? Being captured by the logging collector
> (and probably silently eaten)? Who knows!
>
> Using elog(...) provides a log_line_prefix and other useful info to
> associate the dump with the process of interest and what it's doing at the
> time.
>
> Also, an astonishing number of deployed systems I've worked with actually
> don't put the pid or anything useful in log_line_prefix to make grouping up
> multi-line output practical. Which is insane. But it's only PGC_SIGHUP so
> fixing it is easy enough.
>

sorry for small offtopic. Can be used this mechanism for log of executed
plan or full query?

Regards

Pavel

>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread David Rowley
On 20 December 2017 at 02:48, Andrey Zhidenkov  wrote:
> PLPGSQL_STMT_ASSIGN -> exec_stmt_assign() -> exec_assign_expr() ->
> exec_eval_expr() -> exec_run_select() -> SPI_execute_plan_with_paramlist()
> -> _SPI_execute_plan() which finnaly calls PushActiveSnapshot() and
> PopActiveSnapshot() wich just do memory context allocations and use malloc()
> to copy snaphot.

Probably the best thing to do is to look at which functions are taking
the most time by doing a perf record for a single running instance,
then the same again with multiple instances running. Perhaps something
in there might appear in the samples more often with the multiple
instances than it does with a single instance.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP Patch: Pgbench Serialization and deadlock errors

2017-12-19 Thread Marina Polyakova

On 19-12-2017 16:52, Fabien COELHO wrote:

Hello Marina,


This is the fourth version of the patch for pgbench.


Consider adding the patch to the next commitfest?


Hi! Yes, here it is: https://commitfest.postgresql.org/16/1420/

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Notes about Pl/PgSQL assignment performance

2017-12-19 Thread Alvaro Herrera
Andrey Zhidenkov wrote:
> I've digged into the source code a little bit and found that chain:
> 
> PLPGSQL_STMT_ASSIGN -> exec_stmt_assign() -> exec_assign_expr()
> -> exec_eval_expr() -> exec_run_select()
> -> SPI_execute_plan_with_paramlist() -> _SPI_execute_plan() which finnaly
> calls PushActiveSnapshot() and PopActiveSnapshot() wich just do memory
> context allocations and use malloc() to copy snaphot.
> 
> Maybe I have missed something?

Yes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP Patch: Pgbench Serialization and deadlock errors

2017-12-19 Thread Fabien COELHO



Consider adding the patch to the next commitfest?


Hi! Yes, here it is: https://commitfest.postgresql.org/16/1420/


I think you may have to ask the cf app admin to remove the duplicate.

https://commitfest.postgresql.org/16/1419/

--
Fabien.



Re: Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Greg Stark
On 19 December 2017 at 10:00, Oleksandr Shulgin
 wrote:

> If there would be an option in the database itself to provide those
> estimation, we wouldn't even need to figure out estimation queries.
> "EXPLAIN CREATE INDEX" anyone?

You're not the first to propose something like that. I think an
EXPLAIN ALTER TABLE would also be very handy -- it's currently
impossible to tell without carefully reading the source code whether a
given DDL change will require a full table scan, a full table rewrite,
or just a quick meta data update (and even in that case what strength
lock will be required). I think there are other utility statements
that make interesting heuristic decisions that would be nice to be
able to have some visibility into -- CLUSTER comes to mind.

I'm not clear how you would determine how much memory is needed to
sort a table without actually doing the sort though. So that would be
more of an EXPLAIN ANALYZE wouldn't it?

-- 
greg



Re: WIP Patch: Pgbench Serialization and deadlock errors

2017-12-19 Thread Marina Polyakova

On 19-12-2017 17:11, Fabien COELHO wrote:

Consider adding the patch to the next commitfest?


Hi! Yes, here it is: https://commitfest.postgresql.org/16/1420/


I think you may have to ask the cf app admin to remove the duplicate.

https://commitfest.postgresql.org/16/1419/


Thanks, I'm trying to do this..

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Paul Ramsey
On Tue, Dec 19, 2017 at 12:24 AM, Regina Obe  wrote:
> On December 13th this change to context creation was committed, which broke
> PostGIS trunk compile against PostgreSQL 11 head.
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fa6f00b1308d
> d10da4eca2f31ccbfc7b35bb461
>
> Ticketed in PostGIS here:
> https://trac.osgeo.org/postgis/ticket/3946
>
>
> It's been broken for a couple of months
> https://trac.osgeo.org/postgis/ticket/3904  with just core dumping but at
> least it used to compile before December 13th.
>
>
> In going thru the changes I see that MemoryContextCreate was changed to not
> return a context anymore and to take in pointer to the context that should
> be returned among other changes.
>
> I also notice that MemoryContextCreate seems to be rarely used in PostgreSQL
> code in places I would expect and instead AllocSetContextCreateExtended is
> used.
>
> So is the preferred approach to not use MemoryContextCreate in extensions
> and instead for PG < 11 use AllocSetContextCreate and PG >= use
> AllocSetContextCreateExtended?
>
> Sorry if my questions don't make any sense.  Still trying to grapple with
> all these memory context functions and how each is different from the other.
>
> For reference, one of the slices of code we have in place that broke looks
> something like this and we've got I think at least 5 more such instances
> sprinkled  in PostGIS code base.
>
> https://git.osgeo.org/gitea/postgis/postgis/src/branch/svn-trunk/libpgcommon
> /lwgeom_transform.c#L550
>
>
>
> MemoryContext PJMemoryContext;
> POSTGIS_DEBUGF(3, "adding SRID %d with proj4text \"%s\" to query
> cache at index %d", srid, proj_str, PROJ4Cache->PROJ4SRSCacheCount);
>
> PJMemoryContext = MemoryContextCreate(T_AllocSetContext, 8192,
>   &PROJ4SRSCacheContextMethods,
>
> PROJ4Cache->PROJ4SRSCacheContext,
>   "PostGIS PROJ4 PJ Memory
> Context");

As Regina noted, we're working in https://trac.osgeo.org/postgis/ticket/3946

Our use of MemoryContextCreate is solely in order to get use
MemoryContextDelete as a callback so that, at the end of a statement,
we can clean up externally allocated memory that we're holding in a
cache. If we had some other callback to use for "the statement is
complete, you can clean up now", we could avoid all this mucking
around with raw MemoryContexts entirely. The MemoryContext trick/hack
is very old, perhaps a callback or hook has been added since then that
we could make use of?



Re: Basebackups reported as idle

2017-12-19 Thread David Steele
Hi Magnus,

On 12/19/17 4:56 AM, Magnus Hagander wrote:
> AFAICT, base backups running on the replication protocol are always
> reported as "idle" in pg_stat_activity. This seems to have been an
> oversight in the "include walsender backends in pg_stat_activity" in 10,
> which does include it for walsenders in general, just not for the ones
> sending base backups. (and was then improved on later with the "include
> all non-standard backends" patch).
> 
> Unlike the regular walsender it also has to set it back to IDLE, since
> you can actually finish a base backup without disconnecting.
> 
> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
> don't think it has a large risk of breaking things. Thoughts?

+1 for this being a bug, albeit a minor one.

> Also, in setting this, there is no real way to differentiate between a
> regular walsender and a basebackup walsender, other than looking at the
> wait events. They're both listed as walsenders. Should there be?  (That
> might not be as easily backpatchable, so keeping that as a separate one)

Maybe something like "walsender [backup]" or just "basebackup" since
walsender is pretty misleading?  It think it would be nice to be able to
tell them apart, though I don't think it should be back-patched.  People
might be relying on the name in the current versions.

Thanks!
-- 
-David
da...@pgmasters.net



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Alvaro Herrera
Paul Ramsey wrote:

> Our use of MemoryContextCreate is solely in order to get use
> MemoryContextDelete as a callback so that, at the end of a statement,
> we can clean up externally allocated memory that we're holding in a
> cache.

You should not use MemoryContextCreate at all -- it's somewhat of an
internal API, as you could guess by looking at the weird arguments that
you're forced into passing.

Instead, the interface you're supposed to use is AllocSetContextCreate.
Just make sure you attach your new context to one which has the right
lifetime for your usage -- in your case ISTM the parent should be
PortalContext, which makes it go away when the current portal (query) is
gone.

See src/backend/utils/mmgr/README for more.  This applies to all
releases, old and new, though recently the API of these memory context
creation functions has been refined somewhat.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Tom Lane
Paul Ramsey  writes:
> Our use of MemoryContextCreate is solely in order to get use
> MemoryContextDelete as a callback so that, at the end of a statement,
> we can clean up externally allocated memory that we're holding in a
> cache. If we had some other callback to use for "the statement is
> complete, you can clean up now", we could avoid all this mucking
> around with raw MemoryContexts entirely. The MemoryContext trick/hack
> is very old, perhaps a callback or hook has been added since then that
> we could make use of?

I'm not managing to wrap my head around how you could use
MemoryContextCreate directly, unless you are implementing your own memory
context type, in which case the API changes aren't that difficult to make
I should think.

However, if the need is to free some external resources when a memory
context is destroyed, seems like what you ought to be using is a memory
context reset callback.  Look at MemoryContextRegisterResetCallback and
its callers (there are just a couple at the moment, though I'm fooling
with a patch that will add more).

regards, tom lane



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Paul Ramsey
On Tue, Dec 19, 2017 at 6:54 AM, Alvaro Herrera
 wrote:
> Paul Ramsey wrote:
>
>> Our use of MemoryContextCreate is solely in order to get use
>> MemoryContextDelete as a callback so that, at the end of a statement,
>> we can clean up externally allocated memory that we're holding in a
>> cache.
>
> You should not use MemoryContextCreate at all -- it's somewhat of an
> internal API, as you could guess by looking at the weird arguments that
> you're forced into passing.
>
> Instead, the interface you're supposed to use is AllocSetContextCreate.
> Just make sure you attach your new context to one which has the right
> lifetime for your usage -- in your case ISTM the parent should be
> PortalContext, which makes it go away when the current portal (query) is
> gone.

Thanks, it looks like PortalContext will serve perfectly.

P



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread Paul Ramsey
On Tue, Dec 19, 2017 at 7:00 AM, Tom Lane  wrote:
> Paul Ramsey  writes:
>> Our use of MemoryContextCreate is solely in order to get use
>> MemoryContextDelete as a callback so that, at the end of a statement,
>> we can clean up externally allocated memory that we're holding in a
>> cache. If we had some other callback to use for "the statement is
>> complete, you can clean up now", we could avoid all this mucking
>> around with raw MemoryContexts entirely. The MemoryContext trick/hack
>> is very old, perhaps a callback or hook has been added since then that
>> we could make use of?
>
> However, if the need is to free some external resources when a memory
> context is destroyed, seems like what you ought to be using is a memory
> context reset callback.  Look at MemoryContextRegisterResetCallback and
> its callers (there are just a couple at the moment, though I'm fooling
> with a patch that will add more).

During a query we'll look up a coordinate transformation object, which
is an expensive lookup, and want to keep it around for the duration of
the query. For extra fun, it's externally allocated, not palloc'ed.
When the query ends, we want to clean up the objects associated with
that query.

Right now the trick is to create our custom memorycontext that has its
delete method dedicated to cleaning out entries in our cache.

If I'm reading right, using MemoryContextRegisterResetCallback on a
AllocSetContext created under our PortalContext should do the trick,
with less direct mucking about into the internals of contexts.

P



Re: MemoryContextCreate change in PG 11 how should contexts be created

2017-12-19 Thread David Steele
On 12/19/17 10:11 AM, Paul Ramsey wrote:
> On Tue, Dec 19, 2017 at 7:00 AM, Tom Lane  wrote:
>> Paul Ramsey  writes:
> 
> If I'm reading right, using MemoryContextRegisterResetCallback on a
> AllocSetContext created under our PortalContext should do the trick,
> with less direct mucking about into the internals of contexts.

Be aware that this function is only available in PostgreSQL >= 9.5.

Regards,
-- 
-David
da...@pgmasters.net



Re: access/parallel.h lacks PGDLLIMPORT

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 3:36 AM, Amit Kapila  wrote:
> I also think it is good to allow ParallelWorkerNumber to be used in
> extensions.  Attached is the patch for same.  I think for other two we
> should wait till there is really a good use case for them.

I think waiting for a "really good" use case is too high a bar.  I
think we only need to think that there is a "reasonable" use case.
Accordingly, I pushed a commit adding PGDLLIMPORT to both
ParallelWorkerNumber and InitializingParallelWorker.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 5:01 AM, Amit Kapila  wrote:
> I think it would have been much easier to fix this problem if we would
> have some way to differentiate whether the worker has stopped
> gracefully or not.  Do you think it makes sense to introduce such a
> state in the background worker machinery?

I think it makes a lot more sense to just throw an ERROR when the
worker doesn't shut down cleanly, which is currently what happens in
nearly all cases.  It only fails to happen for fork() failure and
other errors that happen very early in startup.  I don't think there's
any point in trying to make this code more complicated to cater to
such cases.  If fork() is failing, the fact that parallel query is
erroring out rather than running with fewer workers is likely to be a
good thing.  Your principle concern in that situation is probably
whether your attempt to log into the machine and kill some processes
is also going to die with 'fork failure', and having PostgreSQL
consume every available process slot is not going to make that easier.
On the other hand, if workers are failing so early in startup that
they never attach to the error queue, then they're probably all
failing the same way and trying to cope with that problem in any way
other than throwing an error is going to result in parallelism being
silently disabled with no notification to the user, which doesn't seem
good to me either.

So basically I think it's right to treat these as error conditions,
not try to continue the work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Top-N sorts verses parallelism

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 6:54 AM, Thomas Munro
 wrote:
> On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas  wrote:
>> I went through the callers to create_sort_path and the only one that
>> looks like it can pass a limit is the one you and Jeff already
>> identified.  So I think the question is just whether
>> create_gather_merge_path needs a similar fix.
>
> I might be missing something, but it looks like there are no cases
> where we have a limit_tuples value we could use AND we're relying on
> create_gather_merge_path's own ability to create sort paths.  So I
> suspect there is no reason to change create_gather_merge_path itself
> to deal with tuple limits.  I looked at each of its callers:
>
> 1.  create_ordered_paths is the case the patch posted earlier covers:
> it has a useful limit_tuples value but it creates the sort path itself
> first, so there is no need for create_gather_merge_path to be aware of
> it.
>
> 2.  create_grouping_paths doesn't have limit_tuples value because
> grouping always inhibits limits.
>
> 3.  generate_gather_paths is in turn called by:
>
> 3.1.  standard_joinsearch can't use limits (at least in general) since
> it's dealing with a join.
>
> 3.2.  geco's merge_clump is also about joins, so ditto.
>
> 3.3.  set_rel_pathlist will consider only pathkeys from existing index
> scans that set_plain_rel_pathlist found, not creating new pathkeys by
> sorting.

Well, it might be good future-proofing, but at least it's good to know
that it's not a active bug.  I pushed your earlier fix, which turns
out to be a revert of commit
dc02c7bca4dccf7de278cdc6b3325a829e75b252.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add hint about replication slots when nearing wraparound

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 5:27 AM, Feike Steenbergen
 wrote:
> While doing some wraparound debugging, I saw the hint regarding upcoming
> wraparound did not include the problem of having a stale replication
> slot (which I'm actually using to force wraparound issues).
>
> I remember a few discussions where a stale replication slot was actually the
> culprit in these situations.
>
> Something like the attached maybe?

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
 wrote:
> I see three kinds of syscache entries.
>
> A. An entry for an actually existing object.
> B. An entry for an object which once existed but no longer.
> C. An entry for a just non-existent objects.

I'm not convinced that it's useful to divide things up this way.
Regardless of whether the syscache entries is a positive entry, a
negative entry for a dropped object, or a negative energy for an
object that never existed in the first place, it's valuable if it's
likely to get used again and worthless if not.  Positive entries may
get used repeatedly, or not; negative entries may get used repeatedly,
or not.

>> Generally, the way I'm viewing this is that a syscache entry means
>> paying memory to save CPU time.  Each 8kB of memory we use to store
>> system cache entries is one less block we have for the OS page cache
>> to hold onto our data blocks.  If we had an oracle (the kind from
>
> Sure
>
>> Delphi, not Redwood City) that told us with perfect accuracy when to
>> discard syscache entries, it would throw away syscache entries
>
> Except for the B in the aboves. The logic seems somewhat alien to
> the time-based cleanup but this can be the measure for quick
> bloat of some syscahces.

I guess I still don't see why B is different.  If somebody sits there
and runs queries against non-existent table names at top speed, maybe
they'll query the same non-existent table entries more than once, in
which case keeping the negative entries for the non-existent table
names around until they stop doing it may improve performance.  If
they are sitting there and running queries against randomly-generated
non-existent table names at top speed, then they'll generate a lot of
catcache bloat, but that's not really any different from a database
with a large number of tables that DO exist which are queried at
random.  Workloads that access a lot of objects, whether those objects
exist or not, are going to use up a lot of cache entries, and I guess
that just seems OK to me.

> Agreed. The following is the whole image of the measure for
> syscache bloat considering "quick bloat". (I still think it is
> wanted under some situations.)
>
> 1. If a removal of any objects that make some syscache entries
>   stale (this cannot be checked without scanning whole a hash so
>   just queue it into, for exameple, recently_removed_relations
>   OID hash.)

If we just let some sort of cleanup process that generally blows away
rarely-used entries get rid of those entries too, then it should
handle this case, too, because the cache entries pertaining to removed
relations (or schemas) probably won't get used after that (and if they
do, we should keep them).  So I don't see that there is a need for
this, and it drew objections upthread because of the cost of scanning
the whole hash table.  Batching relations together might help, but it
doesn't really seem worth trying to sort out the problems with this
idea when we can do something better and more general.

> 2. If the number of the oid-hash entries reasches 1000 or 1
>   (mmm. quite arbitrary..), Immediately clean up syscaches that
>   accepts/needs removed-reloid cleanup.  (The oid hash might be
>   needed separately for each target cache to avoid readandunt
>   scan, or to get rid a kind of generation management in the oid
>   hash.)

That is bound to draw a strong negative response from Tom, and for
good reason.  If the number of relations in the working set is 1001
and your cleanup threshold is 1000, cleanups will happen constantly
and performance will be poor.  This is exactly why, as I said in the
second email on this thread, the limit of on the size of the relcache
was removed.

>> 1. The first time we are due to expand the hash table, we check
>> whether we can forestall that expansion by doing a cleanup; if so, we
>> do that instead.
>
>   And if there's any entry in the removed-reloid hash it is
>   considered while cleanup.

As I say, I don't think there's any need for a removed-reloid hash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256

2017-12-19 Thread Peter Eisentraut
On 11/30/17 21:11, Michael Paquier wrote:
> OK, here is a reworked version with the following changes:
> - renamed saslchannelbinding to scramchannelbinding, with a default
> set to tls-unique.
> - An empty value of scramchannelbinding allows client to not use
> channel binding, or in short use use SCRAM-SHA-256 and cbind-flag set
> to 'n'.
> 
> While reviewing the code, I have found something a bit disturbing with
> the header definitions: the libpq frontend code includes scram.h,
> which references backend-side routines. So I think that the definition
> of the SCRAM mechanisms as well as the channel binding types should be
> moved to scram-common.h. This cleanup is included in 0001.

I have committed 0001 and 0002 (renaming to scram_channel_binding).

The 0003 patch looks mostly fine as well.  The only concern I have is
that the way it is set up now, we make the server compute the channel
binding data for both tls-unique and tls-server-end-point, even though
we only end up using one.  This might need some restructuring so that we
only get the data we need once we have learned which channel binding
type the client requested.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 8:44 AM, Craig Ringer  wrote:
> I didn't want to mess with the MemoryContextMethods and expose a
> printf-wrapper style typedef in memnodes.h, so I went with a hook global.

That looks pretty grotty to me.  I think if you want to elog/ereport
this, you need to pass another argument to MemoryContextStats() or add
another memory context method.  This is pretty much a textbook example
of the wrong way to use a global variable, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



update portal-related memory context names and API

2017-12-19 Thread Peter Eisentraut
ISTM that some of the portal-related memory context naming is a bit
antiquated and at odds with current terminology.  In this patch, I
propose to rename PortalMemory to TopPortalContext and rename
Portal->heap to Portal->portalContext, and then clean up some
surrounding APIs.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From df21079da9d002c6ba80f5a968dfdf3ce630808c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 16 Dec 2017 17:26:26 -0500
Subject: [PATCH 1/2] Update portal-related memory context names and API

Rename PortalMemory to TopPortalContext, to avoid confusion with
PortalContext and align naming with similar top-level memory contexts.

Rename PortalData's "heap" field to portalContext.  The "heap" naming
seems quite antiquated and confusing.  Also get rid of the
PortalGetHeapMemory() macro and access the field directly, which we do
for other portal fields, so this abstraction doesn't buy anything.
---
 src/backend/commands/portalcmds.c  | 10 +-
 src/backend/commands/prepare.c |  2 +-
 src/backend/executor/spi.c |  6 +++---
 src/backend/tcop/postgres.c|  2 +-
 src/backend/tcop/pquery.c  | 16 
 src/backend/utils/mmgr/portalmem.c | 32 
 src/include/utils/portal.h |  3 +--
 7 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/src/backend/commands/portalcmds.c 
b/src/backend/commands/portalcmds.c
index 76d6cf154c..e19b2d82a7 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -96,7 +96,7 @@ PerformCursorOpen(DeclareCursorStmt *cstmt, ParamListInfo 
params,
 */
portal = CreatePortal(cstmt->portalname, false, false);
 
-   oldContext = MemoryContextSwitchTo(PortalGetHeapMemory(portal));
+   oldContext = MemoryContextSwitchTo(portal->portalContext);
 
plan = copyObject(plan);
 
@@ -363,7 +363,7 @@ PersistHoldablePortal(Portal portal)
ActivePortal = portal;
if (portal->resowner)
CurrentResourceOwner = portal->resowner;
-   PortalContext = PortalGetHeapMemory(portal);
+   PortalContext = portal->portalContext;
 
MemoryContextSwitchTo(PortalContext);
 
@@ -450,10 +450,10 @@ PersistHoldablePortal(Portal portal)
PopActiveSnapshot();
 
/*
-* We can now release any subsidiary memory of the portal's heap 
context;
+* We can now release any subsidiary memory of the portal's context;
 * we'll never use it again.  The executor already dropped its context,
-* but this will clean up anything that glommed onto the portal's heap 
via
+* but this will clean up anything that glommed onto the portal's 
context via
 * PortalContext.
 */
-   MemoryContextDeleteChildren(PortalGetHeapMemory(portal));
+   MemoryContextDeleteChildren(portal->portalContext);
 }
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index be7222f003..2f57d52b05 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -239,7 +239,7 @@ ExecuteQuery(ExecuteStmt *stmt, IntoClause *intoClause,
portal->visible = false;
 
/* Copy the plan's saved query string into the portal's memory */
-   query_string = MemoryContextStrdup(PortalGetHeapMemory(portal),
+   query_string = MemoryContextStrdup(portal->portalContext,
   
entry->plansource->query_string);
 
/* Replan if needed, and increment plan refcount for portal */
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index f3da2ddd08..46b8360330 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1183,7 +1183,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr 
plan,
}
 
/* Copy the plan's query string into the portal */
-   query_string = MemoryContextStrdup(PortalGetHeapMemory(portal),
+   query_string = MemoryContextStrdup(portal->portalContext,
   
plansource->query_string);
 
/*
@@ -1213,7 +1213,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr 
plan,
 * will result in leaking our refcount on the plan, but it 
doesn't
 * matter because the plan is unsaved and hence transient 
anyway.
 */
-   oldcontext = MemoryContextSwitchTo(PortalGetHeapMemory(portal));
+   oldcontext = MemoryContextSwitchTo(portal->portalContext);
stmt_list = copyObject(stmt_list);
MemoryContextSwitchTo(oldcontext);
ReleaseCachedPlan(cplan, false);
@@ -1311,7 +1311,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr 
plan,
 */
if (paramLI)

Letting plpgsql in on the fun with the new expression eval stuff

2017-12-19 Thread Tom Lane
I'm looking at ways to get plpgsql expression evaluation to go faster,
and one thing I'm noticing is the rather large overhead of going through
ExecEvalParamExtern and plpgsql_param_fetch to get to the useful work
(exec_eval_datum).  We've ameliorated that for DTYPE_VAR variables
by keeping a pre-set-up copy of their values in a ParamListInfo struct,
but that's pretty ugly and carries a bunch of costs of its own.

What I'm wondering about, given the v10 redesign of expression evaluation,
is whether we couldn't be smarter about this by allowing plpgsql (or other
external users) to skip the ParamListInfo representation altogether, and
instead compile Param references into calls to evaluation functions that
are better tailored to the problem of fetching the desired value.

In the existing execution infrastructure, what seems to make sense is to
have an ExprEvalStep type that has functionality like EEOP_PARAM_EXTERN,
but includes a function pointer to a plpgsql-supplied function having the
same signature as ExecEvalParamExtern.  So the execution would look more
or less like

EEO_CASE(EEOP_PARAM_CALLBACK)
{
op->eval_param(state, op, econtext);
EEO_NEXT();
}

and there'd need to be some extra fields (at least a void*) in the op
struct where plpgsql could keep private data.

The JIT stuff you're working on could just compile an equivalent of the
above, although in the very long term maybe there would be some advantage
to letting add-on modules compile specialized code for such steps.

The immediate problem is how can ExecInitExpr generate such a step?
It can't itself know what to put into the function ptr or the additional
fields.  There has to be a way for it to call a plpgsql-supplied
support routine that can construct the eval step.  (And we have to
export ExprEvalPushStep, though that doesn't seem like a problem.)

For compiling full-fledged query trees, what I think we could do is
add a method (function pointer) to ParamListInfo and have ExecInitExpr
invoke plan->state->es_param_list_info->compile_param if that's set.
However, that solution doesn't immediately work for compiling simple
expressions because we pass a null "parent" pointer when building
those.

I thought about instantiating a dummy PlanState and EState to use
just for carrying this info, but that seems pretty ugly.  Another
way we could do it is to invent ExecInitExprWithParams() that takes
an additional ParamListInfo pointer, and use that.  Rather than
adding yet one more parameter that has to be passed down through
ExecInitExprRec, I suggest that we could waste a bit of space in
struct ExprState and store that value there.  Maybe do the same
with the parent pointer so as to reduce the number of recursive
parameters.

I've not written any code around this idea yet, and am not sure
if it conflicts with what you're trying to do for JIT or further out.
Comments, better ideas?

regards, tom lane



Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-19 Thread Robert Haas
On Wed, Dec 13, 2017 at 9:18 PM, Amit Kapila  wrote:
> Thanks.  I think now we can proceed with
> fix_accum_instr_parallel_workers_v8.patch posted above which will fix
> the original issue and the problem we have found in sort and hash
> nodes.

Committed and back-patched to v10.  While I was doing that, I couldn't
help wondering if this code doesn't also need to be moved:

/*
 * Next, accumulate buffer usage.  (This must wait for the workers to
 * finish, or we might get incomplete data.)
 */
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i]);

It seems that it doesn't, because the way that instrumentation data
gets accumulated is via InstrStartParallelQuery and
InstrEndParallelQuery, the latter of which clobbers the entry in the
buffer_usage array rather than adding to it:

void
InstrEndParallelQuery(BufferUsage *result)
{
memset(result, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(result, &pgBufferUsage, &save_pgBufferUsage);
}

But we could think about choosing to make that work the same way; that
is, move the code block to ExecParallelCleanup, remove the memset()
call from InstrEndParallelQuery, and change the code that allocates
PARALLEL_KEY_BUFFER_USAGE to initialize the space.  That would make
the handling of this more consistent with what we're now doing for the
instrumentation data, although I can't see that it fixes any live bug.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

2017-12-19 Thread Tom Lane
Robert Haas  writes:
> On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
>  wrote:
>> I see three kinds of syscache entries.
>> 
>> A. An entry for an actually existing object.
>> B. An entry for an object which once existed but no longer.
>> C. An entry for a just non-existent objects.

> I'm not convinced that it's useful to divide things up this way.

Actually, I don't believe that case B exists at all; such an entry
should get blown away by syscache invalidation when we commit the
DROP command.  If one were to stick around, we'd risk false positive
lookups later.

> I guess I still don't see why B is different.  If somebody sits there
> and runs queries against non-existent table names at top speed, maybe
> they'll query the same non-existent table entries more than once, in
> which case keeping the negative entries for the non-existent table
> names around until they stop doing it may improve performance.

FWIW, my recollection is that the reason for negative cache entries
is that there are some very common patterns where we probe for object
names (not just table names, either) that aren't there, typically as
a side effect of walking through the search_path looking for a match
to an unqualified object name.  Those cache entries aren't going to
get any less useful than the positive entry for the ultimately-found
object.  So from a lifespan point of view I'm not very sure that it's
worth distinguishing cases A and C.

It's conceivable that we could rewrite all the lookup algorithms
so that they didn't require negative cache entries to have good
performance ... but I doubt that that's easy to do.

regards, tom lane



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Tom Lane
Robert Haas  writes:
> On Tue, Dec 19, 2017 at 8:44 AM, Craig Ringer  wrote:
>> I didn't want to mess with the MemoryContextMethods and expose a
>> printf-wrapper style typedef in memnodes.h, so I went with a hook global.

> That looks pretty grotty to me.  I think if you want to elog/ereport
> this, you need to pass another argument to MemoryContextStats() or add
> another memory context method.  This is pretty much a textbook example
> of the wrong way to use a global variable, IMHO.

Yeah.  But please don't mess with MemoryContextStats per se ---
I dunno about you guys but "call MemoryContextStats(TopMemoryContext)"
is kinda wired into my gdb reflexes.  I think what'd make sense
is a new function "MemoryContextStatsTo(context, function_pointer)".
It's okay to redefine the APIs of the per-context-type functions
these would call, though, because nobody calls those functions directly.

regards, tom lane



Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-19 Thread Robert Haas
On Tue, Dec 5, 2017 at 4:23 PM, Thomas Munro
 wrote:
> The hash version of this code is now committed as 5bcf389e.  Here is a
> patch for discussion that adds some extra tests to join.sql to
> exercise rescans of a hash join under a Gather node.  It fails on
> head, because it loses track of the instrumentation pointer after the
> first loop as you described (since the Hash coding is the same is the
> Sort coding), so it finishes up with no instrumentation data.  If you
> move ExecParallelRetrieveInstrumentation() to ExecParallelCleanup() as
> you showed in your patch, then it passes.  The way I'm asserting that
> instrumentation data is making its way back to the leader is by
> turning off leader participation and then checking if it knows how
> many batches there were.

In a later email in this thread, you asked me to consider this patch
for commit, but it doesn't apply.  I thought that might be the result
of conflicts with Amit's patch which I just committed, but I think
that's not the real explanation, because it touches the 'join'
regression test, not 'select_parallel'.  Well, I thought, I'll just
find the place where the SQL should be inserted and stick it in there
-- trivial rebase, right?

Well, not really, because the context surrounding the lines you've
added seems to refer to SQL that I can't find in join.sql or anywhere
else in the tree.  So my suspicion is that this patch is based on your
parallel hash patch set rather than master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



vacuum vs heap_update_tuple() and multixactids

2017-12-19 Thread Andres Freund
Hi,

In [1] I'd discovered a only mildly related bug while reading code to
make sure my fix [2] et al was correct.

Quoting a couple messages by myself:
> Staring at the vacuumlazy hunk I think I might have found a related bug:
> heap_update_tuple() just copies the old xmax to the new tuple's xmax if
> a multixact and still running.  It does so without verifying liveliness
> of members.  Isn't that buggy? Consider what happens if we have three
> blocks: 1 has free space, two is being vacuumed and is locked, three is
> full and has a tuple that's key share locked by a live tuple and is
> updated by a dead xmax from before the xmin horizon. In that case afaict
> the multi will be copied from the third page to the first one.  Which is
> quite bad, because vacuum already processed it, and we'll set
> relfrozenxid accordingly.  I hope I'm missing something here?
>
> Trying to write a testcase for that now.
>
> This indeed happens, but I can't quite figure out a way to write an
> isolationtester test for this. The problem is that to have something
> reproducible one has to make vacuum block on a cleanup lock, but that
> currently doesn't register as waiting for the purpose of
> isolationtester's logic.

So what basically happens is that vacuum might be at block X, and a
concurrent update will move a tuple from a block > X to a block < X,
preserving the multixactid in xmax. Which can mean there later is a
multixactid in the table that's from before relminmxid.

I manually reproduced the issue, but it's pretty painful to do so
manually. I've not found any way to reliably do so in isolationtester so
far.  Cleanup locks make it possible to schedule this without race
conditions, but isolationtester currently won't switch sessions when
blocked on a cleanup lock.

Could I perhaps convince somebody to add that as a feature to
isolationtester? I'm willing to work on a bugfix for the bug itself, but
I've already spent tremendous amounts of time, energy and pain on
multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
also working on test infrastructure for it...  If somebody else is
willing to work on both infrastructure *and* a bugfix, that's obviously
even better ;)

I think the bugfix is going to have to essentially be something similar
to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
tuple version, we need to prune dead multixact members. I think we can
do so unconditionally and rely on multixact id caching layer to avoid
unnecesarily creating multis when all members are the same.

Greetings,

Andres Freund

[1] 
http://archives.postgresql.org/message-id/20171103145330.5ycjoje5s6lfwxps%40alap3.anarazel.de
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c2f0a6c3cc8bb85b78191579760dbe9fb7814ec



Re: vacuum vs heap_update_tuple() and multixactids

2017-12-19 Thread Alvaro Herrera
Andres Freund wrote:

> I think the bugfix is going to have to essentially be something similar
> to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> tuple version, we need to prune dead multixact members. I think we can
> do so unconditionally and rely on multixact id caching layer to avoid
> unnecesarily creating multis when all members are the same.

Actually, isn't the cache subject to the very same problem?  If you use
a value from the cache, it could very well be below whatever the cutoff
multi was chosen in the other process ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Andres Freund
On 2017-12-19 13:17:52 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Tue, Dec 19, 2017 at 8:44 AM, Craig Ringer  wrote:
> >> I didn't want to mess with the MemoryContextMethods and expose a
> >> printf-wrapper style typedef in memnodes.h, so I went with a hook global.
> 
> > That looks pretty grotty to me.  I think if you want to elog/ereport
> > this, you need to pass another argument to MemoryContextStats() or add
> > another memory context method.  This is pretty much a textbook example
> > of the wrong way to use a global variable, IMHO.

Agreed.


> Yeah.  But please don't mess with MemoryContextStats per se ---
> I dunno about you guys but "call MemoryContextStats(TopMemoryContext)"
> is kinda wired into my gdb reflexes.  I think what'd make sense
> is a new function "MemoryContextStatsTo(context, function_pointer)".
> It's okay to redefine the APIs of the per-context-type functions
> these would call, though, because nobody calls those functions directly.

We already have MemoryContextStatsDetail() - it seems to make sense to
expand that API and leave MemoryContextStats() alone. I don't think
there's a need for a third variant?

Greetings,

Andres Freund



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Tom Lane
Andres Freund  writes:
> On 2017-12-19 13:17:52 -0500, Tom Lane wrote:
>> Yeah.  But please don't mess with MemoryContextStats per se ---
>> I dunno about you guys but "call MemoryContextStats(TopMemoryContext)"
>> is kinda wired into my gdb reflexes.  I think what'd make sense
>> is a new function "MemoryContextStatsTo(context, function_pointer)".
>> It's okay to redefine the APIs of the per-context-type functions
>> these would call, though, because nobody calls those functions directly.

> We already have MemoryContextStatsDetail() - it seems to make sense to
> expand that API and leave MemoryContextStats() alone. I don't think
> there's a need for a third variant?

Sure, WFM.

regards, tom lane



improve type conversion of SPI_processed in Python

2017-12-19 Thread Peter Eisentraut
Here is a patch to improves how PL/Python deals with very large values
of SPI_processed.  The previous code converts anything that does not fit
into a C long into a Python float.  But Python long has unlimited
precision, so we should be using that instead.  And in Python 3, int and
long as the same, so there is no need to deal with any variations anymore.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From ba45631d66502600034f3d4803e35909f29d8e7c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 12 Sep 2017 22:09:21 -0400
Subject: [PATCH] Improve type conversion of SPI_processed in Python

The previous code converted SPI_processed to a Python float if it didn't
fit into a Python int.  But Python longs have unlimited precision, so
use that instead.

Refactor the code a bit to avoid having to repeat this logic three times.
---
 src/pl/plpython/plpy_cursorobject.c |  4 +---
 src/pl/plpython/plpy_spi.c  |  8 ++--
 src/pl/plpython/plpy_util.c | 25 +
 src/pl/plpython/plpy_util.h |  2 ++
 4 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/src/pl/plpython/plpy_cursorobject.c 
b/src/pl/plpython/plpy_cursorobject.c
index 9467f64808..37baf5fafd 100644
--- a/src/pl/plpython/plpy_cursorobject.c
+++ b/src/pl/plpython/plpy_cursorobject.c
@@ -437,9 +437,7 @@ PLy_cursor_fetch(PyObject *self, PyObject *args)
ret->status = PyInt_FromLong(SPI_OK_FETCH);
 
Py_DECREF(ret->nrows);
-   ret->nrows = (SPI_processed > (uint64) LONG_MAX) ?
-   PyFloat_FromDouble((double) SPI_processed) :
-   PyInt_FromLong((long) SPI_processed);
+   ret->nrows = PLyObject_FromUint64(SPI_processed);
 
if (SPI_processed != 0)
{
diff --git a/src/pl/plpython/plpy_spi.c b/src/pl/plpython/plpy_spi.c
index 0c623a9458..c6b6f73498 100644
--- a/src/pl/plpython/plpy_spi.c
+++ b/src/pl/plpython/plpy_spi.c
@@ -371,9 +371,7 @@ PLy_spi_execute_fetch_result(SPITupleTable *tuptable, 
uint64 rows, int status)
if (status > 0 && tuptable == NULL)
{
Py_DECREF(result->nrows);
-   result->nrows = (rows > (uint64) LONG_MAX) ?
-   PyFloat_FromDouble((double) rows) :
-   PyInt_FromLong((long) rows);
+   result->nrows = PLyObject_FromUint64(rows);
}
else if (status > 0 && tuptable != NULL)
{
@@ -381,9 +379,7 @@ PLy_spi_execute_fetch_result(SPITupleTable *tuptable, 
uint64 rows, int status)
MemoryContext cxt;
 
Py_DECREF(result->nrows);
-   result->nrows = (rows > (uint64) LONG_MAX) ?
-   PyFloat_FromDouble((double) rows) :
-   PyInt_FromLong((long) rows);
+   result->nrows = PLyObject_FromUint64(rows);
 
cxt = AllocSetContextCreate(CurrentMemoryContext,

"PL/Python temp context",
diff --git a/src/pl/plpython/plpy_util.c b/src/pl/plpython/plpy_util.c
index 35d57a9e80..0bb5150de9 100644
--- a/src/pl/plpython/plpy_util.c
+++ b/src/pl/plpython/plpy_util.c
@@ -133,3 +133,28 @@ PLyUnicode_FromString(const char *s)
 }
 
 #endif /* PY_MAJOR_VERSION >= 
3 */
+
+/*
+ * Return a suitable Python object containing a uint64.
+ */
+PyObject *
+PLyObject_FromUint64(uint64 ival)
+{
+#if PY_MAJOR_VERSION < 3
+   /* In Python 2, return int if it fits. */
+   if (ival <= (uint64) LONG_MAX)
+   return PyInt_FromLong((long) ival);
+   else
+#endif
+   {
+   /*
+* Convert to Python long, picking the conversion function that
+* corresponds to the underlying definition of our uint64.
+*/
+#ifdef HAVE_LONG_LONG
+   return PyLong_FromUnsignedLongLong(ival);
+#else
+   return PyLong_FromUnsignedLong(ival);
+#endif
+   }
+}
diff --git a/src/pl/plpython/plpy_util.h b/src/pl/plpython/plpy_util.h
index f990bb0890..1c7e109617 100644
--- a/src/pl/plpython/plpy_util.h
+++ b/src/pl/plpython/plpy_util.h
@@ -14,4 +14,6 @@ extern PyObject *PLyUnicode_FromString(const char *s);
 extern PyObject *PLyUnicode_FromStringAndSize(const char *s, Py_ssize_t size);
 #endif
 
+extern PyObject *PLyObject_FromUint64(uint64 ival);
+
 #endif /* PLPY_UTIL_H */
-- 
2.15.1



Re: vacuum vs heap_update_tuple() and multixactids

2017-12-19 Thread Andres Freund
On 2017-12-19 15:35:12 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I think the bugfix is going to have to essentially be something similar
> > to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> > tuple version, we need to prune dead multixact members. I think we can
> > do so unconditionally and rely on multixact id caching layer to avoid
> > unnecesarily creating multis when all members are the same.
> 
> Actually, isn't the cache subject to the very same problem?  If you use
> a value from the cache, it could very well be below whatever the cutoff
> multi was chosen in the other process ...

That's an potential issue somewhat indepent of this bug though (IIRC I
also mentioned it in the other thread). I hit that problem a bunch in
manual testing, but I didn't manage to create an actual testcase for it,
without pre-existing corruption.  I don't think this specific instance
would be more-vulnerable than FreezeMultiXactId() itself - we'd just use
alive multis and pass them to MultiXactIdCreateFromMembers(), which is
exactly what FreezeMultiXactId() does.

I think the cache issue ends up not quite being a live bug, because
every transaction that's a member of a multixact also has done
MultiXactIdSetOldestMember(). Which in turn probably, but I'm not sure,
prevents the existance of multis with just alive members in the cache,
that are below the multi horizon. That relies on the fact that we only
create multis with alive members though, which seems fragile.

It'd be good if we added some assurances to
MultiXactIdCreateFromMembers() that it actually can't happen.

Hope I didn't miss a live version of the above problem?

Greetings,

Andres Freund



Re: WIP: a way forward on bootstrap data

2017-12-19 Thread David Fetter
On Thu, Dec 14, 2017 at 05:59:12PM +0700, John Naylor wrote:
> On 12/13/17, Peter Eisentraut  wrote:
> > On 12/13/17 04:06, John Naylor wrote:
> >> There doesn't seem to be any interest in bootstrap data at the moment,
> >> but rather than give up just yet, I've added a couple features to make
> >> a data migration more compelling:
> >
> > I took a brief look at your patches, and there appear to be a number of
> > good cleanups in there at least.  But could you please send patches in
> > git format-patch format with commit messages, so we don't have to guess
> > what each patch does?
> 
> Thanks for taking a look and for pointing me to git format-patch.
> That's much nicer than trying to keep emails straight. I've attached a
> new patchset.

Thanks for your hard work on this.  It'll really make developing this
part of the code a lot more pleasant.

Unfortunately, it no longer applies to master.  Can we get a rebase?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-12-19 Thread Mark Dilger

> On Nov 27, 2017, at 8:47 AM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Attached is an updated version of the patch series, fixing the issues
> reported by Mark Dilger:
> 
> 1) Fix fabs() issue in histogram.c.
> 
> 2) Do not rely on extra_data being StdAnalyzeData, and instead lookup
> the LT operator explicitly. This also adds a simple regression tests to
> make sure ANALYZE on arrays works fine, but perhaps we should invent
> some simple queries too.
> 
> 3) I've removed / clarified some of the comments mentioned by Mark.
> 
> 4) I haven't changed how the statistics kinds are defined in relation.h,
> but I agree there should be a comment explaining how STATS_EXT_INFO_*
> relate to StatisticExtInfo.kinds.
> 
> 5) The most significant change happened histograms. There used to be two
> structures for histograms:
> 
>  - MVHistogram - expanded (no deduplication etc.), result of histogram
>build and never used for estimation
> 
>  - MVSerializedHistogram - deduplicated to save space, produced from
>MVHistogram before storing in pg_statistic_ext and never used for
>estimation
> 
> So there wasn't really any reason to expose the "non-serialized" version
> outside histogram.c. It was just confusing and unnecessary, so I've
> moved MVHistogram to histogram.c (and renamed it to MVHistogramBuild),
> and renamed MVSerializedHistogram. And same for the MVBucket stuff.
> 
> So now we only deal with MVHistogram everywhere, except in histogram.c.
> 
> 6) I've also made MVHistogram to include a varlena header directly (and
> be packed as a bytea), which allows us to store it without having to
> call any serialization functions).
> 
> I guess if we should do (5) and (6) for the MCV lists too, it seems more
> convenient than the current approach. And perhaps even for the
> statistics added to 9.6 (it does not change the storage format).

I tested your latest patches on my mac os x laptop and got one test
failure due to the results of 'explain' coming up differently.  For the record,
I followed these steps:

cd postgresql/
git pull
# this got my directory up to 8526bcb2df76d5171b4f4d6dc7a97560a73a5eff with no 
local changes
patch -p 1 < ../0001-multivariate-MCV-lists.patch
patch -p 1 < ../0002-multivariate-histograms.patch
./configure --prefix=/Users/mark/master/testinstall --enable-cassert 
--enable-tap-tests --enable-depend && make -j4 && make check-world

mark



regression.diffs
Description: Binary data


stats_ext.out
Description: Binary data




Re: WIP: BRIN multi-range indexes

2017-12-19 Thread Mark Dilger

> On Nov 18, 2017, at 12:45 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> Apparently there was some minor breakage due to duplicate OIDs, so here
> is the patch series updated to current master.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>


After applying these four patches to my copy of master, the regression
tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.

mark



regression.diffs
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2017-12-19 Thread Peter Eisentraut
On 12/19/17 03:37, Nikhil Sontakke wrote:
> Note that this patch does not contain the HeapTupleSatisfiesVacuum
> changes. I believe we need changes to HeapTupleSatisfiesVacuum given
> than logical decoding changes the assumption that catalog tuples
> belonging to a transaction which never committed can be reclaimed
> immediately. With 2PC logical decoding or streaming logical decoding,
> we can always have a split time window in which the ongoing decode
> cycle needs those tuples. The solution is that even for aborted
> transactions, we do not return HEAPTUPLE_DEAD if the transaction id is
> newer than the OldestXmin (same logic we use for deleted tuples of
> committed transactions). We can do this only for catalog table rows
> (both system and user defined) to limit the scope of impact. In any
> case, this needs to be a separate patch along with a separate
> discussion thread.

Are you working on that as well?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-19 Thread Peter Eisentraut
On 12/15/17 14:10, Robert Haas wrote:
>> There is an argument for having a big array versus the switch/case
>> approach.  But most existing code around object addresses uses the
>> switch/case approach, so it's better to align it that way, I think.
>> It's weird to have to maintain two different styles.
> Eh, really?  What about the two big arrays at the top of objectaddress.c?

They are not indexed by object type.  I can't find any existing array or
other structure that fits into what this patch is doing (other than the
one this patch is removing).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: vacuum vs heap_update_tuple() and multixactids

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund  wrote:
> Could I perhaps convince somebody to add that as a feature to
> isolationtester? I'm willing to work on a bugfix for the bug itself, but
> I've already spent tremendous amounts of time, energy and pain on
> multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> also working on test infrastructure for it...  If somebody else is
> willing to work on both infrastructure *and* a bugfix, that's obviously
> even better ;)

Hmm.  The problem with trying to make the isolation tester do this is
that pg_isolation_test_session_is_blocked(X, A) is documented to tell
us whether X is waiting for one the PIDs listed in A.  It's easy
enough to tell whether X is blocked waiting for a cleanup lock by
looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
no information about which processes are holding the buffer pins that
prevent us from acquiring that lock.  That's a hard problem to solve
because that data is not recorded in shared memory and doing so would
probably be prohibitively expensive.



What if we add a hook to vacuum that lets you invoke some arbitrary C
code after each block, and then write a test module that uses that
hook to acquire and release an advisory lock at that point?  Then you
could do:

S1: SELECT pg_advisory_lock(...);
S2: VACUUM;  -- blocks on S1, because of the test module
S3: UPDATE ...;
S1: COMMIT; -- unblocks VACUUM

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


crappy-buffer-pin-wait.patch
Description: Binary data


Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-19 Thread Peter Eisentraut
On 12/18/17 02:38, Rushabh Lathia wrote:
> Only motivation is, earlier approach looks more cleaner. Also patch is
> getting bigger - so if we continue with old approach it will make review
> easy. Just in case switch/case approach is a go to, then it can be
> done as part of separate clean up patch.

If find the approach with the giant array harder to maintain because you
typically need to maintain a consistent order between an enum in one
file and arrays in other files, and the only approaches to satisfy this
are hope and 100% test coverage.  And then if you want to reorder or
insert something, you need to do it everywhere at once in a very careful
manner.  In this particular patch, it would also bloat the array even
more, because we don't support grants on all object types, and with the
switch approach we can easily omit those.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-19 Thread Peter Eisentraut
On 12/15/17 17:34, Michael Paquier wrote:
> On Sat, Dec 16, 2017 at 2:39 AM, Peter Eisentraut
>  wrote:
>> On 12/13/17 02:35, Michael Paquier wrote:
>>> Patch 0001 is simply removing EventTriggerSupportsGrantObjectType(),
>>> but shouldn't we keep it and return an error for objects that have no
>>> GRANT support? Returning conditionally true looks like a trap waiting
>>> to take someone in.
>>
>> I don't understand the motivation for this.  It would just be two lists
>> for the same thing.
> 
> Not really. What grant supports is a subset of what event triggers do.
> 
>> I think the potential for omission would be much greater that way.
> 
> That's the whole point of not having "default" in the switches, no?
> Any object additions would be caught at compilation time.

I think the purpose of EventTriggerSupportsGrantObjectType() is to tell
which object types are supported for event triggers.  The purpose is not
to tell which object types are supported by GRANT.

The way I have written it, if GRANT gets support for a new object type,
then event trigger support automatically happens, without having to
update another list.

As a related example, we use the generic
EventTriggerSupportsObjectType() for RenameStmt, even though we don't
actually support RenameStmt on every ObjectType.  And there are probably
more examples like that.  Taken to the extreme, you'd need to remove
EventTriggerSupportsObjectType() altogether.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Bitmap table scan cost per page formula

2017-12-19 Thread Haisheng Yuan
Hi hackers,

This is Haisheng Yuan from Greenplum Database.

We had some query in production showing that planner favors seqscan over
bitmapscan, and the execution of seqscan is 5x slower than using
bitmapscan, but the cost of bitmapscan is 2x the cost of seqscan. The
statistics were updated and quite accurate.

Bitmap table scan uses a formula to interpolate between random_page_cost
and seq_page_cost to determine the cost per page. In Greenplum Database,
the default value of random_page_cost is 100, the default value of
seq_page_cost is 1. With the original cost formula, random_page_cost
dominates in the final cost result, even the formula is declared to be
non-linear. However, it is still more like linear, which can't reflect the
real cost per page when a majority of pages are fetched. Therefore, the
cost formula is updated to real non-linear function to reflect both
random_page_cost and seq_page_cost for different percentage of pages
fetched.

Even though PostgreSQL has much smaller default value of random_page_cost
(4), the same problem exists there if we change the default value.

Old cost formula:
cost_per_page = random_page_cost - (random_page_cost - seq_page_cost) *
sqrt(pages_fetched / T);
[image: Inline image 1]

New cost formula:
cost_per_page = seq_page_cost * pow(random_page_cost / seq_page_cost, 1 -
sqrt(pages_fetched / T));
[image: Inline image 2]

Below is the graph (credit to Heikki) that plots the total estimated cost
of a bitmap heap scan, where table size is 1 pages, and
random_page_cost=10 and seq_page_cost=1. X axis is the number of pages
fetche. I.e. on the left, no pages are fetched, and on the right end, at
1, all pages are fetched. The original formula is in black, the new
formula in red, and the horizontal line, in blue, shows the cost of a Seq
Scan.
[image: Inline image 3]


Thoughts? Any better ideas?

Thanks!
Haisheng Yuan


Re: genomic locus

2017-12-19 Thread Oleg Bartunov
On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov  wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but never
started :( It'd be nice if you start this very interesting for science project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last 20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that I
> have not maintained since 2009 and I now I am unable to fix some of them, so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene


hdate.sql
Description: Binary data


Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-12-19 Thread Robert Haas
On Tue, Dec 19, 2017 at 1:29 PM, Robert Haas  wrote:
> Well, not really, because the context surrounding the lines you've
> added seems to refer to SQL that I can't find in join.sql or anywhere
> else in the tree.  So my suspicion is that this patch is based on your
> parallel hash patch set rather than master.

Thomas pinged me off-list and showed patch -p1 < $file working fine
for him and ... I'm dumb.  I was trying to apply the patch to v10, not
master.  It applies fine to master; committed there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: genomic locus

2017-12-19 Thread Andrey Zhidenkov
Uncertain type is a great idea. I needed type like this to store
information about premiere date or people birth date in movie database.
Finally I made two columns:  and . Precocious could me
'year', 'month' and so on and if precocious was 'year' the date should
always had to be like '-01-01'. I have just read your presentation from
2011 and have a question - is it possible to compare two uncertain dates
like 'January 2017' and 'year 2017'?

On 19 Dec. 2017 23:22, "Oleg Bartunov"  wrote:

On Fri, Dec 15, 2017 at 10:49 PM, Gene Selkov  wrote:
> Greetings everyone,

Привет !

>
> I need a data type to represent genomic positions, which will consist of a
> string and a pair of integers with interval logic and access methods. Sort
> of like my seg type, but more straightforward.

Why not use composite type ? For simple interval approach it's worked for us
(see attached hdate.sql). If you need to specify distribution
function, than it may be
worth to see orion project http://orion.cs.purdue.edu/index.html
6 years ago we was thinking about implementation special UNCERTAINTY data
type
(http://www.sai.msu.su/~megera/postgres/talks/big_uncertain_data.pdf), but
never
started :( It'd be nice if you start this very interesting for science
project.

btw, now you can use range data type, check
https://wiki.postgresql.org/images/7/73/Range-types-pgopen-2012.pdf



>
> I noticed somebody took a good care of seg while I was away for the last
20
> years, and I am extremely grateful for that. I have been using it. In the
> meantime, things have changed and now I am almost clueless about how you
> deal with contributed modules and what steps I should take once I get it
to
> work. Also, is it a good idea to clone and fix seg for this purpose, or is
> there a more appropriate template? Or maybe just building it from scratch
> will be a better idea?
>
> I have seen a lot of bit rot in other extensions (never contributed) that
I
> have not maintained since 2009 and I now I am unable to fix some of them,
so
> I wonder how much of old knowledge is still applicable. In other words, is
> what I see in new code just a change of macros or the change of
principles?

nothing special, copy, modify, compile




>
> Thanks,
>
> --Gene


Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-19 Thread Alvaro Herrera
Peter Eisentraut wrote:
> On 12/15/17 17:34, Michael Paquier wrote:
> > On Sat, Dec 16, 2017 at 2:39 AM, Peter Eisentraut
> >  wrote:

> > That's the whole point of not having "default" in the switches, no?
> > Any object additions would be caught at compilation time.
> 
> I think the purpose of EventTriggerSupportsGrantObjectType() is to tell
> which object types are supported for event triggers.  The purpose is not
> to tell which object types are supported by GRANT.
> 
> The way I have written it, if GRANT gets support for a new object type,
> then event trigger support automatically happens, without having to
> update another list.

That's correct, and using a single implementation as in the posted patch
is desirable.  I was not happy about having to add
EventTriggerSupportsGrantObjectType (c.f.  commit 296f3a605384).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] postgres_fdw: Add support for INSERT OVERRIDING clause

2017-12-19 Thread Peter Eisentraut
On 11/29/17 19:59, Michael Paquier wrote:
> On Wed, Nov 29, 2017 at 1:53 PM, Michael Paquier
>  wrote:
>> On Wed, Nov 29, 2017 at 8:12 AM, Tom Lane  wrote:
>>> IIRC, this issue was debated at great length back when we first put
>>> in foreign tables, because early drafts of postgres_fdw did what you
>>> propose here, and we ran into very nasty problems.  We eventually decided
>>> that allowing remotely-determined column defaults was a can of worms we
>>> didn't want to open.  I do not think that GENERATED columns really change
>>> anything about that.  They certainly don't do anything to resolve the
>>> problems we were contending with back then.  (Which I don't recall the
>>> details of; you'll need to trawl the archives.  Should be somewhere early
>>> in 2013, though, since we implemented that change in commit 50c19fc76.)
>>
>> So this gives a good reason to do nothing or return an error at
>> postgres_fdw level for OVERRIDING?
> 
> Moving the patch to next CF as the discussion has not settled yet.

I think I'll close this patch.  I was operating under the assumption
that there is a bug of omission in PG10 here.  But it seems this
combination of features just isn't meant to work together at this time.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Boolean partitions syntax

2017-12-19 Thread Mark Dilger

> On Dec 12, 2017, at 10:32 PM, Amit Langote  
> wrote:
> 
> On 2017/12/12 15:39, Amit Langote wrote:
>> On 2017/12/12 15:35, Amit Langote wrote:
>>> Works for me, updated patch attached.
>> 
>> Oops, attached the old one with the last email.
>> 
>> Updated one really attached this time.
> 
> Added to CF: https://commitfest.postgresql.org/16/1410/

This compiles and passes the regression tests for me.

I extended your test a bit to check whether partitions over booleans are useful.
Note specifically the 'explain' output, which does not seem to restrict the scan
to just the relevant partitions.  You could easily argue that this is beyond 
the scope
of your patch (and therefore not your problem), but I doubt it makes much sense
to have boolean partitions without planner support for skipping partitions like 
is
done for tables partitioned over other data types.

mark



-- boolean partitions
create table boolspart (a bool, b text) partition by list (a);
create table boolspart_t partition of boolspart for values in (true);
create table boolspart_f partition of boolspart for values in (false);
\d+ boolspart
 Table "public.boolspart"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | 
Description 
+-+---+--+-+--+--+-
 a  | boolean |   |  | | plain|  | 
 b  | text|   |  | | extended |  | 
Partition key: LIST (a)
Partitions: boolspart_f FOR VALUES IN (false),
boolspart_t FOR VALUES IN (true)

insert into boolspart (a, b) values (false, 'false');
insert into boolspart (a, b) values (true, 'true');
explain select * from boolspart where a is true;
 QUERY PLAN  
-
 Append  (cost=0.00..46.60 rows=1330 width=33)
   ->  Seq Scan on boolspart_f  (cost=0.00..23.30 rows=665 width=33)
 Filter: (a IS TRUE)
   ->  Seq Scan on boolspart_t  (cost=0.00..23.30 rows=665 width=33)
 Filter: (a IS TRUE)
(5 rows)

explain select * from boolspart where a is false;
 QUERY PLAN  
-
 Append  (cost=0.00..46.60 rows=1330 width=33)
   ->  Seq Scan on boolspart_f  (cost=0.00..23.30 rows=665 width=33)
 Filter: (a IS FALSE)
   ->  Seq Scan on boolspart_t  (cost=0.00..23.30 rows=665 width=33)
 Filter: (a IS FALSE)
(5 rows)

drop table boolspart;
create table multiboolspart (a bool, b bool, c bool, d float, e text) partition 
by range (a, b, c);
create table multiboolspart_fff partition of multiboolspart for values from 
(minvalue, minvalue, minvalue) to (false, false, false);
create table multiboolspart_fft partition of multiboolspart for values from 
(false, false, false) to (false, false, true);
create table multiboolspart_ftf partition of multiboolspart for values from 
(false, false, true) to (false, true, false);
create table multiboolspart_ftt partition of multiboolspart for values from 
(false, true, false) to (false, true, true);
create table multiboolspart_tff partition of multiboolspart for values from 
(false, true, true) to (true, false, false);
create table multiboolspart_tft partition of multiboolspart for values from 
(true, false, false) to (true, false, true);
create table multiboolspart_ttf partition of multiboolspart for values from 
(true, false, true) to (true, true, false);
create table multiboolspart_ttt partition of multiboolspart for values from 
(true, true, false) to (true, true, true);
create table multiboolspart_max partition of multiboolspart for values from 
(true, true, true) to (maxvalue, maxvalue, maxvalue);
\d+ multiboolspart;
   Table "public.multiboolspart"
 Column |   Type   | Collation | Nullable | Default | Storage  | Stats 
target | Description 
+--+---+--+-+--+--+-
 a  | boolean  |   |  | | plain|
  | 
 b  | boolean  |   |  | | plain|
  | 
 c  | boolean  |   |  | | plain|
  | 
 d  | double precision |   |  | | plain|
  | 
 e  | text |   |  | | extended |
  | 
Partition key: RANGE (a, b, c)
Partitions: multiboolspart_fff FOR VALUES FROM (MINVALUE, MINVALUE, MINVALUE) 
TO (false, false, false),
multiboolspart_fft FOR VALUES FROM (false, false, false) TO (false, 
false, true),
multiboolspart_ftf FOR VALUES FROM (false, false, true) TO (false, 
true, false),
multiboolspart_ftt FOR VALUES FROM (false, true,

Re: File name as application name in regression tests and elsewhere

2017-12-19 Thread Andrew Dunstan


On 12/18/2017 03:00 PM, Peter Eisentraut wrote:
> On 12/18/17 06:59, Andrew Dunstan wrote:
>> I was doing some work over the weekend and it occurred to me that it
>> would be quite nice to have the input file name from regression tests
>> set as the application name, and then use a log_line_prefix setting
>> including %a, so that this would appear on the logs.
> It does do that already, as of commit
> a4327296df7366ecc657b706a9b5e87aa921311a.  Is it not working for you?
>


Hah! Totally missed that! It's not working for me because the buildfarm
is carefully overriding the log_line_prefix without including %a - it's
been like that for ages. I will change the default and go around and
change all my machine settings.

cheers

andrew

-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Alex Shulgin
On Tue, Dec 19, 2017 at 3:15 PM Greg Stark  wrote:

> On 19 December 2017 at 10:00, Oleksandr Shulgin
>  wrote:
>
> > If there would be an option in the database itself to provide those
> > estimation, we wouldn't even need to figure out estimation queries.
> > "EXPLAIN CREATE INDEX" anyone?
>
> You're not the first to propose something like that. I think an
> EXPLAIN ALTER TABLE would also be very handy -- it's currently
> impossible to tell without carefully reading the source code whether a
> given DDL change will require a full table scan, a full table rewrite,
> or just a quick meta data update (and even in that case what strength
> lock will be required). I think there are other utility statements
> that make interesting heuristic decisions that would be nice to be
> able to have some visibility into -- CLUSTER comes to mind.
>

Yes, that would be pretty handy.

I'm not clear how you would determine how much memory is needed to
> sort a table without actually doing the sort though. So that would be
> more of an EXPLAIN ANALYZE wouldn't it?
>

My idea would be to use statistic.  So that EXPLAIN CREATE INDEX (or
whatever the actual interface could be like) would benefit from up-to-date
statistic produced by ANALYZE.

Based on the estimated number of rows in the table, average width of
column(s) to index and taking into account the bookkeeping structures one
should be able to arrive at a good guess for the amount of memory the
backend would end up allocating (assuming it is available).

Having done that, as the first step, and using statistic again we could
also infer (though, probably with less accuracy) memory requirements for
building partial indexes.  Functional indexes would be harder to tackle, I
would think this is only possible if the return type(s) of the function(s)
has all fixed width.

I didn't look in the code, but I imagine the procedure to read -> sort
-> spill to tapes, if needed -> merge sort the tapes is generic to all
index types, so this shouldn't be a breaking change for any user-defined
indexes (is this already a thing?).  OK, maybe it's only generic for B-Tree
and BRIN, but not for GIN and GiST, to name a few.  Damn, I gotta look in
the code at some point. ;-)

To let me fantasize a little more, what I would also love to see is the
estimated on-disk size for the resulting index, before starting to create
it.  This is obviously dependent on the actual index type and options, such
as fill-factor, etc.

Cheers,
--
Alex


TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals

2017-12-19 Thread Erik Rijkers

I saw this just now:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, 
snap->xmin))", File: "snapbuild.c", Line: 580)


while running 50 cascading instances on a single machine.

select version():
PostgreSQL 11devel_HEAD_20171219_2158_7d3583ad9ae5 on 
x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18) 6.3.0 20170516, 
64-bit



I don't know if the admittedly somewhat crazy 50 instances make that 
error acceptable/expected but I thought I'd report it anyway.


thanks,

Erik Rijkers




Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-19 Thread Tomas Vondra
Hi,

a couple of months ago there was proposal / patch with the similar
goals, from Andrey Borodin. See these two threads

[1]
https://www.postgresql.org/message-id/flat/843D96CC-7C55-4296-ADE0-622A7ACD4978%40yesql.se#843d96cc-7c55-4296-ade0-622a7acd4...@yesql.se

[2]
https://www.postgresql.org/message-id/flat/449A7A9D-DB58-40F8-B80E-4C4EE7DB47FD%40yandex-team.ru#449a7a9d-db58-40f8-b80e-4c4ee7db4...@yandex-team.ru

I recall there was a long discussion regarding which of the approaches
is the *right* one (although that certainly depends on other factors).

On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
> In this thread I would like to raise the issue of incremental backups.
> What I suggest in this thread, is to choose one direction, so we can
> concentrate our community efforts.
> There is already a number of tools, which provide incremental backup.
> And we can see five principle techniques they use:
> 
> 1. Use file modification time as a marker that the file has changed.
> 2. Compute file checksums and compare them.
> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
> 4. Scan all WAL files in the archive since the previous backup and
> collect information about changed pages.
> 5. Track page changes on the fly. (ptrack)
> 
> They can also be combined to achieve better performance.
> 
> My personal candidate is the last one, since it provides page-level
> granularity, while most of the others approaches can only do file-level
> incremental backups or require additional reads or calculations.
> 

I share the opinion that options 1 and 2 are not particularly
attractive, due to either unreliability, or not really saving that much
CPU and I/O.

I'm not quite sure about 3, because it doesn't really explain how would
it be done - it seems to assume we'd have to reread the files. I'll get
back to this.

Option 4 has some very interesting features. Firstly, relies on WAL and
so should not require any new code (and it could, in theory, support
even older PostgreSQL releases, for example). Secondly, this can be
offloaded to a different machine. And it does even support additional
workflows - e.g. "given these two full backups and the WAL, generate an
incremental backup between them".

So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.


> In a nutshell, using ptrack patch, PostgreSQL can track page changes on
> the fly. Each time a relation page is updated, this page is marked in a
> special PTRACK bitmap fork for this relation. As one page requires just
> one bit in the PTRACK fork, such bitmaps are quite small. Tracking
> implies some minor overhead on the database server operation but speeds
> up incremental backups significantly.
> 

That makes sense, I guess, although I find the "ptrack" acronym somewhat
cryptic, and we should probably look for something more descriptive. But
the naming can wait, I guess.

My main question is if bitmap is the right data type. It seems to cause
a lot of complexity later, because it needs to be reset once in a while,
you have to be careful about failed incremental backups etc.

What if we tracked the LSN for each page instead? Sure, it'd require so,
64x more space (1 bit -> 8 bytes per page), but it would not require
resets, you could take incremental backups from arbitrary point in time,
and so on. That seems like a significant improvement to me, so perhaps
the space requirements are justified (still just 1MB for 1GB segment,
with the default 8kB pages).

> Detailed overview of the implementation with all pros and cons,
> patches and links to the related threads you can find here:
> 
> https://wiki.postgresql.org/index.php?title=PTRACK_incremental_backups.
> 
> Patches for v 10.1 and v 9.6 are attached.
> Since ptrack is basically just an API for use in backup tools, it is
> impossible to test the patch independently.
> Now it is integrated with our backup utility, called pg_probackup. You can
> find it herehttps://github.com/postgrespro/pg_probackup
> Let me know if you find the documentation too long and complicated, I'll
> write a brief How-to for ptrack backups.
> 
> Spoiler: Please consider this patch and README as a proof of concept. It
> can be improved in some ways, but in its current state PTRACK is a
> stable prototype, reviewed and tested well enough to find many
> non-trivial corner cases and subtle problems. And any discussion of
> change track algorithm must be aware of them. Feel free to share your
> concerns and point out any shortcomings of the idea or the implementation.
> 

Thanks for the proposal and patch!

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals

2017-12-19 Thread Erik Rijkers

On 2017-12-19 23:35, Erik Rijkers wrote:

I saw this just now:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

while running 50 cascading instances on a single machine.


Sorry, that was probably too terse, I should explain that a little.

After initing 50 instances, I set up and run a pgbench session in the 
master session; the pgbench lines are:


  init: pgbench --port=6515 --quiet --initialize --scale=1 postgres
  run:  pgbench -M prepared -c 16 -j 8 -T 1 -P 1 -n postgres -- scale 1

the other instances then catch up.  The whole takes 5 minutes or so

I vary scale, duration, and number of instances.  I haven't had it fail 
in this way yet but I mostly tried with lower number of instances (up to 
25 or so).



select version():
PostgreSQL 11devel_HEAD_20171219_2158_7d3583ad9ae5 on
x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18) 6.3.0 20170516,
64-bit


I don't know if the admittedly somewhat crazy 50 instances make that
error acceptable/expected but I thought I'd report it anyway.

thanks,

Erik Rijkers




Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-12-19 Thread Tomas Vondra
Hi,

On 12/19/2017 08:17 PM, Mark Dilger wrote:
> 
> I tested your latest patches on my mac os x laptop and got one test
> failure due to the results of 'explain' coming up differently.  For the 
> record,
> I followed these steps:
> 
> cd postgresql/
> git pull
> # this got my directory up to 8526bcb2df76d5171b4f4d6dc7a97560a73a5eff with 
> no local changes
> patch -p 1 < ../0001-multivariate-MCV-lists.patch
> patch -p 1 < ../0002-multivariate-histograms.patch
> ./configure --prefix=/Users/mark/master/testinstall --enable-cassert 
> --enable-tap-tests --enable-depend && make -j4 && make check-world
> 

Yeah, those steps sounds about right.

Apparently this got broken by ecc27d55f4, although I don't quite
understand why - but it works fine before. Can you try if it works fine
on 9f4992e2a9 and fails with ecc27d55f4?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256

2017-12-19 Thread Michael Paquier
On Wed, Dec 20, 2017 at 1:19 AM, Peter Eisentraut
 wrote:
> I have committed 0001 and 0002 (renaming to scram_channel_binding).

Thanks!

> The 0003 patch looks mostly fine as well.  The only concern I have is
> that the way it is set up now, we make the server compute the channel
> binding data for both tls-unique and tls-server-end-point, even though
> we only end up using one.  This might need some restructuring so that we
> only get the data we need once we have learned which channel binding
> type the client requested.

The current patch is focused on simplicity and it has the advantage
that there is no need to depend on any SSL structures or Port* in
fe-auth-scram.c and auth-scram.c. So I would really like to keep the
code simple with this goal in mind.

Speaking of which, making the server-side code is going to be in my
opinion grotty, because the server only knows about the channel
binding type used by the client after reading its first message, so we
would need to call directly the SSL-related APIs in auth-scram.c, and
update scram_state with the appropriate data in the middle of the
exchange. This causes the addition of two dependencies to Port* and
the SSL APIs into auth-scram.c.

However, it is possible to simply optimize the frontend code as in
pg_SASL_init() we already know the channel binding type selected when
calling pgtls_get_finished() and pgtls_get_peer_certificate_hash(). So
while I agree with your point, my opinion is to keep the code as
simple as possible, and then just optimize the frontend code. What do
you think?
-- 
Michael



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-19 Thread Craig Ringer
On 20 December 2017 at 02:35, Andres Freund  wrote:


>
>
> > Yeah.  But please don't mess with MemoryContextStats per se ---
> > I dunno about you guys but "call MemoryContextStats(TopMemoryContext)"
> > is kinda wired into my gdb reflexes.  I think what'd make sense
> > is a new function "MemoryContextStatsTo(context, function_pointer)".
> > It's okay to redefine the APIs of the per-context-type functions
> > these would call, though, because nobody calls those functions directly.
>
> We already have MemoryContextStatsDetail() - it seems to make sense to
> expand that API and leave MemoryContextStats() alone. I don't think
> there's a need for a third variant?
>

Cool, can do.

I'll have to expose a typedef for the printf-wrapper callback in memnodes.h
and add it to the stats method, which I thought would be more likely to get
complaints than the global hook. I'm actually happier to do it with a
passed callback.

Will revise when I get a chance in the next couple of days.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: access/parallel.h lacks PGDLLIMPORT

2017-12-19 Thread Craig Ringer
On 19 December 2017 at 23:24, Robert Haas  wrote:

> On Tue, Dec 19, 2017 at 3:36 AM, Amit Kapila 
> wrote:
> > I also think it is good to allow ParallelWorkerNumber to be used in
> > extensions.  Attached is the patch for same.  I think for other two we
> > should wait till there is really a good use case for them.
>
> I think waiting for a "really good" use case is too high a bar.  I
> think we only need to think that there is a "reasonable" use case.
> Accordingly, I pushed a commit adding PGDLLIMPORT to both
> ParallelWorkerNumber and InitializingParallelWorker.
>

Especially since  all non-static *functions* are exported unconditionally,
so it's not like we set a high bar for public API.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Statically linking ICU with Postgres

2017-12-19 Thread leoaaryan
I was able to do it.
I had to build ICU statically first with --enable-static --enable-shared=no
options.

During configuring postgres I had to use the right flags for ICU:
/configure --prefix=/leoaaryan/postgres-10 ... --with-icu
ICU_CFLAGS="-I/leoaaryan/postgres-10/include -DU_STATIC_IMPLEMENTATION"
ICU_LIBS="-L/leoaaryan/postgres-10/lib -licui18n -licuuc -licudata -lstdc++"



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-19 Thread Michael Paquier
On Wed, Dec 20, 2017 at 5:43 AM, Alvaro Herrera  wrote:
> Peter Eisentraut wrote:
>> On 12/15/17 17:34, Michael Paquier wrote:
>> > On Sat, Dec 16, 2017 at 2:39 AM, Peter Eisentraut
>> >  wrote:
>
>> > That's the whole point of not having "default" in the switches, no?
>> > Any object additions would be caught at compilation time.
>>
>> I think the purpose of EventTriggerSupportsGrantObjectType() is to tell
>> which object types are supported for event triggers.  The purpose is not
>> to tell which object types are supported by GRANT.
>>
>> The way I have written it, if GRANT gets support for a new object type,
>> then event trigger support automatically happens, without having to
>> update another list.
>
> That's correct, and using a single implementation as in the posted patch
> is desirable.  I was not happy about having to add
> EventTriggerSupportsGrantObjectType (c.f.  commit 296f3a605384).

Hm. OK. I would have thought that allowing a new object to work
automatically is actually we would have liked to avoid for safety. So
complain withdrawn.

-stringify_adefprivs_objtype(GrantObjectType objtype)
+stringify_adefprivs_objtype(ObjectType objtype)
[...]
+default:
+elog(ERROR, "unrecognized grant object type: %d", (int) objtype);
+return "???";/* keep compiler quiet */
 }
-
-elog(ERROR, "unrecognized grant object type: %d", (int) objtype);
-return "???";/* keep compiler quiet */
Still this portion in 0001 is something that we try to avoid as much
as possible, no? I would have thought that all object types should be
listed directly so as nothing is missed in the future.
-- 
Michael



Re: WIP: BRIN multi-range indexes

2017-12-19 Thread Tomas Vondra


On 12/19/2017 08:38 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra  
>> wrote:
>>
>> Hi,
>>
>> Apparently there was some minor breakage due to duplicate OIDs, so here
>> is the patch series updated to current master.
>>
>> regards
>>
>> -- 
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
> 
> 
> After applying these four patches to my copy of master, the regression
> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
> 

D'oh! There was an incorrect OID referenced in pg_opclass, which was
also used by the satisfies_hash_partition() function. Fixed patches
attached.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0003-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-12-19 Thread Mark Dilger

> On Dec 19, 2017, at 4:31 PM, Tomas Vondra  
> wrote:
> 
> Hi,
> 
> On 12/19/2017 08:17 PM, Mark Dilger wrote:
>> 
>> I tested your latest patches on my mac os x laptop and got one test
>> failure due to the results of 'explain' coming up differently.  For the 
>> record,
>> I followed these steps:
>> 
>> cd postgresql/
>> git pull
>> # this got my directory up to 8526bcb2df76d5171b4f4d6dc7a97560a73a5eff with 
>> no local changes
>> patch -p 1 < ../0001-multivariate-MCV-lists.patch
>> patch -p 1 < ../0002-multivariate-histograms.patch
>> ./configure --prefix=/Users/mark/master/testinstall --enable-cassert 
>> --enable-tap-tests --enable-depend && make -j4 && make check-world
>> 
> 
> Yeah, those steps sounds about right.
> 
> Apparently this got broken by ecc27d55f4, although I don't quite
> understand why - but it works fine before. Can you try if it works fine
> on 9f4992e2a9 and fails with ecc27d55f4?

It succeeds with 9f4992e2a9.  It fails with ecc27d55f4.  The failures look
to be the same as I reported previously.

mark




Re: WIP: BRIN multi-range indexes

2017-12-19 Thread Mark Dilger

> On Dec 19, 2017, at 5:16 PM, Tomas Vondra  
> wrote:
> 
> 
> 
> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>> 
>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>> is the patch series updated to current master.
>>> 
>>> regards
>>> 
>>> -- 
>>> Tomas Vondra  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>> 
>> 
>> After applying these four patches to my copy of master, the regression
>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>> 
> 
> D'oh! There was an incorrect OID referenced in pg_opclass, which was
> also used by the satisfies_hash_partition() function. Fixed patches
> attached.

Thanks!  These fix the regression test failures.  On my mac, all tests are now
passing.  I have not yet looked any further into the merits of these patches,
however.

mark


Re: Boolean partitions syntax

2017-12-19 Thread Amit Langote
Hi Mark,

On 2017/12/20 6:46, Mark Dilger wrote:
>> On Dec 12, 2017, at 10:32 PM, Amit Langote  
>> wrote:
>> Added to CF: https://commitfest.postgresql.org/16/1410/
> 
> This compiles and passes the regression tests for me.

Thanks for the review.

> I extended your test a bit to check whether partitions over booleans are 
> useful.
> Note specifically the 'explain' output, which does not seem to restrict the 
> scan
> to just the relevant partitions.  You could easily argue that this is beyond 
> the scope
> of your patch (and therefore not your problem), but I doubt it makes much 
> sense
> to have boolean partitions without planner support for skipping partitions 
> like is
> done for tables partitioned over other data types.

Yeah.  Actually, I'm aware that the planner doesn't work this.  While
constraint exclusion (planner's current method of skipping partitions)
does not work with IS TRUE/FALSE/UNKNOWN clauses, the new partition
pruning patch [1] addresses that.  In fact, I started this thread prompted
by some discussion about Boolean partitions on that thread [2].

That said, someone might argue that we should also fix constraint
exclusion (the current method of partition pruning) so that partition
skipping works correctly for Boolean partitions.

Thanks,
Amit

[1] https://commitfest.postgresql.org/15/1272/

[2]
https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp




Shouldn't execParallel.c null-terminate query_string in the parallel DSM?

2017-12-19 Thread Thomas Munro
Hi hackers,

I just saw some trailing garbage in my log file emanating from a
parallel worker when my query happened to be a BUFFERALIGNed length
(in this case 64 characters).  Did commit 4c728f382970 forget to make
sure the null terminator is copied?  See attached proposed fix.

-- 
Thomas Munro
http://www.enterprisedb.com


null-terminate-query-string.patch
Description: Binary data


Re: Estimate maintenance_work_mem for CREATE INDEX

2017-12-19 Thread Michael Paquier
On Tue, Dec 19, 2017 at 11:14 PM, Greg Stark  wrote:
> You're not the first to propose something like that. I think an
> EXPLAIN ALTER TABLE would also be very handy -- it's currently
> impossible to tell without carefully reading the source code whether a
> given DDL change will require a full table scan, a full table rewrite,
> or just a quick meta data update (and even in that case what strength
> lock will be required). I think there are other utility statements
> that make interesting heuristic decisions that would be nice to be
> able to have some visibility into -- CLUSTER comes to mind.

An application of such things is attempting to estimate the amount of
disk space needed when doing a schema upgrade, so that could be handy.
-- 
Michael



Re: WIP: BRIN multi-range indexes

2017-12-19 Thread Mark Dilger

> On Dec 19, 2017, at 5:16 PM, Tomas Vondra  
> wrote:
> 
> 
> 
> On 12/19/2017 08:38 PM, Mark Dilger wrote:
>> 
>>> On Nov 18, 2017, at 12:45 PM, Tomas Vondra  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> Apparently there was some minor breakage due to duplicate OIDs, so here
>>> is the patch series updated to current master.
>>> 
>>> regards
>>> 
>>> -- 
>>> Tomas Vondra  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>> <0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz><0002-BRIN-bloom-indexes.patch.gz><0003-BRIN-multi-range-minmax-indexes.patch.gz><0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz>
>> 
>> 
>> After applying these four patches to my copy of master, the regression
>> tests fail for F_SATISFIES_HASH_PARTITION 5028 as attached.
>> 
> 
> D'oh! There was an incorrect OID referenced in pg_opclass, which was
> also used by the satisfies_hash_partition() function. Fixed patches
> attached.

Your use of type ScanKey in src/backend/access/brin/brin.c is a bit confusing.  
A
ScanKey is defined elsewhere as a pointer to ScanKeyData.  When you define
an array of ScanKeys, you use pointer-to-pointer style:

+   ScanKey   **keys,
+  **nullkeys;

But when you allocate space for the array, you don't treat it that way:

+   keys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);
+   nullkeys = palloc0(sizeof(ScanKey) * bdesc->bd_tupdesc->natts);

But then again when you use nullkeys, you treat it as a two-dimensional array:

+   nullkeys[keyattno - 1][nnullkeys[keyattno - 1]] = key;

and likewise when you allocate space within keys:

+keys[keyattno - 1] = palloc0(sizeof(ScanKey) * scan->numberOfKeys);

Could you please clarify?  I have been awake a bit too long; hopefully, I am
not merely missing the obvious.

mark




Re: Bitmap table scan cost per page formula

2017-12-19 Thread Justin Pryzby
On Tue, Dec 19, 2017 at 07:55:32PM +, Haisheng Yuan wrote:
> Hi hackers,
> 
> This is Haisheng Yuan from Greenplum Database.
> 
> We had some query in production showing that planner favors seqscan over
> bitmapscan, and the execution of seqscan is 5x slower than using
> bitmapscan, but the cost of bitmapscan is 2x the cost of seqscan. The
> statistics were updated and quite accurate.
> 
> Bitmap table scan uses a formula to interpolate between random_page_cost
> and seq_page_cost to determine the cost per page. In Greenplum Database,
> the default value of random_page_cost is 100, the default value of
> seq_page_cost is 1. With the original cost formula, random_page_cost
> dominates in the final cost result, even the formula is declared to be
> non-linear. However, it is still more like linear, which can't reflect the
> real cost per page when a majority of pages are fetched. Therefore, the
> cost formula is updated to real non-linear function to reflect both
> random_page_cost and seq_page_cost for different percentage of pages
> fetched.
> 
> Even though PostgreSQL has much smaller default value of random_page_cost
> (4), the same problem exists there if we change the default value.
> 
> Old cost formula:
> cost_per_page = random_page_cost - (random_page_cost - seq_page_cost) *
> sqrt(pages_fetched / T);
> [image: Inline image 1]
> 
> New cost formula:
> cost_per_page = seq_page_cost * pow(random_page_cost / seq_page_cost, 1 -
> sqrt(pages_fetched / T));
> [image: Inline image 2]
> 
> Below is the graph (credit to Heikki) that plots the total estimated cost
> of a bitmap heap scan, where table size is 1 pages, and
> random_page_cost=10 and seq_page_cost=1. X axis is the number of pages
> fetche. I.e. on the left, no pages are fetched, and on the right end, at
> 1, all pages are fetched. The original formula is in black, the new
> formula in red, and the horizontal line, in blue, shows the cost of a Seq
> Scan.
> [image: Inline image 3]

Thanks for caring about bitmap scans ;)

There's a couple earlier/ongoing discussions on this:

In this old thread: 
https://www.postgresql.org/message-id/CAGTBQpZ%2BauG%2BKhcLghvTecm4-cGGgL8vZb5uA3%3D47K7kf9RgJw%40mail.gmail.com
..Claudio Freire  wrote:
> Correct me if I'm wrong, but this looks like the planner not
> accounting for correlation when using bitmap heap scans.
> 
> Checking the source, it really doesn't.

..which I think is basically right: the formula does distinguish between the
cases of small or large fraction of pages, but doesn't use correlation.  Our
issue in that case seems to be mostly a failure of cost_index to account for
fine-scale deviations from large-scale correlation; but, if cost_bitmap
accounted for our high correlation metric (>0.99), it might've helped our case.

Note costsize.c:
 * Save amcostestimate's results for possible use in bitmap scan planning.
 * We don't bother to save indexStartupCost or indexCorrelation, because a
 * BITMAP SCAN DOESN'T CARE ABOUT EITHER.

See more at this recent/ongoing discussion (involving several issues, only one
of which is bitmap cost vs index cost):
https://www.postgresql.org/message-id/flat/20171206214652.GA13889%40telsasoft.com#20171206214652.ga13...@telsasoft.com

Consider the bitmap scans in the two different cases:

1) In Vitaliy's case, bitmap was being chosen in preference to index scan (due
in part to random_page_cost>1), but performed poorly, partially because bitmap
component must complete before the heap reads can begin.  And also because the
heap reads for the test case involving modular division would've read pages
across the entire length of the table, incurring maximum lseek costs.

2) In my case from ~16 months ago, index scan was being chosen in preference to
bitmap, but index scan was incurring high seek cost.  We would've been happy if
the planner would've done a bitmap scan on a weekly-partitioned child table
(with 7 days data), while querying one day's data (1/7th of the table), 99% of
which would been strictly sequential page reads, so incurring low lseek costs
(plus some component of random costs for the remaining 1% "long tail").

It seems clear to me that sequentially reading 1/7th of the tightly clustered
pages in a table ought to be costed differently than reading 1/7th of the pages
evenly distributed accross its entire length.

I started playing with this weeks ago (probably during Vitaliy's problem
report).  Is there any reason cost_bitmap_heap_scan shouldn't interpolate based
on correlation from seq_page_cost to rand_page_cost, same as cost_index ?

Justin



  1   2   >