Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 15:40, Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.



Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?



Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


Yeah, I thinks so, too.

Sorry, maybe my explanation above was not enough, but in the inherted 
UPDATEs/DELETEs, the table modification is also ensured to be done in 
the same order.  So, as Amit Langote said, both transactions will do the 
updates in the same order.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz
Amit Langote wrote:
> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>> Will it handle deadlocks across different table partitions. Consider
>> a case as below:
>>
>> T1
>> 1. Updates row R1 of T1 on shard S1
>> 2. Updates row R2 of T2 on shard S2
>>
>> T2
>> 1. Updates row R2 of T2 on shard S2
>> 2. Updates row R1 of T1 on shard S1

> As long as shards are processed in the same order in different
> transactions, ISTM, this issue should not arise? I can imagine it becoming
> a concern if parallel shard processing enters the scene. Am I missing
> something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 04:07, Albe Laurenz wrote:
> Amit Langote wrote:
>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>> Will it handle deadlocks across different table partitions. Consider
>>> a case as below:
>>>
>>> T1
>>> 1. Updates row R1 of T1 on shard S1
>>> 2. Updates row R2 of T2 on shard S2
>>>
>>> T2
>>> 1. Updates row R2 of T2 on shard S2
>>> 2. Updates row R1 of T1 on shard S1
> 
>> As long as shards are processed in the same order in different
>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>> a concern if parallel shard processing enters the scene. Am I missing
>> something?
> 
> That would only hold for a single query, right?
> 
> If 1. and 2. in the above example come from different queries within one
> transaction, you cannot guarantee that shards are processed in the same order.
> 
> So T1 and T2 could deadlock.
> 

Sorry, I failed to see why that would be the case. Could you elaborate?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Michael Paquier
On Wed, Sep 2, 2015 at 6:21 AM, Michael Paquier wrote:
> Yes, that's what I have been looking at actually by having some markers in 
> relcache.c. The reference count of this relation get incremented here:

So, I have been playing more with this code, and as mentioned by
Andres this test case goes twice through the PG_CATCH block in
pl_exec.c, causing the crash as the temporary relation does not get
dereferenced. I may be missing something, but isn't it a matter of
moving the switch to the old memory context at the end of the block so
as we can get a correct cleanup of the estate in the first block?
See for example the attached that fixes the problem for me, and passes
make checkworld, though I expect Tom and Andres to jump in and say
that this is not correct within the next 12 hours :)
I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.
Regards,
-- 
Michael
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 935fa62..56084c3 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -1253,8 +1253,6 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 
 			/* Abort the inner transaction */
 			RollbackAndReleaseCurrentSubTransaction();
-			MemoryContextSwitchTo(oldcontext);
-			CurrentResourceOwner = oldowner;
 
 			/* Revert to outer eval_econtext */
 			estate->eval_econtext = old_eval_econtext;
@@ -1326,6 +1324,9 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 ReThrowError(edata);
 			else
 FreeErrorData(edata);
+
+			MemoryContextSwitchTo(oldcontext);
+			CurrentResourceOwner = oldowner;
 		}
 		PG_END_TRY();
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.



Sorry, I failed to see why that would be the case. Could you elaborate?


I think Laurenz would assume that the updates 1. and 2. in the above 
transactions are performed *in a non-inherited manner*.  If that's 
right, T1 and T2 could deadlock, but I think we assume here to run 
transactions over shards *in an inherited manner*.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-09-02 Thread Amit Langote

Peter,

On 2015-08-11 AM 07:37, Peter Geoghegan wrote:
> What I'm going to do is roll this into my own pending patch to fix the
> issue with wholerow vars, which is also down to a problem with the
> excluded targetlist initially generated by calling expandRelAttrs().
> Andres might want to take an alternative approach with that, so a
> consolidation makes sense.

Did you get around to making a patch for this?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz
Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>> Amit Langote wrote:
 On 2015-09-02 PM 03:25, Amit Kapila wrote:
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
>
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
>
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
>>>
 As long as shards are processed in the same order in different
 transactions, ISTM, this issue should not arise? I can imagine it becoming
 a concern if parallel shard processing enters the scene. Am I missing
 something?
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same 
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's
> right, T1 and T2 could deadlock, but I think we assume here to run
> transactions over shards *in an inherited manner*.

Yes, but does every update affect all shards?

If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42
happens to be on shard S1, the update would only affect that shard, right?

Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on
shard S2, and two transactions issue both updates in different order,
one transaction would be waiting for a lock on shard S1, while the other
would be waiting for a lock on shard S2, right?

But maybe I'm missing something fundamental.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Shulgin, Oleksandr
On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:

> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
> > But I think it's quite wrong to assume that the infrastructure for
> > this is available and usable everywhere, because in my experience,
> > that's far from the case.
>
> Especially when the alternative is a rather short patch implementing an
> otherwise widely available feature.
>

But that won't actually help in the case described by Robert: if the master
server A failed, the client has no idea if B or C would become the new
master.

Unless it actually tries to connect them in turn and check for the result
of pg_is_in_recovery().  I think that brings enough complexity for keeping
this outside of libpq.  Also think about all the extra flexibility people
will likely want to have: number of retries, delay between retries, delay
backoff, etc., to the point we'll have to support some sort of client code
retry_policy_callback.

For read-only clients you might want to include a number of slave
hostnames, and let the connector choose one, but then again you can't
achieve load-balancing on the client side, you're better off using
round-robin DNS.  To add or remove a slave you only need to update DNS, and
not configuration on all the clients.

For the master failover I think a common technique is to just move the
floating IP address from the old master to the new one.  This doesn't
require touching the DNS record.

--
Alex


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Shulgin, Oleksandr
On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
wrote:

>
>> But do we really need the slots mechanism?  Would it not be OK to just
>> let the LWLock do the sequencing of concurrent requests?  Given that we
>> only going to use one message queue per cluster, there's not much
>> concurrency you can gain by introducing slots I believe.
>>
>
> I afraid of problems on production. When you have a queue related to any
> process, then all problems should be off after end of processes. One
> message queue per cluster needs restart cluster when some pathological
> problems are - and you cannot restart cluster in production week, sometimes
> weeks. The slots are more robust.
>

Yes, but in your implementation the slots themselves don't have a
queue/buffer.  Did you intend to have a message queue per slot?

What sort of pathological problems are you concerned of?  The communicating
backends should just detach from the message queue properly and have some
timeout configured to prevent deadlocks.  Other than that, I don't see how
having N slots really help the problem: in case of pathological problems
you will just deplete them all sooner or later.

--
Alex


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Pavel Stehule
2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr 
:

> On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
> wrote:
>
>>
>>> But do we really need the slots mechanism?  Would it not be OK to just
>>> let the LWLock do the sequencing of concurrent requests?  Given that we
>>> only going to use one message queue per cluster, there's not much
>>> concurrency you can gain by introducing slots I believe.
>>>
>>
>> I afraid of problems on production. When you have a queue related to any
>> process, then all problems should be off after end of processes. One
>> message queue per cluster needs restart cluster when some pathological
>> problems are - and you cannot restart cluster in production week, sometimes
>> weeks. The slots are more robust.
>>
>
> Yes, but in your implementation the slots themselves don't have a
> queue/buffer.  Did you intend to have a message queue per slot?
>

The message queue cannot be reused, so I expect one slot per caller to be
used passing parameters, - message queue will be created/released by demand
by caller.


>
> What sort of pathological problems are you concerned of?  The
> communicating backends should just detach from the message queue properly
> and have some timeout configured to prevent deadlocks.  Other than that, I
> don't see how having N slots really help the problem: in case of
> pathological problems you will just deplete them all sooner or later.
>

I afraid of unexpected problems :) - any part of signal handling or
multiprocess communication is fragile. Slots are simple and simply attached
to any process without necessity to alloc/free some memory.

Pavel


>
> --
> Alex
>
>


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 05:07, Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's right,
> T1 and T2 could deadlock, but I think we assume here to run transactions
> over shards *in an inherited manner*.
> 

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;


-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;


-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;


-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;



Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-02 Thread Nikolay Shaplov
В письме от 3 августа 2015 15:35:23 пользователь Alvaro Herrera написал:
> Nikolay Shaplov wrote:
> > This patch adds several new functions, available from SQL queries. All
> > these functions are based on heap_page_items, but accept slightly
> > different arguments and has one additional column at the result set:
> > 
> > heap_page_tuples - accepts relation name, and bulkno, and returns usual
> > heap_page_items set with additional column that contain snapshot of tuple
> > data area stored in bytea.
> 
> I think the API around get_raw_page is more useful generally.  You might
> have table blocks stored in a bytea column for instance, because you
> extracted from some remote server and inserted into a local table for
> examination.  If you only accept relname/blkno, there's no way to
> examine data other than blocks directly in the database dir, which is
> limiting.
> 
> > There is also one strange function: _heap_page_items it is useless for
> > practical purposes. As heap_page_items it accepts page data as bytea, but
> > also get a relation name. It tries to apply tuple descriptor of that
> > relation to that page data.
> 
> For BRIN, I added something similar (brin_page_items), but it receives
> the index OID rather than name, and constructs a tuple descriptor to
> read the data.  I think OID is better than name generally.  (You can
> cast the relation name to regclass).
> 
> It's easy to misuse, but these functions are superuser-only, so there
> should be no security issue at least.  The possibility of a crash
> remains real, though, so maybe we should try to come up with a way to
> harden that.

I've done as you've said: Now patch adds two functions heap_page_item_attrs 
and heap_page_item_detoast_attrs and these function takes raw_page and 
relation OID as arguments. They also have third optional parameter that allows 
to change error mode from ERROR to WARNING.

Is it ok now?


-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index aec5258..e4bc1af 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,7 +5,7 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.3.sql pageinspect--1.2--1.3.sql \
+DATA = pageinspect--1.4.sql pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \
 	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
 	pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..59eae0f 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,7 +29,12 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/rel.h"
+#include "catalog/namespace.h"
+#include "catalog/pg_type.h"
 
+#include "rawpage.h"
 
 /*
  * bits_to_text
@@ -53,6 +58,9 @@ bits_to_text(bits8 *bits, int len)
 	return str;
 }
 
+Datum heap_page_items_internal(FunctionCallInfo fcinfo, bytea *raw_page, Oid rel_oid, int error_level, bool do_detoast);
+void fill_header_data(FuncCallContext *fctx, Datum *values, bool *nulls, bool do_detoast);
+void split_tuple_data(FuncCallContext *fctx, HeapTupleHeader tuphdr, Datum *values, bool *nulls, bool do_detoast);
 
 /*
  * heap_page_items
@@ -66,38 +74,80 @@ typedef struct heap_page_items_state
 	TupleDesc	tupd;
 	Page		page;
 	uint16		offset;
+	TupleDesc	page_tuple_desc;
+	int		raw_page_size;
+	int		error_level;
 } heap_page_items_state;
 
 Datum
 heap_page_items(PG_FUNCTION_ARGS)
 {
 	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
+	return heap_page_items_internal(fcinfo, raw_page, InvalidOid, ERROR, false);
+}
+
+PG_FUNCTION_INFO_V1(heap_page_item_attrs);
+Datum
+heap_page_item_attrs(PG_FUNCTION_ARGS)
+{
+	bytea		*raw_page = PG_GETARG_BYTEA_P(0);
+	Oid		rel_oid = PG_GETARG_OID(1);
+	int		error_mode = PG_NARGS()==3 ? PG_GETARG_BOOL(2) :NULL;
+	error_mode = error_mode?WARNING:ERROR;
+	if (SRF_IS_FIRSTCALL() && (error_mode == WARNING))
+	{
+		ereport(WARNING,
+			(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+(errmsg("Runnung heap_page_item_attrs in WARNING mode.";
+	}
+
+	return heap_page_items_internal(fcinfo, raw_page, rel_oid, error_mode, false);
+}
+
+PG_FUNCTION_INFO_V1(heap_page_item_detoast_attrs);
+Datum
+heap_page_item_detoast_attrs(PG_FUNCTION_ARGS)
+{
+	bytea		*raw_page = PG_GETARG_BYTEA_P(0);
+	Oid		rel_oid = PG_GETARG_OID(1);
+	int		error_mode = PG_NARGS()==3 ? PG_GETARG_BOOL(2) :NULL;
+	error_mode = error_mode?WARNING:ERROR;
+	if (SRF_IS_FIRSTCALL() && (error_mode == WARNING))
+	{
+		ereport(WARNING,
+			(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+(errmsg("Runnung heap_page_item_detoast_attrs in WARNING mode.";
+	}
+	return heap_page_items_internal(fcinfo, raw_page, rel_oid, error_mode, t

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Ashutosh Bapat
On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus  wrote:

> On 09/01/2015 11:36 AM, Tomas Vondra wrote:
> >> We want multiple copies of shards created by the sharding system itself.
> >>   Having a separate, and completely orthagonal, redundancy system to the
> >> sharding system is overly burdensome on the DBA and makes low-data-loss
> >> HA impossible.
> >
> > IMHO it'd be quite unfortunate if the design would make it impossible to
> > combine those two features (e.g. creating standbys for shards and
> > failing over to them).
> >
> > It's true that solving HA at the sharding level (by keeping multiple
> > copies of a each shard) may be simpler than combining sharding and
> > standbys, but I don't see why it makes low-data-loss HA impossible.
>
> Other way around, that is, having replication standbys as the only
> method of redundancy requires either high data loss or high latency for
> all writes.
>
> In the case of async rep, every time we fail over a node, the entire
> cluser would need to roll back to the last common known-good replay
> point, hence high data loss.
>
> In the case of sync rep, we are required to wait for at least double
> network lag time in order to do a single write ... making
> write-scalability quite difficult.
>
> Futher, if using replication the sharding system would have no way to
> (a) find out immediately if a copy was bad and (b) fail over quickly to
> a copy of the shard if the first requested copy was not responding.
> With async replication, we also can't use multiple copies of the same
> shard as a way to balance read workloads.
>
> If we write to multiple copies as a part of the sharding feature, then
> that can be parallelized, so that we are waiting only as long as the
> slowest write (or in failure cases, as long as the shard timeout).
> Further, we can check for shard-copy health and update shard
> availability data with each user request, so that the ability to see
> stale/bad data is minimized.
>

XC (and I guess XL, pgPool II as well) did this by firing same DML
statement to all the copies after resolving any volatile references  (e.g.
now()) in DML, so that all the copies get the same values. That method
however needed some row identifier which can identify same row on all the
replicas. Primary key is used as row identifier usually, but not all use
cases which require shards to be replicated have primary key in their
sharded tables.


>
> There are obvious problems with multiplexing writes, which you can
> figure out if you knock pg_shard around a bit.  But I really think that
> solving those problems is the only way to go.
>
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of the
> sharding system if you're aiming for write scalability.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote
On 2015-09-02 PM 06:41, Amit Langote wrote:
> 
> I think Albe may have a point here...
> 
> Even inherited updates case appears to cause a deadlock if they are in
> different queries. Demonstrated below:
> 
> -- setup
> CREATE TABLE t(a int);
> CREATE TABLE t1() INHERITS(t);
> CREATE TABLE t2() INHERITS(t);
> 
> INSERT INTO t1 VALUES (1);
> INSERT INTO t2 VALUES (2);
> 
> -- in session 1
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 1;
> 
> 
> -- in session 2
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 2;
> 
> 
> -- back in session 1
> UPDATE t SET a = a + 1 WHERE a = 2;
> 
> 
> -- back in session 2
> UPDATE t SET a = a + 1 WHERE a = 1;
> 
> 

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Shulgin, Oleksandr
On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule 
wrote:

>
>
> 2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de>:
>
>> On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
>> wrote:
>>
>>>
 But do we really need the slots mechanism?  Would it not be OK to just
 let the LWLock do the sequencing of concurrent requests?  Given that we
 only going to use one message queue per cluster, there's not much
 concurrency you can gain by introducing slots I believe.

>>>
>>> I afraid of problems on production. When you have a queue related to any
>>> process, then all problems should be off after end of processes. One
>>> message queue per cluster needs restart cluster when some pathological
>>> problems are - and you cannot restart cluster in production week, sometimes
>>> weeks. The slots are more robust.
>>>
>>
>> Yes, but in your implementation the slots themselves don't have a
>> queue/buffer.  Did you intend to have a message queue per slot?
>>
>
> The message queue cannot be reused, so I expect one slot per caller to be
> used passing parameters, - message queue will be created/released by demand
> by caller.
>

I don't believe a message queue cannot really be reused.  What would stop
us from calling shm_mq_create() on the queue struct again?

To give you an idea, in my current prototype I have only the following
struct:

typedef struct {
LWLock   *lock;
/*CmdStatusInfoSlot slots[CMDINFO_SLOTS];*/
pid_t target_pid;
pid_t sender_pid;
int request_type;
int result_code;
shm_mq buffer;
} CmdStatusInfo;

An instance of this is allocated on shared memory once, using BUFFER_SIZE
of 8k.

In pg_cmdstatus() I lock on the LWLock to check if target_pid is 0, then it
means nobody else is using this communication channel at the moment.  If
that's the case, I set the pids and request_type and initialize the mq
buffer.  Otherwise I just sleep and retry acquiring the lock (a timeout
should be added here probably).

What sort of pathological problems are you concerned of?  The communicating
>> backends should just detach from the message queue properly and have some
>> timeout configured to prevent deadlocks.  Other than that, I don't see how
>> having N slots really help the problem: in case of pathological problems
>> you will just deplete them all sooner or later.
>>
>
> I afraid of unexpected problems :) - any part of signal handling or
> multiprocess communication is fragile. Slots are simple and simply attached
> to any process without necessity to alloc/free some memory.
>

Yes, but do slots solve the actual problem?  If there is only one message
queue, you still have the same problem regardless of the number of slots
you decide to have.

--
Alex


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2015-09-02 Thread Fujii Masao
On Tue, Aug 11, 2015 at 2:58 AM, Jeff Janes  wrote:
> On Thu, Oct 30, 2014 at 5:30 AM, Fujii Masao  wrote:
>>
>> On Thu, Oct 30, 2014 at 7:30 PM, Etsuro Fujita
>>  wrote:
>>
>> > +   {
>> > +   {"pending_list_cleanup_size", PGC_USERSET,
>> > CLIENT_CONN_STATEMENT,
>> > +   gettext_noop("Sets the maximum size of the
>> > pending
>> > list for GIN index."),
>> > +NULL,
>> > +   GUC_UNIT_KB
>> > +   },
>> > +   &pending_list_cleanup_size,
>> > +   4096, 0, MAX_KILOBYTES,
>> > +   NULL, NULL, NULL
>> > +   },
>> >
>> > ISTM it'd be better to use RESOURCES_MEM, not CLIENT_CONN_STATEMENT. No?
>>
>> Yes if the pending list always exists in the memory. But not, IIUC.
>> Thought?
>>
>> > Also why not set min to 64, not to 0, in accoradance with that of
>> > work_mem?
>>
>> I'm OK to use 64. But I just chose 0 because I could not think of any
>> reasonable
>> reason why 64k is suitable as the minimum size of the pending list.
>> IOW, I have no idea about whether it's reasonable to use the min value of
>> work_mem as the min size of the pending list.
>
>
>
> I know I am late to the party here, but would like to have the minimum be 0,
> not 64.  As long as by zero, it means it doesn't insert anything into the
> pending list, rather than inserting and promptly cleaning it up.
>
> The reason for this is that if I am trying to decide what
> pending_list_cleanup_size I want to set for the index in the indexes storage
> parameters, the way to find that out is try a bunch of different ones
> through the guc while the index is still at the default of no overriding
> storage parameter.  It would be nice to try the fastupdate=off alternative
> (i.e. the same as pending_list_cleanup_size=0) without having to change the
> index itself and change the syntax used in the testing.

Sounds OK to me. So we should change the minimum values of both
gin_pending_list_limit GUC and storage parameters to 0? Or only GUC?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Pavan Deolasee
On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote 
wrote:

> On 2015-09-02 PM 06:41, Amit Langote wrote:
> >
> > I think Albe may have a point here...
> >
> > Even inherited updates case appears to cause a deadlock if they are in
> > different queries. Demonstrated below:
> >
> > -- setup
> > CREATE TABLE t(a int);
> > CREATE TABLE t1() INHERITS(t);
> > CREATE TABLE t2() INHERITS(t);
> >
> > INSERT INTO t1 VALUES (1);
> > INSERT INTO t2 VALUES (2);
> >
> > -- in session 1
> > BEGIN;
> > UPDATE t SET a = a + 1 WHERE a = 1;
> > 
> >
> > -- in session 2
> > BEGIN;
> > UPDATE t SET a = a + 1 WHERE a = 2;
> > 
> >
> > -- back in session 1
> > UPDATE t SET a = a + 1 WHERE a = 2;
> > 
> >
> > -- back in session 2
> > UPDATE t SET a = a + 1 WHERE a = 1;
> > 
> >
>
> Which, I now realize, is not the worry Amit Kapila's expresses.
>
> The deadlock was *indeed detected* in this case, with all the locks in the
> same PG instance. In a sharded environment with multiple PG instances,
> that becomes tricky. DLM (distributed lock manager/deadlock detector)
> seems indeed necessary as Amit K. suspects.
>
>
Right. XC/XL did not address this issue and they rely on statement timeouts
to break distributed deadlocks.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] FSM versus GIN pending list bloat

2015-09-02 Thread Fujii Masao
On Wed, Aug 5, 2015 at 5:50 AM, Simon Riggs  wrote:
> On 4 August 2015 at 21:04, Jeff Janes  wrote:
>
>>>
>>> Couple of questions here...
>>>
>>> * the docs say "it's desirable to have pending-list cleanup occur in the
>>> background", but there is no way to invoke that, except via VACUUM. I think
>>> we need a separate function to be able to call this as a background action.
>>> If we had that, we wouldn't need much else, would we?
>>
>>
>> I thought maybe the new bgworker framework would be a way to have a
>> backend signal a bgworker to do the cleanup when it notices the pending list
>> is getting large.  But that wouldn't directly fix this issue, because the
>> bgworker still wouldn't recycle that space (without further changes), only
>> vacuum workers do that currently.
>>
>> But I don't think this could be implemented as an extension, because the
>> signalling code has to be in core, so (not having studied the matter at all)
>> I don't know if it is good fit for bgworker.
>
>
> We need to expose 2 functions:
>
> 1. a function to perform the recycling directly (BRIN has an equivalent
> function)
>
> 2. a function to see how big the pending list is for a particular index,
> i.e. do we need to run function 1?

Probably you can use pgstatginindex() that pgstattuple contrib module provides
for this case.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila
On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee 
wrote:
> On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:
>>
>> On 2015-09-02 PM 06:41, Amit Langote wrote:
>>
>> Which, I now realize, is not the worry Amit Kapila's expresses.
>>
>> The deadlock was *indeed detected* in this case, with all the locks in
the
>> same PG instance. In a sharded environment with multiple PG instances,
>> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>> seems indeed necessary as Amit K. suspects.
>>
>
> Right. XC/XL did not address this issue and they rely on statement
timeouts to break distributed deadlocks.
>

I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time.  I think it is better to
have solution for this problem and deadlock should be detected.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] about fsync in CLOG buffer write

2015-09-02 Thread 张广舟(明虚)

Hi Hackers,

We host many PG instances on each of our machines, and we want to minimize
the calls to fsync within each instance, in order to minimize possible
impact on other instances. We found there is a fsync call when CLOG buffer
is written out in SlruPhysicalWritePage(). It is often called when a backend
needs to check transaction status with SimpleLruReadPage().


The code is as follows (in src/backend/access/transam/slru.c line 837 in PG
9.4):
/*
 * If not part of Flush, need to fsync now.  We assume this happens
 * infrequently enough that it's not a performance issue.
 */
if (!fdata)
{
if (ctl->do_fsync && pg_fsync(fd))
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}

if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
return false;
}

 ctl->do_fsync is true for CLOG.  Question is, could we just disable fsync
for CLOG buffer write out here? Is it safe to do so? I understand a
checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
the fsync call here (for buffer write out) is not necessary.

Thanks for your time!
Guangzhou




Re: [HACKERS] synchronous_commit = apply

2015-09-02 Thread Fujii Masao
On Wed, Sep 2, 2015 at 10:25 AM, Thomas Munro
 wrote:
> Hi
>
> Do you think it's reasonable to want to COMMIT a particular transaction on a
> master node, and then immediately run a read-only query on a hot standby
> node that is guaranteed to see that transaction?
>
> A friend of mine who works with a different RDBMS technology that can do
> that asked me how to achieve this with Postgres, and I suggested waiting for
> the standby's pg_last_xlog_replay_location() to be >= the master's
> pg_current_xlog_location() after COMMIT, which might involve some looping
> and sleeping.
>
> As a quick weekend learning exercise/hack I recently went looking into how
> we could support $SUBJECT.  I discovered we already report the apply
> progress back to the master, and the synchronous waiting facility seemed to
> be all ready to support this.  In fact it seemed a little too easy so
> something tells me it must be wrong!  But anyway, please see the attached
> toy POC patch which does that.
>
> The next problem is that the master can be waiting quite a long time for a
> reply from the remote walreceiver containing the desired apply LSN: in the
> best case it learns of apply progress from replies to subsequent unrelated
> records (which might be very soon on a busy system but still involves
> waiting for the next transaction's WAL flush), and in the worst case it
> needs to wait for wal_receiver_status_interval (10 seconds by default),
> which makes for a long COMMIT delay.  I was thinking that the solution to
> that may be to teach StartupLOG to signal the walreceiver after it updates
> XLogCtl->lastReplayedEndRecPtr, which should cause walrcv_receive to be
> interrupted and return early, and then walreceiver could send a reply if it
> sees that lastReplayedEndRecPtr has moved.  Maybe that would generate an
> unacceptably high frequency of signals

One idea is to change the standby so that it manages the locations
that the backends in "apply" mode are waiting for in the master,
and to make the startup process wake the walreceiver up whenever
the replay location reaches either of those locations. In this idea,
walreceiver sends back the "apply" location to the master only when
needed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about fsync in CLOG buffer write

2015-09-02 Thread Andres Freund
On 2015-09-10 19:39:59 +0800, 张广舟(明虚) wrote:
> We found there is a fsync call when CLOG buffer
> is written out in SlruPhysicalWritePage(). It is often called when a backend
> needs to check transaction status with SimpleLruReadPage().

That's when there's not enough buffers available some other, and your
case dirty, needs to be written out.

You could try increasing the max (32) in CLOGShmemBuffers() further.

>  ctl->do_fsync is true for CLOG.  Question is, could we just disable fsync
> for CLOG buffer write out here? Is it safe to do so? I understand a
> checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
> the fsync call here (for buffer write out) is not necessary.

No, that'd not be safe. The reason we fsync in SlruPhysicalWritePage()
is that the flush during checkpoint won't necessarily touch those files
at all (as they've been replaced with other buffers in memory).

This could be optimized though - it should be possible to delay the
fsync for slru files that have been evicted from memory till
checkpoint. Using something like ForwardFsyncRequest() except that it
obviously has to be usable for other files than normal relfilenodes.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] commitfest does not see my real latest patch

2015-09-02 Thread Nikolay Shaplov
Hi!

I've added a patch to commitfest

https://commitfest.postgresql.org/7/363/


but in my mail thread I've committed two version of patch. Old one 
(pageinspect.diff) and a new one (pageinspect_show_tuple_data.diff)

and commitfest software for some reason sees only old patch, not new. I think 
that this may be a bug. 

And also I'd like reviewers to see proper patch as "Latest patch", if someone 
can manually fix that, it would be great!

-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] synchronous_commit = apply

2015-09-02 Thread Robert Haas
On Tue, Sep 1, 2015 at 9:25 PM, Thomas Munro
 wrote:
> The next problem is that the master can be waiting quite a long time for a
> reply from the remote walreceiver containing the desired apply LSN: in the
> best case it learns of apply progress from replies to subsequent unrelated
> records (which might be very soon on a busy system but still involves
> waiting for the next transaction's WAL flush), and in the worst case it
> needs to wait for wal_receiver_status_interval (10 seconds by default),
> which makes for a long COMMIT delay.  I was thinking that the solution to
> that may be to teach StartupLOG to signal the walreceiver after it updates
> XLogCtl->lastReplayedEndRecPtr, which should cause walrcv_receive to be
> interrupted and return early, and then walreceiver could send a reply if it
> sees that lastReplayedEndRecPtr has moved.  Maybe that would generate an
> unacceptably high frequency of signals, and maybe there is a better form of
> IPC for this.

Yeah, that could be a problem, as could reply volume. If you've got a
bunch of heap inserts of narrow rows into some table, you don't really
want to send a reply after each one.  That would be a lot of replies,
and nobody can really care about them anyway, at least not for
synchronous_commit purposes.  But what if you only sent a signal when
the just-replayed record was a COMMIT record?  I suppose that could
still be a lot of replies on something like a full-tilt pgbench
workload, but even in that case it would help a lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] commitfest does not see my real latest patch

2015-09-02 Thread Magnus Hagander
On Wed, Sep 2, 2015 at 2:34 PM, Nikolay Shaplov 
wrote:

> Hi!
>
> I've added a patch to commitfest
>
> https://commitfest.postgresql.org/7/363/
>
>
> but in my mail thread I've committed two version of patch. Old one
> (pageinspect.diff) and a new one (pageinspect_show_tuple_data.diff)
>
> and commitfest software for some reason sees only old patch, not new. I
> think
> that this may be a bug.
>
> And also I'd like reviewers to see proper patch as "Latest patch", if
> someone
> can manually fix that, it would be great!
>


Hi!

There have been network issues at the service provider hosting the
commitfest app today, which means that a number of runs where it has tried
to pull down the updated mailthreads from the archives have failed. My
guess is that it will pick up the thread once those network issues have
been fixed (people are working on it right now). So please give it some
time (the job runs every couple of hours), and with luck it will sort
itself out before too long.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Pavel Stehule
2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr 
:

> On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule 
> wrote:
>
>>
>>
>> 2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
>> oleksandr.shul...@zalando.de>:
>>
>>> On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
>>> wrote:
>>>

> But do we really need the slots mechanism?  Would it not be OK to just
> let the LWLock do the sequencing of concurrent requests?  Given that we
> only going to use one message queue per cluster, there's not much
> concurrency you can gain by introducing slots I believe.
>

 I afraid of problems on production. When you have a queue related to
 any process, then all problems should be off after end of processes. One
 message queue per cluster needs restart cluster when some pathological
 problems are - and you cannot restart cluster in production week, sometimes
 weeks. The slots are more robust.

>>>
>>> Yes, but in your implementation the slots themselves don't have a
>>> queue/buffer.  Did you intend to have a message queue per slot?
>>>
>>
>> The message queue cannot be reused, so I expect one slot per caller to be
>> used passing parameters, - message queue will be created/released by demand
>> by caller.
>>
>
> I don't believe a message queue cannot really be reused.  What would stop
> us from calling shm_mq_create() on the queue struct again?
>

you cannot to change recipient later


>
> To give you an idea, in my current prototype I have only the following
> struct:
>
> typedef struct {
> LWLock   *lock;
> /*CmdStatusInfoSlot slots[CMDINFO_SLOTS];*/
> pid_t target_pid;
> pid_t sender_pid;
> int request_type;
> int result_code;
> shm_mq buffer;
> } CmdStatusInfo;
>
> An instance of this is allocated on shared memory once, using BUFFER_SIZE
> of 8k.
>
> In pg_cmdstatus() I lock on the LWLock to check if target_pid is 0, then
> it means nobody else is using this communication channel at the moment.  If
> that's the case, I set the pids and request_type and initialize the mq
> buffer.  Otherwise I just sleep and retry acquiring the lock (a timeout
> should be added here probably).
>
> What sort of pathological problems are you concerned of?  The
>>> communicating backends should just detach from the message queue properly
>>> and have some timeout configured to prevent deadlocks.  Other than that, I
>>> don't see how having N slots really help the problem: in case of
>>> pathological problems you will just deplete them all sooner or later.
>>>
>>
>> I afraid of unexpected problems :) - any part of signal handling or
>> multiprocess communication is fragile. Slots are simple and simply attached
>> to any process without necessity to alloc/free some memory.
>>
>
> Yes, but do slots solve the actual problem?  If there is only one message
> queue, you still have the same problem regardless of the number of slots
> you decide to have.
>
> --
> Alex
>
>


Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Etsuro Fujita
On 2015/08/01 23:25, Tom Lane wrote:
> Robert Haas  writes:
>> The problem that was bothering us (or at least what was bothering me)
>> is that the PlannerInfo provides only a list of SpecialJoinInfo
>> structures, which don't directly give you the original join order.  In
>> fact, min_righthand and min_lefthand are intended to constraint the
>> *possible* join orders, and are deliberately designed *not* to specify
>> a single join order.  If you're sending a query to a remote PostgreSQL
>> node, you don't want to know what all the possible join orders are;
>> it's the remote side's job to plan the query.  You do, however, need
>> an easy way to identify one join order that you can use to construct a
>> query.  It didn't seem easy to do that without duplicating
>> make_join_rel(), which seemed like a bad idea.

> In principle it seems like you could traverse root->parse->jointree
> as a guide to reconstructing the original syntactic structure; though
> I'm not sure how hard it would be to ignore the parts of that tree
> that correspond to relations you're not shipping.

I'll investigate this.

>> But maybe there's a good way to do it.  Tom wasn't crazy about this
>> hook both because of the frequency of calls and also because of the
>> long argument list.  I think those concerns are legitimate; I just
>> couldn't see how to make the other way work.

> In my vision you probably really only want one call per build_join_rel
> event (that is, per construction of a new RelOptInfo), not per
> make_join_rel event.
> 
> It's possible that an FDW that wants to handle joins but is not talking to
> a remote query planner would need to grovel through all the join ordering
> possibilities individually, and then maybe hooking at make_join_rel is
> sensible rather than having to reinvent that logic.  But I'd want to see a
> concrete use-case first, and I certainly don't think that that's the main
> case to design the API around.

I'd vote for hooking at standard_join_search.  Here is a use-case:

* When the callback routine is hooked at that funcition (right after
allpaths.c:1817), an FDW would collect lists of all the available
local-join-path orderings and parameterizations by looking at each path
in rel->pathlist (if the join rel only contains foreign tables that all
belong to the same foreign server).

* Then the FDW would use these as a heuristic to indcate which sort
orderings and parameterizations we should build foreign-join paths for.
 (These would be also used as alternative paths for EvalPlanQual
handling, as discussed upthread.)  It seems reasonable to me to consider
pushed-down versions of these paths as first candidates, but
foreign-join paths to build are not limited to such ones.  The FDW is
allowed to consider any foreign-join paths as long as their alternative
paths are provided.

IMO one thing to consider for the postgres_fdw case would be the
use_remote_estimate option.  In the case when the option is true, I
think we should perform remote EXPLAINs for pushed-down-join queries to
obtain cost estimates.  But it would require too much time to do that
for each of the possible join rel.  So, I think it would be better to
put off the callback routine's work as long as possible.  I think that
that could probably be done by looking at rel->joininfo,
root->join_info_list and/or something like that.  (When considering a
join rel A JOIN B both on the same foreign server, for example, we can
skip the routine's work if the join rel proved to be joined with C on
the same foreign server by looking at rel->joininfo, for example.)
Maybe I'm missing something, though.

Best regards,
Etsuro Fujita



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita

On 2015/09/02 20:42, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee mailto:pavan.deola...@gmail.com>> wrote:
 > On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote
mailto:langote_amit...@lab.ntt.co.jp>>
wrote:
 >> On 2015-09-02 PM 06:41, Amit Langote wrote:
 >> Which, I now realize, is not the worry Amit Kapila's expresses.



 >> The deadlock was *indeed detected* in this case, with all the locks
in the
 >> same PG instance. In a sharded environment with multiple PG instances,
 >> that becomes tricky. DLM (distributed lock manager/deadlock detector)
 >> seems indeed necessary as Amit K. suspects.


Ah, you are right.


 > Right. XC/XL did not address this issue and they rely on statement
timeouts to break distributed deadlocks.



I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time.  I think it is better to
have solution for this problem and deadlock should be detected.


+1

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Shulgin, Oleksandr
On Wed, Sep 2, 2015 at 2:56 PM, Pavel Stehule 
wrote:

>
>
> 2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de>:
>
>> On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule 
>> wrote:
>>
>>>
>>>
>>> 2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
>>> oleksandr.shul...@zalando.de>:
>>>
 On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
 wrote:

>
>> But do we really need the slots mechanism?  Would it not be OK to
>> just let the LWLock do the sequencing of concurrent requests?  Given that
>> we only going to use one message queue per cluster, there's not much
>> concurrency you can gain by introducing slots I believe.
>>
>
> I afraid of problems on production. When you have a queue related to
> any process, then all problems should be off after end of processes. One
> message queue per cluster needs restart cluster when some pathological
> problems are - and you cannot restart cluster in production week, 
> sometimes
> weeks. The slots are more robust.
>

 Yes, but in your implementation the slots themselves don't have a
 queue/buffer.  Did you intend to have a message queue per slot?

>>>
>>> The message queue cannot be reused, so I expect one slot per caller to
>>> be used passing parameters, - message queue will be created/released by
>>> demand by caller.
>>>
>>
>> I don't believe a message queue cannot really be reused.  What would stop
>> us from calling shm_mq_create() on the queue struct again?
>>
>
> you cannot to change recipient later
>

Well, maybe I'm missing something, but sh_mq_create() will just overwrite
the contents of the struct, so it doesn't care about sender/receiver: only
sh_mq_set_sender/receiver() do.


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Pavel Stehule
2015-09-02 15:00 GMT+02:00 Shulgin, Oleksandr 
:

> On Wed, Sep 2, 2015 at 2:56 PM, Pavel Stehule 
> wrote:
>
>>
>>
>> 2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <
>> oleksandr.shul...@zalando.de>:
>>
>>> On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule 
>>> wrote:
>>>


 2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
 oleksandr.shul...@zalando.de>:

> On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule  > wrote:
>
>>
>>> But do we really need the slots mechanism?  Would it not be OK to
>>> just let the LWLock do the sequencing of concurrent requests?  Given 
>>> that
>>> we only going to use one message queue per cluster, there's not much
>>> concurrency you can gain by introducing slots I believe.
>>>
>>
>> I afraid of problems on production. When you have a queue related to
>> any process, then all problems should be off after end of processes. One
>> message queue per cluster needs restart cluster when some pathological
>> problems are - and you cannot restart cluster in production week, 
>> sometimes
>> weeks. The slots are more robust.
>>
>
> Yes, but in your implementation the slots themselves don't have a
> queue/buffer.  Did you intend to have a message queue per slot?
>

 The message queue cannot be reused, so I expect one slot per caller to
 be used passing parameters, - message queue will be created/released by
 demand by caller.

>>>
>>> I don't believe a message queue cannot really be reused.  What would
>>> stop us from calling shm_mq_create() on the queue struct again?
>>>
>>
>> you cannot to change recipient later
>>
>
> Well, maybe I'm missing something, but sh_mq_create() will just overwrite
> the contents of the struct, so it doesn't care about sender/receiver: only
> sh_mq_set_sender/receiver() do.
>

if you create sh_mq from scratch, then you can reuse structure.

Pavel


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Shulgin, Oleksandr
On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule 
wrote:
>
>
>> Well, maybe I'm missing something, but sh_mq_create() will just overwrite
>> the contents of the struct, so it doesn't care about sender/receiver: only
>> sh_mq_set_sender/receiver() do.
>>
>
> if you create sh_mq from scratch, then you can reuse structure.
>

That was my plan.  I'm now close to actually compiling this stuff, we'll
see if it works in the simplest case at least. :-)


Re: [HACKERS] [POC] FETCH limited by bytes.

2015-09-02 Thread Andres Freund
On 2015-02-27 13:50:22 -0500, Corey Huinker wrote:
> +static DefElem*
> +get_option(List *options, char *optname)
> +{
> + ListCell *lc;
> +
> + foreach(lc, options)
> + {
> + DefElem *def = (DefElem *) lfirst(lc);
> +
> + if (strcmp(def->defname, optname) == 0)
> + return def;
> + }
> + return NULL;
> +}


>   /*
>* Do nothing in EXPLAIN (no ANALYZE) case.  node->fdw_state stays NULL.
> @@ -915,6 +933,23 @@ postgresBeginForeignScan(ForeignScanState *node, int 
> eflags)
>   server = GetForeignServer(table->serverid);
>   user = GetUserMapping(userid, server->serverid);
>  
> + /* Reading table options */
> + fsstate->fetch_size = -1;
> +
> + def = get_option(table->options, "fetch_size");
> + if (!def)
> + def = get_option(server->options, "fetch_size");
> +
> + if (def)
> + {
> + fsstate->fetch_size = strtod(defGetString(def), NULL);
> + if (fsstate->fetch_size < 0)
> + elog(ERROR, "invalid fetch size for foreign table 
> \"%s\"",
> +  get_rel_name(table->relid));
> + }
> + else
> + fsstate->fetch_size = 100;

I don't think it's a good idea to make such checks at runtime - and
either way it's somethign that should be reported back using an
ereport(), not an elog.

Also, it seems somewhat wrong to determine this at execution
time. Shouldn't this rather be done when creating the foreign scan node?
And be a part of the scan state?

Have you thought about how this option should cooperate with join
pushdown once implemented?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-09-02 Thread Andres Freund
Hi,

On 2015-07-08 14:50:37 +0200, Pavel Stehule wrote:
> - static const char *const my_list[] =
> - {"DEFAULT", NULL};
> + /* fallback for GUC settings */
>  
> - COMPLETE_WITH_LIST(my_list);
> + char *vartype = get_vartype(prev2_wd);
> +
> + if (strcmp(vartype, "enum") == 0)
> + {
> + char querybuf[1024];
> +
> + snprintf(querybuf, 1024, Query_for_enum, 
> prev2_wd);
> + COMPLETE_WITH_QUERY(querybuf);
> + }

Won't that mean that enum variables don't complete to default anymore?

> +static char *
> +get_vartype(const char *varname)
> +{
> + PQExpBufferData query_buffer;
> + char*e_varname;
> + PGresult *result;
> + int string_length;
> + static char resbuf[10];
> +
> + initPQExpBuffer(&query_buffer);
> +
> + string_length = strlen(varname);
> + e_varname = pg_malloc(string_length * 2 + 1);
> + PQescapeString(e_varname, varname, string_length);

Independent of this patch, we really shouldn't do this in several places
:(

> + appendPQExpBuffer(&query_buffer,
> + "SELECT vartype FROM pg_settings WHERE pg_catalog.lower(name) = 
> pg_catalog.lower('%s')",
> +  e_varname);

Missing pg_catalog for pg_settings.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund

Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> I didn't know that the thread must exists on -hackers to be able to add
> a commitfest entry, so I transfer the thread here.

Please, in the future, also update the title of the thread to something
fitting.

> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
> *estate, int eflags)
>  {
>   BitmapHeapScanState *scanstate;
>   RelationcurrentRelation;
> +#ifdef USE_PREFETCH
> + int new_io_concurrency;
> +#endif
>  
>   /* check for unsupported flags */
>   Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
> @@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
> *estate, int eflags)
>*/
>   currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
> eflags);
>  
> +#ifdef USE_PREFETCH
> + /* check if the effective_io_concurrency has been overloaded for the
> +  * tablespace storing the relation and compute the 
> target_prefetch_pages,
> +  * or just get the current target_prefetch_pages
> +  */
> + new_io_concurrency = get_tablespace_io_concurrency(
> + currentRelation->rd_rel->reltablespace);
> +
> +
> + scanstate->target_prefetch_pages = target_prefetch_pages;
> +
> + if (new_io_concurrency != effective_io_concurrency)
> + {
> + double prefetch_pages;
> +if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
> + scanstate->target_prefetch_pages = rint(prefetch_pages);
> + }
> +#endif

Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine? Especially changing the size of externally
visible structs depending on a configure detected ifdef seems wrong to
me.

> +bool
> +compute_io_concurrency(int io_concurrency, double *target_prefetch_pages)
> +{
> + double  new_prefetch_pages = 0.0;
> + int i;
> +
> + /* make sure the io_concurrency value is correct, it may have been 
> forced
> +  * with a pg_tablespace UPDATE
> +  */

Nitpick: Wrong comment style (/* stands on its own line).

> + if (io_concurrency > MAX_IO_CONCURRENCY)
> + io_concurrency = MAX_IO_CONCURRENCY;
> +
> + /*--
> +  * The user-visible GUC parameter is the number of drives (spindles),
> +  * which we need to translate to a number-of-pages-to-prefetch target.
> +  * The target value is stashed in *extra and then assigned to the actual
> +  * variable by assign_effective_io_concurrency.
> +  *
> +  * The expected number of prefetch pages needed to keep N drives busy 
> is:
> +  *
> +  * drives |   I/O requests
> +  * ---+
> +  *  1 |   1
> +  *  2 |   2/1 + 2/2 = 3
> +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> +  *  n |   n * H(n)

I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?

Perhaps more relevantly: Imo nodeBitmapHeapscan.c is the wrong place for
this. bufmgr.c maybe?

You also didn't touch
/*
 * How many buffers PrefetchBuffer callers should try to stay ahead of their
 * ReadBuffer calls by.  This is maintained by the assign hook for
 * effective_io_concurrency.  Zero means "never prefetch".
 */
int target_prefetch_pages = 0;
which surely doesn't make sense anymore after these changes.

But do we even need that variable now?

> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
> index dc167f9..57008fc 100644
> --- a/src/include/utils/guc.h
> +++ b/src/include/utils/guc.h
> @@ -26,6 +26,9 @@
>  #define MAX_KILOBYTES(INT_MAX / 1024)
>  #endif
>  
> +/* upper limit for effective_io_concurrency */
> +#define MAX_IO_CONCURRENCY 1000
> +
>  /*
>   * Automatic configuration file name for ALTER SYSTEM.
>   * This file will be used to store values of configuration parameters
> @@ -256,6 +259,8 @@ extern inttemp_file_limit;
>  
>  extern int   num_temp_buffers;
>  
> +extern int   effective_io_concurrency;
> +

target_prefetch_pages is declared in bufmgr.h - that seems like a better
place for these.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian
On Tue, Sep  1, 2015 at 06:11:45PM -0400, Bruce Momjian wrote:
> Let me clearer about what the Citus Data paper shows.  I said originally
> that the data was sent to the coordinator, sorted, then resent to the
> shards, but the document:
> 
>   https://goo.gl/vJWF85
>   
> https://www.citusdata.com/blog/114-how-to-build-your-distributed-database
> 
> has the shards create the groups and the groups are sent to the other
> shards.  For example, to do COUNT(DISTINCT) if you have three shards,
> then each shard breaks its data into 3 buckets (1B in size), then the
> first bucket from each of the three shards goes to the first shard, and
> the second bucket goes to the second shared, etc.
> 
> Basically, they are doing map-reduce, and the shards are creating
> additional batches that get shipped to other shards.  I can see FDWs not
> working well in that case as you are really creating a new data layout
> just for the query.  This explains why the XC/XL people are saying they
> would use FDWs if they existed at the time they started development,
> while the Citus Data people are saying they couldn't use FDWs as they
> currently exist.  They probably both needed FDW improvements, but I
> think the Citus Data features would need a lot more.

To expand on this, using FDWs, it means each shard would create a
temporary table on the other shards and send some if its data to those
shards.  Once a shard gets all its data from the other shards, it will
process the data and send the result to the collector.

That certainly seems like something FDWs would not do well.  Frankly, I
am unclear how Citus Data was able to do this with only backend hooks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund
On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> +# PGAC_C_BUILTIN_PREFETCH
> +# -
> +# Check if the C compiler understands __builtin_prefetch(),
> +# and define HAVE__BUILTIN_PREFETCH if so.
> +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
> +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
> +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
> +[int i = 0;__builtin_prefetch(&i, 0, 3);])],
> +[pgac_cv__builtin_prefetch=yes],
> +[pgac_cv__builtin_prefetch=no])])
> +if test x"$pgac_cv__builtin_prefetch" = xyes ; then
> +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
> +  [Define to 1 if your compiler understands __builtin_prefetch.])
> +fi])# PGAC_C_BUILTIN_PREFETCH

Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

> +/*
> + * Prefetch support -- Support memory prefetching hints on some platforms.
> + *
> + * pg_rfetch() is specialized for the case where an array is accessed
> + * sequentially, and we can prefetch a pointer within the next element (or an
> + * even later element) in order to hide memory latency.  This case involves
> + * prefetching addresses with low temporal locality.  Note that it's rather
> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
> + * intrinsic should be carefully tested.  Also note that it's okay to pass it
> + * an invalid or NULL address, although it's best avoided.
> + */
> +#if defined(USE_MEM_PREFETCH)
> +#define pg_rfetch(addr)  __builtin_prefetch((addr), 0, 0)
> +#endif

What is that name actually supposed to mean? 'rfetch' doesn't seem to be
particularly clear - or is it a meant to be a wordplay combined with the
p?

I don't think you should specify the read/write and locality parameters
if we don't hand-pick them - right now you're saying the memory will
only be read and that it has no temporal locality.

I think there should be a empty fallback definition even if the current
only caller ends up not needing it - not all callers will require it.

> + /*
> +  * Perform memory prefetch of tuple 
> that's three places
> +  * ahead of current (which is returned 
> to caller).
> +  * Testing shows that this 
> significantly boosts the
> +  * performance for TSS_INMEM "forward" 
> callers by
> +  * hiding memory latency behind their 
> processing of
> +  * returned tuples.
> +  */
> +#ifdef USE_MEM_PREFETCH
> + if (readptr->current + 3 < 
> state->memtupcount)
> + 
> pg_rfetch(state->memtuples[readptr->current + 3]);
> +#endif

Hm. Why do we need a branch here? The target of prefetches don't need to
be valid addresses and adding a bunch of arithmetic and a branch for the
corner case doesn't sound worthwhile to me.


What worries me about adding explicit prefetching is that it's *highly*
platform and even micro-architecture dependent. Why is looking three
elements ahead the right value?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-09-02 Thread Fujii Masao
On Wed, Aug 5, 2015 at 2:16 AM, Robert Haas  wrote:
> On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao  wrote:
>> track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
>> but not in master server. Is this intentional? It should track COMMIT 
>> PREPARED
>> even in master? Otherwise, we cannot use commit_timestamp feature to check
>> the replication lag properly while we use 2PC.
>
> That sounds like it must be a bug.  I think you should add it to the
> open items list.

Yep, added.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane
Etsuro Fujita  writes:
> On 2015/08/01 23:25, Tom Lane wrote:
>> In my vision you probably really only want one call per build_join_rel
>> event (that is, per construction of a new RelOptInfo), not per
>> make_join_rel event.

> I'd vote for hooking at standard_join_search.

I think that method would require the FDW to duplicate a whole lot of the
join search mechanism, for not a whole lot of benefit.  It's possible that
there'd be value in doing some initial reconnaissance once we've examined
all the baserels, so I'm not necessarily against providing a hook there.
But if you have in mind that typical FDWs would actually create join paths
at that point, consider that

1. The FDW would have to find all the combinations of its supplied
relations (unless you are only intending to generate one path for the
union of all such rels, which seems pretty narrow-minded from here).

2. The FDW would have to account for join_is_legal considerations.

3. The FDW would have to arrange for creation of joinrel RelOptInfo
structures.  While that's possible, the available infrastructure for it
assumes that joinrels are built up from pairs of simpler joinrels, so
you couldn't go directly to the union of all the FDW's rels anyway.

So I still think that the most directly useful infrastructure here
would involve, when build_join_rel() first creates a given joinrel,
noticing whether both sides belong to the same foreign server and
if so giving the FDW a callback to consider producing pushed-down
joins.  That would be extremely cheap to do and it would not involve
adding overhead for an FDW to discover what the valid sets of joins
are.  In a large join problem, that's *not* going to be a cheap
thing to duplicate.  If there are multiple FDWs involved, the idea
that each one of them would do its own join search is particularly
horrid.

One other problem with the proposal is that we might never call
standard_join_search at all: GEQO overrides it, and so can external
users of join_search_hook.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Fujii Masao
On Thu, Aug 27, 2015 at 10:50 AM, Michael Paquier
 wrote:
> Hi all,
>
> As of now, file access functions in genfile.c can only be used by
> superusers. This proposal is to relax those functions so as
> replication users can use them as well. Here are the functions aimed
> by this patch:
> - pg_stat_file
> - pg_read_binary_file
> - pg_read_file
> - pg_ls_dir
> The main argument for this change is that pg_rewind makes use of those
> functions, forcing users to use a superuser role when rewinding a
> node. And with this patch, we could allow replication roles to do the
> same. Another argument in favor of this change is to allow replication
> users to dump directly the contents of PGDATA via SQL, though I don't
> believe that there are many people doing so these days.
>
> Also, replication roles can already have an access to the contents of
> PGDATA by taking a base backup for example, so this change looks
> logical to me, even if we filter out some files in a base backup,
> though I could not find any arguments to not let a replication user
> have a look at them via those functions. A patch is attached, I am
> adding it as well to the next CF.

+1

Did you confirm that replication user can complete pg_rewind
after this patch is applied?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: About CMake v2

2015-09-02 Thread Noah Misch
On Tue, Sep 01, 2015 at 11:46:05AM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2015-09-01 10:32:39 -0400, Noah Misch wrote:
> >> A monolithic patch replacing the GNU make build system with a CMake build
> >> system sounds far too hard to write and review; we should expect to 
> >> maintain
> >> those two build systems in parallel for awhile.
> 
> > I don't buy this.
> 
> Me either.

Okay.  If the author, reviewer(s) and committer deliver a correct monolithic
patch, I won't have anything to complain about.  For the PostgreSQL hackers
not involved in this work, that outcome is better than my suggestion.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating extension including dependencies

2015-09-02 Thread Andres Freund
Hi,

I'm looking at committing this patch. I found some nitpick-level things
that I can easily fixup. But I dislike two things:

1) Passing the list of parents through the cascade DefElem strikes me as
incredibly ugly.

For one the cascade option really should take a true/false type option
on the C level (so you can do defGetBoolean()), for another passing
through the list of parents via DefElem->arg seems wrong. You're
supposed to be able to copy parsenodes and at the very least that's
broken by the approach.

2) I don't like the control flow around the schema selection.

It seems to be getting a bit arcane. How about instead moving the
"extension \"%s\" must be installed in schema \"%s\" check into the if
(control->schema != NULL) block and check for d_schema after it? That
should look cleaner.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PSA: Upcoming Linux scheduler changes

2015-09-02 Thread Joshua D. Drake

Folks,

This is something we should be watching for and if people have time, 
testing to see how it affects us:


http://lkml.iu.edu/hypermail/linux/kernel/1508.3/04818.html
--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating extension including dependencies

2015-09-02 Thread Andres Freund
On 2015-09-02 17:27:38 +0200, Andres Freund wrote:
> 1) Passing the list of parents through the cascade DefElem strikes me as
> incredibly ugly.
> 
> For one the cascade option really should take a true/false type option
> on the C level (so you can do defGetBoolean()), for another passing
> through the list of parents via DefElem->arg seems wrong. You're
> supposed to be able to copy parsenodes and at the very least that's
> broken by the approach.

I think the fix here is to split off the bulk of CreateExtension() into
an internal function that takes additional parameters.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Oleg Bartunov
On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas  wrote:

> On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
>  wrote:
> > My worry is that if we start implementing them again from scratch, it
> will
> > take a few years before we get them in a usable state. What XC/XL lacked
> is
> > probably a Robert Haas or a Tom Lane who could look at the work and
> suggest
> > major edits. If that had happened, the quality of the product could have
> > been much better today. I don't mean to derate the developers who worked
> on
> > XC/XL, but there is no harm in accepting that if someone with a much
> better
> > understanding of the whole system was part of the team, that would have
> > positively impacted the project. Is that an angle worth exploring? Does
> it
> > make sense to commit some more resources to say XC or XL and try to
> improve
> > the quality of the product even further? To be honest, XL is in far far
> > better shape (haven't really tried XC in a while) and some more
> QA/polishing
> > can make it production ready much sooner.
>
> From my point of view, and EnterpriseDB's point of view, anything that
> doesn't go into the core PostgreSQL distribution isn't really getting
> us where we need to be.  If there's code in XL that would be valuable
> to merge into core PostgreSQL, then let's do it.  If the code cannot
> be used but there are lessons we can learn that will make what does go
> into core PostgreSQL better, let's learn them.  However, I don't think
> it's serving anybody very well that we have the XC fork, and multiple
> forks of the XC fork, floating around out there and people are working
> on those instead of working on core PostgreSQL.  The reality is that
> we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
> different projects and have all of them be good.  The reality is,
> also, that horizontal scalability isn't an optional feature.  There
> was a point in time at which the PostgreSQL project's official policy
> on replication was that it did not belong in core.  That was a bad
> policy; thankfully, it was reversed, and the result was Hot Standby
> and Streaming Replication, incredibly important technologies without
> which we would not be where we are today. Horizontal scalability is
> just as essential.
>

Agree with you, Robert.

One lesson from XL we got is that we need testing framework for cluster, so
any cluster project should at least pass functional and performance
testing. XL was very easy to break and I'm wondering how many corner cases
still exists. We tried several other approaches and while reading the
papers was a fun, in practice we found many devil details, which made the
paper be just a paper.



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Andres Freund
On 2015-08-02 19:15:58 -0700, Michael Paquier wrote:
> +psql 'postgres', 'CREATE EXTENSION tables_fk';
> +
> +# Insert some data before running the dump, this is needed to check
> +# consistent data dump of tables with foreign key dependencies
> +psql 'postgres', 'INSERT INTO cc_tab_fkey VALUES (1)';
> +psql 'postgres', 'INSERT INTO bb_tab_fkey VALUES (1)';
> +psql 'postgres', 'INSERT INTO aa_tab_fkey VALUES (1)';
> +
> +# Create a table depending on a FK defined in the extension
> +psql 'postgres', 'CREATE TABLE dd_tab_fkey (id int REFERENCES 
> bb_tab_fkey(id))';
> +psql 'postgres', 'INSERT INTO dd_tab_fkey VALUES (1)';
> +
> +# Take a dump then re-deploy it to a new database
> +command_ok(['pg_dump', '-d', 'postgres', '-f', "$tempdir/dump.sql"],
> +'taken dump with installed extension');
> +command_ok(['createdb', 'foobar1'], 'created new database to redeploy dump');
> +command_ok(['psql', '--set=ON_ERROR_STOP=on', '-d', 'foobar1', '-f',
> + "$tempdir/dump.sql"], 'dump restored');
> +
> +# And check its content
> +my $result = psql 'foobar1',
> + q{SELECT id FROM aa_tab_fkey UNION ALL
> +SELECT id FROM bb_tab_fkey UNION ALL
> +SELECT id FROM cc_tab_fkey UNION ALL
> +SELECT id FROM dd_tab_fkey};
> +is($result, qq(1
> +1
> +1
> +1),
> + 'consistency of data restored');
> diff --git a/src/test/modules/tables_fk/tables_fk--1.0.sql 
> b/src/test/modules/tables_fk/tables_fk--1.0.sql
> new file mode 100644
> index 000..e424610
> --- /dev/null
> +++ b/src/test/modules/tables_fk/tables_fk--1.0.sql
> @@ -0,0 +1,20 @@
> +/* src/test/modules/tables_fk/tables_fk--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION tables_fk" to load this file. \quit
> +
> +CREATE TABLE IF NOT EXISTS cc_tab_fkey (
> + id int PRIMARY KEY
> +);
> +
> +CREATE TABLE IF NOT EXISTS bb_tab_fkey (
> + id int PRIMARY KEY REFERENCES cc_tab_fkey(id)
> +);
> +
> +CREATE TABLE IF NOT EXISTS aa_tab_fkey (
> + id int REFERENCES bb_tab_fkey(id)
> +);
> +
> +SELECT pg_catalog.pg_extension_config_dump('aa_tab_fkey', '');
> +SELECT pg_catalog.pg_extension_config_dump('bb_tab_fkey', '');
> +SELECT pg_catalog.pg_extension_config_dump('cc_tab_fkey', '');
> diff --git a/src/test/modules/tables_fk/tables_fk.control 
> b/src/test/modules/tables_fk/tables_fk.control

Isn't a full test with a separate initdb, create extension etc. a really
heavyhanded way to test this? I mean that's a test where the setup takes
up to 10s, whereas the actual runtime is in the millisecond range?

Adding tests in this manner doesn't seem scalable to me.

I think we should rather add *one* test that does dump/restore over the
normal regression test database. Something similar to the pg_upgrade
tests. And then work at adding more content to the regression test
database - potentially sourcing from src/test/modules.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra

Hi

On 09/02/2015 03:53 PM, Andres Freund wrote:


Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.


Please, in the future, also update the title of the thread to something
fitting.


@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
  {
BitmapHeapScanState *scanstate;
RelationcurrentRelation;
+#ifdef USE_PREFETCH
+   int new_io_concurrency;
+#endif

/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
 */
currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
eflags);

+#ifdef USE_PREFETCH
+   /* check if the effective_io_concurrency has been overloaded for the
+* tablespace storing the relation and compute the 
target_prefetch_pages,
+* or just get the current target_prefetch_pages
+*/
+   new_io_concurrency = get_tablespace_io_concurrency(
+   currentRelation->rd_rel->reltablespace);
+
+
+   scanstate->target_prefetch_pages = target_prefetch_pages;
+
+   if (new_io_concurrency != effective_io_concurrency)
+   {
+   double prefetch_pages;
+  if (compute_io_concurrency(new_io_concurrency, &prefetch_pages))
+   scanstate->target_prefetch_pages = rint(prefetch_pages);
+   }
+#endif


Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine?


It's not just you. Dealing with code with plenty of ifdefs is annoying - 
it compiles just fine most of the time, until you compile it with some 
rare configuration. Then it either starts producing strange warnings or 
the compilation fails entirely.


It might make a tiny difference on builds without prefetching support 
because of code size, but realistically I think it's noise.



Especially changing the size of externally visible structs depending
ona configure detected ifdef seems wrong to me.


+100 to that


+   /*--
+* The user-visible GUC parameter is the number of drives (spindles),
+* which we need to translate to a number-of-pages-to-prefetch target.
+* The target value is stashed in *extra and then assigned to the actual
+* variable by assign_effective_io_concurrency.
+*
+* The expected number of prefetch pages needed to keep N drives busy 
is:
+*
+* drives |   I/O requests
+* ---+
+*  1 |   1
+*  2 |   2/1 + 2/2 = 3
+*  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+*  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+*  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


Well, even the comment right next after the formula says that:

* Experimental results show that both of these formulas aren't
* aggressiveenough, but we don't really have any better proposals.

That's the reason why users generally either use 0 or some rather high 
value (16 or 32 are the most common values see). The problem is that we 
don't really care about the number of spindles (and not just because 
SSDs don't have them at all), but about the target queue length per 
device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs 
are parallel by nature (stacking multiple chips with separate channels).


I doubt we can really improve the formula, except maybe for saying "we 
want 16 requests per device" and multiplying the number by that. We 
don't really have the necessary introspection to determine better values 
(and it's not really possible, because the devices may be hidden behind 
a RAID controller (or a SAN). So we can't really do much.


Maybe the best thing we can do is just completely abandon the "number of 
spindles" idea, and just say "number of I/O requests to prefetch". 
Possibly with an explanation of how to estimate it (devices * queue length).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
> But if you have in mind that typical FDWs would actually create join paths
> at that point, consider that
>
> 1. The FDW would have to find all the combinations of its supplied
> relations (unless you are only intending to generate one path for the
> union of all such rels, which seems pretty narrow-minded from here).

Well, if the remote end is another database server, presumably we can
leave it to optimize the query, so why would we need more than one
path?  I can see that we need more than one path because of sort-order
considerations, which would affect the query we ship to the remote
side.  But I don't see the point of considering multiple join orders
unless the remote end is dumber than our optimizer, which might be
true in some cases, but not if the remote end is PostgreSQL.

> 2. The FDW would have to account for join_is_legal considerations.

I agree with this.

> 3. The FDW would have to arrange for creation of joinrel RelOptInfo
> structures.  While that's possible, the available infrastructure for it
> assumes that joinrels are built up from pairs of simpler joinrels, so
> you couldn't go directly to the union of all the FDW's rels anyway.

And with this.

> So I still think that the most directly useful infrastructure here
> would involve, when build_join_rel() first creates a given joinrel,
> noticing whether both sides belong to the same foreign server and
> if so giving the FDW a callback to consider producing pushed-down
> joins.  That would be extremely cheap to do and it would not involve
> adding overhead for an FDW to discover what the valid sets of joins
> are.  In a large join problem, that's *not* going to be a cheap
> thing to duplicate.  If there are multiple FDWs involved, the idea
> that each one of them would do its own join search is particularly
> horrid.

So, the problem is that I don't think this entirely skirts the
join_is_legal issues, which are a principal point of concern for me.
Say this is a joinrel between (A B) and (C D E).  We need to generate
an SQL query for (A B C D E).  We know that the outermost syntactic
join can be (A B) to (C D E).  But how do we know which join orders
are legal as among (C D E)?  Maybe there's a simple way to handle this
that I'm not seeing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-09-02 Thread Andres Freund
Hi,

I don't know too much about gist, but did a quick read through. Mostly
spotting some stylistic issues. Please fix those making it easier for
the next reviewer.

> *** a/src/backend/access/gist/gist.c
> --- b/src/backend/access/gist/gist.c
> ***
> *** 36,42  static bool gistinserttuples(GISTInsertState *state, 
> GISTInsertStack *stack,
>bool unlockbuf, bool unlockleftchild);
>   static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
>   GISTSTATE *giststate, List *splitinfo, bool 
> releasebuf);
> ! 

>   #define ROTATEDIST(d) do { \
>   SplitedPageLayout 
> *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
> --- 36,42 
>bool unlockbuf, bool unlockleftchild);
>   static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
>   GISTSTATE *giststate, List *splitinfo, bool 
> releasebuf);
> ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
>   
>   #define ROTATEDIST(d) do { \
>   SplitedPageLayout
>   *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \

Newline removed.

> + /*
> +  * If leaf page is full, try at first to delete dead tuples. And then
> +  * check again.
> +  */
> + if ((is_split) && GistPageIsLeaf(page) && GistPageHasGarbage(page))

superfluous parens around is_split
> + /*
> +  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
> +  * told us were killed.
> +  *
> +  * We match items by heap TID before mark them. If an item has moved off
> +  * the current page due to a split, we'll fail to find it and do nothing
> +  * (this is not an error case --- we assume the item will eventually get
> +  * marked in a future indexscan).
> +  *
> +  * We re-read page here, so it's significant to check page LSN. If page
> +  * has been modified since the last read (as determined by LSN), we dare not
> +  * flag any antries because it is possible that the old entry was vacuumed
> +  * away and the TID was re-used by a completely different heap tuple.

s/significant/important/?.
s/If page/If the page/
s/dare not/cannot/

> +  */
> + static void
> + gistkillitems(IndexScanDesc scan)
> + {
> + GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
> + Buffer  buffer;
> + Pagepage;
> + OffsetNumber minoff;
> + OffsetNumber maxoff;
> + int i;
> + boolkilledsomething = false;
> + 
> + Assert(so->curBlkno != InvalidBlockNumber);
> + 
> + buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
> + if (!BufferIsValid(buffer))
> + return;
> + 
> + LockBuffer(buffer, GIST_SHARE);
> + gistcheckpage(scan->indexRelation, buffer);
> + page = BufferGetPage(buffer);
> + 
> + /*
> +  * If page LSN differs it means that the page was modified since the 
> last read.
> +  * killedItemes could be not valid so LP_DEAD hints applying is not 
> safe.
> +  */
> + if(PageGetLSN(page) != so->curPageLSN)
> + {
> + UnlockReleaseBuffer(buffer);
> + so->numKilled = 0; /* reset counter */
> + return;
> + }
> + 
> + minoff = FirstOffsetNumber;
> + maxoff = PageGetMaxOffsetNumber(page);
> + 
> + maxoff = PageGetMaxOffsetNumber(page);

duplicated line.

> + for (i = 0; i < so->numKilled; i++)
> + {
> + if (so->killedItems != NULL)
> + {
> + OffsetNumber offnum = FirstOffsetNumber;
> + 
> + while (offnum <= maxoff)
> + {

This nested loop deserves a comment.

> + ItemId  iid = PageGetItemId(page, 
> offnum);
> + IndexTuple  ituple = (IndexTuple) 
> PageGetItem(page, iid);
> + 
> + if (ItemPointerEquals(&ituple->t_tid, 
> &(so->killedItems[i])))
> + {
> + /* found the item */
> + ItemIdMarkDead(iid);
> + killedsomething = true;
> + break;  /* out of inner search 
> loop */
> + }
> + offnum = OffsetNumberNext(offnum);
> + }
> + }
> + }

I know it's the same approach nbtree takes, but if there's a significant
number of deleted items this takes me as a rather suboptimal
approach. The constants are small, but this still essentially is O(n^2).

> ***
> *** 451,456  getNextNearest(IndexScanDesc scan)
> --- 553,575 
>   
>   if (scan->xs_itup)
>   {
> + /*
> +  * If previously returned index tuple is not visible save it 
> into
> +  * so->killedItems. And at the end of the page 

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Christopher Browne
On 2 September 2015 at 04:52, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:
>
> On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:
>>
>> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
>> > But I think it's quite wrong to assume that the infrastructure for
>> > this is available and usable everywhere, because in my experience,
>> > that's far from the case.
>>
>> Especially when the alternative is a rather short patch implementing an
>> otherwise widely available feature.
>
>
> But that won't actually help in the case described by Robert: if the
master server A failed, the client has no idea if B or C would become the
new master.
>
> Unless it actually tries to connect them in turn and check for the result
of pg_is_in_recovery().  I think that brings enough complexity for keeping
this outside of libpq.  Also think about all the extra flexibility people
will likely want to have: number of retries, delay between retries, delay
backoff, etc., to the point we'll have to support some sort of client code
retry_policy_callback.
>
> For read-only clients you might want to include a number of slave
hostnames, and let the connector choose one, but then again you can't
achieve load-balancing on the client side, you're better off using
round-robin DNS.  To add or remove a slave you only need to update DNS, and
not configuration on all the clients.
>
> For the master failover I think a common technique is to just move the
floating IP address from the old master to the new one.  This doesn't
require touching the DNS record.

It seems to me that any time the answer involves touching the DNS record,
then having this be libpq configuration is terribly inappropriate.

I'll note that OpenLDAP tends to accept multiple database URIs, and is
nonspecific as to the relative semantics.  That points me at two things
that may be worth considering for libpq:

1.  It seems to me that it would be ridiculous to try to have semantics
trying to interpret multiple values passed to
PGPORT/PGHOST/PGDATABASE/-p/-h/-d/, but rather, that if there are to be
multiple values, they should be presented as URIs.

I'm finding, by the way, that it is very useful in simplifying application
configuration to use the URIs that have been usable since about 9.2.

I <3
postgresql://postgres@localhost:7093/myfavedatabase

2.  I'd be inclined to go with pretty simple functionality in anything that
would be built-in to libpq.  No great amount of logic, rather

   a) Try the first URI.  If that works (where "connects" = "works"), we're
done.
   b) Try the next URI.  If that works, we're done.
   c) Is there another one?  Try it...  Works... Done.  Repeat as needful.

There's all kinds of functionality that you could add, between...
  - Random selection of URI.
  - Running some query against each URI tried to see if it's a
"master/slave", changing client functionality based on that.
  - Picking URI based on connecting (somewhere) and deciding what URI to
prefer.

I'd suggest implementing the simplest thing (e.g. - try URI 1, then 2, then
...), as that's most likely the most useful.
And more sophisticated functionality probably ought to be in the
application anyways.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> Maybe the best thing we can do is just completely abandon the "number of
> spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> with an explanation of how to estimate it (devices * queue length).

I think that'd be a lot better.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] about fsync in CLOG buffer write

2015-09-02 Thread 张广舟(明虚)
Hi Hackers,

We host many PG instances on each of our machines, and we want to minimize
the calls to fsync within each instance, in order to minimize possible
impact on other instances. We found there is a fsync call when CLOG buffer
is written out in SlruPhysicalWritePage(). It is often called when a backend
needs to check transaction status with SimpleLruReadPage().


The code is as follows (in src/backend/access/transam/slru.c line 837 in PG
9.4):
/*
 * If not part of Flush, need to fsync now.  We assume this happens
 * infrequently enough that it's not a performance issue.
 */
if (!fdata)
{
if (ctl->do_fsync && pg_fsync(fd))
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}

if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
return false;
}

 ctl->do_fsync is true for CLOG.  Question is, could we just disable fsync
for CLOG buffer write out here? Is it safe to do so? I understand a
checkpoint will calls SimpleLruFlush to flush all CLOG buffer at once, and
the fsync call here (for buffer write out) is not necessary.

Thanks for your time!
Guangzhou




Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-02 Thread Andres Freund
On 2015-08-09 12:47:53 +1200, David Rowley wrote:
> I took a bit of weekend time to finish this one off. Patch attached.
> 
> A quick test shows a pretty good performance increase:
> 
> create table ts (ts timestamp not null);
> insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
> 00:00:00', '1 sec'::interval);
> vacuum freeze ts;
> 
> Master:
> david=# copy ts to 'l:/ts.sql';
> COPY 31536001
> Time: 20444.468 ms
> 
> Patched:
> david=# copy ts to 'l:/ts.sql';
> COPY 31536001
> Time: 10947.097 ms

Yes, that's pretty cool.

> There's probably a bit more to squeeze out of this.
> 1. EncodeDateTime() could return the length of the string to allow callers
> to use pnstrdup() instead of pstrdup(), which would save the strlen() call.
> 2. Have pg_int2str_zeropad() and pg_int2str() skip appending the NUL char
> and leave this up to the calling function.
> 3. Make something to replace the sprintf(str, " %.*s", MAXTZLEN, tzn); call.
> 
> Perhaps 1 and 3 are worth looking into, but I think 2 is likely too error
> prone for the small gain we'd get from it.

I'm inclined to first get the majority of the optimization - as in
somethign similar to the current patch. If we then feel a need to
optimize further we can do that. Otherwise we might end up not getting
the 95% performance improvement in 9.6 because we're playing with the
remaining 5 ;)


> Also I was not too sure on if pg_int2str() was too similar to pg_ltoa().
> Perhaps pg_ltoa() should just be modified to return the end of string?

I don't think the benefits are worth breaking pg_ltoa interface.

>  /*
> - * Append sections and fractional seconds (if any) at *cp.
> + * Append seconds and fractional seconds (if any) at *cp.
>   * precision is the max number of fraction digits, fillzeros says to
>   * pad to two integral-seconds digits.
>   * Note that any sign is stripped from the input seconds values.
> + * Note 'precision' must not be a negative number.
>   */
> -static void
> +static char *
>  AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
>  {
> +#ifdef HAVE_INT64_TIMESTAMP

Wouldn't it be better to just normalize fsec to an integer in that case?
Afaics that's the only remaining reason for the alternative path?

> +/*
> + * pg_int2str
> + *   Converts 'value' into a decimal string representation of the 
> number.
> + *
> + * Caller must ensure that 'str' points to enough memory to hold the result
> + * (at least 12 bytes, counting a leading sign and trailing NUL).
> + * Return value is a pointer to the new NUL terminated end of string.
> + */
> +char *
> +pg_int2str(char *str, int32 value)
> +{
> + char *start;
> + char *end;
> +
> + /*
> +  * Handle negative numbers in a special way. We can't just append a '-'
> +  * prefix and reverse the sign as on two's complement machines negative
> +  * numbers can be 1 further from 0 than positive numbers, we do it this 
> way
> +  * so we properly handle the smallest possible value.
> +  */
> + if (value < 0)
> + {

We could alternatively handle this by special-casing INT_MIN, probably
resulting in a bit less duplicated code.
>  
>  /*
>   *   Per-opclass comparison functions for new btrees.  These are
> diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl
> index e107d41..1e2dd62 100644
> --- a/src/tools/msvc/build.pl
> +++ b/src/tools/msvc/build.pl
> @@ -61,7 +61,7 @@ elsif ($buildwhat)
>  }
>  else
>  {
> - system("msbuild pgsql.sln /verbosity:detailed /p:Configuration=$bconf");
> + system("msbuild pgsql.sln /verbosity:quiet /p:Configuration=$bconf");
>  }

Uh? I assume that's an acciental change?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-02 Thread Andres Freund
On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
> Attached patch does it that way.  There was also a free-standing
> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
> vacuum_delay_point, so I changed that one as well.

I think we should backpatch this - any arguments against?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Andres Freund
On 2015-07-30 18:03:56 +0200, Fabien COELHO wrote:
> 
> >v6 is just a rebase after a bug fix by Andres Freund.
> >
> >Also a small question: The patch currently displays pgbench scripts
> >starting numbering at 0. Probably a little too geek... should start at 1?
> 
> v7 is a rebase after another small bug fix in pgbench.
> 
> -- 
> Fabien.

> diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
> index 2517a3a..99670d4 100644
> --- a/doc/src/sgml/ref/pgbench.sgml
> +++ b/doc/src/sgml/ref/pgbench.sgml
> @@ -261,6 +261,23 @@ pgbench  options  
> dbname
>  benchmarking arguments:
>  
>  
> + 
> +  -b scriptname[@weight]
> +  --builtin scriptname[@weight]
> +  
> +   
> +Add the specified builtin script to the list of executed scripts.
> +An optional integer weight after @ allows to adjust the
> +probability of drawing the test.
> +Available builtin scripts are: tpcb-like,
> +simple-update and select-only.
> +The provided scriptname needs only to be a prefix
> +of the builtin name, hence simp would be enough to select
> +simple-update.
> +   
> +  
> + 

Maybe add --builtin list to show them?

> @@ -404,10 +422,7 @@ pgbench  options  
> dbname
>--skip-some-updates
>
> 
> -Do not update pgbench_tellers and
> -pgbench_branches.
> -This will avoid update contention on these tables, but
> -it makes the test case even less like TPC-B.
> +Shorthand for -b simple-update@1.
> 
>
>   

> @@ -511,7 +526,7 @@ pgbench  options  
> dbname
>--select-only
>
> 
> -Perform select-only transactions instead of TPC-B-like test.
> +Shorthand for -b select-only@1.
> 
>
>   

I'm a bit inclined to remove these options.

>
> -   The default transaction script issues seven commands per transaction:
> +   Pgbench executes test scripts chosen randomly from a specified list.
> +   They include built-in scripts with -b and
> +   user-provided custom scripts with -f.
> +   Each script may be given a relative weight specified after a
> +   @ so as to change its drawing probability.
> +   The default weight is 1.
> + 

I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?

> +static SQLScript sql_script[MAX_SCRIPTS];
> +static struct {
> + char *name;   /* very short name for -b ...*/
> + char *desc;   /* short description */
> + char *script; /* actual pgbench script */
> +} builtin_script[]

Can't we put these in the same array?

> + printf("transaction type: %s\n",
> +num_scripts == 1? sql_script[0].name: "multiple scripts");

Seems like it'd be more useful to simply always list the scripts +
weights here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera
Andres Freund wrote:

> Isn't a full test with a separate initdb, create extension etc. a really
> heavyhanded way to test this? I mean that's a test where the setup takes
> up to 10s, whereas the actual runtime is in the millisecond range?

I spent some time looking over this patch yesterday, and that was one
concern I had too.  I was thinking that if we turn this into a single
module that can contain several extensions, or several pg_dump-related
tests, and then test multiple features without requiring an initdb for
each, it would be more palatable.

> Adding tests in this manner doesn't seem scalable to me.

I was thinking in having this be renamed src/test/modules/extensions/
and then the extension contained here would be renamed ext001_fk_tables
or something like that; later we could ext002_something for testing some
other angle of extensions, not necessarily pg_dump related.

> I think we should rather add *one* test that does dump/restore over the
> normal regression test database. Something similar to the pg_upgrade
> tests. And then work at adding more content to the regression test
> database - potentially sourcing from src/test/modules.

That's another option, but we've had this idea for many years now and it
hasn't materialized.  As I recall, Andrew Dunstan has a module that
tests cross-version pg_upgrade and one thing he does is dump both and
compare; the problem is that there are differences, so he keeps a count
of how many lines he expect to differ between any two releases.  Or
something like that.  While it's a good enough start, I don't think it's
robust enough to be in core.  How would we do it?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Andres Freund
On 2015-09-02 14:30:33 -0300, Alvaro Herrera wrote:
> I was thinking in having this be renamed src/test/modules/extensions/
> and then the extension contained here would be renamed ext001_fk_tables
> or something like that; later we could ext002_something for testing some
> other angle of extensions, not necessarily pg_dump related.

The largest dataset we have for this is the current regression test
database, it seems a waste not to include it...

> That's another option, but we've had this idea for many years now and it
> hasn't materialized.

But that's just minimally more complex than what's currently done in
that test and in the pg_upgrade test?

> As I recall, Andrew Dunstan has a module that
> tests cross-version pg_upgrade and one thing he does is dump both and
> compare; the problem is that there are differences, so he keeps a count
> of how many lines he expect to differ between any two releases.

I'm not suggesting to do anything cross-release - that'd indeed be
another ballpark.

Just that instead of a pg_dump test that tests some individual things we
have one that tests the whole regression test output and then does a
diff?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera
Andres Freund wrote:

> > As I recall, Andrew Dunstan has a module that
> > tests cross-version pg_upgrade and one thing he does is dump both and
> > compare; the problem is that there are differences, so he keeps a count
> > of how many lines he expect to differ between any two releases.
> 
> I'm not suggesting to do anything cross-release - that'd indeed be
> another ballpark.

Ah, you're right.

> Just that instead of a pg_dump test that tests some individual things we
> have one that tests the whole regression test output and then does a
> diff?

Sorry if I'm slow -- A diff against what?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane
Robert Haas  writes:
> On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
>> But if you have in mind that typical FDWs would actually create join paths
>> at that point, consider that
>> 
>> 1. The FDW would have to find all the combinations of its supplied
>> relations (unless you are only intending to generate one path for the
>> union of all such rels, which seems pretty narrow-minded from here).

> Well, if the remote end is another database server, presumably we can
> leave it to optimize the query, so why would we need more than one
> path?

If you have say 5 relations in the query, 3 of which are foreign, it might
make sense to join all 3 at the remote end, or maybe you should only join
2 of them remotely because it's better to then join to one of the local
rels before joining the last remote rel.  Even if you claim that that
would never make sense from a cost standpoint (a claim easily seen to be
silly), there might not be any legal way to join all 3 directly because of
join order constraints.

The larger point is that we can't expect the remote server to be fully
responsible for optimizing, because it will know nothing of what's being
done on our end.

> I can see that we need more than one path because of sort-order
> considerations, which would affect the query we ship to the remote
> side.  But I don't see the point of considering multiple join orders
> unless the remote end is dumber than our optimizer, which might be
> true in some cases, but not if the remote end is PostgreSQL.

(1) not all remote ends are Postgres, (2) the remote end doesn't have any
access to info about our end.

> So, the problem is that I don't think this entirely skirts the
> join_is_legal issues, which are a principal point of concern for me.
> Say this is a joinrel between (A B) and (C D E).  We need to generate
> an SQL query for (A B C D E).  We know that the outermost syntactic
> join can be (A B) to (C D E).  But how do we know which join orders
> are legal as among (C D E)?  Maybe there's a simple way to handle this
> that I'm not seeing.

Well, if the joins get built up in the way I think should happen, we'd
have already considered (C D E), and we could have recorded the legal join
orders within that at the time.  (I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.)  Or we could trawl through the paths associated with the
child joinrel, which will presumably include instances of every reasonable
sub-join combination.  Or the FDW could look at the SpecialJoinInfo data
and determine things for itself (or more likely, ask join_is_legal about
that).

In the case of postgres_fdw, I think the actual requirement will be to be
able to reconstruct a SQL query that correctly expresses the join; that
is, we need to send over something like "from c left join d on (...) full
join e on (...)", not just "from c, d, e", or we'll get totally bogus
estimates as well as bogus execution results.  Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it.  But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
   2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Merlin Moncure
On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
> It would be a bad idea to cling blindly to the FDW infrastructure if
> it's fundamentally inadequate to do what we want.  On the other hand,
> it would also be a bad idea to set about recreating it without a
> really good reason, and - just to take one example - the fact that it
> doesn't currently push down DML operations to the remote side is not a
> really good reason to rewrite the whole thing.  On the contrary, it's
> a reason to put some energy into the already-written patch which
> implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API.  There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing.  This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result.   To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane
I wrote:
> ... I imagine that we should allow FDWs to
> store some data within RelOptInfo structs that represent foreign joins
> belonging entirely to them, so that there'd be a handy place to keep that
> data till later.

Actually, if we do that (ie, provide a "void *fdw_state" field in join
RelOptInfos), then the FDW could use the nullness or not-nullness of
such a field to realize whether or not it had already considered this
join relation.  So I'm now thinking that the best API is to call the
FDW at the end of each make_join_rel call, whether it's the first one
for the joinrel or not.  If the FDW wants a call for each legal pair of
input sub-relations, it's got one.  If it only wants one call per joinrel,
it can just make sure to put something into fdw_state, and then on
subsequent calls for the same joinrel it can just exit immediately if
fdw_state is already non-null.  So we have both use-cases covered.
Also, by doing this at the end, the FDW can look at the "regular" (local
join execution) paths that were already generated, should it wish to.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Fabien COELHO


Hello Andres,


Maybe add --builtin list to show them?


Yep, easy enough.


[...]
+Shorthand for -b simple-update@1.
+Shorthand for -b select-only@1.


I'm a bit inclined to remove these options.


Hm...

This is really backward compatibility, and people may find reference to 
these in blogs or elswhere, so I think that it would make sense to

be upward compatible.

I would certainly be against adding any other of these options, though.


+   Each script may be given a relative weight specified after a
+   @ so as to change its drawing probability.
+   The default weight is 1.


I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?


If you use only percent, then you have to check that the total is 100, 
probably you have to use floats, to do something when the total is not 
100, checking would complicate the code and test people mental calculus 
abilities. Not sure this is a good idea:-)


In the use case you outline, when adding a script, maybe you know that it 
runs "as much as" this other script, so you can pick up the same weight 
without bothering.


Also, when testing, there is an issue when you want to remove one script 
for a quick test, and that would mean changing all percentages on the 
command line...


So I would advise not to put such a constraint.


+static SQLScript sql_script[MAX_SCRIPTS];

+static struct {
+   char *name;   /* very short name for -b ...*/
+   char *desc;   /* short description */
+   char *script; /* actual pgbench script */
+} builtin_script[]


Can't we put these in the same array?


I do not understand.


+   printf("transaction type: %s\n",
+  num_scripts == 1? sql_script[0].name: "multiple scripts");


Seems like it'd be more useful to simply always list the scripts +
weights here.


The detailed list is shown later, with the summary performance figure for 
each scripts, so ISTM that it would be redundant? Maybe the transaction 
type could be moved downwards just before said list?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure  wrote:
> On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
>> It would be a bad idea to cling blindly to the FDW infrastructure if
>> it's fundamentally inadequate to do what we want.  On the other hand,
>> it would also be a bad idea to set about recreating it without a
>> really good reason, and - just to take one example - the fact that it
>> doesn't currently push down DML operations to the remote side is not a
>> really good reason to rewrite the whole thing.  On the contrary, it's
>> a reason to put some energy into the already-written patch which
>> implements that optimization.
>
> The problem with FDW for these purposes as I see it is that too much
> intelligence is relegated to the implementer of the API.  There needs
> to be a mechanism so that the planner can rewrite the remote query and
> then do some after the fact processing.  This exactly what citus does;
> if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
> so that aggregation can be properly weighted to the result.   To do
> this (distributed OLAP-type processing) right, the planner needs to
> *know* that this table is in fact distributed and also know that it
> can make SQL compatible adjustments to the query.
>
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might make
> sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW.  In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
>> I'm wondering if percentages instead of weights would be a better
>> idea. That'd mean you'd be forced to be more careful when adding another
>> script (having to adjust the percentages of other scripts) but arguably
>> that's a good thing?
>
> If you use only percent, then you have to check that the total is 100,
> probably you have to use floats, to do something when the total is not 100,
> checking would complicate the code and test people mental calculus
> abilities. Not sure this is a good idea:-)

I agree.  I don't see a reason to enforce that the total of the
weights must be 100.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-02 Thread Robert Haas
On Tue, Sep 1, 2015 at 6:43 PM, and...@anarazel.de  wrote:
> Why a new tranche for each of these? And it can't be correct that each
> has the same base?

I complained about the same-base problem before.  Apparently, that got ignored.

> I don't really like the tranche model as in the patch right now. I'd
> rather have in a way that we have one tranch for all the individual
> lwlocks, where the tranche points to an array of names alongside the
> tranche's name. And then for the others we just supply the tranche name,
> but leave the name array empty, whereas a name can be generated.

That's an interesting idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread Robert Haas
On Tue, Sep 1, 2015 at 6:13 PM, Jim Nasby  wrote:
> My thought is that there's a fair amount of places where we do string
> comparison for not a great reason. Perhaps a better example is data that
> comes back from a trigger; AFTER/BEFORE, INSERT/UPDATE/..., which is more
> expensive to setup the variables for (strdup a fixed string, which means a
> palloc), and then comparisons are done as text varlena (iirc).
>
> Instead if this information came back as an ENUM the variable would be a
> simple int as would the comparison. We'd still have a raw string being
> parsed in the function body, but that would happen once during initial
> compilation and it would be replaced with an ENUM value.
>
> For RAISE, AFAIK we still end up converting the raw string into a varlena
> CONST, which means a palloc. If it was an ENUM it'd be converted to an int.
>
> If we're worried about the overhead of the enum machinery we could create a
> relcache for system enums, but I suspect that even without that it'd be a
> win over the string stuff. Especially since I bet most people run UTF8.

I agree with Pavel on this one: creating an extra type here is going
to cause more pain than it removes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark
On 2 Sep 2015 14:54, "Andres Freund"  wrote:
>
>
> > + /*--
> > +  * The user-visible GUC parameter is the number of drives (spindles),
> > +  * which we need to translate to a number-of-pages-to-prefetch target.
> > +  * The target value is stashed in *extra and then assigned to the 
> > actual
> > +  * variable by assign_effective_io_concurrency.
> > +  *
> > +  * The expected number of prefetch pages needed to keep N drives busy 
> > is:
> > +  *
> > +  * drives |   I/O requests
> > +  * ---+
> > +  *  1 |   1
> > +  *  2 |   2/1 + 2/2 = 3
> > +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> > +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> > +  *  n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-09-02 Thread Robert Haas
On Mon, Aug 10, 2015 at 1:16 PM, Jeff Janes  wrote:
> I have a simple test case that inserts an array of 101 md5 digests into each
> row.  With 10_000 of these rows inserted into an already indexed table, I
> get 40MB for the table and 80MB for the index unpatched.  With the patch, I
> get 7.3 MB for the index.

Uh, wow.  Seems like we should do something about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 4:52 AM, Shulgin, Oleksandr
 wrote:
> On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:
>>
>> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
>> > But I think it's quite wrong to assume that the infrastructure for
>> > this is available and usable everywhere, because in my experience,
>> > that's far from the case.
>>
>> Especially when the alternative is a rather short patch implementing an
>> otherwise widely available feature.
>
> But that won't actually help in the case described by Robert: if the master
> server A failed, the client has no idea if B or C would become the new
> master.

Sure it does.  You just need to ensure that whichever of those is the
new master accepts connections, and the other one doesn't.  There are
lots of ways to do this; e.g. give the machine a second IP that
accepts connections only when the machine is the designated master,
and have read-write clients connect to that IP, and read-only clients
connect to the machine's main IP.

Andres's point is the same as mine: we ought to accept this feature,
in some form, because it's really quite useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 11:41 AM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
>> Even if it's only on paper, any new sharding design needs to address
>> these questions:
>>
>> 1. How do we ensure no/minimal data is lost if we lose a node?
>> 2. How do we replace a lost node (without taking the cluster down)?
>>2. a. how do we allow an out-of-sync node to "catch up"?
>> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
>> 4. How do we add nodes to expand the cluster?
>>
>> There doesn't need to be code for all of the above from version 0.1, but
>> there needs to be a plan to tackle those problems.  Otherwise, we'll
>> just end up with another dead-end, not-useful-in-production technology.
> 
> This is a good point, and I think I agree with it.  Let me make a few
> observations:
> 
> 1. None of this stuff matters very much when the data is strictly
> read-only. 

Yep.

> 2. None of this stuff matters when you only have one copy of the data.
> Your system is low-availability, but you just don't care for whatever
> reason. 

Uh-huh.

> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader).  So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

> But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
 I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:
> Hi
> 
> On 09/02/2015 03:53 PM, Andres Freund wrote:
>> 
>> Hi,
>> 
>> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>>> I didn't know that the thread must exists on -hackers to be
>>> able to add a commitfest entry, so I transfer the thread here.
>> 
>> Please, in the future, also update the title of the thread to
>> something fitting.
>> 

Sorry for that.

>>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
>>> *node, EState *estate, int eflags) { BitmapHeapScanState
>>> *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH +
>>> int new_io_concurrency; +#endif
>>> 
>>> /* check for unsupported flags */ Assert(!(eflags &
>>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
>>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
>>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, 
>>> node->scan.scanrelid, eflags);
>>> 
>>> +#ifdef USE_PREFETCH +/* check if the
>>> effective_io_concurrency has been overloaded for the + *
>>> tablespace storing the relation and compute the 
>>> target_prefetch_pages, + * or just get the current
>>> target_prefetch_pages + */ +new_io_concurrency =
>>> get_tablespace_io_concurrency( +
>>> currentRelation->rd_rel->reltablespace); + + +
>>> scanstate->target_prefetch_pages = target_prefetch_pages; + +
>>> if (new_io_concurrency != effective_io_concurrency) +{ +
>>> double prefetch_pages; +   if
>>> (compute_io_concurrency(new_io_concurrency, &prefetch_pages)) +
>>> scanstate->target_prefetch_pages = rint(prefetch_pages); +
>>> } +#endif
>> 
>> Maybe it's just me - but imo there should be as few USE_PREFETCH 
>> dependant places in the code as possible. It'll just be 0 when
>> not supported, that's fine?
> 
> It's not just you. Dealing with code with plenty of ifdefs is
> annoying - it compiles just fine most of the time, until you
> compile it with some rare configuration. Then it either starts
> producing strange warnings or the compilation fails entirely.
> 
> It might make a tiny difference on builds without prefetching
> support because of code size, but realistically I think it's
> noise.
> 
>> Especially changing the size of externally visible structs
>> depending ona configure detected ifdef seems wrong to me.
> 
> +100 to that
> 

I totally agree. I'll remove the ifdefs.

>> Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

>>> +/*-- + * The user-visible GUC parameter is the
>>> number of drives (spindles), + * which we need to translate
>>> to a number-of-pages-to-prefetch target. + * The target
>>> value is stashed in *extra and then assigned to the actual +
>>> * variable by assign_effective_io_concurrency. + * + *
>>> The expected number of prefetch pages needed to keep N drives 
>>> busy is: + * + * drives |   I/O requests + *
>>> ---+ + *1 |   1 + *
>>> 2 |   2/1 + 2/2 = 3 + *3 |   3/1 + 3/2 + 3/3 = 5
>>> 1/2 + *4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + *
>>> n |   n * H(n)
>> 
>> I know you just moved this code. But: I don't buy this formula.
>> Like at all. Doesn't queuing and reordering entirely invalidate
>> the logic here?
> 
> Well, even the comment right next after the formula says that:
> 
> * Experimental results show that both of these formulas aren't *
> aggressiveenough, but we don't really have any better proposals.
> 
> That's the reason why users generally either use 0 or some rather
> high value (16 or 32 are the most common values see). The problem
> is that we don't really care about the number of spindles (and not
> just because SSDs don't have them at all), but about the target
> queue length per device. Spinning rust uses TCQ/NCQ to optimize the
> head movement, SSDs are parallel by nature (stacking multiple chips
> with separate channels).
> 
> I doubt we can really improve the formula, except maybe for saying
> "we want 16 requests per device" and multiplying the number by
> that. We don't really have the necessary introspection to determine
> better values (and it's not really possible, because the devices
> may be hidden behind a RAID controller (or a SAN). So we can't
> really do much.
> 
> Maybe the best thing we can do is just completely abandon the
> "number of spindles" idea, and just say "number of I/O requests to
> prefetch". Possibly with an explanation of how to estimate it
> (devices * queue length).
> 
>> I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?



- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAG

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Alvaro Herrera
Michael Paquier wrote:

> I haven't written yet a test case but I think that we could reproduce
> that simply by having a relation referenced in the exception block of
> a first function, calling a second function that itself raises an
> exception, causing the referencing error.

Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> Hm. Is a compiler test actually test anything reliably here? Won't this
> just throw a warning during compile time about an unknown function?

I'll need to look into that.

>> +/*
>> + * Prefetch support -- Support memory prefetching hints on some platforms.
>> + *
>> + * pg_rfetch() is specialized for the case where an array is accessed
>> + * sequentially, and we can prefetch a pointer within the next element (or 
>> an
>> + * even later element) in order to hide memory latency.  This case involves
>> + * prefetching addresses with low temporal locality.  Note that it's rather
>> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
>> + * intrinsic should be carefully tested.  Also note that it's okay to pass 
>> it
>> + * an invalid or NULL address, although it's best avoided.
>> + */
>> +#if defined(USE_MEM_PREFETCH)
>> +#define pg_rfetch(addr)  __builtin_prefetch((addr), 0, 0)
>> +#endif
>
> What is that name actually supposed to mean? 'rfetch' doesn't seem to be
> particularly clear - or is it a meant to be a wordplay combined with the
> p?

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

> I don't think you should specify the read/write and locality parameters
> if we don't hand-pick them - right now you're saying the memory will
> only be read and that it has no temporal locality.
>
> I think there should be a empty fallback definition even if the current
> only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

>> + /*
>> +  * Perform memory prefetch of tuple 
>> that's three places
>> +  * ahead of current (which is returned 
>> to caller).
>> +  * Testing shows that this 
>> significantly boosts the
>> +  * performance for TSS_INMEM "forward" 
>> callers by
>> +  * hiding memory latency behind their 
>> processing of
>> +  * returned tuples.
>> +  */
>> +#ifdef USE_MEM_PREFETCH
>> + if (readptr->current + 3 < 
>> state->memtupcount)
>> + 
>> pg_rfetch(state->memtuples[readptr->current + 3]);
>> +#endif
>
> Hm. Why do we need a branch here? The target of prefetches don't need to
> be valid addresses and adding a bunch of arithmetic and a branch for the
> corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.

> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.

BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort work already posted,

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.

Yep.  It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions.  Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

How is that different from the previous thing?  Just that we'd treat
"lagging" as "down" beyond some threshold?  That doesn't seem like a
mandatory feature.

>> But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so.  If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there.  That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read.  It's
complicated, but it doesn't seem intractable.

>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
> harder to set up, and becomes a SPOF.

I partially agree.  I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component.  People who don't want it shouldn't have to
pay the price in performance and administrative complexity.  But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread dinesh kumar
On Mon, Aug 31, 2015 at 10:08 PM, Pavel Stehule 
wrote:

>
>
> 2015-09-01 6:59 GMT+02:00 Pavel Stehule :
>
>>
>>
>> 2015-08-31 20:43 GMT+02:00 dinesh kumar :
>>
>>> Hi,
>>>
>>> On Sat, Aug 29, 2015 at 4:22 PM, Pavel Stehule 
>>> wrote:
>>>
 Hi

 I am starting to work review of this patch

 2015-07-13 9:54 GMT+02:00 dinesh kumar :

> Hi All,
>
> Greetings for the day.
>
> Would like to discuss on below feature here.
>
> Feature:
> Having an SQL function, to write messages to log destination.
>
> Justification:
> As of now, we don't have an SQL function to write
> custom/application messages to log destination. We have "RAISE" clause
> which is controlled by
> log_ parameters. If we have an SQL function which works irrespective
> of log settings, that would be a good for many log parsers. What i mean 
> is,
> in DBA point of view, if we route all our native OS stats to log files in 
> a
> proper format, then we can have our log reporting tools to give most
> effective reports. Also, Applications can log their own messages to
> postgres log files, which can be monitored by DBAs too.
>
> Implementation:
> Implemented a new function "pg_report_log" which takes one
> argument as text, and returns void. I took, "LOG" prefix for all the
> reporting messages.I wasn't sure to go with new prefix for this, since
> these are normal LOG messages. Let me know, if i am wrong here.
>
> Here is the attached patch.
>

 This patch is not complex, but the implementation doesn't cover a
 "ereport" well.

 Although this functionality should be replaced by custom function in
 any PL (now or near future), I am not against to have this function in
 core. There are lot of companies with strong resistance against stored
 procedures - and sometimes this functionality can help with SQL debugging.

 Issues:

 1. Support only MESSAGE field in exception - I am expecting to support
 all fields: HINT, DETAIL, ...

>>>
>>> Added these functionalities too.
>>>
>>>
 2. Missing regress tests

>>>
>>> Adding here.
>>>
>>>
 3. the parsing ereport level should be public function shared with
 PLpgSQL and other PL

>>>
>>> Sorry Pavel. I am not getting your point here. Would you give me an
>>> example.
>>>
>>
>> The transformation: text -> error level is common task - and PLpgSQL it
>> does in pl_gram.y. My idea is to add new function to error utils named
>> "parse_error_level" and use it from PLpgSQL and from your code.
>>
>>
>>>
>>>
 4. should be hidestmt mandatory parameter?

>>>
>>> I changed this argument's default value as "true".
>>>
>>>
 5. the function declaration is strange

 postgres=# \sf pg_report_log (text, anyelement, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, anyelement,
 boolean)
  RETURNS void
  LANGUAGE sql
  STABLE STRICT COST 1
 AS $function$SELECT pg_report_log($1::pg_catalog.text,
 $2::pg_catalog.text, $3::boolean)$function$

 Why polymorphic? It is useless on any modern release


>>> I took quote_ident(anyelement) as referral code, for implementing this.
>>> Could you guide me if I am doing wrong here.
>>>
>>
>> I was wrong - this is ok - it is emulation of force casting to text
>>
>>
>>>
>>>
 postgres=# \sf pg_report_log (text, text, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, text, boolean)
  RETURNS void
  LANGUAGE internal
  IMMUTABLE STRICT
 AS $function$pg_report_log$function$

 Why stable, why immutable? This function should be volatile.

 Fixed these to volatile.
>>>
>>>
 6. using elog level enum as errcode is wrong idea - errcodes are
 defined in table
 http://www.postgresql.org/docs/9.4/static/errcodes-appendix.html

>>>
>>> You mean, if the elevel is 'ERROR', then we need to allow errcode. Let
>>> me know your inputs.
>>>
>>
>> I was blind, but the code was not good. Yes, error and higher needs error
>> code. From ANSI/SQL anything can has error code "00 is ok", "01 ..
>> warnings" ...
>>
>> There is more possibilities - look to PLpgSQL implementation - it can be
>> optional parameter - it default can use ERRCODE_RAISE_EXCEPTION
>>
>>
>>>
>>> Adding new patch, with the above fixes.
>>>
>>
> the code looks better
>
> my objections:
>
> 1. I prefer default values be NULL
>

Fixed it.


> 2. readability:
>   * parsing error level should be in alone cycle
>   * you don't need to use more ereport calls - one is good enough - look
> on implementation of stmt_raise in PLpgSQL
>

Sorry for my ignorance. I have tried to implement parse_error_level in
pl_gram.y, but not able to do it. I was not able to parse the given string
with tokens, and return the error levels. I tried for a refferal code, but
not able to find any. Would you g

Re: [HACKERS] Pg_upgrade remote copy

2015-09-02 Thread Alvaro Herrera
Bruce Momjian wrote:
> On Fri, Aug 28, 2015 at 10:34:38PM -0700, AI Rumman wrote:

> > In pg_upgrade, how about adding a feature to copy data directory over 
> > network.
> > That is, we can run pg_upgrade from our new host, where old host will be a
> > remote machine.

> I think it is much simpler to just copy the old clsuter to the remote
> server and run pg_upgrade in --link mode on the remote server.

Couldn't it run pg_basebackup as a first step?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Alvaro Herrera
Michael Paquier wrote:
> Hi all,
> 
> As of now, file access functions in genfile.c can only be used by
> superusers. This proposal is to relax those functions so as
> replication users can use them as well. Here are the functions aimed
> by this patch:
> - pg_stat_file
> - pg_read_binary_file
> - pg_read_file
> - pg_ls_dir
> The main argument for this change is that pg_rewind makes use of those
> functions, forcing users to use a superuser role when rewinding a
> node.

Can this piggyback on Stephen Frost's proposed patch for reserved roles et al?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby

On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an 
exception block thing. If it was that simple then we'd still get the 
crash without involving pgTap. So I suspect you need to have a named 
cursor in the mix as well.


Let me make another attempt at something simpler.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tomas Vondra



On 09/02/2015 08:27 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure 
wrote:


This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might
make sense for single table queries but breaks down in the face of
joins.


+1 to these concerns


Well, I don't think that ALL of the logic should go into the FDW.


Then maybe we shouldn't call this "FDW-based sharding" (or "FDW 
approach" or whatever was used in this thread so far) because that kinda 
implies that the proposal is to build on FDW.


In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.


The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.


Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.


And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.


I'm not opposed to reusing the FDW infrastructure, of course.

> In particular, in this example, parallel aggregate needs the same
> query rewrite, so the logic for that should live in core so that
> both parallel and distributed queries get the benefit.

I'm not sure the parallel query is a great example here - maybe I'm 
wrong but I think it's a fairly isolated piece of code, and we have 
pretty clear idea of the two use cases.


I'm sure it's non-trivial to design it well for both cases, but I think 
the questions for FWD/sharding will be much more about abstract concepts 
than particular technical solutions.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread dinesh kumar
On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner <
ste...@kaltenbrunner.cc> wrote:

> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  > > wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>> wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > mailto:robertmh...@gmail.com>> wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>>
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That is,
> chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other users
> from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
>
>
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
>
>
Hi Stefan,

I had the same questions too. But, I believe, our initdb works in these
cases, after creating the data cluster. Isn't ?

Regards,
Dinesh
manojadinesh.blogspot.com

>
>
> Stefan
>


Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread Stefan Kaltenbrunner
On 09/02/2015 10:10 PM, dinesh kumar wrote:
> On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner
> mailto:ste...@kaltenbrunner.cc>> wrote:
> 
> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  
> > >> wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>
> >>
> wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > mailto:robertmh...@gmail.com>
> >> wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > mailto:dineshkuma...@gmail.com>
> >>
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That
> is, chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other
> users from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible
> enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
> 
> 
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
> 
> 
> Hi Stefan,
> 
> I had the same questions too. But, I believe, our initdb works in these
> cases, after creating the data cluster. Isn't ?

maybe - but having a fixed "default"  is very different from baking a
classic unix permission concept of user/group/world^others into actual
DDL or into a COPY option. The proposed syntax might make some sense to
a admin used to a unix style system but it is likely utterly
incomprehensible to somebody who is used to (windows style) ACLs.

I dont have a good answer on what to do else atm but I dont think we
should embedded traditional/historical unix permission models in our
grammer unless really really needed...


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> 
> You also didn't touch /* * How many buffers PrefetchBuffer callers
> should try to stay ahead of their * ReadBuffer calls by.  This is
> maintained by the assign hook for * effective_io_concurrency.  Zero
> means "never prefetch". */ inttarget_prefetch_pages = 
> 0; which
> surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?
> 

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.


>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
>> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
>> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
>> (INT_MAX / 1024) #endif
>> 
>> +/* upper limit for effective_io_concurrency */ +#define
>> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
>> for ALTER SYSTEM. * This file will be used to store values of
>> configuration parameters @@ -256,6 +259,8 @@ extern int
>> temp_file_limit;
>> 
>> extern int   num_temp_buffers;
>> 
>> +extern int  effective_io_concurrency; +
> 
> target_prefetch_pages is declared in bufmgr.h - that seems like a
> better place for these.
> 

I was rather sceptical about that too. I'll move these in bufmgr.h.


Regards.


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Robert Haas
On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:
> On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
>  wrote:
>> Only HEAD is impacted, and attached is a patch to fix the problem.
>
> Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2015 at 1:18 AM, Amit Langote
 wrote:
> Did you get around to making a patch for this?

I've worked on it inconsistently. I'll pick this up again soon. I may
take the opportunity to talk this over with Andres in person when we
meet at Postgres Open shortly.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Andrew Dunstan



On 09/02/2015 04:22 PM, Robert Haas wrote:

On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:

On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
 wrote:

Only HEAD is impacted, and attached is a patch to fix the problem.

Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.




Tell me what's needed and I'll look at creating a buildfarm test module 
for it.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-09-02 Thread Pavel Stehule
Hi

2015-09-02 15:23 GMT+02:00 Andres Freund :

> Hi,
>
> On 2015-07-08 14:50:37 +0200, Pavel Stehule wrote:
> > - static const char *const my_list[] =
> > - {"DEFAULT", NULL};
> > + /* fallback for GUC settings */
> >
> > - COMPLETE_WITH_LIST(my_list);
> > + char *vartype = get_vartype(prev2_wd);
> > +
> > + if (strcmp(vartype, "enum") == 0)
> > + {
> > + char querybuf[1024];
> > +
> > + snprintf(querybuf, 1024, Query_for_enum,
> prev2_wd);
> > + COMPLETE_WITH_QUERY(querybuf);
> > + }
>
> Won't that mean that enum variables don't complete to default anymore?
>

no, it does

#define Query_for_enum \
" SELECT name FROM ( "\
"   SELECT unnest(enumvals) AS name "\
"FROM pg_catalog.pg_settings "\
"   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
"   UNION SELECT 'DEFAULT' ) ss "\

"  WHERE pg_catalog.substring(name,1,%%d)='%%s'"



>
> > +static char *
> > +get_vartype(const char *varname)
> > +{
> > + PQExpBufferData query_buffer;
> > + char*e_varname;
> > + PGresult *result;
> > + int string_length;
> > + static char resbuf[10];
> > +
> > + initPQExpBuffer(&query_buffer);
> > +
> > + string_length = strlen(varname);
> > + e_varname = pg_malloc(string_length * 2 + 1);
> > + PQescapeString(e_varname, varname, string_length);
>
> Independent of this patch, we really shouldn't do this in several places
> :(
>

fixed

>
> > + appendPQExpBuffer(&query_buffer,
> > + "SELECT vartype FROM pg_settings WHERE
> pg_catalog.lower(name) = pg_catalog.lower('%s')",
> > +  e_varname);
>
> Missing pg_catalog for pg_settings.
>

fixed

>
> Greetings,
>
> Andres Freund
>

I am sending new version

Regards

Pavel
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
new file mode 100644
index 816deda..61216e1
*** a/src/bin/psql/tab-complete.c
--- b/src/bin/psql/tab-complete.c
*** static const SchemaQuery Query_for_list_
*** 757,762 
--- 757,770 
  "   (SELECT polrelid FROM pg_catalog.pg_policy "\
  " WHERE pg_catalog.quote_ident(polname)='%s')"
  
+ #define Query_for_enum \
+ " SELECT name FROM ( "\
+ "   SELECT unnest(enumvals) AS name "\
+ "FROM pg_catalog.pg_settings "\
+ "   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
+ "   UNION SELECT 'DEFAULT' ) ss "\
+ "  WHERE pg_catalog.substring(name,1,%%d)='%%s'"
+ 
  /*
   * This is a list of all "things" in Pgsql, which can show up after CREATE or
   * DROP; and there is also a query to get a list of them.
*** static PGresult *exec_query(const char *
*** 847,852 
--- 855,863 
  
  static void get_previous_words(int point, char **previous_words, int nwords);
  
+ static char *get_vartype(const char *varname);
+ static char *escape_string(const char *text);
+ 
  #ifdef NOT_USED
  static char *quote_file_name(char *text, int match_type, char *quote_pointer);
  static char *dequote_file_name(char *text, char quote_char);
*** psql_completion(const char *text, int st
*** 3604,3623 
  
  			COMPLETE_WITH_LIST(my_list);
  		}
- 		else if (pg_strcasecmp(prev2_wd, "IntervalStyle") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"postgres", "postgres_verbose", "sql_standard", "iso_8601", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
- 		else if (pg_strcasecmp(prev2_wd, "GEQO") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"ON", "OFF", "DEFAULT", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
  		else if (pg_strcasecmp(prev2_wd, "search_path") == 0)
  		{
  			COMPLETE_WITH_QUERY(Query_for_list_of_schemas
--- 3615,3620 
*** psql_completion(const char *text, int st
*** 3627,3636 
  		}
  		else
  		{
! 			static const char *const my_list[] =
! 			{"DEFAULT", NULL};
  
! 			COMPLETE_WITH_LIST(my_list);
  		}
  	}
  
--- 3624,3654 
  		}
  		else
  		{
! 			/* fallback for GUC settings */
  
! 			char *vartype = get_vartype(prev2_wd);
! 
! 			if (strcmp(vartype, "enum") == 0)
! 			{
! char querybuf[1024];
! 
! snprintf(querybuf, 1024, Query_for_enum, prev2_wd);
! COMPLETE_WITH_QUERY(querybuf);
! 			}
! 			else if (strcmp(vartype, "bool") == 0)
! 			{
! static const char *const my_list[] =
! {"ON", "OFF", "DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
! 			else
! 			{
! static const char *const my_list[] =
! {"DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
  		}
  	}
  
*** _complete_from_query(int is_schema_query
*** 4166,4195 
  		result = NULL;
  
  		/* Set up suitably-escaped copies of textual inputs */
! 		e_text = pg_malloc(string_length * 2 + 1);
! 		PQescapeString(e_text, text, string_length);
  
  		if (completion_

Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Robert Haas
On Wed, Sep 2, 2015 at 4:50 PM, Andrew Dunstan  wrote:
> Tell me what's needed and I'll look at creating a buildfarm test module for
> it.

You run the tests via: make -C src/test/ssl check

But nota bene security caveats:

commit e39250c644ea7cd3904e4e24570db21a209cf97f
Author: Heikki Linnakangas 
Date:   Tue Dec 9 17:21:18 2014 +0200

Add a regression test suite for SSL support.

It's not run by the global "check" or "installcheck" targets, because the
temporary installation it creates accepts TCP connections from any user
the same host, which is insecure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Alvaro Herrera
Andrew Dunstan wrote:
> 
> 
> On 09/02/2015 04:22 PM, Robert Haas wrote:

> >For so long as this test suite is not run by 'make check-world' or by
> >the buildfarm, it's likely to keep getting broken, and we're likely to
> >keep not noticing.  I realize that the decision to exclude this from
> >'make check-world' was deliberately made for security reasons, but
> >that doesn't take anything away from the maintenance headache.

> Tell me what's needed and I'll look at creating a buildfarm test module for
> it.

AFAICS it just requires running "make check" in src/test/ssl.  Maybe
this could be part of a generic module that runs "make check" in
specified subdirectories; see
http://www.postgresql.org/message-id/20150813181107.GC5232@alvherre.pgsql
for previous discussion about that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Robert Haas
On Mon, Aug 31, 2015 at 9:47 PM, Peter Eisentraut  wrote:
> On 8/25/15 11:32 PM, Joe Conway wrote:
>> 1.) pg_controldata() function and pg_controldata view added
>
> I don't think dumping out whatever pg_controldata happens to print as a
> bunch of text fields is very sophisticated.  We have functionality to
> compute with WAL positions, for example, and they won't be of much use
> if this is going to be all text.
>
> Also, the GUC settings tracked in pg_control can already be viewed using
> normal mechanisms, so we don't need a second way to see them.
>
> The fact that some of this is stored in pg_control and some is not is
> really an implementation detail.  We should be thinking of ways to
> expose specific useful information in useful ways, not just dump out
> everything we can find.  Ultimately, I think we would like to move away
> from people parsing textual pg_controldata output, but this proposal is
> not moving very far in that direction.

The nice thing about dumping the information as text is that you can
return every value in the same universal format: text.  There's a lot
to like about that.

But I'm not sure I like the idea of adding a server dependency on the
ability to exec pg_controldata.  That seems like it could be
unreliable at best, and a security vulnerability at worst.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Vladimir Borodin
25 авг. 2015 г., в 16:03, Michael Paquier  написал(а):On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:To me it sounds like this shouldn't go through the full ReadBuffer()rigamarole. That code is already complex enough, and here it's reallynot needed. I think it'll be much easier to review - and actually fasterin many cases to simply have something likeboolBufferInCache(Relation, ForkNumber, BlockNumber){    /* XXX: setup tag, hash, partition */    LWLockAcquire(newPartitionLock, LW_SHARED);    buf_id = BufTableLookup(&newTag, newHash);    LWLockRelease(newPartitionLock);    return buf_id != -1;}and then fall back to the normal ReadBuffer() when it's in cache.Yep, sounds good. Patch with implementation attached.Patch marked as returned with feedback as input from the author hasbeen waited for some time now.Sorry for delay, I will link it to the current commitfest.

btree_vacuum_v3.patch
Description: Binary data
-- Michael
--May the force be with you…https://simply.name




Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:

On 2 Sep 2015 14:54, "Andres Freund"  wrote:




+ /*--
+  * The user-visible GUC parameter is the number of drives (spindles),
+  * which we need to translate to a number-of-pages-to-prefetch target.
+  * The target value is stashed in *extra and then assigned to the actual
+  * variable by assign_effective_io_concurrency.
+  *
+  * The expected number of prefetch pages needed to keep N drives busy is:
+  *
+  * drives |   I/O requests
+  * ---+
+  *  1 |   1
+  *  2 |   2/1 + 2/2 = 3
+  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+  *  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.


There are different meanings of "busy". If I get the coupon collector 
problem right (after quickly skimming the wikipedia article today), it 
effectively makes sure that each "spindle" has at least 1 request in the 
queue. Which sucks in practice, because on spinning rust it makes 
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one 
of the multiple channels.


On spinning drives, it's usually good to keep the iodepth>=4. For 
example this 10k Seagate drive [1] can do ~450 random IOPS with 
iodepth=16, while 10k drive should be able to do ~150 IOPS (with 
iodepth=1). The other SAS drives behave quite similarly.


[1] 
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review


On SSDs the good values usually start at 16, depending on the model (and 
controller), and size (large SSDs are basically multiple small ones 
glued together, thus have more channels).


This is why the numbers from coupon collector are way too low in many 
cases. (OTOH this is done per backend, so if there are multiple backends 
doing prefetching ...)




We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I


As I explained, spindles have very little to do with it - you need 
multiple I/O requests per device, to get the benefit. Sure, the DBAs 
should know how many spindles they have and should be able to determine 
optimal IO depth. But we actually say this in the docs:


 A good starting point for this setting is the number of separate
 drives comprising a RAID 0 stripe or RAID 1 mirror being used for
 the database. (For RAID 5 the parity drive should not be counted.)
 However, if the database is often busy with multiple queries
 issued in concurrent sessions, lower values may be sufficient to
 keep the disk array busy. A value higher than needed to keep the
 disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warn 
against increasing the value further.


Moreover, ISTM it's very unclear what value to use even if you know the 
number of devices and optimal iodepth. Setting (devices * iodepth) 
doesn't really make much sense, because that effectively computes


(devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at 
least one request for each of them", right? I guess we actually want


(devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need 
track both ndevices and iodepth.


There probably is a value X so that

 X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).



think the above formula works for essentially random I/O but for
more predictable I/O it might be possible to use a different formula.
But if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access patte

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Tom Lane
Robert Haas  writes:
> But I'm not sure I like the idea of adding a server dependency on the
> ability to exec pg_controldata.  That seems like it could be
> unreliable at best, and a security vulnerability at worst.

I hadn't been paying attention --- the proposed patch actually depends on
exec'ing pg_controldata?  That's horrid!  There is no expectation that
that's installed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/02/2015 05:25 PM, Tom Lane wrote:
> Robert Haas  writes:
>> But I'm not sure I like the idea of adding a server dependency on
>> the ability to exec pg_controldata.  That seems like it could be 
>> unreliable at best, and a security vulnerability at worst.
> 
> I hadn't been paying attention --- the proposed patch actually
> depends on exec'ing pg_controldata?  That's horrid!  There is no
> expectation that that's installed.

No it doesn't. I'm confused :-/

Joe


- -- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJV52qoAAoJEDfy90M199hltvwP/3MfUQfPPglhBuY1V3CTzHu9
kTw5tNuGI/244Yc11wLtV07W+3QWXzCNY/fL1JCW5ns51mTfZKfkskNWD0O9fIex
gvK4p/3Z+y344qsdDlzbzw0A/PU05UCq1UlgXCF6nyQJW6cZfaCckbEpZWVK/uV7
aYokIqnIiilWaPu224b6jBOukK7oizEjXevdFBafLbetLJHMx+9k8LMbpPdieAm/
RSk17+N77WQ2zTFHcdz8U1MYAbaokmv155s1vUFgrqOUJGc0r6K+vImKgxOjbbmg
pv2jf7vvUwwjUy7f2iPhWJAKfGCV1m9ovWaXsMYqcF55JwSzvP8B2htUtM4Lr1qF
SsWO7e36bLoH++yAGfKp7oZIhA9r6SR6cwEoCvso3immZ2zhOzbRcw4tI4pE9fhB
P/mEbKFF5BsGHjeslB8RrMQG68DxEwPkaafH4mc1QjKiXNfWPH9ci+pgfSLVphJq
gn+ZuPrReIFhQKyMchcvZVVWJd9Dt02D2lsIzUfBWGXwOTLFVikD6BC6siy5KWmy
xuEGLEfts9E7gPD3qPXxNuY7TCvb+L7R+1C9/M5diiV7rerMUocH/RqrPP6nXHTc
BdfJhzOfU+H+Kt0nbdE8Vjw3BOKT6nqT0kc+le+F/Q1h2XLB63KhaOkFzVW73Rfd
JRRqkyks+eVgEn2I4OKm
=OAms
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Josh Berkus
On 09/02/2015 02:25 PM, Tomas Vondra wrote:
> 
> As I explained, spindles have very little to do with it - you need
> multiple I/O requests per device, to get the benefit. Sure, the DBAs
> should know how many spindles they have and should be able to determine
> optimal IO depth. But we actually say this in the docs:

My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Alvaro Herrera
Tom Lane wrote:
> Robert Haas  writes:
> > But I'm not sure I like the idea of adding a server dependency on the
> > ability to exec pg_controldata.  That seems like it could be
> > unreliable at best, and a security vulnerability at worst.
> 
> I hadn't been paying attention --- the proposed patch actually depends on
> exec'ing pg_controldata?  That's horrid!  There is no expectation that
> that's installed.

No, it doesn't.  For the pg_controldata output it processes the
pg_control file directly, and for pg_config it relies on compile-time
CPPFLAGS.

I think trying to duplicate the exact strings isn't too nice an
interface.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus
On 09/02/2015 12:30 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>>> 4. Therefore, I think that we should instead use logical replication,
>>> which might be either synchronous or asynchronous.  When you modify
>>> one copy of the data, that change will then be replicated to all other
>>> nodes.  If you are OK with eventual consistency, this replication can
>>> be asynchronous, and nodes that are off-line will catch up when they
>>> are on-line.  If you are not OK with that, then you must replicate
>>> synchronously to every node before transaction commit; or at least you
>>> must replicate synchronously to every node that is currently on-line.
>>> This presents some challenges: logical decoding currently can't
>>> replicate transactions that are still in process - replication starts
>>> when the transaction commits.  Also, we don't have any way for
>>> synchronous replication to wait for multiple nodes.
>>
>> Well, there is a WIP patch for that, which IMHO would be much improved
>> by having a concrete use-case like this one.  What nobody is working on
>> -- and we've vetoed in the past -- is a way of automatically failing and
>> removing from replication any node which repeatedly fails to sync, which
>> would be a requirement for this model.
> 
> Yep.  It's clear to me we need that in general, not just for sharding.
> To me, the key is to make sure there's a way for the cluster-ware to
> know about the state transitions.  Currently, when the synchronous
> standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log.  If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

> 
>> You'd also need a way to let the connection nodes know when a replica
>> has fallen behind so that they can be taken out of
>> load-balancing/sharding for read queries.  For the synchronous model,
>> that would be "fallen behind at all"; for asynchronous it would be
>> "fallen more than ### behind".
> 
> How is that different from the previous thing?  Just that we'd treat
> "lagging" as "down" beyond some threshold?  That doesn't seem like a
> mandatory feature.

It's a mandatory feature if you want to load-balance reads.  We have to
know which nodes not to send reads to because they are out of sync.

>> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
> 
> I don't think so.  If transaction A writes data on X which is
> replicated to Y and then commits, a new snapshot which shows A as
> committed can't be used on Y until A's changes have been replicated
> there.  That could be enforced by having the commit of A wait for
> replication, or by having an attempt by a later transaction to use the
> snapshot on Y wait until replication completes, or some even more
> sophisticated strategy that considers whether the replication backlog
> touches the same data that the new transaction will read.  It's
> complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

>>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
>> harder to set up, and becomes a SPOF.
> 
> I partially agree.  I think it's very important that the GTM is an
> optional feature of whatever we end up with, rather than an
> indispensable component.  People who don't want it shouldn't have to
> pay the price in performance and administrative complexity.  But at
> the same time, I think a lot of people will want it, because without
> it, the fact that sharding is in use is much less transparent to the
> application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure.  That would be nice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Josh Berkus
On 09/02/2015 02:34 PM, Alvaro Herrera wrote:
> Tom Lane wrote:
>> Robert Haas  writes:
>>> But I'm not sure I like the idea of adding a server dependency on the
>>> ability to exec pg_controldata.  That seems like it could be
>>> unreliable at best, and a security vulnerability at worst.
>>
>> I hadn't been paying attention --- the proposed patch actually depends on
>> exec'ing pg_controldata?  That's horrid!  There is no expectation that
>> that's installed.
> 
> No, it doesn't.  For the pg_controldata output it processes the
> pg_control file directly, and for pg_config it relies on compile-time
> CPPFLAGS.
> 
> I think trying to duplicate the exact strings isn't too nice an
> interface.

Well, for pg_controldata, no, but what else would you do for pg_config?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Greg Stark
On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund  wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
> /* XXX: setup tag, hash, partition */
>
> LWLockAcquire(newPartitionLock, LW_SHARED);
> buf_id = BufTableLookup(&newTag, newHash);
> LWLockRelease(newPartitionLock);
> return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.


I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Alvaro Herrera
Josh Berkus wrote:
> On 09/02/2015 02:34 PM, Alvaro Herrera wrote:

> > I think trying to duplicate the exact strings isn't too nice an
> > interface.
> 
> Well, for pg_controldata, no, but what else would you do for pg_config?

I was primarily looking at pg_controldata, so we agree there.

As for pg_config, I'm confused about its usefulness -- which of these
lines are useful in the SQL interface?  Anyway, I don't see anything
better than a couple of text columns for this case.

BINDIR = /home/alvherre/Code/pgsql/install/master/bin
DOCDIR = /home/alvherre/Code/pgsql/install/master/share/doc
HTMLDIR = /home/alvherre/Code/pgsql/install/master/share/doc
INCLUDEDIR = /home/alvherre/Code/pgsql/install/master/include
PKGINCLUDEDIR = /home/alvherre/Code/pgsql/install/master/include
INCLUDEDIR-SERVER = /home/alvherre/Code/pgsql/install/master/include/server
LIBDIR = /home/alvherre/Code/pgsql/install/master/lib
PKGLIBDIR = /home/alvherre/Code/pgsql/install/master/lib
LOCALEDIR = /home/alvherre/Code/pgsql/install/master/share/locale
MANDIR = /home/alvherre/Code/pgsql/install/master/share/man
SHAREDIR = /home/alvherre/Code/pgsql/install/master/share
SYSCONFDIR = /home/alvherre/Code/pgsql/install/master/etc
PGXS = /home/alvherre/Code/pgsql/install/master/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-debug' '--enable-depend' '--enable-cassert' 
'--enable-nls' '--cache-file=/home/alvherre/tmp/pgconfig.master.cache' 
'--enable-thread-safety' '--with-python' '--with-perl' '--with-tcl' 
'--with-openssl' '--with-libxml' '--with-tclconfig=/usr/lib/tcl8.6' 
'--enable-tap-tests' '--prefix=/pgsql/install/master' '--with-pgport=55432'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O2
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,--as-needed 
-Wl,-rpath,'/pgsql/install/master/lib',--enable-new-dtags
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgcommon -lpgport -lxml2 -lssl -lcrypto -lz -lreadline -lrt -lcrypt 
-ldl -lm 
VERSION = PostgreSQL 9.6devel


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >