date:20130520

Re: [HACKERS] Proposal to add connection request Wait-time in PSQL client.

2013-05-20 Thread amul sul

 
>>   in a way, we client terminal wont hangup by throwing  *The connection to 
> the server was lost. Attempting reset: Failed. !*
> The thing is that this just should not be a routine occurrence. It's a
> minor irritation to me when debugging sometimes, but it's not something
> that you should be encountering in production. If you are, changing
> psql's reconnect behaviour is not necessarily the best solution. I'd try
> to work out why you're having so many unrecoverable disconnects and
> restarts and fix that instead.
 
Yes, I think so. This is very rare case, and won't have any affect in 
production.
Initially, IMO, I thought no harm if PSQL client wait for few seconds till 
server recovered properly and ready to accept connection.

Any way, I will follow-up your suggestion. 

Thank you for sharing your concerns and explaining me actual needed things. 

Regards,
Amul Sul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Soroosh Sardari

Dear Hackers

In fix part oh HeapTuple, there is a union that is named t_choice,
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
}t_choice;

I can't find out why we need t_datum, actually there is no comment about
DatumTupleFields.

Regards
Soroosh

Re: [HACKERS] [9.3] Automatically updatable views vs writable foreign tables

2013-05-20 Thread Dean Rasheed

On 16 May 2013 22:16, Tom Lane  wrote:
>> Specifically, for foreign tables
>> information_schema.tables.is_insertable_into and
>> information_schema.columns.is_updatable always say 'NO' even if the
>> foreign table is writable. Fixing that would require new C functions
>> along the same lines as pg_view_is_insertable/updatable(), or those
>> functions could just be renamed and repurposed to do the check for all
>> relation kinds, except those known to be always/never updatable.
>
> I'd vote to rename/extend them to be pg_relation_is_updatable I think.
>

I remember now just how ugly this code is. The SQL standard has
separate concepts of trigger-insertable, trigger-updatable,
trigger-deletable, insertable and updatable but not deletable for some
reason. So we define updatable as supporting both UPDATE and DELETE
without triggers. I think what we have implemented is technically
correct with regards to the spec in this area, but it is not
particularly useful as far as telling whether a relation will actually
support a particular query in practice (for example a simple view on
top of a trigger-updatable view is neither updatable nor
trigger-updatable).

One place where I think we have diverged from the spec, however, is in
information_schema.columns.updatable. This should be returning 'YES'
if the individual column is updatable, and I see no reason for that
the require the relation to support DELETE, which is what we currently
do (and always have done).

To implement the information_schema properly per-spec, I think we need
3 functions: pg_relation_is_insertable(), pg_relation_is_updatable()
and pg_column_is_updatable(), with the latter just checking UPDATE
events. It's probably a good idea to add these functions now, since I
hope in the future to support more of the SQL spec regarding
automatically updatable views, which will involve views for which only
a subset of their columns are updatable.

The attached patch does that, and tightens up relation_is_updatable()
to support all relation kinds, but it still assumes that if a FDW
defines, say, ExecForeignUpdate, then all its foreign tables are
updatable. That could be improved upon by defining new FDW API
callbacks that select from the remote information_schema, but I'm now
starting to doubt whether its really worth the trouble, given its
bizzare definition.

Regards,
Dean

writable-fdw-view2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Amit Langote

Hello,

I think the comment just above the HeapTupleFields struct definition
has the related details.

/*
 * Heap tuple header.  To avoid wasting space, the fields should be
 * laid out in such a way as to avoid structure padding.
 *
 * Datums of composite types (row types) share the same general structure
 * as on-disk tuples, so that the same routines can be used to build and
 * examine them.  However the requirements are slightly different: a Datum
 * does not need any transaction visibility information, and it does need
 * a length word and some embedded type information.  We can achieve this
 * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
 * with the fields needed in the Datum case.  Typically, all tuples built
 * in-memory will be initialized with the Datum fields; but when a tuple is
 * about to be inserted in a table, the transaction fields will be filled,
 * overwriting the datum fields.

especially the last line points as to what roles each of them plays,
though, I would like to hear more about additional details from others
who might reply.

On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari
 wrote:
> Dear Hackers
>
> In fix part oh HeapTuple, there is a union that is named t_choice,
> union
> {
> HeapTupleFields t_heap;
> DatumTupleFields t_datum;
> }t_choice;
>
> I can't find out why we need t_datum, actually there is no comment about
> DatumTupleFields.
>
> Regards
> Soroosh
>
>

-- 
Amit Langote

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Soroosh Sardari

Thanks,

If a tuple constructed in memory we don't need t_heap. I have another
question,
How make an in-memory tuple?




On Mon, May 20, 2013 at 12:46 PM, Amit Langote wrote:

> Hello,
>
> I think the comment just above the HeapTupleFields struct definition
> has the related details.
>
> /*
>  * Heap tuple header.  To avoid wasting space, the fields should be
>  * laid out in such a way as to avoid structure padding.
>  *
>  * Datums of composite types (row types) share the same general structure
>  * as on-disk tuples, so that the same routines can be used to build and
>  * examine them.  However the requirements are slightly different: a Datum
>  * does not need any transaction visibility information, and it does need
>  * a length word and some embedded type information.  We can achieve this
>  * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
>  * with the fields needed in the Datum case.  Typically, all tuples built
>  * in-memory will be initialized with the Datum fields; but when a tuple is
>  * about to be inserted in a table, the transaction fields will be filled,
>  * overwriting the datum fields.
>
>
> especially the last line points as to what roles each of them plays,
> though, I would like to hear more about additional details from others
> who might reply.
>
>
>
> On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari
>  wrote:
> > Dear Hackers
> >
> > In fix part oh HeapTuple, there is a union that is named t_choice,
> > union
> > {
> > HeapTupleFields t_heap;
> > DatumTupleFields t_datum;
> > }t_choice;
> >
> > I can't find out why we need t_datum, actually there is no comment about
> > DatumTupleFields.
> >
> > Regards
> > Soroosh
> >
> >
>
>
>
> --
> Amit Langote
>

Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-20 Thread Kevin Grittner

Josh Berkus  wrote:

> I recently had a reason to benchmark a database which is default
> SERIALIZABLE mode.  I was startled to discover that pgbench is set up to
> abort the client once it hits a serialization failure.  You get a bunch
> of these:
>
> Client 7 aborted in state 11: ERROR:  could not serialize access due to
> read/write dependencies among transactions
> DETAIL:  Reason code: Canceled on identification as a pivot, during write.
> HINT:  The transaction might succeed if retried.
> Client 0 aborted in state 11: ERROR:  could not serialize access due to
> read/write dependencies among transactions
> DETAIL:  Reason code: Canceled on identification as a pivot, during write.
>
> ... which continue until you're down to one client, which then finished
> out the pgbench (at very low rates, of course).
>
> The problem is this code here:
>
> if (commands[st->state]->type == SQL_COMMAND)
> {
> /*
>
> * Read and discard the query result; note this
> is not included in
> * the statement latency numbers.
>
> */
> res = PQgetResult(st->con);
> switch (PQresultStatus(res))
> {
> case PGRES_COMMAND_OK:
> case PGRES_TUPLES_OK:
> break;  /* OK */
> default:
> fprintf(stderr, "Client %d
> aborted in state %d: %s",
> st->id,
> st->state, PQerrorMessage(st->con));
> PQclear(res);
> return clientDone(st, false);
> }
> PQclear(res);
> discard_response(st);
>
>
> The way I read that, if the client encounters any errors at all, it
> gives up and halts that client.  This doesn't seem very robust, and it
> certainly won't work with SERIALIZABLE.

Yes, I ran into this and wound up testing with a hacked copy of
pgbench. Anyone using SERIALIZABLE transactions needs to be
prepared to handle serialization failures.  Even REPEATABLE READ
can see a lot of serialization failures due to write conflicts in
some workloads, and READ COMMITTED can see deadlocks on certain
workloads.

> My thinking is that what pgbench should do is:
> * track an error count
> * if it finds an error, don't increment the transaction count, but do
> increment the error count.
> * then continue to the next transaction.
>
> Does that seem like the right approach?

The main thing is to not consider a transaction complete on a
serialization failure.  Many frameworks will retry a transaction
from the start on a serialization failure.  pgbench should do the
same.  The transaction should be rolled back and retried without
incrementing the transaction count.  It would not hurt to keep
track of the retries, but as long as the transaction is retried
until successful you will get meaningful throughput numbers.

Perhaps other errors should also be counted and handled better.
For SQLSTATE values *other than* 40001 and 40P01 we should not
retry the same transaction, though, because it could well be a case
where the same attempt will head-bang forever.  Serialization
failures, by definition, can be expected to work on retry.  (Not
always on the *first* retry, but any benchmark tool should keep at
it until the transaction succeeds or gets a hard error if you want
meaningful numbers, because that's what a software frameworks
should be doing -- and many do.)

I raised this issue near the end of SSI development, but nobody
seemed very interested and someone argued that a tool to do that
would be good but we shouldn't try to do it in pgbench -- so I let
it drop at the time.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block write statistics WIP

2013-05-20 Thread Heikki Linnakangas


On 19.05.2013 11:15, Greg Smith wrote:

I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode. There is
a warning about the perils of assuming you can map that way from a
buftag value in buf_internals.h though:

"Note: the BufferTag data must be sufficient to determine where to write
the block, without reference to pg_class or pg_tablespace entries. It's
possible that the backend flushing the buffer doesn't even believe the
relation is visible yet (its xact may have started before the xact that
created the rel). The storage manager must be able to cope anyway."

-Modify every caller of MarkDirty* to include a relation when that
information is available. There are about 200 callers of those functions
around, so that won't be fun. I noted that many of them are in the XLog
replay functions, which won't have the relation--but those aren't
important to count anyway.

Neither of these options feels very good to me, so selecting between the
two feels like picking the lesser of two evils. I'm fine with chugging
through all of the callers to modify MarkDirty*, but I didn't want to do
that only to have the whole approach rejected as wrong. That's why I
stopped here to get some feedback.


Adding a Relation argument to all the Mark* calls seems fine to me. I 
don't find it ugly at all. The biggest objection would be that if there 
are extensions out there that call those functions, they would need to 
be changed, but I think that's fine.



The way that MarkDirty requires this specific underlying storage manager
behavior to work properly strikes me as as a bit of a layering violation
too. I'd like the read and write paths to have a similar API, but here
they don't even operate on the same type of inputs. Addressing that is
probably harder than just throwing a hack on the existing code though.


To be honest, I don't understand what you mean by that. ?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ASYNC Privileges proposal

2013-05-20 Thread Craig Ringer

On 05/20/2013 09:54 AM, Chris Farmiloe wrote:
> Hey all,
>
> I find the current LISTEN / NOTIFY rather limited in the context of
> databases with multiple roles. As it stands it is not possible to restrict
> the use of LISTEN or NOTIFY to specific roles, and therefore notifications
> (and their payloads) cannot really be trusted as coming from any particular
> source.
Just for context, here's the dba.stackexchange.com question where this
was previously raised:

http://dba.stackexchange.com/q/42496/7788

I agree that this only became relevant since NOTIFY payload support was
added, so we can't really just say "nobody's wanted this in 15 years". A
couple of years, maybe, but given the lagging client driver support for
NOTIFY payloads, maybe not even that.

That said, I personally don't see this as a significant problem since
the approach that worked prior to the introduction of notify payloads -
just using a staging table - continues to work just fine. I  agree with
Tom that much finer grained control would be required if adding
privileges at all were to be particularly useful; such a coarse right as
Chris proposes seems unlikely to cover many use cases. We'd likely
control over who can NOTIFY and who can LISTEN to a given channel name.

I'm just not convinced it's worth the complexity it'll introduce or any
performance consequences. If it could be done by adding a few hooks that
a C extension could use, that might be worthwhile though.

The same argument might be applied to advisory locking, after all. I
can't really see adding fine-grained rights to _everything_, but hooks
might not be harmful.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Amit Langote

Hello,

I think a more appropriate question to be asked here would be at what
point (in the life  of a typical tuple), does a tuple's header contain
t_datum or otherwise, which I would also like to be answered.

On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari
 wrote:
> Thanks,
>
> If a tuple constructed in memory we don't need t_heap. I have another
> question,
> How make an in-memory tuple?
>
>
>
>
> On Mon, May 20, 2013 at 12:46 PM, Amit Langote 
> wrote:
>>
>> Hello,
>>
>> I think the comment just above the HeapTupleFields struct definition
>> has the related details.
>>
>> /*
>>  * Heap tuple header.  To avoid wasting space, the fields should be
>>  * laid out in such a way as to avoid structure padding.
>>  *
>>  * Datums of composite types (row types) share the same general structure
>>  * as on-disk tuples, so that the same routines can be used to build and
>>  * examine them.  However the requirements are slightly different: a Datum
>>  * does not need any transaction visibility information, and it does need
>>  * a length word and some embedded type information.  We can achieve this
>>  * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
>>  * with the fields needed in the Datum case.  Typically, all tuples built
>>  * in-memory will be initialized with the Datum fields; but when a tuple
>> is
>>  * about to be inserted in a table, the transaction fields will be filled,
>>  * overwriting the datum fields.
>>
>>
>> especially the last line points as to what roles each of them plays,
>> though, I would like to hear more about additional details from others
>> who might reply.
>>
>>
>>
>> On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari
>>  wrote:
>> > Dear Hackers
>> >
>> > In fix part oh HeapTuple, there is a union that is named t_choice,
>> > union
>> > {
>> > HeapTupleFields t_heap;
>> > DatumTupleFields t_datum;
>> > }t_choice;
>> >
>> > I can't find out why we need t_datum, actually there is no comment about
>> > DatumTupleFields.
>> >
>> > Regards
>> > Soroosh
>> >
>> >
>>
>>
>>
>> --
>> Amit Langote
>
>



-- 
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Dickson S. Guedes

Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu:
> On 18.05.2013 03:52, Dickson S. Guedes wrote:
> >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this
> >> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
> >> 32-core Linux machine:
> >>
> >> -  64.09%  postgres  postgres   [.] tas - tas - 99.83%
> >> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
> >> LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
> >> postgres  postgres   [.] tas +   1.53%  postgres
> >> libc-2.13.so   [.] 0x119873 +   1.44%  postgres  postgres
> >> [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
> >> arch_local_irq_enable +   1.18%  postgres  postgres   [.]
> >> AllocSetAlloc ...
> >
> > I'd like to test this here but I couldn't reproduce that perf output
> > here in a 64-core or 24-core machines, could you post the changes to
> > postgresql.conf and the perf arguments that you used?
> 
> Sure, here are the non-default postgresql.conf settings:


Thank you for your information.


> While pgbench was running, I ran this:
> 
> perf record -p 6050 -g -e cpu-clock
> 
> to connect to one of the backends. (I used cpu-clock, because the 
> default cpu-cycles event didn't work on the box)


Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:


+ 71,27% postgres postgres [.] AtEOXact_Buffers
+  7,67% postgres postgres [.] AtEOXact_CatCache
+  6,30% postgres postgres [.] AllocSetCheck
+  5,34% postgres libc-2.12.so [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page


It's a 64-core machine with PGDATA in a SAN.

vendor_id   : GenuineIntel
cpu family  : 6
model   : 47
model name  : Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
stepping: 2
cpu MHz : 1064.000
cache size  : 24576 KB
physical id : 3
siblings: 16
core id : 24
cpu cores   : 8
apicid  : 241
initial apicid  : 241
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4255.87
clflush size: 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:


Would you like that I test some other configuration to try to simulate
that expected workload?


[]s
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Andres Freund

On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
> Hum, I was supposing that I was doing something wrong but I'm getting
> the same result as before even using your test case and my results is
> still different from yours:
> 
> 
> + 71,27% postgres postgres [.] AtEOXact_Buffers
> +  7,67% postgres postgres [.] AtEOXact_CatCache
> +  6,30% postgres postgres [.] AllocSetCheck
> +  5,34% postgres libc-2.12.so [.] __mcount_internal
> +  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Amit Langote

Wonder though if this question is better asked in pgsql-novice?

On Mon, May 20, 2013 at 9:23 PM, Amit Langote  wrote:
> Hello,
>
> I think a more appropriate question to be asked here would be at what
> point (in the life  of a typical tuple), does a tuple's header contain
> t_datum or otherwise, which I would also like to be answered.
>
> On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari
>  wrote:
>> Thanks,
>>
>> If a tuple constructed in memory we don't need t_heap. I have another
>> question,
>> How make an in-memory tuple?
>>
>>
>>
>>
>> On Mon, May 20, 2013 at 12:46 PM, Amit Langote 
>> wrote:
>>>
>>> Hello,
>>>
>>> I think the comment just above the HeapTupleFields struct definition
>>> has the related details.
>>>
>>> /*
>>>  * Heap tuple header.  To avoid wasting space, the fields should be
>>>  * laid out in such a way as to avoid structure padding.
>>>  *
>>>  * Datums of composite types (row types) share the same general structure
>>>  * as on-disk tuples, so that the same routines can be used to build and
>>>  * examine them.  However the requirements are slightly different: a Datum
>>>  * does not need any transaction visibility information, and it does need
>>>  * a length word and some embedded type information.  We can achieve this
>>>  * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
>>>  * with the fields needed in the Datum case.  Typically, all tuples built
>>>  * in-memory will be initialized with the Datum fields; but when a tuple
>>> is
>>>  * about to be inserted in a table, the transaction fields will be filled,
>>>  * overwriting the datum fields.
>>>
>>>
>>> especially the last line points as to what roles each of them plays,
>>> though, I would like to hear more about additional details from others
>>> who might reply.
>>>
>>>
>>>
>>> On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari
>>>  wrote:
>>> > Dear Hackers
>>> >
>>> > In fix part oh HeapTuple, there is a union that is named t_choice,
>>> > union
>>> > {
>>> > HeapTupleFields t_heap;
>>> > DatumTupleFields t_datum;
>>> > }t_choice;
>>> >
>>> > I can't find out why we need t_datum, actually there is no comment about
>>> > DatumTupleFields.
>>> >
>>> > Regards
>>> > Soroosh
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Amit Langote
>>
>>
>
>
>
> --
> Amit Langote



-- 
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Dickson S. Guedes

Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu:
> On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
> > Hum, I was supposing that I was doing something wrong but I'm getting
> > the same result as before even using your test case and my results is
> > still different from yours:
> > 
> > 
> > + 71,27% postgres postgres [.] AtEOXact_Buffers
> > +  7,67% postgres postgres [.] AtEOXact_CatCache
> > +  6,30% postgres postgres [.] AllocSetCheck
> > +  5,34% postgres libc-2.12.so [.] __mcount_internal
> > +  2,14% postgres [kernel.kallsyms][k] activate_page
> 
> That looks like you have configured with --enable-cassert and probably
> also --enable-profiling? The former will give completely distorted
> performance results...


Ah! Wrong PATH, so wrong binaries. Thanks Andres.


-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Atri Sharma

Sent from my iPad

On 20-May-2013, at 18:14, Amit Langote  wrote:

> Wonder though if this question is better asked in pgsql-novice?
> 
> On Mon, May 20, 2013 at 9:23 PM, Amit Langote  wrote:
>> Hello,
>> 
>> I think a more appropriate question to be asked here would be at what
>> point (in the life  of a typical tuple), does a tuple's header contain
>> t_datum or otherwise, which I would also like to be answered.
>> 
>> On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari
>>  wrote:
>>> Thanks,
>>> 
>>> If a tuple constructed in memory we don't need t_heap. I have another
>>> question,
>>> How make an in-memory tuple?
>>> 
>>> 
>>> 
>>> 
>>> On Mon, May 20, 2013 at 12:46 PM, Amit Langote 
>>> wrote:

 Hello,

 I think the comment just above the HeapTupleFields struct definition
 has the related details.

 /*
 * Heap tuple header.  To avoid wasting space, the fields should be
 * laid out in such a way as to avoid structure padding.
 *
 * Datums of composite types (row types) share the same general structure
 * as on-disk tuples, so that the same routines can be used to build and
 * examine them.  However the requirements are slightly different: a Datum
 * does not need any transaction visibility information, and it does need
 * a length word and some embedded type information.  We can achieve this
 * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple
 * with the fields needed in the Datum case.  Typically, all tuples built
 * in-memory will be initialized with the Datum fields; but when a tuple
 is
 * about to be inserted in a table, the transaction fields will be filled,
 * overwriting the datum fields.

 especially the last line points as to what roles each of them plays,
 though, I would like to hear more about additional details from others
 who might reply.

 On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari
  wrote:
> Dear Hackers
> 
> In fix part oh HeapTuple, there is a union that is named t_choice,
> union
>{
>HeapTupleFields t_heap;
>DatumTupleFields t_datum;
>}t_choice;
> 
> I can't find out why we need t_datum, actually there is no comment about
> DatumTupleFields.
> 
> Regards
> Soroosh
> 
> 

I don't think so, as this involves internal structures.Let's keep it here.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move unused buffers to freelist

2013-05-20 Thread Robert Haas

On Thu, May 16, 2013 at 10:18 AM, Amit Kapila  wrote:
> Further Performance Data:
>
> Below data is for average 3 runs of 20 minutes
>
> Scale Factor   - 1200
> Shared Buffers - 7G

These results are good but I don't get similar results in my own
testing.  I ran pgbench tests at a variety of client counts and scale
factors, using 30-minute test runs and the following non-default
configuration parameters.

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '

Here are the results.  The first field in each line is the number of
clients.  The second number is the scale factor.  The numbers after
"master" and "patched" are the median of three runs.

01 100 master 1433.297699 patched 1420.306088
01 300 master 1371.286876 patched 1368.910732
01 1000 master 1056.891901 patched 1067.341658
01 3000 master 637.312651 patched 685.205011
08 100 master 10575.017704 patched 11456.043638
08 300 master 9262.601107 patched 9120.925071
08 1000 master 1721.807658 patched 1800.733257
08 3000 master 819.694049 patched 854.333830
32 100 master 26981.677368 patched 27024.507600
32 300 master 14554.870871 patched 14778.285400
32 1000 master 1941.733251 patched 1990.248137
32 3000 master 846.654654 patched 892.554222

And here's the same results for 5-minute, read-only tests:

01 100 master 9361.073952 patched 9049.553997
01 300 master 8640.235680 patched 8646.590739
01 1000 master 8339.364026 patched 8342.799468
01 3000 master 7968.428287 patched 7882.121547
08 100 master 71311.491773 patched 71812.899492
08 300 master 69238.839225 patched 70063.632081
08 1000 master 34794.778567 patched 65998.468775
08 3000 master 60834.509571 patched 61165.998080
32 100 master 203168.264456 patched 205258.283852
32 300 master 199137.276025 patched 200391.633074
32 1000 master 177996.853496 patched 176365.732087
32 3000 master 149891.147442 patched 148683.269107

Something appears to have screwed up my results for 8 clients @ scale
factor 300 on master, but overall, on both the read-only and
read-write tests, I'm not seeing anything that resembles the big gains
you reported.

Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)

2013-05-20 Thread Robert Haas

On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas
 wrote:
> If we could use the catchup interrupts to speed that up though, that would
> be much better. I think vacuum could simply send a catchup interrupt, and
> wait until everyone has caught up. That would significantly increase the
> traffic of sinval queue and catchup interrupts, compared to what it is
> today, but I think it would still be ok. It would still only be a few sinval
> messages and catchup interrupts per truncation (ie. per vacuum).

Hmm.  So your proposal is to only send these sinval messages while a
truncation is in progress, not any time the relation is extended?
That would certainly be far more appealing from the point of view of
not blowing out sinval.

It shouldn't be difficult to restrict the set of backends that have to
be signaled to those that have the relation open.  You could have a
special kind of catchup signal that means "catch yourself up, but
don't chain" - and send that only to those backends returned by
GetConflictingVirtualXIDs.  It might be better to disconnect this
mechanism from sinval entirely.  In other words, just stick a version
number on your shared memory data structure and have everyone
advertise the last version number they've seen via PGPROC.  The sinval
message doesn't really seem necessary; it's basically just a no-op
message to say "reread shared memory", and a plain old signal can
carry that same message more efficiently.

One other thought regarding point 6 from your original proposal.  If
it's true that a scan could hold a pin on a buffer above the current
hard watermark, which I think it is, then that means there's a scan in
progress which is going to try to fetch pages beyond that also, up to
whatever the end of file position was when the scan started.  I
suppose that heap scans will need to be taught to check some
backend-local flag before fetching each new block, to see if a hard
watermark change might have intervened.  Is that what you have in
mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)

2013-05-20 Thread Heikki Linnakangas


On 20.05.2013 16:59, Robert Haas wrote:

On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas
  wrote:

If we could use the catchup interrupts to speed that up though, that would
be much better. I think vacuum could simply send a catchup interrupt, and
wait until everyone has caught up. That would significantly increase the
traffic of sinval queue and catchup interrupts, compared to what it is
today, but I think it would still be ok. It would still only be a few sinval
messages and catchup interrupts per truncation (ie. per vacuum).


Hmm.  So your proposal is to only send these sinval messages while a
truncation is in progress, not any time the relation is extended?
That would certainly be far more appealing from the point of view of
not blowing out sinval.


Right.


It shouldn't be difficult to restrict the set of backends that have to
be signaled to those that have the relation open.  You could have a
special kind of catchup signal that means "catch yourself up, but
don't chain"


What does "chain" mean above?


- and send that only to those backends returned by
GetConflictingVirtualXIDs.


Yeah, that might be a worthwhile optimization.


It might be better to disconnect this mechanism from sinval entirely.
In other words, just stick a version number on your shared memory
data structure and have everyone advertise the last version number
they've seen via PGPROC.  The sinval message doesn't really seem
necessary; it's basically just a no-op message to say "reread shared
memory", and a plain old signal can carry that same message more
efficiently.


Hmm. The sinval message makes sure that when a backend locks a relation, 
it will see the latest value, because of the AcceptInvalidationMessages 
call in LockRelation. If there is no sinval message, you'd need to 
always check the shared memory area when you lock a relation.



One other thought regarding point 6 from your original proposal.  If
it's true that a scan could hold a pin on a buffer above the current
hard watermark, which I think it is, then that means there's a scan in
progress which is going to try to fetch pages beyond that also, up to
whatever the end of file position was when the scan started.  I
suppose that heap scans will need to be taught to check some
backend-local flag before fetching each new block, to see if a hard
watermark change might have intervened.  Is that what you have in
mind?


Yes, heap scan will need to check the locally cached high watermark 
value every time when stepping to a new page.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas


On 13.05.2013 17:21, Merlin Moncure wrote:

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
  wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.


These are really interesting results. Why is the CAS method so much
faster then TAS?


If my theory is right, the steep fall is caused by backends being 
sometimes context switched while holding the spinlock. The CAS method 
alleviates that, because the lock is never "held" by anyone. With a 
spinlock, if a process gets context switched while holding the lock, 
everyone else will have to wait until the process gets context switched 
back, and releases the spinlock. With the CAS method, if a process gets 
switched in the CAS-loop, it doesn't hurt others.


With the patch, the CAS-loop still spins, just like a spinlock, when the 
wait queue has to be manipulated. So when that happens, it will behave 
more or less like the spinlock implementation. I'm not too concerned 
about that optimizing that further; sleeping and signaling sleeping 
backends are heavy-weight operations anyway.



Did you see any contention?


With the current spinlock implementation? Yeah, per the profile I 
posted, a lot of CPU time was spent spinning, which is a sign of 
contention. I didn't see contention with the CAS implemention.


Attached is a new version of the patch. I cleaned it up a lot, and added 
a configure test, and fallback to the good old spinlock implementation 
if compare-and-swap is not available. Also, if the CAS loops needs to 
spin, like a spinlock, it now sleeps after a while and updates the 
"spins_per_delay" like regular spinlock.


- Heikki
diff --git a/configure b/configure
index 98d889a..43faaf8 100755
--- a/configure
+++ b/configure
@@ -22742,71 +22742,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:$LINENO: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if test "${pgac_cv_gcc_int_atomics+set}" = set; then
-  $as_echo_n "(cached) " >&6
-else
-  cat >conftest.$ac_ext <<_ACEOF
-/* confdefs.h.  */
-_ACEOF
-cat confdefs.h >>conftest.$ac_ext
-cat >>conftest.$ac_ext <<_ACEOF
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-rm -f conftest.$ac_objext conftest$ac_exeext
-if { (ac_try="$ac_link"
-case "(($ac_try" in
-  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
-$as_echo "$ac_try_echo") >&5
-  (eval "$ac_link") 2>conftest.er1
-  ac_status=$?
-  grep -v '^ *+' conftest.er1 >conftest.err
-  rm -f conftest.er1
-  cat conftest.err >&5
-  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
-  (exit $ac_status); } && {
-	 test -z "$ac_c_werror_flag" ||
-	 test ! -s conftest.err
-   } && test -s conftest$ac_exeext && {
-	 test "$cross_compiling" = yes ||
-	 $as_test_x conftest$ac_exeext
-   }; then
-  pgac_cv_gcc_int_atomics="yes"
-else
-  $as_echo "$as_me: failed program was:" >&5
-sed 's/^/| /' conftest.$ac_ext >&5
-
-	pgac_cv_gcc_int_atomics="no"
-fi
-
-rm -rf conftest.dSYM
-rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
-  conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-cat >>confdefs.h <<\_ACEOF
-#define HAVE_GCC_INT_ATOMICS 1
-_ACEOF
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -28573,6 +28508,136 @@ _ACEOF
 fi
 
 
+# Check for GCC built-in atomic functions
+{ $as_echo "$as_me:$LINENO: checking for builtin test-and-set function" >&5
+$as_echo_n "checking for builtin test-and-set function... " >&6; }
+if test "${pgac_cv_gcc_int_test_and_set+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Alvaro Herrera

Hello,

Would you guys please trim the quoted text of the emails you're replying
to?  I understand Gmail is obnoxious w.r.t. quoted text, but this is
starting to become excessive.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct

2013-05-20 Thread Atri Sharma

On Mon, May 20, 2013 at 8:54 PM, Alvaro Herrera
 wrote:
> Hello,
>
> Would you guys please trim the quoted text of the emails you're replying
> to?  I understand Gmail is obnoxious w.r.t. quoted text, but this is
> starting to become excessive.

Oops, I didnt notice that. Sorry!

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue & Wed.

2013-05-20 Thread Alfred Perlstein


Hello PostgreSQL Hackers,

I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping 
to chat with a few Postgresql developers in person about using 
Postgresql in FreeNAS and offering it as an extension to the platform as 
a plug-in technology.  Unfortunately due to time constraints I can not 
attend the entire conference and I am only in town until Wednesday at noon.


I'm hoping there's a good time to talk to a few developers about 
Postgresql + FreeNAS before I have to depart back to the bay area.


Some info on me:  My name is Alfred Perlstein, I am a FreeBSD developer 
and FreeNAS project lead.  I am the VP of Software Engineering at 
iXsystems.  I have been a fan of Postgresql for many years.  In the 
early 2000s we build a high speed web tracking application on top of 
Postgresql and worked closely with the community to shake out 
performance and bug, so closely that Tom Lane and Vadim Mikheevhad 
logins on our box.  Since that time I have tried to get Postgresql into 
as many places as possible.


Some info on the topics I wanted to briefly discuss:

1) Using Postgresql as the config store for FreeNAS.
We currently use SQLITE, SQLITE fits our needs until we get to the point 
of replication between HA (high availability) units.  Then we are forced 
to manually sync data between configurations.  A discussion on how we 
might do this better using Postgresql, while still maintaining our ease 
of config export (single file) and small footprint would be interesting.


2) Postgresql plugin for FreeNAS.
Flip a switch and suddenly your file server is also serving enterprise 
data.  We currently have a plug-in architecture, but would like to 
discuss the possibility of a tighter integration so that Postgresql 
looks like a more cohesive addition to FreeNAS.


3) Statistic monitoring / EagleEye
In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye 
is a system where all mibs are easily exportable with timestamps in a 
common format (for now YAML & modified CSV) which is then consumed by a 
utility which can then provide graphs. The entire point of EagleEye is 
to eventually upstream the modifications to future proof statistics 
tracking into the FreeBSD and FreeNAS systems.  I have spoken with some 
Illuminos/ZFS developers and they are interested as well.


I think that is all I have, please drop me a note if you'll have some 
time in Ottawa today, tomorrow or early Wednesday.  I'd love to discuss 
and buy some beers for the group.


thank you,
-Alfred Perlstein
VP Software Engineering, iXsystems.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-20 Thread Heikki Linnakangas


On 20.05.2013 14:50, Kevin Grittner wrote:

I raised this issue near the end of SSI development, but nobody
seemed very interested and someone argued that a tool to do that
would be good but we shouldn't try to do it in pgbench -- so I let
it drop at the time.


+1 on doing it in pgbench.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-20 Thread Jeff Janes

On Mon, May 20, 2013 at 4:50 AM, Kevin Grittner  wrote:

> I raised this issue near the end of SSI development, but nobody
> seemed very interested and someone argued that a tool to do that
> would be good but we shouldn't try to do it in pgbench -- so I let
> it drop at the time.
>

I think it would be good to do it in pgbench, provided it can be done
fairly cleanly.

Presumably we would want to repeat all of the ordinary commands, in the
file, but not any of the backslash set commands that precede any ordinary
commands.  But what if backslash set commands are sprinkled between
ordinary commands?

Cheers,

Jeff

Re: [HACKERS] Re: pg_basebackup with -R option and start standby have problems with escaped password

2013-05-20 Thread Heikki Linnakangas


On 17.05.2013 19:03, Boszormenyi Zoltan wrote:

2013-05-17 16:05 keltezéssel, Heikki Linnakangas írta:

On 18.02.2013 16:35, Boszormenyi Zoltan wrote:

2013-01-29 11:15 keltezéssel, Magnus Hagander írta:

On Thu, Jan 24, 2013 at 7:04 AM, Hari Babu 
wrote:

On Wed, Jan 23, 2013 11:48 PM, Magnus Hagander wrote:

On Wed, Jan 23, 2013 at 10:18 AM, Hari Babu


wrote:

Test scenario to reproduce:
1. Start the server
2. create the user as follows
./psql postgres -c "create user user1 superuser login
password 'use''1'"

3. Take the backup with -R option as follows.
./pg_basebackup -D ../../data1 -R -U user1 -W

The following errors are occurring when the new standby on the
backup
database starts.

FATAL: could not connect to the primary server: missing "=" after
"1'"

in

connection info string

What does the resulting recovery.conf file look like?

The recovery.conf which is generated is as follows

standby_mode = 'on'
primary_conninfo = 'user=''user1'' password=''use''1'' port=''5432'' '


I observed the problem is while reading primary_conninfo from the
recovery.conf file
the function "GUC_scanstr" removes the quotes of the string and also
makes
the
continuos double quote('') as single quote(').

By using the same connection string while connecting to primary
server the
function "conninfo_parse" the escape quotes are not able to parse
properly
and it is leading
to problem.

please correct me if any thing wrong in my observation.

Well, it's clearly broken at least :O

Zoltan, do you have time to look at it? I won't have time until at
least after FOSDEM, unfortunately.


I looked at it shortly. What I tried first is adding another pair of
single
quotes manually like this:

primary_conninfo = 'user=''user1'' password=''use1''
host=''192.168.1.2'' port=''5432'' sslmode=''disable''
sslcompression=''1'' '

But it doesn't solve the problem either, I got:

FATAL: could not connect to the primary server: missing "=" after "'1'"
in connection info string

This worked though:

primary_conninfo = 'user=user1 password=use\'1 host=192.168.1.2
port=5432 sslmode=disable sslcompression=1 '

When I added an elog() to print the conninfo string in
libpqrcv_connect(),
I saw that the double quotes were properly eliminated by ParseConfigFp()
in the first case.

So, there is a bug in generating recovery.conf by not double-escaping
the values and another bug in parsing the connection string in libpq
when the parameter value starts with a single-quote character.


No, the libpq connection string parser is working as intended. Per the
docs on PQconnectdb:


The passed string can be empty to use all default parameters, or it
can contain one or more parameter settings separated by whitespace.
Each parameter setting is in the form keyword = value. Spaces around
the equal sign are optional. To write an empty value, or a value
containing spaces, surround it with single quotes, e.g., keyword = 'a
value'. Single quotes and backslashes within the value must be
escaped with a backslash, i.e., \' and \\.


So, the proper way to escape a quote in a libpq connection string is
\', not ''. There are two escaping systems layered on top of each
other; the recovery.conf parser's, where you use '', and the libpq
system, where you use \'. So we need two different escaping functions
in pg_basebackup to get this right.


Why is extending the libpq parser to allow doubling the single
quotes not a good solution? It would be consistent in this regard
with the recovery.conf parser. From this POV the first patch only
needs a little change in the libpq docs.


I guess that would work too, but I don't see any meaningful advantage in 
doing that.



Apart from that, does it bother anyone else that the the
primary_conninfo line that pg_basebackup creates is butt-ugly?

primary_conninfo = 'user=''heikki'' host=''localhost'' port=''5432''
sslmode=''prefer'' sslcompression=''1'' '


Not a single bit. It's machine generated and the code is generic.
The parser can deal with it.


Oh, ok :-). Well, IMO that's really ugly; the file ought to be 
human-readable too.



Also, do we really need to include the ssl options when they are the
defaults?


I think we were through about this bullet point. IIRC the reasoning and
the consensus was that the backup and the generated recovery.conf
should also work on another machine with a possibly different set of
compilation options for libpq and envvars present on the system.


Ok.


I think the attached patch fixes the original test scenario correctly,
without changing libpq's quoting rules, and only quotes when
necessary. I didn't do anything about the ssl options. Please take a
look.


Your patch should work, too.


Thanks, I've applied that.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas


On 19.05.2013 17:25, Simon Riggs wrote:

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
"initializes TimeLineID if it's not set yet." The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this "doesn't get executed with fast promotion", or the title of
the thread.


Hmm, I see.


So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.


Yeah. I spent some more time debugging this, and found the culprit. In 
CreateRestartPoint, we do this:



/*
 * Update ThisTimeLineID to the timeline we're currently 
replaying,
 * so that we install any recycled segments on that timeline.
 *
 * There is no guarantee that the WAL segments will be useful 
on the
 * current timeline; if recovery proceeds to a new timeline 
right
 * after this, the pre-allocated WAL segments on this timeline 
will
 * not be used, and will go wasted until recycled on the next
 * restartpoint. We'll live with that.
 */
(void) GetXLogReplayRecPtr(&ThisTimeLineID);


Here's what happens:

0. System is in recovery
1. checkpointer process starts creating a restartpoint.
2. Startup process ends recovery, creates end-of-recovery record, sets 
the shared ThisTimeLineID value to 2, and exits.
3. checkpointer process calls RecoveryInProgess() while performing the 
restartpoint (there are RecoveryInProgress() calls in many places, e.g 
in XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and 
initializes ThisTimeLineID to 2.
4. At the end of restartpoint, the checkpointer process does the above 
GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID 
back to 1.
5. Checkpointer calls RecoveryInProgress, which returns true 
immediately, as LocalRecoveryInProgress is already set.
5. Checkpointer performs the first checkpoint, with ThisTimeLineID set 
incorrectly to 1.


Not sure what the best fix would be. Perhaps change the code in 
CreateRestartPoint() to do something like this instead:


GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
  ThisTimeLineID = replayTLI;



I see that the comment in InitXLOGAccess() is incorrect
"ThisTimeLineID doesn't change so we need no lock to copy it", which
seems worth correcting since there's no need to save time in a once
only function.


Hmm, the code seems right to me, XLogCtl->ThisTimeLineID indeed doesn't 
change, once it's set. And InitXLOGAccess() isn't called until it is 
set. The comment could explain that better, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-20 Thread Fabien COELHO




I think it would be good to do it in pgbench, provided it can be done
fairly cleanly.

Presumably we would want to repeat all of the ordinary commands, in the
file, but not any of the backslash set commands that precede any ordinary
commands.  But what if backslash set commands are sprinkled between
ordinary commands?


As I have spent some time in pgbench recently, ISTM that when restarting 
the transaction (reset state = 0), you want to keep track of the farthest 
state that was reached on previous attempts, so as to skip backslash 
commands for those states, and restart execute them where they were left. 
That would mean one integer for this purpose in the client structure.


--
Fabien


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fast promotion and log_checkpoints

2013-05-20 Thread Heikki Linnakangas


On 19.05.2013 17:22, Simon Riggs wrote:

On 1 May 2013 10:05, Fujii Masao  wrote:


In HEAD, when the standby is promoted, recovery requests the checkpoint
but doesn't wait for its completion. I found the checkpoint starting log message
of this checkpoint looks odd as follows:

 LOG:  checkpoint starting:

I think something like the following is better.

 LOG:  checkpoint starting: end-of-recovery

In 9.2 or before, "end-of-recovery" part is logged. Even if we changed the
behavior of end-of-recovery checkpoint, I think that it's more intuitive to
label it as "end-of-recovery". Thought?


The checkpoint isn't an "end-of-recovery" checkpoint, its just the
first checkpoint after the end of recovery. I don't think it should
say "end-of-recovery".


Agreed.


The problem is that we've now changed the code to trigger a checkpoint
in a place that wasn't part of the original design, so the checkpoint
called at that point isn't supplied with a reason and so has nothing
to print.

It would be possible to redesign this with a special new reason, or we
could just use "time" as the reason, or we could just leave it.

Do nothing is easy, though so are the others, so we can choose
anything we want. What do we want it to say?


I'm not sure. Perhaps we should print "(no flags)", so that it wouldn't 
look like there's something missing in the log message.


- Heikk


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs

On 20 May 2013 18:47, Heikki Linnakangas  wrote:
> On 19.05.2013 17:25, Simon Riggs wrote:

>> So while I believe that the checkpointer might have an incorrect TLI
>> and that you've seen a bug, what isn't clear is that the checkpointer
>> is the only process that would see an incorrect TLI, or why such
>> processes see an incorrect TLI. It seems equally likely at this point
>> that the TLI may be set incorrectly somehow and that is why it is
>> being read incorrectly.
>
>
> Yeah. I spent some more time debugging this, and found the culprit. In
> CreateRestartPoint, we do this:
>
>> /*
>>  * Update ThisTimeLineID to the timeline we're currently
>> replaying,
>>  * so that we install any recycled segments on that
>> timeline.
>>  *
>>  * There is no guarantee that the WAL segments will be
>> useful on the
>>  * current timeline; if recovery proceeds to a new
>> timeline right
>>  * after this, the pre-allocated WAL segments on this
>> timeline will
>>  * not be used, and will go wasted until recycled on the
>> next
>>  * restartpoint. We'll live with that.
>>  */
>> (void) GetXLogReplayRecPtr(&ThisTimeLineID);
>
>
> Here's what happens:
>
> 0. System is in recovery
> 1. checkpointer process starts creating a restartpoint.
> 2. Startup process ends recovery, creates end-of-recovery record, sets the
> shared ThisTimeLineID value to 2, and exits.
> 3. checkpointer process calls RecoveryInProgess() while performing the
> restartpoint (there are RecoveryInProgress() calls in many places, e.g in
> XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes
> ThisTimeLineID to 2.
> 4. At the end of restartpoint, the checkpointer process does the above
> GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to
> 1.
> 5. Checkpointer calls RecoveryInProgress, which returns true immediately, as
> LocalRecoveryInProgress is already set.
> 6. Checkpointer performs the first checkpoint, with ThisTimeLineID set
> incorrectly to 1.

That description looks correct to me from the code.


> Not sure what the best fix would be. Perhaps change the code in
> CreateRestartPoint() to do something like this instead:
>
> GetXLogReplayRecPtr(&replayTLI);
> if (RecoveryInProgress())
>   ThisTimeLineID = replayTLI;

Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.

Suggested patch attached.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


avoid_setting_tli.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FK locking concurrency improvement

2013-05-20 Thread Daniel Wood


As part of 0ac5ad5134f2769ccbaefec73844f8504c4d6182
the permutations in test/isolation/fk-deadlock2.spec and elsewhere were 
removed.  Is it the intent that these tests no longer do anything 
useful?  I was expecting a failure in the test with some work I'm doing 
and was confused, after a merge from the upstream 9.3, that the test 
didn't fail until I noticed the test is no longer running the permutations.


FYI, I saw some comments and adding fflush's into isolationtester.c.  I 
ran into the same problem with debugging tests when they failed/hung in 
the middle.  A simple "setbuf(stdout, NULL)" at the beginning of main 
gets rid of the problem where line buffering becomes block buffering 
when redirecting stdout to a file.  This causes problems with sequencing 
of mixed stderr and stdout and not seeing the last few lines of stdout 
if the process fails or hangs.  The setbuf on stdout shown above 
disables buffering of stdout to match the unbuffered stderr.


That way you don't need to fflush after each printf/fprintf.  I'm not 
sure why fflush of stderr was added because it isn't buffered to begin 
with so is unnecessary.  The problem was with stdout.  YMMV on windows 
but might work.


- Dan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal to add --single-row to psql

2013-05-20 Thread Bruce Momjian

On Sat, May 11, 2013 at 07:22:55PM -0400, Robert Haas wrote:
> On Sat, May 11, 2013 at 12:27 PM, Tom Lane  wrote:
> > By the time you've got an expression tree, the problem is mostly solved,
> > at least so far as parser extension is concerned.
> 
> Right.
> 
> > More years ago than I care to admit, I worked on systems that had
> > run-time-extensible parsers at Hewlett-Packard, so technology for this
> > does exist.  But my (vague) memory of those systems is that the parser's
> > language capabilities were more limited than bison's, perhaps only
> > LL(1).  Parsing spec-compatible SQL that way might be a challenge.
> 
> If I understand bison correctly, it basically looks at the current
> parser state and the next token and decides to either shift that token
> onto the stack or reduce the stack using some rule.  If there's no
> matching rule, we error out.  If someone wants to inject new rules
> into the grammar, those state tables are all going to change.  But if
> we could contrive things so that the state tables are built
> dynamically and can be change as rules are added and removed, then it
> seems to me that we could let a loadable module supply (or delete)
> whatever grammar rules it likes.  Whenever it does this, we recompile
> the grammar on next use (and complain if we get ambiguities).  This
> does not sound all that easy to code, but at least in theory it seems
> doable.
> 
> We'd also need a way to add keywords.  To be quite frank, I think our
> whole approach to keywords is massive screwed up right now.  Aside

FYI, one trick that is often used when parsing C code is to allow parsed
code to add to the keyword list, so when the parser goes to get the next
token, it might get "keyword" back based on what it has seen already. 
This is required to handle C typedefs, and might be a method we could
use.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FK locking concurrency improvement

2013-05-20 Thread Alvaro Herrera

Daniel Wood wrote:
> As part of 0ac5ad5134f2769ccbaefec73844f8504c4d6182
> the permutations in test/isolation/fk-deadlock2.spec and elsewhere
> were removed.  Is it the intent that these tests no longer do
> anything useful?  I was expecting a failure in the test with some
> work I'm doing and was confused, after a merge from the upstream
> 9.3, that the test didn't fail until I noticed the test is no longer
> running the permutations.

No, you're misunderstanding.  When the spec file specifies permutations,
then those are exactly the permutations that are run.  When the spec
does not list any permutations, then the test driver runs all the
possible permutations.

This was made possibly by a change to isolationtester that an
"impossible" permutation did not cause the test to die, but instead to
continue by reporting that the permutation is impossible.  That way, we
ensure that not only the listed permutations are running and passing,
but also that the set of permutations that are possible does not change.
(An "impossible" permutation is one that requires running a command in a
session that is in blocked state, something which is clearly
impossible.)

In hindsight, we could have committed this change separately.

> FYI, I saw some comments and adding fflush's into isolationtester.c.
> I ran into the same problem with debugging tests when they
> failed/hung in the middle.  A simple "setbuf(stdout, NULL)" at the
> beginning of main gets rid of the problem where line buffering
> becomes block buffering when redirecting stdout to a file.

Interesting.  I'm not sure it's worth messing with this now (given that
the current coding works everywhere), but if there's a strong reason to
do it that way we can certainly change it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas


On 20.05.2013 22:18, Simon Riggs wrote:

On 20 May 2013 18:47, Heikki Linnakangas  wrote:

Not sure what the best fix would be. Perhaps change the code in
CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
   ThisTimeLineID = replayTLI;


Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.



Suggested patch attached.



@@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
 * not be used, and will go wasted until recycled on the next
 * restartpoint. We'll live with that.
 */
-   (void) GetXLogReplayRecPtr(&ThisTimeLineID);
-
-   RemoveOldXlogFiles(_logSegNo, endptr);
-
-   /*
-* Make more log segments if needed.  (Do this after recycling 
old log
-* segments, since that may supply some of the needed files.)
-*/
PreallocXlogFiles(endptr);
}


That's essentially reverting commit 343ee00b7, resurrecting the bug that 
that fixed.



@@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
SpinLockAcquire(&xlogctl->info_lck);
recptr = xlogctl->lastReplayedEndRecPtr;
tli = xlogctl->lastReplayedTLI;
+   if (replayTLI && xlogctl->SharedRecoveryInProgress)
+   *replayTLI = tli;
SpinLockRelease(&xlogctl->info_lck);

-   if (replayTLI)
-   *replayTLI = tli;
return recptr;
 }



That breaks the only remaining caller that passes a replayTLI pointer, 
GetStandbyFlushRecPtr(); *replayTLI points to a local variable in that 
function, which is left uninitialized if !xlogctl->SharedRecoveryInProgress.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)

2013-05-20 Thread Heikki Linnakangas


On 17.05.2013 12:35, Andres Freund wrote:

On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote:

On 16.05.2013 04:15, Andres Freund wrote:

Couldn't we "just" take the extension lock and then walk backwards from
the rechecked end of relation ConditionalLockBufferForCleanup() the
buffers?
For every such locked page we check whether its still empty. If we find
a page that we couldn't lock, isn't empty or we already locked a
sufficient number of pages we truncate.


You need an AccessExclusiveLock on the relation to make sure that after you
have checked that pages 10-15 are empty, and truncated them away, a backend
doesn't come along a few seconds later and try to read page 10 again. There
might be an old sequential scan in progress, for example, that thinks that
the pages are still there.


But that seems easily enough handled: We know the current page in its
scan cannot be removed since its pinned. So make
heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to
ReadBuffer and if the read fails recheck the length of the relation
before throwing an error.


Hmm. For the above to work, you'd need to atomically check that the 
pages you're truncating away are not pinned, and truncate them. If those 
steps are not atomic, a backend might pin a page after you've checked 
that it's not pinned, but before you've truncated the underlying file. I 
guess that be doable; needs some new infrastructure in the buffer 
manager, however.



There isn't much besides seqscans that can have that behaviour afaics:
- (bitmap)indexscans et al. won't point to completely empty pages
- there cannot be a concurrent vacuum since we have the appropriate
   locks
- if a trigger or something else has a tid referencing a page there need
   to be unremovable tuples on it.

The only thing that I immediately see are tidscans which should be
handleable in a similar manner to seqscans.

Sure, there are some callsites that need to be adapted but it still
seems noticeably easier than what you proposed upthread.


Yeah. I'll think some more how the required buffer manager changes could 
be done.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fast promotion and log_checkpoints

2013-05-20 Thread Simon Riggs

On 20 May 2013 20:06, Heikki Linnakangas  wrote:

>> It would be possible to redesign this with a special new reason, or we
>> could just use "time" as the reason, or we could just leave it.
>>
>> Do nothing is easy, though so are the others, so we can choose
>> anything we want. What do we want it to say?
>
>
> I'm not sure. Perhaps we should print "(no flags)", so that it wouldn't look
> like there's something missing in the log message.

The reason text would still be absent, so it wouldn't really help the
user interpret the log message correctly.

I suggest we use RequestCheckpoint(CHECKPOINT_CAUSE_TIME) instead,
since it is literally time for a checkpoint.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Bruce Momjian

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > Isn't this the same issue which has prompted multiple people to propose
> > (sometimes with code, as I recall) to rip out our internal spinlock
> > system and replace it with kernel-backed calls which do it better,
> > specifically by dealing with issues like the above?  Have you seen those
> > threads in the past?  Any thoughts about moving in that direction?
> 
> All of the proposals of that sort that I've seen had a flavor of
> "my OS is the only one that matters".  While I don't object to
> platform-dependent implementations of spinlocks as such, they're not
> much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization?  Seems so.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Alvaro Herrera

> diff --git a/configure.in b/configure.in
> index 4ea5699..ff8470e 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -1445,17 +1445,6 @@ fi
>  AC_CHECK_FUNCS([strtoll strtoq], [break])
>  AC_CHECK_FUNCS([strtoull strtouq], [break])
>  
> -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
> -[AC_TRY_LINK([],
> -  [int lock = 0;
> -   __sync_lock_test_and_set(&lock, 1);
> -   __sync_lock_release(&lock);],
> -  [pgac_cv_gcc_int_atomics="yes"],
> -  [pgac_cv_gcc_int_atomics="no"])])
> -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
> -  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have 
> __sync_lock_test_and_set(int *) and friends.])
> -fi
> -

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas


On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost  writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above?  Have you seen those
threads in the past?  Any thoughts about moving in that direction?


All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters".  While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.


Uh, is this an x86-64-only optimization?  Seems so.


All modern architectures have an atomic compare-and-swap instruction (or 
something more powerful that can be used to implement it). That includes 
x86, x86-64, ARM, PowerPC, among others.


There are some differences in how wide values can be swapped with it; 
386 only supported 32-bit, until Pentium, which added a 64-bit variant. 
I used the 64-bit variant in the patch, but for lwlocks, we could 
actually get away with the 32-bit variant if we packed the booleans and 
the shared lock counter more tightly.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas


On 20.05.2013 23:11, Alvaro Herrera wrote:

diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
  AC_CHECK_FUNCS([strtoll strtoq], [break])
  AC_CHECK_FUNCS([strtoull strtouq], [break])

-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have 
__sync_lock_test_and_set(int *) and friends.])
-fi
-


Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.


Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to 
HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we 
also test for __sync_val_compare_and_swap(p, oldval, newval).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Bruce Momjian

On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote:
> On 20.05.2013 23:01, Bruce Momjian wrote:
> >On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
> >>Stephen Frost  writes:
> >>>Isn't this the same issue which has prompted multiple people to propose
> >>>(sometimes with code, as I recall) to rip out our internal spinlock
> >>>system and replace it with kernel-backed calls which do it better,
> >>>specifically by dealing with issues like the above?  Have you seen those
> >>>threads in the past?  Any thoughts about moving in that direction?
> >>
> >>All of the proposals of that sort that I've seen had a flavor of
> >>"my OS is the only one that matters".  While I don't object to
> >>platform-dependent implementations of spinlocks as such, they're not
> >>much of a cure for a generic performance issue.
> >
> >Uh, is this an x86-64-only optimization?  Seems so.
> 
> All modern architectures have an atomic compare-and-swap instruction
> (or something more powerful that can be used to implement it). That
> includes x86, x86-64, ARM, PowerPC, among others.
> 
> There are some differences in how wide values can be swapped with
> it; 386 only supported 32-bit, until Pentium, which added a 64-bit
> variant. I used the 64-bit variant in the patch, but for lwlocks, we
> could actually get away with the 32-bit variant if we packed the
> booleans and the shared lock counter more tightly.

So we are going to need to add this kind of assembly language
optimization for every CPU type?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs

On 20 May 2013 20:40, Heikki Linnakangas  wrote:
> On 20.05.2013 22:18, Simon Riggs wrote:
>>
>> On 20 May 2013 18:47, Heikki Linnakangas  wrote:
>>>
>>> Not sure what the best fix would be. Perhaps change the code in
>>>
>>> CreateRestartPoint() to do something like this instead:
>>>
>>> GetXLogReplayRecPtr(&replayTLI);
>>> if (RecoveryInProgress())
>>>ThisTimeLineID = replayTLI;
>>
>>
>> Installing a few extra WAL files doesn't seem to be a good reason to
>> mess with such an important variable as ThisTimeLineID, especially
>> since its such a rare event and hardly worth optimising for.
>>
>> I would prefer it if we didn't set ThisTimeLineID at all in that way,
>> or anywhere else. If we do, it needs to be done safely, otherwise any
>> caller could make the same mistake.
>
>
>> Suggested patch attached.
>
>
>> @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
>>
>>  * not be used, and will go wasted until recycled on the
>> next
>>  * restartpoint. We'll live with that.
>>  */
>> -   (void) GetXLogReplayRecPtr(&ThisTimeLineID);
>> -
>> -   RemoveOldXlogFiles(_logSegNo, endptr);
>> -
>> -   /*
>> -* Make more log segments if needed.  (Do this after
>> recycling old log
>> -* segments, since that may supply some of the needed
>> files.)
>> -*/
>> PreallocXlogFiles(endptr);
>> }
>
>
> That's essentially reverting commit 343ee00b7, resurrecting the bug that
> that fixed.

Hmm, OK.

Then we should using the correct value, not the one that patch set. It
certainly worked at that time, but not in a principled way.

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

Patch attached.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


set_lastReplayedTLI.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas


On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
  wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:

-  64.09%  postgres  postgres   [.] tas
- tas
   - 99.83% s_lock
  - 53.22% LWLockAcquire
 + 99.87% GetSnapshotData
  - 46.78% LWLockRelease
   GetSnapshotData
 + GetTransactionSnapshot
+   2.97%  postgres  postgres   [.] tas
+   1.53%  postgres  libc-2.13.so   [.] 0x119873
+   1.44%  postgres  postgres   [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres   [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.


I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>>  number of processors.  It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time.  I figured preemption while in
the spinlock was to blame at the time, given the extreme nature


Stop the press! I'm getting the same speedup on that 32-core box I got 
with the compare-and-swap patch, from this one-liner:


--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

 #define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
 static __inline__ int
 tas(volatile slock_t *lock)
 {

So, on this system, doing a non-locked test before the locked xchg 
instruction while spinning, is a very good idea. That contradicts the 
testing that was done earlier when the x86-64 implementation was added, 
as we have this comment in the tas() implementation:



/*
 * On Opteron, using a non-locking test before the locking instruction
 * is a huge loss.  On EM64T, it appears to be a wash or small loss,
 * so we needn't bother to try to distinguish the sub-architectures.
 */


On my test system, the non-locking test is a big win. I tested this 
because I was reading this article from Intel:


http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. 
It says very explicitly that the non-locking test is a good idea:



Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is 
attempting to spin on an atomic instruction instead of spinning on a volatile 
read. Spinning on a dirty read instead of attempting to acquire a lock consumes 
less time and resources. This allows an application to only attempt to acquire 
a lock only when it is free.


Now, I'm not sure what to do about this. If we put the non-locking test 
in there, according to the previous testing that would be a huge loss on 
Opterons.


Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. 
That way, even if each TAS_SPIN test is very expensive, we don't spend 
too much time spinning if it's really busy, or held by a process that is 
sleeping.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add more regression tests for dbcommands

2013-05-20 Thread Robins Tharakan

Hi,

Attached is an updated patch that does only 1 CREATE DATABASE and reuses
that for all other tests.
The code-coverage with this patch goes up from 36% to 70%.

--
Robins Tharakan


On 13 May 2013 21:04, Robins Tharakan  wrote:

> I believe Tom / Andres and Fabien all have valid points.
>
> Net-net, I believe the tests although non-negotiable, are not required to
> be in make-check. For now, its the slow tests that are the pain points
> here, and then I would soon try to prune them and commit once again.
>
> Whether it goes in make-check or not is obviously not discretion but to me
> their importance is undoubted since I am sure they increase code-coverage.
> Actually that is 'how' I create those tests, I look at untouched code and
> create new tests that check untouched code.
>
> Would try to revert with a faster script (preferably with minimal
> CREATE/DROP).
>
> --
> Robins Tharakan
>
>
> On 13 May 2013 20:30, Andres Freund  wrote:
>
>> On 2013-05-13 16:52:08 +0200, Fabien COELHO wrote:
>> >
>> > Hello,
>> >
>> > >>Would you be okay if there is one/a few effective create/drop (some
>> tests
>> > >>check that the create or drop fails e.g. depending on permissions,
>> which
>> > >>ISTM is not tested anywhere else), so that tests for various ALTER
>> > >>DATABASE commands are combined together onto these databases?
>> > >
>> > >TBH, I do not see that such tests are worth adding, if they are going
>> to
>> > >significantly slow down the core regression tests.  Those tests are run
>> > >probably hundreds of times a day, in aggregate across all Postgres
>> > >developers.  Adding ten seconds or whatever this would add is a major
>> > >cost, while the benefit appears trivial.
>> >
>> > >We could consider adding expensive low-value tests like these to some
>> > >alternate regression target that's only exercised by buildfarm members,
>> > >perhaps.  But I think there's probably a point of diminishing returns
>> > >even in that context.
>> >
>> > I'm not sure that the tests are "low value", because a commit that would
>> > generate a failure on a permission check test would be a potential
>> security
>> > issue for Pg.
>>
>> > As for the cost, if the proposed tests are indeed too costly, what is
>> not
>> > necessarily the case for what I have seen, I do not think that it would
>> be a
>> > great problem to have two set of tests, with one a superset of the
>> other,
>> > with some convention.
>>
>> Well, tests like permission tests aren't the expensive part. The actual
>> CREATE/DROP DATABASE you do is. The latter essentially are already
>> tested by the buildfarm already.
>> So, trimming the patch to do only the fast stuff should be less
>> controversial?
>>
>> Greetings,
>>
>> Andres Freund
>>
>> --
>>  Andres Freund http://www.2ndQuadrant.com/
>>  PostgreSQL Development, 24x7 Support, Training & Services
>>
>
>


regress_db_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Removal of pageinspect--1.0.sql

2013-05-20 Thread Michael Paquier

Hi all,

The contrib module pageinspect has been upgraded to 1.1, but
pageinspect--1.0.sql is still present in source code. Shouldn't it be
removed? Please find patch attached.

Thanks
-- 
Michael


20130521_pageinspect10_removal.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal to add --single-row to psql

2013-05-20 Thread Darren Duncan

I have actually been working on the task discussed in this thread, most relevant 
parts quoted below, for awhile now, and hope to have something concrete that you 
can use by the end of this summer.


My in-development Muldis D language is homoiconic as a core feature, its source 
code is data to it, and the native syntax of the language resembles an abstract 
syntax tree.  This tree of nodes primarily defines behavior of the code, but it 
also supports arbitrary metadata per node, which for example can be used to 
preserve concrete syntax for any programming language that can be parsed into 
Muldis D nodes or conversely generated from said.


For example, you can have one type of node defining a function, and its details 
are defined by its attributes or child nodes, such as its result type, its 
parameters, whether it is declared associative/commutative/etc or not, and the 
expression(s) defining its body.  Another type of node defines a call to a 
function, another type defines a text literal, another a relation literal, and 
so on.  Conversely, a node can define a schema or package etc.  Example 
metadata, which is also structured, could include line numbers or code comments 
or whitespace or exact numeric literal formats or quoting formats or exact 
keyword choices, for example.


Using these node data types, we have a fairly normalized representation of any 
source code (or data) that is easy to introspect or transform with code.  A key 
design of Muldis D is the clear separation of syntax and behavior.


A parser is just a function that takes (typically) a character string as input 
and produces a node (tree) as output.  A compiler is just a function or 
operation that takes a node (tree) as input and produces machine code.  An 
optimizer can be a function that derives one node tree from another, either as 
its own operation or typically as part of the compiler stage.


A compiler or optimizer generally can trivially ignore the node metadata, but a 
code generator or debugger can use it; metadata can be stripped without 
affecting behavior.  The canonical tree form can also easily be mapped 
losslessly with relation forms, such as typical information schemas have.


Practically all behavior is defined in terms of generic type and routine 
definitions.  Practically all system-defined types and routines are defined in 
one or more libraries/modules/packages that have the same format as those users 
would write like extensions.  So, all the relational operators have the same 
syntax as say string or numeric or array operators.


I envision that the most effective way to use Muldis D to handle an arbitrary 
programming language, including the native SQL syntax+behavior of each SQL DBMS, 
is to have a Muldis D library+parser pair for it.


For example, to represent PostgreSQL 9.2 most directly, we have a library with 
an explicitly defined type and routine for every built-in that Pg 9.2 has, and 
we also have a parser function that takes the SQL syntax that Pg 9.2 understands 
and produces a Muldis D node tree consisting of calls to the routines of that 
library or value selectors of types in that library (things like SELECT and 
INSERT etc are each mapped to a routine too).  That way, even with a standard 
node format that isn't specific to a typical language or version, the code for 
parsing Pg 9.2 SQL has the minimal amount of indirection that it has to deal 
with, as each syntax element would have a direct library call counterpart. 
Similarly, the existing Pg 9.2 SQL compiler would have the least indirection to 
take said nodes and execute them.  (The library would be named eg "Postgres_9_2" 
for example, which makes it easier say to also have one side-by-side for other 
versions, shims, legacy code you want to more easily support compatibility with.)


Where one decides to do cross-compilation, say make Oracle SQL run on Pg, that 
could be done as simply as defining a library for Oracle SQL with the 
routines+types that has, and then mapping it to a Pg one just in terms of shim 
calls (which the compiler can optimize away as applicable), and so parsers or 
compilers never necessarily have to deal with behavior compatibility issues.


I am presently working out the exact set of such language nodes, and making a 
reference implementation which is mostly self-defined and would compile to Perl 
code, and hope to have the first Muldis D executable by the end of this summer.


I am confident that an adaption of this design into C or whatever would serve 
Postgres greatly in letting it effectively parse multiple languages, anywhere 
from application programming languages to multiple SQL dialects or versions.


Even if you don't decide to use my design specifically (which is open source and 
you can influence it), I think you should find some of the general design 
principles I stated above to be useful.  Representing behavior as libraries the 
AST being flexible enough for any concrete language without bei

Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.

2013-05-20 Thread Bruce Momjian


Thanks, applied.

---

On Fri, May 17, 2013 at 03:12:21PM -0400, Joe Abbate wrote:
> Small release notes correction attached.
> 
> Joe

> >From 330f5af36ffdba8930ea2da8146e8f17e1ec8a68 Mon Sep 17 00:00:00 2001
> From: Joe Abbate 
> Date: Fri, 17 May 2013 14:59:03 -0400
> Subject: [PATCH] Correct release notes about DROP TABLE IF EXISTS and add
>  link.
> 
> ---
>  doc/src/sgml/release-9.3.sgml |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/src/sgml/release-9.3.sgml b/doc/src/sgml/release-9.3.sgml
> index 519ea49..24e037b 100644
> --- a/doc/src/sgml/release-9.3.sgml
> +++ b/doc/src/sgml/release-9.3.sgml
> @@ -740,8 +740,9 @@
>  
>
> 
> -Allow DROP TABLE IF NOT EXISTS to succeed when a
> -non-existent schema is specified in the table name (Bruce Momjian)
> +Allow DROP TABLE IF
> +EXISTS to succeed when a non-existent schema is specified
> +in the table name (Bruce Momjian)
> 
>  
> 
> -- 
> 1.7.10.4
> 

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.

2013-05-20 Thread Andrew Dunstan



On 05/20/2013 10:38 PM, Bruce Momjian wrote:

-Allow DROP TABLE IF NOT EXISTS to succeed when a
-non-existent schema is specified in the table name (Bruce Momjian)



*snort*

This would be a rather pointless command!

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.

2013-05-20 Thread Bruce Momjian

On Mon, May 20, 2013 at 10:55:19PM -0400, Andrew Dunstan wrote:
> 
> On 05/20/2013 10:38 PM, Bruce Momjian wrote:
> >-Allow DROP TABLE IF NOT EXISTS to succeed when a
> >-non-existent schema is specified in the table name (Bruce Momjian)
> >
> 
> *snort*
> 
> This would be a rather pointless command!

Yeah.  :-)  The patch added markup, and fixed the command text.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas


On 21.05.2013 00:00, Simon Riggs wrote:

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5838,8 +5838,16 @@ StartupXLOG(void)
}

/* Save the selected TimeLineID in shared memory, too */
-   XLogCtl->ThisTimeLineID = ThisTimeLineID;
-   XLogCtl->PrevTimeLineID = PrevTimeLineID;
+   {
+   /* use volatile pointer to prevent code rearrangement */
+   volatile XLogCtlData *xlogctl = XLogCtl;
+
+   SpinLockAcquire(&xlogctl->info_lck);
+   XLogCtl->ThisTimeLineID = ThisTimeLineID;
+   XLogCtl->lastReplayedTLI = ThisTimeLineID;
+   XLogCtl->PrevTimeLineID = PrevTimeLineID;
+   SpinLockRelease(&xlogctl->info_lck);
+   }


Hmm. lastReplayedTLI is supposed to be the timeline of the last record 
that was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. 
I think it would work, but it feels like it could be an easy source of 
bugs in the future.


It might be a good idea to change walsender to not store that in 
ThisTimeLineID, though. It used to make sense for ThisTimeLineID to 
track the current recovery timeline in 9.2 and below, but now that 
walsender can be sending any older timeline, using ThisTimeLineID to 
store the latest one seems confusing.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

48 matches

Mail list logo