Re: [HACKERS] Proposal to add connection request Wait-time in PSQL client.
>> in a way, we client terminal wont hangup by throwing *The connection to > the server was lost. Attempting reset: Failed. !* > The thing is that this just should not be a routine occurrence. It's a > minor irritation to me when debugging sometimes, but it's not something > that you should be encountering in production. If you are, changing > psql's reconnect behaviour is not necessarily the best solution. I'd try > to work out why you're having so many unrecoverable disconnects and > restarts and fix that instead. Yes, I think so. This is very rare case, and won't have any affect in production. Initially, IMO, I thought no harm if PSQL client wait for few seconds till server recovered properly and ready to accept connection. Any way, I will follow-up your suggestion. Thank you for sharing your concerns and explaining me actual needed things. Regards, Amul Sul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why there is a union in HeapTupleHeaderData struct
Dear Hackers In fix part oh HeapTuple, there is a union that is named t_choice, union { HeapTupleFields t_heap; DatumTupleFields t_datum; }t_choice; I can't find out why we need t_datum, actually there is no comment about DatumTupleFields. Regards Soroosh
Re: [HACKERS] [9.3] Automatically updatable views vs writable foreign tables
On 16 May 2013 22:16, Tom Lane wrote: >> Specifically, for foreign tables >> information_schema.tables.is_insertable_into and >> information_schema.columns.is_updatable always say 'NO' even if the >> foreign table is writable. Fixing that would require new C functions >> along the same lines as pg_view_is_insertable/updatable(), or those >> functions could just be renamed and repurposed to do the check for all >> relation kinds, except those known to be always/never updatable. > > I'd vote to rename/extend them to be pg_relation_is_updatable I think. > I remember now just how ugly this code is. The SQL standard has separate concepts of trigger-insertable, trigger-updatable, trigger-deletable, insertable and updatable but not deletable for some reason. So we define updatable as supporting both UPDATE and DELETE without triggers. I think what we have implemented is technically correct with regards to the spec in this area, but it is not particularly useful as far as telling whether a relation will actually support a particular query in practice (for example a simple view on top of a trigger-updatable view is neither updatable nor trigger-updatable). One place where I think we have diverged from the spec, however, is in information_schema.columns.updatable. This should be returning 'YES' if the individual column is updatable, and I see no reason for that the require the relation to support DELETE, which is what we currently do (and always have done). To implement the information_schema properly per-spec, I think we need 3 functions: pg_relation_is_insertable(), pg_relation_is_updatable() and pg_column_is_updatable(), with the latter just checking UPDATE events. It's probably a good idea to add these functions now, since I hope in the future to support more of the SQL spec regarding automatically updatable views, which will involve views for which only a subset of their columns are updatable. The attached patch does that, and tightens up relation_is_updatable() to support all relation kinds, but it still assumes that if a FDW defines, say, ExecForeignUpdate, then all its foreign tables are updatable. That could be improved upon by defining new FDW API callbacks that select from the remote information_schema, but I'm now starting to doubt whether its really worth the trouble, given its bizzare definition. Regards, Dean writable-fdw-view2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Hello, I think the comment just above the HeapTupleFields struct definition has the related details. /* * Heap tuple header. To avoid wasting space, the fields should be * laid out in such a way as to avoid structure padding. * * Datums of composite types (row types) share the same general structure * as on-disk tuples, so that the same routines can be used to build and * examine them. However the requirements are slightly different: a Datum * does not need any transaction visibility information, and it does need * a length word and some embedded type information. We can achieve this * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple * with the fields needed in the Datum case. Typically, all tuples built * in-memory will be initialized with the Datum fields; but when a tuple is * about to be inserted in a table, the transaction fields will be filled, * overwriting the datum fields. especially the last line points as to what roles each of them plays, though, I would like to hear more about additional details from others who might reply. On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari wrote: > Dear Hackers > > In fix part oh HeapTuple, there is a union that is named t_choice, > union > { > HeapTupleFields t_heap; > DatumTupleFields t_datum; > }t_choice; > > I can't find out why we need t_datum, actually there is no comment about > DatumTupleFields. > > Regards > Soroosh > > -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Thanks, If a tuple constructed in memory we don't need t_heap. I have another question, How make an in-memory tuple? On Mon, May 20, 2013 at 12:46 PM, Amit Langote wrote: > Hello, > > I think the comment just above the HeapTupleFields struct definition > has the related details. > > /* > * Heap tuple header. To avoid wasting space, the fields should be > * laid out in such a way as to avoid structure padding. > * > * Datums of composite types (row types) share the same general structure > * as on-disk tuples, so that the same routines can be used to build and > * examine them. However the requirements are slightly different: a Datum > * does not need any transaction visibility information, and it does need > * a length word and some embedded type information. We can achieve this > * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple > * with the fields needed in the Datum case. Typically, all tuples built > * in-memory will be initialized with the Datum fields; but when a tuple is > * about to be inserted in a table, the transaction fields will be filled, > * overwriting the datum fields. > > > especially the last line points as to what roles each of them plays, > though, I would like to hear more about additional details from others > who might reply. > > > > On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari > wrote: > > Dear Hackers > > > > In fix part oh HeapTuple, there is a union that is named t_choice, > > union > > { > > HeapTupleFields t_heap; > > DatumTupleFields t_datum; > > }t_choice; > > > > I can't find out why we need t_datum, actually there is no comment about > > DatumTupleFields. > > > > Regards > > Soroosh > > > > > > > > -- > Amit Langote >
Re: [HACKERS] pgbench vs. SERIALIZABLE
Josh Berkus wrote: > I recently had a reason to benchmark a database which is default > SERIALIZABLE mode. I was startled to discover that pgbench is set up to > abort the client once it hits a serialization failure. You get a bunch > of these: > > Client 7 aborted in state 11: ERROR: could not serialize access due to > read/write dependencies among transactions > DETAIL: Reason code: Canceled on identification as a pivot, during write. > HINT: The transaction might succeed if retried. > Client 0 aborted in state 11: ERROR: could not serialize access due to > read/write dependencies among transactions > DETAIL: Reason code: Canceled on identification as a pivot, during write. > > ... which continue until you're down to one client, which then finished > out the pgbench (at very low rates, of course). > > The problem is this code here: > > if (commands[st->state]->type == SQL_COMMAND) > { > /* > > * Read and discard the query result; note this > is not included in > * the statement latency numbers. > > */ > res = PQgetResult(st->con); > switch (PQresultStatus(res)) > { > case PGRES_COMMAND_OK: > case PGRES_TUPLES_OK: > break; /* OK */ > default: > fprintf(stderr, "Client %d > aborted in state %d: %s", > st->id, > st->state, PQerrorMessage(st->con)); > PQclear(res); > return clientDone(st, false); > } > PQclear(res); > discard_response(st); > > > The way I read that, if the client encounters any errors at all, it > gives up and halts that client. This doesn't seem very robust, and it > certainly won't work with SERIALIZABLE. Yes, I ran into this and wound up testing with a hacked copy of pgbench. Anyone using SERIALIZABLE transactions needs to be prepared to handle serialization failures. Even REPEATABLE READ can see a lot of serialization failures due to write conflicts in some workloads, and READ COMMITTED can see deadlocks on certain workloads. > My thinking is that what pgbench should do is: > * track an error count > * if it finds an error, don't increment the transaction count, but do > increment the error count. > * then continue to the next transaction. > > Does that seem like the right approach? The main thing is to not consider a transaction complete on a serialization failure. Many frameworks will retry a transaction from the start on a serialization failure. pgbench should do the same. The transaction should be rolled back and retried without incrementing the transaction count. It would not hurt to keep track of the retries, but as long as the transaction is retried until successful you will get meaningful throughput numbers. Perhaps other errors should also be counted and handled better. For SQLSTATE values *other than* 40001 and 40P01 we should not retry the same transaction, though, because it could well be a case where the same attempt will head-bang forever. Serialization failures, by definition, can be expected to work on retry. (Not always on the *first* retry, but any benchmark tool should keep at it until the transaction succeeds or gets a hard error if you want meaningful numbers, because that's what a software frameworks should be doing -- and many do.) I raised this issue near the end of SSI development, but nobody seemed very interested and someone argued that a tool to do that would be good but we shouldn't try to do it in pgbench -- so I let it drop at the time. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block write statistics WIP
On 19.05.2013 11:15, Greg Smith wrote: I've thought of two paths to get a block write count out of that so far: -Provide a function to find the Relation from the RelFileNode. There is a warning about the perils of assuming you can map that way from a buftag value in buf_internals.h though: "Note: the BufferTag data must be sufficient to determine where to write the block, without reference to pg_class or pg_tablespace entries. It's possible that the backend flushing the buffer doesn't even believe the relation is visible yet (its xact may have started before the xact that created the rel). The storage manager must be able to cope anyway." -Modify every caller of MarkDirty* to include a relation when that information is available. There are about 200 callers of those functions around, so that won't be fun. I noted that many of them are in the XLog replay functions, which won't have the relation--but those aren't important to count anyway. Neither of these options feels very good to me, so selecting between the two feels like picking the lesser of two evils. I'm fine with chugging through all of the callers to modify MarkDirty*, but I didn't want to do that only to have the whole approach rejected as wrong. That's why I stopped here to get some feedback. Adding a Relation argument to all the Mark* calls seems fine to me. I don't find it ugly at all. The biggest objection would be that if there are extensions out there that call those functions, they would need to be changed, but I think that's fine. The way that MarkDirty requires this specific underlying storage manager behavior to work properly strikes me as as a bit of a layering violation too. I'd like the read and write paths to have a similar API, but here they don't even operate on the same type of inputs. Addressing that is probably harder than just throwing a hack on the existing code though. To be honest, I don't understand what you mean by that. ? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ASYNC Privileges proposal
On 05/20/2013 09:54 AM, Chris Farmiloe wrote: > Hey all, > > I find the current LISTEN / NOTIFY rather limited in the context of > databases with multiple roles. As it stands it is not possible to restrict > the use of LISTEN or NOTIFY to specific roles, and therefore notifications > (and their payloads) cannot really be trusted as coming from any particular > source. Just for context, here's the dba.stackexchange.com question where this was previously raised: http://dba.stackexchange.com/q/42496/7788 I agree that this only became relevant since NOTIFY payload support was added, so we can't really just say "nobody's wanted this in 15 years". A couple of years, maybe, but given the lagging client driver support for NOTIFY payloads, maybe not even that. That said, I personally don't see this as a significant problem since the approach that worked prior to the introduction of notify payloads - just using a staging table - continues to work just fine. I agree with Tom that much finer grained control would be required if adding privileges at all were to be particularly useful; such a coarse right as Chris proposes seems unlikely to cover many use cases. We'd likely control over who can NOTIFY and who can LISTEN to a given channel name. I'm just not convinced it's worth the complexity it'll introduce or any performance consequences. If it could be done by adding a few hooks that a C extension could use, that might be worthwhile though. The same argument might be applied to advisory locking, after all. I can't really see adding fine-grained rights to _everything_, but hooks might not be harmful. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Hello, I think a more appropriate question to be asked here would be at what point (in the life of a typical tuple), does a tuple's header contain t_datum or otherwise, which I would also like to be answered. On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari wrote: > Thanks, > > If a tuple constructed in memory we don't need t_heap. I have another > question, > How make an in-memory tuple? > > > > > On Mon, May 20, 2013 at 12:46 PM, Amit Langote > wrote: >> >> Hello, >> >> I think the comment just above the HeapTupleFields struct definition >> has the related details. >> >> /* >> * Heap tuple header. To avoid wasting space, the fields should be >> * laid out in such a way as to avoid structure padding. >> * >> * Datums of composite types (row types) share the same general structure >> * as on-disk tuples, so that the same routines can be used to build and >> * examine them. However the requirements are slightly different: a Datum >> * does not need any transaction visibility information, and it does need >> * a length word and some embedded type information. We can achieve this >> * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple >> * with the fields needed in the Datum case. Typically, all tuples built >> * in-memory will be initialized with the Datum fields; but when a tuple >> is >> * about to be inserted in a table, the transaction fields will be filled, >> * overwriting the datum fields. >> >> >> especially the last line points as to what roles each of them plays, >> though, I would like to hear more about additional details from others >> who might reply. >> >> >> >> On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari >> wrote: >> > Dear Hackers >> > >> > In fix part oh HeapTuple, there is a union that is named t_choice, >> > union >> > { >> > HeapTupleFields t_heap; >> > DatumTupleFields t_datum; >> > }t_choice; >> > >> > I can't find out why we need t_datum, actually there is no comment about >> > DatumTupleFields. >> > >> > Regards >> > Soroosh >> > >> > >> >> >> >> -- >> Amit Langote > > -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu: > On 18.05.2013 03:52, Dickson S. Guedes wrote: > >> pgbench -S is such a workload. With 9.3beta1, I'm seeing this > >> profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a > >> 32-core Linux machine: > >> > >> - 64.09% postgres postgres [.] tas - tas - 99.83% > >> s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% > >> LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% > >> postgres postgres [.] tas + 1.53% postgres > >> libc-2.13.so [.] 0x119873 + 1.44% postgres postgres > >> [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] > >> arch_local_irq_enable + 1.18% postgres postgres [.] > >> AllocSetAlloc ... > > > > I'd like to test this here but I couldn't reproduce that perf output > > here in a 64-core or 24-core machines, could you post the changes to > > postgresql.conf and the perf arguments that you used? > > Sure, here are the non-default postgresql.conf settings: Thank you for your information. > While pgbench was running, I ran this: > > perf record -p 6050 -g -e cpu-clock > > to connect to one of the backends. (I used cpu-clock, because the > default cpu-cycles event didn't work on the box) Hum, I was supposing that I was doing something wrong but I'm getting the same result as before even using your test case and my results is still different from yours: + 71,27% postgres postgres [.] AtEOXact_Buffers + 7,67% postgres postgres [.] AtEOXact_CatCache + 6,30% postgres postgres [.] AllocSetCheck + 5,34% postgres libc-2.12.so [.] __mcount_internal + 2,14% postgres [kernel.kallsyms][k] activate_page It's a 64-core machine with PGDATA in a SAN. vendor_id : GenuineIntel cpu family : 6 model : 47 model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz stepping: 2 cpu MHz : 1064.000 cache size : 24576 KB physical id : 3 siblings: 16 core id : 24 cpu cores : 8 apicid : 241 initial apicid : 241 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips: 4255.87 clflush size: 64 cache_alignment : 64 address sizes : 44 bits physical, 48 bits virtual power management: Would you like that I test some other configuration to try to simulate that expected workload? []s -- Dickson S. Guedes mail/xmpp: gue...@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote: > Hum, I was supposing that I was doing something wrong but I'm getting > the same result as before even using your test case and my results is > still different from yours: > > > + 71,27% postgres postgres [.] AtEOXact_Buffers > + 7,67% postgres postgres [.] AtEOXact_CatCache > + 6,30% postgres postgres [.] AllocSetCheck > + 5,34% postgres libc-2.12.so [.] __mcount_internal > + 2,14% postgres [kernel.kallsyms][k] activate_page That looks like you have configured with --enable-cassert and probably also --enable-profiling? The former will give completely distorted performance results... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Wonder though if this question is better asked in pgsql-novice? On Mon, May 20, 2013 at 9:23 PM, Amit Langote wrote: > Hello, > > I think a more appropriate question to be asked here would be at what > point (in the life of a typical tuple), does a tuple's header contain > t_datum or otherwise, which I would also like to be answered. > > On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari > wrote: >> Thanks, >> >> If a tuple constructed in memory we don't need t_heap. I have another >> question, >> How make an in-memory tuple? >> >> >> >> >> On Mon, May 20, 2013 at 12:46 PM, Amit Langote >> wrote: >>> >>> Hello, >>> >>> I think the comment just above the HeapTupleFields struct definition >>> has the related details. >>> >>> /* >>> * Heap tuple header. To avoid wasting space, the fields should be >>> * laid out in such a way as to avoid structure padding. >>> * >>> * Datums of composite types (row types) share the same general structure >>> * as on-disk tuples, so that the same routines can be used to build and >>> * examine them. However the requirements are slightly different: a Datum >>> * does not need any transaction visibility information, and it does need >>> * a length word and some embedded type information. We can achieve this >>> * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple >>> * with the fields needed in the Datum case. Typically, all tuples built >>> * in-memory will be initialized with the Datum fields; but when a tuple >>> is >>> * about to be inserted in a table, the transaction fields will be filled, >>> * overwriting the datum fields. >>> >>> >>> especially the last line points as to what roles each of them plays, >>> though, I would like to hear more about additional details from others >>> who might reply. >>> >>> >>> >>> On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari >>> wrote: >>> > Dear Hackers >>> > >>> > In fix part oh HeapTuple, there is a union that is named t_choice, >>> > union >>> > { >>> > HeapTupleFields t_heap; >>> > DatumTupleFields t_datum; >>> > }t_choice; >>> > >>> > I can't find out why we need t_datum, actually there is no comment about >>> > DatumTupleFields. >>> > >>> > Regards >>> > Soroosh >>> > >>> > >>> >>> >>> >>> -- >>> Amit Langote >> >> > > > > -- > Amit Langote -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu: > On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote: > > Hum, I was supposing that I was doing something wrong but I'm getting > > the same result as before even using your test case and my results is > > still different from yours: > > > > > > + 71,27% postgres postgres [.] AtEOXact_Buffers > > + 7,67% postgres postgres [.] AtEOXact_CatCache > > + 6,30% postgres postgres [.] AllocSetCheck > > + 5,34% postgres libc-2.12.so [.] __mcount_internal > > + 2,14% postgres [kernel.kallsyms][k] activate_page > > That looks like you have configured with --enable-cassert and probably > also --enable-profiling? The former will give completely distorted > performance results... Ah! Wrong PATH, so wrong binaries. Thanks Andres. -- Dickson S. Guedes mail/xmpp: gue...@guedesoft.net - skype: guediz http://guedesoft.net - http://www.postgresql.org.br http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Sent from my iPad On 20-May-2013, at 18:14, Amit Langote wrote: > Wonder though if this question is better asked in pgsql-novice? > > On Mon, May 20, 2013 at 9:23 PM, Amit Langote wrote: >> Hello, >> >> I think a more appropriate question to be asked here would be at what >> point (in the life of a typical tuple), does a tuple's header contain >> t_datum or otherwise, which I would also like to be answered. >> >> On Mon, May 20, 2013 at 6:06 PM, Soroosh Sardari >> wrote: >>> Thanks, >>> >>> If a tuple constructed in memory we don't need t_heap. I have another >>> question, >>> How make an in-memory tuple? >>> >>> >>> >>> >>> On Mon, May 20, 2013 at 12:46 PM, Amit Langote >>> wrote: Hello, I think the comment just above the HeapTupleFields struct definition has the related details. /* * Heap tuple header. To avoid wasting space, the fields should be * laid out in such a way as to avoid structure padding. * * Datums of composite types (row types) share the same general structure * as on-disk tuples, so that the same routines can be used to build and * examine them. However the requirements are slightly different: a Datum * does not need any transaction visibility information, and it does need * a length word and some embedded type information. We can achieve this * by overlaying the xmin/cmin/xmax/cmax/xvac fields of a heap tuple * with the fields needed in the Datum case. Typically, all tuples built * in-memory will be initialized with the Datum fields; but when a tuple is * about to be inserted in a table, the transaction fields will be filled, * overwriting the datum fields. especially the last line points as to what roles each of them plays, though, I would like to hear more about additional details from others who might reply. On Mon, May 20, 2013 at 4:28 PM, Soroosh Sardari wrote: > Dear Hackers > > In fix part oh HeapTuple, there is a union that is named t_choice, > union >{ >HeapTupleFields t_heap; >DatumTupleFields t_datum; >}t_choice; > > I can't find out why we need t_datum, actually there is no comment about > DatumTupleFields. > > Regards > Soroosh > > I don't think so, as this involves internal structures.Let's keep it here. Regards, Atri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move unused buffers to freelist
On Thu, May 16, 2013 at 10:18 AM, Amit Kapila wrote: > Further Performance Data: > > Below data is for average 3 runs of 20 minutes > > Scale Factor - 1200 > Shared Buffers - 7G These results are good but I don't get similar results in my own testing. I ran pgbench tests at a variety of client counts and scale factors, using 30-minute test runs and the following non-default configuration parameters. shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 log_line_prefix = '%t [%p] ' Here are the results. The first field in each line is the number of clients. The second number is the scale factor. The numbers after "master" and "patched" are the median of three runs. 01 100 master 1433.297699 patched 1420.306088 01 300 master 1371.286876 patched 1368.910732 01 1000 master 1056.891901 patched 1067.341658 01 3000 master 637.312651 patched 685.205011 08 100 master 10575.017704 patched 11456.043638 08 300 master 9262.601107 patched 9120.925071 08 1000 master 1721.807658 patched 1800.733257 08 3000 master 819.694049 patched 854.333830 32 100 master 26981.677368 patched 27024.507600 32 300 master 14554.870871 patched 14778.285400 32 1000 master 1941.733251 patched 1990.248137 32 3000 master 846.654654 patched 892.554222 And here's the same results for 5-minute, read-only tests: 01 100 master 9361.073952 patched 9049.553997 01 300 master 8640.235680 patched 8646.590739 01 1000 master 8339.364026 patched 8342.799468 01 3000 master 7968.428287 patched 7882.121547 08 100 master 71311.491773 patched 71812.899492 08 300 master 69238.839225 patched 70063.632081 08 1000 master 34794.778567 patched 65998.468775 08 3000 master 60834.509571 patched 61165.998080 32 100 master 203168.264456 patched 205258.283852 32 300 master 199137.276025 patched 200391.633074 32 1000 master 177996.853496 patched 176365.732087 32 3000 master 149891.147442 patched 148683.269107 Something appears to have screwed up my results for 8 clients @ scale factor 300 on master, but overall, on both the read-only and read-write tests, I'm not seeing anything that resembles the big gains you reported. Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the PostgreSQL community courtesy of IBM. Fedora 16, Linux kernel 3.2.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)
On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas wrote: > If we could use the catchup interrupts to speed that up though, that would > be much better. I think vacuum could simply send a catchup interrupt, and > wait until everyone has caught up. That would significantly increase the > traffic of sinval queue and catchup interrupts, compared to what it is > today, but I think it would still be ok. It would still only be a few sinval > messages and catchup interrupts per truncation (ie. per vacuum). Hmm. So your proposal is to only send these sinval messages while a truncation is in progress, not any time the relation is extended? That would certainly be far more appealing from the point of view of not blowing out sinval. It shouldn't be difficult to restrict the set of backends that have to be signaled to those that have the relation open. You could have a special kind of catchup signal that means "catch yourself up, but don't chain" - and send that only to those backends returned by GetConflictingVirtualXIDs. It might be better to disconnect this mechanism from sinval entirely. In other words, just stick a version number on your shared memory data structure and have everyone advertise the last version number they've seen via PGPROC. The sinval message doesn't really seem necessary; it's basically just a no-op message to say "reread shared memory", and a plain old signal can carry that same message more efficiently. One other thought regarding point 6 from your original proposal. If it's true that a scan could hold a pin on a buffer above the current hard watermark, which I think it is, then that means there's a scan in progress which is going to try to fetch pages beyond that also, up to whatever the end of file position was when the scan started. I suppose that heap scans will need to be taught to check some backend-local flag before fetching each new block, to see if a hard watermark change might have intervened. Is that what you have in mind? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)
On 20.05.2013 16:59, Robert Haas wrote: On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas wrote: If we could use the catchup interrupts to speed that up though, that would be much better. I think vacuum could simply send a catchup interrupt, and wait until everyone has caught up. That would significantly increase the traffic of sinval queue and catchup interrupts, compared to what it is today, but I think it would still be ok. It would still only be a few sinval messages and catchup interrupts per truncation (ie. per vacuum). Hmm. So your proposal is to only send these sinval messages while a truncation is in progress, not any time the relation is extended? That would certainly be far more appealing from the point of view of not blowing out sinval. Right. It shouldn't be difficult to restrict the set of backends that have to be signaled to those that have the relation open. You could have a special kind of catchup signal that means "catch yourself up, but don't chain" What does "chain" mean above? - and send that only to those backends returned by GetConflictingVirtualXIDs. Yeah, that might be a worthwhile optimization. It might be better to disconnect this mechanism from sinval entirely. In other words, just stick a version number on your shared memory data structure and have everyone advertise the last version number they've seen via PGPROC. The sinval message doesn't really seem necessary; it's basically just a no-op message to say "reread shared memory", and a plain old signal can carry that same message more efficiently. Hmm. The sinval message makes sure that when a backend locks a relation, it will see the latest value, because of the AcceptInvalidationMessages call in LockRelation. If there is no sinval message, you'd need to always check the shared memory area when you lock a relation. One other thought regarding point 6 from your original proposal. If it's true that a scan could hold a pin on a buffer above the current hard watermark, which I think it is, then that means there's a scan in progress which is going to try to fetch pages beyond that also, up to whatever the end of file position was when the scan started. I suppose that heap scans will need to be taught to check some backend-local flag before fetching each new block, to see if a hard watermark change might have intervened. Is that what you have in mind? Yes, heap scan will need to check the locally cached high watermark value every time when stepping to a new page. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On 13.05.2013 17:21, Merlin Moncure wrote: On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas wrote: The attached patch is still work-in-progress. There needs to be a configure test and fallback to spinlock if a CAS instruction is not available. I used the gcc __sync_val_compare_and_swap() builtin directly, that needs to be abstracted. Also, in the case that the wait queue needs to be manipulated, the code spins on the CAS instruction, but there is no delay mechanism like there is on a regular spinlock; that needs to be added in somehow. These are really interesting results. Why is the CAS method so much faster then TAS? If my theory is right, the steep fall is caused by backends being sometimes context switched while holding the spinlock. The CAS method alleviates that, because the lock is never "held" by anyone. With a spinlock, if a process gets context switched while holding the lock, everyone else will have to wait until the process gets context switched back, and releases the spinlock. With the CAS method, if a process gets switched in the CAS-loop, it doesn't hurt others. With the patch, the CAS-loop still spins, just like a spinlock, when the wait queue has to be manipulated. So when that happens, it will behave more or less like the spinlock implementation. I'm not too concerned about that optimizing that further; sleeping and signaling sleeping backends are heavy-weight operations anyway. Did you see any contention? With the current spinlock implementation? Yeah, per the profile I posted, a lot of CPU time was spent spinning, which is a sign of contention. I didn't see contention with the CAS implemention. Attached is a new version of the patch. I cleaned it up a lot, and added a configure test, and fallback to the good old spinlock implementation if compare-and-swap is not available. Also, if the CAS loops needs to spin, like a spinlock, it now sleeps after a while and updates the "spins_per_delay" like regular spinlock. - Heikki diff --git a/configure b/configure index 98d889a..43faaf8 100755 --- a/configure +++ b/configure @@ -22742,71 +22742,6 @@ fi done -{ $as_echo "$as_me:$LINENO: checking for builtin locking functions" >&5 -$as_echo_n "checking for builtin locking functions... " >&6; } -if test "${pgac_cv_gcc_int_atomics+set}" = set; then - $as_echo_n "(cached) " >&6 -else - cat >conftest.$ac_ext <<_ACEOF -/* confdefs.h. */ -_ACEOF -cat confdefs.h >>conftest.$ac_ext -cat >>conftest.$ac_ext <<_ACEOF -/* end confdefs.h. */ - -int -main () -{ -int lock = 0; - __sync_lock_test_and_set(&lock, 1); - __sync_lock_release(&lock); - ; - return 0; -} -_ACEOF -rm -f conftest.$ac_objext conftest$ac_exeext -if { (ac_try="$ac_link" -case "(($ac_try" in - *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; - *) ac_try_echo=$ac_try;; -esac -eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\"" -$as_echo "$ac_try_echo") >&5 - (eval "$ac_link") 2>conftest.er1 - ac_status=$? - grep -v '^ *+' conftest.er1 >conftest.err - rm -f conftest.er1 - cat conftest.err >&5 - $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5 - (exit $ac_status); } && { - test -z "$ac_c_werror_flag" || - test ! -s conftest.err - } && test -s conftest$ac_exeext && { - test "$cross_compiling" = yes || - $as_test_x conftest$ac_exeext - }; then - pgac_cv_gcc_int_atomics="yes" -else - $as_echo "$as_me: failed program was:" >&5 -sed 's/^/| /' conftest.$ac_ext >&5 - - pgac_cv_gcc_int_atomics="no" -fi - -rm -rf conftest.dSYM -rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \ - conftest$ac_exeext conftest.$ac_ext -fi -{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int_atomics" >&5 -$as_echo "$pgac_cv_gcc_int_atomics" >&6; } -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then - -cat >>confdefs.h <<\_ACEOF -#define HAVE_GCC_INT_ATOMICS 1 -_ACEOF - -fi - # Lastly, restore full LIBS list and check for readline/libedit symbols LIBS="$LIBS_including_readline" @@ -28573,6 +28508,136 @@ _ACEOF fi +# Check for GCC built-in atomic functions +{ $as_echo "$as_me:$LINENO: checking for builtin test-and-set function" >&5 +$as_echo_n "checking for builtin test-and-set function... " >&6; } +if test "${pgac_cv_gcc_int_test_and_set+set}" = set; then + $as_echo_n "(cached) " >&6 +else + cat >conftest.$ac_ext <<_ACEOF +/* confdefs.h. */ +_ACEOF +cat confdefs.h >>conftest.$ac_ext +cat >>conftest.$ac_ext <<_ACEOF +/* end confdefs.h. */ + +int +main () +{ +int lock = 0; + __sync_lock_test_and_set(&lock, 1); + __sync_lock_release(&lock); + ; + return 0; +} +_ACEOF +rm -f conftest.$ac_objext conftest$ac_exeext +if { (ac_try="$ac_link" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\"" +$as_echo "$ac_try_echo") >&5 + (eval "$ac_link") 2>conftest.er1 + ac_status=$? + grep -v '^ *+' conftest.er1 >conftest.err + rm -f conftest.er1 + cat conftest.err >&5 + $
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
Hello, Would you guys please trim the quoted text of the emails you're replying to? I understand Gmail is obnoxious w.r.t. quoted text, but this is starting to become excessive. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why there is a union in HeapTupleHeaderData struct
On Mon, May 20, 2013 at 8:54 PM, Alvaro Herrera wrote: > Hello, > > Would you guys please trim the quoted text of the emails you're replying > to? I understand Gmail is obnoxious w.r.t. quoted text, but this is > starting to become excessive. Oops, I didnt notice that. Sorry! Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue & Wed.
Hello PostgreSQL Hackers, I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping to chat with a few Postgresql developers in person about using Postgresql in FreeNAS and offering it as an extension to the platform as a plug-in technology. Unfortunately due to time constraints I can not attend the entire conference and I am only in town until Wednesday at noon. I'm hoping there's a good time to talk to a few developers about Postgresql + FreeNAS before I have to depart back to the bay area. Some info on me: My name is Alfred Perlstein, I am a FreeBSD developer and FreeNAS project lead. I am the VP of Software Engineering at iXsystems. I have been a fan of Postgresql for many years. In the early 2000s we build a high speed web tracking application on top of Postgresql and worked closely with the community to shake out performance and bug, so closely that Tom Lane and Vadim Mikheevhad logins on our box. Since that time I have tried to get Postgresql into as many places as possible. Some info on the topics I wanted to briefly discuss: 1) Using Postgresql as the config store for FreeNAS. We currently use SQLITE, SQLITE fits our needs until we get to the point of replication between HA (high availability) units. Then we are forced to manually sync data between configurations. A discussion on how we might do this better using Postgresql, while still maintaining our ease of config export (single file) and small footprint would be interesting. 2) Postgresql plugin for FreeNAS. Flip a switch and suddenly your file server is also serving enterprise data. We currently have a plug-in architecture, but would like to discuss the possibility of a tighter integration so that Postgresql looks like a more cohesive addition to FreeNAS. 3) Statistic monitoring / EagleEye In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye is a system where all mibs are easily exportable with timestamps in a common format (for now YAML & modified CSV) which is then consumed by a utility which can then provide graphs. The entire point of EagleEye is to eventually upstream the modifications to future proof statistics tracking into the FreeBSD and FreeNAS systems. I have spoken with some Illuminos/ZFS developers and they are interested as well. I think that is all I have, please drop me a note if you'll have some time in Ottawa today, tomorrow or early Wednesday. I'd love to discuss and buy some beers for the group. thank you, -Alfred Perlstein VP Software Engineering, iXsystems. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. SERIALIZABLE
On 20.05.2013 14:50, Kevin Grittner wrote: I raised this issue near the end of SSI development, but nobody seemed very interested and someone argued that a tool to do that would be good but we shouldn't try to do it in pgbench -- so I let it drop at the time. +1 on doing it in pgbench. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. SERIALIZABLE
On Mon, May 20, 2013 at 4:50 AM, Kevin Grittner wrote: > I raised this issue near the end of SSI development, but nobody > seemed very interested and someone argued that a tool to do that > would be good but we shouldn't try to do it in pgbench -- so I let > it drop at the time. > I think it would be good to do it in pgbench, provided it can be done fairly cleanly. Presumably we would want to repeat all of the ordinary commands, in the file, but not any of the backslash set commands that precede any ordinary commands. But what if backslash set commands are sprinkled between ordinary commands? Cheers, Jeff
Re: [HACKERS] Re: pg_basebackup with -R option and start standby have problems with escaped password
On 17.05.2013 19:03, Boszormenyi Zoltan wrote: 2013-05-17 16:05 keltezéssel, Heikki Linnakangas írta: On 18.02.2013 16:35, Boszormenyi Zoltan wrote: 2013-01-29 11:15 keltezéssel, Magnus Hagander írta: On Thu, Jan 24, 2013 at 7:04 AM, Hari Babu wrote: On Wed, Jan 23, 2013 11:48 PM, Magnus Hagander wrote: On Wed, Jan 23, 2013 at 10:18 AM, Hari Babu wrote: Test scenario to reproduce: 1. Start the server 2. create the user as follows ./psql postgres -c "create user user1 superuser login password 'use''1'" 3. Take the backup with -R option as follows. ./pg_basebackup -D ../../data1 -R -U user1 -W The following errors are occurring when the new standby on the backup database starts. FATAL: could not connect to the primary server: missing "=" after "1'" in connection info string What does the resulting recovery.conf file look like? The recovery.conf which is generated is as follows standby_mode = 'on' primary_conninfo = 'user=''user1'' password=''use''1'' port=''5432'' ' I observed the problem is while reading primary_conninfo from the recovery.conf file the function "GUC_scanstr" removes the quotes of the string and also makes the continuos double quote('') as single quote('). By using the same connection string while connecting to primary server the function "conninfo_parse" the escape quotes are not able to parse properly and it is leading to problem. please correct me if any thing wrong in my observation. Well, it's clearly broken at least :O Zoltan, do you have time to look at it? I won't have time until at least after FOSDEM, unfortunately. I looked at it shortly. What I tried first is adding another pair of single quotes manually like this: primary_conninfo = 'user=''user1'' password=''use1'' host=''192.168.1.2'' port=''5432'' sslmode=''disable'' sslcompression=''1'' ' But it doesn't solve the problem either, I got: FATAL: could not connect to the primary server: missing "=" after "'1'" in connection info string This worked though: primary_conninfo = 'user=user1 password=use\'1 host=192.168.1.2 port=5432 sslmode=disable sslcompression=1 ' When I added an elog() to print the conninfo string in libpqrcv_connect(), I saw that the double quotes were properly eliminated by ParseConfigFp() in the first case. So, there is a bug in generating recovery.conf by not double-escaping the values and another bug in parsing the connection string in libpq when the parameter value starts with a single-quote character. No, the libpq connection string parser is working as intended. Per the docs on PQconnectdb: The passed string can be empty to use all default parameters, or it can contain one or more parameter settings separated by whitespace. Each parameter setting is in the form keyword = value. Spaces around the equal sign are optional. To write an empty value, or a value containing spaces, surround it with single quotes, e.g., keyword = 'a value'. Single quotes and backslashes within the value must be escaped with a backslash, i.e., \' and \\. So, the proper way to escape a quote in a libpq connection string is \', not ''. There are two escaping systems layered on top of each other; the recovery.conf parser's, where you use '', and the libpq system, where you use \'. So we need two different escaping functions in pg_basebackup to get this right. Why is extending the libpq parser to allow doubling the single quotes not a good solution? It would be consistent in this regard with the recovery.conf parser. From this POV the first patch only needs a little change in the libpq docs. I guess that would work too, but I don't see any meaningful advantage in doing that. Apart from that, does it bother anyone else that the the primary_conninfo line that pg_basebackup creates is butt-ugly? primary_conninfo = 'user=''heikki'' host=''localhost'' port=''5432'' sslmode=''prefer'' sslcompression=''1'' ' Not a single bit. It's machine generated and the code is generic. The parser can deal with it. Oh, ok :-). Well, IMO that's really ugly; the file ought to be human-readable too. Also, do we really need to include the ssl options when they are the defaults? I think we were through about this bullet point. IIRC the reasoning and the consensus was that the backup and the generated recovery.conf should also work on another machine with a possibly different set of compilation options for libpq and envvars present on the system. Ok. I think the attached patch fixes the original test scenario correctly, without changing libpq's quoting rules, and only quotes when necessary. I didn't do anything about the ssl options. Please take a look. Your patch should work, too. Thanks, I've applied that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion failure
On 19.05.2013 17:25, Simon Riggs wrote: However, there is a call to RecoveryInProgress() at the top of the main loop of the checkpointer, which does explicitly state that it "initializes TimeLineID if it's not set yet." The checkpointer makes the decision about whether to run a restartpoint or a checkpoint directly from the answer to that, modified only in the case of an CHECKPOINT_END_OF_RECOVERY. So the appropiate call is made and I don't agree with the statement that this "doesn't get executed with fast promotion", or the title of the thread. Hmm, I see. So while I believe that the checkpointer might have an incorrect TLI and that you've seen a bug, what isn't clear is that the checkpointer is the only process that would see an incorrect TLI, or why such processes see an incorrect TLI. It seems equally likely at this point that the TLI may be set incorrectly somehow and that is why it is being read incorrectly. Yeah. I spent some more time debugging this, and found the culprit. In CreateRestartPoint, we do this: /* * Update ThisTimeLineID to the timeline we're currently replaying, * so that we install any recycled segments on that timeline. * * There is no guarantee that the WAL segments will be useful on the * current timeline; if recovery proceeds to a new timeline right * after this, the pre-allocated WAL segments on this timeline will * not be used, and will go wasted until recycled on the next * restartpoint. We'll live with that. */ (void) GetXLogReplayRecPtr(&ThisTimeLineID); Here's what happens: 0. System is in recovery 1. checkpointer process starts creating a restartpoint. 2. Startup process ends recovery, creates end-of-recovery record, sets the shared ThisTimeLineID value to 2, and exits. 3. checkpointer process calls RecoveryInProgess() while performing the restartpoint (there are RecoveryInProgress() calls in many places, e.g in XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes ThisTimeLineID to 2. 4. At the end of restartpoint, the checkpointer process does the above GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to 1. 5. Checkpointer calls RecoveryInProgress, which returns true immediately, as LocalRecoveryInProgress is already set. 5. Checkpointer performs the first checkpoint, with ThisTimeLineID set incorrectly to 1. Not sure what the best fix would be. Perhaps change the code in CreateRestartPoint() to do something like this instead: GetXLogReplayRecPtr(&replayTLI); if (RecoveryInProgress()) ThisTimeLineID = replayTLI; I see that the comment in InitXLOGAccess() is incorrect "ThisTimeLineID doesn't change so we need no lock to copy it", which seems worth correcting since there's no need to save time in a once only function. Hmm, the code seems right to me, XLogCtl->ThisTimeLineID indeed doesn't change, once it's set. And InitXLOGAccess() isn't called until it is set. The comment could explain that better, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. SERIALIZABLE
I think it would be good to do it in pgbench, provided it can be done fairly cleanly. Presumably we would want to repeat all of the ordinary commands, in the file, but not any of the backslash set commands that precede any ordinary commands. But what if backslash set commands are sprinkled between ordinary commands? As I have spent some time in pgbench recently, ISTM that when restarting the transaction (reset state = 0), you want to keep track of the farthest state that was reached on previous attempts, so as to skip backslash commands for those states, and restart execute them where they were left. That would mean one integer for this purpose in the client structure. -- Fabien -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fast promotion and log_checkpoints
On 19.05.2013 17:22, Simon Riggs wrote: On 1 May 2013 10:05, Fujii Masao wrote: In HEAD, when the standby is promoted, recovery requests the checkpoint but doesn't wait for its completion. I found the checkpoint starting log message of this checkpoint looks odd as follows: LOG: checkpoint starting: I think something like the following is better. LOG: checkpoint starting: end-of-recovery In 9.2 or before, "end-of-recovery" part is logged. Even if we changed the behavior of end-of-recovery checkpoint, I think that it's more intuitive to label it as "end-of-recovery". Thought? The checkpoint isn't an "end-of-recovery" checkpoint, its just the first checkpoint after the end of recovery. I don't think it should say "end-of-recovery". Agreed. The problem is that we've now changed the code to trigger a checkpoint in a place that wasn't part of the original design, so the checkpoint called at that point isn't supplied with a reason and so has nothing to print. It would be possible to redesign this with a special new reason, or we could just use "time" as the reason, or we could just leave it. Do nothing is easy, though so are the others, so we can choose anything we want. What do we want it to say? I'm not sure. Perhaps we should print "(no flags)", so that it wouldn't look like there's something missing in the log message. - Heikk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion failure
On 20 May 2013 18:47, Heikki Linnakangas wrote: > On 19.05.2013 17:25, Simon Riggs wrote: >> So while I believe that the checkpointer might have an incorrect TLI >> and that you've seen a bug, what isn't clear is that the checkpointer >> is the only process that would see an incorrect TLI, or why such >> processes see an incorrect TLI. It seems equally likely at this point >> that the TLI may be set incorrectly somehow and that is why it is >> being read incorrectly. > > > Yeah. I spent some more time debugging this, and found the culprit. In > CreateRestartPoint, we do this: > >> /* >> * Update ThisTimeLineID to the timeline we're currently >> replaying, >> * so that we install any recycled segments on that >> timeline. >> * >> * There is no guarantee that the WAL segments will be >> useful on the >> * current timeline; if recovery proceeds to a new >> timeline right >> * after this, the pre-allocated WAL segments on this >> timeline will >> * not be used, and will go wasted until recycled on the >> next >> * restartpoint. We'll live with that. >> */ >> (void) GetXLogReplayRecPtr(&ThisTimeLineID); > > > Here's what happens: > > 0. System is in recovery > 1. checkpointer process starts creating a restartpoint. > 2. Startup process ends recovery, creates end-of-recovery record, sets the > shared ThisTimeLineID value to 2, and exits. > 3. checkpointer process calls RecoveryInProgess() while performing the > restartpoint (there are RecoveryInProgress() calls in many places, e.g in > XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes > ThisTimeLineID to 2. > 4. At the end of restartpoint, the checkpointer process does the above > GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to > 1. > 5. Checkpointer calls RecoveryInProgress, which returns true immediately, as > LocalRecoveryInProgress is already set. > 6. Checkpointer performs the first checkpoint, with ThisTimeLineID set > incorrectly to 1. That description looks correct to me from the code. > Not sure what the best fix would be. Perhaps change the code in > CreateRestartPoint() to do something like this instead: > > GetXLogReplayRecPtr(&replayTLI); > if (RecoveryInProgress()) > ThisTimeLineID = replayTLI; Installing a few extra WAL files doesn't seem to be a good reason to mess with such an important variable as ThisTimeLineID, especially since its such a rare event and hardly worth optimising for. I would prefer it if we didn't set ThisTimeLineID at all in that way, or anywhere else. If we do, it needs to be done safely, otherwise any caller could make the same mistake. Suggested patch attached. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services avoid_setting_tli.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FK locking concurrency improvement
As part of 0ac5ad5134f2769ccbaefec73844f8504c4d6182 the permutations in test/isolation/fk-deadlock2.spec and elsewhere were removed. Is it the intent that these tests no longer do anything useful? I was expecting a failure in the test with some work I'm doing and was confused, after a merge from the upstream 9.3, that the test didn't fail until I noticed the test is no longer running the permutations. FYI, I saw some comments and adding fflush's into isolationtester.c. I ran into the same problem with debugging tests when they failed/hung in the middle. A simple "setbuf(stdout, NULL)" at the beginning of main gets rid of the problem where line buffering becomes block buffering when redirecting stdout to a file. This causes problems with sequencing of mixed stderr and stdout and not seeing the last few lines of stdout if the process fails or hangs. The setbuf on stdout shown above disables buffering of stdout to match the unbuffered stderr. That way you don't need to fflush after each printf/fprintf. I'm not sure why fflush of stderr was added because it isn't buffered to begin with so is unnecessary. The problem was with stdout. YMMV on windows but might work. - Dan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add --single-row to psql
On Sat, May 11, 2013 at 07:22:55PM -0400, Robert Haas wrote: > On Sat, May 11, 2013 at 12:27 PM, Tom Lane wrote: > > By the time you've got an expression tree, the problem is mostly solved, > > at least so far as parser extension is concerned. > > Right. > > > More years ago than I care to admit, I worked on systems that had > > run-time-extensible parsers at Hewlett-Packard, so technology for this > > does exist. But my (vague) memory of those systems is that the parser's > > language capabilities were more limited than bison's, perhaps only > > LL(1). Parsing spec-compatible SQL that way might be a challenge. > > If I understand bison correctly, it basically looks at the current > parser state and the next token and decides to either shift that token > onto the stack or reduce the stack using some rule. If there's no > matching rule, we error out. If someone wants to inject new rules > into the grammar, those state tables are all going to change. But if > we could contrive things so that the state tables are built > dynamically and can be change as rules are added and removed, then it > seems to me that we could let a loadable module supply (or delete) > whatever grammar rules it likes. Whenever it does this, we recompile > the grammar on next use (and complain if we get ambiguities). This > does not sound all that easy to code, but at least in theory it seems > doable. > > We'd also need a way to add keywords. To be quite frank, I think our > whole approach to keywords is massive screwed up right now. Aside FYI, one trick that is often used when parsing C code is to allow parsed code to add to the keyword list, so when the parser goes to get the next token, it might get "keyword" back based on what it has seen already. This is required to handle C typedefs, and might be a method we could use. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FK locking concurrency improvement
Daniel Wood wrote: > As part of 0ac5ad5134f2769ccbaefec73844f8504c4d6182 > the permutations in test/isolation/fk-deadlock2.spec and elsewhere > were removed. Is it the intent that these tests no longer do > anything useful? I was expecting a failure in the test with some > work I'm doing and was confused, after a merge from the upstream > 9.3, that the test didn't fail until I noticed the test is no longer > running the permutations. No, you're misunderstanding. When the spec file specifies permutations, then those are exactly the permutations that are run. When the spec does not list any permutations, then the test driver runs all the possible permutations. This was made possibly by a change to isolationtester that an "impossible" permutation did not cause the test to die, but instead to continue by reporting that the permutation is impossible. That way, we ensure that not only the listed permutations are running and passing, but also that the set of permutations that are possible does not change. (An "impossible" permutation is one that requires running a command in a session that is in blocked state, something which is clearly impossible.) In hindsight, we could have committed this change separately. > FYI, I saw some comments and adding fflush's into isolationtester.c. > I ran into the same problem with debugging tests when they > failed/hung in the middle. A simple "setbuf(stdout, NULL)" at the > beginning of main gets rid of the problem where line buffering > becomes block buffering when redirecting stdout to a file. Interesting. I'm not sure it's worth messing with this now (given that the current coding works everywhere), but if there's a strong reason to do it that way we can certainly change it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion failure
On 20.05.2013 22:18, Simon Riggs wrote: On 20 May 2013 18:47, Heikki Linnakangas wrote: Not sure what the best fix would be. Perhaps change the code in CreateRestartPoint() to do something like this instead: GetXLogReplayRecPtr(&replayTLI); if (RecoveryInProgress()) ThisTimeLineID = replayTLI; Installing a few extra WAL files doesn't seem to be a good reason to mess with such an important variable as ThisTimeLineID, especially since its such a rare event and hardly worth optimising for. I would prefer it if we didn't set ThisTimeLineID at all in that way, or anywhere else. If we do, it needs to be done safely, otherwise any caller could make the same mistake. Suggested patch attached. @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags) * not be used, and will go wasted until recycled on the next * restartpoint. We'll live with that. */ - (void) GetXLogReplayRecPtr(&ThisTimeLineID); - - RemoveOldXlogFiles(_logSegNo, endptr); - - /* -* Make more log segments if needed. (Do this after recycling old log -* segments, since that may supply some of the needed files.) -*/ PreallocXlogFiles(endptr); } That's essentially reverting commit 343ee00b7, resurrecting the bug that that fixed. @@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI) SpinLockAcquire(&xlogctl->info_lck); recptr = xlogctl->lastReplayedEndRecPtr; tli = xlogctl->lastReplayedTLI; + if (replayTLI && xlogctl->SharedRecoveryInProgress) + *replayTLI = tli; SpinLockRelease(&xlogctl->info_lck); - if (replayTLI) - *replayTLI = tli; return recptr; } That breaks the only remaining caller that passes a replayTLI pointer, GetStandbyFlushRecPtr(); *replayTLI points to a local variable in that function, which is left uninitialized if !xlogctl->SharedRecoveryInProgress. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Heap truncation without AccessExclusiveLock (9.4)
On 17.05.2013 12:35, Andres Freund wrote: On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote: On 16.05.2013 04:15, Andres Freund wrote: Couldn't we "just" take the extension lock and then walk backwards from the rechecked end of relation ConditionalLockBufferForCleanup() the buffers? For every such locked page we check whether its still empty. If we find a page that we couldn't lock, isn't empty or we already locked a sufficient number of pages we truncate. You need an AccessExclusiveLock on the relation to make sure that after you have checked that pages 10-15 are empty, and truncated them away, a backend doesn't come along a few seconds later and try to read page 10 again. There might be an old sequential scan in progress, for example, that thinks that the pages are still there. But that seems easily enough handled: We know the current page in its scan cannot be removed since its pinned. So make heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to ReadBuffer and if the read fails recheck the length of the relation before throwing an error. Hmm. For the above to work, you'd need to atomically check that the pages you're truncating away are not pinned, and truncate them. If those steps are not atomic, a backend might pin a page after you've checked that it's not pinned, but before you've truncated the underlying file. I guess that be doable; needs some new infrastructure in the buffer manager, however. There isn't much besides seqscans that can have that behaviour afaics: - (bitmap)indexscans et al. won't point to completely empty pages - there cannot be a concurrent vacuum since we have the appropriate locks - if a trigger or something else has a tid referencing a page there need to be unremovable tuples on it. The only thing that I immediately see are tidscans which should be handleable in a similar manner to seqscans. Sure, there are some callsites that need to be adapted but it still seems noticeably easier than what you proposed upthread. Yeah. I'll think some more how the required buffer manager changes could be done. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fast promotion and log_checkpoints
On 20 May 2013 20:06, Heikki Linnakangas wrote: >> It would be possible to redesign this with a special new reason, or we >> could just use "time" as the reason, or we could just leave it. >> >> Do nothing is easy, though so are the others, so we can choose >> anything we want. What do we want it to say? > > > I'm not sure. Perhaps we should print "(no flags)", so that it wouldn't look > like there's something missing in the log message. The reason text would still be absent, so it wouldn't really help the user interpret the log message correctly. I suggest we use RequestCheckpoint(CHECKPOINT_CAUSE_TIME) instead, since it is literally time for a checkpoint. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: > Stephen Frost writes: > > Isn't this the same issue which has prompted multiple people to propose > > (sometimes with code, as I recall) to rip out our internal spinlock > > system and replace it with kernel-backed calls which do it better, > > specifically by dealing with issues like the above? Have you seen those > > threads in the past? Any thoughts about moving in that direction? > > All of the proposals of that sort that I've seen had a flavor of > "my OS is the only one that matters". While I don't object to > platform-dependent implementations of spinlocks as such, they're not > much of a cure for a generic performance issue. Uh, is this an x86-64-only optimization? Seems so. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
> diff --git a/configure.in b/configure.in > index 4ea5699..ff8470e 100644 > --- a/configure.in > +++ b/configure.in > @@ -1445,17 +1445,6 @@ fi > AC_CHECK_FUNCS([strtoll strtoq], [break]) > AC_CHECK_FUNCS([strtoull strtouq], [break]) > > -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics, > -[AC_TRY_LINK([], > - [int lock = 0; > - __sync_lock_test_and_set(&lock, 1); > - __sync_lock_release(&lock);], > - [pgac_cv_gcc_int_atomics="yes"], > - [pgac_cv_gcc_int_atomics="no"])]) > -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then > - AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have > __sync_lock_test_and_set(int *) and friends.]) > -fi > - Careful here --- s_lock.h has some code conditional on HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is removing the definition, unless I'm misreading. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On 20.05.2013 23:01, Bruce Momjian wrote: On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: Stephen Frost writes: Isn't this the same issue which has prompted multiple people to propose (sometimes with code, as I recall) to rip out our internal spinlock system and replace it with kernel-backed calls which do it better, specifically by dealing with issues like the above? Have you seen those threads in the past? Any thoughts about moving in that direction? All of the proposals of that sort that I've seen had a flavor of "my OS is the only one that matters". While I don't object to platform-dependent implementations of spinlocks as such, they're not much of a cure for a generic performance issue. Uh, is this an x86-64-only optimization? Seems so. All modern architectures have an atomic compare-and-swap instruction (or something more powerful that can be used to implement it). That includes x86, x86-64, ARM, PowerPC, among others. There are some differences in how wide values can be swapped with it; 386 only supported 32-bit, until Pentium, which added a 64-bit variant. I used the 64-bit variant in the patch, but for lwlocks, we could actually get away with the 32-bit variant if we packed the booleans and the shared lock counter more tightly. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On 20.05.2013 23:11, Alvaro Herrera wrote: diff --git a/configure.in b/configure.in index 4ea5699..ff8470e 100644 --- a/configure.in +++ b/configure.in @@ -1445,17 +1445,6 @@ fi AC_CHECK_FUNCS([strtoll strtoq], [break]) AC_CHECK_FUNCS([strtoull strtouq], [break]) -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics, -[AC_TRY_LINK([], - [int lock = 0; - __sync_lock_test_and_set(&lock, 1); - __sync_lock_release(&lock);], - [pgac_cv_gcc_int_atomics="yes"], - [pgac_cv_gcc_int_atomics="no"])]) -if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then - AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.]) -fi - Careful here --- s_lock.h has some code conditional on HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is removing the definition, unless I'm misreading. Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we also test for __sync_val_compare_and_swap(p, oldval, newval). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote: > On 20.05.2013 23:01, Bruce Momjian wrote: > >On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote: > >>Stephen Frost writes: > >>>Isn't this the same issue which has prompted multiple people to propose > >>>(sometimes with code, as I recall) to rip out our internal spinlock > >>>system and replace it with kernel-backed calls which do it better, > >>>specifically by dealing with issues like the above? Have you seen those > >>>threads in the past? Any thoughts about moving in that direction? > >> > >>All of the proposals of that sort that I've seen had a flavor of > >>"my OS is the only one that matters". While I don't object to > >>platform-dependent implementations of spinlocks as such, they're not > >>much of a cure for a generic performance issue. > > > >Uh, is this an x86-64-only optimization? Seems so. > > All modern architectures have an atomic compare-and-swap instruction > (or something more powerful that can be used to implement it). That > includes x86, x86-64, ARM, PowerPC, among others. > > There are some differences in how wide values can be swapped with > it; 386 only supported 32-bit, until Pentium, which added a 64-bit > variant. I used the 64-bit variant in the patch, but for lwlocks, we > could actually get away with the 32-bit variant if we packed the > booleans and the shared lock counter more tightly. So we are going to need to add this kind of assembly language optimization for every CPU type? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion failure
On 20 May 2013 20:40, Heikki Linnakangas wrote: > On 20.05.2013 22:18, Simon Riggs wrote: >> >> On 20 May 2013 18:47, Heikki Linnakangas wrote: >>> >>> Not sure what the best fix would be. Perhaps change the code in >>> >>> CreateRestartPoint() to do something like this instead: >>> >>> GetXLogReplayRecPtr(&replayTLI); >>> if (RecoveryInProgress()) >>>ThisTimeLineID = replayTLI; >> >> >> Installing a few extra WAL files doesn't seem to be a good reason to >> mess with such an important variable as ThisTimeLineID, especially >> since its such a rare event and hardly worth optimising for. >> >> I would prefer it if we didn't set ThisTimeLineID at all in that way, >> or anywhere else. If we do, it needs to be done safely, otherwise any >> caller could make the same mistake. > > >> Suggested patch attached. > > >> @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags) >> >> * not be used, and will go wasted until recycled on the >> next >> * restartpoint. We'll live with that. >> */ >> - (void) GetXLogReplayRecPtr(&ThisTimeLineID); >> - >> - RemoveOldXlogFiles(_logSegNo, endptr); >> - >> - /* >> -* Make more log segments if needed. (Do this after >> recycling old log >> -* segments, since that may supply some of the needed >> files.) >> -*/ >> PreallocXlogFiles(endptr); >> } > > > That's essentially reverting commit 343ee00b7, resurrecting the bug that > that fixed. Hmm, OK. Then we should using the correct value, not the one that patch set. It certainly worked at that time, but not in a principled way. When we set the new timeline we should be updating all values that might be used elsewhere. If we do that, then no matter when or how we run GetXLogReplayRecPtr, it can't ever get it wrong in any backend. Patch attached. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services set_lastReplayedTLI.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)
On 16.05.2013 01:08, Daniel Farina wrote: On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas wrote: pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine: - 64.09% postgres postgres [.] tas - tas - 99.83% s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78% LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97% postgres postgres [.] tas + 1.53% postgres libc-2.13.so [.] 0x119873 + 1.44% postgres postgres [.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable + 1.18% postgres postgres [.] AllocSetAlloc ... So, on this test, a lot of time is wasted spinning on the mutex of ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a surprisingly steep drop in performance once you go beyond 29 clients (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory is that after that point all the cores are busy, and processes start to be sometimes context switched while holding the spinlock, which kills performance. I have, I also used linux perf to come to this conclusion, and my determination was similar: a system was undergoing increasingly heavy load, in this case with processes>> number of processors. It was also a phase-change type of event: at one moment everything would be going great, but once a critical threshold was hit, s_lock would consume enormous amount of CPU time. I figured preemption while in the spinlock was to blame at the time, given the extreme nature Stop the press! I'm getting the same speedup on that 32-core box I got with the compare-and-swap patch, from this one-liner: --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -200,6 +200,8 @@ typedef unsigned char slock_t; #define TAS(lock) tas(lock) +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock)) + static __inline__ int tas(volatile slock_t *lock) { So, on this system, doing a non-locked test before the locked xchg instruction while spinning, is a very good idea. That contradicts the testing that was done earlier when the x86-64 implementation was added, as we have this comment in the tas() implementation: /* * On Opteron, using a non-locking test before the locking instruction * is a huge loss. On EM64T, it appears to be a wash or small loss, * so we needn't bother to try to distinguish the sub-architectures. */ On my test system, the non-locking test is a big win. I tested this because I was reading this article from Intel: http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. It says very explicitly that the non-locking test is a good idea: Spinning on volatile read vs. spin on lock attempt One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free. Now, I'm not sure what to do about this. If we put the non-locking test in there, according to the previous testing that would be a huge loss on Opterons. Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. That way, even if each TAS_SPIN test is very expensive, we don't spend too much time spinning if it's really busy, or held by a process that is sleeping. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add more regression tests for dbcommands
Hi, Attached is an updated patch that does only 1 CREATE DATABASE and reuses that for all other tests. The code-coverage with this patch goes up from 36% to 70%. -- Robins Tharakan On 13 May 2013 21:04, Robins Tharakan wrote: > I believe Tom / Andres and Fabien all have valid points. > > Net-net, I believe the tests although non-negotiable, are not required to > be in make-check. For now, its the slow tests that are the pain points > here, and then I would soon try to prune them and commit once again. > > Whether it goes in make-check or not is obviously not discretion but to me > their importance is undoubted since I am sure they increase code-coverage. > Actually that is 'how' I create those tests, I look at untouched code and > create new tests that check untouched code. > > Would try to revert with a faster script (preferably with minimal > CREATE/DROP). > > -- > Robins Tharakan > > > On 13 May 2013 20:30, Andres Freund wrote: > >> On 2013-05-13 16:52:08 +0200, Fabien COELHO wrote: >> > >> > Hello, >> > >> > >>Would you be okay if there is one/a few effective create/drop (some >> tests >> > >>check that the create or drop fails e.g. depending on permissions, >> which >> > >>ISTM is not tested anywhere else), so that tests for various ALTER >> > >>DATABASE commands are combined together onto these databases? >> > > >> > >TBH, I do not see that such tests are worth adding, if they are going >> to >> > >significantly slow down the core regression tests. Those tests are run >> > >probably hundreds of times a day, in aggregate across all Postgres >> > >developers. Adding ten seconds or whatever this would add is a major >> > >cost, while the benefit appears trivial. >> > >> > >We could consider adding expensive low-value tests like these to some >> > >alternate regression target that's only exercised by buildfarm members, >> > >perhaps. But I think there's probably a point of diminishing returns >> > >even in that context. >> > >> > I'm not sure that the tests are "low value", because a commit that would >> > generate a failure on a permission check test would be a potential >> security >> > issue for Pg. >> >> > As for the cost, if the proposed tests are indeed too costly, what is >> not >> > necessarily the case for what I have seen, I do not think that it would >> be a >> > great problem to have two set of tests, with one a superset of the >> other, >> > with some convention. >> >> Well, tests like permission tests aren't the expensive part. The actual >> CREATE/DROP DATABASE you do is. The latter essentially are already >> tested by the buildfarm already. >> So, trimming the patch to do only the fast stuff should be less >> controversial? >> >> Greetings, >> >> Andres Freund >> >> -- >> Andres Freund http://www.2ndQuadrant.com/ >> PostgreSQL Development, 24x7 Support, Training & Services >> > > regress_db_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Removal of pageinspect--1.0.sql
Hi all, The contrib module pageinspect has been upgraded to 1.1, but pageinspect--1.0.sql is still present in source code. Shouldn't it be removed? Please find patch attached. Thanks -- Michael 20130521_pageinspect10_removal.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add --single-row to psql
I have actually been working on the task discussed in this thread, most relevant parts quoted below, for awhile now, and hope to have something concrete that you can use by the end of this summer. My in-development Muldis D language is homoiconic as a core feature, its source code is data to it, and the native syntax of the language resembles an abstract syntax tree. This tree of nodes primarily defines behavior of the code, but it also supports arbitrary metadata per node, which for example can be used to preserve concrete syntax for any programming language that can be parsed into Muldis D nodes or conversely generated from said. For example, you can have one type of node defining a function, and its details are defined by its attributes or child nodes, such as its result type, its parameters, whether it is declared associative/commutative/etc or not, and the expression(s) defining its body. Another type of node defines a call to a function, another type defines a text literal, another a relation literal, and so on. Conversely, a node can define a schema or package etc. Example metadata, which is also structured, could include line numbers or code comments or whitespace or exact numeric literal formats or quoting formats or exact keyword choices, for example. Using these node data types, we have a fairly normalized representation of any source code (or data) that is easy to introspect or transform with code. A key design of Muldis D is the clear separation of syntax and behavior. A parser is just a function that takes (typically) a character string as input and produces a node (tree) as output. A compiler is just a function or operation that takes a node (tree) as input and produces machine code. An optimizer can be a function that derives one node tree from another, either as its own operation or typically as part of the compiler stage. A compiler or optimizer generally can trivially ignore the node metadata, but a code generator or debugger can use it; metadata can be stripped without affecting behavior. The canonical tree form can also easily be mapped losslessly with relation forms, such as typical information schemas have. Practically all behavior is defined in terms of generic type and routine definitions. Practically all system-defined types and routines are defined in one or more libraries/modules/packages that have the same format as those users would write like extensions. So, all the relational operators have the same syntax as say string or numeric or array operators. I envision that the most effective way to use Muldis D to handle an arbitrary programming language, including the native SQL syntax+behavior of each SQL DBMS, is to have a Muldis D library+parser pair for it. For example, to represent PostgreSQL 9.2 most directly, we have a library with an explicitly defined type and routine for every built-in that Pg 9.2 has, and we also have a parser function that takes the SQL syntax that Pg 9.2 understands and produces a Muldis D node tree consisting of calls to the routines of that library or value selectors of types in that library (things like SELECT and INSERT etc are each mapped to a routine too). That way, even with a standard node format that isn't specific to a typical language or version, the code for parsing Pg 9.2 SQL has the minimal amount of indirection that it has to deal with, as each syntax element would have a direct library call counterpart. Similarly, the existing Pg 9.2 SQL compiler would have the least indirection to take said nodes and execute them. (The library would be named eg "Postgres_9_2" for example, which makes it easier say to also have one side-by-side for other versions, shims, legacy code you want to more easily support compatibility with.) Where one decides to do cross-compilation, say make Oracle SQL run on Pg, that could be done as simply as defining a library for Oracle SQL with the routines+types that has, and then mapping it to a Pg one just in terms of shim calls (which the compiler can optimize away as applicable), and so parsers or compilers never necessarily have to deal with behavior compatibility issues. I am presently working out the exact set of such language nodes, and making a reference implementation which is mostly self-defined and would compile to Perl code, and hope to have the first Muldis D executable by the end of this summer. I am confident that an adaption of this design into C or whatever would serve Postgres greatly in letting it effectively parse multiple languages, anywhere from application programming languages to multiple SQL dialects or versions. Even if you don't decide to use my design specifically (which is open source and you can influence it), I think you should find some of the general design principles I stated above to be useful. Representing behavior as libraries the AST being flexible enough for any concrete language without bei
Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.
Thanks, applied. --- On Fri, May 17, 2013 at 03:12:21PM -0400, Joe Abbate wrote: > Small release notes correction attached. > > Joe > >From 330f5af36ffdba8930ea2da8146e8f17e1ec8a68 Mon Sep 17 00:00:00 2001 > From: Joe Abbate > Date: Fri, 17 May 2013 14:59:03 -0400 > Subject: [PATCH] Correct release notes about DROP TABLE IF EXISTS and add > link. > > --- > doc/src/sgml/release-9.3.sgml |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/doc/src/sgml/release-9.3.sgml b/doc/src/sgml/release-9.3.sgml > index 519ea49..24e037b 100644 > --- a/doc/src/sgml/release-9.3.sgml > +++ b/doc/src/sgml/release-9.3.sgml > @@ -740,8 +740,9 @@ > > > > -Allow DROP TABLE IF NOT EXISTS to succeed when a > -non-existent schema is specified in the table name (Bruce Momjian) > +Allow DROP TABLE IF > +EXISTS to succeed when a non-existent schema is specified > +in the table name (Bruce Momjian) > > > > -- > 1.7.10.4 > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.
On 05/20/2013 10:38 PM, Bruce Momjian wrote: -Allow DROP TABLE IF NOT EXISTS to succeed when a -non-existent schema is specified in the table name (Bruce Momjian) *snort* This would be a rather pointless command! cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link.
On Mon, May 20, 2013 at 10:55:19PM -0400, Andrew Dunstan wrote: > > On 05/20/2013 10:38 PM, Bruce Momjian wrote: > >-Allow DROP TABLE IF NOT EXISTS to succeed when a > >-non-existent schema is specified in the table name (Bruce Momjian) > > > > *snort* > > This would be a rather pointless command! Yeah. :-) The patch added markup, and fixed the command text. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion failure
On 21.05.2013 00:00, Simon Riggs wrote: When we set the new timeline we should be updating all values that might be used elsewhere. If we do that, then no matter when or how we run GetXLogReplayRecPtr, it can't ever get it wrong in any backend. --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5838,8 +5838,16 @@ StartupXLOG(void) } /* Save the selected TimeLineID in shared memory, too */ - XLogCtl->ThisTimeLineID = ThisTimeLineID; - XLogCtl->PrevTimeLineID = PrevTimeLineID; + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + XLogCtl->ThisTimeLineID = ThisTimeLineID; + XLogCtl->lastReplayedTLI = ThisTimeLineID; + XLogCtl->PrevTimeLineID = PrevTimeLineID; + SpinLockRelease(&xlogctl->info_lck); + } Hmm. lastReplayedTLI is supposed to be the timeline of the last record that was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. I think it would work, but it feels like it could be an easy source of bugs in the future. It might be a good idea to change walsender to not store that in ThisTimeLineID, though. It used to make sense for ThisTimeLineID to track the current recovery timeline in 9.2 and below, but now that walsender can be sending any older timeline, using ThisTimeLineID to store the latest one seems confusing. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers