Re: [HACKERS] InvokeObjectPostAlterHook() vs. CommandCounterIncrement()

2013-07-21 Thread Ants Aasma
On Jul 21, 2013 4:06 AM, "Noah Misch"  wrote:
> If these hooks will need to apply to a larger operation, I
> think that mandates a different means to reliably expose the before/after
> object states.

I haven't checked the code to see how it would fit the API, but what about
taking a snapshot before altering and passing this to the hook. Would there
be other issues besides performance? If the snapshot is taken only when
there is a hook present then the performance can be fixed later.

Regards,
Ants Aasma


[HACKERS] Wal sync odirect

2013-07-21 Thread Миша Тюрин
hi, list. there are my proposal. i would like to tell about odirect in wal sync 
in wal_level is higher than minimal. i think in my case when wal traffic is up 
to 1gb per 2-3 minutes but discs hardware with 2gb bbu cache (or maybe ssd 
under wal) - there would be better if wall traffic could not harm os memory 
eviction. and i do not use streaming. my archive command may read wal directly 
without os cache. just opinion, i have not done any tests yet. but i am still 
under the some memory eviction anomaly.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Auto explain target tables

2013-07-21 Thread Миша Тюрин

hi, list, again. the next proposal into auto explain. one would be happy if 
could set list of target tables and indexes. sometimes it is very hard to 
detect who is using your indexes. but turn total logging on under thousands 
transactions per seconds is not seems like nice idea couse size of resulting 
log files (cpu utilization might not be so critical)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing tuple-table leakage in plpgsql

2013-07-21 Thread Noah Misch
On Thu, Jul 11, 2013 at 09:14:38PM -0400, Chad Wagner wrote:
> It looks like to me when AtEOSubXact_SPI is called the
> _SPI_current->connectSubId is always 1 (since it is only set when
> SPI_connect is called, which is only once for plpgsql), but the
> CurrentSubTransactionId is incremented each time a subtransaction is
> started.

Right.  AtEOSubXact_SPI() cleans up any SPI connections originating in the
ending subtransaction.  It leaves alone connections from higher subtransaction
levels; SPI has no general expectation that those have lost relevance.

> As a result, the memory for procCxt is only freed when I presume the
> TopTransaction is aborted or committed.

In your code from bug #8279, I expect it to be freed when the DO block exits.
The backend might not actually shrink then, but repeated calls to a similar DO
block within the same transaction should not cause successive increases in the
process's memory footprint.

> Should SPI_connect be called again after the subtransaction is created?  And
>  SPI_finish before the subtransaction is committed or aborted?

Hmm.  An SPI_push()+SPI_connect() every time PL/pgSQL starts a subtransaction
would be another way to fix it, yes.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing tuple-table leakage in plpgsql

2013-07-21 Thread Tom Lane
Noah Misch  writes:
> On Thu, Jul 11, 2013 at 09:14:38PM -0400, Chad Wagner wrote:
>> Should SPI_connect be called again after the subtransaction is created?  And
>> SPI_finish before the subtransaction is committed or aborted?

> Hmm.  An SPI_push()+SPI_connect() every time PL/pgSQL starts a subtransaction
> would be another way to fix it, yes.

That sounds like a dangerous idea to me.  The procedure would then be
working actively with queries from two different SPI levels, which I'm
pretty sure would cause issues.  It's possible that plpgsql's SPI access
is sufficiently lexically-local that statements within the BEGIN block
couldn't use any SPI resources created by statements outside it nor vice
versa.  But then again maybe not, and in any case we couldn't imagine
that that would be a workable restriction for non-plpgsql scenarios.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing tuple-table leakage in plpgsql

2013-07-21 Thread Noah Misch
On Fri, Jul 19, 2013 at 07:34:14PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Jul 05, 2013 at 02:47:06PM -0400, Tom Lane wrote:
> >> So I'm inclined to propose that SPI itself should offer some mechanism
> >> for cleaning up tuple tables at subtransaction abort.  We could just
> >> have it automatically throw away tuple tables made in the current
> >> subtransaction, or we could allow callers to exercise some control,
> >> perhaps by calling a function that says "don't reclaim this tuple table
> >> automatically".  I'm not sure if there's any real use-case for such a
> >> call though.
> 
> > I suppose that would be as simple as making spi_dest_startup() put the
> > tuptabcxt under CurTransactionContext?  The function to prevent reclamation
> > would then just do a MemoryContextSetParent().
> 
> I experimented with this, and found out that it's not quite that simple.
> In a SPI procedure that hasn't created a subtransaction (eg, a plpgsql
> function without an exception block), if we attach tuple tables to the
> outer transaction's CurTransactionContext then they fail to go away
> during SPI_finish(), creating leaks in code that was previously correct.
> 
> However, we can use your idea when running inside a subtransaction,
> while still attaching the tuple table to the procedure's own procCxt
> when no subtransaction is involved.  The attached draft patch does it
> that way, and successfully resolves the complained-of leakage case.
> 
> I like this better than what I originally had in mind, because it
> produces no change in semantics in the case where a SPI procedure
> doesn't create any subtransactions, which I imagine is at least 99.44%
> of third-party SPI-using code.

Reasonable enough.  Code that does use subtransactions will need to be more
careful than before to manually free tuple tables in the non-error case.
Failure to do so has been creating a leak that lasts until SPI_finish(), but
it will now be able to cause a transaction-lifespan leak.

> patch's changes to remove SPI_freetuptable() calls in
> plpy_cursorobject.c are not actually necessary for correctness, it's
> just that we no longer need them.

If PLy_spi_subtransaction_commit() were to throw an error (granted, unlikely),
would we not reference freed memory at those code sites as they stand today?

> Unfortunately, the change in pl_exec.c *is* necessary
> to avoid a coredump, because that call was being made after rolling back
> the subxact.

Brief search for similar patterns in external PLs:

plr - no subtransaction use
plv8 - no SPI_freetuptable()
plphp - uses both, but usage looks compatible
pljava - calls "SPI_freetuptable(SPI_tuptable)", but never a tuptable pointer
  it stored away.  Should be compatible, then.

> All in all, the risk of breaking anything outside core code seems very
> small, and I'm inclined to think that we don't need to provide an API
> function to reparent a tuple table.  But having said that, I'm also
> inclined to not back-patch this further than 9.3, just in case.

I wouldn't be confident in back-patching further than that.  It's not hard to
imagine a non-core PL needing a compensating change like the one you made to
PL/pgSQL.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvokeObjectPostAlterHook() vs. CommandCounterIncrement()

2013-07-21 Thread Noah Misch
On Sun, Jul 21, 2013 at 11:44:51AM +0300, Ants Aasma wrote:
> On Jul 21, 2013 4:06 AM, "Noah Misch"  wrote:
> > If these hooks will need to apply to a larger operation, I
> > think that mandates a different means to reliably expose the before/after
> > object states.
> 
> I haven't checked the code to see how it would fit the API, but what about
> taking a snapshot before altering and passing this to the hook. Would there
> be other issues besides performance? If the snapshot is taken only when
> there is a hook present then the performance can be fixed later.

That would work.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing tuple-table leakage in plpgsql

2013-07-21 Thread Tom Lane
Noah Misch  writes:
> On Fri, Jul 19, 2013 at 07:34:14PM -0400, Tom Lane wrote:
>> However, we can use your idea when running inside a subtransaction,
>> while still attaching the tuple table to the procedure's own procCxt
>> when no subtransaction is involved.  The attached draft patch does it
>> that way, and successfully resolves the complained-of leakage case.
>> 
>> I like this better than what I originally had in mind, because it
>> produces no change in semantics in the case where a SPI procedure
>> doesn't create any subtransactions, which I imagine is at least 99.44%
>> of third-party SPI-using code.

> Reasonable enough.  Code that does use subtransactions will need to be more
> careful than before to manually free tuple tables in the non-error case.
> Failure to do so has been creating a leak that lasts until SPI_finish(), but
> it will now be able to cause a transaction-lifespan leak.

Hmm ... good point.  The other plan I'd been considering was to add
explicit tracking inside spi.c of all tuple tables created within the
current procedure, and then have AtEOSubXact_SPI flush any that were
created inside a failed subxact.  The tables would still be children of
the procCxt and thus could not be leaked past SPI_finish.  When you
suggested attaching to subtransaction contexts I thought that would let
us get away without this additional bookkeeping logic, but maybe we
should bite the bullet and add the extra logic.  A change that's meant
to remove leak risks really shouldn't be introducing other, new leak
risks.  (An additional advantage is we could detect attempts to free
the same tuptable more than once, which would be a good thing ...)

>> patch's changes to remove SPI_freetuptable() calls in
>> plpy_cursorobject.c are not actually necessary for correctness, it's
>> just that we no longer need them.

> If PLy_spi_subtransaction_commit() were to throw an error (granted, unlikely),
> would we not reference freed memory at those code sites as they stand today?

Hm, possibly, depending on just when the error was thrown.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] REINDEX checking of index constraints

2013-07-21 Thread Noah Misch
Historically, REINDEX would always revalidate any uniqueness enforced by the
index.  An EDB customer reported that this is not happening, and indeed I
broke it way back in commit 8ceb24568054232696dddc1166a8563bc78c900a.
Specifically, REINDEX TABLE and REINDEX DATABASE no longer revalidate
constraints, but REINDEX INDEX still does so.  As a consequence, REINDEX INDEX
is the only form of REINDEX that fixes a failed CREATE INDEX CONCURRENTLY.

Attached patch just restores the old behavior.  Would it be worth preserving
the ability to fix an index consistency problem with a REINDEX independent
from related heap consistency problems such as duplicate keys?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 1768,1774  ReindexTable(RangeVar *relation)
heapOid = RangeVarGetRelidExtended(relation, ShareLock, false, false,
   
RangeVarCallbackOwnsTable, NULL);
  
!   if (!reindex_relation(heapOid, REINDEX_REL_PROCESS_TOAST))
ereport(NOTICE,
(errmsg("table \"%s\" has no indexes",
relation->relname)));
--- 1768,1776 
heapOid = RangeVarGetRelidExtended(relation, ShareLock, false, false,
   
RangeVarCallbackOwnsTable, NULL);
  
!   if (!reindex_relation(heapOid,
! REINDEX_REL_PROCESS_TOAST |
! 
REINDEX_REL_CHECK_CONSTRAINTS))
ereport(NOTICE,
(errmsg("table \"%s\" has no indexes",
relation->relname)));
***
*** 1884,1890  ReindexDatabase(const char *databaseName, bool do_system, 
bool do_user)
StartTransactionCommand();
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
!   if (reindex_relation(relid, REINDEX_REL_PROCESS_TOAST))
ereport(NOTICE,
(errmsg("table \"%s.%s\" was reindexed",

get_namespace_name(get_rel_namespace(relid)),
--- 1886,1894 
StartTransactionCommand();
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
!   if (reindex_relation(relid,
!
REINDEX_REL_PROCESS_TOAST |
!
REINDEX_REL_CHECK_CONSTRAINTS))
ereport(NOTICE,
(errmsg("table \"%s.%s\" was reindexed",

get_namespace_name(get_rel_namespace(relid)),
*** a/src/test/regress/expected/create_index.out
--- b/src/test/regress/expected/create_index.out
***
*** 2298,2306  COMMIT;
  BEGIN;
  CREATE INDEX std_index on concur_heap(f2);
  COMMIT;
! -- check to make sure that the failed indexes were cleaned up properly and the
! -- successful indexes are created properly. Notably that they do NOT have the
! -- "invalid" flag set.
  \d concur_heap
  Table "public.concur_heap"
   Column | Type | Modifiers 
--- 2298,2310 
  BEGIN;
  CREATE INDEX std_index on concur_heap(f2);
  COMMIT;
! -- Failed builds are left invalid by VACUUM FULL, fixed by REINDEX
! VACUUM FULL concur_heap;
! REINDEX TABLE concur_heap;
! ERROR:  could not create unique index "concur_index3"
! DETAIL:  Key (f2)=(b) is duplicated.
! DELETE FROM concur_heap WHERE f1 = 'b';
! VACUUM FULL concur_heap;
  \d concur_heap
  Table "public.concur_heap"
   Column | Type | Modifiers 
***
*** 2316,2321  Indexes:
--- 2320,2341 
  "concur_index5" btree (f2) WHERE f1 = 'x'::text
  "std_index" btree (f2)
  
+ REINDEX TABLE concur_heap;
+ \d concur_heap
+ Table "public.concur_heap"
+  Column | Type | Modifiers 
+ +--+---
+  f1 | text | 
+  f2 | text | 
+ Indexes:
+ "concur_index2" UNIQUE, btree (f1)
+ "concur_index3" UNIQUE, btree (f2)
+ "concur_heap_expr_idx" btree ((f2 || f1))
+ "concur_index1" btree (f2, f1)
+ "concur_index4" btree (f2) WHERE f1 = 'a'::text
+ "concur_index5" btree (f2) WHERE f1 = 'x'::text
+ "std_index" btree (f2)
+ 
  --
  -- Try some concurrent index drops
  --
*** a/src/test/regress/sql/create_index.sql
--- b/src/test/regress/sql/create_index.sql
***
*** 721,730  BEGIN;
  CREATE INDEX std_index on concur_heap(f2);
  COMMIT;
  
! -- check to make sure that the failed indexes were cleaned up properly and

Re: [HACKERS] REINDEX checking of index constraints

2013-07-21 Thread Josh Berkus
Noah,

> Attached patch just restores the old behavior.  Would it be worth preserving
> the ability to fix an index consistency problem with a REINDEX independent
> from related heap consistency problems such as duplicate keys?

I would love to have two versions of REINDEX, one which validated and
one which didn't.   Maybe a ( validate off ) type check?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REINDEX checking of index constraints

2013-07-21 Thread Josh Berkus
On 07/21/2013 11:30 AM, Josh Berkus wrote:
> Noah,
> 
>> Attached patch just restores the old behavior.  Would it be worth preserving
>> the ability to fix an index consistency problem with a REINDEX independent
>> from related heap consistency problems such as duplicate keys?
> 
> I would love to have two versions of REINDEX, one which validated and
> one which didn't.   Maybe a ( validate off ) type check?

Cancel this.  I just did some tests, and there amount of time required
for the validation (at least, in simple two-column table test) is < 10%
of the time required to reindex in general.  At that difference, we
don't need two options.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-21 Thread Markus Wanner
Salut Dimitri,

On 07/20/2013 01:23 AM, Dimitri Fontaine wrote:
> Markus Wanner  writes:
>>>   - per-installation (not even per-cluster) DSO availability
>>>
>>> If you install PostGIS 1.5 on a system, then it's just impossible to
>>> bring another cluster (of the same PostgreSQL major version), let
>> On Debian, that should be well possible. Certainly installing Postgres
>> 9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I
>> designed it to be.
>>
>> I think I'm misunderstanding the problem statement, here.
> 
> (of the same PostgreSQL major version)

Not sure what the issue is, here, but I agree that should be possible.

>> Can CREATE EXTENSION check if the standbys have the extension installed?
>> And refuse creation, if they don't?
> 
> No, because we don't register standbies so we don't know who they are,
> and also because some things that we see connected and using the
> replication protocol could well be pg_basebackup or pg_receivexlog.

Can the standby check? In any case, these seem to be problems we can
solve without affecting security.

> Also, it's possible that the standby is only there for High Availability
> purposes and runs no user query.

Requiring the sysadmin to install the extensions there, too, seems
justified to me. Sounds like good advice, anyways.

>> I'm sure you are aware that even without this clear separation of roles,
>> the guarantee means we provide an additional level of security against
>> attackers.
> 
> Given lo_import() to upload a file from the client to the server then
> LOAD with the absolute path where the file ended up imported (or any
> untrusted PL really), this argument carries no sensible weight in my
> opinion.

lo_import() won't write a file for LOAD to load. An untrusted PL (or any
other extension allowing the superuser to do that) is currently required
to do that.

Or to put it another way: Trusted PLs exist for a good reason. And some
people just value security a lot and want that separation of roles.

>> None the less, the "safe by default" has served us well, I think.
> 
> That's true. We need to consider carefully the proposal at hand though.
> 
> It's all about allowing the backend to automatically load a file that it
> finds within its own $PGDATA so that we can have per-cluster and
> per-database modules (DSO files).

As someone mentioned previously, $PGDATA may well be mounted noexec, so
that seems to be a bad choice.

> The only difference with before is the location where the file is read
> from, and the main security danger comes from the fact that we used to
> only consider root-writable places and now propose to consider postgres
> bootstrap user writable places.

FWIW, I only proposed to let postgres check write permissions on
libraries it loads. IIUC we don't currently do that, yet. And Postgres
happily loads a world-writable library, ATM.

> Having the modules in different places in the system when it's a
> template and when it's instanciated allows us to solve a problem I
> forgot to list:
> 
>   - upgrading an extension at the OS level
> 
> Once you've done that, any new backend will load the newer module
> (DSO file), so you have to be real quick if installing an hot fix in
> production and the SQL definition must be changed to match the new
> module version…

I agree, that's a problem.

Alternatively, we could solve that problem the other way around: Rather
than template-ify the DSO, we could instead turn the objects created by
the SQL scripts into something that's more linked to the script.
Something that would reload as soon as the file on disk changes.

(Note how this would make out-of-line extensions a lot closer to the
in-line variant your recent patch adds? With the dependency between
"template" and "instantiation"?)

> With the ability to "instanciate" the module in a per-cluster
> per-database directory within $PGDATA the new version of the DSO module
> would only put in place and loaded at ALTER EXTENSION UPDATE time.
> 
> I'm still ok with allowing to fix those problems only when a security
> option that defaults to 'false' has been switched to 'true', by the way,
> so that it's an opt-in,

Okay, good.

For the issues you raised, I'd clearly prefer fixes that maintain
current security standards, though.

> but I will admit having a hard time swallowing
> the threat model we're talking about…

An attacker having access to a libpq connection with superuser rights
cannot currently run arbitrary native code. He can try a DOS by
exhausting system resources, but that's pretty far from being
invisible. Or he can delete valuable data. Maybe other nasty things. But
he should not be able to gain root access and remove its traces.

Dropping this barrier by installing an untrusted PL (or equally insecure
extensions), an attacker with superuser rights can trivially gain
root.

Of course, an attacker shouldn't gain superuser rights in the first
place. But if he did, you better stop him right there wi

Re: [HACKERS] REINDEX checking of index constraints

2013-07-21 Thread Josh Berkus
On 07/21/2013 11:30 AM, Josh Berkus wrote:
> Noah,
> 
>> Attached patch just restores the old behavior.  Would it be worth preserving
>> the ability to fix an index consistency problem with a REINDEX independent
>> from related heap consistency problems such as duplicate keys?
> 
> I would love to have two versions of REINDEX, one which validated and
> one which didn't.   Maybe a ( validate off ) type check?

Cancel this.  I just did some tests, and there amount of time required
for the validation (at least, in simple two-column table test) is < 10%
of the time required to reindex in general.  At that difference, we
don't need two options.

Unless you're asking if we want a command to check the index validity
without rebuilding it?  That might be more valuable ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-21 Thread Hannu Krosing
On 07/21/2013 10:30 PM, Markus Wanner wrote:
>> but I will admit having a hard time swallowing
>> the threat model we're talking about…
> An attacker having access to a libpq connection with superuser rights
> cannot currently run arbitrary native code. He can try a DOS by
> exhausting system resources, but that's pretty far from being
> invisible. Or he can delete valuable data. Maybe other nasty things. But
> he should not be able to gain root access and remove its traces.
>
> Dropping this barrier by installing an untrusted PL (or equally insecure
> extensions), an attacker with superuser rights can trivially gain
> root.
Could you elaborate ?

This is equivalent to claiming that any linux user can trivially gain root.

>>> If the sysadmin wants to disallow arbitrary execution of native code to
>>> postgres (the process), any kind of embedded compiler likely is equally
>>> unwelcome.
>> You already mentioned untrusted PL languages, and I don't see any
>> difference in between offering PL/pythonu and PL/C on security grounds,
>> really.
> I agree. However, this also means that any kind of solution it offers is
> not a good one for the security conscious sysadmin.
This is usually the case with a "security conscious sysadmin" - they very
seldom want to install anything.

A "cloud style"  solution to this problem is installing the whole
PostgreSQL
host in its own VM and deklegate all security to developers ;)
>
> Regards
>
> Markus Wanner
>
>


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-07-21 Thread Tatsuo Ishii
Greg,

> Yes, I already took at look at it briefly.  The updates move in the
> right direction, but I can edit them usefully before commit.  I'll
> have that done by tomorrow and send out a new version.  I'm hopeful
> that v18 will finally be the one that everyone likes.

Have you done it?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith  wrote:

> On 7/20/13 4:48 AM, didier wrote:
>
>> With your tests did you try to write the hot buffers first? ie buffers
>> with a high  refcount, either by sorting them on refcount or at least
>> sweeping the buffer list in reverse?
>>
>
> I never tried that version.  After a few rounds of seeing that all changes
> I tried were just rearranging the good and bad cases, I got pretty bored
> with trying new changes in that same style.
>
>
>  by writing to the OS the less likely to be recycle buffers first it may
>> have less work to do at fsync time, hopefully they have been written by
>> the OS background task during the spread and are not re-dirtied by other
>> backends.
>>
>
> That is the theory.  In practice write caches are so large now, there is
> almost no pressure forcing writes to happen until the fsync calls show up.
>  It's easily possible to enter the checkpoint fsync phase only to discover
> there are 4GB of dirty writes ahead of you, ones that have nothing to do
> with the checkpoint's I/O.
>
> Backends are constantly pounding the write cache with new writes in
> situations with checkpoint spikes.  The writes and fsync calls made by the
> checkpoint process are only a fraction of the real I/O going on. The volume
> of data being squeezed out by each fsync call is based on total writes to
> that relation since the checkpoint.  That's connected to the writes to that
> relation happening during the checkpoint, but the checkpoint writes can
> easily be the minority there.
>
> It is not a coincidence that the next feature I'm working on attempts to
> quantify the total writes to each 1GB relation chunk.  That's the most
> promising path forward on the checkpoint problem I've found.
>
>
> --
> Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
>


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
Hi,

On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith  wrote:

> On 7/20/13 4:48 AM, didier wrote:
>
>>
>> That is the theory.  In practice write caches are so large now, there is
> almost no pressure forcing writes to happen until the fsync calls show up.
>  It's easily possible to enter the checkpoint fsync phase only to discover
> there are 4GB of dirty writes ahead of you, ones that have nothing to do
> with the checkpoint's I/O.
>
> Isn't adding another layer of cache the usual answer?

The best would be in the OS, a fs with a big journal able to write
sequentially a lot of blocks.

If not and If you can spare at worst 2bit in memory per data blocks,  don't
mind preallocated data files (assuming meta data are stable then) and have
a working mmap(  MAP_NONBLOCK), and mincore() syscalls you could have a
checkpoint in bound time, worst case you sequentially write the whole
server RAM to a separate disk every checkpoint.
Not sure I would trust such a beast with my data though :)


Didier