from:"Robert Haas"

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas

On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Hmm, so what exactly is the sequence of events here?  It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
However, I can see that there could be a problem if we try to replay
an older checkpoint after having already replayed a new one - for
example, if a standby replays checkpoint A truncating the members
multixact and performs a restart point, and then replays checkpoint B
truncating the members multixact again but without performing a
restartpoint, and then is shut down, it will resume replay from
checkpoint A, and trouble will ensue.  Is that the scenario, or is
there something else?

> I think the fix to this is to verify whether the file exists on disk
> before reading it; if it doesn't, assume the truncation has already
> happened and that it's not necessary to remove it.

That might be an OK fix, but this implementation doesn't seem very
clean.  If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid.  Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.

I think we should modify the API for find_multixact_start.  Let's have
it return a Boolean and return oldestOffset via an out parameter.  If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter.  Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.

This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset?  If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas

On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera
 wrote:
> Well I'm not very clear on what's the problematic case.  The scenario I
> actually saw this first reported was a pg_basebackup taken on a very
> large database, so the master could have truncated multixact and the
> standby receives a truncated directory but actually tries to apply a
> checkpoint that is much older than what the master currently has
> transmitted as pg_multixact contents.

OK, that makes sense.

>> That might be an OK fix, but this implementation doesn't seem very
>> clean.  If we're going to remove the invariant that
>> MultiXactState->oldestOffset will always be valid after replaying a
>> checkpoint, then we should be explicit about that and add a flag
>> indicating whether or not it's currently valid.  Shoving nextOffset in
>> there and hoping that's good enough seems like a bad idea to me.
>>
>> I think we should modify the API for find_multixact_start.  Let's have
>> it return a Boolean and return oldestOffset via an out parameter.  If
>> !InRecovery, it will always return true and set the out parameter; but
>> if in recovery, it is allowed to return false without setting the out
>> parameter.  Both values can get stored in MultiXactState, and we can
>> adjust the logic elsewhere to disregard oldestOffset when the
>> accompanying flag is false.
>
> Sounds good.  I think I prefer that multixact creation is rejected
> altogether if the new flag is false.  Is that what you mean when you say
> "adjust the logic"?

No.  I'm not sure quite what you mean here.  We can't reject multixact
creation during normal running, and during recovery, we won't create
any really new mulitxacts, but we must replay the creation of
multixacts.  What I meant was stuff like this:

if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
(MultiXactState->nextOffset - MultiXactState->oldestOffset
> MULTIXACT_MEMBER_SAFE_THRESHOLD))

I meant that we'd change the second prong of the test to check
multiXactState->nextOffsetValid && MultiXactState->nextOffset -
MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD.  And
likewise change anything else that relies on oldestOffset.  Or else we
guarantee that we can't reach those points until the oldestOffset is
valid, and then check that it is with an Assert() or elog().

>> This still leaves open an ugly possibility: can we reach normal
>> running without a valid oldestOffset?  If so, until the next
>> checkpoint happens, autovacuum has no clue whether it needs to worry.
>> There's got to be a fix for that, but it escapes me at the moment.
>
> I think the fix to that issue is to set the oldest offset on
> TrimMultiXact.  That way, once WAL replay finished we're certain that we
> have a valid oldest offset to create new multixacts with.
>
> I'm also wondering whether the call to DetermineSafeOldestOffset on
> StartupMultiXact is good.  At that point, we haven't replayed any WAL
> yet, so the oldest multi might be pointing at a file that has already
> been removed -- again considering the pg_basebackup scenario where the
> multixact files are copied much later than pg_control, so the checkpoint
> to replay is old but the pg_multixact contents have already been
> truncated in the master and are copied truncated.

Moving the call from StartupMultiXact() to TrimMultiXact() seems like
a good idea.  I'm not sure why we didn't do that before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Steve: Can you tell us more about how you shut down the old cluster?
Did you by any chance perform an immediate shutdown?  Do you have the
actual log messages that were written when the system was shut down
for the upgrade?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Wait a minute, wait a minute.  There's a serious problem with this
theory, at least in Steve's scenario.  This message:

2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
down at 2015-05-27

That message implies a *clean shutdown*.  If he had performed an
immediate shutdown or just pulled the plug, it would have said
"database system was interrupted" or some such.

There may be bugs in redo, also, but they don't explain what happened to Steve.

Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 8:01 AM, Robert Haas  wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
>  wrote:
>> Steve Kehlet wrote:
>>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>>> just dropped new binaries in place) but it wouldn't start up. I found this
>>> in the logs:
>>>
>>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>>  database system was shut down at 2015-05-27 13:12:55 PDT
>>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>>> starting up
>>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>>> transaction 1
>>
>> I am debugging today a problem currently that looks very similar to
>> this.  AFAICT the problem is that WAL replay of an online checkpoint in
>> which multixact files are removed fails because replay tries to read a
>> file that has already been removed.
>
> Wait a minute, wait a minute.  There's a serious problem with this
> theory, at least in Steve's scenario.  This message:
>
> 2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
> down at 2015-05-27
>
> That message implies a *clean shutdown*.  If he had performed an
> immediate shutdown or just pulled the plug, it would have said
> "database system was interrupted" or some such.
>
> There may be bugs in redo, also, but they don't explain what happened to 
> Steve.
>
> Steve, is there any chance we can get your pg_controldata output and a
> list of all the files in pg_clog?

Err, make that pg_multixact/members, which I assume is at issue here.
You didn't show us the DETAIL line from this message, which would
presumably clarify:

FATAL:  could not access status of transaction 1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 8:03 AM, Robert Haas  wrote:
>> Steve, is there any chance we can get your pg_controldata output and a
>> list of all the files in pg_clog?
>
> Err, make that pg_multixact/members, which I assume is at issue here.
> You didn't show us the DETAIL line from this message, which would
> presumably clarify:
>
> FATAL:  could not access status of transaction 1

And I'm still wrong, probably.  The new code in 9.4.2 cares about
being able to look at an *offsets* file to find the corresponding
member offset.  So most likely it is an offsets file that is missing
here.  The question is, how are we ending up with an offsets file that
is referenced by the control file but not actually present on disk?

It seems like it would be good to compare the pg_controldata output to
what is actually present in pg_multixact/offsets (hopefully that's the
right directory, now that I'm on my third try) and try to understand
what is going on here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 8:51 AM, Robert Haas  wrote:
> [ speculation ]

OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
do it:

1. Install any pre-9.3 version of the server and generate enough
multixacts to create at least TWO new segments.  When you shut down
the server, all segments except for the most current one will be
removed.  At this point, the only thing in
$PGDATA/pg_multixact/offsets should be a single file, and the name of
that file should not be  or 0001.

2. Use pg_upgrade to upgrade to 9.3.4.  It is possible that versions <
9.3.4 will also work here, but you must not use 9.3.5 or higher,
because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to
preserve relminmxid and datminmxid values.   At this point,
pg_controldata on the new cluster should show an oldestMultiXid value
greater than 1 (copied from the old cluster), but all the datminmxid
values are 1.  Also, initdb will have left behind a bogus  file in
pg_multixact/offsets.

3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping
in the new binaries.  Follow the instructions in the 9.3.5 release
notes; since you created at least TWO new segments in step one, there
will be no 0001 file, and the query there will say that you should
remove the bogus  file.  So do that, leaving just the good file in
pg_multixact/offsets.  At this point, pg_multixact/offsets is OK, and
pg_controldata still says that oldestMultiXid > 1, so that is also OK.
The only problem is that we've got some bogus datminmxid values
floating around.  Our next step will be to convince vacuum to
propagate the bogus datminmxid values back into pg_controldata.

4. Consume at least one transaction ID (e.g. SELECT txid_current())
and then do this:

postgres=# set vacuum_freeze_min_age = 0;
SET
postgres=# set vacuum_freeze_table_age = 0;
SET
postgres=# vacuum;
VACUUM

Setting the GUCs forces full table scans, so that we advance
relfrozenxid.  But notice that we were careful not to just run VACUUM
FREEZE, which would have also advanced relminmxid, which, for purposes
of reproducing this bug, is not what we want to happen.  So relminmxid
is still (incorrectly) set to 1 for every database.  However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.

(In my testing, this step doesn't work if performed on 9.3.4; you have
to do it on 9.3.5.  I think that's because of Tom's commit 78db307bb,
but I believe in a more complex test scenario you might be able to get
this to happen on 9.3.4 also.)

I believe it's the case that an autovacuum of even a single table can
substitute for this step if it happens to advance relfrozenxid but not
relminmxid.

5. The next checkpoint, or the shutdown checkpoint in any event, will
propagate the bogus value of 1 from shared memory back into the
control file.

6. Now try to start 9.3.7.  It will see the bogus oldestMultiXid = 1
value in the control file, attempt to read the corresponding offsets
file, and die.

In the process of investigating this, we found a few other things that
seem like they may also be bugs:

- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems.  This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.

- pg_upgrade sets datminmxid =
old_cluster.controldata.chkpnt_nxtmulti, which is correct only if
there are ZERO multixacts in use at the time of the upgrade.  It would
be best, I think, to set this to the same value it had in the old
cluster, but if we're going to use a blanket value, I think it needs
to be chkpnt_oldstMulti.

- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it.  Maybe Thomas can jump in here and explain the
concern.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake  wrote:
> FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Thanks!  I really appreciate the kind words.

So, in thinking through this situation further, it seems to me that
the situation is pretty dire:

1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
or datminmxid values which are 1 instead of the correct value.
Setting the value to 1 was too far in the past if your MXID counter is
< 2B, and too far in the future if your MXID counter is > 2B.

2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are two new.

3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
have the first problem for tables in template databases, and the
second one for the rest. (See 866f3017a.)

4. Wrong relminmxid or datminmxid values can eventually propagate into
the control file, as demonstrated in my previous post.  Therefore, we
can't count on relminmxid to be correct, we can't count on datminmxid
to be correct, and we can't count on the control file to be correct.
That's a sack of sad.

5. If the values are too far in the past, then nothing really terrible
will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the
system will refuse to start.  Forcing a VACUUM FREEZE on every
database, including the unconnectable ones, should fix this and allow
you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2
fix a different set of multixact data loss bugs.

6. If the values are too far in the future, the system may fail to
prevent wraparound, leading to data loss.  I am not totally clear on
whether a VACUUM FREEZE will fix this problem.  It seems like the
chances are better if you are running at least 9.3.5+ or 9.4.X,
because of 78db307bb.  But I'm not sure how complete a fix that is.

So what do we do about this?  I have a few ideas:

A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem.  That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)

B. We need to change find_multixact_start() to fail softly.  This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:

- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum.  If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.

- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit.  If it fails here, we don't know when
to refuse multixact creation to prevent wraparound.  Again, in
recovery, that's fine.  If it happens in normal running, it's not
clear what to do.  Refusing multixact creation is an awfully blunt
instrument.  Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point.  Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails.  It might be more robust than what we
have now.

- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU.  If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.

C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members.  As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes.  That seem like an
unnecessary risk.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas

at we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does.  With this approach, we can be certain that
there is a gap.  And that is a darned good thing to be certain about.

>> C. I think we should also change TruncateMultiXact() to truncate
>> offsets first, and then members.  As things stand, if we truncate
>> members first, we increase the risk of seeing an offset that will fail
>> when passed to find_multixact_start(), because TruncateMultiXact()
>> might get interrupted before it finishes.  That seem like an
>> unnecessary risk.
>
> Not sure about this point.  We did it the way you propose previously,
> and found it to be a problem because sometimes we tried to read an
> offset file that was no longer there.  Do we really read member files
> anywhere?  I thought we only tried to read offset files.  If we remove
> member files, what is it that we try to read and find not to be present?

Do you have a link to the previous discussion?

I mean, the problem we're having right now is that sometimes we have
an offset, but the corresponding member isn't there.  So clearly
offsets reference members.  Do members also reference offsets?  I
didn't think so, but life is full of surprises.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas

On Fri, May 29, 2015 at 10:17 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> On Fri, May 29, 2015 at 11:24 AM, Robert Haas  wrote:
>>> B. We need to change find_multixact_start() to fail softly.
>
>> Here is an experimental WIP patch that changes StartupMultiXact and
>> SetMultiXactIdLimit to find the oldest multixact that exists on disk
>> (by scanning the directory), and uses that if it is more recent than
>> the oldestMultiXactId from shmem,
>
> Not sure about the details of this patch, but I was planning to propose
> what I think is the same thing: the way to make find_multixact_start()
> fail softly is to have it find the oldest actually existing file if the
> one that should be there isn't.

Working on that now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas

On Fri, May 29, 2015 at 12:43 PM, Robert Haas  wrote:
> Working on that now.

OK, here's a patch.  Actually two patches, differing only in
whitespace, for 9.3 and for master (ha!).  I now think that the root
of the problem here is that DetermineSafeOldestOffset() and
SetMultiXactIdLimit() were largely ignorant of the possibility that
they might be called at points in time when the cluster was
inconsistent.  SetMultiXactIdLimit() bracketed certain parts of its
logic with if (!InRecovery), but those guards were ineffective because
it gets called before InRecovery is set in the first place.

It seems pretty clear that we can't effectively determine anything
about member wraparound until the cluster is consistent.  Before then,
there might be files missing from the offsets or members SLRUs which
get put back during replay.  There could even be gaps in the sequence
of files, with some things having made it to disk before the crash (or
having made it into the backup) and others not.  So all the work of
determining what the safe stop points and vacuum thresholds for
members are needs to be postponed until TrimMultiXact() time.  And
that's fine, because we don't need this information in recovery anyway
- it only affects behavior in normal running.

So this patch does the following:

1. Moves the call to DetermineSafeOldestOffset() that appears in
StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
until we're consistent.  Also, instead of passing
MultiXactState->oldestMultiXactId, pass the newer of that value and
the earliest offset that exists on disk.  That way, it won't try to
read data that's not there.  Note that the second call to
DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a
similar guard, because we already bail out of that function early if
the multixacts we're going to truncate away don't exist.

2. Adds a new flag MultiXactState->didTrimMultiXact indicate whether
we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit()
to use that rather than InRecovery to test whether it's safe to do
complicated things that might require that the cluster is consistent.
This is a slight behavior change, since formerly we would have tried
to do that stuff very early in the startup process, and now it won't
happen until somebody completes a vacuum operation.  If that's a
problem, we could consider doing it in TrimMultiXact(), but I don't
think it's safe the way it was.  The new flag also prevents
oldestOffset from being set while in recovery; I think it would be
safe to do that in recovery once we've reached consistency, but I
don't believe it's necessary.

3. Arranges for TrimMultiXact() to set oldestOffset.  This is
necessary because, previously, we relied on SetMultiXactIdLimit doing
that during early startup or during recovery, and that's no longer
true.  Here too we set oldestOffset keeping in mind that our notion of
the oldest multixact may point to something that doesn't exist; if so,
we use the oldest MXID that does.

4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU
directory on every call to find the oldest file that exists.  Instead,
it arranges to remember the value from the first scan and then updates
it thereafter to reflect its own truncation activity.  This isn't
absolutely necessary, but because this oldest-file logic is used in
multiple places (TrimMultiXact, SetMultiXactIdLimit, and
TruncateMultiXact all need it directly or indirectly) caching the
value seems like a better idea than recomputing it frequently.

I have tested that this patch fixes Steve Kehlet's problem, or at
least what I believe to be Steve Kehlet's problem based on the
reproduction scenario I described upthread.  I believe it will also
fix the problems with starting up from a base backup with Alvaro
mentioned upthread.  It won't fix the fact that pg_upgrade is putting
a wrong value into everybody's datminmxid field, which should really
be addressed too, but I've been working on this for about three days
virtually non-stop and I don't have the energy to tackle it right now.
If anyone feels the urge to step into that breech, I think what it
needs to do is: when upgrading from a 9.3-or-later instance, copy over
each database's datminmxid into the corresponding database in the new
cluster.

Aside from that, it's very possible that despite my best efforts this
has serious bugs.  Review and testing would be very much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..8d28a5c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -197,8 +197,9 @@ typedef struct MultiXactStateData
 	Mu

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas

On Fri, May 29, 2015 at 3:08 PM, Robert Haas  wrote:
> It won't fix the fact that pg_upgrade is putting
> a wrong value into everybody's datminmxid field, which should really
> be addressed too, but I've been working on this for about three days
> virtually non-stop and I don't have the energy to tackle it right now.
> If anyone feels the urge to step into that breech, I think what it
> needs to do is: when upgrading from a 9.3-or-later instance, copy over
> each database's datminmxid into the corresponding database in the new
> cluster.

Bruce was kind enough to spend some time on IM with me this afternoon,
and I think this may actually be OK.  What pg_upgrade does is:

1. First, put next-xid into the relminmxid for all tables, including
catalog tables.  This is the correct behavior for upgrades from a
pre-9.3 release, and is correct for catalog tables in general.

2. Next, restoring the schema dump will set the relminmxid values for
all non-catalog tables to the value dumped from the old cluster.  At
this point, everything is fine provided that we are coming from a
release 9.3 or newer.  But if the old cluster is pre-9.3, it will have
dumped *zero* values for all of its relminmxid values; so all of the
user tables go from the correct value they had after step 1 to an
incorrect value.

3. Finally, if the old cluster is pre-9.3, repeat step 1, undoing the
damage done in step 2.

This is a bit convoluted, but I don't know of a reason why it
shouldn't work.  Sorry for the false alarm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas

On Fri, May 29, 2015 at 9:46 PM, Andres Freund  wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.

I agree.  I actually meant to mention this in my previous email, but,
owing to exhaustion and burnout, didn't.

> I think at least for 9.5+ we should a) invent proper truncation records
> for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> The current hack of scanning the directories to get knowledge we should
> have is a pretty bad hack, and we should not continue using it forever.
> I think we might end up needing to do a) even in the backbranches.

That may be the right thing to do.  I'm concerned that changing the
behavior of master too much will make it every subsequent fix twice as
hard, because we'll have to do one fix in master and another fix in
the back-branches.  I'm also concerned that it will create even more
convoluted failure scenarios. The failure-to-start problem discussed
on this thread requires a chain of four (maybe three) different
PostgreSQL versions in order to create it, and the more things we go
change, the harder it's going to be to reason about this stuff.

The diseased and rotting elephant in the room here is that clusters
with bogus relminmxid, datminmxid, and/or oldestMultiXid values may
exist in the wild and we really have no plan to get rid of them.
78db307bb may have helped somewhat - although I'm haven't grokked what
it's about well enough to be sure - but it's certainly not a complete
solution, as this bug report itself illustrates rather well.  Unless
we figure out some clever solution that is not now apparent to me, or
impose a hard pg_upgrade compatibility break at some point, we
basically can't count on pg_control's "oldest multixact" information
to be correct ever again.  We may be running into clusters 15 years
from now that have problems that are just holdovers from what was
fixed in 9.3.5.

One thing I think we should definitely do is add one or two additional
fields to pg_controldata that get filled in by pg_upgrade.  One of
them should be "the oldest known catversion in the lineage of this
cluster" and the other should be "the most recent catverson in the
lineage of this cluster before this one".   Or maybe we should store
PG_VERSION_NUM values.  Or store both things.  I think that would make
troubleshooting this kind of problem a lot easier - just from the
pg_controldata output, you'd be able to tell whether the cluster had
been pg_upgraded, whether it had been pg_upgraded once or multiple
times, and at least some of the versions involved, without relying on
the user's memory of what they did and when.  Fortunately, Steve
Kellet had a pretty clear idea of what his history was, but not all
users know that kind of thing, and I've wanted it more than once while
troubleshooting.

Another thing I think we should do is add a field to pg_class that is
propagated by pg_upgrade and stores the most recent PG_VERSION_NUM
that is known to have performed a scan_all vacuum of the table.  This
would allow us to do things in the future like (a) force a full-table
vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or
(b) advise users to manually inspect the values and manually perform
said vacuum or (c) only believe that certain information about a table
is accurate if it's been full-scanned by a vacuum newer than
$BUGGYRELEASE.  It could also be used as part of a strategy for
reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to
10.5, which repurposes those bits, unless you've done a scan_all
vacuum on every table with a release new enough to guarantee that
they're not used for their historical purpose.

> This problem isn't conflicting with most of the fixes you describe, so
> I'll continue with reviewing those.

Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas

On Sat, May 30, 2015 at 8:55 PM, Andres Freund  wrote:
> Is oldestMulti, nextMulti - 1 really suitable for this? Are both
> actually guaranteed to exist in the offsets slru and be valid?  Hm. I
> guess you intend to simply truncate everything else, but just in
> offsets?

oldestMulti in theory is the right thing, I think, but in actuality we
know that some people have 1 here instead of the correct value.

>> One argument against this idea is that we may not want to keep a full
>> set of member files on standbys (due to disk space usage), but that's
>> what will happen unless we truncate during replay.
>
> I think that argument is pretty much the death-knell.=

Yes.  Truncating on the standby is really not optional.

>> > I think at least for 9.5+ we should a) invent proper truncation records
>> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
>> > The current hack of scanning the directories to get knowledge we should
>> > have is a pretty bad hack, and we should not continue using it forever.
>> > I think we might end up needing to do a) even in the backbranches.
>>
>> Definitely agree with WAL-logging truncations; also +1 on backpatching
>> that to 9.3.  We already have experience with adding extra WAL records
>> on minor releases, and it didn't seem to have bitten too hard.
>
> I'm inclined to agree. My only problem is that I'm not sure whether we
> can find a way of doing all this without adding a pg_control field. Let
> me try to sketch this out:
>
> 1) We continue determining the oldest 
> SlruScanDirectory(SlruScanDirCbFindEarliest)
>on the master to find the oldest offsets segment to
>truncate. Alternatively, if we determine it to be safe, we could use
>oldestMulti to find that.
> 2) SlruScanDirCbRemoveMembers is changed to return the range of members
>to remove, instead of doing itself
> 3) We wal log [oldest offset segment guaranteed to not be alive,
>nextmulti) for offsets, and [oldest members segment guaranteed to not be 
> alive,
>nextmultioff), and redo truncations for the entire range during
>recovery.
>
> I'm pretty tired right now, but this sounds doable.

I'm probably biased here, but I think we should finish reviewing,
testing, and committing my patch before we embark on designing this.
So far we have no reports of trouble attributable to the lack of the
WAL-logging support discussed here, as opposed to several reports of
trouble from the status quo within days of release.

I'm having trouble reconstructing the series of events where what
you're worried about here really becomes a problem, and I think we
ought to have a very clear idea about that before back-patching
changes of this type.  It's true that if the state of the SLRU
directory is in the future, because recovery was restarted from an
earlier checkpoint, we might replay a checkpoint and remove some of
those files from the future.  But so what?  By the time we've reached
the minimum recovery point, they will have been recreated by the same
WAL records that created them in the first place.  If, in the previous
replay, we had wrapped all the way around, some of the stuff we keep
may actually already have been overwritten by future WAL records, but
they'll be overwritten again.  Now, that could mess up our
determination of which members to remove, I guess, but I'm not clear
that actually matters either: if the offsets space has wrapped around,
the members space will certainly have wrapped around as well, so we
can remove anything we like at this stage and we're still OK.  I agree
this is ugly the way it is, but where is the actual bug?

As far as your actual outline goes, I think if we do this, we need to
be very careful about step #2.  Right now, we decide what we need to
keep and then remove everything else, but that's kind of wonky because
new stuff may be getting created at the same time, so we keep
adjusting our idea of exactly what needs to be removed.  It would be
far better to invert that logic: decide what needs to be removed -
presumably, everything from the oldest member that now exists up until
some later point - and then remove precisely that stuff and nothing
else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas

On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch  wrote:
> Incomplete review, done in a relative rush:

Thanks.

> On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> OK, here's a patch.  Actually two patches, differing only in
>> whitespace, for 9.3 and for master (ha!).  I now think that the root
>> of the problem here is that DetermineSafeOldestOffset() and
>> SetMultiXactIdLimit() were largely ignorant of the possibility that
>> they might be called at points in time when the cluster was
>> inconsistent.
>
> A cause perhaps closer to the root is commit f741300 moving truncation from
> VACUUM to checkpoints.  CLOG has given us deep experience with VACUUM-time
> truncation.  Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
> truncation up to the same level.
>
> Speaking of commit f6a6c46d, it seems logical that updating allocation stop
> limits should happen proximate to truncation.  That's currently the case for
> CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
> TruncateMultiXact() call does both).  However, pg_multixact/offsets is
> truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
> I did not distill an errant test case, but this is fishy.

Good point.  Because we confine ourselves to using half the offset
space, it seems much harder for us to get into trouble here than it is
with members. The first scenario that occurred to me is that the SLRU
might actually wrap.  That seems tough, though: between one checkpoint
and the next, vacuum would need to advance oldest_datminmxid by 2^31
MXIDs while generating 2^31 new ones, or something like that.  That
doesn't seem real plausible. But then it occurred to me that it's
probably sufficient to advance the head of the SLRU far enough that
TruncateMultiXact things that the tail is in the future instead of in
the past.  I see no reason why that couldn't happen.  Then we'd end up
leaving some files behind that we should have removed.  I'm not sure
exactly what problem that would cause; would they just get overwritten
on the next pass through the space, or would they cause errors?  I
have not had time to check.

>> SetMultiXactIdLimit() bracketed certain parts of its
>> logic with if (!InRecovery), but those guards were ineffective because
>> it gets called before InRecovery is set in the first place.
>
> SetTransactionIdLimit() checks InRecovery for the same things, and it is
> called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
> sense of whether it is subject to similar problems as a result?

Well, I think it's pretty weird that those things will get done before
beginning recovery, even on an inconsistent cluster, but not during
recovery.  That is pretty strange.  I don't see that it can actually
do any worse than emit a few log messages at the start of recovery
that won't show up again until the end of recovery, though.

>> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> until we're consistent.  Also, instead of passing
>> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> the earliest offset that exists on disk.  That way, it won't try to
>> read data that's not there.
>
> Perhaps emit a LOG message when we do that, since it's our last opportunity to
> point to the potential data loss?

If the problem is just that somebody minmxid got set to 1 instead of
the real value, I think that there is no data loss, because none of
those older values are actually present there.  But we could add a LOG
message anyway.  How do you suggest that we phrase that?

>> +  * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
>> +  * had bugs that could allow users who reached those release through
>
> s/release/releases/

Fixed.

>> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
>>   SimpleLruTruncate(MultiXactOffsetCtl,
>> 
>> MultiXactIdToOffsetPage(oldestMXact));
>>
>> + /* Update oldest-on-disk value in shared memory. */
>> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
>> + if (earliest < FirstMultiXactId)
>> + earliest = FirstMultiXactId;
>> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>> + Assert(MultiXactState->oldestMultiXactOnDiskValid);
>> + MultiXactState->oldestMultiXactOnDiskValid = earliest;
>
> That last line needs s/Valid//, I presume.  Is it okay that
> oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> Valid indicator remaining true?

Ay yai

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas

On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund  wrote:
>> I'm probably biased here, but I think we should finish reviewing,
>> testing, and committing my patch before we embark on designing this.
>
> Probably, yes. I am wondering whether doing this immediately won't end
> up making some things simpler and more robust though.

I'm open to being convinced of that, but as of this moment I'm not
seeing any clear-cut evidence that we need to go so far.

>> So far we have no reports of trouble attributable to the lack of the
>> WAL-logging support discussed here, as opposed to several reports of
>> trouble from the status quo within days of release.
>
> The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> era, where we didn't do any truncation during recovery...

Right, but now we're piggybacking on the checkpoint records, and I
don't have any evidence that this approach can't be made robust.  It's
possible that it can't be made robust, but that's not currently clear.

>> By the time we've reached the minimum recovery point, they will have
>> been recreated by the same WAL records that created them in the first
>> place.
>
> I'm not sure that's true. I think we could end up errorneously removing
> files that were included in the base backup. Anyway, let's focus on your
> patch for now.

OK, but I am interested in discussing the other thing too.  I just
can't piece together the scenario myself - there may well be one.  The
base backup will begin replay from the checkpoint caused by
pg_start_backup() and remove anything that wasn't there at the start
of the backup.  But all of that stuff should get recreated by the time
we reach the minimum recovery point (end of backup).

>> If, in the previous
>> replay, we had wrapped all the way around, some of the stuff we keep
>> may actually already have been overwritten by future WAL records, but
>> they'll be overwritten again.  Now, that could mess up our
>> determination of which members to remove, I guess, but I'm not clear
>> that actually matters either: if the offsets space has wrapped around,
>> the members space will certainly have wrapped around as well, so we
>> can remove anything we like at this stage and we're still OK.  I agree
>> this is ugly the way it is, but where is the actual bug?
>
> I'm more worried about the cases where we didn't ever actually "badly
> wrap around" (i.e. overwrite needed data); but where that's not clear on
> the standby because the base backup isn't in a consistent state.

I agree. The current patch tries to make it so that we never call
find_multixact_start() while in recovery, but it doesn't quite
succeed: the call in TruncateMultiXact still happens during recovery,
but only once we're sure that the mxact we plan to call it on actually
exists on disk.  That won't be called until we replay the first
checkpoint, but that might still be prior to consistency.

Since I forgot to attach the revised patch with fixes for the points
Noah mentioned to that email, here it is attached to this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
Author: Robert Haas 
Date:   Fri May 29 14:35:53 2015 -0400

foo

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..aca829d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -199,8 +199,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that may still be referenced from a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
@@ -213,6 +214,18 @@ typedef struct MultiXactStateData
 	 */
 	MultiXactId lastCheckpointedOldest;
 
+	/*
+	 * This is the oldest file that actually exist on the disk.  This value
+	 * is initialized by scanning pg_multixact/offsets, and subsequently
+	 * updated each time we complete a truncation.  We need a flag to
+	 * indicate whether this has been initialized yet.
+	 */
+	MultiXactId oldestMultiXactOnDisk;
+	bool		oldestMultiXactOnDiskValid;
+
+	/* Has TrimMultiXact been called yet? */
+	bool		didTrimMultiXact;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -344,6 +357,8 @@ static char *mxstatus_to_string(MultiXactStatus status);
 /* management of SLRU infrastructure */
 static int	ZeroMultiXactOffsetPage(int page

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch  wrote:
> On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
>> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch  wrote:
>> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
>> >> SetMultiXactIdLimit() bracketed certain parts of its
>> >> logic with if (!InRecovery), but those guards were ineffective because
>> >> it gets called before InRecovery is set in the first place.
>> >
>> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
>> > called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
>> > sense of whether it is subject to similar problems as a result?
>>
>> Well, I think it's pretty weird that those things will get done before
>> beginning recovery, even on an inconsistent cluster, but not during
>> recovery.  That is pretty strange.  I don't see that it can actually
>> do any worse than emit a few log messages at the start of recovery
>> that won't show up again until the end of recovery, though.
>
> Granted.  Would it be better to update both functions at the same time, and
> perhaps to make that a master-only change?  Does the status quo cause more
> practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

It does in the case of the call to find_multixact_start().  If that
fails, we take the server down for no good reason, as demonstrated by
the original report. I'll revert the changes to the other two things
in this function that I changed to be protected by did_trim.  There's
no special reason to think that's a necessary change.

>> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
>> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
>> >> until we're consistent.  Also, instead of passing
>> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
>> >> the earliest offset that exists on disk.  That way, it won't try to
>> >> read data that's not there.
>> >
>> > Perhaps emit a LOG message when we do that, since it's our last 
>> > opportunity to
>> > point to the potential data loss?
>>
>> If the problem is just that somebody minmxid got set to 1 instead of
>> the real value, I think that there is no data loss, because none of
>> those older values are actually present there.  But we could add a LOG
>> message anyway.  How do you suggest that we phrase that?
>
> There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
> to 1.  Otherwise, data loss is possible.

Yes, but in that scenario, the log message you propose wouldn't be
triggered.  If true_minmxid > 2^31, then the stored minmxid will not
precede the files on disk; it will follow it (assuming the older stuff
hasn't been truncated, as is likely).  So the message would be
essentially:

LOG: you didn't lose data.  but if exactly the opposite of what this
message is telling you about had happened, then you would have.
DETAIL: Have a nice day.

> I don't hope for an actionable
> message, but we might want a reporter to grep logs for it when we diagnose
> future reports.  Perhaps this:
>
>   "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

This seems misleading.  In the known failure case, it's not that the
pg_multixact files are unexpectedly missing; it's that we incorrectly
think that they should still be there.  Maybe:

oldest MultiXactId on disk %u follows expected oldest MultiXact %u

> For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
> more than one segment:
>
>   "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

So, this scenario will happen whenever the system was interrupted in
the middle of a truncation, or when the system is started from a base
backup and a truncation happened after files were copied.  I'm wary of
giving users the idea that this is an atypical event.  Perhaps a
message at DEBUG1?

>> I'm not sure what you mean about it becoming too old.  At least with
>> that fix, it should get updated to exactly the first file that we
>> didn't remove.  Isn't that right?
>
> Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
> only in that it never reads or writes the cache.  I might expect
> oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
> does hold most of the time.  It does not always hold between the start of the
> quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
> That&

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> this scenario now (since a couple weeks ago) does a
> SimpleLruReadPage_ReadOnly() in the members slru via
> find_multixact_start(). That just won't work acceptably when we're not
> yet consistent. There very well could not be a valid members segment at
> that point?  Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2.  The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function.  We did that because we needed to use the same logic in some
other places.  The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

>From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact().  It's kooky, and
there may well be bugs buried there.  But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic.  I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing.  We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are.  I think we really need to fix those new places ASAP.

> I think at the very least we'll have to skip this step while not yet
> consistent. That really sucks, because we'll possibly end up with
> multixacts that are completely filled by the time we've reached
> consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund  wrote:
> On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
>> I'm having trouble figuring out what to do about this.  I mean, the
>> essential principle of this patch is that if we can't count on
>> relminmxid, datminmxid, or the control file to be accurate, we can at
>> least look at what is present on the disk.  If we also cannot count on
>> that to be accurate, we are left without any reliable source of
>> information.  Consider a hypothetical cluster where all our stored
>> minmxids of whatever form are corrupted (say, all change to 1) and in
>> addition there are stray files in pg_multixact.  I don't think there's
>> really any way to get ourselves out of trouble in that scenario.
>
> If we were to truncate after vacuum, and only on the primary (via WAL
> logging), we could, afaics, just rely on all the values to be
> recomputed. I mean relminmxid will be recomputed after a vacuum, and
> thus, after some time, will datminmxid and the control file value.  We
> could just force a value of 1 to always trigger anti-wraparound vacuums
> (or wait for that to happen implicitly, to delay the impact?). That'll
> then should then fix the problem in a relatively short amount of time?

The exact circumstances under which we're willing to replace a
relminmxid with a newly-computed one that differs are not altogether
clear to me, but there's an "if" statement protecting that logic, so
there are some circumstances in which we'll leave the existing value
intact.  If we force non-stop vacuuming in that scenario, autovacuum
will just run like crazy without accomplishing anything, which
wouldn't be good.  It would similarly do so when the oldest MXID
reference in the relation is in fact 1, but that value can't be
vacuumed away yet.

Also, the database might be really big.  Even if it were true that a
full scan of every table would get us out of this state, describing
the time that it would take to do that as "relatively short" seems to
me to be considerably understating the impact of a full-cluster
VACUUM.

With regard to the more general question of WAL-logging this, are you
going to work on that?  Are you hoping Alvaro or I will work on that?
Should we draw straws?  It seems like somebody needs to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund  wrote:
>> That would be a departure from the behavior of every existing release
>> that includes this code based on, to my knowledge, zero trouble
>> reports.
>
> On the other hand we're now at about bug #5 attributeable to the odd way
> truncation works for multixacts. It's obviously complex and hard to get
> right. It makes it harder to cope with the wrong values left in
> datminxid etc. So I'm still wondering whether fixing this for good isn't
> the better approach.

It may well be.  But I think we should do something more surgical
first.  Perhaps we can justify the pain and risk of making changes to
the WAL format in the back-branches, but let's not do it in a rush.
If we can get this patch to a state where it undoes the damage
inflicted in 9.3.7/9.4.2, then we will be in a state where we have as
much reliability as we had in 9.3.6 plus the protections against
member-space wraparound added in 9.3.7 - which, like the patch I'm
proposing now, were directly motivated by multiple, independent bug
reports.  That seems like a good place to get to.  If nothing else, it
will buy us some time to figure out what else we want to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
>> The exact circumstances under which we're willing to replace a
>> relminmxid with a newly-computed one that differs are not altogether
>> clear to me, but there's an "if" statement protecting that logic, so
>> there are some circumstances in which we'll leave the existing value
>> intact.
>
> I guess we'd have to change that then.

Yeah, but first we'd need to assess why it's like that.  Tom was the
one who installed the current logic, but I haven't been able to fully
understand it.

>> It would similarly do so when the oldest MXID reference in the
>> relation is in fact 1, but that value can't be vacuumed away yet.
>
> I'd thought of just forcing consumption of one multixact in that
> case. Not pretty, but imo acceptable.

What if multixact 1 still has living members?

>> Also, the database might be really big.  Even if it were true that a
>> full scan of every table would get us out of this state, describing
>> the time that it would take to do that as "relatively short" seems to
>> me to be considerably understating the impact of a full-cluster
>> VACUUM.
>
> Well. We're dealing with a corrupted cluster. Having a way out that's
> done automatically, even if it takes a long while, is pretty good
> already. In many cases the corruption (i.e. pg_upgrade) happened long
> ago, so the table's relminmxid will already have been recomputed.  I
> think that's acceptable.

I'm a long way from being convinced the automated recovery is
possible.  There are so many different scenarios here that it's very
difficult to reason generally about what the "right" thing to do is.
I agree it would be nice if we had it, though.

>> With regard to the more general question of WAL-logging this, are you
>> going to work on that?  Are you hoping Alvaro or I will work on that?
>> Should we draw straws?  It seems like somebody needs to do it.
>
> I'm willing to invest the time to develop an initial version, but I'll
> need help evaluating it. I don't have many testing resources available
> atm, and I'm not going to trust stuff I developed while travelling by
> just looking at the code.

I'm willing to help with that.  Hopefully I'm not the only one, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas

On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund  wrote:
> I'm not really convinced tying things closer to having done trimming is
> easier to understand than tying things to recovery having finished.
>
> E.g.
> if (did_trim)
> oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
> imo is harder to understand than if (!InRecovery).
>
> Maybe we could just name it finishedStartup and rename the functions 
> accordingly?

Basing that particular call site on InRecovery doesn't work, because
InRecovery isn't set early enough.  But I'm fine to rename it to
whatever.

> Maybe it's worthwhile to add a 'NB: At this stage the data directory is
> not yet necessarily consistent' StartupMultiXact's comments, to avoid
> reintroducing problems like this?

Sure.

>>   /*
>> +  * We can read this without a lock, because it only changes when 
>> nothing
>> +  * else is running.
>> +  */
>> + did_trim = MultiXactState->didTrimMultiXact;
>
> Err, Hot Standby? It might be ok to not lock, but the comment is
> definitely wrong. I'm inclined to simply use locking, this doesn't look
> sufficiently critical performancewise.

/me nods.  Good point.

> Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> the disk it'll always get one at a segment boundary, right? I'm not sure
> that's actually ok; because the value at the beginning of the segment
> can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> filled the page with zeros.
>
> I think that should be harmless, the worst that can happen is that
> oldestOffset errorneously is 0, which should be correct, even though
> possibly overly conservative, in these cases.

Uh oh.  That seems like a real bad problem for this approach.  What
keeps that from being the opposite of too conservative?  There's no
"safe" value in a circular numbering space.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas

On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro
 wrote:
> On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  
> wrote:
>> Thomas Munro wrote:
>>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>>> wrote:
>>> > My guess is that the file existed, and perhaps had one or more pages,
>>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>>> > back.  read() returns 0 in this case but doesn't set errno.
>>> >
>>> > I didn't find a way to set things so that the file exists but is of
>>> > shorter contents than oldestMulti by the time the checkpoint record is
>>> > replayed.
>>>
>>> I'm just starting to learn about the recovery machinery, so forgive me
>>> if I'm missing something basic here, but I just don't get this.  As I
>>> understand it, offsets/0046 should either have been copied with that
>>> page present in it if it existed before the backup started (apparently
>>> not in this case), or extended to contain it by WAL records that come
>>> after the backup label but before the checkpoint record that
>>> references it (also apparently not in this case).
>>
>> Exactly --- that's the spot at which I am, also.  I have had this
>> spinning in my head for three days now, and tried every single variation
>> that I could think of, but like you I was unable to reproduce the issue.
>> However, our customer took a second base backup and it failed in exactly
>> the same way, module some changes to the counters (the file that
>> didn't exist was 004B rather than 0046).  I'm still at a loss at what
>> the failure mode is.  We must be missing some crucial detail ...
>
> I have finally reproduced that error!  See attached repro shell script.
>
> The conditions are:
>
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 
> 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
>
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...
>
> I realise now that this is actually a symptom of a problem spotted by
> Noah recently:
>
> http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com
>
> He noticed the problem for segment boundaries, when not in recovery.
> In recovery, segment boundaries don't raise an error (the read-zeroes
> case applies), but page boundaries do.  The fix is probably to do
> nothing if they are the same, as we do elsewhere, like in the attached
> patch.

Actually, we still need to call DetermineSafeOldestOffset in that
case.  Otherwise, if someone goes from lots of multixacts to none, the
stop point won't advance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas

On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>> > that's actually ok; because the value at the beginning of the segment
>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>> > filled the page with zeros.
>> >
>> > I think that should be harmless, the worst that can happen is that
>> > oldestOffset errorneously is 0, which should be correct, even though
>> > possibly overly conservative, in these cases.
>>
>> Uh oh.  That seems like a real bad problem for this approach.  What
>> keeps that from being the opposite of too conservative?  There's no
>> "safe" value in a circular numbering space.
>
> I think it *might* (I'm really jetlagged) be fine because that'll only
> happen after a upgrade from < 9.3. And in that case we initialize
> nextOffset to 0. That ought to safe us?

That's pretty much betting the farm on the bugs we know about today
being the only ones there are.  That seems imprudent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas

On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas  wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh.  That seems like a real bad problem for this approach.  What
>>> keeps that from being the opposite of too conservative?  There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..4400fc5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;

+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;

 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -221,6 +232,7 @@ typedef struct MultiXactStateData

 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;

 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -350,10 +362,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 stat

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas

On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch  wrote:
> I like that change a lot.  It's much easier to seek forgiveness for wasting <=
> 28 GiB of disk than for deleting visibility information wrongly.

I'm glad you like it.  I concur.

>> 2. If setting the offset stop limit (the point where we refuse to
>> create new multixact space), we don't arm the stop point.  This means
>> that if you're in this situation, you run without member wraparound
>> protection until it's corrected.  A message gets logged once per
>> checkpoint telling you that you have this problem, and another message
>> gets logged when things get straightened out and the guards are
>> enabled.
>>
>> 3. If setting the vacuum force point, we assume that it's appropriate
>> to immediately force vacuum.
>
> Those seem reasonable, too.

Cool.

>> I've only tested this very lightly - this is just to see what you and
>> Noah and others think of the approach.  As compared with the previous
>> approach, it has the advantage of making minimal assumptions about the
>> sanity of what's on disk.  It has the disadvantage that, for some
>> people, the member-wraparound guard won't be enabled at startup -- but
>> note that those people can't start 9.3.7/9.4.2 *at all*, so currently
>> they are either running without member wraparound protection anyway
>> (if they haven't upgraded to those releases) or they're down entirely.
>
> That disadvantage is negligible, considering.

All right.

>> Another disadvantage is that we'll be triggering what may be quite a
>> bit of autovacuum activity for some people, which could be painful.
>> On the plus side, they'll hopefully end up with sane relminmxid and
>> datminmxid guards afterwards.
>
> That sounds good so long as each table requires just one successful emergency
> autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
> indeed perform a full-table scan and update relminmxid; is it there?

No.  Oops.

> For sites that can't tolerate an autovacuum storm, what alternative can we
> provide?  Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM " of
> every table, done before applying the minor update, sufficient?

I don't know.  In practical terms, they probably need to ensure that
if pg_multixact/offsets/ does not exist, no relations have
relminmxid = 1 and no remaining databases have datminmxid = 1.
Exactly what it will take to get there is possibly dependent on which
minor release you are running; on current minor releases, I am hopeful
that what you propose is sufficient.

>>  static void
>> -DetermineSafeOldestOffset(MultiXactId oldestMXact)
>> +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)
>
> Leftover change from an earlier iteration?  The values passed continue to be
> MultiXactId values.

Oopsie.

>>   /* move back to start of the corresponding segment */
>> - oldestOffset -= oldestOffset %
>> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
>> + offsetStopLimit = oldestOffset - (oldestOffset %
>> + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
>> + /* always leave one segment before the wraparound point */
>> + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * 
>> SLRU_PAGES_PER_SEGMENT);
>> +
>> + /* if nothing has changed, we're done */
>> + if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
>> + return;
>>
>>   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>> - /* always leave one segment before the wraparound point */
>> - MultiXactState->offsetStopLimit = oldestOffset -
>> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
>> + MultiXactState->offsetStopLimit = oldestOffset;
>
> That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Another oops.

Thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas

On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas  wrote:
> Thanks for the review.

Here's a new version.  I've fixed the things Alvaro and Noah noted,
and some compiler warnings about set but unused variables.

I also tested it, and it doesn't quite work as hoped.  If started on a
cluster where oldestMultiXid is incorrectly set to 1, it starts up and
indicates that the member wraparound guards are disabled.  But even
after everything is fixed, they don't get enabled until after the next
full restart.  I think that's because TruncateMultiXact() bails out
too early, without calling DetermineSafeOldestOffset.

My attempt at a quick fix for that problem didn't work out, so I'm
posting this version for now to facilitate further review and testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit eb39cf10e4ff853ed4b9d3fab599cf42911e6f70
Author: Robert Haas 
Date:   Thu Jun 4 11:58:49 2015 -0400

bar

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..209d3e6 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -196,13 +196,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;
 
 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -219,6 +230,7 @@ typedef struct MultiXactStateData
 
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -348,10 +360,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 		 MultiXactOffset start, uint32 distance);
-static MultiXactOffset find_multixact_start(MultiXactId multi);
+static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
 
 
@@ -960,7 +973,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * against catastrophic data loss due to multixact wraparound.  The basic
 	 * rules are:
 	 *
-	 * If we're past multiVacLimit or the safe threshold for member storage space,
+	 * If we're past multiVacLimit or the safe threshold for member storage
+	 * space, or we don't know what the safe threshold for member storage is,
 	 * start trying to force autovacuum cycles.
 	 * If we're past multiWarnLimit, start issuing warnings.
 	 * If we're past multiStopLimit, refuse to create new MultiXactIds.
@@ -969,6 +983,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *--
 	 */
 	if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
+		!MultiXactState->oldestOffsetKnown ||
 		(MultiXactState->nextOffset - MultiXactState->oldestOffset
 			> MULTIXACT_MEMBER_SAFE_THRESHOLD))
 	{
@@ -1083,7 +1098,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *--
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
+	if (MultiXactState->offsetStopLimitKnown &&
+		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
  nmembers))
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
@@ -1095,7 +,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		   MultiXactState->offsetStopLimit - nextOffset - 1),
  errhint("Execute a database-wide VACUUM in database with OID %u with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings.",
 		 MultiXactState->oldestMultiXactDB)));
-	else if (MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
+	else if (M

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas

On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund  wrote:
> On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
>> + /*
>> +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
>> +  */
>> + return !oldestOffsetKnown ||
>> + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
>
> I think without teaching autovac about those rules, this might just lead
> to lots of autovac processes starting without knowing they should do
> something? They know about autovacuum_multixact_freeze_age, but they
> know neither about !oldestOffsetKnown nor, afaics, about pending offset
> wraparounds?

You're right, but that's why the latest patch has changes in
MultiXactMemberFreezeThreshold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas

On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas  wrote:
> On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas  wrote:
>> Thanks for the review.
>
> Here's a new version.  I've fixed the things Alvaro and Noah noted,
> and some compiler warnings about set but unused variables.
>
> I also tested it, and it doesn't quite work as hoped.  If started on a
> cluster where oldestMultiXid is incorrectly set to 1, it starts up and
> indicates that the member wraparound guards are disabled.  But even
> after everything is fixed, they don't get enabled until after the next
> full restart.  I think that's because TruncateMultiXact() bails out
> too early, without calling DetermineSafeOldestOffset.
>
> My attempt at a quick fix for that problem didn't work out, so I'm
> posting this version for now to facilitate further review and testing.

Here's a new version with some more fixes and improvements:

- SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
when the oldest offset became known if the now-known value happened to
be zero.  Fixed.

- SetOffsetVacuumLimit now logs useful information at the DEBUG1
level, so that you can see that it's doing what it's supposed to.

- TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
offsetStopLimit even if it can't truncate anything.  This seems
useless, but it's not, because it may be that the last checkpoint
advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
value, and now we can actually set offsetStopLimit properly.

- TruncateMultiXact no longer calls find_multixact_start when there
are no remaining multixacts.  This is actually a completely separate
bug that goes all the way back to 9.3.0 and can potentially cause
TruncateMultiXact to remove every file in pg_multixact/offsets.
Restarting the cluster becomes impossible because TrimMultiXact barfs.

- TruncateMultiXact now logs a message if the oldest multixact does
not precede the earliest one on disk and is not equal to the next
multixact and yet does not exist.  The value of the log message is
that it discovered the bug mentioned in the previous line, so I think
it's earning its keep.

With this version, I'm able to see that when you start up a
9.3.latest+this patch with a cluster that has a bogus value of 1 in
relminmxid, datminmxid, and the control file, autovacuum vacuums
everything in sight, all the values get set back to the right thing,
and the next checkpoint enables the member-wraparound guards.  This
works with both autovacuum=on and autovacuum=off; the emergency
mechanism kicks in as intended.  We'll want to warn people with big
databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
want to pre-vacuum those tables before upgrading to avoid a vacuum
storm.  But generally I'm pretty happy with this: forcing those values
to get fixed so that we can guard against member-space wraparound
seems like the right thing to do.

So, to summarize, this patch does the following:

- Fixes the failure-to-start problems introduced in 9.4.2 in
complicated pg_upgrade scenarios.
- Prevents the new calls to find_multixact_start we added in 9.4.2
from happening during recovery, where they can only create failure
scenarios.  The call in TruncateMultiXact that has been there all
along is not eliminated, but now handles failure more gracefully.
- Fixes possible incorrect removal of every single
pg_multixact/offsets file when no multixacts exist; one file should be
kept.
- Forces aggressive autovacuuming when the control file's
oldestMultiXid doesn't point to a valid MultiXact and enables member
wraparound at the next checkpoint following the correction of that
problem.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit 87aa15fe5257060e0c971e135dd9f460fdc00bd0
Author: Robert Haas 
Date:   Thu Jun 4 11:58:49 2015 -0400

bar

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..7c457a6 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to ind

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas

On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas  wrote:
> - Forces aggressive autovacuuming when the control file's
> oldestMultiXid doesn't point to a valid MultiXact and enables member
> wraparound at the next checkpoint following the correction of that
> problem.

Err, enables member wraparound *protection* at the next checkpoint,
not the wraparound itself.

It's worth noting that every startup will now include one of the
following two messages:

LOG:  MultiXact member wraparound protections are now enabled

Or:

LOG:  MultiXact member wraparound protections are disabled because
oldest checkpointed MultiXact %u does not exist on disk
...where %u is probably 1

If you get the second one, you'll get the first one later after vacuum
has done its thing and a checkpoint has happened.

This is, obviously, some log chatter for people who don't have a
problem and never have, but I think it's worth emitting the first
message at startup even when there's no problem, so that people don't
have to make inferences from the absence of a message.  We can tell
people very simply that (1) if they see the first message, everything
is fine; (2) if they see the second message, autovacuum is going to
clean things up and they will eventually see the first message; and
(3) if they see neither message, they haven't upgraded to a fixed
version yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas

On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch  wrote:
> On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote:
>> Here's a new version with some more fixes and improvements:
>
> I read through this version and found nothing to change.  I encourage other
> hackers to study the patch, though.  The surrounding code is challenging.

Andres tested this and discovered that my changes to
find_multixact_start() were far more creative than intended.
Committed and back-patched with a trivial fix for that stupidity and a
novel-length explanation of the changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas

On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund  wrote:
> On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
>> Robert Haas  writes:
>> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch  wrote:
>> >> I read through this version and found nothing to change.  I encourage 
>> >> other
>> >> hackers to study the patch, though.  The surrounding code is challenging.
>>
>> > Andres tested this and discovered that my changes to
>> > find_multixact_start() were far more creative than intended.
>> > Committed and back-patched with a trivial fix for that stupidity and a
>> > novel-length explanation of the changes.
>>
>> So where are we on this?  Are we ready to schedule a new set of
>> back-branch releases?  If not, what issues remain to be looked at?
>
> We're currently still doing bad things while the database is in an
> inconsistent state (i.e. read from SLRUs and truncate based on the
> results of that). It's quite easy to reproduce base backup startup
> failures.
>
> On the other hand, that's not new. And the fix requires, afaics, a new
> type of WAL record (issued very infrequently). I'll post a first version
> of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
> guess we can then decide what we'd like to do.

There are at least two other known issues that seem like they should
be fixed before we release:

1. The problem that we might truncate an SLRU members page away when
it's in the buffers, but not drop it from the buffers, leading to a
failure when we try to write it later.

2. Thomas's bug fix for another longstanding but that occurs when you
run his checkpoint-segment-boundary.sh script.

I think we might want to try to fix one or both of those before
cutting a new release.  I'm less sold on the idea of installing
WAL-logging in this minor release.  That probably needs to be done,
but right now we've got stuff that worked in early 9.3.X release and
is now broken, and I'm in favor of fixing that first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas

On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund  wrote:
> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
>> Robert Haas  writes:
>> > 1. The problem that we might truncate an SLRU members page away when
>> > it's in the buffers, but not drop it from the buffers, leading to a
>> > failure when we try to write it later.
>
> I've got a fix for this, and about three other issues I found during
> development of the new truncation codepath.
>
> I'll commit the fix tomorrow.

OK.  Then I think we should release next week, so we get the fixes we
have out before PGCon.  The current situation is not good.

>> > I think we might want to try to fix one or both of those before
>> > cutting a new release.  I'm less sold on the idea of installing
>> > WAL-logging in this minor release.  That probably needs to be done,
>> > but right now we've got stuff that worked in early 9.3.X release and
>> > is now broken, and I'm in favor of fixing that first.
>
> I've implemented this, and so far it removes more code than it
> adds. It's imo also a pretty clear win in how understandable the code
> is.  The remaining work, besides testing, is primarily going over lots
> of comment and updating them. Some of them are outdated by the patch,
> and some already were.
>
> Will post tonight, together with the other fixes, after I get back from
> climbing.
>
> My gut feeling right now is that it's a significant improvement, and
> that it'll be reasonable to include it. But I'd definitely like some
> independent testing for it, and I'm not sure if that's doable in time
> for the wrap.

I think we would be foolish to rush that part into the tree.  We
probably got here in the first place by rushing the last round of
fixes too much; let's try not to double down on that mistake.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas

On Fri, Jun 5, 2015 at 2:36 PM, Alvaro Herrera  wrote:
> Tom Lane wrote:
>> Robert Haas  writes:
>
>> > There are at least two other known issues that seem like they should
>> > be fixed before we release:
>>
>> > 1. The problem that we might truncate an SLRU members page away when
>> > it's in the buffers, but not drop it from the buffers, leading to a
>> > failure when we try to write it later.
>>
>> > 2. Thomas's bug fix for another longstanding but that occurs when you
>> > run his checkpoint-segment-boundary.sh script.
>>
>> > I think we might want to try to fix one or both of those before
>> > cutting a new release.  I'm less sold on the idea of installing
>> > WAL-logging in this minor release.  That probably needs to be done,
>> > but right now we've got stuff that worked in early 9.3.X release and
>> > is now broken, and I'm in favor of fixing that first.
>>
>> Okay, but if we're not committing today to a release wrap on Monday,
>> I don't see it happening till after PGCon.
>
> In that case, I think we should get a release out next week.  The
> current situation is rather badly broken and dangerous, and the above
> two bugs are nowhere as problematic.  If we can get fixes for these over
> the weekend, that would be additional bonus.

Yeah, I think I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas

On Fri, Jun 5, 2015 at 4:40 PM, Andres Freund  wrote:
>>I think we would be foolish to rush that part into the tree.  We
>>probably got here in the first place by rushing the last round of
>>fixes too much; let's try not to double down on that mistake.
>
> My problem with that approach is that I think the code has gotten 
> significantly more complex in the least few weeks. I have very little trust 
> that the interactions between vacuum, the deferred truncations in the 
> checkpointer, the state management in shared memory and recovery are correct. 
> There's just too many non-local subtleties here.
>
> I don't know what the right thing to do here is.

That may be true, but we don't need to get to perfect to be better
than 9.4.2 and 9.4.3, where some people can't start the database.

I will grant you that, if the patch I committed today introduces some
regression that is even worse, life will suck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Robert Haas

On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera  wrote:
> Andres Freund wrote:
>> On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera 
>>  wrote:
>> >I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
>> >only causes things to happen (i.e. a new worker to be started) when
>> >autovacuum is disabled.  If autovacuum is enabled, postmaster
>> >receives the signal and doesn't do anything about it, because the
>> >launcher is already running.  Of course, regularly scheduled autovac
>> >workers will eventually start running, but perhaps this is not good
>> >enough.
>>
>> Well that's just the same for the plain xid precedent? I'd not mind
>> improving further, but that seems like a separate thing.
>
> Sure.  I just concern that we might be putting excessive trust on
> emergency workers being launched at a high pace.  With normally
> configured systems (naptime=1min) it shouldn't be a problem, but we have
> seen systems with naptime set to one hour or so, and those might feel
> some pain; and it would get worse the more databases you have, because
> people might feel the need to space the autovac runs even more.
>
> (My personal alarm bells go off when I see autovac_naptime=15min or
> more, but apparently not everybody sees things that way.)

Uh, I'd echo that sentiment if you did s/15min/1min/

I think Andres's patch is just improving the existing mechanism so
that it's reliable, and you're proposing something notably different
which might be better, but which is really a different proposal
altogether.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-15 Thread Robert Haas

On Fri, Jun 12, 2015 at 7:27 PM, Steve Kehlet  wrote:
> Just wanted to report that I rolled back my VM to where it was with 9.4.2
> installed and it wouldn't start. I installed 9.4.4 and now it starts up just
> fine:
>
>> 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG:  database system was shut down
>> at 2015-05-27 13:12:55 PDT
>> 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG:  MultiXact member wraparound
>> protections are disabled because oldest checkpointed MultiXact 1 does not
>> exist on disk
>> 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG:  autovacuum launcher started
>> 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG:  database system is ready to
>> accept connections
>>  done
>> server started
>
> And this is showing up in my serverlog periodically as the emergency
> autovacuums are running:
>
>> 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG:  MultiXact member wraparound
>> protections are disabled because oldest checkpointed MultiXact 1 does not
>> exist on disk
>
> **Thank you Robert and all involved for the resolution to this.**
>
>> With the fixes introduced in this release, such a situation will result in
>> immediate emergency autovacuuming until a correct oldestMultiXid value can
>> be determined
>
> Okay, I notice these vacuums are of the "to prevent wraparound" type (like
> VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to
> know, we'll plan our software updates accordingly.
>
> Is there any risk until these autovacuums finish?

As long as you see only a modest number of files in
pg_multixact/members, you're OK.  But in theory, until that emergency
autovacuuming finishes, there's nothing keeping that directory from
wrapping around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Change in order of criteria - reg

2016-06-03 Thread Robert Haas

On Wed, Jun 1, 2016 at 5:22 AM, Amit Langote
 wrote:
> On 2016/06/01 13:07, sri harsha wrote:
>> Hi,
>>
>> In PostgreSQL , does the order in which the criteria is given matter ??
>> For example
>>
>> Query 1 : Select * from TABLE where a > 5 and b < 10;
>>
>> Query 2 : Select * from TABLE where b <10 and a > 5;
>>
>> Are query 1 and query 2 the same in PostgreSQL or different ?? If its
>> different , WHY ??
>
> tl;dr they are the same.  As in they obviously produce the same result and
> result in invoking the same plan.
>
> Internally, optimizer will order application of those quals in resulting
> plan based on per-tuple cost of individual quals.  So a cheaper, more
> selective qual might result in short-circuiting of relatively expensive
> quals for a large number of rows in the table saving some cost in
> run-time.  Also, if index scan is chosen and quals pushed down, the
> underlying index method might know to order quals smartly.
>
> However, the cost-markings of operators/functions involved in quals better
> match reality.  By default, most operators/functions in a database are
> marked with cost of 1 unit.  Stable sorting used in ordering of quals
> would mean the order of applying quals in resulting plan matches the
> original order (ie, the order in which they appear in the query).  So, if
> the first specified qual really happens to be an expensive qual but marked
> as having the same cost as other less expensive quals, one would have to
> pay the price of evaluating it for all the rows.  Whereas, correctly
> marking the costs could have avoided that (as explained above).  Note that
> I am not suggesting that ordering quals in query by their perceived cost
> is the solution.  Keep optimizer informed by setting costs appropriately
> and it will do the right thing more often than not. :)

I think that if the costs are actually identical, the system will keep
the quals in the same order they were written - so then the order does
matter, a little bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] PgQ and pg_dump

2016-06-21 Thread Robert Haas

On Thu, Jun 16, 2016 at 1:46 PM, Martín Marqués  wrote:
> The comment is accurate on what is going to be dumpable and what's not
> from the code. In our case, as the pgq schema is not dumpable becaause
> it comes from an extension, other objects it contain will not be
> dumpable as well.
>
> That's the reason why the PgQ event tables created by
> pgq.create_queue() are not dumped.

That sucks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] pg_dumping extensions having sequences with 9.6beta3

2016-07-29 Thread Robert Haas

On Wed, Jul 27, 2016 at 2:24 AM, Michael Paquier
 wrote:
> On Wed, Jul 27, 2016 at 8:07 AM, Stephen Frost  wrote:
>> That'd be great.  It's definitely on my list of things to look into, but
>> I'm extremely busy this week.  I hope to look into it on Friday, would
>> be great to see what you find.
>
> Sequences that are directly defined in extensions do not get dumped,
> and sequences that are part of a serial column in an extension are
> getting dumped. Looking into this problem, getOwnedSeqs() is visibly
> doing an incorrect assumption: sequences owned by table columns are
> dumped unconditionally, but this is not true for sequences that are
> part of extensions. More precisely, dobj->dump is being enforced to
> DUMP_COMPONENT_ALL, which makes the sequence definition to be dumped.
> Oops.
>
> The patch attached fixes the problem for me. I have added as well
> tests in test_pg_dump in the shape of sequences defined in an
> extension, and sequences that are part of a serial column. This patch
> is also able to work in the case where a sequence is created as part
> of a serial column, and gets removed after, say with ALTER EXTENSION
> DROP SEQUENCE. The behavior for sequences and serial columns that are
> not part of extensions is unchanged.
>
> Stephen, it would be good if you could check the correctness of this
> patch as you did all this refactoring of pg_dump to support catalog
> ACLs. I am sure by the way that checking for (owning_tab->dobj.dump &&
> DUMP_COMPONENT_DEFINITION) != 0 is not good because of for example the
> case of a serial column created in an extension where the sequence is
> dropped from the extension afterwards.

Stephen, is this still on your list of things for today?  Please
provide a status update.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Request - repeat value of \pset title during \watch interations

2016-03-19 Thread Robert Haas

On Thu, Mar 10, 2016 at 1:40 AM, David G. Johnston
 wrote:
> Adding -hackers for consideration in the Commitfest.

I don't much like how this patch uses the arbitrary constant 50 in no
fewer than 5 locations.

Also, it seems like we could arrange for head_title to be "" rather
than NULL when myopt.title is NULL.  Then instead of this:

+if (head_title)
+snprintf(title, strlen(myopt.title) + 50,
+ _("Watch every %lds\t%s\n%s"),
+ sleep, asctime(localtime(&timer)), head_title);
+else
+snprintf(title, 50, _("Watch every %lds\t%s"),
+ sleep, asctime(localtime(&timer)));

...we could just the first branch of that if all the time.

 if (res == -1)
+{
+pg_free(title);
+pg_free(head_title);
 return false;
+}

Instead of repeating the cleanup code, how about making this break;
then, change the return statement at the bottom of the function to
return (res != -1).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Request - repeat value of \pset title during \watch interations

2016-03-21 Thread Robert Haas

On Sun, Mar 20, 2016 at 9:31 AM, Michael Paquier
 wrote:
> And the patch attached gives the following output:
> With title:
> =# \watch 1
> Watch every 1sSun Mar 20 22:28:38 2016
> popo
>  a
> ---
>  1
> (1 row)
>
> And without title:
> Watch every 1sSun Mar 20 22:29:31 2016
>
>  a
> ---
>  1
> (1 row)

And does everybody agree that this is a desirable change?

As for the patch itself, you could replace all this:

+   /*
+* Take into account any title present in the user setup as a part of
+* what is printed for each iteration by using it as a header.
+*/
+   if (myopt.title)
+   {
+   title_len = strlen(myopt.title);
+   title = pg_malloc(title_len + 50);
+   head_title = pg_strdup(myopt.title);
+   }
+   else
+   {
+   title_len = 0;
+   title = pg_malloc(50);
+   head_title = pg_strdup("");
+   }

...with:

head_title = pg_strdup(myopt.title != NULL ? myopt.title : "");
title_len = strlen(head_title);
title = pg_malloc(title_len + 50);

Better yet, include the + 50 in title_len, and then you don't need to
reference the number 50 again further down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Request - repeat value of \pset title during \watch interations

2016-03-21 Thread Robert Haas

On Mon, Mar 21, 2016 at 11:17 AM, David G. Johnston
 wrote:
>> And does everybody agree that this is a desirable change?
>
> Adding the title is desirable.  While I'm inclined to bike-shed this
> anything that gets it in I can live with and so I'm content letting the
> author/committer decide where exactly things (including whitespace) appear.
>
> It is a bit odd that the "Watch every %s" gets centered if the result is
> wide but that the title remains left-aligned.

Well, the title isn't normally centered, but yeah, that is odd.  Yeah,
that is odd.  Come to think of it, I think I might have expected the
title to appear *above* "Watch every %s", not below it.  That might
decrease the oddness.

As for letting the committer decide, I don't care about this
personally at all, so I'm only looking at it to be nice to the people
who do.  Whatever is the consensus is OK with me.  I just don't want
to get yelled at later for committing something here, so it would be
nice to see a few votes for whatever we're gonna do here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Request - repeat value of \pset title during \watch interations

2016-03-21 Thread Robert Haas

On Mon, Mar 21, 2016 at 2:09 PM, David G. Johnston
 wrote:
> On Monday, March 21, 2016, Tom Lane  wrote:
>> "David G. Johnston"  writes:
>> > I'd rather not omit sleep but removing "Watch every" is fine (preferred
>> > actually), so:
>> > Title Is Here Mon Mar 21 15:05:06 2016 (5s)
>>
>> Meh ... seems a bit awkward to me.  Couldn't you include " (5s)" in the
>> title, if you want that info?  If it's variable, you could still
>> accommodate that:
>
> Actually, only if it's a variable that you setup and repeat and you show.  A
> bit cumbersome and mixes the parts that are title and those that are present
> only because you are watching.

Ah, come on.  This doesn't really seem like an issue we should spend
more time quibbling about.  I think Tom's version is fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Large object corruption during 'piped' pg_restore

2011-01-21 Thread Robert Haas

On Fri, Jan 21, 2011 at 12:44 PM, Bosco Rama  wrote:
> Tom Lane wrote:
>>
>> So I'm not sure whether to fix it, or leave it as a known failure case
>> in old branches.  Comments?
>
> I understand the reluctance to fool with stable code.  I have zero insight
> into your installed versions distribution and backward compatibility needs
> so any comment I may have here is purely selfish.
>
> As an end user there is one area of the DB that I want to work correctly
> 100% of the time and that is the dump/restore tool(s).  If it's not going
> to work under certain circumstances it should at least tell me so and
> fail.  I don't think having the tool appear to work while corrupting the
> data (even if documented as doing so) is a viable method of operation.

Yeah, I lean toward saying we should back-patch this.  Yeah, it'll be
lightly tested, but it's a pretty confined change, so it's unlikely to
break anything else.  ISTM the worst case scenario is that it takes
two minor releases to get it right, and even that seems fairly
unlikely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Issues with generate_series using integer boundaries

2011-02-08 Thread Robert Haas

On Tue, Feb 8, 2011 at 8:30 PM, Andrew Dunstan  wrote:
> Quite right, but the commitfest manager isn't meant to be a substitute for
> one. Bug fixes aren't subject to the same restrictions of feature changes.

Another option would be to add this here:

http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Shared memory changes in 9.4?

2014-05-27 Thread Robert Haas

On Tue, May 27, 2014 at 8:22 PM, Maciek Sakrejda  wrote:
> On Mon, May 26, 2014 at 12:24 AM, Andres Freund 
> wrote:
>>
>> Any chance you're using a 9.3 configuration file instead of the one
>> generated by initdb?
>> dynamic_shared_memory_type defaults to 'posix' if not specified in the
>> config file (on platforms supporting it). If initdb detects that 'posix'
>> can't be used it'll emit a different value. If you're copying the config
>> from 9.3 and your environment doesn't support posix shm that'll cause
>> the above error.
>> I still think dynamic_shared_memory_type should default to 'none'
>> because of such problems
>
> It works with 'none' and 'sysv'--I think the issue is that technically our
> environment does support 'posix', but '/dev/shm' is indeed not mounted in
> the LXC container, leading to a discrepancy between what initdb decides and
> what's actually possible. Thanks for your help.

I think it would be good to understand why initdb isn't getting this
right.  Did you run initdb outside the LXC container, where /dev/shm
would have worked, but then run postgres inside the LXC container,
where /dev/shm does not work?  I ask because initdb is supposed to be
doing the same thing that postgres does, so it really ought to come to
the same conclusion about what will and won't work.

With regard to Andres' proposal, I'm not that keen on setting
dynamic_shared_memory_type='none' by default.  Would we leave it that
way until we get in-core users of the facility, and then change it?  I
guess that'd be OK, but frankly if enabling dynamic_shared_memory_type
by default is causing us many problems, then we'd better reconsider
the design of the facility now, before we start adding more
dependencies on it.  We've already fixed a bunch of DSM-related issues
as a result of the fact that the default *isn't* none, and I dunno how
many of those we would have found if the default had been none.  I
tend to think DSM is an important facility that we're going to be
wanting to build on in future releases, so I'm keen to have it
available by default so that we can iron out any kinks before we get
too far down that path.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Question about partial functional indexes and the query planner

2014-06-11 Thread Robert Haas

On Tue, Jun 10, 2014 at 7:19 PM, Tom Lane  wrote:
> Given the lack of previous complaints, I'm not sure this amounts to
> a back-patchable bug, but it does seem like something worth fixing
> going forward.

Agreed, although I'd be willing to see us slip it into 9.4.  It's
doubtful that anyone will get upset if their query plans change
between beta1 and beta2, but the same cannot be said for released
branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4

2014-11-17 Thread Robert Haas

On Thu, Nov 13, 2014 at 7:34 PM, Tom Lane  wrote:
> One thing that occurs to me is that if the generic plan estimate comes
> out much cheaper than the custom one, maybe we should assume that the
> generic's cost estimate is bogus.  Right offhand I can't think of a reason
> for a custom plan to look worse than a generic one, unless there's a
> statistical quirk like this one.

That's an interesting idea, but what do we do after deciding that it's
bogus?  The generic plan really can't be cheaper than the custom plan,
but it could be the same price, or as close as makes no difference.

I have long thought that maybe the custom vs. generic plan decision
should have something to do with whether the parameters are MCVs of
the table columns they're being compared against.  This case is
interesting because it demonstrates that querying for a *non*-MCV can
require a switch to a custom plan, which is something I don't think
I've seen before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4

2014-11-19 Thread Robert Haas

On Mon, Nov 17, 2014 at 4:27 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Nov 13, 2014 at 7:34 PM, Tom Lane  wrote:
>>> One thing that occurs to me is that if the generic plan estimate comes
>>> out much cheaper than the custom one, maybe we should assume that the
>>> generic's cost estimate is bogus.  Right offhand I can't think of a reason
>>> for a custom plan to look worse than a generic one, unless there's a
>>> statistical quirk like this one.
>
>> That's an interesting idea, but what do we do after deciding that it's
>> bogus?
>
> Keep using custom plans.  It's possible that the estimate that's in error
> is the custom one, but that's not the way to bet IMO, since the custom
> plan estimate is based on better information.
>
>> The generic plan really can't be cheaper than the custom plan,
>> but it could be the same price, or as close as makes no difference.
>
> Right, and what we want to do is use the generic plan as long as it's
> close to the same cost (close enough to not justify replanning effort).
> The trick here is to not be fooled by estimation errors.  Can we assume
> that generic cost < custom cost is always an estimation error?

Maybe.  It seems like kind of a fragile bet to me.  There's going to
be some qual selectivity below which an index scan on a particular
table outperforms a sequential scan, but the selectivity estimated for
a generic plan can be either higher or lower than the selectivity we'd
estimate for some particular value.  And once one of the two plans
decides on an index scan while the other one divides on a sequential
scan, it can cascade through and change the whole plan - e.g. because
it affects whether the tuples emerge with usable pathkeys.  I don't
feel very confident about assuming that applying < to the result of
all that is going to tell us anything useful.

I think what's really going on here is that the generic plan will be
optimal for some range of possible qual selectivities.  Typically, the
qual is of the form col = val, so the plan will be optimal for all
values where the estimated frequency is between some values A and B.
What would be nice is to notice when we see a value that is outside
that range, and switch to a custom plan in that case.  I realize that
the planner isn't really well set up to deliver the information we'd
need to test that for each new supplied value, but that's really what
we need to do.  The current system wastes CPU cycles by replanning up
to 5 times even when there is no benefit to be gained by it, but can
also cause big performance problems when it settles into a generic
plan and then a value with different characteristics shows up later
on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] REFERENCES privilege should not be symmetric (was Re: [GENERAL] Postgres Permissions Article)

2017-03-31 Thread Robert Haas

On Thu, Mar 30, 2017 at 4:45 PM, Tom Lane  wrote:
> In short, it seems like this statement in the docs is correctly describing
> our code's behavior, but said behavior is wrong and should be changed.
> I'd propose fixing it like that in HEAD; I'm not sure if the back branches
> should also be changed.

Sounds reasonable, but I don't see much advantage to changing it in
the back-branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] REFERENCES privilege should not be symmetric (was Re: [GENERAL] Postgres Permissions Article)

2017-03-31 Thread Robert Haas

On Fri, Mar 31, 2017 at 11:29 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Mar 30, 2017 at 4:45 PM, Tom Lane  wrote:
>>> In short, it seems like this statement in the docs is correctly describing
>>> our code's behavior, but said behavior is wrong and should be changed.
>>> I'd propose fixing it like that in HEAD; I'm not sure if the back branches
>>> should also be changed.
>
>> Sounds reasonable, but I don't see much advantage to changing it in
>> the back-branches.
>
> Well, it's a SQL-compliance bug, and we often back-patch bug fixes.

Personally, I'd be more inclined to view it as a definitional change.
Maybe we picked the wrong definition before, but it's well-established
behavior at this point.  The SQL standard also says that identifiers
are supposed to be folded to upper case, so I don't think the theory
that every deviation from the standard should be fixed and
back-patched holds a lot of water.

> The argument for not back-patching a bug fix usually boils down to
> fear of breaking existing applications, but it's hard to see how
> removal of a permission check could break a working application ---
> especially when the permission check is as hard to trigger as this one.
> How many table owners ever revoke their own REFERENCES permission?

Sure, but that argument cuts both ways.  If nobody ever does that, who
will be helped by back-patching this?

I certainly agree that back-patching this change is pretty low risk.
I just don't think it has any real benefits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] contributing patches

2008-03-11 Thread Robert Haas

I have to dig this up and see if I still have it.

...Robert

-Original Message-
From: Bruce Momjian [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 11, 2008 1:05 PM
To: Robert Haas
Cc: Tom Lane; pgsql-general@postgresql.org
Subject: Re: [GENERAL] contributing patches


Great, would you please end your truncate patch to the patches email
list?  Thanks.


---

Robert Haas wrote:
> > We require that all submissions conform to the Postgres BSD license,
> > but we are not picky about requiring paperwork to prove it.  Just
put
> > the same copyright header into any added files as you see in
existing
> > files.
> 
> OK cool.
> 
> > Yeah, the core team does not have a lot of bandwidth right now for
> > thinking about 8.4 development.  You don't have to wait for 8.3
> > final release, but probably after the first couple of betas are out
> > would be better than now.  Bear in mind also that any patches
> developed
> > now are likely to have merge problems after the pgindent run that
will
> > occur late in beta --- so you might want to wait till after that
> before
> > starting any large coding effort.
> 
> OK.  I wrote a patch to implement the following TODO item, which I
> selected on the basis of the fact that it appeared to be easy:
> 
> * %Add a separate TRUNCATE permission
> 
>   Currently only the owner can TRUNCATE a table because triggers are
not
>   called, and the table is locked in exclusive mode.
> 
> It was easy, so I can update the patch as necessary until it can be
> submitted.
> 
> Thanks,
> 
> ...Robert
> 
> ---(end of
broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org/

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB
http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] libpq port number handling

2009-09-24 Thread Robert Haas

On Thu, Sep 24, 2009 at 8:36 PM, Tom Lane  wrote:
> BTW, are port numbers still limited to 16 bits in IPv6?

Yes.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] libpq port number handling

2009-09-24 Thread Robert Haas

On Thu, Sep 24, 2009 at 8:59 PM, Tom Lane  wrote:
> Sam Mason  writes:
>> +             if (portnum < 1 || portnum > 65535)
>
> BTW, it strikes me that we could tighten this even more by rejecting
> target ports below 1024.  This is guaranteed safe on all Unix systems
> I know of, because privileged ports can only be listened to by root-owned
> processes and we know the postmaster won't be one.  I am not sure
> whether it would be possible to start the postmaster on a low-numbered
> port on Windows though.  Anyone know?  Even if it's possible, do we
> want to allow it?

I don't think we get much benefit out of artificially limiting libpq
in this way.  In 99.99% of cases it won't matter, and in the other
0.01% it will be a needless annoyance.  I think we should restrict
ourselves to checking what is legal, not what we think is a good idea.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Urgent Help Required

2013-10-08 Thread Robert Haas

*Don't* VACUUM FULL.  Just VACUUM.  It's not the same thing.

...Robert


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Error compiling sepgsql in PG9.1

2011-05-24 Thread Robert Haas

2011/5/24 Kohei Kaigai :
> The attached patch enables to abort configure script when we run it with 
> '--with-selinux'
> option, but libselinux is older than minimum requirement to SE-PostgreSQL.
>
> As the documentation said, it needs libselinux-2.0.93 at least, because this 
> or later
> version support selabel_lookup(3) for database object classes; used to 
> initial labeling.
>
> The current configure script checks existence of libselinux, but no version 
> checks.
> (getpeercon_raw(3) has been a supported API for a long term.)
> The selinux_sepgsql_context_path(3) is a good watermark of libselinux-2.0.93 
> instead.

Looks to me like you need to adjust the wording of the error message.

Maybe "libselinux version 2.0.93 or newer is required", or something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Issues with generate_series using integer boundaries

2011-06-17 Thread Robert Haas

On Wed, Feb 9, 2011 at 4:50 AM, Thom Brown  wrote:
> On 9 February 2011 02:11, Robert Haas  wrote:
>> On Tue, Feb 8, 2011 at 8:30 PM, Andrew Dunstan  wrote:
>>> Quite right, but the commitfest manager isn't meant to be a substitute for
>>> one. Bug fixes aren't subject to the same restrictions of feature changes.
>>
>> Another option would be to add this here:
>>
>> http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items
>
> I've removed it from the commitfest because it really doesn't belong
> there, and I've added it to the open items list.

So, I finally got around to look at this, and I think there is a
simpler solution.  When an overflow occurs while calculating the next
value, that just means that the value we're about to return is the
last one that should be generated.  So we just need to frob the
context state so that the next call will decide we're done.  There are
any of number of ways to do that; I just picked what looked like the
easiest one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

generate-series-overflow.patch
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Issues with generate_series using integer boundaries

2011-06-17 Thread Robert Haas

On Fri, Jun 17, 2011 at 2:15 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> So, I finally got around to look at this, and I think there is a
>> simpler solution.  When an overflow occurs while calculating the next
>> value, that just means that the value we're about to return is the
>> last one that should be generated.  So we just need to frob the
>> context state so that the next call will decide we're done.  There are
>> any of number of ways to do that; I just picked what looked like the
>> easiest one.
>
> +1 for this solution.
>
> BTW, there was some mention of changing the timestamp versions of
> generate_series as well, but right offhand I'm not convinced that
> those need any change.  I think you'll get overflow detection there
> automatically from the functions being used --- and if not, it's a
> bug in those functions, not in generate_series.

Maybe not, because those functions probably throw an error if an
overflow is detected, and that's not really correct.  By definition,
the second generate_series() is the point at which we should stop
generating, and that point has to be within the range of the
underlying data type, by definition.  So if an overflow occurs, that's
just another way of saying that we've certainly gone past the stop
point and needn't generate anything further.  The error is an artifact
of the method we've used to generate the next point.

I'm not sure how much energy it's worth expending on that case.  Using
really large dates may be less common that using values that strain
the range of a 4-byte integer.  But it might at least be worth a TODO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Issues with generate_series using integer boundaries

2011-06-17 Thread Robert Haas

On Fri, Jun 17, 2011 at 10:39 AM, David Johnston  wrote:
> Tangential comment but have you considered emitting a warning (and/or log
> entry) when you are 10,000-50,000 away from issuing the last available
> number in the sequence so that some recognition exists that any code
> depending on the sequence is going to fail soon?
>
> Also, during sequence creation you know the integer type being used so that
> maximum value is known and an overflow should not need to come into play (I
> guess the trade-off is the implicit "try-catch" [or whatever mechanism C
> uses] performance hit versus the need to store another full integer in the
> data structure).
>
> You could also give access to the "warning threshold" value so that the
> developer can change it to whatever value is desired (with a meaningful
> default of course).

There are already tools out there that can monitor this stuff - for
example, check_postgres.pl.

http://bucardo.org/check_postgres/check_postgres.pl.html#sequence

We tend to avoid emitting warnings for this kind of thing because they
can consume vast amounts of disk space, and a lot of times no one's
looking at them anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [PERFORM] change sample size for statistics

2011-06-28 Thread Robert Haas

On Mon, Jun 13, 2011 at 6:33 PM, Willy-Bas Loos  wrote:
> On Fri, Jun 10, 2011 at 9:58 PM, Josh Berkus  wrote:
>>
>> It's not 10%.  We use a fixed sample size, which is configurable on the
>> system, table, or column basis.
>
> It seems that you are referring to "alter column set statistics" and
> "default_statistics_target", which are the number of percentiles in the
> histogram  (and MCV's) .
> I mean the number of records that are scanned by analyze to come to the
> statistics for the planner, especially n_disctict.

In 9.0+ you can do ALTER TABLE .. ALTER COLUMN .. SET (n_distinct = ...);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Creating temp tables inside read only transactions

2011-07-08 Thread Robert Haas

On Fri, Jul 8, 2011 at 2:21 AM, Darren Duncan  wrote:
> I think an even better way to support this is would be based on Postgres
> having support for directly using multiple databases within the same SQL
> session at once, as if namespaces were another level deep, the first level
> being the databases, the second level the schemas, and the third level the
> schema objects.
>
> Kind of like what the SQL standard defines its catalog/schema/object
> namespaces.
>
> This instead of needing to use federating or that contrib module to use
> multiple Pg databases of the same cluster at once.

But if that's what you want, just don't put your data in different
databases in the first place.  That's what schemas are for.

If for some reason we needed to have tables that happened to be called
x.y.z and a.b.c accessible from a single SQL session, we could allow
that much more simply by allowing schemas to be nested.  Then we could
allow arbitrary numbers of levels, not just three.  The whole point of
having databases and schemas as separate objects is that they do
different things: schemas are just containers for names, allowing
common access to data, and databases are completely separate entities,
allowing privilege separation for (say) a multi-tenant hosting
environment.  We're not going to throw out the latter concept just so
people can use two dots in their table names instead of one.

> Under this scenario, we make the property of a database being read-only or
> read-write for the current SQL session associated with a database rather
> than the whole SQL session.  A given transaction can read from any database
> but can only make changes to the ones not read-only.
>
> Also, the proper way to do temporary tables would be to put them in another
> database than the main one, where the whole other database has the property
> of being temporary.
>
> Under this scenario, there would be separate system catalogs for each
> database, and so the ones for read-only databases are read-only, and the
> ones for other databases aren't.
>
> Then the system catalog itself fundamentally isn't more complicated, per
> database, and anything extra to handle cross-database queries or whatever,
> if anything, is a separate layer.  Code that only deals with a single
> database at once would be an optimized situation and perform no worse than
> it does now.

I think you should make more of an effort to understand how the system
works now, and why, before proposing radical redesigns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL][HACKERS] register creation date of table

2011-10-15 Thread Robert Haas

On Fri, Oct 14, 2011 at 6:20 AM, Willy-Bas Loos  wrote:
> 1. I think that there is no such information in the system tables. is
> that correct?

Yes.  It's been discussed before but some people (particularly, Tom,
IIRC) are not convinced that it's useful enough to justify its
existence.

> 2. i would like to go back in time. I think that i will just look up
> the creation date for the files in the data directory and translate
> their oid's to the object names and then update their dates. This
> would of course only work from the last restore. Is that a good way to
> do it?

Well, that timestamp will get bumped on TRUNCATE, CLUSTER, VACUUM
FULL, and rewriting versions of ALTER TABLE.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [RRR] [HACKERS] Commitfest: The Good, The Bad, and the Ugly

2010-09-28 Thread Robert Haas

On Tue, Sep 28, 2010 at 9:11 PM, Itagaki Takahiro
 wrote:
> On Wed, Sep 29, 2010 at 6:03 AM, David Fetter  wrote:
>> The Good:
>> - Most patches still in play have a reviewer.
>
> As far as I remember, there were discussions about the issue
> "A patch has a reviewer, but in Needs Review state for several weeks "
> in 9.0 development.
>
> Do we have any plans for it? According to the commitfest app, one patch
> has only one reviewer at once. A new reviewer might avoid reviewing
> a patch that have another reviewer already.

No, the column is very clearly labelled "Reviewers", not "Reviewer".
And we have certainly had patches with more than one person's name in
that field in the past.  The issue is rather that we don't have enough
people reviewing.  We haven't had enough people volunteer to do
reviews to even assign ONE person to each patch, let alone two.  There
are, as of this writing, SEVEN patches that have no reviewer at all.

Of course, several of the committers, including you, me, and Tom, have
been working our way through the patches.  And that is great.  But the
point of the CommitFest process is that everyone is supposed to pitch
in and help out, not just the committers.  That is not happening, and
it's a problem.  This process does not work and will not scale if the
committers are responsible for doing all the work on every patch from
beginning to end.  That has never worked, and the fact that we have a
few more committers now doesn't change that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [RRR] [HACKERS] Commitfest: The Good, The Bad, and the Ugly

2010-09-28 Thread Robert Haas

On Tue, Sep 28, 2010 at 9:33 PM, Itagaki Takahiro
 wrote:
> On Wed, Sep 29, 2010 at 10:18 AM, Robert Haas  wrote:
>> No, the column is very clearly labelled "Reviewers", not "Reviewer".
>> And we have certainly had patches with more than one person's name in
>> that field in the past.  The issue is rather that we don't have enough
>> people reviewing.  We haven't had enough people volunteer to do
>> reviews to even assign ONE person to each patch, let alone two.  There
>> are, as of this writing, SEVEN patches that have no reviewer at all.
>
> Some of them might be too difficult to review. For example, replication
> or snapshot management requires special skills to review.
>
> I'm worrying about new reviewers hesitate to review a patch that has
> a previous reviewer, and then, if they think the remaining patches are
> too difficult for them, they would just leave the commitfest page.

That's a legitimate concern, but I am not sure how much of a problem
it is in practice.  Most people who become round-robin reviewers are
getting pulled into the process a little more than just stumbling
across the CF page by happenstance, or at least I hope they are.  Not
all patches can benefit from multiple reviewers, but CF managers can
and should encourage multiple reviews of those that can.  However, at
the moment, the problem is that regardless of who is assigned to do
what, we're not getting enough reviews done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] is not distinct from any(...)

2008-09-19 Thread Robert Haas

I'm trying to write a SQL statement to determine whether a value is an
an array, but I want the comparison to be done using IS NOT DISTINCT
FROM rather than =.

My first thought was that instead of writing:

SELECT value = ANY(array)

...I could simply write:

SELECT value IS NOT DISTINCT FROM ANY(array)

That doesn't seem to work, because IS NOT DISTINCT FROM is not an
operator.  So then I tried creating an operator === (anyelement,
anyelement) that just does IS NOT DISTINCT FROM and writing:

select 1 === any(array[1]);

which got me:

ERROR:  could not find array type for data type anyelement

Grr... any suggestions?

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] calling a function that takes a row type and returns a set of rows

2008-10-10 Thread Robert Haas

So, say I have something like this - the actual example is something a
bit more useful:

CREATE TABLE foo (a integer, b integer);
INSERT INTO foo VALUES (1, 1);   -- must have some data to generate the failure

CREATE FUNCTION bar (foo) RETURNS SETOF foo AS $$
DECLARE
f foo;
BEGIN
f.a := 1;
RETURN NEXT f;
f.a := 2;
RETURN NEXT f;
END
$$ LANGUAGE plpgsql;

I can't find any legal way of calling this function.

SELECT bar(f) FROM foo f;
ERROR:  set-valued function called in context that cannot accept a set

SELECT * FROM foo f, bar(f);
ERROR:  function expression in FROM may not refer to other relations
of same query level

Any help appreciated.

Thanks,

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] calling a function that takes a row type and returns a set of rows

2008-10-10 Thread Robert Haas

Hmm, the implicit cursor approach approach won't work for me because I
want to be able to call the function on an arbitrary slice of the rows
in the table, but the explicit cursor approach looks like it might
work.  I'll give that a try, thanks.

...Robert

On Fri, Oct 10, 2008 at 4:01 PM, Pavel Stehule <[EMAIL PROTECTED]> wrote:
> Hello
>
> PostgreSQL doesn't support pipe functions, so you cannot do what you
> wont.  But you should to use SQL SETOF functions, that should be
> called in normal context. I dislike this feature, but it should be
> useful for you,
>
> try:
>
> create or replace function bar1(foo)
> returns setof foo as $$
>  select 1, $1.b
>  union all
>  select 2, $1.b;
> $$ language sql;
>
> postgres=# select (bar1(foo)).* from foo;
>  a | b
> ---+---
>  1 | 1
>  2 | 1
> (2 rows)
>
> I thing, so much better and cleaner version is using explicit or
> implicit cursor in function
>
> -- implicit cursor
> create or replace function bar() returns setof foo as $$
> declare r record;
> begin
>  for r in select * from foo loop
>r.a := 1;
>return next r;
>r.a := 2;
>return next r;
>  end loop;
>  return;
> end;
> $$ language plpgsql;
>
> postgres=# select * from bar();
>  a | b
> ---+---
>  1 | 1
>  2 | 1
> (2 rows)
>
> -- using explicit cursor (it's more complicated variant, and I thing,
> so it's better don't use it)
> create or replace function bar(c refcursor) returns setof foo as $$
> declare r record;
> begin
>  loop
>fetch c into r;
>exit when not found;
>r.a := 1;
>return next r;
>r.a := 2;
>return next r;
>  end loop;
>  return;
> end;
> $$ language plpgsql;
>
> begin;
> declare x cursor for select * from foo;
> select * from bar('x'::refcursor);
> commit;
>
> postgres=# declare x cursor for select * from foo;
> DECLARE CURSOR
> postgres=# select * from bar('x'::refcursor);
>  a | b
> ---+---
>  1 | 1
>  2 | 1
> (2 rows)
>
> postgres=# commit;
> COMMIT
>
> Regards
> Pavel Stehule
>
>
> 2008/10/10 Robert Haas <[EMAIL PROTECTED]>:
>> So, say I have something like this - the actual example is something a
>> bit more useful:
>>
>> CREATE TABLE foo (a integer, b integer);
>> INSERT INTO foo VALUES (1, 1);   -- must have some data to generate the 
>> failure
>>
>> CREATE FUNCTION bar (foo) RETURNS SETOF foo AS $$
>> DECLARE
>>f foo;
>> BEGIN
>>f.a := 1;
>>RETURN NEXT f;
>>f.a := 2;
>>RETURN NEXT f;
>> END
>> $$ LANGUAGE plpgsql;
>>
>> I can't find any legal way of calling this function.
>>
>> SELECT bar(f) FROM foo f;
>> ERROR:  set-valued function called in context that cannot accept a set
>>
>> SELECT * FROM foo f, bar(f);
>> ERROR:  function expression in FROM may not refer to other relations
>> of same query level
>>
>> Any help appreciated.
>>
>> Thanks,
>>
>> ...Robert
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] calling a function that takes a row type and returns a set of rows

2008-10-10 Thread Robert Haas

> You need LATERAL support for this:
>  SELECT * FROM foo f LATERAL bar(f);
>
> I'm not sure about the syntax, but LATERAL is a standard JOIN type wherein
> upper "nodes" are visible.

That would be really nice.  Then you could presumably also do:

SELECT f.id, f.name, f.apple, f.banana, bar.apple AS bar_apple,
bar.banana AS bar_banana FROM foo f LATERAL bar(f);

...which I frequently wish to do, and can't.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Hot Standby utility and administrator functions

2008-10-20 Thread Robert Haas

> * pg_last_recovered_xact_xid()
> Will throw an ERROR if *not* executed in recovery mode.
> returns bigint
>
> * pg_last_completed_xact_xid()
> Will throw an ERROR *if* executed in recovery mode.
> returns bigint

Should these return xid?

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] why hash on the primary key?

2008-11-28 Thread Robert Haas

I'm seeing a lot of plans in my database that look like this:

portal=# explain select * from foo i, foo j where i.id = j.id;
   QUERY PLAN
-
 Hash Join  (cost=769.87..2159.36 rows=13283 width=264)
   Hash Cond: (i.id = j.id)
   ->  Seq Scan on foo i  (cost=0.00..343.83 rows=13283 width=132)
   ->  Hash  (cost=343.83..343.83 rows=13283 width=132)
 ->  Seq Scan on foo j  (cost=0.00..343.83 rows=13283 width=132)

It seems very strange for the planner to decide to build an in-memory
hash table on a column that is already indexed (the primary key, no
less!).  But this is happening A LOT - I often see plans where a
majority of the joins are executed this way (and they're not all
self-joins either...).  It seems like the planner is concluding that
it's going to need most or all of the pages in the table anyway, and
that building a hash table as it goes is quicker than reading the
index pages in from disk.  On a simple query like the above, setting
enable_seqscan to off or random_page_cost to 1 generates the expected
plan:

 QUERY PLAN
-
 Merge Join  (cost=0.00..2534.24 rows=13283 width=264)
   Merge Cond: (i.id = j.id)
   ->  Index Scan using foo_pkey on foo i  (cost=0.00..1167.50
rows=13283 width=132)
   ->  Index Scan using foo_pkey on foo j  (cost=0.00..1167.50
rows=13283 width=132)
(4 rows)

Experimentation shows this is actually about 25% faster.  But, this is
kind of a blunt instrument, and my attempts to fiddle with various
parameters have not been real succesful in generating better plans for
more complicated examples.

Any suggestions/explanations?

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] why hash on the primary key?

2008-11-29 Thread Robert Haas

> Could you send the output of these two queries using "explain analyze"
> instead of plain explain?

portal=# explain analyze select * from foo i, foo j where i.id = j.id;
   QUERY PLAN

 Hash Join  (cost=747.87..2127.36 rows=13283 width=272) (actual
time=68.434..205.536 rows=13283 loops=1)
   Hash Cond: (i.id = j.id)
   ->  Seq Scan on foo i  (cost=0.00..315.83 rows=13283 width=136)
(actual time=0.024..23.349 rows=13283 loops=1)
   ->  Hash  (cost=315.83..315.83 rows=13283 width=136) (actual
time=68.353..68.353 rows=13283 loops=1)
 ->  Seq Scan on foo j  (cost=0.00..315.83 rows=13283
width=136) (actual time=0.009..23.839 rows=13283 loops=1)
 Total runtime: 223.390 ms
(6 rows)
portal=# set enable_seqscan to false;
SET
portal=# explain analyze select * from foo i, foo j where i.id = j.id;
 QUERY PLAN

 Merge Join  (cost=0.00..2310.24 rows=13283 width=272) (actual
time=0.272..149.317 rows=13283 loops=1)
   Merge Cond: (i.id = j.id)
   ->  Index Scan using foo_pkey on foo i  (cost=0.00..1055.50
rows=13283 width=136) (actual time=0.205..29.714 rows=13283 loops=1)
   ->  Index Scan using foo_pkey on foo j  (cost=0.00..1055.50
rows=13283 width=136) (actual time=0.025..37.534 rows=13283 loops=1)
 Total runtime: 166.364 ms
(5 rows)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] why hash on the primary key?

2008-11-29 Thread Robert Haas

> What's strange about it?  A probe into an in-memory hashtable is a lot
> cheaper than a probe into an index, so this type of plan makes plenty
> of sense if the hashtable will fit in RAM and there are going to be a
> lot of probes.  (Where "a lot" means "enough to amortize the cost of
> building the hashtable", of course.)

Hmm...  it didn't occur to me that the index probe itself might be
more expensive than a hash probe.  Is that due to concurrency control,
or are you talking about the need to possibly read index pages in from
disk?

>> Experimentation shows this is actually about 25% faster.
>
> Well, that just says your cost parameters need a bit of adjustment
> if you'd like the planner to get the crossover point exactly right.

Any sense of which ones might be worth fiddling with?

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] why hash on the primary key?

2008-11-29 Thread Robert Haas

>>> Well, that just says your cost parameters need a bit of adjustment
>>> if you'd like the planner to get the crossover point exactly right.
>
>> Any sense of which ones might be worth fiddling with?
>
> random_page_cost, effective_cache_size, maybe the cpu_xxx parameters.

I fiddled with this some more on a more complex query with more than
20 joins.  It appears that the culprit is from_collapse_limit.  If I
raise from_collapse_limit, most of the hash joins turn into Nested
Loops or Nested Loop Left Joins with Index Scans.  The estimated cost
is about 20% of the cost of the original plan, but the planning takes
so much longer that the actual time is 60% larger.  This is really the
pits.

I had hoped that the work Simon Riggs was doing on join removal would
hit 8.4, but that doesn't seem likely at this point.  The problem is
that the application is a web app where the user can select which
columns they want to see.  Most of the time only ~6-8 columns of 40+
that are available are selected, and the joins needed to generate the
unselected columns can be tossed (which would hopefully both avoid
unnecessary join steps and also speed up planning the remaining
joins).

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Updating column on row update

2009-11-22 Thread Robert Haas

On Sun, Nov 22, 2009 at 6:51 PM, Tom Lane  wrote:
> Craig Ringer  writes:
>> I do think this comes up often enough that a built-in trigger "update
>> named column with result of expression on insert" trigger might be
>> desirable.
>
> There's something of the sort in contrib already, I believe, though
> it's so old it still uses abstime :-(
>
>> So might "CREATE LANGUAGE ... IF NOT EXISTS". Maybe even "CREATE ROLE
>> ... IF NOT EXISTS" and "CREATE USER ... IF NOT EXISTS" - I know I'd find
>> them really handy.
>
> CREATE IF NOT EXISTS has been proposed and rejected before, more than
> once.  Please see the archives.

Search for CINE to find the discussions.  This is a good place to start:

http://archives.postgresql.org/pgsql-hackers/2009-05/msg00252.php

Despite Tom's assertions to the contrary, I am unable to find a clear
consensus against this feature in the archives, and still think it
would be useful.  MySQL and SQLite both support it, at least in the
specific case of CREATE TABLE IF NOT EXISTS.  But I've exhausted my
quota of beating my head against a brick wall on this issue.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Updating column on row update

2009-11-23 Thread Robert Haas

On Sun, Nov 22, 2009 at 11:38 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sun, Nov 22, 2009 at 6:51 PM, Tom Lane  wrote:
>>> CREATE IF NOT EXISTS has been proposed and rejected before, more than
>>> once.  Please see the archives.
>
>> Search for CINE to find the discussions.  This is a good place to start:
>> http://archives.postgresql.org/pgsql-hackers/2009-05/msg00252.php
>
>> Despite Tom's assertions to the contrary, I am unable to find a clear
>> consensus against this feature in the archives,
>
> I think you didn't look back far enough --- that issue was settled years
> ago.  IIRC the killer argument is that after CINE you do not know the
> state of the object: it exists, yes, but what properties has it got?
> If it already existed then it's still got its old definition, which
> might or might not be what you're expecting.
>
> CREATE OR REPLACE has got far safer semantics from the viewpoint of a
> script that wants to bull through without having any actual error
> handling (which is more or less the scenario we are arguing here, no?)
> After successful execution of the command you know exactly what
> properties the object has got.

Sure.  I think that CINE only makes sense for objects for which COR
can't be implemented - things that have internal substructure, like
tables or tablespaces.  I agree that there are pitfalls for the unwary
but I still think it's better than nothing.  I understand that you
disagree.

> Whether it would be sensible to have CREATE OR REPLACE semantics for a
> language is something I'm not very clear on.  It seems like changing any
> of the properties of a pg_language entry could be rather fatal from the
> viewpoint of an existing function using the language.
>
> [ thinks for awhile... ]  Actually, CREATE LANGUAGE is unique among
> creation commands in that the common cases have no parameters, at least
> not since we added pg_pltemplate.  So you could imagine defining CINE
> for a language as disallowing any parameters and having these semantics:
>        * language not present -> create from template
>        * language present, matches template -> OK, do nothing
>        * language present, does not match template -> report error
> This would meet the objection of not being sure what the state is
> after successful execution of the command.  It doesn't scale to any
> other object type, but is it worth doing for this one type?

CREATE OR REPLACE seems like a better fit in this case.  For example,
it seems plausible that someone might want to add an inline handler to
a procedural language that didn't have one without dropping and
recreating the language.  Even changing the call handler seems like it
could be potentially useful in an upgrade scenario.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Updating column on row update

2009-11-24 Thread Robert Haas

On Tue, Nov 24, 2009 at 12:28 PM, Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> So we're conceding that this is a valid need and people will now have
>> a way to meet it.  Is the argument against having CINE syntax that it
>> would be more prone to error than the above, or that the code would be
>> so large and complex as to create a maintenance burden?
>
> The argument against CINE is that it's unsafe.  The fragment proposed
> by Andrew is no safer, of course, but it could be made safe by adding
> additional checks that the properties of the existing object are what
> the script expects.  So in principle that's an acceptable approach,
> whereas CINE will never be safe.

Well, there can be methods extrinsic to the system for controlling
this sort of thing.  For example, I can provide a script, using CINE,
that will either install version 2 of my app into some database or
that will upgrade an existing version 1 installation to version 2.
It's true that if someone has taken the version-1 schema and made
manual modifications to it, then things might blow up.  But, I can
tell people that they shouldn't do that, or the upgrade script might
break.  If they do and it does then they get to keep both pieces.
Even if I do the whole thing in PL/pgsql, I'm still not going to check
for every stupid thing someone might have done to break the schema...
I think the cat is already out of the bag on this one, and it's just a
matter of whether we're willing to provide some convenient syntax or
leave people to hand-code it.

> But actually I thought we had more or less concluded that CREATE OR
> REPLACE LANGUAGE would be acceptable (perhaps only if it's given
> without any extra args?).

I'm not sure there's any value in that restriction - seems more
confusing than helpful.

> Or for that matter there seems to be enough
> opinion on the side of just installing plpgsql by default.  CINE is
> a markedly inferior alternative to either of those.

For languages, yes.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] Updating column on row update

2009-11-24 Thread Robert Haas

On Tue, Nov 24, 2009 at 2:07 PM, Kevin Grittner
 wrote:
> Tom Lane  wrote:
>
>> If it did so, that would be outside the apparent meaning of the
>> command, which is to do nothing if an object of that name exists.
>> That's why we've gone with CREATE OR REPLACE instead.
>
> I think that "fail on existence of an object conflicting with given
> definition" is behavior which could be documented and rates fairly
> low on my astonishment scale.  (I can't speak for anyone else.)

I think CINE should create the object if it does not exist and
otherwise do nothing.  It might be useful to have some kind of
consistency-checking behavior, but it would probably be more useful if
decoupled from CINE, and in any case, that's not what "CREATE IF NOT
EXISTS" means to me.

> I am skeptical that, in the absence of built-in support for checking
> the existing object against the supplied definition, people would
> generally go any further than Andrew's example.  When they did, I'm
> skeptical about how often they would get the details exactly right.

Bingo.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Fwd: psql+krb5

2009-12-01 Thread Robert Haas

2009/11/30 rahimeh khodadadi :
>
>
> -- Forwarded message --
> From: rahimeh khodadadi 
> Date: 2009/11/29
> Subject: Re: psql+krb5
> To: Denis Feklushkin 

Please review the guidelines for reporting a problem, which you can find here:

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

It seems to me that you've done the exact opposite of nearly
everything suggested there, starting with cross-posting your email to
four mailing lists at least three of which are irrelevant to the
problem that you're attempting to solve.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Fwd: psql+krb5

2009-12-01 Thread Robert Haas

On Tue, Dec 1, 2009 at 11:26 AM, Scott Marlowe  wrote:
> Except that he posted a month ago and got no answers...

Gee, I wonder why.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] [HACKERS] Sugerencia de opcion

2010-01-24 Thread Robert Haas

2009/1/22 Informatica-Cooperativa Cnel. Oviedo :
> Buenos Dias todos,
>
>                             Soy un usuario de postgres de Paraguay, consulto
> sobre la posibilidad de inclucion en la futura version la siguiente
> sentencia(Uso de alias en la condicion HAVING ):
>
>
>     SELECT id, sum(salario) as SumaSalario
>     FROM salarios
>     GROUP BY id
>     HAVING SumaSalario>500;

I've wished for that syntax once or twice myself, but I'm assuming
there's a reason we haven't implemented it?  Part of the problem is
it's inheritantly ambiguous if salarios happens to contain a column
called sumasalario, which is a problem that seems to arise for me
fairly regularly in practice.  Still, it would be nice for WHERE/GROUP
BY/HAVING clauses to have an explicit way to reference "the target
list column called foo".

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-01-31 Thread Robert Haas

On Mon, Jan 30, 2012 at 11:18 PM, Tom Lane  wrote:
> I don't recall that we thought very hard about what should happen when
> pg_dump switches are used to produce a selective dump, but ISTM
> reasonable that if it's "user data" then it should be dumped only if
> data in a regular user table would be.

Yep.

> What's not apparent to me is whether there's an argument for doing more
> than that.  It strikes me that the current design is not very friendly
> towards the idea of an extension that creates a table that's meant
> solely to hold user data --- you'd have to mark it as "config" which
> seems a bit unfortunate terminology for that case.  Is it important to
> do something about that, and if so what?

Is this anything more than a naming problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [BUGS] [GENERAL] Altering a table with a rowtype column

2012-03-20 Thread Robert Haas

On Wed, Mar 7, 2012 at 3:49 PM, Merlin Moncure  wrote:
> On Wed, Mar 7, 2012 at 2:31 PM, Tom Lane  wrote:
>> Merlin Moncure  writes:
>>> On Wed, Mar 7, 2012 at 11:45 AM, Mike Blackwell  
>>> wrote:
>>>> alter table a add column even_more_stuff boolean not null default false;
>>
>>> aha! that's not what you posted last time.  you appended 'not null
>>> default false'; which inexplicably breaks the ALTER.
>>
>>> try this:
>>> ALTER TABLE a ADD COLUMN even_more_stuff text not null;
>>> ALTER TABLE a ALTER even_more_stuff set default false;
>>> ALTER TABLE a DROP COLUMN even_more_stuff;
>>> ALTER TABLE a ADD COLUMN even_more_stuff boolean not null default false;
>>
>>> (this really looks like a bug in postgres, cc-ing to bugs)
>>
>> It is not a bug.  The ALTER ADD ... DEFAULT ... form implies rewriting
>> every existing tuple of the rowtype to insert a non-null value in the
>> added column, and we don't have support for doing that to rowtype
>> columns, only to the target table and descendants.
>
> I'm not buying that..it implies no such thing.  In particular, for
> table-as-rowtype columns, there's no way that I can see to have
> default values be generated.  So why does it follow that the dependent
> table has to be rewritten?  Column constraints are not enforced on the
> rowtype, so it follows that default shouldn't be either considering
> there's no way to get the default to fire.  Composite type (or table
> based composite) defaults are applied to the composite as a whole, not
> to specific fields.

I think Tom's correct about what the right behavior would be if
composite types supported defaults, but they don't, never have, and
maybe never will.  I had a previous argument about this with Tom, and
lost, though I am not sure that anyone other than Tom thinks that the
current behavior is for the best.  But see commits
a06e41deebdf74b8b5109329dc75b2e9d9057962 and
a40b1e0bf32b1da46c1baa9bc7da87f207cd37d8.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [BUGS] [GENERAL] Altering a table with a rowtype column

2012-03-21 Thread Robert Haas

On Tue, Mar 20, 2012 at 2:48 PM, Tom Lane  wrote:
>> I think Tom's correct about what the right behavior would be if
>> composite types supported defaults, but they don't, never have, and
>> maybe never will.  I had a previous argument about this with Tom, and
>> lost, though I am not sure that anyone other than Tom thinks that the
>> current behavior is for the best.
>
> Um, did I say I thought it was for the best?  I thought I said we don't
> have support for doing better.
>
> If we are willing to legislate that column defaults are not and never
> will be applied to composite types, then I think Merlin might be right
> that we could just let an ALTER ADD with DEFAULT ignore the existence of
> composite columns.

I tend to think that's exactly what we should do, and it's what that
patch did, although as you point out my commit message was the product
of confused thinking.

> I'd always figured that we'd want to try to fix that
> omission eventually, though.

It's mildly tempting, but as Merlin points out, it's hard to know
exactly when you'd apply those rules.  We talked a while back about
domains with NOT NULL constraints; if someone does a left join with a
domain-typed column on the outer side, what are you going to put there
if you don't put NULL?  This case seems somewhat similar.  Defaults
make sense when applied to table columns, because the semantics are
clear: columns not explicitly mentioned get their default value if
any, else NULL.  But if we rule that a composite type with no default
gets the composite type's default values for each column, then we're
overriding the general SQL presumption that unspecified columns are
NULL.  And similarly for temps created by uninitialized variables or,
worse, LEFT JOINs.  In languages like C++ or even Perl, there's always
a very clear notion of when an object gets created, and constructors
and so on run at that time.  Defaults logically should run at the same
time that a constructor would, but that concept doesn't really exist
in SQL, which is seemingly deliberately quite murky about when values
spring into existence.

Does the SQL standard say anything on this topic?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Robert Haas

On Tue, Mar 31, 2009 at 10:44 AM, Greg Stark  wrote:
> On Tue, Mar 31, 2009 at 3:42 PM, Sam Mason  wrote:
>>
>>  string_to_array('',',')::INT[]  => invalid input syntax for integer: ""
>
> Oof. That's a good point.

+1.  I find this argument much more compelling than anything else
that's been offered up so far.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Robert Haas

On Wed, Apr 1, 2009 at 12:52 PM, David E. Wheeler  wrote:
> Well, I'd just point out that the return value of string_to_array() is
> text[]. Thus, this is not a problem with string_to_array(), but a casting
> problem from text[] to int[]. Making string_to_array() return a NULL for
> this case to make casting simpler is addressing the problem in the wrong
> place, IMHO. If I want to do this in Perl, for example, I'd do something
> like this:
>
> my @ints = grep { defined $_ && $_ ne '' } split ',', $string;

I've written code that looks a whole lot like this myself, but there's
no easy way to do that in SQL.  SQL, in particular, lacks closures, so
grep {} and map {} don't exist.  I really, really wish they did, but I
believe that our type system is too woefully pathetic to be up to the
job.  So it seems to me that arguing that SQL (which lacks those
primitives) should match Perl (which has them) isn't really getting us
anywhere.

> my @ints = map { $_ || 0 } split ',', $string;
>
> This ensures that I get the proper number of records in the example of 
> something like '1,2,,4'.

I can't see that there's any way to do this in SQL regardless of how
we define this operation.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Robert Haas

On Wed, Apr 1, 2009 at 1:05 PM, justin  wrote:
> I'm still a hold out,  We are taking a string putting it into a array based
> on a delimiter.  That is very simple and straight forward.  Yet many argue
> if we want to cast this into another data type the function should deal with
> in limited cases.
>
> string_to_array('',',')::INT[]  works as proposed
>
> But
> string_to_array(',,,', ',' )::INT[]   Fails
> or
> string_to_array('1,2,,4', ',' )::INT[] Fails .

But... but... those aren't comma-separated lists of integers.  If they
were, it would work.

string_to_array('cow,dog,horse')::INT[] will also fail.

If you take 0 items of any type whatsoever and join them together with
commas, you will get the empty string.  It is also true that if you
join 1 item together with commas, you will get that item back, and if
that item is the empty string, you will now have the empty string.  I
think it's better to worry more about the first case because it
applies to any type at all, whereas the latter case ONLY applies in
situations where the empty string is a potentially legal value.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-01 Thread Robert Haas

On Wed, Apr 1, 2009 at 5:22 PM, Tom Lane  wrote:
> Or we could stick to the current behavior and say "use COALESCE() to
> resolve the ambiguity, if you need to".

If there's no consensus on changing the behavior, it's probably better
to be backward compatible than not.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Robert Haas

On Thu, Apr 2, 2009 at 12:17 PM, David E. Wheeler  wrote:
> On Apr 1, 2009, at 2:22 PM, Tom Lane wrote:
>
>> Another way to state the point is that we can offer people a choice of
>> two limitations: string_to_array doesn't work for zero-length lists,
>> or string_to_array doesn't work for empty strings (except most of the
>> time, it does).  The former is sounding less likely to bite people
>> unexpectedly.
>
> Right, very well put.
>
>> Or we could stick to the current behavior and say "use COALESCE() to
>> resolve the ambiguity, if you need to".
>
> Steve has a point that leaving it as-is leaves it as impossible to tell the
> difference between string_to_array(NULL, ',') and string_to_array('', ',').
> The former properly handles an unknown value, while the latter, where '' is
> a known value, seems weird to be returning NULL.

*shrug* CASE WHEN blah IS NOT NULL THEN string_to_array(blah, ',') END

More and more I'm leaning toward leaving this alone.  No matter how
you define it, the behavior can be changed to whichever alternative
you prefer with a 1-line case statement.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Robert Haas

On Thu, Apr 2, 2009 at 12:10 PM, David E. Wheeler  wrote:
> On Apr 1, 2009, at 12:19 PM, Robert Haas wrote:
>
>>> my @ints = map { $_ || 0 } split ',', $string;
>>>
>>> This ensures that I get the proper number of records in the example of
>>> something like '1,2,,4'.
>>
>> I can't see that there's any way to do this in SQL regardless of how
>> we define this operation.
>
> It's easy enough to write a function to do it:
>
> CREATE OR REPLACE FUNCTION trim_blanks (anyarray) RETURNS anyarray AS $$
>    SELECT ARRAY(
>        SELECT CASE WHEN $1[i] IS NULL OR $1[i] = '' THEN '0' ELSE $1[i] END
>          FROM generate_series(1, array_upper($1, 1)) s(i)
>         ORDER BY i
>    );
> $$ LANGUAGE SQL IMMUTABLE;

Ah!  Thanks for the tip.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Robert Haas

On Thu, Apr 2, 2009 at 2:04 PM, Tom Lane  wrote:
> Right at the moment, if we stick with the historical definition
> of the function, *both* camps have to write out their choice of
> the above.  Seems like this is the worst of all possible worlds.
> We should probably pick one or the other.

ISTM there are three camps.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Robert Haas

On Thu, Apr 2, 2009 at 2:18 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Apr 2, 2009 at 2:04 PM, Tom Lane  wrote:
>>> Right at the moment, if we stick with the historical definition
>>> of the function, *both* camps have to write out their choice of
>>> the above.  Seems like this is the worst of all possible worlds.
>>> We should probably pick one or the other.
>
>> ISTM there are three camps.
>
> If there's a camp that actually *wants* a NULL result for this case,
> I missed the reasoning.  AFAICS we can either say that every application
> is going to have to put in a CASE wrapper around this function, or say
> that we'll make it do the right thing for some of them and the rest have
> to put the same wrapper around it.

So that we don't break existing apps because of an issue that is
trivial to work around.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [HACKERS] [GENERAL] string_to_array with empty input

2009-04-02 Thread Robert Haas

On Thu, Apr 2, 2009 at 2:50 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Apr 2, 2009 at 2:18 PM, Tom Lane  wrote:
>>> If there's a camp that actually *wants* a NULL result for this case,
>>> I missed the reasoning.
>
>> So that we don't break existing apps because of an issue that is
>> trivial to work around.
>
> We would only be breaking them if a NULL result were actually the
> correct behavior for the application's requirements, which seems
> a bit unlikely.

But that's completely untrue.  If the most useful behavior is either
ARRAY[''] or ARRAY[], then there are presumably lots and lots of
people out there who have apps that do COALESCE(string_to_array(...),
something).  Whichever way you change string_to_array() will break all
of the people doing this who wanted the opposite behavior for no good
reason.

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] complex referential integrity constraints

2007-02-18 Thread Robert Haas

So, I have the following problem.

Suppose you have two kinds of animals, sheep and wolves.  Since they
have very similar properties, you create a single table to hold both
kinds of animals, and an animal_type table to specify the type of each
animal:

CREATE TABLE animal_type (
id  integer not null,
namevarchar(80) not null,
primary key (id)
);
INSERT INTO animal_type VALUES (1, 'Sheep');
INSERT INTO animal_type VALUES (2, 'Wolf');

CREATE TABLE animal (
id  serial,
type_id integer not null references animal_type (id), 
namevarchar(80) not null,
age integer not null,
weight_in_poundsinteger not null,
primary key (id)
);

The animal_type table is more or less written in stone, but the animal
table will be updated frequently.  Now, let's suppose that we want to
keep track of all of the cases where one animal is mauled by another
animal:

CREATE TABLE mauling (
id  serial,
attacker_id integer not null references animal (id),
victim_id   integer not null references animal (id),
attack_time timestamp not null,
primary key (id)
);

The problem with this is that I have a very unsettled feeling about the
foreign key constraints on this table.  The victim_id constraint is
fine, but the attacker_id constraint is really inadequate, because the
attacker CAN NEVER BE A SHEEP.  I really want a way to write a
constraint that says that the attacker must be an animal, but
specifically, a wolf.

It would be really nice to be able to write:

FOREIGN KEY (attacker_id, 2) REFERENCES animal (id, type_id)

Or:

CREATE UNIQUE INDEX wolves ON animal (id) WHERE type_id = 2;
-- and then
FOREIGN KEY (attacker_id) REFERENCES INDEX wolves

...but that's entirely speculative syntax.  I don't think there's any
easy way to do this.  (Please tell me I'm wrong.)

The problem really comes in when people start modifying the animal
table.  Every once in a while we have a case where we record something
as a wolf, but it turns out to have been a sheep in wolf's clothing.  In
this case, we want to do something like this:

UPDATE animal SET type_id = 1 WHERE id = 572;

HOWEVER, this operation MUST NOT be allowed if it turns out there is a
row in the mauling table where attacker_id = 572, because that would
violate my integrity constraints that says that sheep do not maul.

Any suggestions?  I've thought about creating rules or triggers to check
the conditions, but I'm scared that this could either (a) get really
complicated when there are a lot more tables and constraints involved or
(b) introduce race conditions.

Thanks,

...Robert

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/

Re: [GENERAL] complex referential integrity constraints

2007-02-22 Thread Robert Haas

The ability to make a foreign key reference a specific partial unique
index (rather than just a set of columns that have a unique index) would
solve many problems of this type.  As another example, you might have a
table where one of the columns is "is_deleted boolean not null".  By
creating a partial unique index on the primary key of that table "WHERE
NOT is_deleted" and then pointing a foreign key at it, you could enforce
that each row in the child table references a parent who isn't deleted.

However, this would break down when there's more one than intermediate
step involved.  For example, if you have:

CREATE TABLE animal_type (
  idserial,
namevarchar(60) not null,
is_attacker boolean not null,
primary key (id)
);

CREATE TABLE animal (
id  serial,
type_id integer not null references animal_type (id),
namevarchar(60) not null,
primary key (id)
);

CREATE TABLE mauling (
id  serial,
attacker_id integer not null references animal (id),
victim_id   integer not null references animal (id),
attack_time timestamp with time zone not null,
primary key (id)
);

It would be easy to enforce the constraint that the attacker must be an
animal of some specific type, but difficult to enforce the constraint
that the attacker must be an animal whose type, in turn, has a true
value for is_attacker.

The best idea that I can think of right now to handle multiple levels of
tables is to allow FOREIGN KEY constraints to references a VIEW, rather
than a table.  Then you could say:

CREATE VIEW attackers AS
SELECT a.id FROM animal a, animal_type t WHERE a.type_id = t.id AND
t.attacker;

...and then FOREIGN KEY (attacker_id) REFERENCES attackers (id).

This syntax would solve a number of other problems as well, such as
requiring that some record in table A has a parent either in table P or
in table Q.  However, I think this would probably require implementing
some kind of materialized view so that you could actually build an index
on the view, and that opens up a whole new can of worms, because it's
not very difficult to define a view that is costly to update
incrementally.

The problem is really that there is a pretty large gap between writing a
foreign key constraint, which is trivial, and enforcing a constraint
using triggers, which is quite a bit more complex (and therefore, easy
to screw up), because the foreign key automatically handles all the
cases (insert into child table, update of child table, update of parent
table, delete from parent table) whereas with triggers you have to
address each of those cases individually.  Unfortunately, something
tells me that implementing a more powerful system for foreign key
constraints is a non-trivial project, however useful it would be.
Still, I'd love to see it in the TODO file, too.

...Robert

-Original Message-
From: Joris Dobbelsteen [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 8:03 AM
To: Robert Haas; elein
Cc: pgsql-general@postgresql.org
Subject: RE: [GENERAL] complex referential integrity constraints

I partially agree:
If people CAN do stupid things, they are 'clever' enough to find a way
to actually do it. I've seen them destroy things, by just using a system
in a way it was not intended. They effectively found a way to blow away
the very thing that part was designed for.
But indeed, it's a lot of work, especially if the number of tables that
must be referenced increases. I'm a strong supporter for ensuring
consistency. Postgres has what it takes to do the job, but it doesn't
make my life a lot easier. But it seems to be as good as it gets
today...

Perhaps we should rather define a 'database' constraint in the order of:
"For every mauling, the attacking animal must be of the attacker type"
(in a computer understandable manner). From the set theory this should
be possible without too much problems, However doing so efficiently
might be slightly harder.
This might be a fun project and useful for the TODO list. At least it
makes it a lot easier (and maintanable) to enforce database-wide
constraints.

- Joris

>-Original Message-
>From: Robert Haas [mailto:[EMAIL PROTECTED] 
>Sent: woensdag 21 februari 2007 3:37
>To: Joris Dobbelsteen; elein
>Cc: pgsql-general@postgresql.org
>Subject: RE: [GENERAL] complex referential integrity constraints
>
>Yes, exactly.  And while you might not care about all of those 
>(e.g. I care about the first two but am not worried about the 
>third one because I'm the only one who will ever update that 
>table), writing multiple triggers to enforce each constraint 
>of this type quickly gets old if there are even a

Re: [GENERAL] complex referential integrity constraints

2007-02-22 Thread Robert Haas

The idea here is that a wolf can attack a sheep, or a wolf can attack
another wolf, but sheep can't attack anything.  I suppose I could list
each wolf in both the predator and prey tables, but that seems a bit
duplicative (and causes other problems).

...Robert

-Original Message-
From: David Fetter [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 19, 2007 1:04 PM
To: Robert Haas
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] complex referential integrity constraints

On Fri, Feb 16, 2007 at 09:58:56AM -0500, Robert Haas wrote:
> So, I have the following problem.
> 
> Suppose you have two kinds of animals, sheep and wolves.  Since they
> have very similar properties, you create a single table to hold both
> kinds of animals, and an animal_type table to specify the type of each
> animal:
> 
> CREATE TABLE animal_type (
> idinteger not null,
> name  varchar(80) not null,
> primary key (id)
> );
> INSERT INTO animal_type VALUES (1, 'Sheep');
> INSERT INTO animal_type VALUES (2, 'Wolf');
> 
> CREATE TABLE animal (
> idserial,
> type_id integer not null references animal_type (id), 
> name  varchar(80) not null,
> age   integer not null,
> weight_in_pounds  integer not null,
> primary key (id)
> );
> 
> The animal_type table is more or less written in stone, but the animal
> table will be updated frequently.  Now, let's suppose that we want to
> keep track of all of the cases where one animal is mauled by another
> animal:
> 
> CREATE TABLE mauling (
> id  serial,
> attacker_id integer not null references animal (id),
> victim_id   integer not null references animal (id),
> attack_time timestamp not null,
> primary key (id)
> );
> 
> The problem with this is that I have a very unsettled feeling about
the
> foreign key constraints on this table.  The victim_id constraint is
> fine, but the attacker_id constraint is really inadequate, because the
> attacker CAN NEVER BE A SHEEP.  I really want a way to write a
> constraint that says that the attacker must be an animal, but
> specifically, a wolf.
> 
> It would be really nice to be able to write:
> 
> FOREIGN KEY (attacker_id, 2) REFERENCES animal (id, type_id)
> 
> Or:
> 
> CREATE UNIQUE INDEX wolves ON animal (id) WHERE type_id = 2;
> -- and then
> FOREIGN KEY (attacker_id) REFERENCES INDEX wolves
> 
> ...but that's entirely speculative syntax.  I don't think there's any
> easy way to do this.  (Please tell me I'm wrong.)
> 
> The problem really comes in when people start modifying the animal
> table.  Every once in a while we have a case where we record something
> as a wolf, but it turns out to have been a sheep in wolf's clothing.
In
> this case, we want to do something like this:
> 
> UPDATE animal SET type_id = 1 WHERE id = 572;
> 
> HOWEVER, this operation MUST NOT be allowed if it turns out there is a
> row in the mauling table where attacker_id = 572, because that would
> violate my integrity constraints that says that sheep do not maul.
> 
> Any suggestions?  I've thought about creating rules or triggers to
check
> the conditions, but I'm scared that this could either (a) get really
> complicated when there are a lot more tables and constraints involved
or
> (b) introduce race conditions.
> 
> Thanks,
> 
> ...Robert

I'd do something like this:

CREATE TABLE animal_type (
animal_name  TEXT PRIMARY KEY,
CHECK(animal_name = trim(animal_name))
);

/* Only one of {Wolf,wolf} can be in the table. */

CREATE UNIQUE INDEX just_one_animal_name
ON animal_type(LOWER(animal_name));

CREATE TABLE predator (
animal_name TEXT NOT NULL
REFERENCES animal_type(animal_name)
ON DELETE CASCADE,
PRIMARY KEY(animal_name)
);

CREATE TABLE prey (
animal_name TEXT NOT NULL
REFERENCES animal_type(animal_name)
ON DELETE CASCADE,
PRIMARY KEY(animal_name)
);

CREATE TABLE mauling (
id SERIAL PRIMARY KEY,
attacker_idINTEGER NOT NULL REFERENCES predator
(animal_type_id),
victim_id  INTEGER NOT NULL REFERENCES prey (animal_type_id),
attack_timeTIMESTAMP WITH TIME ZONE NOT NULL
);

Cheers,
D
-- 
David Fetter <[EMAIL PROTECTED]> http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [GENERAL] complex referential integrity constraints

2007-02-22 Thread Robert Haas

Yes, exactly.  And while you might not care about all of those (e.g. I
care about the first two but am not worried about the third one because
I'm the only one who will ever update that table), writing multiple
triggers to enforce each constraint of this type quickly gets old if
there are even a few of them.  It is exponentially harder to write a
constraint of this type than it is to write a simple foreign key
constraint.

...Robert 

-Original Message-
From: Joris Dobbelsteen [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 19, 2007 5:59 AM
To: elein; Robert Haas
Cc: pgsql-general@postgresql.org
Subject: RE: [GENERAL] complex referential integrity constraints

>Why don't you add a field in animal_types that is boolean mauler.
>Then you can add a trigger on the mauling table to raise an 
>error when the attacker_id is an animal type mauler.

This is only partial. You need a lot more triggers to guarentee the
constraints are enforced.
Precisely you need to validate:
* mauling on insert/update of attacker_id
* animal on update of type_id
* animal_type on update of your property

Of course you need to think about the MVCC model, such that:
Transaction 1 executes
INSERT INTO mauling VALUES ('someattacker'),
Transaction 2 executes
UPDATE animal_type SET mauler = false WHERE name = 'someattacker',
such that both transaction happen in parallel.

This is perfectly possible and will make it possible to violate the
constraint, UNLESS locking of the tuples is done correctly.

These contraints are not trivial to implement (unfortunally). It would
be great if they where.

- Joris

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [GENERAL] complex referential integrity constraints

2007-02-25 Thread Robert Haas

Actually, what would be really nice is if there were just a button I
could push that would make all of my data automatically correct.  Can
that go into 8.3?  Thanks, ...Robert

-Original Message-
From: Alvaro Herrera [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 23, 2007 9:35 AM
To: Alban Hertroys
Cc: Robert Haas; David Fetter; pgsql-general@postgresql.org
Subject: Re: [GENERAL] complex referential integrity constraints

Alban Hertroys wrote:
> Robert Haas wrote:
> > The idea here is that a wolf can attack a sheep, or a wolf can
attack
> > another wolf, but sheep can't attack anything.  I suppose I could
list
> > each wolf in both the predator and prey tables, but that seems a bit
> > duplicative (and causes other problems).
> 
> I'm quite certain a wolf is much more likely to attack a sheep than to
> attack another wolf, and even more unlikely to attack for example a
> lion. It seems to me that just the fact that it can isn't enough
> information.
> 
> It looks like you need "weighted constraints"; there's 0 chance that a
> sheep attacks a wolf, but there's >0 chance that a wolf attacks a
sheep,
> >0 chance it attacks a wolf and >0 chance it attacks a lion. The exact
> numbers will vary, and I have absolutely no idea what they would be
> like. It probably requires some kind of ranking system that adjusts
> according to the known animals and their likelihood to attack
eachother.

Depending on what you're modelling, even this could be too simple -- for
example, while a single wolf is unlikely to attack a lion, a pack of
wolves have a lot more probability of doing so.

Do you keep packs of wolves in your barn?  If so, watch your lions.

-- 
Alvaro Herrera
http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [GENERAL] complex referential integrity constraints

2007-02-25 Thread Robert Haas

I don't understand what a weighted constraint would mean.  Either the
attacker_id can be a wolf, or it can't.  Knowing that it is only 1%
likely over the long haul is insufficient to disallow any particular
transaction.

It's certainly true that the constraint as stated is insufficient to
guarantee that the table will contain good data.  For example if we
looked at the maulings table and wolves were always mauling other wolves
but never sheep, we would naturally want to dig into that a little more
and find out why they weren't picking easier targets.  But this is
neither here nor there, because NO constraint (foreign key, check, or
what have you) is ever strong enough to ensure that the data in a table
is completely clean.  At least as I understand it, the purpose of these
constraints is to allow us to write application code which relies on
certain basic invariants being true, i.e. so that we can join animal to
animal_type and not have to worry about rows dropping out because some
animals had an invalid type, or rows getting added because there are two
animal_type records with the same id.

Besides, the problem as stated is a proxy for some real problem which is
part of a non-zoological project the details of which (a) would take too
long to explain and (b) should probably not be posted to a public
mailing list.  :-)

So far, the best ideas I've seen have been:

(a) Tom Lane's idea of denormalizing by copying the animal type column
into the maulings table with ON UPDATE CASCADE, and then adding a CHECK
constraint on that column, and

(b) Creating a separate table called "wolf" and some triggers that
ensure that the wolf table will always contain the subset of IDs from
the animal table where the type_id is that of a wolf, with a foreign key
constraint from that id column back to animal with "on delete cascade".
This ensures that nobody can delete a wolf or change it into a sheep if
it has maulings, but permits it otherwise.

For what it's worth, I've adopted the latter solution for the present.
Unfortunately, it's too much work to do it everywhere it would be nice
to have, so I'm just doing it in some really critical cases and hoping
that the others don't break.

Thanks,

...Robert

-Original Message-
From: Alban Hertroys [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 23, 2007 4:02 AM
To: Robert Haas
Cc: David Fetter; pgsql-general@postgresql.org
Subject: Re: [GENERAL] complex referential integrity constraints

Robert Haas wrote:
> The idea here is that a wolf can attack a sheep, or a wolf can attack
> another wolf, but sheep can't attack anything.  I suppose I could list
> each wolf in both the predator and prey tables, but that seems a bit
> duplicative (and causes other problems).
> 
> ...Robert

I'm quite certain a wolf is much more likely to attack a sheep than to
attack another wolf, and even more unlikely to attack for example a
lion. It seems to me that just the fact that it can isn't enough
information.

It looks like you need "weighted constraints"; there's 0 chance that a
sheep attacks a wolf, but there's >0 chance that a wolf attacks a sheep,
>0 chance it attacks a wolf and >0 chance it attacks a lion. The exact
numbers will vary, and I have absolutely no idea what they would be
like. It probably requires some kind of ranking system that adjusts
according to the known animals and their likelihood to attack eachother.

I'm pretty sure you can't get this done without defining some triggers.

-- 
Alban Hertroys
[EMAIL PROTECTED]

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [GENERAL] complex referential integrity constraints

2007-02-26 Thread Robert Haas

I sort of think that this kind of stuff belongs in a separate table
somehow.  For example in this case I would want to:

CREATE TABLE attack_probability (
attacker_type_id integer not null references animal_type (id), 
victim_type_id integer not null references animal_type (id), 
percent_chance integer not null,
primary key (attacker_type_id, victim_type_id)
);

...and then declare that the CONSTRAINT on the "maulings" table is that
if I look up the types of the attacker and victim, that pair of types
must be present in the attack_probability table with percent_chance > 0.

I guess my point here is that I think in your proposal you are letting
domain-specific data creep into the schema.  It's the job of the schema
to enforce integrity constraints, but not to know specifically how
things work.  The fact (if it is a fact) that the chance of one type of
animal attacking another can be captured as a probability (rather than,
say, one probability for the day time and another probability for the
night time, or one probability for each specific animal rather than each
animal type, or I don't know what's going on and want to infer the
probabilities from the data after I've gathered it) is domain-specific.
I don't really want the information about attack probabilities (or
whatever) to be something that's hardcoded in my schema; I want it to be
part of the data in the schema, with the schema enforcing such
constraints on that data as I may see fit to define.  I don't want to
have to select things out of system tables to find out attack
probabilities.  Also, as a practical matter, I suspect that such a setup
would result in an absurdly complex constraint language.

...Robert

-Original Message-
From: Alban Hertroys [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 26, 2007 4:15 AM
To: Robert Haas
Cc: David Fetter; pgsql-general@postgresql.org
Subject: Re: [GENERAL] complex referential integrity constraints

Robert Haas wrote:
> I don't understand what a weighted constraint would mean.  Either the
> attacker_id can be a wolf, or it can't.  Knowing that it is only 1%
> likely over the long haul is insufficient to disallow any particular
> transaction.

Basically I suggested to combine the constraint with a probability. If
the probability of one animal attacking another is 0% it can't attack
the other animal - that's a strict constraint. The other way around it
can, and you'll also immediately know how likely that is to happen.

An added bonus is that you can generalize certain constraints. If
certain animals are less than - say 25% - likely to attack other certain
 other animals you could determine that the attacked animal is not in
fact prey. An example would probably be wolves attacking other wolves
(or predators in general). For relations that depend on an animal being
prey, a constraint would be that this number be <25%.

In this discussion it is also not entirely defined what attacking means.
Is a ram defending his horde from wolves attacking (wolves)?

I realise this all started from an analogy to a real problem, so most of
this is probably not very relevant to your actual problem. No less
interesting, though.

-- 
Alban Hertroys
[EMAIL PROTECTED]

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

1 2 >

1 - 100 of 102 matches

Mail list logo