Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Andres Freund
On 2015-12-09 20:23:06 -0500, Noah Misch wrote:
> By the way, it occurs to me that I should also make pg_upgrade blacklist the
> range of catversions that might have data loss.  No sense in putting ourselves
> in the position of asking whether data files of a 9.9.3 cluster spent time in
> a 9.5beta2 cluster.

I can't see any benefit in that. We're talking about a bug that, afaics,
needs another unknown bug to trigger (so find_multixact_start() fails),
and then very likely needs significant amounts of new multixacts
consumed, without a restart and without find_multixact_start()
succeeding later.


What I think would actually help for questions like this, is to add, as
discussed in some other threads, the following:
1) 'creating version' to pg_control
2) 'creating version' to each pg_class entry
3) 'last relation rewrite in version' to each pg_class entry
4) 'last full vacuum in version' to each pg_class entry

I think for this purpose 'version' should be something akin to
$catversion||$numericversion (int64 probably?) - that way development
branches and release branches are handled somewhat usefully.

I think that'd be useful, both from an investigatory perspective, as
from a tooling perspective, because it'd allow reusing things like hint
bits.


> > Ripping it out and replacing it monolithically will not
> > change that; it will only make the detailed history harder to
> > reconstruct, and I *will* want to reconstruct it.
> 
> What's something that might happen six months from now and lead you to inspect
> master or 9.5 multixact.c between 4f627f8 and its revert?

"Hey, what has happened to multixact.c lately? I'm investigating a bug,
and I wonder if it already has been fixed?", "Uh, what was the problem
with that earlier large commit?", "Hey, what has changed between beta2
and the final release?"...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-10 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 9 Dec 2015 10:31:06 -0800, David Fetter  wrote in 
<20151209183106.gc10...@fetter.org>
> On Wed, Dec 09, 2015 at 03:49:20PM +, Greg Stark wrote:
> > On Wed, Dec 9, 2015 at 2:27 PM, David Fetter  wrote:
> > > Agreed that the "whole new language" aspect seems like way too big a
> > > hammer, given what it actually does.
> > 
> > Which would be easier to update when things change?
> 
> This question seems closer to being on point with the patch sets
> proposed here.

I agree to some extent.

I'm unhappy with context matching using previous_words in two
points. Current code needs human-readable comments describing
almost all matchings. It is hard to maintain and some of them
actually are wrong. The hardness is largely alleviated by
Thomas's approach exept for complex ones. Another is that
previous_words method is not-enough adaptable for optional words
in syntax. For example, CREATE INDEX has a complex syntax and
current rather complex code does not cover it fully (or enough).

On the other hand, regexp is quite heavy-weight. Current code
does one completion in 1 milliseconds but regexps simplly
replaced with current matching code takes nearly 100ms on my
environment. But appropriate refactoring reduces it to under 10
ms.

If we need more powerful completion (which means it covers more
wide area of syntax including more optional words), Thomas's
approach would face difficulties of another level of
complexity. I'd like to overcome it.

> > Which would be possible to automatically generate from gram.y?
> 
> This seems like it goes to a wholesale context-aware reworking of tab
> completion rather than the myopic ("What has happened within the past N
> tokens?", for slowly increasing N) versions of tab completions in both
> the current code and in the two proposals here.
> 
> A context-aware tab completion wouldn't care how many columns you were
> into a target list, or a FROM list, or whatever, as it would complete
> based on the (possibly nested) context ("in a target list", e.g.)
> rather than on inferences made from some slightly variable number of
> previous tokens.

It's unknown to me how much the full-context-aware completion
makes me happy. But it would be another high-wall to overcome..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Victor Wagner
Collegues,

Recently (about a month ago) flex 2.6.0 have been released.

But when we tried to compile PostgreSQL 9.5 beta 1 from git with it on
Windows, we've encountered that src/tools/msvc/pgflex.pl explicitely 
checks that minor version of flex is equal to 5.

unless ($verparts[0] == 2 && $verparts[1] == 5 && $verparts[2]
>=31) { print "WARNING! Flex install not found, or unsupported Flex
   version.\n"; print "echo Attempting to build without.\n";
   exit 0;
}

Is this check intentional, or it just seats here from the time when
version 2.6.0 didn't exist?

Postgres seems to compile fine with flex 2.6.0.

--

Victor Wagner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Peter Geoghegan
On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund  wrote:
>> > Ripping it out and replacing it monolithically will not
>> > change that; it will only make the detailed history harder to
>> > reconstruct, and I *will* want to reconstruct it.
>>
>> What's something that might happen six months from now and lead you to 
>> inspect
>> master or 9.5 multixact.c between 4f627f8 and its revert?
>
> "Hey, what has happened to multixact.c lately? I'm investigating a bug,
> and I wonder if it already has been fixed?", "Uh, what was the problem
> with that earlier large commit?", "Hey, what has changed between beta2
> and the final release?"...

Quite.

I can't believe we're still having this silly discussion. Can we please move on?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Bert
+1

On Thu, Dec 10, 2015 at 9:58 AM, Peter Geoghegan  wrote:

> On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund 
> wrote:
> >> > Ripping it out and replacing it monolithically will not
> >> > change that; it will only make the detailed history harder to
> >> > reconstruct, and I *will* want to reconstruct it.
> >>
> >> What's something that might happen six months from now and lead you to
> inspect
> >> master or 9.5 multixact.c between 4f627f8 and its revert?
> >
> > "Hey, what has happened to multixact.c lately? I'm investigating a bug,
> > and I wonder if it already has been fixed?", "Uh, what was the problem
> > with that earlier large commit?", "Hey, what has changed between beta2
> > and the final release?"...
>
> Quite.
>
> I can't believe we're still having this silly discussion. Can we please
> move on?
>
> --
> Peter Geoghegan
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Andreas Seltenreich
Tom Lane writes:

> [2. transitive-lateral-fixes-1.patch]

I was about to write that sqlsmith likes the patch, but after more than
10^8 ok queries the attached ones were generated.

regards,
Andreas



post-patch-errors.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Peter Geoghegan
On Sun, Dec 6, 2015 at 9:52 AM, Andreas Seltenreich  wrote:
> I've added new grammar rules to sqlsmith and improved some older ones.

Could you possibly teach sqlsmith about INSERT ... ON CONFLICT DO
UPDATE/IGNORE? I think that that could be very helpful, especially if
it could be done in advance of any stable release of 9.5.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Amit Langote
On 2015/12/10 15:28, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 4:40 AM, Robert Haas  wrote:
>> It's going to be *really* important that this facility provides a
>> lightweight way of updating progress, so I think this whole API is
>> badly designed.  VACUUM, for example, is going to want a way to update
>> the individual counter for the number of pages it's scanned after
>> every page.  It should not have to pass all of the other information
>> that is part of a complete report.  It should just be able to say
>> pgstat_report_progress_update_counter(1, pages_scanned) or something
>> of this sort.  Don't marshal all of the data and then make
>> pgstat_report_progress figure out what's changed.  Provide a series of
>> narrow APIs where the caller tells you exactly what they want to
>> update, and you update only that.  It's fine to have a
>> pgstat_report_progress() function to update everything at once, for
>> the use at the beginning of a command, but the incremental updates
>> within the command should do something lighter-weight.
> 
> [first time looking really at the patch and catching up with this thread]
> 
> Agreed. As far as I can guess from the code, those should be as
> followed to bloat pgstat message queue a minimum:
> 
> +   pgstat_report_activity_flag(ACTIVITY_IS_VACUUM);
> /*
>  * Loop to process each selected relation.
>  */
> Defining a new routine for this purpose is a bit surprising. Cannot we
> just use pgstat_report_activity with a new BackendState STATE_INVACUUM
> or similar if we pursue the progress tracking approach?

ISTM, PgBackendStatus.st_state is normally manipulated at quite different
sites (mostly tcop) than where a backend would be able to report that a
command like VACUUM is running. Also, the value 'active' of
pg_stat_activity.state has an established meaning as Horiguchi-san seems
to point out in his reply. IMHO, this patch shouldn't affect such meaning
of a pg_stat_activity field.


> A couple of comments:
> - The relation OID should be reported and not its name. In case of a
> relation rename that would not be cool for tracking, and most users
> are surely going to join with other system tables using it.

+1

> - The progress tracking facility adds a whole level of complexity for
> very little gain, and IMO this should *not* be part of PgBackendStatus
> because in most cases its data finishes wasted. We don't expect
> backends to run frequently such progress reports, do we? My opinion on
> the matter if that we should define a different collector data for
> vacuum, with something like PgStat_StatVacuumEntry, then have on top
> of it a couple of routines dedicated at feeding up data with it when
> some work is done on a vacuum job.

I assume your comment here means we should use stats collector to the
track/publish progress info, is that right?

AIUI, the counts published via stats collector are updated asynchronously
w.r.t. operations they count and mostly as aggregate figures. For example,
PgStat_StatTabEntry.blocks_fetched. IOW, we never see
pg_statio_all_tables.heap_blks_read updating as a scan reads blocks. Maybe
that helps keep traffic to pgstat collector to sane levels. But that is
not to mean that I think controlling stats collector levels was the only
design consideration behind how such counters are published.

In case of reporting counters as progress info, it seems we might have to
send too many PgStat_Msg's, for example, for every block we finish
processing during vacuum. That kind of message traffic may swamp the
collector. Then we need to see the updated counters from other counters in
near real-time though that may be possible with suitable (build?)
configuration.

Moreover, the counters reported as progress info seem to be of different
nature than those published via the stats collector. I would think of it
in terms of the distinction between track_activities and track_counts. I
find these lines on "The Statistics Collector" page somewhat related:

"PostgreSQL also supports reporting dynamic information about exactly what
is going on in the system right now, such as the exact command currently
being executed by other server processes, and which other connections
exist in the system. This facility is independent of the collector process."

Then,

"When using the statistics to monitor collected data, it is important to
realize that the information does not update instantaneously. Each
individual server process transmits new statistical counts to the
collector just before going idle; so a query or transaction still in
progress does not affect the displayed totals. Also, the collector itself
emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500
ms unless altered while building the server). So the displayed information
lags behind actual activity. However, current-query information collected
by track_activities is always up-to-date."

http://www.postgresql.org/docs/devel/st

Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Andreas Seltenreich
Tom Lane writes:

> Merlin Moncure  writes:
>> Aside from the functional issues, could your changes result in
>> performance regressions?
[...]
> It's a little bit harder to gauge the impact on planner speed.  The
> transitive closure calculation could be expensive in a query with many
> lateral references, but that doesn't seem likely to be common; and anyway
> we'll buy back some of that cost due to simpler tests later.  I'm
> optimistic that we'll come out ahead in HEAD/9.5 after the removal
> of LateralJoinInfo setup.  It might be roughly a wash in the back
> branches.

On the empirical side: I see a speedup of 0.4% in testing speed with the
patch applied.  It could very well be me venting the room one additional
time during the second session, resulting in the CPUs spending more time
in their opportunistic frequency range or something.

regards,
Andreas

smith=# select extract('day' from t),
   avg(generated/extract(epoch from updated - t)) as tps
from instance natural join stat
where generated > 100
  and hostname not in('dwagon','slugbug')
   -- these do stuff beside sqlsmith
  and t > now() - interval '3 days' group by 1 order by 1;
 date_part |   tps
---+--
 8 |  55.494181110456
 9 | 55.6902316869404


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 4:57 PM, Andres Freund  wrote:
> On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier 
>  wrote:
>>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund 
>>wrote:
>>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
 Oh, OK. I didn't read though your lines correctly. So you basically
 mean that we would look at the init files that are on disk, and
>>check
 if they are empty. If they are, we simply use XLogReadBufferExtended
 to fetch the INIT_FORKNUM content and fill in another buffer for the
 MAIN_FORKNUM. More or less right?
>>>
>>> We would not just do so if they're empty, we would just generally
>>copy the file
>>> via shared buffers, instead of copy_file(). But we'd get the file
>>size
>>> from the filesystem (which is fine, we make sure it is correct during
>>> replay).
>>
>>So, this suggestion is basically implementing the reverse operation of
>>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>>then look at the shared buffers needed. Isn't it a bit brittle for
>>back-branches? RelFileNode stuff is available easily through records
>>when replaying individually FPIs, but not really in this code path.
>
> Oh, sorry. In definitely not thinking about doing this for the back branches. 
> That was more musing about a way to optimize this.

OK sure. Yeah that may be something worth investigating. The reverse
engineering of GetRelationPath would be interesting perhaps for some
utilities.

So, do we go for something like the patch you attached in
20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
~9.4 we use the one I wrote in
cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
Note that in both cases the patches are not complete, we need to fix
as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
are logged all the time. I am also thinking that the patch changing
WAL format of XLOG_FPI is overdoing it as we know that all the
INIT_FORKNUM will be generated for unlogged relations, so that's fine
to flush unconditionally INIT_FORKNUM buffers at replay.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 7:23 PM, Amit Langote
 wrote:
> On 2015/12/10 15:28, Michael Paquier wrote:
>> - The progress tracking facility adds a whole level of complexity for
>> very little gain, and IMO this should *not* be part of PgBackendStatus
>> because in most cases its data finishes wasted. We don't expect
>> backends to run frequently such progress reports, do we? My opinion on
>> the matter if that we should define a different collector data for
>> vacuum, with something like PgStat_StatVacuumEntry, then have on top
>> of it a couple of routines dedicated at feeding up data with it when
>> some work is done on a vacuum job.
>
> I assume your comment here means we should use stats collector to the
> track/publish progress info, is that right?

Yep.

> AIUI, the counts published via stats collector are updated asynchronously
> w.r.t. operations they count and mostly as aggregate figures. For example,
> PgStat_StatTabEntry.blocks_fetched. IOW, we never see
> pg_statio_all_tables.heap_blks_read updating as a scan reads blocks. Maybe
> that helps keep traffic to pgstat collector to sane levels. But that is
> not to mean that I think controlling stats collector levels was the only
> design consideration behind how such counters are published.
>
> In case of reporting counters as progress info, it seems we might have to
> send too many PgStat_Msg's, for example, for every block we finish
> processing during vacuum. That kind of message traffic may swamp the
> collector. Then we need to see the updated counters from other counters in
> near real-time though that may be possible with suitable (build?)
> configuration.

As far as I understand it, the basic reason why this patch exists is
to allow a DBA to have a hint of the progress of a VACUUM that may be
taking minutes, or say hours, which is something we don't have now. So
it seems perfectly fine to me to report this information
asynchronously with a bit of lag. Why would we need so much precision
in the report?

>> In short, it seems to me that this patch needs a rework, and should be
>> returned with feedback. Other opinions?
>
> Yeah, some more thought needs to be put into design of the general
> reporting interface. Then we also need to pay attention to another
> important aspect of this patch - lazy vacuum instrumentation.

This patch has received a lot of feedback, and it is not in a
committable state, so I marked it as "Returned with feedback" for this
CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
> So, do we go for something like the patch you attached in
> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
> ~9.4 we use the one I wrote in
> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?

I'm more thinking of using something like my patch for all branches. Why
would we want to go for the more complicated approach in the more
distant branches?

> Note that in both cases the patches are not complete, we need to fix
> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> are logged all the time.

Aggreed. It's probably better treated as an entirely different - pretty
ugly - bug though. I mean it's not some issue of a race during replay,
it's entirely missing WAL logging for SET TABLESPACE of unlogged
relations.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund  wrote:
>> So, do we go for something like the patch you attached in
>> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
>> ~9.4 we use the one I wrote in
>> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
>
> I'm more thinking of using something like my patch for all branches. Why
> would we want to go for the more complicated approach in the more
> distant branches?

That's not what I think it meant: I don't wish to do the complicated
approach either anymore. I sent two patches on the mail mentioned
above: one for master/9.5 that used the approach of changing WAL, and
a second aimed for 9.4 and old versions that is close to what you
sent. It looks that you did not look at the second patch, named
20151209_replay_unlogged_94.patch that does some stuff with
XLOG_HEAP_NEWPAGE to fix the issue.

>> Note that in both cases the patches are not complete, we need to fix
>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>> are logged all the time.
>
> Agreed. It's probably better treated as an entirely different - pretty
> ugly - bug though. I mean it's not some issue of a race during replay,
> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> relations.

Okidoki.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
 wrote:
> On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund  wrote:
>>> So, do we go for something like the patch you attached in
>>> 20151208125716.gs4...@alap3.anarazel.de for master and 9.5, and for
>>> ~9.4 we use the one I wrote in
>>> cab7npqsxerpzj+bz-mfopzfzp5pabie9jwbucjy6qayertt...@mail.gmail.com?
>>
>> I'm more thinking of using something like my patch for all branches. Why
>> would we want to go for the more complicated approach in the more
>> distant branches?
>
> That's not what I think it meant: I don't wish to do the complicated
> approach either anymore. I sent two patches on the mail mentioned
> above: one for master/9.5 that used the approach of changing WAL, and
> a second aimed for 9.4 and old versions that is close to what you
> sent. It looks that you did not look at the second patch, named
> 20151209_replay_unlogged_94.patch that does some stuff with
> XLOG_HEAP_NEWPAGE to fix the issue.
>
>>> Note that in both cases the patches are not complete, we need to fix
>>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>>> are logged all the time.
>>
>> Agreed. It's probably better treated as an entirely different - pretty
>> ugly - bug though. I mean it's not some issue of a race during replay,
>> it's entirely missing WAL logging for SET TABLESPACE of unlogged
>> relations.
>
> Okidoki.

In short: should I send patches for all those things or are you on it?
It seems that we are on the same page: using the simple approach, with
XLOG_FPI that enforces the flushes for 9.5/master and
XLOG_HEAP_NEWPAGE that does the same for ~9.4.

For the second issue, I would just need to extract the fix from one of
the patches upthread.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> In short: should I send patches for all those things or are you on it?

I'm on it. I don't think the new name you gave the function, and the new
comment, are really an improvement. We already have 'SyncOneBuffer'
(unusable for our purpose unfortunately), so going for Single here
doesn't seem like a good idea.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 5:45 PM, Victor Wagner  wrote:
> Recently (about a month ago) flex 2.6.0 have been released.
>
> But when we tried to compile PostgreSQL 9.5 beta 1 from git with it on
> Windows, we've encountered that src/tools/msvc/pgflex.pl explicitely
> checks that minor version of flex is equal to 5.
>
> unless ($verparts[0] == 2 && $verparts[1] == 5 && $verparts[2]
> >=31) { print "WARNING! Flex install not found, or unsupported Flex
>version.\n"; print "echo Attempting to build without.\n";
>exit 0;
> }
>
> Is this check intentional, or it just seats here from the time when
> version 2.6.0 didn't exist?
>
> Postgres seems to compile fine with flex 2.6.0.

It seems that 32f15d05 missed to update its equivalent in
src/tools/msvc. Per se the patch attached as you already found out.
-- 
Michael


20151210_msvc_flex.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 9:13 PM, Andres Freund  wrote:
> On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
>> In short: should I send patches for all those things or are you on it?
>
> I'm on it. I don't think the new name you gave the function, and the new
> comment, are really an improvement. We already have 'SyncOneBuffer'
> (unusable for our purpose unfortunately), so going for Single here
> doesn't seem like a good idea.

OK, no problem. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-10 Thread Stas Kelvich
Michael, Jeff thanks for reviewing and testing.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:
> 
> This has better be InvalidXLogRecPtr if unused.


Yes, that’s better. Changed.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:

> +if (gxact->prepare_lsn)
> +{
> +XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
> +}
> Perhaps you mean prepare_xlogptr here?


Yes, my bad. But funnily I have this error even number of times: code in 
CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this 
was working well, that’s why it survived my own tests and probably Jeff’s tests.
I think that’s a bad variable naming, for example because lsn in pg_xlogdump 
points to start of the record, but here start used as xloptr and end as lsn.
So changed both variables to prepare_start_lsn and prepare_end_lsn.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:
> I've tested this through my testing harness which forces the database
> to go through endless runs of crash recovery and checks for
> consistency, and so far it has survived perfectly.


Cool! I think that patch is most vulnerable to following type of workload: 
prepare transaction, do a lot of stuff with database to force checkpoints (or 
even recovery cycles), and commit it.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Can you give the full command line?  -j, -c, etc.


pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 
16 -r

where 2pc.pgb as in previous message.

Also all this applies to hosts with uniform memory. I tried to run patched 
postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf 
top shows that main bottleneck is access to gxact, but on ordinary host with 
1/2 cpu’s that access even not in top ten heaviest routines.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Why are you incrementing :scale ?


That’s a funny part, overall 2pc speed depends on how you will name your 
prepared transaction. Concretely I tried to use random numbers for gid’s and it 
was slower than having constantly incrementing gid. Probably that happens due 
to linear search by gid in gxact array on commit. So I used :scale just as a 
counter, bacause it is initialised on pgbench start and line like “\set scale 
:scale+1” works well. (may be there is a different way to do it in pgbench).

> I very rapidly reach a point where most of the updates are against
> tuples that don't exist, and then get integer overflow problems.

Hmm, that’s strange. Probably you set scale to big value, so that 10*:scale 
is bigger that int4? But i thought that pgbench will change aid columns to 
bigint if scale is more than 2.



2pc_xlog.v2.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Victor Wagner
On Thu, 10 Dec 2015 21:19:46 +0900
Michael Paquier  wrote:

> On Thu, Dec 10, 2015 at 5:45 PM, Victor Wagner 

> > Postgres seems to compile fine with flex 2.6.0.  
> 
> It seems that 32f15d05 missed to update its equivalent in
> src/tools/msvc. Per se the patch attached as you already found out.

May be it is better to use Perl standard module version to compare
version number.

something like:

use version;

unless (version->parse($flexver) > version->parse("2.5.31"))

Postgres requires relatively recent Perl.  Version objects appeared in
Perl 5.10 and most people would have something newer. 

-- 
 Victor Wagner 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Robert Haas
On Wed, Dec 9, 2015 at 8:23 PM, Noah Misch  wrote:
> On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:
>> On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch  wrote:
>> > I hope those who have not already read commit 4f627f8 will not waste time
>> > reading it.  They should instead ignore multixact changes from commit 
>> > 4f627f8
>> > through its revert.  The 2015-09-26 commits have not appeared in a 
>> > supported
>> > release, and no other work has built upon them.  They have no tenure.  (I 
>> > am
>> > glad you talked the author out of back-patching; otherwise, 9.4.5 and 
>> > 9.3.10
>> > would have introduced a data loss bug.)  Nobody reported a single defect
>> > before my review overturned half the patch.  A revert will indeed impose on
>> > those who invested time to understand commit 4f627f8, but the silence about
>> > its defects suggests the people understanding it number either zero or one.
>> > Even as its author and reviewers, you would do better to set aside what you
>> > thought you knew about this code.
>>
>> I just don't find this a realistic model of how people use the git
>> log.  Maybe you use it this way; I don't.  I don't *want* git blame to
>> make it seem as if 4f627f8 is not part of the history.  For better or
>> worse, it is.
>
> I would like to understand how you use git, then.  What's one of your models
> of using "git log" and/or "git blame" in which you foresee the revert making
> history less clear, not more clear?

Well, suppose I wanted to know what bugs were fixed between 9.5beta
and 9.5.0, for example.  I mean, I'm going to run git log
src/backend/access/transam/multixact.c ... and the existing commits
are going to be there.

> By the way, it occurs to me that I should also make pg_upgrade blacklist the
> range of catversions that might have data loss.  No sense in putting ourselves
> in the position of asking whether data files of a 9.9.3 cluster spent time in
> a 9.5beta2 cluster.

Maybe.  But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

>> Ripping it out and replacing it monolithically will not
>> change that; it will only make the detailed history harder to
>> reconstruct, and I *will* want to reconstruct it.
>
> What's something that might happen six months from now and lead you to inspect
> master or 9.5 multixact.c between 4f627f8 and its revert?

I don't know have anything to add to what others have said in response
to this point, except this: the whole point of using a source code
management system is to tell you what changed and when.  What you are
proposing to do makes it unusable for that purpose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Andres Freund
On 2015-12-10 08:55:54 -0500, Robert Haas wrote:
> Maybe.  But I think we could use a little more vigorous discussion of
> that issue, since Andres doesn't seem to be very convinced by your
> analysis, and I don't currently understand what you've fixed because I
> can't, as mentioned several times, follow your patch stack.

The issue at hand is that the following block
oldestOffsetKnown =
find_multixact_start(oldestMultiXactId, &oldestOffset);

...
else if (prevOldestOffsetKnown)
{
/*
 * If we failed to get the oldest offset this time, but we have 
a
 * value from a previous pass through this function, use the 
old value
 * rather than automatically forcing it.
 */
oldestOffset = prevOldestOffset;
oldestOffsetKnown = true;
}
in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
is set in shared memory:
/* Install the computed values */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->oldestOffset = oldestOffset;
MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
MultiXactState->offsetStopLimit = offsetStopLimit;
LWLockRelease(MultiXactGenLock);

so, if find_multixact_start() failed - a "should never happen" occurance
- we install a wrong stop limit. It does get 'repaired' upon the next
suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
though.

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Michael Paquier
On Thu, Dec 10, 2015 at 10:36 PM, Victor Wagner  wrote:
> Postgres requires relatively recent Perl.  Version objects appeared in
> Perl 5.10 and most people would have something newer.

AFAIK, the minimum requirement for perl is 5.8...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Merlin Moncure
On Thu, Dec 10, 2015 at 4:55 AM, Andreas Seltenreich  wrote:
> Tom Lane writes:
>
>> Merlin Moncure  writes:
>>> Aside from the functional issues, could your changes result in
>>> performance regressions?
> [...]
>> It's a little bit harder to gauge the impact on planner speed.  The
>> transitive closure calculation could be expensive in a query with many
>> lateral references, but that doesn't seem likely to be common; and anyway
>> we'll buy back some of that cost due to simpler tests later.  I'm
>> optimistic that we'll come out ahead in HEAD/9.5 after the removal
>> of LateralJoinInfo setup.  It might be roughly a wash in the back
>> branches.
>
> On the empirical side: I see a speedup of 0.4% in testing speed with the
> patch applied.  It could very well be me venting the room one additional
> time during the second session, resulting in the CPUs spending more time
> in their opportunistic frequency range or something.

Really...wow.  That satisfies me.   You ought to be commended for
sqlsmith -- great stuff.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rework the way multixact truncations work

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund  wrote:
> On 2015-12-10 08:55:54 -0500, Robert Haas wrote:
>> Maybe.  But I think we could use a little more vigorous discussion of
>> that issue, since Andres doesn't seem to be very convinced by your
>> analysis, and I don't currently understand what you've fixed because I
>> can't, as mentioned several times, follow your patch stack.
>
> The issue at hand is that the following block
> oldestOffsetKnown =
> find_multixact_start(oldestMultiXactId, 
> &oldestOffset);
>
> ...
> else if (prevOldestOffsetKnown)
> {
> /*
>  * If we failed to get the oldest offset this time, but we 
> have a
>  * value from a previous pass through this function, use the 
> old value
>  * rather than automatically forcing it.
>  */
> oldestOffset = prevOldestOffset;
> oldestOffsetKnown = true;
> }
> in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
> is set in shared memory:
> /* Install the computed values */
> LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> MultiXactState->oldestOffset = oldestOffset;
> MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
> MultiXactState->offsetStopLimit = offsetStopLimit;
> LWLockRelease(MultiXactGenLock);
>
> so, if find_multixact_start() failed - a "should never happen" occurance
> - we install a wrong stop limit. It does get 'repaired' upon the next
> suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
> though.
>
> Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Dec 10, 2015 at 10:36 PM, Victor Wagner  wrote:
>> Postgres requires relatively recent Perl.  Version objects appeared in
>> Perl 5.10 and most people would have something newer.

> AFAIK, the minimum requirement for perl is 5.8...

Yeah, that is our stated requirement, and we do have buildfarm members
checking it on the Unix side (eg prairiedog runs 5.8.6).  I don't know
if anyone is checking the MSVC-build scripts against an old Perl,
though ... it seems possible that some more-recent constructs have
crept into that pile of perl code without anyone noticing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 6:46 AM, Michael Paquier
 wrote:
> On Thu, Dec 10, 2015 at 7:23 PM, Amit Langote
>  wrote:
>> On 2015/12/10 15:28, Michael Paquier wrote:
>>> - The progress tracking facility adds a whole level of complexity for
>>> very little gain, and IMO this should *not* be part of PgBackendStatus
>>> because in most cases its data finishes wasted. We don't expect
>>> backends to run frequently such progress reports, do we? My opinion on
>>> the matter if that we should define a different collector data for
>>> vacuum, with something like PgStat_StatVacuumEntry, then have on top
>>> of it a couple of routines dedicated at feeding up data with it when
>>> some work is done on a vacuum job.
>>
>> I assume your comment here means we should use stats collector to the
>> track/publish progress info, is that right?
>
> Yep.

Oh, please, no.  Gosh, this is supposed to be a lightweight facility!
Just have a chunk of shared memory and write the data in there.  If
you try to feed this through the stats collector you're going to
increase the overhead by 100x or more, and there's no benefit.  We've
got to do relation stats that way because there's no a priori bound on
the number of relations, so we can't just preallocate enough shared
memory for all of them.  But there's no similar restriction here: the
number of backends IS fixed at startup time.  As long as we limit the
amount of progress information that a backend can supply to some fixed
length, which IMHO we definitely should, there's no need to add the
expense of funneling this through the stats collector.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Andres Freund
On 2015-12-10 09:39:30 -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > On Thu, Dec 10, 2015 at 10:36 PM, Victor Wagner  wrote:
> >> Postgres requires relatively recent Perl.  Version objects appeared in
> >> Perl 5.10 and most people would have something newer.
> 
> > AFAIK, the minimum requirement for perl is 5.8...
> 
> Yeah, that is our stated requirement, and we do have buildfarm members
> checking it on the Unix side (eg prairiedog runs 5.8.6).  I don't know
> if anyone is checking the MSVC-build scripts against an old Perl,
> though ... it seems possible that some more-recent constructs have
> crept into that pile of perl code without anyone noticing.

Tom, are you applying the patch, or should I?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is postgresql on Windows compatible with flex 2.6.0?

2015-12-10 Thread Tom Lane
Andres Freund  writes:
> Tom, are you applying the patch, or should I?

I can do it ... it was my oversight to begin with :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Tom Lane
Robert Haas  writes:
> Oh, please, no.  Gosh, this is supposed to be a lightweight facility!
> Just have a chunk of shared memory and write the data in there.  If
> you try to feed this through the stats collector you're going to
> increase the overhead by 100x or more, and there's no benefit.  We've
> got to do relation stats that way because there's no a priori bound on
> the number of relations, so we can't just preallocate enough shared
> memory for all of them.  But there's no similar restriction here: the
> number of backends IS fixed at startup time.  As long as we limit the
> amount of progress information that a backend can supply to some fixed
> length, which IMHO we definitely should, there's no need to add the
> expense of funneling this through the stats collector.

I agree with this, and I'd further add that if we don't have a
fixed-length progress state, we've overdesigned the facility entirely.
People won't be able to make sense of anything that acts much more
complicated than "0% .. 100% done".  So you need to find a way of
approximating progress of a given command in terms more or less
like that, even if it's a pretty crude approximation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 9:49 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Oh, please, no.  Gosh, this is supposed to be a lightweight facility!
>> Just have a chunk of shared memory and write the data in there.  If
>> you try to feed this through the stats collector you're going to
>> increase the overhead by 100x or more, and there's no benefit.  We've
>> got to do relation stats that way because there's no a priori bound on
>> the number of relations, so we can't just preallocate enough shared
>> memory for all of them.  But there's no similar restriction here: the
>> number of backends IS fixed at startup time.  As long as we limit the
>> amount of progress information that a backend can supply to some fixed
>> length, which IMHO we definitely should, there's no need to add the
>> expense of funneling this through the stats collector.
>
> I agree with this, and I'd further add that if we don't have a
> fixed-length progress state, we've overdesigned the facility entirely.
> People won't be able to make sense of anything that acts much more
> complicated than "0% .. 100% done".  So you need to find a way of
> approximating progress of a given command in terms more or less
> like that, even if it's a pretty crude approximation.

That I don't agree with.  Even for something like VACUUM, it's pretty
hard to approximate overall progress - because, for example, normally
we'll only have 1 index scan per index, but we might have multiple
index scans or none if maintenance_work_mem is too small or if there
aren't any dead tuples after all.  I don't want our progress reporting
facility to end up with this reputation:

https://xkcd.com/612/

This point has already been discussed rather extensively upthread, but
to reiterate, I think it's much better to report slightly more
detailed information and let the user figure out what to do with it.
For example, for a VACUUM, I think we should report something like
this:

1. The number of heap pages scanned thus far.
2. The number of dead tuples found thus far.
3. The number of dead tuples we can store before we run out of
maintenance_work_mem.
4. The number of index pages processed by the current index vac cycle
(or a sentinel value if none is in progress).
5. The number of heap pages for which the "second heap pass" has been completed.

Now, if the user wants to flatten this out to a progress meter, they
can write an SQL expression which does that easily enough, folding the
sizes of the table and its indices and whatever assumptions they want
to make about what will happen down the road.  If we all agree on how
that should be done, it can even ship as a built-in view.  But I
*don't* think we should build those assumptions into the core progress
reporting facility.  For one thing, that would make updating the
progress meter considerably more expensive - you'd have to recompute a
new completion percentage instead of just saying "heap pages processed
went up by one".  For another thing, there are definitely going to be
some people that want the detailed information - and I can practically
guarantee that if we don't make it available, at least one person will
write a tool that tries to reverse-engineer the detailed progress
information from whatever we do report.

Heck, I might do it myself.  If I find a long-running vacuum on a
customer system that doesn't seem to be making progress, knowing that
it's 69% complete and that the completion percentage isn't rising
doesn't help me much.  What I want to know is are we stuck in a heap
vacuum phase, or an index vacuum phase, or a second heap pass.  Are we
making progress so slowly that 69% takes forever to get to 70%, or are
we making absolutely no progress at all?  I think if we don't report a
healthy amount of detail here people will still frequently have to
resort to what I do now, which is ask the customer to install strace
and attach it to the vacuum process.  For many customers, that's not
so easy; and it never inspires any confidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Robert Haas
On Wed, Dec 9, 2015 at 4:54 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> The comment in mdnblocks.c says this:
>
>>  * Because we pass O_CREAT, we will create the
>> next segment (with
>>  * zero length) immediately, if the last
>> segment is of length
>>  * RELSEG_SIZE.  While perhaps not strictly
>> necessary, this keeps
>>  * the logic simple.
>
>> I don't really see how this "keeps the logic simple".
>
> My (vague) recollection is that this is defending some assumptions
> elsewhere in md.c.  You sure you haven't broken anything with this?
> Relation extension across a segment boundary, perhaps?

It doesn't look like that particular thing gets broken here.  The key
to extending across a segment boundary is that _mdfd_openseg() needs
to get O_CREAT as its fourth argument, but _mdfd_getseg() thankfully
does not rely on mdnblocks() to have already done that: it passes
O_CREAT itself.  I did also test it by creating a 13GB
pgbench_accounts table, and at least in that simple case it works
fine.  It's possible that there is some other thing I've missed, but
in general I think that if there is some dependency on mdnblocks()
creating new segments, that's a bad idea and we should try to get rid
of it.  A call that supposedly reports the number of blocks in a
relation really has no business changing on-disk state.

(More broadly, as Kevin was pointing out to me yesterday, md.c looks
like it could do with a face lift.  Keeping a linked list of 1GB
segments and chasing down the list to find the length of the file may
have been fine when relations over 1GB were rare, but that's now
routine.  Some relations may be over 1TB, and using a linked list of
1000 entries to keep track of all of those segments does not seem like
an especially good choice.  In fact, having no way to get the relation
length other than scanning 1000 files doesn't seem like an especially
good choice even if we used a better data structure.  Putting a header
page in the heap would make getting the length of a relation O(1)
instead of O(segments), and for a bonus, we'd be able to reliably
detect it if a relation file disappeared out from under us.  That's a
difficult project and definitely not my top priority, but this code is
old and crufty all the same.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-12-10 Thread Robert Haas
On Wed, Dec 9, 2015 at 12:15 AM, Pavel Stehule  wrote:
>> On Wed, Dec 9, 2015 at 2:07 PM, Pavel Stehule 
>> wrote:
>> > should be noted, recorded somewhere so this introduce possible
>> > compatibility
>> > issue - due default processing .psqlrc.
>>
>> That's written in the commit log, so I guess that's fine.
>
> ook

Bruce uses the commit log to prepare the release notes, so I guess
he'll make mention of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remaining 9.5 open items

2015-12-10 Thread Robert Haas
On Wed, Dec 9, 2015 at 2:52 AM, Etsuro Fujita
 wrote:
> Thank you for committing the patch!
>
> Sorry, I overlooked a typo in docs: s/more that one/more than one/ Please
> find attached a patch.

Committed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
Hi Michael, Robert,

On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
>  wrote:
> >>> Note that in both cases the patches are not complete, we need to fix
> >>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> >>> are logged all the time.
> >>
> >> Agreed. It's probably better treated as an entirely different - pretty
> >> ugly - bug though. I mean it's not some issue of a race during replay,
> >> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> >> relations.
> 
> For the second issue, I would just need to extract the fix from one of
> the patches upthread.

I've, pushed the fix for the promotion related issue. I'm afraid that
the ALTER TABLE  SET TABLESPACE issue is a bit bigger
than it though.

Robert, to catch you up: The isssue, discovered by Michael somewhere in
this thread, is that copy_relation_data(), used by SET TABLESPACE, does
not deal with unlogged tables.

Michael, what your earlier patch did was basically

@@ -9892,9 +9892,14 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 
/*
 * We need to log the copied data in WAL iff WAL archiving/streaming is
-* enabled AND it's a permanent relation.
-*/
-   use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+* enabled AND it's a permanent relation. Unlogged relations need to have
+* their init fork logged as well, to ensure a consistent on-disk state
+* when reset at the end of recovery.
+*/
+   use_wal = XLogIsNeeded() &&
+   (relpersistence == RELPERSISTENCE_PERMANENT ||
+(relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM));

...

nblocks = smgrnblocks(src, forkNum);

for (blkno = 0; blkno < nblocks; blkno++)
{
...
if (use_wal)
log_newpage(&dst->smgr_rnode.node, forkNum, blkno, 
page);


But that's not sufficient for a bunch of reasons:

First, and that's the big one, that approach doesn't guarantee that the
file will be created in the new tablespace if the file does not have any
blocks. Like e.g. the heap relation itself. This will lead to errors
like:
ERROR:  58P01: could not open file 
"pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
LOCATION:  mdopen, md.c:602

The real problem there imo isn't that the copy_relation_data() doesn't
deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
unlogged table specific codepath like heap_create_with_catalog() has.


A second problem is that the smgrimmedsync() in copy_relation_data()
isn't called for the init fork of unlogged relations, even if it needs
to.

As ATExecSetTableSpace() is also used for indices, the problem exists
there as well.

Yuck.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Andres Freund
On 2015-12-10 11:10:10 -0500, Robert Haas wrote:
> (More broadly, as Kevin was pointing out to me yesterday, md.c looks
> like it could do with a face lift.  Keeping a linked list of 1GB
> segments and chasing down the list to find the length of the file may
> have been fine when relations over 1GB were rare, but that's now
> routine.  Some relations may be over 1TB, and using a linked list of
> 1000 entries to keep track of all of those segments does not seem like
> an especially good choice.

Yes, that sucks. I've posted a patch for that at
http://archives.postgresql.org/message-id/20141031223210.GN13584%40awork2.anarazel.de
- I don't have access to to either the workload or a good testing
machine anymore though, so I've kinda lost interest for a while.

I'll try to push the patch with the renaming suggested downthread by Tom
soonish.

> In fact, having no way to get the relation length other than scanning
> 1000 files doesn't seem like an especially good choice even if we used
> a better data structure.  Putting a header page in the heap would make
> getting the length of a relation O(1) instead of O(segments), and for
> a bonus, we'd be able to reliably detect it if a relation file
> disappeared out from under us.  That's a difficult project and
> definitely not my top priority, but this code is old and crufty all
> the same.)

The md layer doesn't really know whether it's dealing with an index, or
with an index, or ... So handling this via a metapage doesn't seem
particularly straightforward.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-10 Thread Robert Haas
On Wed, Dec 9, 2015 at 8:30 AM, Aleksander Alekseev
 wrote:
> Hello, Robert
>
> Thanks for your review. I believe I fixed items 1, 2 and 3 (see
> attachment). Also I would like to clarify item 4.
>
>> 4. It mixes together multiple ideas in a single patch, not only
>> introducing a hashing concept but also striping a brand-new layer of
>> abstraction across the resource-owner mechanism.  I am not sure that
>> layer of abstraction is a very good idea, but if it needs to be done,
>> I think it should be a separate patch.
>
> Do I right understand that you suggest following?
>
> Current patch should be split in two parts. In first patch we create
> and use ResourceArray with array-based implementation (abstraction
> layer). Then we apply second patch which change ResourceArray
> implementation to hashing based (optimization).

Well, sorta.  To be honest, I think this patch is really ugly.  If we
were going to do this then, yes, I would want to split the patch into
two parts along those lines.  But actually I don't really want to do
it this way at all.  It's not that I don't want the performance
benefits: I do.  But the current code is really easy to read and
extremely simple, and this changes it into something that is a heck of
a lot harder to read and understand.  I'm not sure exactly what to do
about that, but it seems like a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund  wrote:
>> In fact, having no way to get the relation length other than scanning
>> 1000 files doesn't seem like an especially good choice even if we used
>> a better data structure.  Putting a header page in the heap would make
>> getting the length of a relation O(1) instead of O(segments), and for
>> a bonus, we'd be able to reliably detect it if a relation file
>> disappeared out from under us.  That's a difficult project and
>> definitely not my top priority, but this code is old and crufty all
>> the same.)
>
> The md layer doesn't really know whether it's dealing with an index, or
> with an index, or ... So handling this via a metapage doesn't seem
> particularly straightforward.

It's not straightforward, but I don't think that's the reason.  What
we could do is look at the call sites that use
RelationGetNumberOfBlocks() and change some of them to get the
information some other way instead.  I believe get_relation_info() and
initscan() are the primary culprits, accounting for some enormous
percentage of the system calls we do on a read-only pgbench workload.
Those functions certainly know enough to consult a metapage if we had
such a thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Robert Haas
On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
 wrote:
>> I feel quite uncomfortable that it solves the problem from a kind
>> of nature of unlogged object by arbitrary flagging which is not
>> fully corresponds to the nature. If we can deduce the necessity
>> of fsync from some nature, it would be preferable.
>
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well.

Eh, what?

Indexes use them if they are indexes on unlogged tables, but they'd
better not use them in any other situation.  Otherwise bad things are
going to happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> I've, pushed the fix for the promotion related issue. I'm afraid that
> the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> than it though.

I think that I would have preferred to fix this by flushing unlogged
buffers in bulk at the end of recovery, rather than by flushing them
as they are generated. This approach seems needlessly inefficient.
Maybe it doesn't matter, but you made the same sort of decision with
the find_multixact_start() fix: flushing sooner instead of working
harder to make the delayed flush safe.  I think that's a bad direction
in general, and that it will bite us.  Nonetheless, I've been
unresponsive on this thread, and it's certainly better to have fixed
the bug in a way that isn't what I would prefer than to have not fixed
it.

> Robert, to catch you up: The isssue, discovered by Michael somewhere in
> this thread, is that copy_relation_data(), used by SET TABLESPACE, does
> not deal with unlogged tables.

Hmm.

[ ... ]

> But that's not sufficient for a bunch of reasons:
>
> First, and that's the big one, that approach doesn't guarantee that the
> file will be created in the new tablespace if the file does not have any
> blocks. Like e.g. the heap relation itself. This will lead to errors
> like:
> ERROR:  58P01: could not open file 
> "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
> LOCATION:  mdopen, md.c:602
>
> The real problem there imo isn't that the copy_relation_data() doesn't
> deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> unlogged table specific codepath like heap_create_with_catalog() has.

It looks to me like somewhere we need to do log_smgrcreate(...,
INIT_FORKNUM) in the unlogged table case.  RelationCreateStorage()
skips this for the main forknum of an unlogged table, which seems OK,
but there's nothing that even attempts it for the init fork, which
does not seem OK.  I guess that logic should happen in
ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

> A second problem is that the smgrimmedsync() in copy_relation_data()
> isn't called for the init fork of unlogged relations, even if it needs
> to.

That seems easy enough to fix.  Can't we just sync all forks at that
location for permanent relations, and additionally sync the init fork
only for unlogged relations?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor comment update in setrefs.c

2015-12-10 Thread Robert Haas
On Tue, Dec 8, 2015 at 6:16 AM, Etsuro Fujita
 wrote:
> Attached is a small patch to adjust a comment in setrefs.c; in
> set_foreignscan_references, fdw_recheck_quals also gets adjusted to
> reference foreign scan tuple, in case of a foreign join, so I added
> "etc.", to a comment there, as the comment in case of a simple foreign
> table scan.

Doesn't apply any more.  I suppose we could sync up the similar
comments in set_customscan_references() too.  But to be honest I'm not
sure this is adding any clarity.  "etc." may not be the least
informative thing you can put in a comment, but it's pretty close.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> > I've, pushed the fix for the promotion related issue. I'm afraid that
> > the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> > than it though.
> 
> I think that I would have preferred to fix this by flushing unlogged
> buffers in bulk at the end of recovery, rather than by flushing them
> as they are generated. This approach seems needlessly inefficient.

We touched on that somewhere in the thread - having to scan all of
shared buffers isn't free either, and it'd mean that promotion would
take longer because it'd do all the writes at once. As we don't fsync
them during the flush itself, and as init forks don't ever get
rewritten, I don't think it makes any actual difference? The total
number of writes to the OS is the same, no?


> Maybe it doesn't matter, but you made the same sort of decision with
> the find_multixact_start() fix: flushing sooner instead of working
> harder to make the delayed flush safe. I think that's a bad direction
> in general, and that it will bite us.

Working harder often means added complexity, and complexity for things
executed relatively infrequently has the tendency to bite. I don't think
that's a simple tradeoff.

E.g. in the find_multixact_start() case the flushing isn't just about
"itself", it's also about interactions with SlruScanDirectory. Requiring
that to also check in-memory buffers wouldn't be entirely trivial... And
all that for an option that's essentially executed infrequently.


> > The real problem there imo isn't that the copy_relation_data() doesn't
> > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > unlogged table specific codepath like heap_create_with_catalog() has.
> 
> It looks to me like somewhere we need to do log_smgrcreate(...,
> INIT_FORKNUM) in the unlogged table case.

Yes.

> RelationCreateStorage()
> skips this for the main forknum of an unlogged table, which seems OK,
> but there's nothing that even attempts it for the init fork, which
> does not seem OK.

We unfortunately can't trivially delegate that work to
RelationCreateStorage(). E.g. heap_create() documents that only the main
fork is created :(

> I guess that logic should happen in
> ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

Looks like it's the easiest place.

It sounds worthwhile to check that other locations rewriting tables,
e.g. cluster/vacuum full/reindex are safe.

> > A second problem is that the smgrimmedsync() in copy_relation_data()
> > isn't called for the init fork of unlogged relations, even if it needs
> > to.
> 
> That seems easy enough to fix.  Can't we just sync all forks at that
> location for permanent relations, and additionally sync the init fork
> only for unlogged relations?

Yes, it's not hard, I just wanted to mention it so it doesn't get
forgotten.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab-comletion for RLS

2015-12-10 Thread Robert Haas
On Tue, Dec 8, 2015 at 8:32 AM, Masahiko Sawada  wrote:
> I found some lacks of tab-completion for RLS in 9.5.
>
> * ALTER POLICY [TAB]
> I expected to appear the list of policy name, but nothing is appeared.
>
> * ALTER POLICY hoge_policy ON [TAB]
> I expected to appear the list of table name related to specified policy, but
> all table names are appeared.
>
> * ALTER POLICY ... ON ... TO [TAB]
> I expected to appear { role_name | PUBLIC | CURRENT_USER | SESSION_USER },
> but only role_name and PUBLIC are appeared.
> Same problem is exists in
> "
> CREATE POLICY ... ON ... TO [TAB]
> "
> .
>
> #1 and #2 problems are exist in 9.5 or later, but #3 is exist in only 9.5
> because it's unintentionally fixed by
> 2f8880704a697312d8d10ab3a2ad7ffe4b5e3dfd commit.
> I think we should apply the necessary part of this commit for 9.5 as well,
> though?
>
> Attached patches are:
> * 000_fix_tab_completion_rls.patch
>   fixes #1, #2 problem, and is for master branch and REL9_5_STABLE.
> * 001_fix_tab_completion_rls_for_95.patch
>   fixes #3 problem, and is for only REL9_5_STABLE.

I've committed 000 and back-patched it to 9.5.  I'm not quite sure
what to do about 001; maybe it's better to back-port the whole commit
rather than just bits of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-12-10 Thread Robert Haas
On Tue, Dec 8, 2015 at 6:40 AM, Etsuro Fujita
 wrote:
> IMO I want to see the EvalPlanQual fix in the first version for 9.6.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund  wrote:
> > > I've, pushed the fix for the promotion related issue. I'm afraid that
> > > the ALTER TABLE  SET TABLESPACE issue is a bit bigger
> > > than it though.
> > 
> > I think that I would have preferred to fix this by flushing unlogged
> > buffers in bulk at the end of recovery, rather than by flushing them
> > as they are generated. This approach seems needlessly inefficient.
> 
> We touched on that somewhere in the thread - having to scan all of
> shared buffers isn't free either, and it'd mean that promotion would
> take longer because it'd do all the writes at once. As we don't fsync
> them during the flush itself, and as init forks don't ever get
> rewritten, I don't think it makes any actual difference? The total
> number of writes to the OS is the same, no?

Oh, and the primary, in most places, immediately does an smgrimmedsync()
after creating the init fork anyway. So in comparison to that a plain
smrwrite() during replay is nearly free.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Greg Stark
On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas  wrote:
>
> It's not straightforward, but I don't think that's the reason.  What
> we could do is look at the call sites that use
> RelationGetNumberOfBlocks() and change some of them to get the
> information some other way instead.  I believe get_relation_info() and
> initscan() are the primary culprits, accounting for some enormous
> percentage of the system calls we do on a read-only pgbench workload.
> Those functions certainly know enough to consult a metapage if we had
> such a thing.

Would this not run into a chicken and egg problem with recovery?
Unless you're going to fsync the meta page whenever the file is
extended you'll have to xlog any updates to it and treat the values in
memory as authoritative. But when replaying xlog you'll see obsolete
inconsistent versions on disk and won't have the correct values in
memory either.

It seems to me that if you want to fix the linked lists of files
that's orthogonal to whether the file lengths on disk are
authoritative. You can always keep the lengths or at least the number
of files cached and updated in shared memory in a more efficient
storage format.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Andres Freund
On 2015-12-10 17:55:37 +, Greg Stark wrote:
> It seems to me that if you want to fix the linked lists of files
> that's orthogonal to whether the file lengths on disk are
> authoritative. You can always keep the lengths or at least the number
> of files cached and updated in shared memory in a more efficient
> storage format.

FWIW, I've a very preliminary patch doing exactly that, keeping the file
size in shared memory. I'm not sure it'd end up nice if relation
extension would have to grab an exclusive lock on the metapage.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] REASSIGN OWNED doesn't know how to deal with USER MAPPINGs

2015-12-10 Thread Jaime Casanova
Hi,

We just notice $SUBJECT. Attached patch fixes it by ignoring USER
MAPPINGs in shdepReassignOwned() just like it happens with default
ACLs.

DROP OWNED manages it well.

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 43076c9..027bc8b 100644
--- a/src/backend/catalog/pg_shdepend.c
+++ b/src/backend/catalog/pg_shdepend.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_ts_config.h"
 #include "catalog/pg_ts_dict.h"
 #include "catalog/pg_type.h"
+#include "catalog/pg_user_mapping.h"
 #include "commands/alter.h"
 #include "commands/dbcommands.h"
 #include "commands/collationcmds.h"
@@ -1384,6 +1385,13 @@ shdepReassignOwned(List *roleids, Oid newrole)
 	AlterForeignDataWrapperOwner_oid(sdepForm->objid, newrole);
 	break;
 
+case UserMappingRelationId:
+
+	/* Ignore USER MAPPINGs; they should be handled by DROP
+	 * OWNED, not REASSIGN OWNED.
+	 */
+	break;
+
 case EventTriggerRelationId:
 	AlterEventTriggerOwner_oid(sdepForm->objid, newrole);
 	break;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REASSIGN OWNED doesn't know how to deal with USER MAPPINGs

2015-12-10 Thread Jaime Casanova
On 10 December 2015 at 13:04, Jaime Casanova
 wrote:
> Hi,
>
> We just notice $SUBJECT. Attached patch fixes it by ignoring USER
> MAPPINGs in shdepReassignOwned() just like it happens with default
> ACLs.
>

BTW, shouldn't we at least give a warning on those cases instead of
asuming that the user will know that some objects were ignored?

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 12:55 PM, Greg Stark  wrote:
> On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas  wrote:
>> It's not straightforward, but I don't think that's the reason.  What
>> we could do is look at the call sites that use
>> RelationGetNumberOfBlocks() and change some of them to get the
>> information some other way instead.  I believe get_relation_info() and
>> initscan() are the primary culprits, accounting for some enormous
>> percentage of the system calls we do on a read-only pgbench workload.
>> Those functions certainly know enough to consult a metapage if we had
>> such a thing.
>
> Would this not run into a chicken and egg problem with recovery?

No.

> Unless you're going to fsync the meta page whenever the file is
> extended you'll have to xlog any updates to it and treat the values in
> memory as authoritative. But when replaying xlog you'll see obsolete
> inconsistent versions on disk and won't have the correct values in
> memory either.

All true, but irrelevant.  The places I just mentioned wouldn't get
called during recovery.  They're only invoked from queries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Simon Riggs
On 10 December 2015 at 16:47, Robert Haas  wrote:

> On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund 
> wrote:
> >> In fact, having no way to get the relation length other than scanning
> >> 1000 files doesn't seem like an especially good choice even if we used
> >> a better data structure.  Putting a header page in the heap would make
> >> getting the length of a relation O(1) instead of O(segments), and for
> >> a bonus, we'd be able to reliably detect it if a relation file
> >> disappeared out from under us.  That's a difficult project and
> >> definitely not my top priority, but this code is old and crufty all
> >> the same.)
> >
> > The md layer doesn't really know whether it's dealing with an index, or
> > with an index, or ... So handling this via a metapage doesn't seem
> > particularly straightforward.
>
> It's not straightforward, but I don't think that's the reason.  What
> we could do is look at the call sites that use
> RelationGetNumberOfBlocks() and change some of them to get the
> information some other way instead.  I believe get_relation_info() and
> initscan() are the primary culprits, accounting for some enormous
> percentage of the system calls we do on a read-only pgbench workload.
> Those functions certainly know enough to consult a metapage if we had
> such a thing.
>

It looks pretty straightforward to me...

The number of relations with >1 file is likely to be fairly small, so we
can just have an in-memory array to record that. 8 bytes per relation >1 GB
isn't going to take much shmem, but we can extend using dynshmem as needed.
We can seq scan the array at relcache build time and invalidate relcache
when we extend. WAL log any extension to a new segment and write the table
to disk at checkpoint.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs  wrote:
> On 10 December 2015 at 16:47, Robert Haas  wrote:
>>
>> On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund 
>> wrote:
>> >> In fact, having no way to get the relation length other than scanning
>> >> 1000 files doesn't seem like an especially good choice even if we used
>> >> a better data structure.  Putting a header page in the heap would make
>> >> getting the length of a relation O(1) instead of O(segments), and for
>> >> a bonus, we'd be able to reliably detect it if a relation file
>> >> disappeared out from under us.  That's a difficult project and
>> >> definitely not my top priority, but this code is old and crufty all
>> >> the same.)
>> >
>> > The md layer doesn't really know whether it's dealing with an index, or
>> > with an index, or ... So handling this via a metapage doesn't seem
>> > particularly straightforward.
>>
>> It's not straightforward, but I don't think that's the reason.  What
>> we could do is look at the call sites that use
>> RelationGetNumberOfBlocks() and change some of them to get the
>> information some other way instead.  I believe get_relation_info() and
>> initscan() are the primary culprits, accounting for some enormous
>> percentage of the system calls we do on a read-only pgbench workload.
>> Those functions certainly know enough to consult a metapage if we had
>> such a thing.
>
> It looks pretty straightforward to me...
>
> The number of relations with >1 file is likely to be fairly small, so we can
> just have an in-memory array to record that. 8 bytes per relation >1 GB
> isn't going to take much shmem, but we can extend using dynshmem as needed.
> We can seq scan the array at relcache build time and invalidate relcache
> when we extend. WAL log any extension to a new segment and write the table
> to disk at checkpoint.

Invaliding the relcache when we extend would be extremely expensive,
but we could probably come up with some variant of this that would
work.  I'm not very excited about this design, though; I think
actually putting a metapage on each relation would be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-12-10 Thread Pavel Stehule
> postgres=# \crosstabview 4 +month label
>

Maybe using optional int order column instead label is better - then you
can do sort on client side

so the syntax can be "\crosstabview VCol [+/-]HCol [[+-]HOrderCol]

Regards

Pavel



>
> ┌──┬───┬───┬┬───┬───┬───┬───┬───┬───┬───┬───┐
> │ customer │  led  │  úno  │  bře   │  dub  │  kvě  │
> čen  │  čec  │  srp  │  zář  │  říj  │  lis  │
>
> ╞══╪═══╪═══╪╪═══╪═══╪═══╪═══╪═══╪═══╪═══╪═══╡
> │ A**  │   │   ││   │
> │   │   │   │   │ 13000 │   │
> │ A│   │   │ 8000   │   │
> │   │   │   │   │   │   │
> │ B*   │   │   ││   │
> │   │   │   │   │   │ 3200  │
> │ B*** │   │   ││   │
> │   │   │   │ 26200 │   │   │
> │ B*   │   │   ││   │
> │   │ 14000 │   │   │   │   │
> │ C**  │   │   ││ 7740  │
> │   │   │   │   │   │   │
> │ C*** │   │   ││   │
> │   │   │   │ 26000 │   │   │
> │ C*   │   │   ││ 12000 │
> │   │   │   │   │   │   │
> │ G*** │ 30200 │ 26880 │ 13536  │ 39360 │ 60480 │
> 54240 │ 44160 │ 16320 │ 29760 │ 22560 │ 20160 │
> │ G*** │   │   ││   │   │
> 25500 │   │   │   │   │   │
> │ G**  │   │   ││   │   │
> 16000 │   │   │   │   │   │
> │ I*   │   │   ││   │
> │   │   │ 27920 │   │   │   │
> │ i│   │   ││ 13500 │
> │   │   │   │   │   │   │
> │ n*   │   │   ││   │
> │   │ 12600 │   │   │   │   │
> │ Q**  │   │   ││   │ 16700
> │   │   │   │   │   │   │
> │ S*** │   │   ││   │
> │   │ 8000  │   │   │   │   │
> │ S*** │   │   ││   │ 5368
> │   │   │   │   │   │   │
> │ s*** │   │   │ 5000   │ 3200  │
> │   │   │   │   │   │   │
>
> └──┴───┴───┴┴───┴───┴───┴───┴───┴───┴───┴───┘
> (18 rows)
>
>
>
> Regards
>
> Pavel
>
>
>>
>>
>>
>>> Best regards,
>>> --
>>> Daniel Vérité
>>> PostgreSQL-powered mailer: http://www.manitou-mail.org
>>> Twitter: @DanielVerite
>>>
>>
>>
>


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Tom Lane
Andreas Seltenreich  writes:
> Tom Lane writes:
>> [2. transitive-lateral-fixes-1.patch]

> I was about to write that sqlsmith likes the patch, but after more than
> 10^8 ok queries the attached ones were generated.

Ah-hah --- the new check I added in join_is_legal understood about
chains of LATERAL references, but it forgot that we could also have chains
of outer-join ordering constraints.  When we're looking to see if joining
on the basis of a LATERAL reference would break some later outer join, we
have to look at outer joins to the outer joins' inner relations, too.

Fixed in the attached.  I also stuck all of join_is_legal's
lateral-related checks inside an "if (root->hasLateralRTEs)" block,
which will save some time in typical queries with no LATERAL.  That
makes that section of the patch a bit bigger than before, but it's
mostly just reindentation.

Many thanks for the work you've done with this tool!  Who knows how
long it would've taken us to find these problems otherwise ...

regards, tom lane

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 0f04033..53d8fdd 100644
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*** allow_star_schema_join(PlannerInfo *root
*** 255,308 
  }
  
  /*
-  * There's a pitfall for creating parameterized nestloops: suppose the inner
-  * rel (call it A) has a parameter that is a PlaceHolderVar, and that PHV's
-  * minimum eval_at set includes the outer rel (B) and some third rel (C).
-  * We might think we could create a B/A nestloop join that's parameterized by
-  * C.  But we would end up with a plan in which the PHV's expression has to be
-  * evaluated as a nestloop parameter at the B/A join; and the executor is only
-  * set up to handle simple Vars as NestLoopParams.  Rather than add complexity
-  * and overhead to the executor for such corner cases, it seems better to
-  * forbid the join.  (Note that existence of such a PHV probably means there
-  * is a join order constraint that will cause us to consider joining B and C
-  * directly; so we can still make use of A's parameterized path with B+C.)
-  * So we check whether any PHVs used in the query could pose such a hazard.
-  * We don't have any simple way of checking whether a risky PHV would actually
-  * be used in the inner plan, and the case is so unusual that it doesn't seem
-  * worth working very hard on it.
-  *
-  * This case can occur whether or not the join's remaining parameterization
-  * overlaps param_source_rels, so we have to check for it separately from
-  * allow_star_schema_join, even though it looks much like a star-schema case.
-  */
- static inline bool
- check_hazardous_phv(PlannerInfo *root,
- 	Path *outer_path,
- 	Path *inner_path)
- {
- 	Relids		innerparams = PATH_REQ_OUTER(inner_path);
- 	Relids		outerrelids = outer_path->parent->relids;
- 	ListCell   *lc;
- 
- 	foreach(lc, root->placeholder_list)
- 	{
- 		PlaceHolderInfo *phinfo = (PlaceHolderInfo *) lfirst(lc);
- 
- 		if (!bms_is_subset(phinfo->ph_eval_at, innerparams))
- 			continue;			/* ignore, could not be a nestloop param */
- 		if (!bms_overlap(phinfo->ph_eval_at, outerrelids))
- 			continue;			/* ignore, not relevant to this join */
- 		if (bms_is_subset(phinfo->ph_eval_at, outerrelids))
- 			continue;			/* safe, it can be eval'd within outerrel */
- 		/* Otherwise, it's potentially unsafe, so reject the join */
- 		return false;
- 	}
- 
- 	/* OK to perform the join */
- 	return true;
- }
- 
- /*
   * try_nestloop_path
   *	  Consider a nestloop join path; if it appears useful, push it into
   *	  the joinrel's pathlist via add_path().
--- 255,260 
*** try_nestloop_path(PlannerInfo *root,
*** 322,336 
  	/*
  	 * Check to see if proposed path is still parameterized, and reject if the
  	 * parameterization wouldn't be sensible --- unless allow_star_schema_join
! 	 * says to allow it anyway.  Also, we must reject if check_hazardous_phv
! 	 * doesn't like the look of it.
  	 */
  	required_outer = calc_nestloop_required_outer(outer_path,
    inner_path);
  	if (required_outer &&
  		((!bms_overlap(required_outer, extra->param_source_rels) &&
  		  !allow_star_schema_join(root, outer_path, inner_path)) ||
! 		 !check_hazardous_phv(root, outer_path, inner_path)))
  	{
  		/* Waste no memory when we reject a path here */
  		bms_free(required_outer);
--- 274,291 
  	/*
  	 * Check to see if proposed path is still parameterized, and reject if the
  	 * parameterization wouldn't be sensible --- unless allow_star_schema_join
! 	 * says to allow it anyway.  Also, we must reject if have_dangerous_phv
! 	 * doesn't like the look of it, which could only happen if the nestloop is
! 	 * still parameterized.
  	 */
  	required_outer = calc_nestloop_required_outer(outer_path,
    inner_path);
  	if (required_outer &&
  		((!bms_overlap(required_outer, extra->

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2015-12-10 Thread Pavel Stehule
Hi

2015-12-08 7:06 GMT+01:00 Catalin Iacob :

> On Thu, Dec 3, 2015 at 6:54 PM, Pavel Stehule 
> wrote:
> > Don't understand - if Fatal has same behave as Error, then why it cannot
> be
> > inherited from Error?
> >
> > What can be broken?
>
> Existing code that did "except plpy.Error as exc" will now also be
> called if plpy.Fatal is raised. That wasn't the case so this changes
> meaning of existing code, therefore it introduces an incompatibility.
>

yes, there should be some common ancestor for plpy.Error and plpy.Fatal

Have you any idea, how this ancestor should be named?


> >>
> >> 5. all the reporting functions: plpy.debug...plpy.fatal functions
> >> should also be changed to allow more arguments than the message and
> >> allow them as keyword arguments
> >
> >
> > this is maybe bigger source of broken compatibility
> >
> > lot of people uses plpy.debug(var1, var2, var3)
> >
> > rich constructor use $1 for message, $2 for detail, $3 for hint. So it
> was a
> > reason, why didn't touch these functions
>
> No, if you use PyArg_ParseTupleAndKeywords you'll have Python accept
> all of these so previous ways are still accepted:
>
> plpy.error('a_msg', 'a_detail', 'a_hint')
> plpy.error'a_msg', 'a_detail', hint='a_hint')
> plpy.error('a_msg', detail='a_detail', hint='a_hint')
> plpy.error(msg='a_msg', detail='a_detail', hint='a_hint')
> plpy.error('a_msg')
> plpy.error(msg='a_msg')
> etc.
>
> But I now see that even though the docs say plpy.error and friends
> take a single msg argument, they actually allow any number of
> arguments. If multiple arguments are passed they are converted to a
> tuple and the string representation of that tuple goes into
> ereport/plpy.Error:
>
> CREATE FUNCTION test()
>   RETURNS int
> AS $$
>   try:
> plpy.error('an error message', 'detail for error', 'hint for error')
>   except plpy.Error as exc:
> plpy.info('have exc {}'.format(exc))
> plpy.info('have exc.args |{}| of type {}'.format(exc.args,
> type(exc.args)))
> $$ LANGUAGE plpython3u;
> SELECT test();
> [catalin@archie pg]$ psql -f plpy test
> CREATE FUNCTION
> psql:plpy:13: INFO:  have exc ('an error message', 'detail for error',
> 'hint for error')
> psql:plpy:13: INFO:  have exc.args |("('an error message', 'detail for
> error', 'hint for error')",)| of type 
>
> In the last line note that exc.args (the args tuple passed in the
> constructor of plpy.Error) is a tuple with a single element which is a
> string which is a representation of the tuple passed into plpy.error.
> Don't know why this was thought useful, it was certainly surprising to
> me. Anyway, the separate $2, $3 etc are currently not detail and hint,
> they're just more stuff that gets appended to this tuple. They don't
> get passed to clients as hints. And you can pass lots of them not just
> detail and hint.
>

using tuple is less work for you, you don't need to concat more values into
one. I don't know, how often this technique is used - probably it has sense
only for NOTICE and lower levels - for adhoc debug messages. Probably not
used elsewhere massively.



>
> My proposal would change the meaning of this to actually interpret the
> second argument as detail, third as hint and to only allow a limited
> number (the ones with meaning to ereport). The hint would go to
> ereport so it would be a real hint: it would go to clients as HINT and
> so on. I think that's way more useful that what is done now. But I now
> see my proposal is also backward incompatible.
>

It was my point


>
> > I am not against to plpy.ereport function - it can live together with
> rich
> > exceptions class, and users can use what they prefer.
>
> I wasn't suggesting introducing ereport, I was saying the existing
> functions and exceptions are very similar to your proposed ereport.
> Enhancing them to take keyword arguments would make them a bit more
> powerful. Adding ereport would be another way of doing the same thing
> so that's more confusing than useful.
>

ok


>
> All in all it's hard to improve this cleanly. I'm still not sure the
> latest patch is a good idea but now I'm also not sure what I proposed
> is better than leaving it as it is.
>

We can use different names, we should not to implement all changes at once.

My main target is possibility to raise rich exception instead dirty
workaround. Changing current functions isn't necessary - although if we are
changing API, is better to do it once.

Regards

Pavel


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Tom Lane
Robert Haas  writes:
> On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs  wrote:
>> We can seq scan the array at relcache build time and invalidate relcache
>> when we extend. WAL log any extension to a new segment and write the table
>> to disk at checkpoint.

> Invaliding the relcache when we extend would be extremely expensive,

... and I think it would be too late anyway, if backends are relying on
the relcache to tell the truth.  You can't require an exclusive lock on
a rel just to extend it, which means there cannot be a guarantee that
what a backend has in its relcache will be up to date with current reality.

I really don't like Robert's proposal of a metapage though.  We've got too
darn many forks per relation already.

It strikes me that this discussion is perhaps conflating two different
issues.  Robert seems to be concerned about how we'd detect (not recover
from, just detect) filesystem misfeasance in the form of complete loss
of a non-last segment file.  The other issue is a desire to reduce the
cost of mdnblocks() calls.  It may be worth thinking about those two
things together, but we shouldn't lose sight of these being separate
goals, assuming that anybody besides Robert thinks that the segment
file loss issue is worth worrying about.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Robert Haas
On Sun, Nov 15, 2015 at 4:40 AM, Dean Rasheed  wrote:
> On 14 November 2015 at 20:05, Tom Lane  wrote:
>> I committed this with that change and some other
>> mostly-cosmetic revisions.
>
> Thanks.

This patch, or something nearby, seems to have changed the number of
significant figures produced by log() and maybe some of the other
functions this patch touched.  On master:

rhaas=# select log(.5);
 log
-
 -0.3010299956639812
(1 row)

But on REL9_5_STABLE:

rhaas=# select log(.5);
   log
-
 -0.30102999566398119521
(1 row)

It's certainly not obvious from the commit message that this change
was expected.  There is something about setting the rscales for
intermediate results, but there isn't any indication that the number
of digits in the final result should be expected to differ.  Was that
intentional?  Why did we make the change?  I'm not sure it's bad, but
it seems funny to whack a user-visible behavior around like this
without a clearly-explained reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 1:48 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs  wrote:
>>> We can seq scan the array at relcache build time and invalidate relcache
>>> when we extend. WAL log any extension to a new segment and write the table
>>> to disk at checkpoint.
>
>> Invaliding the relcache when we extend would be extremely expensive,
>
> ... and I think it would be too late anyway, if backends are relying on
> the relcache to tell the truth.  You can't require an exclusive lock on
> a rel just to extend it, which means there cannot be a guarantee that
> what a backend has in its relcache will be up to date with current reality.

True.

> I really don't like Robert's proposal of a metapage though.  We've got too
> darn many forks per relation already.

Oh, I wasn't thinking of adding a fork, just repurposing block 0 of
the main fork, as we do for some index types.

> It strikes me that this discussion is perhaps conflating two different
> issues.  Robert seems to be concerned about how we'd detect (not recover
> from, just detect) filesystem misfeasance in the form of complete loss
> of a non-last segment file.  The other issue is a desire to reduce the
> cost of mdnblocks() calls.  It may be worth thinking about those two
> things together, but we shouldn't lose sight of these being separate
> goals, assuming that anybody besides Robert thinks that the segment
> file loss issue is worth worrying about.

Don't get me wrong, I'm not willing to expend *any* extra cycles to
notice a problem here.  But all things being equal, code that notices
broken stuff is better than code that doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-10 Thread Tom Lane
Robert Haas  writes:
> Well, sorta.  To be honest, I think this patch is really ugly.  If we
> were going to do this then, yes, I would want to split the patch into
> two parts along those lines.  But actually I don't really want to do
> it this way at all.  It's not that I don't want the performance
> benefits: I do.  But the current code is really easy to read and
> extremely simple, and this changes it into something that is a heck of
> a lot harder to read and understand.  I'm not sure exactly what to do
> about that, but it seems like a problem.

I haven't read the patch in any detail, but keep in mind that part of the
API contract for ResourceOwners is that ResourceOwnerRememberFoo() cannot
fail.  Period.  So any patch that makes those functions less than trivial
is going to be really suspect from a reliability standpoint.  It would be
advisable for example that hash_any not suddenly become covered by the
"must not fail" requirement.

BTW, I do not think you can get away with the requirement that all-zeroes
isn't a valid resource representation.  It might be okay today but it's
hardly future-proof.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Simon Riggs
On 10 December 2015 at 18:59, Robert Haas  wrote:
   Why did we make the change?  I'm not sure it's bad, but

> it seems funny to whack a user-visible behavior around like this
> without a clearly-explained reason.
>

Surely the title of the post explains?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Andres Freund
Hi,

On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > > The real problem there imo isn't that the copy_relation_data() doesn't
> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > > unlogged table specific codepath like heap_create_with_catalog() has.
> > 
> > It looks to me like somewhere we need to do log_smgrcreate(...,
> > INIT_FORKNUM) in the unlogged table case.
> 
> Yes.
> 
> > RelationCreateStorage()
> > skips this for the main forknum of an unlogged table, which seems OK,
> > but there's nothing that even attempts it for the init fork, which
> > does not seem OK.
> 
> We unfortunately can't trivially delegate that work to
> RelationCreateStorage(). E.g. heap_create() documents that only the main
> fork is created :(
> 
> > I guess that logic should happen in
> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
> 
> Looks like it's the easiest place.

> > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > isn't called for the init fork of unlogged relations, even if it needs
> > > to.

Here's a patch doing that. It's not yet fully polished, but I wanted to
get it out, because I noticed one thing:

In ATExecSetTableSpace(), for !main forks, we currently call
smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
relations. That seems a bit odd to me. It currently seems to be without
further consequence because, if there's actual data in the fork, we'll
just create the relation in _mdfd_getseg(); or we can cope with the
relation not being there.  But to me that feels wrong.

It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
not just INIT_FORKNUM. What do you guys think?

> It sounds worthwhile to check that other locations rewriting tables,
> e.g. cluster/vacuum full/reindex are safe.

Seems to be ok, on a first glance.

Andres
>From a823b8aa2a24e61786ce924a3b237d846063f5fb Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 10 Dec 2015 20:14:39 +0100
Subject: [PATCH] WIP: Fix ALTER TABLE ... SET TABLESPACE for unlogged
 relations.

XXX: ATST over streaming rep didn't create the init fork on the other
side. Fix that. Also fix syncing of the init fork of unlogged relations
on the primary. Additionally always WAL log file creation for !main
forks of persistent relations.

Symptom:
ERROR:  58P01: could not open file "pg_tblspc/...": No such file or directory
on a promoted standby.

Reported-By: Michael Paquier
Author: Michael Paquier and Andres Freund
Backpatch: 9.1, where unlogged relations were introduced
---
 src/backend/commands/tablecmds.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ddde72..6df4939 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "catalog/pg_type_fn.h"
 #include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
 #include "commands/comment.h"
@@ -9659,6 +9660,15 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 		if (smgrexists(rel->rd_smgr, forkNum))
 		{
 			smgrcreate(dstrel, forkNum, false);
+
+			/*
+			 * WAL log creation if the relation is persistent, or this is the
+			 * init fork of an unlogged relation.
+			 */
+			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+ forkNum == INIT_FORKNUM))
+log_smgrcreate(&newrnode, forkNum);
 			copy_relation_data(rel->rd_smgr, dstrel, forkNum,
 			   rel->rd_rel->relpersistence);
 		}
@@ -9878,6 +9888,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	char	   *buf;
 	Page		page;
 	bool		use_wal;
+	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
 
@@ -9891,10 +9902,19 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	page = (Page) buf;
 
 	/*
+	 * The init fork for an unlogged relation in many respects has to be
+	 * treated the same as normal relation, changes need to be WAL logged and
+	 * it needs to be synced to disk.
+	 */
+	copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+		forkNum == INIT_FORKNUM;
+
+	/*
 	 * We need to log the copied data in WAL iff WAL archiving/streaming is
 	 * enabled AND it's a permanent relation.
 	 */
-	use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
 	nblocks = smgrnblocks(src, forkNum);
 
@@ -9949,7 +9969,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * wouldn't replay our earlier WAL entries. If we do not fsync those pages
 	 * here, they might still not be on disk when the crash occurs.
 	 */
-	if (relpersistence == RELPERSISTENCE_PERMANENT)
+	if (relpe

Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 2:25 PM, Simon Riggs  wrote:
> On 10 December 2015 at 18:59, Robert Haas  wrote:
>Why did we make the change?  I'm not sure it's bad, but
>>
>> it seems funny to whack a user-visible behavior around like this
>> without a clearly-explained reason.
>
> Surely the title of the post explains?

Nope.  I'm not asking why we fixed the inaccurate results.  I'm asking
why we now produce fewer output digits than before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 2:11 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Well, sorta.  To be honest, I think this patch is really ugly.  If we
>> were going to do this then, yes, I would want to split the patch into
>> two parts along those lines.  But actually I don't really want to do
>> it this way at all.  It's not that I don't want the performance
>> benefits: I do.  But the current code is really easy to read and
>> extremely simple, and this changes it into something that is a heck of
>> a lot harder to read and understand.  I'm not sure exactly what to do
>> about that, but it seems like a problem.
>
> I haven't read the patch in any detail, but keep in mind that part of the
> API contract for ResourceOwners is that ResourceOwnerRememberFoo() cannot
> fail.  Period.  So any patch that makes those functions less than trivial
> is going to be really suspect from a reliability standpoint.  It would be
> advisable for example that hash_any not suddenly become covered by the
> "must not fail" requirement.

Hmm.  I hadn't thought about that.

I wonder if there's a simpler approach to this whole problem.  What
the patch aims to do is allow resources to be freed in arbitrary order
without slowdown.  But do we really need that?  Resource owners are
currently optimized for freeing resources in the order opposite to the
order of allocation.  I bet in this case the order in which they are
freed isn't random, but is exactly in order of allocation.  If so, we
might be able to either (a) jigger things so that freeing in order of
allocation is efficient or (b) jigger things so that the resources are
released in reverse order of allocation as the resource owner code
expects.  That might be simpler and less invasive than this fix.

Then again, it's somehow appealing for higher-level code not to have
to care about this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-10 Thread Simon Riggs
On 9 December 2015 at 18:44, Stas Kelvich  wrote:


> In this patch I’ve changed this procedures to following:
> * on prepare backend writes data only to xlog and store pointer to the
> start of the xlog record
> * if commit occurs before checkpoint then backend reads data from xlog by
> this pointer
> * on checkpoint 2pc data copied to files and fsynced
> * if commit happens after checkpoint then backend reads files
> * in case of crash replay will move data from xlog to files (as it was
> before patch)
>

This looks sound to me.

I think we could do better still, but this looks like the easiest 80% and
actually removes code.

The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Tom Lane
Robert Haas  writes:
> This patch, or something nearby, seems to have changed the number of
> significant figures produced by log() and maybe some of the other
> functions this patch touched.

Yeah, not surprising.

> It's certainly not obvious from the commit message that this change
> was expected.

That's on me as author of the commit message, I guess.  The rscale
in most of these functions is exactly the number of fraction digits
that will be emitted, and we changed the rules for computing it.
Not by much, in most cases.  I don't think we should be too worried
about being bug-compatible with the old behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bootstrap DATA is a pita

2015-12-10 Thread Caleb Welton
Hello Hackers,

  Reviving an old thread on simplifying the bootstrap process.

  I'm a developer from the GPDB / HAWQ side of the world where we did some
work a while back to enable catalog definition via SQL files and we have
found it valuable from a dev perspective.  The mechanism currently in those
products is a bit.. convoluted where SQL is processed in perl to create the
existing DATA statements, which are then processed as they are today in
Postgres... I wouldn't suggest this route, but having worked with both the
DATA mechanism and the SQL based one I've certainly found SQL to be a more
convenient way of interacting with the catalog.

  I'd propose:
 - Keep enough of the existing bootstrap mechanism functional to get a
small tidy core, essentially you need enough of pg_type, pg_proc, pg_class,
pg_attribute to support the 25 types used by catalog tables and most
everything else can be moved into SQL processing like how system_views.sql
is handled today.

  The above was largely proposed back in March and rejected based on
concerns that

  1. initdb would be slower.
  2. It would introduce too much special purpose bootstrap cruft into the
code.
  3. Editing SQL commands is not comfortable in bulk

On 1.

I have a prototype that handles about 1000 functions (all the functions in
pg_proc.h that are not used by other catalog tables, e.g. pg_type,
pg_language, pg_range, pg_aggregate, window functions, pg_ts_parser, etc).

All of initdb can be processed in 1.53s. This compares to 1.37s with the
current bootstrap approach.  So yes, this is slower, but not 'noticeably
slower' - I certainly didn't notice the 0.16s until I saw the concern and
then timed it.

On 2.

So far the amount of cruft has been:
  - Enabling adding functions with specific OIDs when creating functions.
1 line changes in pg_aggregate.c, proclang.c, typecmds.c
about dozen lines of code in functioncmds.c
3 lines changed in pg_proc.c
  - Update the fmgr_internal_validator for builtin functions while the
catalog is mutable
3 lines changed in pg_proc.c
  - Update how the builtin function cache is built
Some significant work in fmgr.c that honestly still needs cleanup
before it would be ready to propose as a patch that would be worthy of
committing.
  - Update how builtin functions are resolved outside of bootstrap
Minor updates to dynloader for lookup of symbols within the current
executable, so far I've only done darwin.c for my prototype, this would
need to be extended to the other ports.
  - Initializitation of the builtin cache
2 line change in postinit.c
  - Addition of a stage in initdb to process the sql directives similar in
scope to the processing of system_views.sql.

No changes needed in the parser, planner, etc.  My assessment is that this
worry is not a major concern in practice with the right implementation.

On 3.

Having worked with both SQL and bki DATA directives I have personally found
the convenience of SQL outweighs the pain.  In many cases changes, such as
adding a new column to pg_proc, have minimal impact on the SQL
representation and what changes are needed are often simple to implement.
E.g. accounting for COST only needs to be done for the functions that need
something other than the default value.  This however is somewhat
subjective.

On the Pros side:

  a. Debugging bootstrap is extremely painful, debugging once initdb has
gotten to 'postgres --single' is way easier.

  b. It is easier to introduce minor issues with DATA directives than it is
when using the SQL processing used for all other user objects.

   Example: currently in Postgres all builtin functions default to COST 1,
and all SQL functions default to cost 100. However the following SQL
functions included in bootstrap inexplicably are initialized with a COST of
1:
   age(timestamp with time zone)
   age(timestamp without time zone)
   bit_length(bytea)
   bit_length(text)
   bit_length(bit)
   date_part(text, abstime)
   date_part(text, reltime)
   date_part(text, date)
   ... and 26 other examples

  c. SQL files are significantly less of a PITA (subjective opinion, but I
can say this from a perspective of experience working with both DATA
directives and SQL driven catalog definition).

If people have interest I can share my patch so far if that helps address
concerns, but if there is not interest then I'll probably leave my
prototype where it is rather than investing more effort in the proof of
concept.

Thanks,
  Caleb


On Sat, Mar 7, 2015 at 5:20 PM,  wrote:

> Date: Sat, 7 Mar 2015 23:46:54 +0100
> From: Andres Freund 
> To: Jim Nasby 
> Cc: Stephen Frost , Robert Haas  >,
> Tom Lane , Peter Eisentraut ,
> Josh Berkus ,
> "pgsql-hackers@postgresql.org" 
> Subject: Re: Bootstrap DATA is a pita
> Message-ID: <20150307224654.gc12...@awork2.anarazel.de>
>
> On 2015-03-07 16:43:15 -0600, Jim Nasby wrote:
> > Semi-related... if we put some special handling in some places for
> bootstrap
> > mode, c

Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 2:47 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> This patch, or something nearby, seems to have changed the number of
>> significant figures produced by log() and maybe some of the other
>> functions this patch touched.
>
> Yeah, not surprising.
>
>> It's certainly not obvious from the commit message that this change
>> was expected.
>
> That's on me as author of the commit message, I guess.  The rscale
> in most of these functions is exactly the number of fraction digits
> that will be emitted, and we changed the rules for computing it.
> Not by much, in most cases.  I don't think we should be too worried
> about being bug-compatible with the old behavior.

It seems to be a loss of 4 digits in every case I've seen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Tom Lane
Robert Haas  writes:
> On Thu, Dec 10, 2015 at 2:47 PM, Tom Lane  wrote:
>> That's on me as author of the commit message, I guess.  The rscale
>> in most of these functions is exactly the number of fraction digits
>> that will be emitted, and we changed the rules for computing it.
>> Not by much, in most cases.  I don't think we should be too worried
>> about being bug-compatible with the old behavior.

> It seems to be a loss of 4 digits in every case I've seen.

I wouldn't have a problem with, say, throwing in an extra DEC_DIGITS worth
of rscale in each of these functions so that the discrepancies tend to
favor more significant digits out, rather than fewer.  I don't know that
it's worth trying to guarantee that the result is never fewer digits than
before, and I certainly wouldn't want to make the rules a lot more complex
than what's there now.  But perhaps we could cover most cases easily.

Dean, do you want to recheck the patch with an eye to that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-12-10 Thread Alvaro Herrera
Noah Misch wrote:
> On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:

> > I don't directly see any limitation with the use of kill on Windows..
> > http://perldoc.perl.org/functions/kill.html
> > But indeed using directly pg_ctl kill seems like a better fit for the
> > PG infrastructure.
> 
> From http://perldoc.perl.org/perlwin32.html, "Using signals under this port
> should currently be considered unsupported."  Windows applications cannot
> handle SIGQUIT: https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx.  The
> PostgreSQL backend does not generate or expect Windows signals; see its
> signal.c emulation facility.

Makes sense.  What we're doing now is what you suggested, so we should
be fine.

> > > Postmaster log file names became less informative.  Before the commit:
> > > Should nodes get a name, so we instead see master_57834.log?
> > 
> > I am not sure that this is necessary.
> 
> In general, you'd need to cross-reference the main log file to determine which
> postmaster log corresponds to which action within the test.  I did plenty of
> "grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
> that test.  I'd like to be able to use /*master*.log, not rely on timestamps
> or on scraping regress_log_002_databases to determine which logs are master
> logs.  Feel free to skip this point if I'm the only one minding, though.

Since we now have the node name in the log file name, perhaps we no
longer need the port number in there.  In fact, I find having the file
name change on every run (based on the port number) is slightly
annoying.  I vote we change it back to using the node name without the
port number.  (Also, some PostgresNode messages refer to the instance by
datadir and port number, which is unnecessary: it would be clearer to
use the name instead.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-10 Thread Andres Freund
Hi,

I recently started a pgbench benchmark (to evaluate a piece of hardware,
not postgres) with master. Unfortunately, by accident, I started
postgres in a shell, not screen like pgbench.

Just logged back in and saw:
client 71 aborted in state 8: ERROR:  database is not accepting commands to 
avoid wraparound data loss in database "postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 97
number of threads: 97
duration: 30 s
number of transactions actually processed: 2566862424
latency average: 3.214 ms
latency stddev: 7.336 ms
tps = 30169.374133 (including connections establishing)
tps = 30169.378406 (excluding connections establishing)

Hm. Bad news. We apparently didn't keep up vacuuming. But worse news is
that even now, days later, autovacuum hasn't progressed:
postgres=# select txid_current();
ERROR:  database is not accepting commands to avoid wraparound data loss in 
database "postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

Looking at datfrozenxid:
postgres=# select datname, datfrozenxid, age(datfrozenxid) FROM pg_database ;
  datname  | datfrozenxid |age
---+--+---
 template1 |   3357685367 | 0
 template0 |   3357685367 | 0
 postgres  |   3159867733 | 197817634
(3 rows)
reveals that the launcher doesn't do squat because it doesn't think it
needs to do anything.

(gdb) p *ShmemVariableCache
$3 = {nextOid = 24576, oidCount = 0, nextXid = 3357685367, oldestXid = 
1211201715, xidVacLimit = 1411201715, xidWarnLimit = 3347685362, 
  xidStopLimit = 3357685362, xidWrapLimit = 3358685362, oldestXidDB = 12380, 
oldestCommitTs = 0, newestCommitTs = 0, 
  latestCompletedXid = 3357685366}

'oldestXid' shows the problem: We're indeed pretty short before a
wraparound.


The question is, how did we get here? My current working theory, not
having any logs available, is that two autovacuum workers ran at the
same time. Both concurrently entered vac_update_datfrozenxid(). As both
haven't committed at that time, they can't see each other's updates to
datfrozenxid. And thus vac_truncate_clog(), called by both, won't see a
changed horizon.

Does that make sense?

If so, what can we do about it? After chatting a bit with Alvaro  I can
see two avenues:
1) Hold a self-conflicting lock on pg_database in vac_truncate_clog(),
   and don't release the lock until the transaction end. As the
   pg_database scan uses a fresh snapshot, that ought to guarantee
   progress.
2) Do something like vac_truncate_clog() in the autovacuum launcher,
   once every idle cycle or so. That'd then unwedge us.

Neither of these sound particularly pretty.


Additionally something else has to be going on here - why on earth
wasn't a autovacuum started earlier? The above kinda looks like the
vacuums on template* ran at a very similar time, and only pretty
recently.


I left the cluster hanging in it's stuck state for now, so we have a
chance to continue investigating.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-12-10 Thread Robert Haas
On Wed, Dec 2, 2015 at 6:45 AM, Ashutosh Bapat
 wrote:
> It's been a long time since last patch on this thread was posted. I have
> started
> to work on supporting join pushdown for postgres_fdw. Attached please find
> three
> patches
> 1. pg_fdw_core.patch: changes in core related to user mapping handling, GUC
>   enable_foreignjoin
> 2. pg_fdw_join.patch: postgres_fdw changes for supporting join pushdown

It seems useful to break things up this way.  However, I'm not sure we
want an enable_foreignjoin GUC; in fact, I think we probably don't.
If we want to have a way to control this, postgres_fdw can provide a
custom GUC or FDW option for that.

And to be honest, I haven't really been able to understand why join
pushdown needs changes to user mapping handling.  Just hypothetically
speaking, if I put my foot down and said we're not committing any of
that stuff, how and why would that impact our ability to have join
pushdown in 9.6?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-10 Thread Robert Haas
On Thu, Dec 10, 2015 at 4:55 PM, Andres Freund  wrote:
> Hi,
>
> I recently started a pgbench benchmark (to evaluate a piece of hardware,
> not postgres) with master. Unfortunately, by accident, I started
> postgres in a shell, not screen like pgbench.
>
> Just logged back in and saw:
> client 71 aborted in state 8: ERROR:  database is not accepting commands to 
> avoid wraparound data loss in database "postgres"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 97
> number of threads: 97
> duration: 30 s
> number of transactions actually processed: 2566862424
> latency average: 3.214 ms
> latency stddev: 7.336 ms
> tps = 30169.374133 (including connections establishing)
> tps = 30169.378406 (excluding connections establishing)
>
> Hm. Bad news. We apparently didn't keep up vacuuming. But worse news is
> that even now, days later, autovacuum hasn't progressed:
> postgres=# select txid_current();
> ERROR:  database is not accepting commands to avoid wraparound data loss in 
> database "postgres"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
>
> Looking at datfrozenxid:
> postgres=# select datname, datfrozenxid, age(datfrozenxid) FROM pg_database ;
>   datname  | datfrozenxid |age
> ---+--+---
>  template1 |   3357685367 | 0
>  template0 |   3357685367 | 0
>  postgres  |   3159867733 | 197817634
> (3 rows)
> reveals that the launcher doesn't do squat because it doesn't think it
> needs to do anything.
>
> (gdb) p *ShmemVariableCache
> $3 = {nextOid = 24576, oidCount = 0, nextXid = 3357685367, oldestXid = 
> 1211201715, xidVacLimit = 1411201715, xidWarnLimit = 3347685362,
>   xidStopLimit = 3357685362, xidWrapLimit = 3358685362, oldestXidDB = 12380, 
> oldestCommitTs = 0, newestCommitTs = 0,
>   latestCompletedXid = 3357685366}
>
> 'oldestXid' shows the problem: We're indeed pretty short before a
> wraparound.

Is the postmaster in a "stopped" state (T)?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-10 Thread Andres Freund
On 2015-12-10 17:13:12 -0500, Robert Haas wrote:
> Is the postmaster in a "stopped" state (T)?

No, running normally. New connections are possible without problems.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-10 Thread Greg Stark
On Thu, Dec 10, 2015 at 7:09 PM, Robert Haas  wrote:
>> I really don't like Robert's proposal of a metapage though.  We've got too
>> darn many forks per relation already.
>
> Oh, I wasn't thinking of adding a fork, just repurposing block 0 of
> the main fork, as we do for some index types.

Not to mention that it should probably be done for all forks. In
practice none of the other forks are going to go over a gigabyte these
days but having different extension logic depending on the fork would
be a pain and only limit future uses of forks.

It doesn't seem like an easy change to make work with binary upgrades
though. I suppose you could just support the metapage being absent in
perpetuity but then it's hard to test and would make the logic in md
even more complex than now rather than simpler.


Incidentally, for comparison Oracle has a header page on every data
file with various information. Data files are added like we add
segments to tables but they are actually pieces of a tablespace rather
than a table. The most significant piece of metadata that they store
in this header block is actually the commit sequence number --
effectively equivalent to our LSN -- that the file is consistent with.
That lets you see how far back you need to start recovery and you can
narrow down which data files actually need recovery.




-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fwd: [GENERAL] pgxs/config/missing is... missing

2015-12-10 Thread Jim Nasby

Full story below, but in short:


I see that there is a config/missing script in source, and that 
Makefile.global.in references it:


> src/Makefile.global.in:missing= $(SHELL) 
$(top_srcdir)/config/missing


AFAICT the problem is that missing wasn't included in install or 
uninstall in config/Makefile. Attached patch fixes that, and results in 
missing being properly installed in lib/pgxs/config.



 Forwarded Message 
Subject: [GENERAL] pgxs/config/missing is... missing
Date: Wed, 28 Oct 2015 12:54:54 -0500
From: Jim Nasby 
To: pgsql-general 
CC: David E. Wheeler 

I'm trying to install pgTAP on a FreeBSD machine and running into an odd
problem:


sed -e 's,MODULE_PATHNAME,$libdir/pgtap,g' -e 's,__OS__,freebsd,g' -e 
's,__VERSION__,0.95,g' sql/pgtap-core.sql > sql/pgtap-core.tmp
/bin/sh /usr/local/lib/postgresql/pgxs/src/makefiles/../../config/missing perl 
compat/gencore 0 sql/pgtap-core.tmp > sql/pgtap-core.sql
cannot open /usr/local/lib/postgresql/pgxs/src/makefiles/../../config/missing: 
No such file or directory
Makefile:124: recipe for target 'sql/pgtap-core.sql' failed


That particular recipe is


sql/pgtap-core.sql: sql/pgtap.sql.in
cp $< $@
sed -e 's,sql/pgtap,sql/pgtap-core,g' compat/install-8.4.patch | patch 
-p0
sed -e 's,sql/pgtap,sql/pgtap-core,g' compat/install-8.3.patch | patch 
-p0
sed -e 's,MODULE_PATHNAME,$$libdir/pgtap,g' -e 's,__OS__,$(OSNAME),g' -e 
's,__VERSION__,$(NUMVERSION),g' sql/pgtap-core.sql > sql/pgtap-core.tmp
$(PERL) compat/gencore 0 sql/pgtap-core.tmp > sql/pgtap-core.sql
rm sql/pgtap-core.tmp


and it's the $(PERL) that's failing. If I add this recipe

print-%  : ; @echo $* = $($*)

and do

gmake print-PERL

I indeed get

PERL = /bin/sh
/usr/local/lib/postgresql/pgxs/src/makefiles/../../config/missing perl

even after explicitly exporting PERL=/usr/local/bin/perl

I see that there is a config/missing script in source, and that
Makefile.global.in references it:


src/Makefile.global.in:missing= $(SHELL) 
$(top_srcdir)/config/missing


Any ideas why it's not being installed, or why the PGXS Makefile is
ignoring/over-riding $PERL?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


diff --git a/config/Makefile b/config/Makefile
index da12838..67e7998 100644
--- a/config/Makefile
+++ b/config/Makefile
@@ -7,9 +7,11 @@ include $(top_builddir)/src/Makefile.global
 
 install: all installdirs
$(INSTALL_SCRIPT) $(srcdir)/install-sh 
'$(DESTDIR)$(pgxsdir)/config/install-sh'
+   $(INSTALL_SCRIPT) $(srcdir)/missing 
'$(DESTDIR)$(pgxsdir)/config/missing'
 
 installdirs:
$(MKDIR_P) '$(DESTDIR)$(pgxsdir)/config'
 
 uninstall:
rm -f '$(DESTDIR)$(pgxsdir)/config/install-sh'
+   rm -f '$(DESTDIR)$(pgxsdir)/config/missing'

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] array_remove(anyarray, anyarray)

2015-12-10 Thread Jim Nasby
Recently I had need of removing occurrences of a number of values from 
an array. Obviously I could have nested array_remove() call or wrapped 
the whole thing in a SELECT unnest(), but that seems rather silly and 
inefficient.


Any one have objections to changing array_replace_internal() to make 
search and repalace arrays instead of Datums?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 1:57 AM, Robert Haas  wrote:
> On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
>  wrote:
>>> I feel quite uncomfortable that it solves the problem from a kind
>>> of nature of unlogged object by arbitrary flagging which is not
>>> fully corresponds to the nature. If we can deduce the necessity
>>> of fsync from some nature, it would be preferable.
>>
>> INIT_FORKNUM is not something only related to unlogged relations,
>> indexes use them as well.
>
> Eh, what?
>
> Indexes use them if they are indexes on unlogged tables, but they'd
> better not use them in any other situation.  Otherwise bad things are
> going to happen.

Yes, this was badly formulated, and caused by my lack of knowledge of
unlogged tables, I think I got it now :) Why don't we actually put
some asserts in those code paths to say that INIT_FORKNUM specific
code can just be used for unlogged relations? Just a thought...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Amit Langote
On 2015/12/10 20:46, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 7:23 PM, Amit Langote
>  wrote:
>> AIUI, the counts published via stats collector are updated asynchronously
>> w.r.t. operations they count and mostly as aggregate figures. For example,
>> PgStat_StatTabEntry.blocks_fetched. IOW, we never see
>> pg_statio_all_tables.heap_blks_read updating as a scan reads blocks. Maybe
>> that helps keep traffic to pgstat collector to sane levels. But that is
>> not to mean that I think controlling stats collector levels was the only
>> design consideration behind how such counters are published.
>>
>> In case of reporting counters as progress info, it seems we might have to
>> send too many PgStat_Msg's, for example, for every block we finish
>> processing during vacuum. That kind of message traffic may swamp the
>> collector. Then we need to see the updated counters from other counters in
>> near real-time though that may be possible with suitable (build?)
>> configuration.
> 
> As far as I understand it, the basic reason why this patch exists is
> to allow a DBA to have a hint of the progress of a VACUUM that may be
> taking minutes, or say hours, which is something we don't have now. So
> it seems perfectly fine to me to report this information
> asynchronously with a bit of lag. Why would we need so much precision
> in the report?

Sorry, I didn't mean to overstate this requirement. I agree precise
real-time reporting of progress info is not such a stringent requirement
from the patch. The point regarding whether we should storm the collector
with progress info messages still holds, IMHO.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add IS (NOT) DISTINCT to subquery_Op

2015-12-10 Thread Jim Nasby
Is there any reason we couldn't/shouldn't support IS DISTINCT in 
subquery_Op? (Or really, just add support to ANY()/ALL()/(SELECT ...)?)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Tom Lane
I wrote:
> Ah-hah --- the new check I added in join_is_legal understood about
> chains of LATERAL references, but it forgot that we could also have chains
> of outer-join ordering constraints.  When we're looking to see if joining
> on the basis of a LATERAL reference would break some later outer join, we
> have to look at outer joins to the outer joins' inner relations, too.

> Fixed in the attached.  I also stuck all of join_is_legal's
> lateral-related checks inside an "if (root->hasLateralRTEs)" block,
> which will save some time in typical queries with no LATERAL.  That
> makes that section of the patch a bit bigger than before, but it's
> mostly just reindentation.

As threatened, here's a patch on top of that that gets rid of
LateralJoinInfo.  I'm pretty happy with this, although I have an
itchy feeling that we could dispense with the lateral_vars lists too.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 26264cb..ba04b72 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*** _copySpecialJoinInfo(const SpecialJoinIn
*** 2067,2086 
  }
  
  /*
-  * _copyLateralJoinInfo
-  */
- static LateralJoinInfo *
- _copyLateralJoinInfo(const LateralJoinInfo *from)
- {
- 	LateralJoinInfo *newnode = makeNode(LateralJoinInfo);
- 
- 	COPY_BITMAPSET_FIELD(lateral_lhs);
- 	COPY_BITMAPSET_FIELD(lateral_rhs);
- 
- 	return newnode;
- }
- 
- /*
   * _copyAppendRelInfo
   */
  static AppendRelInfo *
--- 2067,2072 
*** copyObject(const void *from)
*** 4519,4527 
  		case T_SpecialJoinInfo:
  			retval = _copySpecialJoinInfo(from);
  			break;
- 		case T_LateralJoinInfo:
- 			retval = _copyLateralJoinInfo(from);
- 			break;
  		case T_AppendRelInfo:
  			retval = _copyAppendRelInfo(from);
  			break;
--- 4505,4510 
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index aa6e102..356fcaf 100644
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
*** _equalSpecialJoinInfo(const SpecialJoinI
*** 846,860 
  }
  
  static bool
- _equalLateralJoinInfo(const LateralJoinInfo *a, const LateralJoinInfo *b)
- {
- 	COMPARE_BITMAPSET_FIELD(lateral_lhs);
- 	COMPARE_BITMAPSET_FIELD(lateral_rhs);
- 
- 	return true;
- }
- 
- static bool
  _equalAppendRelInfo(const AppendRelInfo *a, const AppendRelInfo *b)
  {
  	COMPARE_SCALAR_FIELD(parent_relid);
--- 846,851 
*** equal(const void *a, const void *b)
*** 2860,2868 
  		case T_SpecialJoinInfo:
  			retval = _equalSpecialJoinInfo(a, b);
  			break;
- 		case T_LateralJoinInfo:
- 			retval = _equalLateralJoinInfo(a, b);
- 			break;
  		case T_AppendRelInfo:
  			retval = _equalAppendRelInfo(a, b);
  			break;
--- 2851,2856 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f07c793..63fae82 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outPlannerInfo(StringInfo str, const Pl
*** 1847,1853 
  	WRITE_NODE_FIELD(right_join_clauses);
  	WRITE_NODE_FIELD(full_join_clauses);
  	WRITE_NODE_FIELD(join_info_list);
- 	WRITE_NODE_FIELD(lateral_info_list);
  	WRITE_NODE_FIELD(append_rel_list);
  	WRITE_NODE_FIELD(rowMarks);
  	WRITE_NODE_FIELD(placeholder_list);
--- 1847,1852 
*** _outRelOptInfo(StringInfo str, const Rel
*** 1892,1897 
--- 1891,1897 
  	WRITE_NODE_FIELD(cheapest_total_path);
  	WRITE_NODE_FIELD(cheapest_unique_path);
  	WRITE_NODE_FIELD(cheapest_parameterized_paths);
+ 	WRITE_BITMAPSET_FIELD(direct_lateral_relids);
  	WRITE_BITMAPSET_FIELD(lateral_relids);
  	WRITE_UINT_FIELD(relid);
  	WRITE_OID_FIELD(reltablespace);
*** _outSpecialJoinInfo(StringInfo str, cons
*** 2057,2071 
  }
  
  static void
- _outLateralJoinInfo(StringInfo str, const LateralJoinInfo *node)
- {
- 	WRITE_NODE_TYPE("LATERALJOININFO");
- 
- 	WRITE_BITMAPSET_FIELD(lateral_lhs);
- 	WRITE_BITMAPSET_FIELD(lateral_rhs);
- }
- 
- static void
  _outAppendRelInfo(StringInfo str, const AppendRelInfo *node)
  {
  	WRITE_NODE_TYPE("APPENDRELINFO");
--- 2057,2062 
*** _outNode(StringInfo str, const void *obj
*** 3355,3363 
  			case T_SpecialJoinInfo:
  _outSpecialJoinInfo(str, obj);
  break;
- 			case T_LateralJoinInfo:
- _outLateralJoinInfo(str, obj);
- break;
  			case T_AppendRelInfo:
  _outAppendRelInfo(str, obj);
  break;
--- 3346,3351 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index e57b8e2..2ab650c 100644
*** a/src/backend/optimizer/path/joinrels.c
--- b/src/backend/optimizer/path/joinrels.c
*** join_search_one_level(PlannerInfo *root,
*** 231,237 
  		 */
  		if (joinrels[level] == NIL &&
  			root->join_info_list == NIL &&
! 			root->lateral_info_list == NIL)
  			elog(ERROR, "failed to build any %d-way joins", level);
  	}
  }
--- 231,237 
  		 */
  		if (joinr

Re: [HACKERS] Add IS (NOT) DISTINCT to subquery_Op

2015-12-10 Thread Tom Lane
Jim Nasby  writes:
> Is there any reason we couldn't/shouldn't support IS DISTINCT in 
> subquery_Op? (Or really, just add support to ANY()/ALL()/(SELECT ...)?)

It's not an operator (in the sense of something with a pg_operator OID),
which means this would be quite a bit less than trivial as far as internal
representation/implementation goes.  I'm not sure if there would be
grammar issues, either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: [GENERAL] pgxs/config/missing is... missing

2015-12-10 Thread Tom Lane
Jim Nasby  writes:
> AFAICT the problem is that missing wasn't included in install or 
> uninstall in config/Makefile. Attached patch fixes that, and results in 
> missing being properly installed in lib/pgxs/config.

I thought we'd more or less rejected that approach in the previous thread.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to install config/missing

2015-12-10 Thread Tom Lane
Alvaro Herrera  writes:
> Peter Eisentraut wrote:
>> On 11/11/15 12:34 PM, Tom Lane wrote:
>>> I was thinking more of removing the "missing" script and associated logic
>>> entirely, rather than making PGXS a special case.

>> Well, about a year ago people were arguing for the opposite change in
>> the documentation build.  It used to default all the build tool
>> variables to programs that weren't there, and people got all confused
>> about that, so we stuck "missing" in there across the board.

> Ah, right :-(  It's obviously difficult to arrange a compromise that
> pleases everyone here.  I think it's fair to keep "missing" for the doc
> build and remove it from Perl/bison/flex, regardless of pgxs; extensions
> cannot build doc files anyway.

I took a closer look at the originally proposed patch at
http://www.postgresql.org/message-id/5633ba23.3030...@bluetreble.com
and realized that my worry about it was based on a misconception:
I thought it'd get installed into someplace where it might interfere
with other packages.  But actually, it would get installed into the
same place we put install-sh, namely
$PREFIX/lib/postgresql/pgxs/config/
(omitting postgresql/ if PREFIX already contains something PG-specific).
So the argument that it could break anybody else seems pretty thin.

Given our inability to come to a consensus on rejiggering the uses of
"missing", I think maybe we should just apply the original patch and
call it good.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-10 Thread Tom Lane
I wrote:
> As threatened, here's a patch on top of that that gets rid of
> LateralJoinInfo.  I'm pretty happy with this, although I have an
> itchy feeling that we could dispense with the lateral_vars lists too.

I experimented with that and figured out what was bothering me: there is
no need for add_placeholders_to_base_rels to fool around with adding items
to lateral_vars anymore, because the code in create_lateral_join_info
that adds placeholders' ph_lateral bits to the evaluation locations'
lateral_relids sets now does everything that we needed to have happen.
We don't care exactly what's inside the PHV expression, only which rels
it comes from, and the ph_lateral bitmap is sufficient for that.

We still need the lateral_vars field in RelOptInfo, but it now exists
purely to carry the results of extract_lateral_references forward to
create_lateral_join_info.  (We wouldn't need even that so far as Vars are
concerned, because we could just add their source rel to the referencing
rel's lateral_relids on-the-fly in extract_lateral_references.  But we
can't handle PHVs that way, because at that stage we don't know precisely
where they'll be evaluated, so we don't know what to add to the
referencer's lateral_relids.)

Updated patch attached; compared to the previous one, this just removes a
big chunk of code in add_placeholders_to_base_rels and updates a couple of
related comments.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 26264cb..ba04b72 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*** _copySpecialJoinInfo(const SpecialJoinIn
*** 2067,2086 
  }
  
  /*
-  * _copyLateralJoinInfo
-  */
- static LateralJoinInfo *
- _copyLateralJoinInfo(const LateralJoinInfo *from)
- {
- 	LateralJoinInfo *newnode = makeNode(LateralJoinInfo);
- 
- 	COPY_BITMAPSET_FIELD(lateral_lhs);
- 	COPY_BITMAPSET_FIELD(lateral_rhs);
- 
- 	return newnode;
- }
- 
- /*
   * _copyAppendRelInfo
   */
  static AppendRelInfo *
--- 2067,2072 
*** copyObject(const void *from)
*** 4519,4527 
  		case T_SpecialJoinInfo:
  			retval = _copySpecialJoinInfo(from);
  			break;
- 		case T_LateralJoinInfo:
- 			retval = _copyLateralJoinInfo(from);
- 			break;
  		case T_AppendRelInfo:
  			retval = _copyAppendRelInfo(from);
  			break;
--- 4505,4510 
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index aa6e102..356fcaf 100644
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
*** _equalSpecialJoinInfo(const SpecialJoinI
*** 846,860 
  }
  
  static bool
- _equalLateralJoinInfo(const LateralJoinInfo *a, const LateralJoinInfo *b)
- {
- 	COMPARE_BITMAPSET_FIELD(lateral_lhs);
- 	COMPARE_BITMAPSET_FIELD(lateral_rhs);
- 
- 	return true;
- }
- 
- static bool
  _equalAppendRelInfo(const AppendRelInfo *a, const AppendRelInfo *b)
  {
  	COMPARE_SCALAR_FIELD(parent_relid);
--- 846,851 
*** equal(const void *a, const void *b)
*** 2860,2868 
  		case T_SpecialJoinInfo:
  			retval = _equalSpecialJoinInfo(a, b);
  			break;
- 		case T_LateralJoinInfo:
- 			retval = _equalLateralJoinInfo(a, b);
- 			break;
  		case T_AppendRelInfo:
  			retval = _equalAppendRelInfo(a, b);
  			break;
--- 2851,2856 
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f07c793..63fae82 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outPlannerInfo(StringInfo str, const Pl
*** 1847,1853 
  	WRITE_NODE_FIELD(right_join_clauses);
  	WRITE_NODE_FIELD(full_join_clauses);
  	WRITE_NODE_FIELD(join_info_list);
- 	WRITE_NODE_FIELD(lateral_info_list);
  	WRITE_NODE_FIELD(append_rel_list);
  	WRITE_NODE_FIELD(rowMarks);
  	WRITE_NODE_FIELD(placeholder_list);
--- 1847,1852 
*** _outRelOptInfo(StringInfo str, const Rel
*** 1892,1897 
--- 1891,1897 
  	WRITE_NODE_FIELD(cheapest_total_path);
  	WRITE_NODE_FIELD(cheapest_unique_path);
  	WRITE_NODE_FIELD(cheapest_parameterized_paths);
+ 	WRITE_BITMAPSET_FIELD(direct_lateral_relids);
  	WRITE_BITMAPSET_FIELD(lateral_relids);
  	WRITE_UINT_FIELD(relid);
  	WRITE_OID_FIELD(reltablespace);
*** _outSpecialJoinInfo(StringInfo str, cons
*** 2057,2071 
  }
  
  static void
- _outLateralJoinInfo(StringInfo str, const LateralJoinInfo *node)
- {
- 	WRITE_NODE_TYPE("LATERALJOININFO");
- 
- 	WRITE_BITMAPSET_FIELD(lateral_lhs);
- 	WRITE_BITMAPSET_FIELD(lateral_rhs);
- }
- 
- static void
  _outAppendRelInfo(StringInfo str, const AppendRelInfo *node)
  {
  	WRITE_NODE_TYPE("APPENDRELINFO");
--- 2057,2062 
*** _outNode(StringInfo str, const void *obj
*** 3355,3363 
  			case T_SpecialJoinInfo:
  _outSpecialJoinInfo(str, obj);
  break;
- 			case T_LateralJoinInfo:
- _outLateralJoinInfo(str, obj);
- break;
  			case T_AppendRelInfo:
  _outAppendRelInfo(str, obj);
  break;
-

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-12-10 Thread Ashutosh Bapat
On Thu, Dec 10, 2015 at 11:20 PM, Robert Haas  wrote:

> On Tue, Dec 8, 2015 at 6:40 AM, Etsuro Fujita
>  wrote:
> > IMO I want to see the EvalPlanQual fix in the first version for 9.6.
>
> +1.
>
> I think there is still a lot functionality that is offered without
EvalPlanQual fix. As long as we do not push joins when there are RowMarks
involved, implementation of that hook is not required. We won't be able to
push down joins for DMLs and when there are FOR SHARE/UPDATE clauses in the
query. And there are huge number of queries, which will be benefitted by
the push down even without that support. There's nothing in this patch,
which comes in way of implementing the EvalPlanQual fix. It can be easily
added after committing the first version. On the other hand, getting
minimal (it's not really minimal, it's much more than that) support for
postgres_fdw support committed opens up possibility to work on multiple
items (as listed in my mail) in parallel.

I am not saying that we do not need EvalPlanQual fix in 9.6. But it's not
needed in the first cut. If we get the first cut in first couple of months
of 2016, there's plenty of room for the fix to go in 9.6. It would be
really bad situation if we could not get postgres_fdw join pushdown
supported in 9.6 because EvalPlanQual hook could not be committed while the
rest of the code is ready. EvalPlanQual fix in core was being discussed
since April 2015. It took 8 months to get that fixed. Hopefully we won't
need that long to implement the hook in postgres_fdw, but that number says
something about the complexity of the feature.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Kyotaro HORIGUCHI
Sorry, I misunderstood the meaning of PgStat_*.

At Fri, 11 Dec 2015 09:41:04 +0900, Amit Langote 
 wrote in <566a1ba0.70...@lab.ntt.co.jp>
> > As far as I understand it, the basic reason why this patch exists is
> > to allow a DBA to have a hint of the progress of a VACUUM that may be
> > taking minutes, or say hours, which is something we don't have now. So
> > it seems perfectly fine to me to report this information
> > asynchronously with a bit of lag. Why would we need so much precision
> > in the report?
> 
> Sorry, I didn't mean to overstate this requirement. I agree precise
> real-time reporting of progress info is not such a stringent requirement
> from the patch. The point regarding whether we should storm the collector
> with progress info messages still holds, IMHO.

Taking a few seconds interval between each messages would be
sufficient. I personaly think that gettimeofday() per processing
every buffer (or few buffers) is not so heavy-weight but I
suppose there's not such a consensus here. However,
IsCheckpointOnSchedule does that per writing one buffer.

vacuum_delay_point() seems to be a reasonable point to check the
interval and send stats since it would be designed to be called
with the interval also appropriate for this purpose.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Amit Langote
On 2015/12/11 14:41, Kyotaro HORIGUCHI wrote:
> Sorry, I misunderstood the meaning of PgStat_*.

I should've just said "messages to the stats collector" instead of
"PgStat_Msg's".

> 
> At Fri, 11 Dec 2015 09:41:04 +0900, Amit Langote 
>  wrote
>>> As far as I understand it, the basic reason why this patch exists is
>>> to allow a DBA to have a hint of the progress of a VACUUM that may be
>>> taking minutes, or say hours, which is something we don't have now. So
>>> it seems perfectly fine to me to report this information
>>> asynchronously with a bit of lag. Why would we need so much precision
>>> in the report?
>>
>> Sorry, I didn't mean to overstate this requirement. I agree precise
>> real-time reporting of progress info is not such a stringent requirement
>> from the patch. The point regarding whether we should storm the collector
>> with progress info messages still holds, IMHO.
> 
> Taking a few seconds interval between each messages would be
> sufficient. I personaly think that gettimeofday() per processing
> every buffer (or few buffers) is not so heavy-weight but I
> suppose there's not such a consensus here. However,
> IsCheckpointOnSchedule does that per writing one buffer.
> 
> vacuum_delay_point() seems to be a reasonable point to check the
> interval and send stats since it would be designed to be called
> with the interval also appropriate for this purpose.

Interesting, vacuum_delay_point() may be worth considering.

It seems though that, overall, PgBackendStatus approach may be more suited
for progress tracking. Let's see what the author thinks.

Thanks,
Amit





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 12:59 AM, Robert Haas  wrote:
> On Thu, Dec 10, 2015 at 9:49 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Oh, please, no.  Gosh, this is supposed to be a lightweight facility!
>>> Just have a chunk of shared memory and write the data in there.  If
>>> you try to feed this through the stats collector you're going to
>>> increase the overhead by 100x or more, and there's no benefit.  We've
>>> got to do relation stats that way because there's no a priori bound on
>>> the number of relations, so we can't just preallocate enough shared
>>> memory for all of them.  But there's no similar restriction here: the
>>> number of backends IS fixed at startup time.  As long as we limit the
>>> amount of progress information that a backend can supply to some fixed
>>> length, which IMHO we definitely should, there's no need to add the
>>> expense of funneling this through the stats collector.
>>
>> I agree with this, and I'd further add that if we don't have a
>> fixed-length progress state, we've overdesigned the facility entirely.
>> People won't be able to make sense of anything that acts much more
>> complicated than "0% .. 100% done".  So you need to find a way of
>> approximating progress of a given command in terms more or less
>> like that, even if it's a pretty crude approximation.

Check. My opinion is based on the fact that most of the backends are
not going to use the progress facility at all, and we actually do not
need a high level of precision for VACUUM reports: we could simply
send messages with a certain delay between two messages. And it looks
like a waste to allocate that for all the backends. But I am going to
withdraw here, two committers is by far too much pressure.

> That I don't agree with.  Even for something like VACUUM, it's pretty
> hard to approximate overall progress - because, for example, normally
> we'll only have 1 index scan per index, but we might have multiple
> index scans or none if maintenance_work_mem is too small or if there
> aren't any dead tuples after all.  I don't want our progress reporting
> facility to end up with this reputation:
>
> https://xkcd.com/612/

This brings memories. Who has never faced that...

> This point has already been discussed rather extensively upthread, but
> to reiterate, I think it's much better to report slightly more
> detailed information and let the user figure out what to do with it.
> For example, for a VACUUM, I think we should report something like
> this:
> 1. The number of heap pages scanned thus far.
> 2. The number of dead tuples found thus far.
> 3. The number of dead tuples we can store before we run out of
> maintenance_work_mem.
> 4. The number of index pages processed by the current index vac cycle
> (or a sentinel value if none is in progress).
> 5. The number of heap pages for which the "second heap pass" has been 
> completed
> Now, if the user wants to flatten this out to a progress meter, they
> can write an SQL expression which does that easily enough, folding the
> sizes of the table and its indices and whatever assumptions they want
> to make about what will happen down the road.  If we all agree on how
> that should be done, it can even ship as a built-in view.  But I
> *don't* think we should build those assumptions into the core progress
> reporting facility.  For one thing, that would make updating the
> progress meter considerably more expensive - you'd have to recompute a
> new completion percentage instead of just saying "heap pages processed
> went up by one".

This stuff I agree. Having global counters, and have user compute any
kind of percentage or progress bar is definitely the way to go.

> For another thing, there are definitely going to be
> some people that want the detailed information - and I can practically
> guarantee that if we don't make it available, at least one person will
> write a tool that tries to reverse-engineer the detailed progress
> information from whatever we do report.

OK, so this justifies the fact of having detailed information, but
does it justify the fact of having precise and accurate data? ISTM
that having detailed information and precise information are two
different things. The level of details is defined depending on how
verbose we want the information to be, and the list you are giving
would fulfill this requirement nicely for VACUUM. The level of
precision/accuracy at which this information is provided though
depends at which frequency we want to send this information. For
long-running VACUUM it does not seem necessary to update the fields of
the progress tracker each time a counter needs to be incremented.
CLUSTER has been mentioned as well as a potential target for the
progress facility, but it seems that it enters as well in the category
of things that would need to be reported on a slower frequency pace
than "each-time-a-counter-is-incremented".

My impression is just based on the needs of VACUUM and CLUSTER.
Perhaps I am lacking imagination

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-12-10 Thread Ashutosh Bapat
On Fri, Dec 11, 2015 at 3:41 AM, Robert Haas  wrote:

> On Wed, Dec 2, 2015 at 6:45 AM, Ashutosh Bapat
>  wrote:
> > It's been a long time since last patch on this thread was posted. I have
> > started
> > to work on supporting join pushdown for postgres_fdw. Attached please
> find
> > three
> > patches
> > 1. pg_fdw_core.patch: changes in core related to user mapping handling,
> GUC
> >   enable_foreignjoin
> > 2. pg_fdw_join.patch: postgres_fdw changes for supporting join pushdown
>
> It seems useful to break things up this way.  However, I'm not sure we
> want an enable_foreignjoin GUC; in fact, I think we probably don't.
> If we want to have a way to control this, postgres_fdw can provide a
> custom GUC or FDW option for that.
>

enable_foreignjoin or its FDW counterpart would be useful for debugging
purposes just like enable_hashjoin/enable_mergejoin etc. Foreign join push
down can be viewed as a join strategy just like merge/nest loop/hash join
etc. Having enable_foreignjoin complements that picture. Users find more
usage of the same. A user running multiple FDWs and needing to disable join
pushdown across FDWs for any purpose would find enable_foreignjoin very
useful. Needing to turn on/off multiple GUCs would be cumbersome.


>
> And to be honest, I haven't really been able to understand why join
> pushdown needs changes to user mapping handling.


Current join pushdown infrastructure in core allows join to be pushed down
if both the sides of join come from the same server. Those sides may have
different user mappings and thus different user properties/access
permissions/visibility on the foreign server. If FDW chooses either of
these different user mappings to push down the join, it will get wrong
results. So, a join between two foreign relations can not be pushed down if
the user mappings on both sides do not match. This has been already
discussed in this thread. I am pasting your response to Hanada-san back in
May 2015 as hint to the discussion
--
2015-05-21 23:11 GMT+09:00 Robert Haas :
> On Sat, May 16, 2015 at 9:04 AM, Shigeru Hanada
>  wrote:
>> d) All relations must have the same effective user id
>> This check is done with userid stored in PgFdwRelationInfo, which is
>> valid only when underlying relations have the same effective user id.
>> Here "effective user id" means id of the user executing the query, or
>> the owner of the view when the foreign table is accessed via view.
>> Tom suggested that it is necessary to check that user mapping matches
>> for security reason, and now I think it's better than checking
>> effective user as current postgres_fdw does.
>
> So, should this be a separate patch?
--
To add to that, checking user mapping is better than checking the effective
user id for multiple reasons. Multiple local users can share same public
user mapping, which implies that they share same permissions/visibility for
objects on the foreign server. Join involving two sides with different
effective local user but same user mapping can be pushed down to the
foreign server as same objects/data is going to visible where or not we
push the join down.



> Just hypothetically
> speaking, if I put my foot down and said we're not committing any of
> that stuff, how and why would that impact our ability to have join
>
pushdown in 9.6?
>
>
In fact, the question would be what user mapping should be used by the
connection on which we are firing join query. Unless we answer that
question we won't have join pushdown in 9.6. If we push down joins without
taking into consideration the user mapping, we will have all sorts of
security/data visibility problems because of wrong user properties used for
connection.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Parallel Aggregate

2015-12-10 Thread Haribabu Kommi
On Thu, Dec 3, 2015 at 6:06 PM, David Rowley
 wrote:
> On 3 December 2015 at 19:24, Haribabu Kommi 
> wrote:
>>
>> On Thu, Dec 3, 2015 at 4:18 PM, David Rowley
>>  wrote:
>> >
>> > Hi,
>> >
>> > I just wanted to cross post here to mark that I've posted an updated
>> > patch
>> > for combining aggregate states:
>> >
>> > http://www.postgresql.org/message-id/CAKJS1f9wfPKSYt8CG=t271xbymzjrzwqbjeixiqrf-olh_u...@mail.gmail.com
>> >
>> > I also wanted to check if you've managed to make any progress on
>> > Parallel
>> > Aggregation? I'm very interested in this myself and would like to
>> > progress
>> > with it, if you're not already doing so.
>>
>> Yes, the parallel aggregate basic patch is almost ready.
>> This patch is based on your earlier combine state patch.
>> I will post it to community with in a week or so.
>
>
> That's great news!
>
> Also note that there's some bug fixes in the patch I just posted on the
> other thread for combining aggregate states:
>
> For example:  values[Anum_pg_aggregate_aggcombinefn - 1] =
> ObjectIdGetDatum(combinefn);
> was missing from AggregateCreate().
>
> It might be worth diffing to the updated patch just to pull in anything else
> that's changed.

Here I attached a POC patch of parallel aggregate based on combine
aggregate patch. This patch contains the combine aggregate changes
also. This patch generates and executes the parallel aggregate plan
as discussed in earlier threads.

Changes:

1. The aggregate reference in Finalize aggregate is getting overwritten
with OUTER_VAR reference. But to do the final aggregate we need the
aggregate here, so currently by checking the combine states it is avoided.

2. Check whether the aggregate functions that are present in the targetlist
and qual can be executed parallel or not? Based on this the targetlist is
formed to pass it to partial aggregate.

3. Replaces the seq scan as the lefttree with partial aggregate plan and
generate full parallel aggregate plan.

Todo:
1. Needs a code cleanup, it is just a prototype.
2. Explain plan with proper instrumentation data.
3. Performance test to observe the effect of parallel aggregate.
4. Need to separate combine aggregate patch with additional changes
done.


Regards,
Hari Babu
Fujitsu Australia


parallelagg_poc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
 wrote:
> Noah Misch wrote:
>> On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:
>> > > Postmaster log file names became less informative.  Before the commit:
>> > > Should nodes get a name, so we instead see master_57834.log?
>> >
>> > I am not sure that this is necessary.
>>
>> In general, you'd need to cross-reference the main log file to determine 
>> which
>> postmaster log corresponds to which action within the test.  I did plenty of
>> "grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
>> that test.  I'd like to be able to use /*master*.log, not rely on timestamps
>> or on scraping regress_log_002_databases to determine which logs are master
>> logs.  Feel free to skip this point if I'm the only one minding, though.
>
> Since we now have the node name in the log file name, perhaps we no
> longer need the port number in there

There is no node name in the log file name as of now, they are built
using the port number, and the information of a node is dumped into
the central log file when created (see dump_info).

> In fact, I find having the file
> name change on every run (based on the port number) is slightly
> annoying.  I vote we change it back to using the node name without the
> port number.  (Also, some PostgresNode messages refer to the instance by
> datadir and port number, which is unnecessary: it would be clearer to
> use the name instead.)

OK, so... What we have now as log file for a specific node is that:
${testname}_node_${port}.log
which is equivalent to that:
${testname}_${applname}.log

I guess that to complete your idea we could allow PostgresNode to get
a custom name for its log file through an optional parameter like
logfile => 'myname' or similar. And if nothing is defined, process
falls back to applname. So this would give the following:
${testname}_${logfile}.log

It seems that we had better keep the test name as a prefix of the log
file name though, to avoid an overlap with any other test in the same
series. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 10 Dec 2015 20:27:01 +0100, Andres Freund  wrote in 
<20151210192701.gc11...@alap3.anarazel.de>
> > > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > > isn't called for the init fork of unlogged relations, even if it needs
> > > > to.
> 
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
> 
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.

FWIW I agree that.

> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

What it is doing seems to me reasonable but copying_initfork
doesn't seems to be necessary. Kicking both of log_newpage() and
smgrimmedsync() by use_wal, which has the value considering
INIT_FORKNUM would be more descriptive. (more readable, in other
words)

> > It sounds worthwhile to check that other locations rewriting tables,
> > e.g. cluster/vacuum full/reindex are safe.
> 
> Seems to be ok, on a first glance.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 1:32 AM, Andres Freund  wrote:
> I've, pushed the fix for the promotion related issue.

Thanks! It is great to see this issue addressed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-10 Thread Dean Rasheed
On 10 December 2015 at 20:02, Tom Lane  wrote:
>> It seems to be a loss of 4 digits in every case I've seen.
>
> I wouldn't have a problem with, say, throwing in an extra DEC_DIGITS worth
> of rscale in each of these functions so that the discrepancies tend to
> favor more significant digits out, rather than fewer.  I don't know that
> it's worth trying to guarantee that the result is never fewer digits than
> before, and I certainly wouldn't want to make the rules a lot more complex
> than what's there now.  But perhaps we could cover most cases easily.
>
> Dean, do you want to recheck the patch with an eye to that?
>

OK, I'll take a look at it.

Regards,
Dean



https://www.avast.com/lp-safe-emailing?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail";
target="_blank">https://ipmcdn.avast.com/images/logo-avast-v1.png"; style="width:
90px; height:33px;"/>

This email has been sent from a virus-free
computer protected by Avast. https://www.avast.com/lp-safe-emailing?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail";
target="_blank" style="color: #4453ea;">www.avast.com





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with index on unlogged table

2015-12-10 Thread Michael Paquier
On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund  wrote:
> Hi,
>
> On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
>> > > The real problem there imo isn't that the copy_relation_data() doesn't
>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
>> > > unlogged table specific codepath like heap_create_with_catalog() has.
>> >
>> > It looks to me like somewhere we need to do log_smgrcreate(...,
>> > INIT_FORKNUM) in the unlogged table case.
>>
>> Yes.
>>
>> > RelationCreateStorage()
>> > skips this for the main forknum of an unlogged table, which seems OK,
>> > but there's nothing that even attempts it for the init fork, which
>> > does not seem OK.
>>
>> We unfortunately can't trivially delegate that work to
>> RelationCreateStorage(). E.g. heap_create() documents that only the main
>> fork is created :(
>>
>> > I guess that logic should happen in
>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>>
>> Looks like it's the easiest place.
>
>> > > A second problem is that the smgrimmedsync() in copy_relation_data()
>> > > isn't called for the init fork of unlogged relations, even if it needs
>> > > to.
>
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
>
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.
>
> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

This fixes the problem in my environment.

+   if (rel->rd_rel->relpersistence ==
RELPERSISTENCE_PERMANENT ||
+   (rel->rd_rel->relpersistence ==
RELPERSISTENCE_UNLOGGED &&
+forkNum == INIT_FORKNUM))
+   log_smgrcreate(&newrnode, forkNum);
There should be a XLogIsNeeded() check as well. Removing the check on
RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :)

+* The init fork for an unlogged relation in many respects has to be
+* treated the same as normal relation, changes need to be WAL
logged and
+* it needs to be synced to disk.
+*/
+   copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+   forkNum == INIT_FORKNUM;
Here as well just a check on INIT_FORKNUM would be fine.

>> It sounds worthwhile to check that other locations rewriting tables,
>> e.g. cluster/vacuum full/reindex are safe.
>
> Seems to be ok, on a first glance.

Yeah. REINDEX relies on index_build to recreate what it should... The
others are looking fine as well. I have tested it in case and the
files produced are consistent on standby and its master.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers