date:20190827

Re: Zedstore - compressed in-core columnar storage

2019-08-27 Thread Ashutosh Sharma

On Tue, Aug 27, 2019 at 6:03 AM Ashwin Agrawal  wrote:
> Hope that helps to clarify the confusion.
>

Thanks for the explanation. Yes, it does clarify my doubt to some extent.

My point is, once we find the leaf page containing the given tid, we go
through each item in the page until we find the data corresponding to the
given tid which means we kind of perform a sequential scan at the page
level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().

for (off = FirstOffsetNumber; off <= maxoff; off++)
{
ItemId  iid = PageGetItemId(page, off);
ZSAttributeArrayItem *item = (ZSAttributeArrayItem *)
PageGetItem(page, iid);

if (item->t_endtid <= nexttid)
continue;

if (item->t_firsttid > nexttid)
break;

But that's not true for IndexScan in case of heap table because there the
index tuple contains the exact physical location of tuple in the heap. So,
there is no need to scan the entire page.

Further here are some minor comments that i could find while doing a quick
code walkthrough.

1) In zsundo_insert_finish(), there is a double call to
BufferGetPage(undobuf); Is that required ?

2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called
twice? I'm referring to the below code.

if (fetch_proj->num_proj_atts == 0)
{

zsbt_tid_begin_scan(rel, tid, tid + 1,
snapshot,
&fetch_proj->tid_scan);
fetch_proj->tid_scan.serializable = true;

for (int i = 1; i < fetch_proj->num_proj_atts; i++)
{
int attno = fetch_proj->proj_atts[i];

zsbt_attr_begin_scan(rel,  reldesc, attno,
 &fetch_proj->attr_scans[i - 1]);
}
MemoryContextSwitchTo(oldcontext);

zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot,
&fetch_proj->tid_scan);
 }

Also, for all types of update operation (be it key or non-key update) we
create a new tid for the new version of tuple. Can't we use the tid
associated with the old tuple for the cases where there is no concurrent
transactions to whom the old tuple is still visible.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Fault injection framework

2019-08-27 Thread Michael Paquier

On Thu, Aug 22, 2019 at 07:45:09PM +0530, Asim R P wrote:
> Fault injection was discussed a few months ago at PGCon in Ottawa.  At
> least a few folks showed interest and so I would like to present what we
> have been using in Greenplum.
> 
> The attached patch set contains the fault injector framework ported to
> PostgreSQL master.  It  provides ability to define points of interest in
> backend code and then inject faults at those points from SQL.  Also
> included is an isolation test to simulate a speculative insert conflict
> scenario that was found to be rather cumbersome to implement using advisory
> locks, see [1].  The alternative isolation spec using fault injectors seems
> much simpler to understand.

You may want to double-check whitespaces in your patch set, and 0002
does not apply because of conflicts in isolationtester.h (my fault!).

0002 is an independent feature, so I would keep it out of the fault
framework for integration.  There has been a argument from Alvaro
more convincing than mine about the use of a separate keyword, hence
removing a dependency with steps:
https://www.postgresql.org/message-id/20190823153825.GA11405@alvherre.pgsql

It would be good also to have a test case which exercises it, without
the need of the fault framework or its dedicated schedule.

Patches 0003, 0004 and 0005 could just be grouped together, they deal
about the same thing.

My first impressions about this patch is that it is very intrusive.
Could you explain the purpose of am_faultinjector?  That's a specific
connection string parameter which can be used similarly to replication
for WAL senders?  Couldn't there be an equivalent with a SUSET GUC?
It may be interesting to see which parts of this framework could be
moved into an extension loaded with shared_preload_libraries, one
thing being the shared memory initialization part.  At the end it
would be interesting to not have a dependency with a compile-time
flag.

Things like exec_fault_injector_command() need to be more documented.
It is hard to guess what it is being used for.
--
Michael

signature.asc
Description: PGP signature

Re: Optimize single tuple fetch from nbtree index

2019-08-27 Thread Floris Van Nee


>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update 
> that README file though. The current implementation of the patch should be 
> correct like this - that's why I added the look-back code on the page if the 
> tuple couldn't be found anymore on the same location on the page. Similarly, 
> it'll look on the page to the right if it detected a page split. These two 
> measures combined should give a correct implementation of the 'it's possible 
> that a scan stops in the middle of a page' relaxation. However, as Peter and 
> Tom pointed out earlier, they feel that the performance advantage that this 
> approach gives, does not outweigh the extra complexity at this time. I'd be 
> open to other suggestions though.

Although now that I think of it - do you mean the case where the tuple that we 
returned to the caller after _bt_first actually gets deleted (not moved) from 
the page? I guess that can theoretically happen if _bt_first returns a 
non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For 
my understanding, would a situation like the following lead to this (in my 
patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this 
tuple is actually deleted in the heap already, however it's not marked dead yet 
in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the 
tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and 
_bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on 
the page to mark tuples dead as well as 2) compact items on the page, thereby 
actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to 
locate the tuple where it left off, but cannot find it anymore because it got 
removed completely by backend 2.

If this is indeed possible then it's a bad issue unfortunately, and quite hard 
to try to reproduce, as a lot of things need to happen concurrently while doing 
a visiblity check.

As for your patch, I've had some time to take a look at it. For the two TODOs:

+   /* TODO Is it possible that currPage is not valid anymore? */
+   Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this 
function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
 * We note the buffer's block number so that we can release the 
pin later.
 * This allows us to re-read the buffer if it is needed again 
for hinting.
 */
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release 
the pin by the unpin macro. So, if we don't store currPage number somewhere 
else, we cannot obtain the pin again if we need it during killitems. I think 
that's the reason that currPage is stored.

Other than the two TODOs in the code, I think the comments really help 
clarifying what's going on in the code - I'd be happy if this gets added.

-Floris

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-27 Thread Kyotaro Horiguchi

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190827.154932.250364935.horikyota@gmail.com>
> 128GB shared buffers contain 16M buffers. On my
> perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
> has only 6GB, the test is ignoring the effect of cache that comes
> from the difference of the buffer size). (attached 1)
...
> For a 16MB file, the cost of write-fsyncing cost is almost the
> same to that of WAL-emitting cost. It was about 200 ms on the
> Vista-era machine with non-performant rotating magnetic disks
> with xfs. (attached 2, 3) Although write-fsyncing of relation
> file makes no lock conflict with other backends, WAL-emitting
> delays other backends' commits at most by that milliseconds.

FWIW, the attached are the programs I used to take the numbers.

testloop.c: to take time to loop over buffers in FlushRelationBuffers

testfile.c: to take time to sync a heap file. (means one file for the size)

testfile2.c: to take time to emit a wal record. (means 16MB per file)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include 
#include 
#include 

typedef struct RelFileNode
{
  unsigned int spc;
  unsigned int db;
  unsigned int rel;
} RelFileNode;

typedef struct Buffer
{
  RelFileNode rnode;
} Buffer;

//#define NBUFFERS ((int)((128.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
#define NBUFFERS ((int)((32.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
int main(void) {
  int i;
  RelFileNode t = {1,2,3};
  Buffer *bufs = (Buffer *) malloc(sizeof(Buffer) * NBUFFERS);
  struct timeval st, ed;
  int matches = 0, unmatches = 0;
  Buffer *b;

  for (i = 0 ; i < NBUFFERS ; i++) {
bufs[i].rnode.spc = random() * 100;
bufs[i].rnode.db = random() * 100;
bufs[i].rnode.rel = random() * 1;
  }

  /* start measuring */
  gettimeofday(&st, NULL);

  b = bufs;
  for (i = 0 ; i < NBUFFERS ; i++) {
if (b->rnode.spc == t.spc && b->rnode.db == t.db && b->rnode.rel == 
t.rel)
  matches++;
else
  unmatches++;

b++;
  }
  gettimeofday(&ed, NULL);

  printf("%lf ms for %d loops, matches %d, unmatches %d\n",
 (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
  (ed.tv_usec - st.tv_usec) / 1000.0),
 i, matches, unmatches);

  return 0;
}
#include 
#include 
#include 
#include 
#include 
#include 

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

#define FILE_SIZE (64 * 1024)
#define LOOPS 1000

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

//#define FILE_SIZE (128)
//#define LOOPS 1000

char buf[FILE_SIZE];
char fname[256];

int main(void) {
  int i, j;
  int fd = -1;
  struct timeval st, ed;
  double accum = 0.0;
  int bufperfile = (int)((16.0 * 1024 * 1024) / FILE_SIZE);

  for (i = 0 ; i < LOOPS ; i++) {
snprintf(fname, 256, "test%03d.file", i);
unlink(fname); // ignore errors
  }

  for (i = 0 ; i < LOOPS ; i++) {
for (j = 0 ; j < FILE_SIZE ; j++)
  buf[j] = random()* 256;

if (i % bufperfile == 0) {
  if (fd >= 0)
close(fd);

  snprintf(fname, 256, "test%03d.file", i / bufperfile);
  fd = open(fname, O_CREAT | O_RDWR, 0644);
  if (fd < 0) {
fprintf(stderr, "open error: %m\n");
exit(1);
  }
  memset(buf, 0, sizeof(buf));
  if (write(fd, buf, sizeof(buf)) < 0) {
fprintf(stderr, "init write error: %m\n");
exit(1);
  }
  if (fsync(fd) < 0) {
fprintf(stderr, "init fsync error: %m\n");
exit(1);
  }
  if (lseek(fd, 0, SEEK_SET) < 0) {
fprintf(stderr, "init lseek error: %m\n");
exit(1);
  }
  
}

gettimeofday(&st, NULL);
if (write(fd, buf, FILE_SIZE) < 0) {
  fprintf(stderr, "write error: %m\n");
  exit(1);
}
if (fdatasync(fd) < 0) {
  fprintf(stderr, "fdatasync error: %m\n");
  exit(1);
}
gettimeofday(&ed, NULL);

accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
  (ed.tv_usec - st.tv_usec) / 1000.0);
  }

  printf("%.2lf ms for %d %dkB-records (%d MB), %.2lf ms per %dkB)\n",
 accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, 
FILE_SIZE / 1024);

  return 0;
}

#include 
#include 
#include 
#include 
#include 

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

#define FILE_SIZE (128)
#define LOOPS 1000

char buf[FILE_SIZE];

int main(void) {
  int i;
  int fd = -1;
  double accum = 0.0;
  struct timeval st, ed;

  for (i =

Re: A problem about partitionwise join

2019-08-27 Thread Richard Guo

On Tue, Aug 27, 2019 at 8:51 AM Amit Langote 
wrote:

> Hi Richard,
>
> On Mon, Aug 26, 2019 at 6:33 PM Richard Guo  wrote:
> >
> > Hi All,
> >
> > To generate partitionwise join, we need to make sure there exists an
> > equi-join condition for each pair of partition keys, which is performed
> > by have_partkey_equi_join(). This makes sense and works well.
> >
> > But if, let's say, one certain pair of partition keys (foo.k = bar.k)
> > has formed an equivalence class containing consts, no join clause would
> > be generated for it, since we have already generated 'foo.k = const' and
> > 'bar.k = const' and pushed them into the proper restrictions earlier.
> >
> > This will make partitionwise join fail to be planned if there are
> > multiple partition keys and the pushed-down restrictions 'xxx = const'
> > fail to prune away any partitions.
> >
> > Consider the examples below:
> >
> > create table p (k1 int, k2 int, val int) partition by range(k1,k2);
> > create table p_1 partition of p for values from (1,1) to (10,100);
> > create table p_2 partition of p for values from (10,100) to (20,200);
> >
> > If we are joining on each pair of partition keys, we can generate
> > partitionwise join:
> >
> > # explain (costs off)
> > select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 =
> bar.k2;
> >   QUERY PLAN
> > --
> >  Append
> >->  Hash Join
> >  Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
> >  ->  Seq Scan on p_1 foo
> >  ->  Hash
> >->  Seq Scan on p_1 bar
> >->  Hash Join
> >  Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
> >  ->  Seq Scan on p_2 foo_1
> >  ->  Hash
> >->  Seq Scan on p_2 bar_1
> > (11 rows)
> >
> > But if we add another qual 'foo.k2 = const', we will be unable to
> > generate partitionwise join any more, because have_partkey_equi_join()
> > thinks not every partition key has an equi-join condition.
> >
> > # explain (costs off)
> > select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 =
> bar.k2 and foo.k2 = 16;
> >QUERY PLAN
> > -
> >  Hash Join
> >Hash Cond: (foo.k1 = bar.k1)
> >->  Append
> >  ->  Seq Scan on p_1 foo
> >Filter: (k2 = 16)
> >  ->  Seq Scan on p_2 foo_1
> >Filter: (k2 = 16)
> >->  Hash
> >  ->  Append
> >->  Seq Scan on p_1 bar
> >  Filter: (k2 = 16)
> >->  Seq Scan on p_2 bar_1
> >  Filter: (k2 = 16)
> > (13 rows)
> >
> > Is this a problem?
>
> Perhaps.  Maybe it has to do with the way have_partkey_equi_join() has
> been coded.  If it was coded such that it figured out on its own that
> the equivalence (foo.k2, bar.k2, ...) does exist, then that would
> allow partitionwise join to occur, which I think would be OK to do.
> But maybe I'm missing something.
>
>
This should be caused by how we deduce join clauses from equivalence
classes. ECs containing consts will not be considered so we cannot
generate (foo.k2 = bar.k2) for the query above.

In addition, when generating join clauses from equivalence classes, we
only select the joinclause with the 'best score', or the first
joinclause with a score of 3. This may make us miss some joinclause on
partition keys.

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
   QUERY PLAN
-
 Append
   ->  Hash Join
 Hash Cond: (foo.k = bar.k)
 ->  Seq Scan on p_1 foo
 ->  Hash
   ->  Seq Scan on p_1 bar
 Filter: (k = val)
   ->  Hash Join
 Hash Cond: (foo_1.k = bar_1.k)
 ->  Seq Scan on p_2 foo_1
 ->  Hash
   ->  Seq Scan on p_2 bar_1
 Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key although
it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
   QUERY PLAN
-
 Hash Join
   Hash Cond: (foo.k = bar.val)
   ->  Append
 ->  Seq Scan on p_1 foo
 ->  Seq Scan on p_2 foo_1
   ->  Hash
 ->  Append
   ->

Re: Re: Email to hackers for test coverage

2019-08-27 Thread movead...@highgo.ca


On Tue, 27 Aug 2019 14:07:48 +0800 mich...@paquier.xyz wrote:
> There is a section in float4.sql which deals with overflow and
> underflow, so wouldn't it be better to move the tests there?  You 
> could just trigger the failures with that: 
> =# insert into float4_tbl values ('-10e-70'::float8); 
> ERROR:  22003: value out of range: underflow 
> LOCATION:  check_float4_val, float.h:145 
> =# insert into float4_tbl values ('-10e70'::float8); 
> ERROR:  22003: value out of range: overflow 
> LOCATION:  check_float4_val, float.h:140 
> I would also test all four patterns: 10e70, 10e-70, -10e70, -10e-70.
 
I think your way is much better, so I change the patch and it is  
'regression_20190827.patch' now.

> For the numeric part, this improves the case of
> ApplySortAbbrevFullComparator() where both values are not NULL.  Could
> things be done so as the other code paths are fully covered?  One
> INSERT is fine by the way to add the extra coverage.
There are code lines related to NULL values in
ApplySortAbbrevFullComparator(), but I think the code lines for
comparing a NULL and a NOT NULL can be never reached, because it is
handled in ApplySortComparator() which is called before
ApplySortAbbrevFullComparator(). So I think it is no use to INSERT
a NULL value.

--
Movead


regression_20190827.patch
Description: Binary data

Re: pgbench - allow to create partitioned tables

2019-08-27 Thread Asif Rehman

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Thanks. All looks good, making it ready for committer.

Regards,
Asif

The new status of this patch is: Ready for Committer

Re: "ago" times on buildfarm status page

2019-08-27 Thread Tom Turelinckx

On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote:
> I think this is the problem:
> 
>  'scmrepo' => '/home/pgbf/pgmirror.git',
> 
> Probably this isn't updated often enough. It probably has little to do with 
> the clock settings.
> 
> This is the kind of old-fashioned way of doing things. These days 
> "git_keep_mirror => 1" along with the community repo as the base would avoid 
> these problems.

We've discussed this before (see below).

The configuration is intentionally like that. I specifically configured skate 
and snapper to build the exact same source, where skate builds with default 
buildfarm settings, while snapper builds with the settings actually used by 
Debian source packages.

These animals were set up to avoid cases we had in the past where Debian source 
packages failed to build on sparc, even though build animals running on Debian 
sparc were building fine:

https://www.postgresql.org/message-id/01d2f0c2%24e2d335a0%24a879a0e0%24%40turelinckx.be

With snapper building the exact same source as skate (as it is now), we have 
some point of reference if snapper fails but skate succeeds. I could configure 
snapper to perform an update of the repo before building, but then we give up 
this comparability in exchange for a bit more clarity regarding timestamps.

Best regards,
Tom Turelinckx

On Thu, Nov 9, 2017, at 8:54 PM, Andrew Dunstan wrote:
> 
> The first line of the log file is always something like this (see end of 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-11-09%2018%3A59%3A01
>  )
> 
> Last file mtime in snapshot: Thu Nov  9 17:56:07 2017 GMT
> This should be the time of the last commit in the snapshot.
>
> On Mon, Nov 6, 2017 at 9:49 AM, Tom Lane  wrote:
>>
>> Tom Turelinckx  writes:
>>
>>  > On Fri, Nov 3, 2017, at 09:42 PM, Tom Lane wrote:
>>  >> Either that, or it's fallen through a wormhole ;-), but the results
>>  >> it's posting seem to be mis-timestamped by several hours, which is
>>  >> confusing. Please set its clock correctly. Maybe spin up ntpd?
>> 
>>  > The clock is correct, but the configuration may be unusual.
>> 
>>  > In fact, snapper runs on the same machine as skate, and it's using ntp.
>>  > At 7 AM (Western Europe), a local git repo is updated. In the morning,
>>  > skate builds from that local repo with the default buildfarm
>>  > configuration that most animals use. In the afternoon, snapper builds
>>  > from that local repo with the exact same configuration, per branch, that
>>  > the Debian source packages from the pgdg repo use on the same platform.
>>  > The local repo is updated only once per day to ensure that snapper and
>>  > skate build the same source with different settings, and they share the
>>  > git mirror and build root, as suggested in the build farm howto.
>> 
>>  Hm. So the issue really is that the build timestamp that the buildfarm
>>  client is reporting tells when it pulled from the local repo, not when
>>  that repo was last updated from the community server. Not sure if there's
>>  any simple way to improve that ... Andrew, any thoughts?

Re: Creating partitions automatically at least on HASH?

2019-08-27 Thread Rafia Sabih

On Mon, 26 Aug 2019 at 19:46, Fabien COELHO  wrote:

>
> Hello Rafia,
>
> >>CREATE TABLE Stuff (...)
> >>  PARTITION BY [HASH | RANGE | LIST] (…)
> >>DO NONE -- this is the default
> >>DO [IMMEDIATE|DEFERRED] USING (…)
> >>
> >> Where the USING part would be generic keword value pairs, eg:
> >>
> >> For HASH: (MODULUS 8) and/or (NPARTS 10)
> >>
> >> For RANGE: (START '1970-01-01', STOP '2020-01-01', INCREMENT '1 year')
> >>  and/or (START 1970, STOP 2020, NPARTS 50)
> >>
> >> And possibly for LIST: (IN (…), IN (…), …), or possibly some other
> >> keyword.
> >>
> >> The "DEFERRED" could be used as an open syntax for dynamic partitioning,
> >> if later someone would feel like doing it.
> >>
> > ISTM that "USING" is better than "WITH" because WITH is already used
> >> specifically for HASH and other optional stuff in CREATE TABLE.
> >>
> >> The text constant would be interpreted depending on the partitioning
> >> expression/column type.
> >>
> >> Any opinion about the overall approach?
>
> > I happen to start a similar discussion [1] being unaware of this one
> > and there Ashutosh Sharma talked about interval partitioning in Oracle.
> > Looking
> > closely it looks like we can have this automatic partitioning more
> > convenient by having something similar. Basically, it is creating
> > partitions on demand or lazy partitioning.
>
> Yep, the "what" of dynamic partitioning is more or less straightforward,
> along the line you are describing.
>
> For me there are really two questions:
>
>   - having a extendable syntax, hence the mail I sent, which would cover
> both automatic static & dynamic partitioning and their parameters,
> given that we already have manual static, automatic static should
> be pretty easy.
>
>   - implementing the stuff, with limited performance impact if possible
> for the dynamic case, which is non trivial.
>
> > To explain a bit more, let's take range partition for example, first
> > parent table is created and it's interval and start and end values are
> > specified and it creates only the parent table just like it works today.
>
> > Now, if there comes a insertion that does not belong to the existing (or
> > any, in the case of first insertion) partition(s), then the
> > corresponding partition is created,
>
> Yep. Now, you also have to deal with race conditions issues, i.e. two
> parallel session inserting tuples that must create the same partition, and
> probably you would like to avoid a deadlock.
>
> Hmmm, that shouldn't be very hard. Postgres handles many such things and I
think mostly by a mutex guarded shared memory structure. E.g. we can have a
shared memory structure associated with the parent table holding the
information of all the available partitions, and keep this structure
guarded by mutex. Anytime a new partition has to be created the relevant
information is first entered in this structure before actually creating it.

> I think it is extensible to other partitioning schemes as well. Also it
> > is likely to have a positive impact on the queries, because there will
> > be required partitions only and would not require to educate
> > planner/executor about many empty partitions.
>
> Yep, but it creates other problems to solve…
>
> Isn't it always the case. :)

-- 
Regards,
Rafia Sabih

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-27 Thread Kyotaro Horiguchi

Hello.

At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch  wrote in 
<20190826050843.gb3153...@rfd.leadboat.com>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch  
wrote in <20190820060314.ga3086...@rfd.leadboat.com>
> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > > I'm not sure the point of the behavior. I suppose that the "log"
> > > > is a sequence of new_page records. It also needs to be synced and
> > > > it is always larger than the file to be synced. I can't think of
> > > > an appropriate threshold without the point.
> > > 
> > > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() 
> > > locks
> > > every buffer header containing a buffer of the current database.  The 
> > > belief
> > > has been that writing one page to xlog is cheaper than 
> > > FlushRelationBuffers()
> > > in a busy system with large shared_buffers.
> > 
> > I'm at a loss.. The decision between WAL and sync is made at
> > commit time, when we no longer have a pin on a buffer. When
> > emitting WAL, opposite to the assumption, lock needs to be
> > re-acquired for every page to emit log_new_page. What is worse,
> > we may need to reload evicted buffers.  If the file has been
> > CopyFrom'ed, ring buffer strategy makes the situnation farther
> > worse. That doesn't seem cheap at all..
> 
> Consider a one-page relfilenode.  Doing all the things you list for a single
> page may be cheaper than locking millions of buffer headers.

If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.

FlushRelationBuffers uses double-checked-locking as follows:

FlushRelationBuffers_common():
..
  if(!islocal) {
for (i for all buffers) {
  if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
LockBufHdr(bufHdr);
if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
  PinBuffer_Locked(bubHder);
  LWLockAcquire();
  FlushBuffer();

128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)

With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.

> > If there were any chance on WAL for smaller files here, it would
> > be on the files smaller than the ring size of bulk-write
> > strategy(16MB).
> 
> Like you, I expect the optimal threshold is less than 16MB, though you should
> benchmark to see.  Under the ideal threshold, when a transaction creates a new
> relfilenode just smaller than the threshold, that transaction will be somewhat
> slower than it would be if the threshold were zero.  Locking every buffer

I looked closer on this.

For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.


In summary, the characteristics of the two methods on a 16MB file
are as the follows.

File write:
 - 15ms of buffer scan without locks (@128GB shared buffer)

 + no hash search for a buffer

 = take locks on all buffers only of the file one by one (to write)

 + plus 200ms of write-fdatasync (of whole the relation file),
which doesn't conflict with other backends. (except via CPU
time slots and IO bandwidth.)

WAL write : 
 + no buffer scan

 - 2048 times (16M/8k) of partition lock on finding every buffer
   for the target file, which can conflict with other backends.

 = take locks on all buffers only of the file one by one (to take FPW)

 - plus 200ms of open(create)-write-fdatasync (of a WAL file (of
   default size)), which can delay commits on other backends at
   most by that duration.

> header causes a distributed slow-down for other queries, and protecting the
> latency of non-DDL queries is typically more useful than accelerating
> TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
> beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
> than the buffer scan harms them.  That's about where the threshold should be.

If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:

Scan all buffers takes:
  15  ms for 128GB shared_buffers
   4.5ms for 32GB shared_buffers

fdatasync takes:
  200 ms for  16MB/sync
   51 ms for   1MB/sync
   46 ms for 512kB/sync
   40 ms for 256kB/sync
   37 ms for 128kB/sync
   35 ms for <64k

Re: Statement timeout in pg_rewind

2019-08-27 Thread Alexander Kukushkin

On Tue, 27 Aug 2019 at 08:36, Michael Paquier  wrote:

> I'd rather be on the safe side and as we are looking at this at this
> area..  Who knows if this logic is going to change in the future and
> how it will change.

Agree.

> Oops, I misread this part.  What about a simple wrapper
> run_simple_command which checks after PGRES_COMMAND_OK, and frees the
> result then?  This could be used for the temporary table creation and
> when setting synchronous_commit.

Done, please see the next version attached.

Regards,
--
Alexander Kukushkin
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 37eccc3126..e2dbc9fdf6 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -39,6 +39,7 @@ static PGconn *conn = NULL;
 static void receiveFileChunks(const char *sql);
 static void execute_pagemap(datapagemap_t *pagemap, const char *path);
 static char *run_simple_query(const char *sql);
+static void run_simple_command(const char *sql);
 
 void
 libpqConnect(const char *connstr)
@@ -54,6 +55,11 @@ libpqConnect(const char *connstr)
 	if (showprogress)
 		pg_log_info("connected to server");
 
+	/* We don't want our queries canceled */
+	run_simple_command("SET statement_timeout = 0");
+	run_simple_command("SET lock_timeout = 0");
+	run_simple_command("SET idle_in_transaction_session_timeout = 0");
+
 	res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 		pg_fatal("could not clear search_path: %s",
@@ -88,11 +94,7 @@ libpqConnect(const char *connstr)
 	 * replication, and replication isn't working for some reason, we don't
 	 * want to get stuck, waiting for it to start working again.
 	 */
-	res = PQexec(conn, "SET synchronous_commit = off");
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pg_fatal("could not set up connection context: %s",
- PQresultErrorMessage(res));
-	PQclear(res);
+	run_simple_command("SET synchronous_commit = off");
 }
 
 /*
@@ -122,6 +124,25 @@ run_simple_query(const char *sql)
 	return result;
 }
 
+/*
+ * Runs a command.
+ * In case of failure pg_fatal exits with code=1.
+ */
+static void
+run_simple_command(const char *sql)
+{
+	PGresult   *res;
+	char	   *result;
+
+	res = PQexec(conn, sql);
+
+	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		pg_fatal("error running command (%s) in source server: %s",
+ sql, PQresultErrorMessage(res));
+
+	PQclear(res);
+}
+
 /*
  * Calls pg_current_wal_insert_lsn() function
  */
@@ -427,12 +448,7 @@ libpq_executeFileMap(filemap_t *map)
 	 * need to fetch.
 	 */
 	sql = "CREATE TEMPORARY TABLE fetchchunks(path text, begin int8, len int4);";
-	res = PQexec(conn, sql);
-
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pg_fatal("could not create temporary table: %s",
- PQresultErrorMessage(res));
-	PQclear(res);
+	run_simple_command(sql);
 
 	sql = "COPY fetchchunks FROM STDIN";
 	res = PQexec(conn, sql);

PostgreSQL and Real Application Testing (RAT)

2019-08-27 Thread ROS Didier

Hi

In my business, one of the things blocking the migration from Oracle to 
PostgreSQL is not having the equivalent of Oracle Real Application Testing .
This product captures a charge in production and replay it in a test 
environment.
this allows to know the impacts of a migration to a newer version, the creation 
of an index..
is there an equivalent in the PostgreSQL community?
if not, do you think it's technically possible to do it ?
who would be interested in this project ?

Thanks in advance
Best Regards

Didier ROS
EDF





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

Re: block-level incremental backup

2019-08-27 Thread vignesh C

On Fri, Aug 16, 2019 at 8:07 PM Ibrar Ahmed  wrote:
>
> What do you mean by bigger file, a file greater than 1GB? In which case you 
> get file > 1GB?
>
>
>
Few comments:
Comment:
+ buf = (char *) malloc(statbuf->st_size);
+ if (buf == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+
+ if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
+ {
+ Bitmapset  *mod_blocks = NULL;
+ int nmodblocks = 0;
+
+ if (cnt % BLCKSZ != 0)
+ {

We can use same size as full page size.
After pg start backup full page write will be enabled.
We can use the same file size to maintain data consistency.

Comment:
/* Validate given LSN and convert it into XLogRecPtr. */
+ opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+ if (XLogRecPtrIsInvalid(opt->lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid value for LSN")));

Validate input lsn is less than current system lsn.

Comment:
/* Validate given LSN and convert it into XLogRecPtr. */
+ opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
+ if (XLogRecPtrIsInvalid(opt->lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid value for LSN")));

Should we check if it is same timeline as the system's timeline.

Comment:
+ if (fread(blkdata, 1, BLCKSZ, infp) != BLCKSZ)
+ {
+ pg_log_error("could not read from file \"%s\": %m", outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ exit(1);
+ }
+
+ /* Finally write one block to the output file */
+ if (fwrite(blkdata, 1, BLCKSZ, outfp) != BLCKSZ)
+ {
+ pg_log_error("could not write to file \"%s\": %m", outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ exit(1);
+ }

Should we support compression formats supported by pg_basebackup.
This can be an enhancement after the functionality is completed.

Comment:
We should provide some mechanism to validate the backup. To identify
if some backup is corrupt or some file is missing(deleted) in a
backup.

Comment:
+ ofp = fopen(tofn, "wb");
+ if (ofp == NULL)
+ {
+ pg_log_error("could not create file \"%s\": %m", tofn);
+ exit(1);
+ }

ifp should be closed in the error flow.

Comment:
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = pg_malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ pg_free(labelfile);
+ exit(1);
+ }

fclose can be moved above.

Comment:
+ if (!modifiedblockfound)
+ {
+ copy_whole_file(fm->filename, outfn);
+ cleanup_filemaps(filemaps, fmindex + 1);
+ return;
+ }
+
+ /* Write all blocks to the output file */
+
+ if (fstat(fileno(fm->fp), &statbuf) != 0)
+ {
+ pg_log_error("could not stat file \"%s\": %m", fm->filename);
+ pg_free(filemaps);
+ exit(1);
+ }

Some error flow, cleanup_filemaps need to be called to close the file
descriptors that are opened.

Comment:
+/*
+ * When to send the whole file, % blocks modified (90%)
+ */
+#define WHOLE_FILE_THRESHOLD 0.9
+

This can be user configured value.
This can be an enhancement after the functionality is completed.


Comment:
We can add a readme file with all the details regarding incremental
backup and combine backup.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: "ago" times on buildfarm status page

2019-08-27 Thread Andrew Dunstan



On 8/27/19 4:33 AM, Tom Turelinckx wrote:
> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote:
>> I think this is the problem:
>>
>>  'scmrepo' => '/home/pgbf/pgmirror.git',
>>
>> Probably this isn't updated often enough. It probably has little to do with 
>> the clock settings.
>>
>> This is the kind of old-fashioned way of doing things. These days 
>> "git_keep_mirror => 1" along with the community repo as the base would avoid 
>> these problems.
> We've discussed this before (see below).
>
>
>  Hm. So the issue really is that the build timestamp that the buildfarm
>  client is reporting tells when it pulled from the local repo, not when
>  that repo was last updated from the community server. Not sure if there's
>  any simple way to improve that ... Andrew, any thoughts?



Maybe we need an option to use the git commit time. instead of the
snapshot time.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fault injection framework

2019-08-27 Thread Asim R P

On Tue, Aug 27, 2019 at 12:35 PM Michael Paquier 
wrote:
>
> You may want to double-check whitespaces in your patch set, and 0002
> does not apply because of conflicts in isolationtester.h (my fault!).
>

I've rebased the patch set against the latest master and tried to resolve
whitespace issues.  Apologies for the whitespace conflicts, I tried
resolving them but there is some trailing whitespace in the answer file of
the regress test in v1-0001 that cannot be removed, else the test will fail.

> 0002 is an independent feature, so I would keep it out of the fault
> framework for integration.  There has been a argument from Alvaro
> more convincing than mine about the use of a separate keyword, hence
> removing a dependency with steps:
>
https://www.postgresql.org/message-id/20190823153825.GA11405@alvherre.pgsql
>

That is a valid point, thank you Alvaro for the feedback.  I've changed
0002 so that a step within a permutation can be declared as blocking,
revised patch set is attached.

> It would be good also to have a test case which exercises it, without
> the need of the fault framework or its dedicated schedule.
>

It is for this reason that I have not separated patch 0002 out from
faultinjector patch set because the test to demonstrate the blocking
feature uses faults.  I need to give more thought to find a test having a
session that needs to block for reasons other than locking.  Any pointers
will be very helpful.

>
> My first impressions about this patch is that it is very intrusive.
> Could you explain the purpose of am_faultinjector?  That's a specific
> connection string parameter which can be used similarly to replication
> for WAL senders?  Couldn't there be an equivalent with a SUSET GUC?

Thank you for the review.  Admittedly, the patch set doesn't include a test
to demonstrate am_faultinjector.  That is used when a fault needs to be
injected into a remote server, say a standby.  And that standby may be
accepting connections or not, depending on if it's operating in hot-standby
mode.  Therefore, the am_faultinjector and the connection parameter is used
to identify fault injection requests and allow those to be handled even
when normal user connections are not allowed.  Also, such connections do
not need to be associated with a database, they simply need to set the
fault in the shared memory hash table.  In that sense, fault injection
connections are treated similar to replication connections.

I was looking into tests under src/test/recovery/t/.  Let me write a test
to demonstrate what I'm trying to explain above.

> It may be interesting to see which parts of this framework could be
> moved into an extension loaded with shared_preload_libraries, one
> thing being the shared memory initialization part.  At the end it
> would be interesting to not have a dependency with a compile-time
> flag.

Patch 0001 includes an extension that provides a SQL UDF as a wrapper over
the fault injection interface in backend.  Moving the backend part of the
patch also into an extension seems difficult to me.  Getting rid of the
compile time dependency is a strong enough advantage to spend more efforts
on this.

>
> Things like exec_fault_injector_command() need to be more documented.
> It is hard to guess what it is being used for.

Added a comment to explain things a bit.  Hope that helps.  And as
mentioned above, I'm working on a test case to demonstrate this feature.

Asim

v1-0002-Add-syntax-to-declare-a-step-that-is-expected-to-.patch
Description: Binary data

v1-0001-Framework-to-inject-faults-from-SQL-tests.patch
Description: Binary data

v1-0003-Speculative-insert-isolation-test-spec-using-faul.patch
Description: Binary data

v1-0005-Isolation-schedule-for-tests-that-inject-faults.patch
Description: Binary data

v1-0004-Run-tests-with-faults-if-faultinjector-was-compil.patch
Description: Binary data

Re: Cleanup isolation specs from unused steps

2019-08-27 Thread Asim R P

On Fri, Aug 23, 2019 at 9:08 PM Alvaro Herrera 
wrote:
>
> On 2019-Aug-23, Asim R P wrote:
>
> > As part of the fault injector patch set [1], I added a new "blocking"
> > keyword to isolation grammar so that a step can be declared as blocking.
> > See patch 0002-Add-syntax-to-declare-a-step-that-is-expected-to-block.
>
> One point to that implementation is that in that design a step is
> globally declared to be blocking, but in reality that's the wrong way to
> see things: a step might block in some permutations and not others.  So
> I think we should do as Michael suggested: it's the permutation that has
> to have a way to mark a given step as blocking, not the step itself.

Thank you for the feedback.  I've changed patch 0002 accordingly, please
take another look:
https://www.postgresql.org/message-id/CANXE4TdvSi7Yia_5sV82%2BMHf0WcUSN9u6_X8VEUBv-YStphd%3DQ%40mail.gmail.com

Asim

Re: Why overhead of SPI is so large?

2019-08-27 Thread Robert Haas

On Sat, Aug 24, 2019 at 12:01 PM David Fetter  wrote:
> No, it's lying to the RDBMS, so it's pilot error. The problem of
> determining from the function itself whether it is in fact immutable
> is, in general, equivalent to the Halting Problem, so no, we can't
> figure it out. We do need to trust our users not to lie to us, and we
> do not need to protect them from the consequences when they do.

Depends.  I don't mind if mislabeling a function leads to "wrong"
query results, but I don't think it's OK for it to, say, crash the
server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL and Real Application Testing (RAT)

2019-08-27 Thread Thomas Kellerer

ROS Didier schrieb am 27.08.2019 um 12:47:
> In my business, one of the things blocking the migration from Oracle
> to PostgreSQL is not having the equivalent of Oracle Real Application
> Testing .
>
> This product captures a charge in production and replay it in a test
> environment.
>
> this allows to know the impacts of a migration to a newer version,
> the creation of an index..
>
> is there an equivalent in the PostgreSQL community?
>
> if not, do you think it's technically possible to do it?
>
> who would be interested in this project?
Not sure how up-to-date that is, but you might want to have a look here:

https://wiki.postgresql.org/wiki/Statement_Playback

Re: "ago" times on buildfarm status page

2019-08-27 Thread Andrew Dunstan

On 8/27/19 8:45 AM, Andrew Dunstan wrote:
> On 8/27/19 4:33 AM, Tom Turelinckx wrote:
>> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote:
>>> I think this is the problem:
>>>
>>>  'scmrepo' => '/home/pgbf/pgmirror.git',
>>>
>>> Probably this isn't updated often enough. It probably has little to do with 
>>> the clock settings.
>>>
>>> This is the kind of old-fashioned way of doing things. These days 
>>> "git_keep_mirror => 1" along with the community repo as the base would 
>>> avoid these problems.
>> We've discussed this before (see below).
>>
>>
>>  Hm. So the issue really is that the build timestamp that the buildfarm
>>  client is reporting tells when it pulled from the local repo, not when
>>  that repo was last updated from the community server. Not sure if there's
>>  any simple way to improve that ... Andrew, any thoughts?
>
>
> Maybe we need an option to use the git commit time. instead of the
> snapshot time.
>
>

Scratch that - we use this to calculate the duration of the first stage,
so mangling it would just create another error.

It's tempting to say we should sort the dashboard by git reference time
then snapshot - that should be fairly doable. But what if there isn't a
git reference, as happens when there's a git failure for example. In
those cases Maybe just use the snapshot time?

Storing the git timestanp would involve a table change in our second
largest table, so the team would need to discuss and plan it.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: "ago" times on buildfarm status page

2019-08-27 Thread Tom Lane

"Tom Turelinckx"  writes:
> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote:
>> I think this is the problem:
>> 'scmrepo' => '/home/pgbf/pgmirror.git',
>> Probably this isn't updated often enough. It probably has little to do with 
>> the clock settings.

> The configuration is intentionally like that. I specifically configured skate 
> and snapper to build the exact same source, where skate builds with default 
> buildfarm settings, while snapper builds with the settings actually used by 
> Debian source packages.

TBH, I don't find that particularly important ... especially not for HEAD
builds, where building a many-hours-old snapshot is pretty much in the
category of "why bother?".  On the whole, I think building from the latest
available source is the most useful policy.  If there's some platform-
or configuration-specific issue, it usually takes more than one build
cycle for us to notice it anyway, so that ensuring two animals have exactly
comparable builds at any given instant isn't very helpful.

regards, tom lane

Re: "ago" times on buildfarm status page

2019-08-27 Thread Andrew Dunstan



On 8/27/19 10:15 AM, Tom Lane wrote:
> "Tom Turelinckx"  writes:
>> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote:
>>> I think this is the problem:
>>> 'scmrepo' => '/home/pgbf/pgmirror.git',
>>> Probably this isn't updated often enough. It probably has little to do with 
>>> the clock settings.
>> The configuration is intentionally like that. I specifically configured 
>> skate and snapper to build the exact same source, where skate builds with 
>> default buildfarm settings, while snapper builds with the settings actually 
>> used by Debian source packages.
> TBH, I don't find that particularly important ... especially not for HEAD
> builds, where building a many-hours-old snapshot is pretty much in the
> category of "why bother?".  On the whole, I think building from the latest
> available source is the most useful policy.  If there's some platform-
> or configuration-specific issue, it usually takes more than one build
> cycle for us to notice it anyway, so that ensuring two animals have exactly
> comparable builds at any given instant isn't very helpful.
>
>   



Yeah, point. snapper seems the more important box here.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

range bug in resolve_generic_type?

2019-08-27 Thread Paul A Jungwirth

Hello,

I was looking at resolve_generic_type to add anymultirange support,
and the range logic doesn't seem correct to me. This function takes 3
type Oids:

- declared_type is the declared type of a function parameter whose
actual type it would like to deduce.
- context_{declared,actual}_type are the {declared,actual} types of
another parameter, as a hint.

Here is the logic in pseudocode:

if (declared_type == ANYARRAYOID)
{
if (context_declared_type == ANYARRAYOID)
return the array type
else if (context_declared_type == ANYELEMENTOID ||
 context_declared_type == ANYNONARRAYOID ||
 context_declared_type == ANYENUMOID ||
 context_declared_type == ANYRANGEOID)
 context_declared_type == ANYMULTIRANGEOID)
return an array of those things
}
else if (declared_type == ANYELEMENTOID ||
   declared_type == ANYNONARRAYOID ||
   declared_type == ANYENUMOID ||
   declared_type == ANYRANGEOID)
{
if (context_declared_type == ANYARRAYOID)
return the element type
else if (context_declared_type == ANYRANGEOID)
return the element type
else if (context_declared_type == ANYELEMENTOID ||
 context_declared_type == ANYNONARRAYOID ||
 context_declared_type == ANYENUMOID)
return the actual type
}
else {
return declared_type  // since it's not polymorphic
}

It seems to me that these inputs would give invalid results:

resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return
an array of the *range type*, but that contracts the normal
relationship between anyelement and anyrange. It should return an
array of the range's element type.

resolve_generic_type(ANYRANGEOID, x, ANYRANGEOID) - this will return
the known range's *element* type, but it should return the known
range.

Fortunately we never call the function in either of those ways. The
only call site I could find is utils/fmgr/funcapi.c, and it only calls
it like this:

resolve_generic_type(ANYELEMENTOID, anyarray_type, ANYARRAYOID)
resolve_generic_type(ANYELEMENTOID, anyrange_type, ANYRANGEOID)
resolve_generic_type(ANYARRAYOID, anyelement_type, ANYELEMENTOID)

So I'm curious what I should do about all that, if anything. I'm happy
to fix it (although I'm not sure how I'd test my changes), but it
seems worth asking first. The first case in particular we *could* use
to guess an anyarray's type if we wanted to, so I could add that to
funcapi.c and then probably invent some scenario to test it. For the
second case, I'm guessing if a function has the *same* polymorphic
parameter we probably make an inference without using
resolve_generic_type, since they should obviously match. But does
anyone here have a suggestion?

Thanks!
Paul

Re: old_snapshot_threshold vs indexes

2019-08-27 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Aug 27, 2019 at 1:54 PM Tom Lane  wrote:
>> +1.  That fix is also back-patchable, which adding fields to relcache
>> entries would not be.

> There is a fly in the ointment: REL9_6_STABLE's copy of
> RelationHasUnloggedIndex() is hardcoded to return true for hash
> indexes (see commit 2cc41acd8).

True, in 9.6 hash indexes *were* effectively unlogged, so that the code
actually did something in that branch.  Given the lack of bug reports
traceable to this, I wouldn't feel too bad about leaving it alone in 9.6.

> However, I now see that there isn't a buffer content lock deadlock
> risk here after all, because we don't reach RelationHasUnloggedIndex()
> if IsCatalogRelation(rel).  That reminds me of commit 4fd05bb55b4.  It
> still doesn't seem like a great idea to be doing catalog access while
> holding the buffer content lock, though.

Yeah, I'm not convinced that that observation means the problem is
unreachable.  Probably does make it harder to hit a deadlock, but
if you mix a few VACUUMs and untimely cache flushes into the
equation, I feel like one could still happen.

> So I think we need to leave 9.6 as is, and discuss how far back to
> back-patch the attached.  It could go back to 10, but perhaps we
> should be cautious and push it to master only for now, if you agree
> with my analysis of the deadlock thing.

I'd vote for back-patching to 10.  Even if there is in fact no deadlock
hazard, you've clearly demonstrated a significant performance hit that
we're taking for basically no reason.

In the larger picture, the commit this reminds me of is b04aeb0a0.
I'm wondering if we could add some assertions to the effect of
"don't initiate relcache or syscache lookups while holding a buffer
lock".  It would be relatively easy to do that if we could make
the rule be "... while holding any LWLock", but I suspect that
that would break some legitimate cases.

regards, tom lane

Re: range bug in resolve_generic_type?

2019-08-27 Thread Tom Lane

Paul A Jungwirth  writes:
> I was looking at resolve_generic_type to add anymultirange support,
> and the range logic doesn't seem correct to me.

Hmmm...

> resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return
> an array of the *range type*, but that contracts the normal
> relationship between anyelement and anyrange. It should return an
> array of the range's element type.

I seem to recall that we discussed this exact point during development
of the range feature, and concluded that this was the behavior we
wanted, ie, treat anyrange like a scalar for this purpose.  Otherwise
there isn't any way to use a polymorphic function to build an array
of ranges, and that seemed more useful than being able to build
an array of the range elements.  Jeff might remember more here.

> resolve_generic_type(ANYRANGEOID, x, ANYRANGEOID) - this will return
> the known range's *element* type, but it should return the known
> range.

Yeah, that seems like a flat-out bug.

> Fortunately we never call the function in either of those ways.

Wouldn't it depend on the signature+result type of the user-defined
function we're dealing with?  (That is, calls with constant argument
types are probably not the interesting ones.)

regards, tom lane

Re: range bug in resolve_generic_type?

2019-08-27 Thread Paul A Jungwirth

On Tue, Aug 27, 2019 at 8:23 AM Tom Lane  wrote:
> > resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return
> > an array of the *range type*, but that contracts the normal
> > relationship between anyelement and anyrange. It should return an
> > array of the range's element type.
>
> I seem to recall that we discussed this exact point during development
> of the range feature, and concluded that this was the behavior we
> wanted, ie, treat anyrange like a scalar for this purpose.  Otherwise
> there isn't any way to use a polymorphic function to build an array
> of ranges

Well, I don't think that works anyway. At least I couldn't get it to
work here: 
https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com

But also check_generic_type_consistency works the way I'm saying:

- if anyrange = r then anyelement = elemOf(r)
- if anyarray = a then anyelement = elemOf(a)
- elemOf(r) = elemOf(a)

So resolve_generic_type should agree with that, right?

Also, I'm interested in adding not just anymultirange but also
anyrangearray, which *would* let you have polymorphic
arrays-of-ranges. (I thought I would need anyrangearray for a
multirange constructor, but actually now I think I might not need
it---maybe. But still it seems like a helpful thing.)

> > Fortunately we never call the function in either of those ways.
>
> Wouldn't it depend on the signature+result type of the user-defined
> function we're dealing with?  (That is, calls with constant argument
> types are probably not the interesting ones.)

I suppose an extension could call it (although it seems unlikely). But
I couldn't find anywhere in the Postgres code that doesn't call it
with hardcoded arguments. (I certainly could have missed something
though.)

Thanks!
Paul

Re: pgbench - implement strict TPC-B benchmark

2019-08-27 Thread Dmitry Dolgov

> On Mon, Aug 5, 2019 at 10:46 PM Fabien COELHO  wrote:
>
> > The index builds are done serially. The vacuum could be replaced by COPY
> > FREEZE.
>
> Well, it could be added?

While doing benchmarking using different tools, including pgbench, I found it
useful as a temporary hack to add copy freeze and maintenance_work_mem options
(the last one not as an env variable, just as a set before, although not sure
if it's a best idea). Is it similar to what you were talking about?

v1-0001-pgbench-load-options.patch
Description: Binary data

basebackup.c's sendFile() ignores read errors

2019-08-27 Thread Robert Haas

While reviewing a proposed patch to basebackup.c this morning, I found
myself a bit underwhelmed by the quality of the code and comments in
basebackup.c's sendFile(). I believe it's already been pointed out
that the the retry logic here is not particularly correct, and the
comments demonstrate a pretty clear lack of understanding of how the
actual race conditions that are possible here might operate.

However, I then noticed another problem which I think is significantly
more serious: this code totally ignores the possibility of a failure
in fread().  It just assumes that if fread() fails to return a
positive value, it's reached the end of the file, and if that happens,
it just pads out the rest of the file with zeroes.  So it looks to me
like if fread() encountered, say, an I/O error while trying to read a
data file, and if that error were on the first data block, then the
entire contents of that file would be replaced with zero bytes in the
backup, and no error or warning of any kind would be given to the
user.  If it hit the error later, everything from the point of the
error onward would just get replaced with zero bytes.  To be clear, I
think it is fine for basebackup.c to fill out the rest of the expected
length with zeroes if in fact the file has been truncated during the
backup; recovery should fix it.  But recovery is not going to fix data
that got eaten because EIO was encountered while copying a file.

The logic that rereads a block appears to have the opposite problem.
Here, it will detect an error, but it will also fail in the face of a
concurrent truncation, which it shouldn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-27 Thread Robert Haas

On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
 wrote:
> [ patches ]

Reviewing 0002 and 0003:

- Commit message for 0003 claims magic number and checksum are 0, but
that (fortunately) doesn't seem to be the case.

- looks_like_rel_name actually checks whether it looks like a
*non-temporary* relation name; suggest adjusting the function name.

- The names do_full_backup and do_incremental_backup are quite
confusing because you're really talking about what to do with one
file.  I suggest sendCompleteFile() and sendPartialFile().

- Is there any good reason to have 'refptr' as a global variable, or
could we just pass the LSN around via function arguments?  I know it's
just mimicking startptr, but storing startptr in a global variable
doesn't seem like a great idea either, so if it's not too annoying,
let's pass it down via function arguments instead.  Also, refptr is a
crappy name (even worse than startptr); whether we end up with a
global variable or a bunch of local variables, let's make the name(s)
clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
that's long, but I still think it's better than being unclear.

- do_incremental_backup looks like it can never report an error from
fread(), which is bad.  But I see that this is just copied from the
existing code which has the same problem, so I started a separate
thread about that.

- I think that passing cnt and blkindex to verify_page_checksum()
doesn't look very good from an abstraction point of view.  Granted,
the existing code isn't great either, but I think this makes the
problem worse.  I suggest passing "int backup_distance" to this
function, computed as cnt - BLCKSZ * blkindex.  Then, you can
fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
- BLCKSZ).

- While I generally support the use of while and for loops rather than
goto for flow control, a while (1) loop that ends with a break is
functionally a goto anyway.  I think there are several ways this could
be revised.  The most obvious one is probably to use goto, but I vote
for inverting the sense of the test: if (PageIsNew(page) ||
PageGetLSN(page) >= startptr) break; This approach also saves a level
of indentation for more than half of the function.

- I am not sure that it's a good idea for sendwholefile = true to
result in dumping the entire file onto the wire in a single CopyData
message.  I don't know of a concrete problem in typical
configurations, but someone who increases RELSEG_SIZE might be able to
overflow CopyData's length word.  At 2GB the length word would be
negative, which might break, and at 4GB it would wrap around, which
would certainly break.  See CopyData in
https://www.postgresql.org/docs/12/protocol-message-formats.html  To
avoid this issue, and maybe some others, I suggest defining a
reasonably large chunk size, say 1MB as a constant in this file
someplace, and sending the data as a series of chunks of that size.

- I don't think that the way concurrent truncation is handled is
correct for partial files.  Right now it just falls through to code
which appends blocks of zeroes in either the complete-file or
partial-file case.  I think that logic should be moved into the
function that handles the complete-file case.  In the partial-file
case, the blocks that we actually send need to match the list of block
numbers we promised to send.  We can't just send the promised blocks
and then tack a bunch of zero-filled blocks onto the end that the file
header doesn't know about.

- For reviewer convenience, please use the -v option to git
format-patch when posting and reposting a patch series.  Using -v2,
-v3, etc. on successive versions really helps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL and Real Application Testing (RAT)

2019-08-27 Thread Jaime Casanova

On Tue, 27 Aug 2019 at 05:47, ROS Didier  wrote:

> Hi
>
>
>
> In my business, one of the things blocking the migration from Oracle to
> PostgreSQL is not having the equivalent of Oracle Real Application Testing .
>
> This product captures a charge in production and replay it in a test
> environment.
>
> this allows to know the impacts of a migration to a newer version, the
> creation of an index..
>
> is there an equivalent in the PostgreSQL community?
>

I used https://github.com/laurenz/pgreplay recently to re-execute the
queries sent to a pg9.1 in a pg11. It was very useful to find queries that
are affected but changes in default values of GUCs.

Normally, a query that works in an old version will work in a new one; but
this is useful to catch the few that don't if any

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Missing newline in pg_upgrade usage()

2019-08-27 Thread Daniel Gustafsson

In 959f6d6a1821 it seems that I fat-fingered and missed a newline in the usage
output, which makes --check become appended to -B rather than on its own line.
The attached diff fixes that.

cheers ./daniel



pg_upgrade_help_check.diff
Description: Binary data

Re: Missing newline in pg_upgrade usage()

2019-08-27 Thread Tom Lane

Daniel Gustafsson  writes:
> In 959f6d6a1821 it seems that I fat-fingered and missed a newline in the usage
> output, which makes --check become appended to -B rather than on its own line.
> The attached diff fixes that.

Roger, pushed.

regards, tom lane

Re: row filtering for logical replication

2019-08-27 Thread a . kondratov


Hi Euler,

On 2019-02-03 13:14, Andres Freund wrote:


On 2018-11-23 13:15:08 -0300, Euler Taveira wrote:

Besides the problem presented by Hironobu-san, I'm doing some cleanup
and improving docs. I also forget to declare pg_publication_rel TOAST
table.

Thanks for your review.


As far as I can tell, the patch has not been refreshed since. So I'm
marking this as returned with feedback for now. Please resubmit once
ready.



Do you have any plans for continuing working on this patch and 
submitting it again on the closest September commitfest? There are only 
a few days left. Anyway, I will be glad to review the patch if you do 
submit it, though I didn't yet dig deeply into the code.


I've rebased recently the entire patch set (attached) and it works fine. 
Your tap test is passed. Also I've added a new test case (see 0009 
attached) with real life example of bidirectional replication (BDR) 
utilising this new WHERE clause. This naive BDR is implemented using 
is_cloud flag, which is set to TRUE/FALSE on cloud/remote nodes 
respectively.


Although almost all new tests are passed, there is a problem with DELETE 
replication, so 1 out of 10 tests is failed. It isn't replicated if the 
record was created with is_cloud=TRUE on cloud, replicated to remote; 
then updated with is_cloud=FALSE on remote, replicated to cloud; then 
deleted on remote.



Regards
--
Alexey Kondratov
Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom ae80e1616fb6374968a09e3c44f0abe59ebf3a87 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Fri, 9 Mar 2018 18:39:22 +
Subject: [PATCH v2 1/9] Remove unused atttypmod column from initial table
 synchronization

 Since commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920, atttypmod was
 added but not used. The removal is safe because COPY from publisher
 does not need such information.
---
 src/backend/replication/logical/tablesync.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 7881079e96..0a565dd837 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -647,7 +647,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 	StringInfoData cmd;
 	TupleTableSlot *slot;
 	Oid			tableRow[2] = {OIDOID, CHAROID};
-	Oid			attrRow[4] = {TEXTOID, OIDOID, INT4OID, BOOLOID};
+	Oid			attrRow[3] = {TEXTOID, OIDOID, BOOLOID};
 	bool		isnull;
 	int			natt;
 
@@ -691,7 +691,6 @@ fetch_remote_table_info(char *nspname, char *relname,
 	appendStringInfo(&cmd,
 	 "SELECT a.attname,"
 	 "   a.atttypid,"
-	 "   a.atttypmod,"
 	 "   a.attnum = ANY(i.indkey)"
 	 "  FROM pg_catalog.pg_attribute a"
 	 "  LEFT JOIN pg_catalog.pg_index i"
@@ -703,7 +702,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 	 lrel->remoteid,
 	 (walrcv_server_version(wrconn) >= 12 ? "AND a.attgenerated = ''" : ""),
 	 lrel->remoteid);
-	res = walrcv_exec(wrconn, cmd.data, 4, attrRow);
+	res = walrcv_exec(wrconn, cmd.data, 3, attrRow);
 
 	if (res->status != WALRCV_OK_TUPLES)
 		ereport(ERROR,
@@ -724,7 +723,7 @@ fetch_remote_table_info(char *nspname, char *relname,
 		Assert(!isnull);
 		lrel->atttyps[natt] = DatumGetObjectId(slot_getattr(slot, 2, &isnull));
 		Assert(!isnull);
-		if (DatumGetBool(slot_getattr(slot, 4, &isnull)))
+		if (DatumGetBool(slot_getattr(slot, 3, &isnull)))
 			lrel->attkeys = bms_add_member(lrel->attkeys, natt);
 
 		/* Should never happen. */

base-commit: 9acda731184c1ebdf99172cbb19d0404b7eebc37
-- 
2.19.1

From 362f2cc97745690ff4739b530f5ba95aea59be09 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Fri, 9 Mar 2018 17:37:36 +
Subject: [PATCH v2 2/9] Store number of tuples in WalRcvExecResult

It seems to be a useful information while allocating memory for queries
that returns more than one row. It reduces memory allocation
for initial table synchronization.

While in it, since we have the number of columns, allocate only nfields
for cstrs instead of MaxTupleAttributeNumber.
---
 .../replication/libpqwalreceiver/libpqwalreceiver.c  | 9 ++---
 src/backend/replication/logical/tablesync.c  | 5 ++---
 src/include/replication/walreceiver.h| 1 +
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 6eba08a920..846b6f89f1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -878,6 +878,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres,
  errdetail("Expected %d fields, got %d fields.",
 		   nRetTypes, nfields)));
 
+	walres->ntuples = PQntuples(pgres);
 	walres->tuplestore = tuplestore_begin_heap(true, false, work_mem);
 
 	/* Create tuple descriptor

Re: doc: clarify "pg_signal_backend" default role

2019-08-27 Thread Tom Lane

Ian Barwick  writes:
> Currently the documentation for the default role "pg_signal_backend" states,
> somewhat ambiguously, "Send signals to other backends (eg: cancel query, 
> terminate)",
> giving the impression other signals (e.g. SIGHUP) can be sent too, which is
> currently not the case.
> Attached patch clarifies this, adds a descriptive paragraph (similar to what
> the other default roles have) and a link to the "Server Signaling Functions"
> section.

Pushed with minor tweaking.

(Note: patches are less likely to fall through the cracks if you
add them to the commitfest page.)

regards, tom lane

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

2019-08-27 Thread Merlin Moncure

On Sun, Aug 25, 2019 at 9:35 PM Thomas Munro  wrote:
>
> On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby  wrote:
> > On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote:
> > > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan  wrote:
> > > > I was reminded of this issue from last year, which also appeared to
> > > > involve BufFileClose() and a double-free:
> > > >
> > > > https://postgr.es/m/87y3hmee19@news-spur.riddles.org.uk
> > > >
> > > > That was a BufFile that was under the control of a tuplestore, so it
> > > > was similar to but different from your case. I suspect it's related.
> > >
> > > Hmm.  tuplestore.c follows the same coding pattern as nodeHashjoin.c:
> > > it always nukes its pointer after calling BufFileFlush(), so it
> > > shouldn't be capable of calling it twice for the same pointer, unless
> > > we have two copies of that pointer somehow.
> > >
> > > Merlin's reported a double-free apparently in ExecHashJoin(), not
> > > ExecHashJoinNewBatch() like this report.  Unfortunately that tells us
> > > very little.
>
> Here's another one:
>
> https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org
>
> Hmm.  Also on RHEL/CentOS 6, and also involving sorting, hashing,
> BufFileClose() but this time the glibc double free error is in
> repalloc().
>
> And another one (repeatedly happening):
>
> https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com
>
> Also on RHEL/CentOS 6, this time a sort in once case and a hash join
> in another case.
>
> Of course it's entirely possible that we have a bug here and I'm very
> keen to find it, but I can't help noticing the common factor here is
> that they're all running ancient RHEL 6.x releases, except Merlin who
> didn't say.  Merlin?

Just noticed this.
redhat-release: "Red Hat Enterprise Linux Server release 6.9 (Santiago)"

merlin

Re: Zedstore - compressed in-core columnar storage

2019-08-27 Thread Alexandra Wang

On Tue, Aug 27, 2019 at 12:03 AM Ashutosh Sharma 
wrote:

> My point is, once we find the leaf page containing the given tid, we go
> through each item in the page until we find the data corresponding to the
> given tid which means we kind of perform a sequential scan at the page
> level. I'm referring to the below loop in zsbt_attr_scan_fetch_array().
>
> for (off = FirstOffsetNumber; off <= maxoff; off++)
> {
> ItemId  iid = PageGetItemId(page, off);
> ZSAttributeArrayItem *item = (ZSAttributeArrayItem *)
> PageGetItem(page, iid);
>
> if (item->t_endtid <= nexttid)
> continue;
>
> if (item->t_firsttid > nexttid)
> break;
>
> But that's not true for IndexScan in case of heap table because there the
> index tuple contains the exact physical location of tuple in the heap. So,
> there is no need to scan the entire page.
>

You are correct that we currently go through each item in the leaf page that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given tid. One
optimization we can to is to tell decode_attstream() to stop decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do binary
search
instead of linear search, and for variable length attribute, we can probably
try something that we didn't think of yet.

1) In zsundo_insert_finish(), there is a double call to
> BufferGetPage(undobuf); Is that required ?
>

Fixed, thanks!

2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called
> twice? I'm referring to the below code.
>
> if (fetch_proj->num_proj_atts == 0)
> {
> 
> 
> zsbt_tid_begin_scan(rel, tid, tid + 1,
> snapshot,
> &fetch_proj->tid_scan);
> fetch_proj->tid_scan.serializable = true;
>
> for (int i = 1; i < fetch_proj->num_proj_atts; i++)
> {
> int attno = fetch_proj->proj_atts[i];
>
> zsbt_attr_begin_scan(rel,  reldesc, attno,
>  &fetch_proj->attr_scans[i - 1]);
> }
> MemoryContextSwitchTo(oldcontext);
>
> zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot,
> &fetch_proj->tid_scan);
>  }
>

I removed the second call, thanks!

> Also, for all types of update operation (be it key or non-key update) we
> create a new tid for the new version of tuple. Can't we use the tid
> associated with the old tuple for the cases where there is no concurrent
> transactions to whom the old tuple is still visible.
>

Zedstore currently implement update as delete+insert, hence the old tid is
not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old tuple
means
putting the old tuple in the UNDO log, which we have not implemented yet.

Thanks for reporting, this is very helpful! Patches are welcome as well!

Re: PostgreSQL and Real Application Testing (RAT)

2019-08-27 Thread Nikolay Samokhvalov

On Tue, Aug 27, 2019 at 3:47 AM ROS Didier  wrote:

> Hi
>
>
>
> In my business, one of the things blocking the migration from Oracle to
> PostgreSQL is not having the equivalent of Oracle Real Application Testing .
>
> This product captures a charge in production and replay it in a test
> environment.
>
> this allows to know the impacts of a migration to a newer version, the
> creation of an index..
>
> is there an equivalent in the PostgreSQL community?
>
> if not, do you think it's technically possible to do it ?
>
> who would be interested in this project ?
>

Replaying workload might or might not apply well to your case.

There are several major difficulties if you want to replay workload:

1) How to "record" workload. You need to write all your queries to the
Postgres log. Three problems here:
  1a) pgreplay expects log_statements = 'all' while you might prefer
dealing with log_min_duration_statement instead. This is a minor issue
though, quite easy to solve with preprocessing.
  1b) under heavy load, log_min_duration_statement = 0 (or log_statements =
'all') will lead to performance degradation or even downtime. Possible
solutions are: write to memory, or don't write at all but send over the
network.
  1c) ideally, recoding just queries is not enough. To replay workload "as
is", we need to replay queries with known plans. There is no easy solution
to this problem in the Postgres ecosystem yet.

A couple of additional points regarding item 1b and 1c. In Postgres 12,
there is a cool new capability: sampling for query logging,
implemented by Adrien
Nayrat https://commitfest.postgresql.org/20/1691/  WIth this, it will be
possible to fully log, say, 5% of all transactions and use it for
replaying. Moreover, with auto_explain, it will be possible to have plans!
Open questions are: (a) how to determine, if N% is enough, and (b) how to
replay with specified plans. [If anyone is interested in working in this
direction – please reach out to me.]

2) Issues with replaying itself. I can highlight at least two problems here:
  2a) pgreplay might be not enough for your workload, it doesn't scale
well. If interested, look at its analog written in Go,
https://github.com/gocardless/pgreplay-go, but this is quite a young
project.
  2b) Postgres logs have millisecond precision (if you switched from %t to
%m in log_line_prefix), this might be not enough. There is a patch to
microsecond precision from David Fetter
https://www.postgresql.org/message-id/flat/20181023185050.GE6049%40fetter.org,
but that conversation hasn't yet led to commit.

Another approach you might be interested in -- workload simulation. This is
what we (Postgres.ai) now used in most times when building "lab"
environments for our clients. The idea is as follows:
- carefully analyze workload using pg_stat_statements (here, our
open-source tool called "postgres-checkup"
https://gitlab.com/postgres-ai/postgres-checkup might be helpful, see
reports in section K),
- take the most resource-consuming query groups (Top-N ordered by
total_time),
- create a set of files with statements with randomly filled parameters
(won't work for most cases, I discuss restrictions below),
- use pgbench, feed workload files to it, using multiple -f options, with
balancing (-f filename@XX, where XX is to be taked from
pg_statements_analysis, but this time, "calls" and their ratio in the whole
workload will be needed -- again, postgres-checkup can help here).
- run, analyze, compare behavior.

Restrictions of this approach are obvious:
- doesn't work well if most of your transactions have multiple statements,
- in many cases, randomization is hard (not obvious how to organize;
synthetic approach is far from real data distribution in storage and
workload; etc),
- the approach requires a significant amount of manual efforts.

However, the "workload simulation" approach is an extremely helpful
approach in many cases, helping with change management. It doesn't require
anything that might negatively affect your production workload, it utilizes
pgbench (or any other tool) which is reliable, has great features and
scales well.

You might be interested in looking at our tool that we built to conduct a
huge amount of DB experiments, Nancy CLI
https://gitlab.com/postgres-ai/nancy. It supports both "workload replay"
method (with pgreplay) and "workload simulation" (with pgbench). PM me if
you're interested in discussing details.

Thanks,
Nik

Re: PostgreSQL and Real Application Testing (RAT)

2019-08-27 Thread Jaime Casanova

On Tue, 27 Aug 2019 at 19:33, Nikolay Samokhvalov 
wrote:

> On Tue, Aug 27, 2019 at 3:47 AM ROS Didier  wrote:
>
>> Hi
>>
>>
>>
>> In my business, one of the things blocking the migration from Oracle to
>> PostgreSQL is not having the equivalent of Oracle Real Application Testing .
>>
>> This product captures a charge in production and replay it in a test
>> environment.
>>
>> this allows to know the impacts of a migration to a newer version, the
>> creation of an index..
>>
>> is there an equivalent in the PostgreSQL community?
>>
>> if not, do you think it's technically possible to do it ?
>>
>> who would be interested in this project ?
>>
>
> Replaying workload might or might not apply well to your case.
>
> There are several major difficulties if you want to replay workload:
>
> 1) How to "record" workload. You need to write all your queries to the
> Postgres log. Three problems here:
>   1a) pgreplay expects log_statements = 'all' while you might prefer
> dealing with log_min_duration_statement instead. This is a minor issue
> though, quite easy to solve with preprocessing.
>   1b) under heavy load, log_min_duration_statement = 0 (or log_statements
> = 'all') will lead to performance degradation or even downtime. Possible
> solutions are: write to memory, or don't write at all but send over the
> network.
>   1c) ideally, recoding just queries is not enough. To replay workload "as
> is", we need to replay queries with known plans. There is no easy solution
> to this problem in the Postgres ecosystem yet.
>
>
why? i prefer queries to take advantage of new plans for example if i'm
migrating from 9.5 to 9.6+ i would prefer that, when replaying, the queries
use parallel plans so i quickly get if that would somehow be a problem (for
example by using more cpu than before)

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: doc: clarify "pg_signal_backend" default role

2019-08-27 Thread Ian Barwick


On 8/28/19 7:04 AM, Tom Lane wrote:

Ian Barwick  writes:

Currently the documentation for the default role "pg_signal_backend" states,
somewhat ambiguously, "Send signals to other backends (eg: cancel query, 
terminate)",
giving the impression other signals (e.g. SIGHUP) can be sent too, which is
currently not the case.
Attached patch clarifies this, adds a descriptive paragraph (similar to what
the other default roles have) and a link to the "Server Signaling Functions"
section.


Pushed with minor tweaking.


Thanks!


(Note: patches are less likely to fall through the cracks if you
add them to the commitfest page.)


Yup, though I was intending to add that one together with a couple of
related minor doc patches to the next CF.

Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Make configuration file "include" directive handling more robust

2019-08-27 Thread Ian Barwick


On 8/25/19 4:39 AM, Tom Lane wrote:

Ian Barwick  writes:

On 7/17/19 5:34 PM, Kyotaro Horiguchi wrote:> Hello.

I don't think this is new to 12.



No, though I'm not sure how much this would be seen as a bugfix
and how far back it would be sensible to patch.


I think this is worth considering as a bugfix; although I'm afraid
we can't change the signature of ParseConfigFile/ParseConfigFp in
released branches, since extensions could possibly be using those.
That limits what we can do --- but it's still possible to detect
direct recursion, which seems like enough to produce a nice error
message in typical cases.


Makes sense.


I concur with Kyotaro-san that disallow-empty-include-directives.v1.patch
seems a bit brute-force, but where I would put the checks is in
ParseConfigFile and ParseConfigDirectory.

Also, I don't agree with the goals of prevent-disallowed-includes.patch.
I'm utterly not on board with breaking use of "include" in extension
files, for instance; while that may not be documented, it works fine,
and maybe somebody out there is relying on it.


I couldn't for the life of me think of any reason for using it.
But if there's undocumented functionality we think someone might
be using, shouldn't that be documented somewhere, if only as a note
in the code to prevent its accidental removal at a later date?


Likewise, while "include"
in pg.auto.conf is not really considered supported, I don't see the
point of going out of our way to break the historical behavior.


The amusing thing about that of course is that the include directive
will disappear the next time ALTER SYSTEM is run and the values from
the included file will appear in pg.auto.conf, which may cause some
headscratching. But I guess hasn't been an actual real-world
issue so far.


That leads me to propose the attached simplified patch.  While I haven't
actually tried, I'm pretty sure this should back-patch without trouble.


Ah, I see it's been applied already, thanks!


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)

2019-08-27 Thread Merlin Moncure

On Mon, Aug 26, 2019 at 12:01 PM Tom Lane  wrote:
>
> Justin Pryzby  writes:
> > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote:
> >> However ... there is some pretty interesting info at
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673
> >> suggesting that compiling with a late-model gcc against older RHEL6
> >> headers could result in bad code.  I wonder whether the reporters'
> >> servers were built using such a configuration.  (Although the linkage,
> >> if any, to this report still wouldn't be very clear.)
>
> > I can tell it was compiled using
> > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 
> > 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit
>
> Ah, that appears to be the default compiler for RHEL6, so that theory
> is out the window.  It's still interesting that we're only seeing this
> reported from RHEL6 ... maybe there's something specific to the code
> that this gcc version generates?

FWIW, I've got the same compiler (which is natural, we are likely
using the same packaging):
PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313
(Red Hat 4.4.7-23), 64-bit

merlin

Re: Statement timeout in pg_rewind

2019-08-27 Thread Michael Paquier

On Tue, Aug 27, 2019 at 10:45:27AM +0200, Alexander Kukushkin wrote:
> Done, please see the next version attached.

I have made the new error message consistent with run_simple_query to
avoid more work to translators and because it is possible to know
immediately the code path involved thanks to the SQL query, then
applied the fix down to 9.5 where pg_rewind has been added.  Please
note that your patch had a warning as "result" is not needed in
run_simple_command().

idle_in_transaction_session_timeout only applies to 9.6 and newer
versions.  lock_timeout (imagine a concurrent lock on pg_class for
example) and statement_timeout can cause issues, but the full set gets
disabled as your patch did and as mentioned upthread.
--
Michael


signature.asc
Description: PGP signature

Re: Cleanup isolation specs from unused steps

2019-08-27 Thread Michael Paquier

On Tue, Aug 27, 2019 at 07:05:50PM +0530, Asim R P wrote:
> Thank you for the feedback.  I've changed patch 0002 accordingly, please
> take another look:
> https://www.postgresql.org/message-id/CANXE4TdvSi7Yia_5sV82%2BMHf0WcUSN9u6_X8VEUBv-YStphd%3DQ%40mail.gmail.com

Thanks!  Let's move the discussion on the other thread then.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-27 Thread Peter Geoghegan

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
 wrote:
> Now the algorithm is the following:

> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.
>
> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

-- 
Peter Geoghegan

Re: Re: Email to hackers for test coverage

2019-08-27 Thread Michael Paquier

On Tue, Aug 27, 2019 at 03:57:20PM +0800, movead...@highgo.ca wrote:
> I think your way is much better, so I change the patch and it is  
> 'regression_20190827.patch' now.

Thanks for the new patch, I have committed the part for float4.

> There are code lines related to NULL values in
> ApplySortAbbrevFullComparator(), but I think the code lines for
> comparing a NULL and a NOT NULL can be never reached, because it is
> handled in ApplySortComparator() which is called before
> ApplySortAbbrevFullComparator(). So I think it is no use to INSERT
> a NULL value.

But I am not sold to that part yet, for three reasons:
- numeric is not a test suite designed for sorting, and hijacking it
to do so it not a good approach.
- it would be good to get coverage for the two extra code paths while
we are on it with NULL datums.
- There is no need for two INSERT queries, I am getting the same
coverage with only one.

Please note that I have not looked in details where we could put that,
but perhaps Robert and Peter G who worked on 4ea51cd to add support
for abbreviated keys have some ideas, so I am adding them in CC for
input.
--
Michael


signature.asc
Description: PGP signature

Re: REINDEX filtering in the backend

2019-08-27 Thread Michael Paquier

On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote:
> I didn't want to spend too much time enjoying bison and adding new
> unreserved keywords, so for now I just implemented this syntax to
> start a discussion for this feature in the next commitfest:
> 
> REINDEX ( FILTER = COLLATION ) [...]
> 
> The FILTER clause can be used multiple times, each one is OR-ed with
> the ReindexStmt's option, so we could easily add a LIBC, ICU and other
> filters, also making COLLATION (or more realistically a better new
> keyword) an alias for (LIBC | ICU) for instance.

I would prefer keeping the interface simple with only COLLATION, so as
only collation sensitive indexes should be updated, including icu and
libc ones.  Or would it be useful to have the filtering for both as
libicu can break things similarly to glibc in an update still a
breakage on one or the other would not happen at the same time?  I
don't know enough of libicu regarding that, eager to learn.  In which
case, wouldn't it be better to support that from the start?

> The filtering is done at table level (with and without the
> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically
> benefit from it.  If this clause is used with a REINDEX INDEX, the
> statement errors out, as I don't see a valid use case for providing a
> single index name and asking to possibly filter it at the same time.

Supporting that case would not be that much amount of work, no?

> I also added some minimal documentation and regression tests.  I'll
> add this patch to the next commitfest.
> 
> [1] 
> https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com

+if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0)
+elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX");
[...]
+  discard indexes whose ordering does not depend on a collation. Note that
+  the FILTER option is not compatible with REINDEX
+  SCHEMA.

Why do you have both limitations?  I think that it would be nice to be
able to do both, generating an error for REINDEX INDEX if the index
specified is not compatible, and a LOG if the index is not filtered
out when a list is processed.  Please note that elog() cannot be used
for user-facing failures, only for internal ones.

+REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose;
+-- One column, not depending on a collation
In order to make sure that a reindex has been done for a given entry
with the filtering, an idea is to save the relfilenode before the
REINDEX and check it afterwards.  That would be nice to make sure that
only the wanted indexes are processed, but it is not possible to be
sure of that based on your tests, and some tests should be done on
relations which have collation-sensitive as well as
collation-insensitive indexes.

+   index = index_open(indexOid, AccessShareLock);
+   numAtts = index->rd_index->indnatts;
+   index_close(index, AccessShareLock);
Wouldn't it be better to close that after doing the scan?

Nit: I am pretty sure that this should be indented.

Could you add tests with REINDEX CONCURRENTLY?
--
Michael


signature.asc
Description: PGP signature

Re: REINDEX filtering in the backend

2019-08-27 Thread Michael Paquier

On Wed, Aug 28, 2019 at 02:02:08PM +0900, Michael Paquier wrote:
> +   index = index_open(indexOid, AccessShareLock);
> +   numAtts = index->rd_index->indnatts;
> +   index_close(index, AccessShareLock);
> Wouldn't it be better to close that after doing the scan?
> 
> Nit: I am pretty sure that this should be indented.
> 
> Could you add tests with REINDEX CONCURRENTLY?

Bonus: support for reindexdb should be added.  Let's not forget about
it.
--
Michael


signature.asc
Description: PGP signature

Re: pgbench - implement strict TPC-B benchmark

2019-08-27 Thread Fabien COELHO



Hello Dmitry,


Well, it could be added?


While doing benchmarking using different tools, including pgbench, I found it
useful as a temporary hack to add copy freeze and maintenance_work_mem options
(the last one not as an env variable, just as a set before, although not sure
if it's a best idea). Is it similar to what you were talking about?


About this patch:

Concerning the --maintenance... option, ISTM that there could rather be a 
generic way to provide "set" settings, not a specific option for a 
specific parameter with a specific unit. Moreover, ISTM that it only needs 
to be set once on a connection, not per command. I'd suggest something 
like:


  --connection-initialization '...'

That would be issue when a connection is started, for any query, then the 
effect would be achieved with:


  pgbench --conn…-init… "SET maintenance_work_main TO '12MB'" ...

The --help does not say that the option expects a parameter.

Also, in you patch it is a initialization option, but the code does not 
check for that.


Concerning the freeze option:

It is also a initialization-specific option that should be checked for 
that.


The option does not make sense if

The alternative queries could be managed simply without intermediate 
variables.


Pgbench documentation is not updated.

There are no tests.

This patch should be submitted in its own thread to help manage it in the 
CF app.


--
Fabien.

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-08-27 Thread Kyotaro Horiguchi

Hello, Noah.

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190827.154932.250364935.horikyota@gmail.com>
> I'm not sure whether the knob shows apparent performance gain and
> whether we can offer the criteria to identify the proper
> value. But I'll add this feature with a GUC
> effective_io_block_size defaults to 64kB as the threshold in the
> next version. (The name and default value are arguable, of course.)

This is a new version of the patch based on the discussion.

The differences from v19 are the follows.

- Removed the new stuff in two-phase.c.

  The action on PREPARE TRANSACTION is now taken in
  PrepareTransaction(). Instead of storing pending syncs in
  two-phase files, the function immediately syncs all files that
  can survive the transaction end. (twophase.c, xact.c)

- Separate pendingSyncs from pendingDeletes.

  pendingSyncs gets handled differently from pendingDeletes so it
  is separated.

- Let smgrDoPendingSyncs() to avoid performing fsync on
  to-be-deleted files.

  In previous versions the function syncs all recorded files even
  if it is being deleted.  Since we use WAL-logging as the
  alternative of fsync now, performance gets more significance
g  than before.  Thus this version avoids uesless fsyncs.

- Use log_newpage instead of fsync for small tables.

  As in the discussion up-thread, I think I understand how
  WAL-logging works better than fsync.  smgrDoPendingSync issues
  log_newpage for all blocks in the table smaller than the GUC
  variable "effective_io_block_size".  I found
  log_newpage_range() that does exact what is needed here but it
  requires Relation that is no available there.  I removed an
  assertion in CreateFakeRelcacheEntry so that it works while
  non-recovery mode.

- Rebased and fixed some bugs.

I'm trying to measure performance difference on WAL/fsync.


By the way, smgrDoPendingDelete is called from CommitTransaction
and AbortTransaction directlry, and from AbortSubTransaction via
AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is
called only from AbortSubTransaction. I think these should be
unified either way.  Any opinions?

CommitTransaction()
  + msgrDoPendingDelete()

AbortTransaction()
  + msgrDoPendingDelete()

AbortSubTransactoin()
  AtSubAbort_smgr()
   + msgrDoPendingDelete()

# Looking around, the prefixes AtEOact/PreCommit/AtAbort don't
# seem to be used keeping a principle.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::perl2host($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMA

Performance improvement of WAL writing?

2019-08-27 Thread Moon, Insung

Dear Hackers.

Currently, the XLogWrite function is written in 8k(or 16kb) units
regardless of the size of the new record.
For example, even if a new record is only 300 bytes, pg_pwrite is
called to write data in 8k units (if it cannot be writing on one page
is 16kb written).
Let's look at the worst case.
1) LogwrtResult.Flush is 8100 pos.
2) And the new record size is only 100 bytes.
In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes.
It is a rare case, but I think there is overhead for pg_pwrite for some systems.
# For systems that often update one record in one transaction.

So what about modifying the XLogWrite function only to write the size
that should record?
Can this idea benefit from WAL writing performance?
If it's OK to improve, I want to do modification.
How do you think of it?

Best Regards.
Moon.

Re: Re: Email to hackers for test coverage

2019-08-27 Thread movead...@highgo.ca


On Wed, 28 Aug 2019 11:30:23 +0800 mich...@paquier.xyz wrote
>- numeric is not a test suite designed for sorting, and hijacking it
>to do so it not a good approach.
Absolutely agreement. We can wait for repling from
Robert and Peter G.

>- it would be good to get coverage for the two extra code paths while
>we are on it with NULL datums.
The remained untested line in ApplySortComparator() can be never
reached I think, and I have explained it on last mail. 
Or I don't fully understand what you mean?
 
>- There is no need for two INSERT queries, I am getting the same
>coverage with only one.
Yes It is my mistake,  the key point is 'ORDER BY n1 DESC' not the 
'INSERT'. Infact it can run the code line without any of the two INSERT.

--
Movead

Improve error detections in TAP tests by spreading safe_psql

2019-08-27 Thread Michael Paquier

Hi all,

This is a follow-up of the discussion which happened here after Tom
has committed fb57f40:
https://www.postgresql.org/message-id/20190828012439.ga1...@paquier.xyz

I have reviewed the TAP tests, and we have much more spots where it
is better to use PostgresNode::safe_psql instead PostgresNode::psql so
as the test dies immediately if there is a failure on a query which
should never fail.

Attached are the spots I have found.  Any thoughts or opinions?
Thanks,
--
Michael
diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 6e2f051187..5f40fcfba6 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -47,12 +47,12 @@ is($primary->slot($slot_name)->{'slot_type'},
 # Generate some WAL.  Use --synchronous at the same time to add more
 # code coverage.  Switch to the next segment first so that subsequent
 # restarts of pg_receivewal will see this segment as full..
-$primary->psql('postgres', 'CREATE TABLE test_table(x integer);');
-$primary->psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', 'CREATE TABLE test_table(x integer);');
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
 my $nextlsn =
   $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
 chomp($nextlsn);
-$primary->psql('postgres',
+$primary->safe_psql('postgres',
 	'INSERT INTO test_table VALUES (generate_series(1,100));');
 
 # Stream up to the given position.
diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
index 99154bcf39..b225b3b4ee 100644
--- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
+++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl
@@ -47,8 +47,8 @@ $node->command_ok(
 my $slot = $node->slot('test');
 isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot');
 
-$node->psql('postgres', 'CREATE TABLE test_table(x integer)');
-$node->psql('postgres',
+$node->safe_psql('postgres', 'CREATE TABLE test_table(x integer)');
+$node->safe_psql('postgres',
 	'INSERT INTO test_table(x) SELECT y FROM generate_series(1, 10) a(y);');
 my $nextlsn =
   $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()');
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index f064ea1063..3e7412dbbf 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -3282,7 +3282,7 @@ if ($collation_check_stderr !~ /ERROR: /)
 }
 
 # Create a second database for certain tests to work against
-$node->psql('postgres', 'create database regress_pg_dump_test;');
+$node->safe_psql('postgres', 'create database regress_pg_dump_test;');
 
 # Start with number of command_fails_like()*2 tests below (each
 # command_fails_like is actually 2 tests)
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3c743d7d7c..42af151ac9 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -125,7 +125,7 @@ test_target_session_attrs($node_standby_1, $node_master, $node_standby_1,
 # role.
 note "testing SHOW commands for replication connection";
 
-$node_master->psql(
+$node_master->safe_psql(
 	'postgres', "
 CREATE ROLE repl_role REPLICATION LOGIN;
 GRANT pg_read_all_settings TO repl_role;");
diff --git a/src/test/recovery/t/008_fsm_truncation.pl b/src/test/recovery/t/008_fsm_truncation.pl
index ddab464a97..a164b81777 100644
--- a/src/test/recovery/t/008_fsm_truncation.pl
+++ b/src/test/recovery/t/008_fsm_truncation.pl
@@ -30,7 +30,7 @@ $node_standby->init_from_backup($node_master, 'master_backup',
 	has_streaming => 1);
 $node_standby->start;
 
-$node_master->psql(
+$node_master->safe_psql(
 	'postgres', qq{
 create table testtab (a int, b char(100));
 insert into testtab select generate_series(1,1000), 'foo';
@@ -39,7 +39,7 @@ delete from testtab where ctid > '(8,0)';
 });
 
 # Take a lock on the table to prevent following vacuum from truncating it
-$node_master->psql(
+$node_master->safe_psql(
 	'postgres', qq{
 begin;
 lock table testtab in row share mode;
@@ -47,14 +47,14 @@ prepare transaction 'p1';
 });
 
 # Vacuum, update FSM without truncation
-$node_master->psql('postgres', 'vacuum verbose testtab');
+$node_master->safe_psql('postgres', 'vacuum verbose testtab');
 
 # Force a checkpoint
-$node_master->psql('postgres', 'checkpoint');
+$node_master->safe_psql('postgres', 'checkpoint');
 
 # Now do some more insert/deletes, another vacuum to ensure full-page writes
 # are done
-$node_master->psql(
+$node_master->safe_psql(
 	'postgres', qq{
 insert into testtab select generate_series(1,1000), 'foo';
 delete from testtab where ctid > '(8,0)';
@@ -62,16 +62,16 @@ vacuum verbose testtab;
 });
 
 # Ensure all buffers are now clean on the standby
-$node_standby->psql('postgres', 'checkpoint');
+$node_standby->safe_psql('postgres', 'checkpoint');
 
 # Release the lock, vacuum again which shoul

Re: refactoring - share str2*int64 functions

2019-08-27 Thread Fabien COELHO



Bonjour Michaël,


So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and
then possibly generalize to lower sizes, 32, 16, depending on what is
actually needed.


I am interested in this patch, and the next commit fest is close by.
Are you working on an updated version?  If not, do you mind if I work
on it and post a new version by the beginning of next week based on
all the feedback gathered?


Here is an updated patch for the u?int64 conversion functions.

I have taken the liberty to optimize the existing int64 function by 
removing spurious tests. I have not created uint64 specific inlined 
overflow functions.


If it looks ok, a separate patch could address the 32 & 16 versions.

--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..28891ba337 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -79,6 +79,7 @@
 #include "utils/builtins.h"
 #include "utils/hashutils.h"
 #include "utils/memutils.h"
+#include "common/string.h"
 
 PG_MODULE_MAGIC;
 
@@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/* parse command tag to retrieve the number of affected rows. */
 		if (completionTag &&
 			strncmp(completionTag, "COPY ", 5) == 0)
-			rows = pg_strtouint64(completionTag + 5, NULL, 10);
+			(void) pg_strtouint64(completionTag + 5, &rows);
 		else
 			rows = 0;
 
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..8e75d52b06 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -21,6 +21,7 @@
 #include "catalog/heap.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
+#include "common/string.h"
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
@@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 	CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
 
 	if (strncmp(completionTag, "SELECT ", 7) == 0)
-		_SPI_current->processed =
-			pg_strtouint64(completionTag + 7, NULL, 10);
+		(void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
 	else
 	{
 		/*
@@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 else if (IsA(stmt->utilityStmt, CopyStmt))
 {
 	Assert(strncmp(completionTag, "COPY ", 5) == 0);
-	_SPI_current->processed = pg_strtouint64(completionTag + 5,
-			 NULL, 10);
+	(void) pg_strtouint64(completionTag + 5, &_SPI_current->processed);
 }
 			}
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..17cde42b4d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -34,6 +34,7 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "common/string.h"
 #include "nodes/extensible.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -80,7 +81,7 @@
 #define READ_UINT64_FIELD(fldname) \
 	token = pg_strtok(&length);		/* skip :fldname */ \
 	token = pg_strtok(&length);		/* get field value */ \
-	local_node->fldname = pg_strtouint64(token, NULL, 10)
+	(void) pg_strtouint64(token, &local_node->fldname)
 
 /* Read a long integer field (anything written as ":fldname %ld") */
 #define READ_LONG_FIELD(fldname) \
diff --git a/src/backend/parser/parse_node.c b/src/backend/parser/parse_node.c
index 1baf7ef31f..4595c35875 100644
--- a/src/backend/parser/parse_node.c
+++ b/src/backend/parser/parse_node.c
@@ -25,10 +25,10 @@
 #include "parser/parse_expr.h"
 #include "parser/parse_relation.h"
 #include "utils/builtins.h"
-#include "utils/int8.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/varbit.h"
+#include "common/string.h"
 
 
 static void pcb_error_callback(void *arg);
@@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
 
 		case T_Float:
 			/* could be an oversize integer as well as a float ... */
-			if (scanint8(strVal(value), true, &val64))
+			if (pg_strtoint64(strVal(value), &val64) == strtoint_ok)
 			{
 /*
  * It might actually fit in int32. Probably only INT_MIN can
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c08757fca..b364a41fb5 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -12,6 +12,7 @@
  */
 #include "postgres.h"
 
+#include "common/string.h"
 #include "catalog/pg_publication.h"
 
 #include "fmgr.h"
@@ -22,7 +23,6 @@
 #include "replication/pgoutput.h"
 
 #include "utils/inval.h"
-#include "utils/int8.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
@@ -113,7 +113,7 @@ parse_output_parameters(List *options, uint32 *protocol_version,
 		 errmsg("conflicting or redundant options")));
 			protocol_version_given = true;
 
-			if (!

52 matches

Mail list logo