Re: Zedstore - compressed in-core columnar storage
On Tue, Aug 27, 2019 at 6:03 AM Ashwin Agrawal wrote: > Hope that helps to clarify the confusion. > Thanks for the explanation. Yes, it does clarify my doubt to some extent. My point is, once we find the leaf page containing the given tid, we go through each item in the page until we find the data corresponding to the given tid which means we kind of perform a sequential scan at the page level. I'm referring to the below loop in zsbt_attr_scan_fetch_array(). for (off = FirstOffsetNumber; off <= maxoff; off++) { ItemId iid = PageGetItemId(page, off); ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) PageGetItem(page, iid); if (item->t_endtid <= nexttid) continue; if (item->t_firsttid > nexttid) break; But that's not true for IndexScan in case of heap table because there the index tuple contains the exact physical location of tuple in the heap. So, there is no need to scan the entire page. Further here are some minor comments that i could find while doing a quick code walkthrough. 1) In zsundo_insert_finish(), there is a double call to BufferGetPage(undobuf); Is that required ? 2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called twice? I'm referring to the below code. if (fetch_proj->num_proj_atts == 0) { zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->tid_scan); fetch_proj->tid_scan.serializable = true; for (int i = 1; i < fetch_proj->num_proj_atts; i++) { int attno = fetch_proj->proj_atts[i]; zsbt_attr_begin_scan(rel, reldesc, attno, &fetch_proj->attr_scans[i - 1]); } MemoryContextSwitchTo(oldcontext); zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, &fetch_proj->tid_scan); } Also, for all types of update operation (be it key or non-key update) we create a new tid for the new version of tuple. Can't we use the tid associated with the old tuple for the cases where there is no concurrent transactions to whom the old tuple is still visible. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Fault injection framework
On Thu, Aug 22, 2019 at 07:45:09PM +0530, Asim R P wrote: > Fault injection was discussed a few months ago at PGCon in Ottawa. At > least a few folks showed interest and so I would like to present what we > have been using in Greenplum. > > The attached patch set contains the fault injector framework ported to > PostgreSQL master. It provides ability to define points of interest in > backend code and then inject faults at those points from SQL. Also > included is an isolation test to simulate a speculative insert conflict > scenario that was found to be rather cumbersome to implement using advisory > locks, see [1]. The alternative isolation spec using fault injectors seems > much simpler to understand. You may want to double-check whitespaces in your patch set, and 0002 does not apply because of conflicts in isolationtester.h (my fault!). 0002 is an independent feature, so I would keep it out of the fault framework for integration. There has been a argument from Alvaro more convincing than mine about the use of a separate keyword, hence removing a dependency with steps: https://www.postgresql.org/message-id/20190823153825.GA11405@alvherre.pgsql It would be good also to have a test case which exercises it, without the need of the fault framework or its dedicated schedule. Patches 0003, 0004 and 0005 could just be grouped together, they deal about the same thing. My first impressions about this patch is that it is very intrusive. Could you explain the purpose of am_faultinjector? That's a specific connection string parameter which can be used similarly to replication for WAL senders? Couldn't there be an equivalent with a SUSET GUC? It may be interesting to see which parts of this framework could be moved into an extension loaded with shared_preload_libraries, one thing being the shared memory initialization part. At the end it would be interesting to not have a dependency with a compile-time flag. Things like exec_fault_injector_command() need to be more documented. It is hard to guess what it is being used for. -- Michael signature.asc Description: PGP signature
Re: Optimize single tuple fetch from nbtree index
>> It seems that it contradicts the very idea of your patch, so probably we >> should look for other ways to optimize this use-case. >> Maybe this restriction can be relaxed for write only tables, that never >> have to reread the page because of visibility, or something like that. >> Also we probably can add to IndexScanDescData info about expected number >> of tuples, to allow index work more optimal >> and avoid the overhead for other loads.= > The idea of the patch is exactly to relax this limitation. I forgot to update > that README file though. The current implementation of the patch should be > correct like this - that's why I added the look-back code on the page if the > tuple couldn't be found anymore on the same location on the page. Similarly, > it'll look on the page to the right if it detected a page split. These two > measures combined should give a correct implementation of the 'it's possible > that a scan stops in the middle of a page' relaxation. However, as Peter and > Tom pointed out earlier, they feel that the performance advantage that this > approach gives, does not outweigh the extra complexity at this time. I'd be > open to other suggestions though. Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actually gets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like the following lead to this (in my patch)? 1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already, however it's not marked dead yet in the index. 2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead 3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actually removing this item from the page. 4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannot find it anymore because it got removed completely by backend 2. If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things need to happen concurrently while doing a visiblity check. As for your patch, I've had some time to take a look at it. For the two TODOs: + /* TODO Is it possible that currPage is not valid anymore? */ + Assert(BTScanPosIsValid(so->currPos)) This Assert exists already a couple of lines earlier at the start of this function. + * TODO It is not clear to me + * why to check scanpos validity based on currPage value. + * I wonder, if we need currPage at all? Is there any codepath that + * assumes that currPage is not the same as BufferGetBlockNumber(buf)? + */ The comments in the source mention the following about this: * We note the buffer's block number so that we can release the pin later. * This allows us to re-read the buffer if it is needed again for hinting. */ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don't store currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that's the reason that currPage is stored. Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happy if this gets added. -Floris
Re: [HACKERS] WAL logging problem in 9.4.3?
At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190827.154932.250364935.horikyota@gmail.com> > 128GB shared buffers contain 16M buffers. On my > perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it > has only 6GB, the test is ignoring the effect of cache that comes > from the difference of the buffer size). (attached 1) ... > For a 16MB file, the cost of write-fsyncing cost is almost the > same to that of WAL-emitting cost. It was about 200 ms on the > Vista-era machine with non-performant rotating magnetic disks > with xfs. (attached 2, 3) Although write-fsyncing of relation > file makes no lock conflict with other backends, WAL-emitting > delays other backends' commits at most by that milliseconds. FWIW, the attached are the programs I used to take the numbers. testloop.c: to take time to loop over buffers in FlushRelationBuffers testfile.c: to take time to sync a heap file. (means one file for the size) testfile2.c: to take time to emit a wal record. (means 16MB per file) regards. -- Kyotaro Horiguchi NTT Open Source Software Center #include #include #include typedef struct RelFileNode { unsigned int spc; unsigned int db; unsigned int rel; } RelFileNode; typedef struct Buffer { RelFileNode rnode; } Buffer; //#define NBUFFERS ((int)((128.0 * 1024 * 1024 * 1024) / (8.0 * 1024))) #define NBUFFERS ((int)((32.0 * 1024 * 1024 * 1024) / (8.0 * 1024))) int main(void) { int i; RelFileNode t = {1,2,3}; Buffer *bufs = (Buffer *) malloc(sizeof(Buffer) * NBUFFERS); struct timeval st, ed; int matches = 0, unmatches = 0; Buffer *b; for (i = 0 ; i < NBUFFERS ; i++) { bufs[i].rnode.spc = random() * 100; bufs[i].rnode.db = random() * 100; bufs[i].rnode.rel = random() * 1; } /* start measuring */ gettimeofday(&st, NULL); b = bufs; for (i = 0 ; i < NBUFFERS ; i++) { if (b->rnode.spc == t.spc && b->rnode.db == t.db && b->rnode.rel == t.rel) matches++; else unmatches++; b++; } gettimeofday(&ed, NULL); printf("%lf ms for %d loops, matches %d, unmatches %d\n", (double)((ed.tv_sec - st.tv_sec) * 1000.0 + (ed.tv_usec - st.tv_usec) / 1000.0), i, matches, unmatches); return 0; } #include #include #include #include #include #include //#define FILE_SIZE (16 * 1024 * 1024) //#define LOOPS 100 #define FILE_SIZE (64 * 1024) #define LOOPS 1000 //#define FILE_SIZE (8 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (1 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (512) //#define LOOPS 1000 //#define FILE_SIZE (128) //#define LOOPS 1000 char buf[FILE_SIZE]; char fname[256]; int main(void) { int i, j; int fd = -1; struct timeval st, ed; double accum = 0.0; int bufperfile = (int)((16.0 * 1024 * 1024) / FILE_SIZE); for (i = 0 ; i < LOOPS ; i++) { snprintf(fname, 256, "test%03d.file", i); unlink(fname); // ignore errors } for (i = 0 ; i < LOOPS ; i++) { for (j = 0 ; j < FILE_SIZE ; j++) buf[j] = random()* 256; if (i % bufperfile == 0) { if (fd >= 0) close(fd); snprintf(fname, 256, "test%03d.file", i / bufperfile); fd = open(fname, O_CREAT | O_RDWR, 0644); if (fd < 0) { fprintf(stderr, "open error: %m\n"); exit(1); } memset(buf, 0, sizeof(buf)); if (write(fd, buf, sizeof(buf)) < 0) { fprintf(stderr, "init write error: %m\n"); exit(1); } if (fsync(fd) < 0) { fprintf(stderr, "init fsync error: %m\n"); exit(1); } if (lseek(fd, 0, SEEK_SET) < 0) { fprintf(stderr, "init lseek error: %m\n"); exit(1); } } gettimeofday(&st, NULL); if (write(fd, buf, FILE_SIZE) < 0) { fprintf(stderr, "write error: %m\n"); exit(1); } if (fdatasync(fd) < 0) { fprintf(stderr, "fdatasync error: %m\n"); exit(1); } gettimeofday(&ed, NULL); accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 + (ed.tv_usec - st.tv_usec) / 1000.0); } printf("%.2lf ms for %d %dkB-records (%d MB), %.2lf ms per %dkB)\n", accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, FILE_SIZE / 1024); return 0; } #include #include #include #include #include //#define FILE_SIZE (16 * 1024 * 1024) //#define LOOPS 100 //#define FILE_SIZE (8 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (1 * 1024) //#define LOOPS 1000 //#define FILE_SIZE (512) //#define LOOPS 1000 #define FILE_SIZE (128) #define LOOPS 1000 char buf[FILE_SIZE]; int main(void) { int i; int fd = -1; double accum = 0.0; struct timeval st, ed; for (i =
Re: A problem about partitionwise join
On Tue, Aug 27, 2019 at 8:51 AM Amit Langote wrote: > Hi Richard, > > On Mon, Aug 26, 2019 at 6:33 PM Richard Guo wrote: > > > > Hi All, > > > > To generate partitionwise join, we need to make sure there exists an > > equi-join condition for each pair of partition keys, which is performed > > by have_partkey_equi_join(). This makes sense and works well. > > > > But if, let's say, one certain pair of partition keys (foo.k = bar.k) > > has formed an equivalence class containing consts, no join clause would > > be generated for it, since we have already generated 'foo.k = const' and > > 'bar.k = const' and pushed them into the proper restrictions earlier. > > > > This will make partitionwise join fail to be planned if there are > > multiple partition keys and the pushed-down restrictions 'xxx = const' > > fail to prune away any partitions. > > > > Consider the examples below: > > > > create table p (k1 int, k2 int, val int) partition by range(k1,k2); > > create table p_1 partition of p for values from (1,1) to (10,100); > > create table p_2 partition of p for values from (10,100) to (20,200); > > > > If we are joining on each pair of partition keys, we can generate > > partitionwise join: > > > > # explain (costs off) > > select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = > bar.k2; > > QUERY PLAN > > -- > > Append > >-> Hash Join > > Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2)) > > -> Seq Scan on p_1 foo > > -> Hash > >-> Seq Scan on p_1 bar > >-> Hash Join > > Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2)) > > -> Seq Scan on p_2 foo_1 > > -> Hash > >-> Seq Scan on p_2 bar_1 > > (11 rows) > > > > But if we add another qual 'foo.k2 = const', we will be unable to > > generate partitionwise join any more, because have_partkey_equi_join() > > thinks not every partition key has an equi-join condition. > > > > # explain (costs off) > > select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = > bar.k2 and foo.k2 = 16; > >QUERY PLAN > > - > > Hash Join > >Hash Cond: (foo.k1 = bar.k1) > >-> Append > > -> Seq Scan on p_1 foo > >Filter: (k2 = 16) > > -> Seq Scan on p_2 foo_1 > >Filter: (k2 = 16) > >-> Hash > > -> Append > >-> Seq Scan on p_1 bar > > Filter: (k2 = 16) > >-> Seq Scan on p_2 bar_1 > > Filter: (k2 = 16) > > (13 rows) > > > > Is this a problem? > > Perhaps. Maybe it has to do with the way have_partkey_equi_join() has > been coded. If it was coded such that it figured out on its own that > the equivalence (foo.k2, bar.k2, ...) does exist, then that would > allow partitionwise join to occur, which I think would be OK to do. > But maybe I'm missing something. > > This should be caused by how we deduce join clauses from equivalence classes. ECs containing consts will not be considered so we cannot generate (foo.k2 = bar.k2) for the query above. In addition, when generating join clauses from equivalence classes, we only select the joinclause with the 'best score', or the first joinclause with a score of 3. This may make us miss some joinclause on partition keys. Check the query below as a more illustrative example: create table p (k int, val int) partition by range(k); create table p_1 partition of p for values from (1) to (10); create table p_2 partition of p for values from (10) to (100); If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate partitionwise join: # explain (costs off) select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val; QUERY PLAN - Append -> Hash Join Hash Cond: (foo.k = bar.k) -> Seq Scan on p_1 foo -> Hash -> Seq Scan on p_1 bar Filter: (k = val) -> Hash Join Hash Cond: (foo_1.k = bar_1.k) -> Seq Scan on p_2 foo_1 -> Hash -> Seq Scan on p_2 bar_1 Filter: (k = val) (13 rows) But if we exchange the order of the two quals to 'foo.k = bar.val and foo.k = bar.k', then partitionwise join cannot be generated any more, because we only have joinclause 'foo.k = bar.val' as it first reached score of 3. We have missed the joinclause on the partition key although it does exist. # explain (costs off) select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k; QUERY PLAN - Hash Join Hash Cond: (foo.k = bar.val) -> Append -> Seq Scan on p_1 foo -> Seq Scan on p_2 foo_1 -> Hash -> Append ->
Re: Re: Email to hackers for test coverage
On Tue, 27 Aug 2019 14:07:48 +0800 mich...@paquier.xyz wrote: > There is a section in float4.sql which deals with overflow and > underflow, so wouldn't it be better to move the tests there? You > could just trigger the failures with that: > =# insert into float4_tbl values ('-10e-70'::float8); > ERROR: 22003: value out of range: underflow > LOCATION: check_float4_val, float.h:145 > =# insert into float4_tbl values ('-10e70'::float8); > ERROR: 22003: value out of range: overflow > LOCATION: check_float4_val, float.h:140 > I would also test all four patterns: 10e70, 10e-70, -10e70, -10e-70. I think your way is much better, so I change the patch and it is 'regression_20190827.patch' now. > For the numeric part, this improves the case of > ApplySortAbbrevFullComparator() where both values are not NULL. Could > things be done so as the other code paths are fully covered? One > INSERT is fine by the way to add the extra coverage. There are code lines related to NULL values in ApplySortAbbrevFullComparator(), but I think the code lines for comparing a NULL and a NOT NULL can be never reached, because it is handled in ApplySortComparator() which is called before ApplySortAbbrevFullComparator(). So I think it is no use to INSERT a NULL value. -- Movead regression_20190827.patch Description: Binary data
Re: pgbench - allow to create partitioned tables
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested Thanks. All looks good, making it ready for committer. Regards, Asif The new status of this patch is: Ready for Committer
Re: "ago" times on buildfarm status page
On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote: > I think this is the problem: > > 'scmrepo' => '/home/pgbf/pgmirror.git', > > Probably this isn't updated often enough. It probably has little to do with > the clock settings. > > This is the kind of old-fashioned way of doing things. These days > "git_keep_mirror => 1" along with the community repo as the base would avoid > these problems. We've discussed this before (see below). The configuration is intentionally like that. I specifically configured skate and snapper to build the exact same source, where skate builds with default buildfarm settings, while snapper builds with the settings actually used by Debian source packages. These animals were set up to avoid cases we had in the past where Debian source packages failed to build on sparc, even though build animals running on Debian sparc were building fine: https://www.postgresql.org/message-id/01d2f0c2%24e2d335a0%24a879a0e0%24%40turelinckx.be With snapper building the exact same source as skate (as it is now), we have some point of reference if snapper fails but skate succeeds. I could configure snapper to perform an update of the repo before building, but then we give up this comparability in exchange for a bit more clarity regarding timestamps. Best regards, Tom Turelinckx On Thu, Nov 9, 2017, at 8:54 PM, Andrew Dunstan wrote: > > The first line of the log file is always something like this (see end of > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-11-09%2018%3A59%3A01 > ) > > Last file mtime in snapshot: Thu Nov 9 17:56:07 2017 GMT > This should be the time of the last commit in the snapshot. > > On Mon, Nov 6, 2017 at 9:49 AM, Tom Lane wrote: >> >> Tom Turelinckx writes: >> >> > On Fri, Nov 3, 2017, at 09:42 PM, Tom Lane wrote: >> >> Either that, or it's fallen through a wormhole ;-), but the results >> >> it's posting seem to be mis-timestamped by several hours, which is >> >> confusing. Please set its clock correctly. Maybe spin up ntpd? >> >> > The clock is correct, but the configuration may be unusual. >> >> > In fact, snapper runs on the same machine as skate, and it's using ntp. >> > At 7 AM (Western Europe), a local git repo is updated. In the morning, >> > skate builds from that local repo with the default buildfarm >> > configuration that most animals use. In the afternoon, snapper builds >> > from that local repo with the exact same configuration, per branch, that >> > the Debian source packages from the pgdg repo use on the same platform. >> > The local repo is updated only once per day to ensure that snapper and >> > skate build the same source with different settings, and they share the >> > git mirror and build root, as suggested in the build farm howto. >> >> Hm. So the issue really is that the build timestamp that the buildfarm >> client is reporting tells when it pulled from the local repo, not when >> that repo was last updated from the community server. Not sure if there's >> any simple way to improve that ... Andrew, any thoughts?
Re: Creating partitions automatically at least on HASH?
On Mon, 26 Aug 2019 at 19:46, Fabien COELHO wrote: > > Hello Rafia, > > >>CREATE TABLE Stuff (...) > >> PARTITION BY [HASH | RANGE | LIST] (…) > >>DO NONE -- this is the default > >>DO [IMMEDIATE|DEFERRED] USING (…) > >> > >> Where the USING part would be generic keword value pairs, eg: > >> > >> For HASH: (MODULUS 8) and/or (NPARTS 10) > >> > >> For RANGE: (START '1970-01-01', STOP '2020-01-01', INCREMENT '1 year') > >> and/or (START 1970, STOP 2020, NPARTS 50) > >> > >> And possibly for LIST: (IN (…), IN (…), …), or possibly some other > >> keyword. > >> > >> The "DEFERRED" could be used as an open syntax for dynamic partitioning, > >> if later someone would feel like doing it. > >> > > ISTM that "USING" is better than "WITH" because WITH is already used > >> specifically for HASH and other optional stuff in CREATE TABLE. > >> > >> The text constant would be interpreted depending on the partitioning > >> expression/column type. > >> > >> Any opinion about the overall approach? > > > I happen to start a similar discussion [1] being unaware of this one > > and there Ashutosh Sharma talked about interval partitioning in Oracle. > > Looking > > closely it looks like we can have this automatic partitioning more > > convenient by having something similar. Basically, it is creating > > partitions on demand or lazy partitioning. > > Yep, the "what" of dynamic partitioning is more or less straightforward, > along the line you are describing. > > For me there are really two questions: > > - having a extendable syntax, hence the mail I sent, which would cover > both automatic static & dynamic partitioning and their parameters, > given that we already have manual static, automatic static should > be pretty easy. > > - implementing the stuff, with limited performance impact if possible > for the dynamic case, which is non trivial. > > > To explain a bit more, let's take range partition for example, first > > parent table is created and it's interval and start and end values are > > specified and it creates only the parent table just like it works today. > > > Now, if there comes a insertion that does not belong to the existing (or > > any, in the case of first insertion) partition(s), then the > > corresponding partition is created, > > Yep. Now, you also have to deal with race conditions issues, i.e. two > parallel session inserting tuples that must create the same partition, and > probably you would like to avoid a deadlock. > > Hmmm, that shouldn't be very hard. Postgres handles many such things and I think mostly by a mutex guarded shared memory structure. E.g. we can have a shared memory structure associated with the parent table holding the information of all the available partitions, and keep this structure guarded by mutex. Anytime a new partition has to be created the relevant information is first entered in this structure before actually creating it. > I think it is extensible to other partitioning schemes as well. Also it > > is likely to have a positive impact on the queries, because there will > > be required partitions only and would not require to educate > > planner/executor about many empty partitions. > > Yep, but it creates other problems to solve… > > Isn't it always the case. :) -- Regards, Rafia Sabih
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch wrote in <20190826050843.gb3153...@rfd.leadboat.com> noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote: noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch wrote in <20190820060314.ga3086...@rfd.leadboat.com> > > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote: > > > > I'm not sure the point of the behavior. I suppose that the "log" > > > > is a sequence of new_page records. It also needs to be synced and > > > > it is always larger than the file to be synced. I can't think of > > > > an appropriate threshold without the point. > > > > > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() > > > locks > > > every buffer header containing a buffer of the current database. The > > > belief > > > has been that writing one page to xlog is cheaper than > > > FlushRelationBuffers() > > > in a busy system with large shared_buffers. > > > > I'm at a loss.. The decision between WAL and sync is made at > > commit time, when we no longer have a pin on a buffer. When > > emitting WAL, opposite to the assumption, lock needs to be > > re-acquired for every page to emit log_new_page. What is worse, > > we may need to reload evicted buffers. If the file has been > > CopyFrom'ed, ring buffer strategy makes the situnation farther > > worse. That doesn't seem cheap at all.. > > Consider a one-page relfilenode. Doing all the things you list for a single > page may be cheaper than locking millions of buffer headers. If I understand you correctly, I would say that *all* buffers that don't belong to in-transaction-created files are skipped before taking locks. No lock conflict happens with other backends. FlushRelationBuffers uses double-checked-locking as follows: FlushRelationBuffers_common(): .. if(!islocal) { for (i for all buffers) { if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) { LockBufHdr(bufHdr); if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) { PinBuffer_Locked(bubHder); LWLockAcquire(); FlushBuffer(); 128GB shared buffers contain 16M buffers. On my perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it has only 6GB, the test is ignoring the effect of cache that comes from the difference of the buffer size). (attached 1) With WAL-emitting we find every buffers of the file using buffer hash, we suffer partition locks instead of the 15ms of local latency. That seems worse. > > If there were any chance on WAL for smaller files here, it would > > be on the files smaller than the ring size of bulk-write > > strategy(16MB). > > Like you, I expect the optimal threshold is less than 16MB, though you should > benchmark to see. Under the ideal threshold, when a transaction creates a new > relfilenode just smaller than the threshold, that transaction will be somewhat > slower than it would be if the threshold were zero. Locking every buffer I looked closer on this. For a 16MB file, the cost of write-fsyncing cost is almost the same to that of WAL-emitting cost. It was about 200 ms on the Vista-era machine with non-performant rotating magnetic disks with xfs. (attached 2, 3) Although write-fsyncing of relation file makes no lock conflict with other backends, WAL-emitting delays other backends' commits at most by that milliseconds. In summary, the characteristics of the two methods on a 16MB file are as the follows. File write: - 15ms of buffer scan without locks (@128GB shared buffer) + no hash search for a buffer = take locks on all buffers only of the file one by one (to write) + plus 200ms of write-fdatasync (of whole the relation file), which doesn't conflict with other backends. (except via CPU time slots and IO bandwidth.) WAL write : + no buffer scan - 2048 times (16M/8k) of partition lock on finding every buffer for the target file, which can conflict with other backends. = take locks on all buffers only of the file one by one (to take FPW) - plus 200ms of open(create)-write-fdatasync (of a WAL file (of default size)), which can delay commits on other backends at most by that duration. > header causes a distributed slow-down for other queries, and protecting the > latency of non-DDL queries is typically more useful than accelerating > TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries; > beyond a certain relfilenode size, the extra WAL harms non-DDL queries more > than the buffer scan harms them. That's about where the threshold should be. If the discussion above is correct, we shouldn't use WAL-write even for files around 16MB. For smaller shared_buffers and file size, the delays are: Scan all buffers takes: 15 ms for 128GB shared_buffers 4.5ms for 32GB shared_buffers fdatasync takes: 200 ms for 16MB/sync 51 ms for 1MB/sync 46 ms for 512kB/sync 40 ms for 256kB/sync 37 ms for 128kB/sync 35 ms for <64k
Re: Statement timeout in pg_rewind
On Tue, 27 Aug 2019 at 08:36, Michael Paquier wrote: > I'd rather be on the safe side and as we are looking at this at this > area.. Who knows if this logic is going to change in the future and > how it will change. Agree. > Oops, I misread this part. What about a simple wrapper > run_simple_command which checks after PGRES_COMMAND_OK, and frees the > result then? This could be used for the temporary table creation and > when setting synchronous_commit. Done, please see the next version attached. Regards, -- Alexander Kukushkin diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c index 37eccc3126..e2dbc9fdf6 100644 --- a/src/bin/pg_rewind/libpq_fetch.c +++ b/src/bin/pg_rewind/libpq_fetch.c @@ -39,6 +39,7 @@ static PGconn *conn = NULL; static void receiveFileChunks(const char *sql); static void execute_pagemap(datapagemap_t *pagemap, const char *path); static char *run_simple_query(const char *sql); +static void run_simple_command(const char *sql); void libpqConnect(const char *connstr) @@ -54,6 +55,11 @@ libpqConnect(const char *connstr) if (showprogress) pg_log_info("connected to server"); + /* We don't want our queries canceled */ + run_simple_command("SET statement_timeout = 0"); + run_simple_command("SET lock_timeout = 0"); + run_simple_command("SET idle_in_transaction_session_timeout = 0"); + res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL); if (PQresultStatus(res) != PGRES_TUPLES_OK) pg_fatal("could not clear search_path: %s", @@ -88,11 +94,7 @@ libpqConnect(const char *connstr) * replication, and replication isn't working for some reason, we don't * want to get stuck, waiting for it to start working again. */ - res = PQexec(conn, "SET synchronous_commit = off"); - if (PQresultStatus(res) != PGRES_COMMAND_OK) - pg_fatal("could not set up connection context: %s", - PQresultErrorMessage(res)); - PQclear(res); + run_simple_command("SET synchronous_commit = off"); } /* @@ -122,6 +124,25 @@ run_simple_query(const char *sql) return result; } +/* + * Runs a command. + * In case of failure pg_fatal exits with code=1. + */ +static void +run_simple_command(const char *sql) +{ + PGresult *res; + char *result; + + res = PQexec(conn, sql); + + if (PQresultStatus(res) != PGRES_COMMAND_OK) + pg_fatal("error running command (%s) in source server: %s", + sql, PQresultErrorMessage(res)); + + PQclear(res); +} + /* * Calls pg_current_wal_insert_lsn() function */ @@ -427,12 +448,7 @@ libpq_executeFileMap(filemap_t *map) * need to fetch. */ sql = "CREATE TEMPORARY TABLE fetchchunks(path text, begin int8, len int4);"; - res = PQexec(conn, sql); - - if (PQresultStatus(res) != PGRES_COMMAND_OK) - pg_fatal("could not create temporary table: %s", - PQresultErrorMessage(res)); - PQclear(res); + run_simple_command(sql); sql = "COPY fetchchunks FROM STDIN"; res = PQexec(conn, sql);
PostgreSQL and Real Application Testing (RAT)
Hi In my business, one of the things blocking the migration from Oracle to PostgreSQL is not having the equivalent of Oracle Real Application Testing . This product captures a charge in production and replay it in a test environment. this allows to know the impacts of a migration to a newer version, the creation of an index.. is there an equivalent in the PostgreSQL community? if not, do you think it's technically possible to do it ? who would be interested in this project ? Thanks in advance Best Regards Didier ROS EDF Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: block-level incremental backup
On Fri, Aug 16, 2019 at 8:07 PM Ibrar Ahmed wrote: > > What do you mean by bigger file, a file greater than 1GB? In which case you > get file > 1GB? > > > Few comments: Comment: + buf = (char *) malloc(statbuf->st_size); + if (buf == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0) + { + Bitmapset *mod_blocks = NULL; + int nmodblocks = 0; + + if (cnt % BLCKSZ != 0) + { We can use same size as full page size. After pg start backup full page write will be enabled. We can use the same file size to maintain data consistency. Comment: /* Validate given LSN and convert it into XLogRecPtr. */ + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error); + if (XLogRecPtrIsInvalid(opt->lsn)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("invalid value for LSN"))); Validate input lsn is less than current system lsn. Comment: /* Validate given LSN and convert it into XLogRecPtr. */ + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error); + if (XLogRecPtrIsInvalid(opt->lsn)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("invalid value for LSN"))); Should we check if it is same timeline as the system's timeline. Comment: + if (fread(blkdata, 1, BLCKSZ, infp) != BLCKSZ) + { + pg_log_error("could not read from file \"%s\": %m", outfn); + cleanup_filemaps(filemaps, fmindex + 1); + exit(1); + } + + /* Finally write one block to the output file */ + if (fwrite(blkdata, 1, BLCKSZ, outfp) != BLCKSZ) + { + pg_log_error("could not write to file \"%s\": %m", outfn); + cleanup_filemaps(filemaps, fmindex + 1); + exit(1); + } Should we support compression formats supported by pg_basebackup. This can be an enhancement after the functionality is completed. Comment: We should provide some mechanism to validate the backup. To identify if some backup is corrupt or some file is missing(deleted) in a backup. Comment: + ofp = fopen(tofn, "wb"); + if (ofp == NULL) + { + pg_log_error("could not create file \"%s\": %m", tofn); + exit(1); + } ifp should be closed in the error flow. Comment: + fp = fopen(filename, "r"); + if (fp == NULL) + { + pg_log_error("could not read file \"%s\": %m", filename); + exit(1); + } + + labelfile = pg_malloc(statbuf.st_size + 1); + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size) + { + pg_log_error("corrupted file \"%s\": %m", filename); + pg_free(labelfile); + exit(1); + } fclose can be moved above. Comment: + if (!modifiedblockfound) + { + copy_whole_file(fm->filename, outfn); + cleanup_filemaps(filemaps, fmindex + 1); + return; + } + + /* Write all blocks to the output file */ + + if (fstat(fileno(fm->fp), &statbuf) != 0) + { + pg_log_error("could not stat file \"%s\": %m", fm->filename); + pg_free(filemaps); + exit(1); + } Some error flow, cleanup_filemaps need to be called to close the file descriptors that are opened. Comment: +/* + * When to send the whole file, % blocks modified (90%) + */ +#define WHOLE_FILE_THRESHOLD 0.9 + This can be user configured value. This can be an enhancement after the functionality is completed. Comment: We can add a readme file with all the details regarding incremental backup and combine backup. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: "ago" times on buildfarm status page
On 8/27/19 4:33 AM, Tom Turelinckx wrote: > On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote: >> I think this is the problem: >> >> 'scmrepo' => '/home/pgbf/pgmirror.git', >> >> Probably this isn't updated often enough. It probably has little to do with >> the clock settings. >> >> This is the kind of old-fashioned way of doing things. These days >> "git_keep_mirror => 1" along with the community repo as the base would avoid >> these problems. > We've discussed this before (see below). > > > Hm. So the issue really is that the build timestamp that the buildfarm > client is reporting tells when it pulled from the local repo, not when > that repo was last updated from the community server. Not sure if there's > any simple way to improve that ... Andrew, any thoughts? Maybe we need an option to use the git commit time. instead of the snapshot time. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fault injection framework
On Tue, Aug 27, 2019 at 12:35 PM Michael Paquier wrote: > > You may want to double-check whitespaces in your patch set, and 0002 > does not apply because of conflicts in isolationtester.h (my fault!). > I've rebased the patch set against the latest master and tried to resolve whitespace issues. Apologies for the whitespace conflicts, I tried resolving them but there is some trailing whitespace in the answer file of the regress test in v1-0001 that cannot be removed, else the test will fail. > 0002 is an independent feature, so I would keep it out of the fault > framework for integration. There has been a argument from Alvaro > more convincing than mine about the use of a separate keyword, hence > removing a dependency with steps: > https://www.postgresql.org/message-id/20190823153825.GA11405@alvherre.pgsql > That is a valid point, thank you Alvaro for the feedback. I've changed 0002 so that a step within a permutation can be declared as blocking, revised patch set is attached. > It would be good also to have a test case which exercises it, without > the need of the fault framework or its dedicated schedule. > It is for this reason that I have not separated patch 0002 out from faultinjector patch set because the test to demonstrate the blocking feature uses faults. I need to give more thought to find a test having a session that needs to block for reasons other than locking. Any pointers will be very helpful. > > My first impressions about this patch is that it is very intrusive. > Could you explain the purpose of am_faultinjector? That's a specific > connection string parameter which can be used similarly to replication > for WAL senders? Couldn't there be an equivalent with a SUSET GUC? Thank you for the review. Admittedly, the patch set doesn't include a test to demonstrate am_faultinjector. That is used when a fault needs to be injected into a remote server, say a standby. And that standby may be accepting connections or not, depending on if it's operating in hot-standby mode. Therefore, the am_faultinjector and the connection parameter is used to identify fault injection requests and allow those to be handled even when normal user connections are not allowed. Also, such connections do not need to be associated with a database, they simply need to set the fault in the shared memory hash table. In that sense, fault injection connections are treated similar to replication connections. I was looking into tests under src/test/recovery/t/. Let me write a test to demonstrate what I'm trying to explain above. > It may be interesting to see which parts of this framework could be > moved into an extension loaded with shared_preload_libraries, one > thing being the shared memory initialization part. At the end it > would be interesting to not have a dependency with a compile-time > flag. Patch 0001 includes an extension that provides a SQL UDF as a wrapper over the fault injection interface in backend. Moving the backend part of the patch also into an extension seems difficult to me. Getting rid of the compile time dependency is a strong enough advantage to spend more efforts on this. > > Things like exec_fault_injector_command() need to be more documented. > It is hard to guess what it is being used for. Added a comment to explain things a bit. Hope that helps. And as mentioned above, I'm working on a test case to demonstrate this feature. Asim v1-0002-Add-syntax-to-declare-a-step-that-is-expected-to-.patch Description: Binary data v1-0001-Framework-to-inject-faults-from-SQL-tests.patch Description: Binary data v1-0003-Speculative-insert-isolation-test-spec-using-faul.patch Description: Binary data v1-0005-Isolation-schedule-for-tests-that-inject-faults.patch Description: Binary data v1-0004-Run-tests-with-faults-if-faultinjector-was-compil.patch Description: Binary data
Re: Cleanup isolation specs from unused steps
On Fri, Aug 23, 2019 at 9:08 PM Alvaro Herrera wrote: > > On 2019-Aug-23, Asim R P wrote: > > > As part of the fault injector patch set [1], I added a new "blocking" > > keyword to isolation grammar so that a step can be declared as blocking. > > See patch 0002-Add-syntax-to-declare-a-step-that-is-expected-to-block. > > One point to that implementation is that in that design a step is > globally declared to be blocking, but in reality that's the wrong way to > see things: a step might block in some permutations and not others. So > I think we should do as Michael suggested: it's the permutation that has > to have a way to mark a given step as blocking, not the step itself. Thank you for the feedback. I've changed patch 0002 accordingly, please take another look: https://www.postgresql.org/message-id/CANXE4TdvSi7Yia_5sV82%2BMHf0WcUSN9u6_X8VEUBv-YStphd%3DQ%40mail.gmail.com Asim
Re: Why overhead of SPI is so large?
On Sat, Aug 24, 2019 at 12:01 PM David Fetter wrote: > No, it's lying to the RDBMS, so it's pilot error. The problem of > determining from the function itself whether it is in fact immutable > is, in general, equivalent to the Halting Problem, so no, we can't > figure it out. We do need to trust our users not to lie to us, and we > do not need to protect them from the consequences when they do. Depends. I don't mind if mislabeling a function leads to "wrong" query results, but I don't think it's OK for it to, say, crash the server. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PostgreSQL and Real Application Testing (RAT)
ROS Didier schrieb am 27.08.2019 um 12:47: > In my business, one of the things blocking the migration from Oracle > to PostgreSQL is not having the equivalent of Oracle Real Application > Testing . > > This product captures a charge in production and replay it in a test > environment. > > this allows to know the impacts of a migration to a newer version, > the creation of an index.. > > is there an equivalent in the PostgreSQL community? > > if not, do you think it's technically possible to do it? > > who would be interested in this project? Not sure how up-to-date that is, but you might want to have a look here: https://wiki.postgresql.org/wiki/Statement_Playback
Re: "ago" times on buildfarm status page
On 8/27/19 8:45 AM, Andrew Dunstan wrote: > On 8/27/19 4:33 AM, Tom Turelinckx wrote: >> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote: >>> I think this is the problem: >>> >>> 'scmrepo' => '/home/pgbf/pgmirror.git', >>> >>> Probably this isn't updated often enough. It probably has little to do with >>> the clock settings. >>> >>> This is the kind of old-fashioned way of doing things. These days >>> "git_keep_mirror => 1" along with the community repo as the base would >>> avoid these problems. >> We've discussed this before (see below). >> >> >> Hm. So the issue really is that the build timestamp that the buildfarm >> client is reporting tells when it pulled from the local repo, not when >> that repo was last updated from the community server. Not sure if there's >> any simple way to improve that ... Andrew, any thoughts? > > > Maybe we need an option to use the git commit time. instead of the > snapshot time. > > Scratch that - we use this to calculate the duration of the first stage, so mangling it would just create another error. It's tempting to say we should sort the dashboard by git reference time then snapshot - that should be fairly doable. But what if there isn't a git reference, as happens when there's a git failure for example. In those cases Maybe just use the snapshot time? Storing the git timestanp would involve a table change in our second largest table, so the team would need to discuss and plan it. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: "ago" times on buildfarm status page
"Tom Turelinckx" writes: > On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote: >> I think this is the problem: >> 'scmrepo' => '/home/pgbf/pgmirror.git', >> Probably this isn't updated often enough. It probably has little to do with >> the clock settings. > The configuration is intentionally like that. I specifically configured skate > and snapper to build the exact same source, where skate builds with default > buildfarm settings, while snapper builds with the settings actually used by > Debian source packages. TBH, I don't find that particularly important ... especially not for HEAD builds, where building a many-hours-old snapshot is pretty much in the category of "why bother?". On the whole, I think building from the latest available source is the most useful policy. If there's some platform- or configuration-specific issue, it usually takes more than one build cycle for us to notice it anyway, so that ensuring two animals have exactly comparable builds at any given instant isn't very helpful. regards, tom lane
Re: "ago" times on buildfarm status page
On 8/27/19 10:15 AM, Tom Lane wrote: > "Tom Turelinckx" writes: >> On Mon, Aug 26, 2019, at 9:08 PM, Andrew Dunstan wrote: >>> I think this is the problem: >>> 'scmrepo' => '/home/pgbf/pgmirror.git', >>> Probably this isn't updated often enough. It probably has little to do with >>> the clock settings. >> The configuration is intentionally like that. I specifically configured >> skate and snapper to build the exact same source, where skate builds with >> default buildfarm settings, while snapper builds with the settings actually >> used by Debian source packages. > TBH, I don't find that particularly important ... especially not for HEAD > builds, where building a many-hours-old snapshot is pretty much in the > category of "why bother?". On the whole, I think building from the latest > available source is the most useful policy. If there's some platform- > or configuration-specific issue, it usually takes more than one build > cycle for us to notice it anyway, so that ensuring two animals have exactly > comparable builds at any given instant isn't very helpful. > > Yeah, point. snapper seems the more important box here. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
range bug in resolve_generic_type?
Hello, I was looking at resolve_generic_type to add anymultirange support, and the range logic doesn't seem correct to me. This function takes 3 type Oids: - declared_type is the declared type of a function parameter whose actual type it would like to deduce. - context_{declared,actual}_type are the {declared,actual} types of another parameter, as a hint. Here is the logic in pseudocode: if (declared_type == ANYARRAYOID) { if (context_declared_type == ANYARRAYOID) return the array type else if (context_declared_type == ANYELEMENTOID || context_declared_type == ANYNONARRAYOID || context_declared_type == ANYENUMOID || context_declared_type == ANYRANGEOID) context_declared_type == ANYMULTIRANGEOID) return an array of those things } else if (declared_type == ANYELEMENTOID || declared_type == ANYNONARRAYOID || declared_type == ANYENUMOID || declared_type == ANYRANGEOID) { if (context_declared_type == ANYARRAYOID) return the element type else if (context_declared_type == ANYRANGEOID) return the element type else if (context_declared_type == ANYELEMENTOID || context_declared_type == ANYNONARRAYOID || context_declared_type == ANYENUMOID) return the actual type } else { return declared_type // since it's not polymorphic } It seems to me that these inputs would give invalid results: resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return an array of the *range type*, but that contracts the normal relationship between anyelement and anyrange. It should return an array of the range's element type. resolve_generic_type(ANYRANGEOID, x, ANYRANGEOID) - this will return the known range's *element* type, but it should return the known range. Fortunately we never call the function in either of those ways. The only call site I could find is utils/fmgr/funcapi.c, and it only calls it like this: resolve_generic_type(ANYELEMENTOID, anyarray_type, ANYARRAYOID) resolve_generic_type(ANYELEMENTOID, anyrange_type, ANYRANGEOID) resolve_generic_type(ANYARRAYOID, anyelement_type, ANYELEMENTOID) So I'm curious what I should do about all that, if anything. I'm happy to fix it (although I'm not sure how I'd test my changes), but it seems worth asking first. The first case in particular we *could* use to guess an anyarray's type if we wanted to, so I could add that to funcapi.c and then probably invent some scenario to test it. For the second case, I'm guessing if a function has the *same* polymorphic parameter we probably make an inference without using resolve_generic_type, since they should obviously match. But does anyone here have a suggestion? Thanks! Paul
Re: old_snapshot_threshold vs indexes
Thomas Munro writes: > On Tue, Aug 27, 2019 at 1:54 PM Tom Lane wrote: >> +1. That fix is also back-patchable, which adding fields to relcache >> entries would not be. > There is a fly in the ointment: REL9_6_STABLE's copy of > RelationHasUnloggedIndex() is hardcoded to return true for hash > indexes (see commit 2cc41acd8). True, in 9.6 hash indexes *were* effectively unlogged, so that the code actually did something in that branch. Given the lack of bug reports traceable to this, I wouldn't feel too bad about leaving it alone in 9.6. > However, I now see that there isn't a buffer content lock deadlock > risk here after all, because we don't reach RelationHasUnloggedIndex() > if IsCatalogRelation(rel). That reminds me of commit 4fd05bb55b4. It > still doesn't seem like a great idea to be doing catalog access while > holding the buffer content lock, though. Yeah, I'm not convinced that that observation means the problem is unreachable. Probably does make it harder to hit a deadlock, but if you mix a few VACUUMs and untimely cache flushes into the equation, I feel like one could still happen. > So I think we need to leave 9.6 as is, and discuss how far back to > back-patch the attached. It could go back to 10, but perhaps we > should be cautious and push it to master only for now, if you agree > with my analysis of the deadlock thing. I'd vote for back-patching to 10. Even if there is in fact no deadlock hazard, you've clearly demonstrated a significant performance hit that we're taking for basically no reason. In the larger picture, the commit this reminds me of is b04aeb0a0. I'm wondering if we could add some assertions to the effect of "don't initiate relcache or syscache lookups while holding a buffer lock". It would be relatively easy to do that if we could make the rule be "... while holding any LWLock", but I suspect that that would break some legitimate cases. regards, tom lane
Re: range bug in resolve_generic_type?
Paul A Jungwirth writes: > I was looking at resolve_generic_type to add anymultirange support, > and the range logic doesn't seem correct to me. Hmmm... > resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return > an array of the *range type*, but that contracts the normal > relationship between anyelement and anyrange. It should return an > array of the range's element type. I seem to recall that we discussed this exact point during development of the range feature, and concluded that this was the behavior we wanted, ie, treat anyrange like a scalar for this purpose. Otherwise there isn't any way to use a polymorphic function to build an array of ranges, and that seemed more useful than being able to build an array of the range elements. Jeff might remember more here. > resolve_generic_type(ANYRANGEOID, x, ANYRANGEOID) - this will return > the known range's *element* type, but it should return the known > range. Yeah, that seems like a flat-out bug. > Fortunately we never call the function in either of those ways. Wouldn't it depend on the signature+result type of the user-defined function we're dealing with? (That is, calls with constant argument types are probably not the interesting ones.) regards, tom lane
Re: range bug in resolve_generic_type?
On Tue, Aug 27, 2019 at 8:23 AM Tom Lane wrote: > > resolve_generic_type(ANYARRAYOID, x, ANYRANGEOID) - this will return > > an array of the *range type*, but that contracts the normal > > relationship between anyelement and anyrange. It should return an > > array of the range's element type. > > I seem to recall that we discussed this exact point during development > of the range feature, and concluded that this was the behavior we > wanted, ie, treat anyrange like a scalar for this purpose. Otherwise > there isn't any way to use a polymorphic function to build an array > of ranges Well, I don't think that works anyway. At least I couldn't get it to work here: https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com But also check_generic_type_consistency works the way I'm saying: - if anyrange = r then anyelement = elemOf(r) - if anyarray = a then anyelement = elemOf(a) - elemOf(r) = elemOf(a) So resolve_generic_type should agree with that, right? Also, I'm interested in adding not just anymultirange but also anyrangearray, which *would* let you have polymorphic arrays-of-ranges. (I thought I would need anyrangearray for a multirange constructor, but actually now I think I might not need it---maybe. But still it seems like a helpful thing.) > > Fortunately we never call the function in either of those ways. > > Wouldn't it depend on the signature+result type of the user-defined > function we're dealing with? (That is, calls with constant argument > types are probably not the interesting ones.) I suppose an extension could call it (although it seems unlikely). But I couldn't find anywhere in the Postgres code that doesn't call it with hardcoded arguments. (I certainly could have missed something though.) Thanks! Paul
Re: pgbench - implement strict TPC-B benchmark
> On Mon, Aug 5, 2019 at 10:46 PM Fabien COELHO wrote: > > > The index builds are done serially. The vacuum could be replaced by COPY > > FREEZE. > > Well, it could be added? While doing benchmarking using different tools, including pgbench, I found it useful as a temporary hack to add copy freeze and maintenance_work_mem options (the last one not as an env variable, just as a set before, although not sure if it's a best idea). Is it similar to what you were talking about? v1-0001-pgbench-load-options.patch Description: Binary data
basebackup.c's sendFile() ignores read errors
While reviewing a proposed patch to basebackup.c this morning, I found myself a bit underwhelmed by the quality of the code and comments in basebackup.c's sendFile(). I believe it's already been pointed out that the the retry logic here is not particularly correct, and the comments demonstrate a pretty clear lack of understanding of how the actual race conditions that are possible here might operate. However, I then noticed another problem which I think is significantly more serious: this code totally ignores the possibility of a failure in fread(). It just assumes that if fread() fails to return a positive value, it's reached the end of the file, and if that happens, it just pads out the rest of the file with zeroes. So it looks to me like if fread() encountered, say, an I/O error while trying to read a data file, and if that error were on the first data block, then the entire contents of that file would be replaced with zero bytes in the backup, and no error or warning of any kind would be given to the user. If it hit the error later, everything from the point of the error onward would just get replaced with zero bytes. To be clear, I think it is fine for basebackup.c to fill out the rest of the expected length with zeroes if in fact the file has been truncated during the backup; recovery should fix it. But recovery is not going to fix data that got eaten because EIO was encountered while copying a file. The logic that rereads a block appears to have the opposite problem. Here, it will detect an error, but it will also fail in the face of a concurrent truncation, which it shouldn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke wrote: > [ patches ] Reviewing 0002 and 0003: - Commit message for 0003 claims magic number and checksum are 0, but that (fortunately) doesn't seem to be the case. - looks_like_rel_name actually checks whether it looks like a *non-temporary* relation name; suggest adjusting the function name. - The names do_full_backup and do_incremental_backup are quite confusing because you're really talking about what to do with one file. I suggest sendCompleteFile() and sendPartialFile(). - Is there any good reason to have 'refptr' as a global variable, or could we just pass the LSN around via function arguments? I know it's just mimicking startptr, but storing startptr in a global variable doesn't seem like a great idea either, so if it's not too annoying, let's pass it down via function arguments instead. Also, refptr is a crappy name (even worse than startptr); whether we end up with a global variable or a bunch of local variables, let's make the name(s) clear and unambiguous, like incremental_reference_lsn. Yeah, I know that's long, but I still think it's better than being unclear. - do_incremental_backup looks like it can never report an error from fread(), which is bad. But I see that this is just copied from the existing code which has the same problem, so I started a separate thread about that. - I think that passing cnt and blkindex to verify_page_checksum() doesn't look very good from an abstraction point of view. Granted, the existing code isn't great either, but I think this makes the problem worse. I suggest passing "int backup_distance" to this function, computed as cnt - BLCKSZ * blkindex. Then, you can fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance - BLCKSZ). - While I generally support the use of while and for loops rather than goto for flow control, a while (1) loop that ends with a break is functionally a goto anyway. I think there are several ways this could be revised. The most obvious one is probably to use goto, but I vote for inverting the sense of the test: if (PageIsNew(page) || PageGetLSN(page) >= startptr) break; This approach also saves a level of indentation for more than half of the function. - I am not sure that it's a good idea for sendwholefile = true to result in dumping the entire file onto the wire in a single CopyData message. I don't know of a concrete problem in typical configurations, but someone who increases RELSEG_SIZE might be able to overflow CopyData's length word. At 2GB the length word would be negative, which might break, and at 4GB it would wrap around, which would certainly break. See CopyData in https://www.postgresql.org/docs/12/protocol-message-formats.html To avoid this issue, and maybe some others, I suggest defining a reasonably large chunk size, say 1MB as a constant in this file someplace, and sending the data as a series of chunks of that size. - I don't think that the way concurrent truncation is handled is correct for partial files. Right now it just falls through to code which appends blocks of zeroes in either the complete-file or partial-file case. I think that logic should be moved into the function that handles the complete-file case. In the partial-file case, the blocks that we actually send need to match the list of block numbers we promised to send. We can't just send the promised blocks and then tack a bunch of zero-filled blocks onto the end that the file header doesn't know about. - For reviewer convenience, please use the -v option to git format-patch when posting and reposting a patch series. Using -v2, -v3, etc. on successive versions really helps. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PostgreSQL and Real Application Testing (RAT)
On Tue, 27 Aug 2019 at 05:47, ROS Didier wrote: > Hi > > > > In my business, one of the things blocking the migration from Oracle to > PostgreSQL is not having the equivalent of Oracle Real Application Testing . > > This product captures a charge in production and replay it in a test > environment. > > this allows to know the impacts of a migration to a newer version, the > creation of an index.. > > is there an equivalent in the PostgreSQL community? > I used https://github.com/laurenz/pgreplay recently to re-execute the queries sent to a pg9.1 in a pg11. It was very useful to find queries that are affected but changes in default values of GUCs. Normally, a query that works in an old version will work in a new one; but this is useful to catch the few that don't if any -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Missing newline in pg_upgrade usage()
In 959f6d6a1821 it seems that I fat-fingered and missed a newline in the usage output, which makes --check become appended to -B rather than on its own line. The attached diff fixes that. cheers ./daniel pg_upgrade_help_check.diff Description: Binary data
Re: Missing newline in pg_upgrade usage()
Daniel Gustafsson writes: > In 959f6d6a1821 it seems that I fat-fingered and missed a newline in the usage > output, which makes --check become appended to -B rather than on its own line. > The attached diff fixes that. Roger, pushed. regards, tom lane
Re: row filtering for logical replication
Hi Euler, On 2019-02-03 13:14, Andres Freund wrote: On 2018-11-23 13:15:08 -0300, Euler Taveira wrote: Besides the problem presented by Hironobu-san, I'm doing some cleanup and improving docs. I also forget to declare pg_publication_rel TOAST table. Thanks for your review. As far as I can tell, the patch has not been refreshed since. So I'm marking this as returned with feedback for now. Please resubmit once ready. Do you have any plans for continuing working on this patch and submitting it again on the closest September commitfest? There are only a few days left. Anyway, I will be glad to review the patch if you do submit it, though I didn't yet dig deeply into the code. I've rebased recently the entire patch set (attached) and it works fine. Your tap test is passed. Also I've added a new test case (see 0009 attached) with real life example of bidirectional replication (BDR) utilising this new WHERE clause. This naive BDR is implemented using is_cloud flag, which is set to TRUE/FALSE on cloud/remote nodes respectively. Although almost all new tests are passed, there is a problem with DELETE replication, so 1 out of 10 tests is failed. It isn't replicated if the record was created with is_cloud=TRUE on cloud, replicated to remote; then updated with is_cloud=FALSE on remote, replicated to cloud; then deleted on remote. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom ae80e1616fb6374968a09e3c44f0abe59ebf3a87 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Fri, 9 Mar 2018 18:39:22 + Subject: [PATCH v2 1/9] Remove unused atttypmod column from initial table synchronization Since commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920, atttypmod was added but not used. The removal is safe because COPY from publisher does not need such information. --- src/backend/replication/logical/tablesync.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 7881079e96..0a565dd837 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -647,7 +647,7 @@ fetch_remote_table_info(char *nspname, char *relname, StringInfoData cmd; TupleTableSlot *slot; Oid tableRow[2] = {OIDOID, CHAROID}; - Oid attrRow[4] = {TEXTOID, OIDOID, INT4OID, BOOLOID}; + Oid attrRow[3] = {TEXTOID, OIDOID, BOOLOID}; bool isnull; int natt; @@ -691,7 +691,6 @@ fetch_remote_table_info(char *nspname, char *relname, appendStringInfo(&cmd, "SELECT a.attname," " a.atttypid," - " a.atttypmod," " a.attnum = ANY(i.indkey)" " FROM pg_catalog.pg_attribute a" " LEFT JOIN pg_catalog.pg_index i" @@ -703,7 +702,7 @@ fetch_remote_table_info(char *nspname, char *relname, lrel->remoteid, (walrcv_server_version(wrconn) >= 12 ? "AND a.attgenerated = ''" : ""), lrel->remoteid); - res = walrcv_exec(wrconn, cmd.data, 4, attrRow); + res = walrcv_exec(wrconn, cmd.data, 3, attrRow); if (res->status != WALRCV_OK_TUPLES) ereport(ERROR, @@ -724,7 +723,7 @@ fetch_remote_table_info(char *nspname, char *relname, Assert(!isnull); lrel->atttyps[natt] = DatumGetObjectId(slot_getattr(slot, 2, &isnull)); Assert(!isnull); - if (DatumGetBool(slot_getattr(slot, 4, &isnull))) + if (DatumGetBool(slot_getattr(slot, 3, &isnull))) lrel->attkeys = bms_add_member(lrel->attkeys, natt); /* Should never happen. */ base-commit: 9acda731184c1ebdf99172cbb19d0404b7eebc37 -- 2.19.1 From 362f2cc97745690ff4739b530f5ba95aea59be09 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Fri, 9 Mar 2018 17:37:36 + Subject: [PATCH v2 2/9] Store number of tuples in WalRcvExecResult It seems to be a useful information while allocating memory for queries that returns more than one row. It reduces memory allocation for initial table synchronization. While in it, since we have the number of columns, allocate only nfields for cstrs instead of MaxTupleAttributeNumber. --- .../replication/libpqwalreceiver/libpqwalreceiver.c | 9 ++--- src/backend/replication/logical/tablesync.c | 5 ++--- src/include/replication/walreceiver.h| 1 + 3 files changed, 9 insertions(+), 6 deletions(-) diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 6eba08a920..846b6f89f1 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -878,6 +878,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres, errdetail("Expected %d fields, got %d fields.", nRetTypes, nfields))); + walres->ntuples = PQntuples(pgres); walres->tuplestore = tuplestore_begin_heap(true, false, work_mem); /* Create tuple descriptor
Re: doc: clarify "pg_signal_backend" default role
Ian Barwick writes: > Currently the documentation for the default role "pg_signal_backend" states, > somewhat ambiguously, "Send signals to other backends (eg: cancel query, > terminate)", > giving the impression other signals (e.g. SIGHUP) can be sent too, which is > currently not the case. > Attached patch clarifies this, adds a descriptive paragraph (similar to what > the other default roles have) and a link to the "Server Signaling Functions" > section. Pushed with minor tweaking. (Note: patches are less likely to fall through the cracks if you add them to the commitfest page.) regards, tom lane
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
On Sun, Aug 25, 2019 at 9:35 PM Thomas Munro wrote: > > On Mon, Aug 26, 2019 at 1:44 PM Justin Pryzby wrote: > > On Mon, Aug 26, 2019 at 01:09:19PM +1200, Thomas Munro wrote: > > > On Sun, Aug 25, 2019 at 3:15 PM Peter Geoghegan wrote: > > > > I was reminded of this issue from last year, which also appeared to > > > > involve BufFileClose() and a double-free: > > > > > > > > https://postgr.es/m/87y3hmee19@news-spur.riddles.org.uk > > > > > > > > That was a BufFile that was under the control of a tuplestore, so it > > > > was similar to but different from your case. I suspect it's related. > > > > > > Hmm. tuplestore.c follows the same coding pattern as nodeHashjoin.c: > > > it always nukes its pointer after calling BufFileFlush(), so it > > > shouldn't be capable of calling it twice for the same pointer, unless > > > we have two copies of that pointer somehow. > > > > > > Merlin's reported a double-free apparently in ExecHashJoin(), not > > > ExecHashJoinNewBatch() like this report. Unfortunately that tells us > > > very little. > > Here's another one: > > https://www.postgresql.org/message-id/flat/20170601081104.1500.56202%40wrigleys.postgresql.org > > Hmm. Also on RHEL/CentOS 6, and also involving sorting, hashing, > BufFileClose() but this time the glibc double free error is in > repalloc(). > > And another one (repeatedly happening): > > https://www.postgresql.org/message-id/flat/3976998C-8D3B-4825-9B10-69ECB70A597A%40appnexus.com > > Also on RHEL/CentOS 6, this time a sort in once case and a hash join > in another case. > > Of course it's entirely possible that we have a bug here and I'm very > keen to find it, but I can't help noticing the common factor here is > that they're all running ancient RHEL 6.x releases, except Merlin who > didn't say. Merlin? Just noticed this. redhat-release: "Red Hat Enterprise Linux Server release 6.9 (Santiago)" merlin
Re: Zedstore - compressed in-core columnar storage
On Tue, Aug 27, 2019 at 12:03 AM Ashutosh Sharma wrote: > My point is, once we find the leaf page containing the given tid, we go > through each item in the page until we find the data corresponding to the > given tid which means we kind of perform a sequential scan at the page > level. I'm referring to the below loop in zsbt_attr_scan_fetch_array(). > > for (off = FirstOffsetNumber; off <= maxoff; off++) > { > ItemId iid = PageGetItemId(page, off); > ZSAttributeArrayItem *item = (ZSAttributeArrayItem *) > PageGetItem(page, iid); > > if (item->t_endtid <= nexttid) > continue; > > if (item->t_firsttid > nexttid) > break; > > But that's not true for IndexScan in case of heap table because there the > index tuple contains the exact physical location of tuple in the heap. So, > there is no need to scan the entire page. > You are correct that we currently go through each item in the leaf page that contains the given tid, specifically, the logic to retrieve all the attribute items inside a ZSAttStream is now moved to decode_attstream() in the latest code, and then in zsbt_attr_fetch() we again loop through each item we previously retrieved from decode_attstream() and look for the given tid. One optimization we can to is to tell decode_attstream() to stop decoding at the tid we are interested in. We can also apply other tricks to speed up the lookups in the page, for fixed length attribute, it is easy to do binary search instead of linear search, and for variable length attribute, we can probably try something that we didn't think of yet. 1) In zsundo_insert_finish(), there is a double call to > BufferGetPage(undobuf); Is that required ? > Fixed, thanks! 2) In zedstoream_fetch_row(), why is zsbt_tid_begin_scan() being called > twice? I'm referring to the below code. > > if (fetch_proj->num_proj_atts == 0) > { > > > zsbt_tid_begin_scan(rel, tid, tid + 1, > snapshot, > &fetch_proj->tid_scan); > fetch_proj->tid_scan.serializable = true; > > for (int i = 1; i < fetch_proj->num_proj_atts; i++) > { > int attno = fetch_proj->proj_atts[i]; > > zsbt_attr_begin_scan(rel, reldesc, attno, > &fetch_proj->attr_scans[i - 1]); > } > MemoryContextSwitchTo(oldcontext); > > zsbt_tid_begin_scan(rel, tid, tid + 1, snapshot, > &fetch_proj->tid_scan); > } > I removed the second call, thanks! > Also, for all types of update operation (be it key or non-key update) we > create a new tid for the new version of tuple. Can't we use the tid > associated with the old tuple for the cases where there is no concurrent > transactions to whom the old tuple is still visible. > Zedstore currently implement update as delete+insert, hence the old tid is not reused. We don't store the tuple in our UNDO log, and we only store the transaction information in the UNDO log. Reusing the tid of the old tuple means putting the old tuple in the UNDO log, which we have not implemented yet. Thanks for reporting, this is very helpful! Patches are welcome as well!
Re: PostgreSQL and Real Application Testing (RAT)
On Tue, Aug 27, 2019 at 3:47 AM ROS Didier wrote: > Hi > > > > In my business, one of the things blocking the migration from Oracle to > PostgreSQL is not having the equivalent of Oracle Real Application Testing . > > This product captures a charge in production and replay it in a test > environment. > > this allows to know the impacts of a migration to a newer version, the > creation of an index.. > > is there an equivalent in the PostgreSQL community? > > if not, do you think it's technically possible to do it ? > > who would be interested in this project ? > Replaying workload might or might not apply well to your case. There are several major difficulties if you want to replay workload: 1) How to "record" workload. You need to write all your queries to the Postgres log. Three problems here: 1a) pgreplay expects log_statements = 'all' while you might prefer dealing with log_min_duration_statement instead. This is a minor issue though, quite easy to solve with preprocessing. 1b) under heavy load, log_min_duration_statement = 0 (or log_statements = 'all') will lead to performance degradation or even downtime. Possible solutions are: write to memory, or don't write at all but send over the network. 1c) ideally, recoding just queries is not enough. To replay workload "as is", we need to replay queries with known plans. There is no easy solution to this problem in the Postgres ecosystem yet. A couple of additional points regarding item 1b and 1c. In Postgres 12, there is a cool new capability: sampling for query logging, implemented by Adrien Nayrat https://commitfest.postgresql.org/20/1691/ WIth this, it will be possible to fully log, say, 5% of all transactions and use it for replaying. Moreover, with auto_explain, it will be possible to have plans! Open questions are: (a) how to determine, if N% is enough, and (b) how to replay with specified plans. [If anyone is interested in working in this direction – please reach out to me.] 2) Issues with replaying itself. I can highlight at least two problems here: 2a) pgreplay might be not enough for your workload, it doesn't scale well. If interested, look at its analog written in Go, https://github.com/gocardless/pgreplay-go, but this is quite a young project. 2b) Postgres logs have millisecond precision (if you switched from %t to %m in log_line_prefix), this might be not enough. There is a patch to microsecond precision from David Fetter https://www.postgresql.org/message-id/flat/20181023185050.GE6049%40fetter.org, but that conversation hasn't yet led to commit. Another approach you might be interested in -- workload simulation. This is what we (Postgres.ai) now used in most times when building "lab" environments for our clients. The idea is as follows: - carefully analyze workload using pg_stat_statements (here, our open-source tool called "postgres-checkup" https://gitlab.com/postgres-ai/postgres-checkup might be helpful, see reports in section K), - take the most resource-consuming query groups (Top-N ordered by total_time), - create a set of files with statements with randomly filled parameters (won't work for most cases, I discuss restrictions below), - use pgbench, feed workload files to it, using multiple -f options, with balancing (-f filename@XX, where XX is to be taked from pg_statements_analysis, but this time, "calls" and their ratio in the whole workload will be needed -- again, postgres-checkup can help here). - run, analyze, compare behavior. Restrictions of this approach are obvious: - doesn't work well if most of your transactions have multiple statements, - in many cases, randomization is hard (not obvious how to organize; synthetic approach is far from real data distribution in storage and workload; etc), - the approach requires a significant amount of manual efforts. However, the "workload simulation" approach is an extremely helpful approach in many cases, helping with change management. It doesn't require anything that might negatively affect your production workload, it utilizes pgbench (or any other tool) which is reliable, has great features and scales well. You might be interested in looking at our tool that we built to conduct a huge amount of DB experiments, Nancy CLI https://gitlab.com/postgres-ai/nancy. It supports both "workload replay" method (with pgreplay) and "workload simulation" (with pgbench). PM me if you're interested in discussing details. Thanks, Nik
Re: PostgreSQL and Real Application Testing (RAT)
On Tue, 27 Aug 2019 at 19:33, Nikolay Samokhvalov wrote: > On Tue, Aug 27, 2019 at 3:47 AM ROS Didier wrote: > >> Hi >> >> >> >> In my business, one of the things blocking the migration from Oracle to >> PostgreSQL is not having the equivalent of Oracle Real Application Testing . >> >> This product captures a charge in production and replay it in a test >> environment. >> >> this allows to know the impacts of a migration to a newer version, the >> creation of an index.. >> >> is there an equivalent in the PostgreSQL community? >> >> if not, do you think it's technically possible to do it ? >> >> who would be interested in this project ? >> > > Replaying workload might or might not apply well to your case. > > There are several major difficulties if you want to replay workload: > > 1) How to "record" workload. You need to write all your queries to the > Postgres log. Three problems here: > 1a) pgreplay expects log_statements = 'all' while you might prefer > dealing with log_min_duration_statement instead. This is a minor issue > though, quite easy to solve with preprocessing. > 1b) under heavy load, log_min_duration_statement = 0 (or log_statements > = 'all') will lead to performance degradation or even downtime. Possible > solutions are: write to memory, or don't write at all but send over the > network. > 1c) ideally, recoding just queries is not enough. To replay workload "as > is", we need to replay queries with known plans. There is no easy solution > to this problem in the Postgres ecosystem yet. > > why? i prefer queries to take advantage of new plans for example if i'm migrating from 9.5 to 9.6+ i would prefer that, when replaying, the queries use parallel plans so i quickly get if that would somehow be a problem (for example by using more cpu than before) -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: doc: clarify "pg_signal_backend" default role
On 8/28/19 7:04 AM, Tom Lane wrote: Ian Barwick writes: Currently the documentation for the default role "pg_signal_backend" states, somewhat ambiguously, "Send signals to other backends (eg: cancel query, terminate)", giving the impression other signals (e.g. SIGHUP) can be sent too, which is currently not the case. Attached patch clarifies this, adds a descriptive paragraph (similar to what the other default roles have) and a link to the "Server Signaling Functions" section. Pushed with minor tweaking. Thanks! (Note: patches are less likely to fall through the cracks if you add them to the commitfest page.) Yup, though I was intending to add that one together with a couple of related minor doc patches to the next CF. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [PATCH] Make configuration file "include" directive handling more robust
On 8/25/19 4:39 AM, Tom Lane wrote: Ian Barwick writes: On 7/17/19 5:34 PM, Kyotaro Horiguchi wrote:> Hello. I don't think this is new to 12. No, though I'm not sure how much this would be seen as a bugfix and how far back it would be sensible to patch. I think this is worth considering as a bugfix; although I'm afraid we can't change the signature of ParseConfigFile/ParseConfigFp in released branches, since extensions could possibly be using those. That limits what we can do --- but it's still possible to detect direct recursion, which seems like enough to produce a nice error message in typical cases. Makes sense. I concur with Kyotaro-san that disallow-empty-include-directives.v1.patch seems a bit brute-force, but where I would put the checks is in ParseConfigFile and ParseConfigDirectory. Also, I don't agree with the goals of prevent-disallowed-includes.patch. I'm utterly not on board with breaking use of "include" in extension files, for instance; while that may not be documented, it works fine, and maybe somebody out there is relying on it. I couldn't for the life of me think of any reason for using it. But if there's undocumented functionality we think someone might be using, shouldn't that be documented somewhere, if only as a note in the code to prevent its accidental removal at a later date? Likewise, while "include" in pg.auto.conf is not really considered supported, I don't see the point of going out of our way to break the historical behavior. The amusing thing about that of course is that the include directive will disappear the next time ALTER SYSTEM is run and the values from the included file will appear in pg.auto.conf, which may cause some headscratching. But I guess hasn't been an actual real-world issue so far. That leads me to propose the attached simplified patch. While I haven't actually tried, I'm pretty sure this should back-patch without trouble. Ah, I see it's been applied already, thanks! Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pg11.5: ExecHashJoinNewBatch: glibc detected...double free or corruption (!prev)
On Mon, Aug 26, 2019 at 12:01 PM Tom Lane wrote: > > Justin Pryzby writes: > > On Mon, Aug 26, 2019 at 12:45:24PM -0400, Tom Lane wrote: > >> However ... there is some pretty interesting info at > >> https://bugzilla.redhat.com/show_bug.cgi?id=1338673 > >> suggesting that compiling with a late-model gcc against older RHEL6 > >> headers could result in bad code. I wonder whether the reporters' > >> servers were built using such a configuration. (Although the linkage, > >> if any, to this report still wouldn't be very clear.) > > > I can tell it was compiled using > > version | PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) > > 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit > > Ah, that appears to be the default compiler for RHEL6, so that theory > is out the window. It's still interesting that we're only seeing this > reported from RHEL6 ... maybe there's something specific to the code > that this gcc version generates? FWIW, I've got the same compiler (which is natural, we are likely using the same packaging): PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23), 64-bit merlin
Re: Statement timeout in pg_rewind
On Tue, Aug 27, 2019 at 10:45:27AM +0200, Alexander Kukushkin wrote: > Done, please see the next version attached. I have made the new error message consistent with run_simple_query to avoid more work to translators and because it is possible to know immediately the code path involved thanks to the SQL query, then applied the fix down to 9.5 where pg_rewind has been added. Please note that your patch had a warning as "result" is not needed in run_simple_command(). idle_in_transaction_session_timeout only applies to 9.6 and newer versions. lock_timeout (imagine a concurrent lock on pg_class for example) and statement_timeout can cause issues, but the full set gets disabled as your patch did and as mentioned upthread. -- Michael signature.asc Description: PGP signature
Re: Cleanup isolation specs from unused steps
On Tue, Aug 27, 2019 at 07:05:50PM +0530, Asim R P wrote: > Thank you for the feedback. I've changed patch 0002 accordingly, please > take another look: > https://www.postgresql.org/message-id/CANXE4TdvSi7Yia_5sV82%2BMHf0WcUSN9u6_X8VEUBv-YStphd%3DQ%40mail.gmail.com Thanks! Let's move the discussion on the other thread then. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova wrote: > Now the algorithm is the following: > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. > > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I think that the way this works within nbtsplitloc.c is too complicated. In v5, the only thing that nbtsplitloc.c knew about deduplication was that it could be sure that suffix truncation would at least make a posting list into a single heap TID in the worst case. This consideration was mostly about suffix truncation, not deduplication, which seemed like a good thing to me. _bt_split() and _bt_findsplitloc() should know as little as possible about posting lists. Obviously it will sometimes be necessary to deal with the case where a posting list is about to become too big (i.e. it's about to go over BTMaxItemSize()), and so must be split. Less often, a page split will be needed because of one of these posting list splits. These are two complicated areas (posting list splits and page splits), and it would be a good idea to find a way to separate them as much as possible. Remember, nbtsplitloc.c works by pretending that the new item that cannot fit on the page is already on its own imaginary version of the page that *can* fit the new item, along with everything else from the original/actual page. That gets *way* too complicated when it has to deal with the fact that the new item is being merged with an existing item. Perhaps nbtsplitloc.c could also "pretend" that the new item is always a plain tuple, without knowing anything about posting lists. Almost like how it worked in v5. We always want posting lists to be as close to the BTMaxItemSize() size as possible, because that helps with space utilization. In v5 of the patch, this was what happened, because, in effect, we didn't try to do anything complicated with the new item. This worked well, apart from the crash safety issue. Maybe we can simulate the v5 approach, giving us the best of all worlds (good space utilization, simplicity, and crash safety). Something like this: * Posting list splits should always result in one posting list that is at or just under BTMaxItemSize() in size, plus one plain tuple to its immediate right on the page. This is similar to the more common case where we cannot add additional tuples to a posting list due to the BTMaxItemSize() restriction, and so end up with a single tuple (or a smaller posting list with the same value) to the right of a BTMaxItemSize()-sized posting list tuple. I don't see a reason to split a posting list in the middle -- we should always split to the right, leaving the posting list as large as possible. * When there is a simple posting list split, with no page split, the logic required is fairly straightforward: We rewrite the posting list in-place so that our new item goes wherever it belongs in the existing posting list on the page (we memmove() the posting list to make space for the new TID, basically). The old last/rightmost TID in the original posting list becomes a new, plain tuple. We may need a new WAL record for this, but it's not that different to a regular leaf page insert. * When this happens to result in a page split, we then have a "fake" new item -- the right half of the posting list that we split, which is always a plain item. Obviously we need to be a bit careful with the WAL logging, but the space accounting within _bt_split() and _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can work like it did in v5, when the only thing it knew about posting lists was that _bt_truncate() always removes them, maybe leaving a single TID behind in the new high key. (Note also that it's not okay to remove the conservative assumption about at least having space for one heap TID within _bt_recsplitloc() -- that needs to be restored to its v5 state in the next version of the patch.) Because deduplication is lazy, there is little value in doing deduplication of the new item (which may or may not be the fake new item). The nbtsplitloc.c logic will "trap" duplicates on the same page today, so we can just let deduplication of the new item happen at a later time. _bt_split() can almost pretend that posting lists don't exist, and nbtsplitloc.c needs to know nothing about posting lists (apart from the way that _bt_truncate() behaves with posting lists). We "lie" to _bt_findsplitloc(), and tell it that the new item is our fake new item -- it doesn't do anything that will be broken by that lie, because it doesn't care about the actual content of posting lists. And, we can fix the "fake new item is not actually real new item" issue at one point within _bt_split(), just as we're about to WAL log. What do you think of that approach? -- Peter Geoghegan
Re: Re: Email to hackers for test coverage
On Tue, Aug 27, 2019 at 03:57:20PM +0800, movead...@highgo.ca wrote: > I think your way is much better, so I change the patch and it is > 'regression_20190827.patch' now. Thanks for the new patch, I have committed the part for float4. > There are code lines related to NULL values in > ApplySortAbbrevFullComparator(), but I think the code lines for > comparing a NULL and a NOT NULL can be never reached, because it is > handled in ApplySortComparator() which is called before > ApplySortAbbrevFullComparator(). So I think it is no use to INSERT > a NULL value. But I am not sold to that part yet, for three reasons: - numeric is not a test suite designed for sorting, and hijacking it to do so it not a good approach. - it would be good to get coverage for the two extra code paths while we are on it with NULL datums. - There is no need for two INSERT queries, I am getting the same coverage with only one. Please note that I have not looked in details where we could put that, but perhaps Robert and Peter G who worked on 4ea51cd to add support for abbreviated keys have some ideas, so I am adding them in CC for input. -- Michael signature.asc Description: PGP signature
Re: REINDEX filtering in the backend
On Thu, Jul 11, 2019 at 11:14:20PM +0200, Julien Rouhaud wrote: > I didn't want to spend too much time enjoying bison and adding new > unreserved keywords, so for now I just implemented this syntax to > start a discussion for this feature in the next commitfest: > > REINDEX ( FILTER = COLLATION ) [...] > > The FILTER clause can be used multiple times, each one is OR-ed with > the ReindexStmt's option, so we could easily add a LIBC, ICU and other > filters, also making COLLATION (or more realistically a better new > keyword) an alias for (LIBC | ICU) for instance. I would prefer keeping the interface simple with only COLLATION, so as only collation sensitive indexes should be updated, including icu and libc ones. Or would it be useful to have the filtering for both as libicu can break things similarly to glibc in an update still a breakage on one or the other would not happen at the same time? I don't know enough of libicu regarding that, eager to learn. In which case, wouldn't it be better to support that from the start? > The filtering is done at table level (with and without the > concurrently option), so SCHEMA, DATABASE and SYSTEM automatically > benefit from it. If this clause is used with a REINDEX INDEX, the > statement errors out, as I don't see a valid use case for providing a > single index name and asking to possibly filter it at the same time. Supporting that case would not be that much amount of work, no? > I also added some minimal documentation and regression tests. I'll > add this patch to the next commitfest. > > [1] > https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com +if ((stmt->options & REINDEXOPT_ALL_FILTERS) != 0) +elog(ERROR, "FILTER clause is not compatible with REINDEX INDEX"); [...] + discard indexes whose ordering does not depend on a collation. Note that + the FILTER option is not compatible with REINDEX + SCHEMA. Why do you have both limitations? I think that it would be nice to be able to do both, generating an error for REINDEX INDEX if the index specified is not compatible, and a LOG if the index is not filtered out when a list is processed. Please note that elog() cannot be used for user-facing failures, only for internal ones. +REINDEX (VERBOSE, FILTER = COLLATION) TABLE reindex_verbose; +-- One column, not depending on a collation In order to make sure that a reindex has been done for a given entry with the filtering, an idea is to save the relfilenode before the REINDEX and check it afterwards. That would be nice to make sure that only the wanted indexes are processed, but it is not possible to be sure of that based on your tests, and some tests should be done on relations which have collation-sensitive as well as collation-insensitive indexes. + index = index_open(indexOid, AccessShareLock); + numAtts = index->rd_index->indnatts; + index_close(index, AccessShareLock); Wouldn't it be better to close that after doing the scan? Nit: I am pretty sure that this should be indented. Could you add tests with REINDEX CONCURRENTLY? -- Michael signature.asc Description: PGP signature
Re: REINDEX filtering in the backend
On Wed, Aug 28, 2019 at 02:02:08PM +0900, Michael Paquier wrote: > + index = index_open(indexOid, AccessShareLock); > + numAtts = index->rd_index->indnatts; > + index_close(index, AccessShareLock); > Wouldn't it be better to close that after doing the scan? > > Nit: I am pretty sure that this should be indented. > > Could you add tests with REINDEX CONCURRENTLY? Bonus: support for reindexdb should be added. Let's not forget about it. -- Michael signature.asc Description: PGP signature
Re: pgbench - implement strict TPC-B benchmark
Hello Dmitry, Well, it could be added? While doing benchmarking using different tools, including pgbench, I found it useful as a temporary hack to add copy freeze and maintenance_work_mem options (the last one not as an env variable, just as a set before, although not sure if it's a best idea). Is it similar to what you were talking about? About this patch: Concerning the --maintenance... option, ISTM that there could rather be a generic way to provide "set" settings, not a specific option for a specific parameter with a specific unit. Moreover, ISTM that it only needs to be set once on a connection, not per command. I'd suggest something like: --connection-initialization '...' That would be issue when a connection is started, for any query, then the effect would be achieved with: pgbench --conn…-init… "SET maintenance_work_main TO '12MB'" ... The --help does not say that the option expects a parameter. Also, in you patch it is a initialization option, but the code does not check for that. Concerning the freeze option: It is also a initialization-specific option that should be checked for that. The option does not make sense if The alternative queries could be managed simply without intermediate variables. Pgbench documentation is not updated. There are no tests. This patch should be submitted in its own thread to help manage it in the CF app. -- Fabien.
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello, Noah. At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi wrote in <20190827.154932.250364935.horikyota@gmail.com> > I'm not sure whether the knob shows apparent performance gain and > whether we can offer the criteria to identify the proper > value. But I'll add this feature with a GUC > effective_io_block_size defaults to 64kB as the threshold in the > next version. (The name and default value are arguable, of course.) This is a new version of the patch based on the discussion. The differences from v19 are the follows. - Removed the new stuff in two-phase.c. The action on PREPARE TRANSACTION is now taken in PrepareTransaction(). Instead of storing pending syncs in two-phase files, the function immediately syncs all files that can survive the transaction end. (twophase.c, xact.c) - Separate pendingSyncs from pendingDeletes. pendingSyncs gets handled differently from pendingDeletes so it is separated. - Let smgrDoPendingSyncs() to avoid performing fsync on to-be-deleted files. In previous versions the function syncs all recorded files even if it is being deleted. Since we use WAL-logging as the alternative of fsync now, performance gets more significance g than before. Thus this version avoids uesless fsyncs. - Use log_newpage instead of fsync for small tables. As in the discussion up-thread, I think I understand how WAL-logging works better than fsync. smgrDoPendingSync issues log_newpage for all blocks in the table smaller than the GUC variable "effective_io_block_size". I found log_newpage_range() that does exact what is needed here but it requires Relation that is no available there. I removed an assertion in CreateFakeRelcacheEntry so that it works while non-recovery mode. - Rebased and fixed some bugs. I'm trying to measure performance difference on WAL/fsync. By the way, smgrDoPendingDelete is called from CommitTransaction and AbortTransaction directlry, and from AbortSubTransaction via AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is called only from AbortSubTransaction. I think these should be unified either way. Any opinions? CommitTransaction() + msgrDoPendingDelete() AbortTransaction() + msgrDoPendingDelete() AbortSubTransactoin() AtSubAbort_smgr() + msgrDoPendingDelete() # Looking around, the prefixes AtEOact/PreCommit/AtAbort don't # seem to be used keeping a principle. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/4] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 312 1 file changed, 312 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 00..b041121745 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,312 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 26; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +max_prepared_transactions = 1 +)); + $node->start; + + # Setup + my $tablespace_dir = $node->basedir . '/tablespace_other'; + mkdir ($tablespace_dir); + $tablespace_dir = TestLib::perl2host($tablespace_dir); + $node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial PRIMA
Performance improvement of WAL writing?
Dear Hackers. Currently, the XLogWrite function is written in 8k(or 16kb) units regardless of the size of the new record. For example, even if a new record is only 300 bytes, pg_pwrite is called to write data in 8k units (if it cannot be writing on one page is 16kb written). Let's look at the worst case. 1) LogwrtResult.Flush is 8100 pos. 2) And the new record size is only 100 bytes. In this case, pg_pwrite is called which writes 16 kb to update only 100 bytes. It is a rare case, but I think there is overhead for pg_pwrite for some systems. # For systems that often update one record in one transaction. So what about modifying the XLogWrite function only to write the size that should record? Can this idea benefit from WAL writing performance? If it's OK to improve, I want to do modification. How do you think of it? Best Regards. Moon.
Re: Re: Email to hackers for test coverage
On Wed, 28 Aug 2019 11:30:23 +0800 mich...@paquier.xyz wrote >- numeric is not a test suite designed for sorting, and hijacking it >to do so it not a good approach. Absolutely agreement. We can wait for repling from Robert and Peter G. >- it would be good to get coverage for the two extra code paths while >we are on it with NULL datums. The remained untested line in ApplySortComparator() can be never reached I think, and I have explained it on last mail. Or I don't fully understand what you mean? >- There is no need for two INSERT queries, I am getting the same >coverage with only one. Yes It is my mistake, the key point is 'ORDER BY n1 DESC' not the 'INSERT'. Infact it can run the code line without any of the two INSERT. -- Movead
Improve error detections in TAP tests by spreading safe_psql
Hi all, This is a follow-up of the discussion which happened here after Tom has committed fb57f40: https://www.postgresql.org/message-id/20190828012439.ga1...@paquier.xyz I have reviewed the TAP tests, and we have much more spots where it is better to use PostgresNode::safe_psql instead PostgresNode::psql so as the test dies immediately if there is a failure on a query which should never fail. Attached are the spots I have found. Any thoughts or opinions? Thanks, -- Michael diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl index 6e2f051187..5f40fcfba6 100644 --- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl +++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl @@ -47,12 +47,12 @@ is($primary->slot($slot_name)->{'slot_type'}, # Generate some WAL. Use --synchronous at the same time to add more # code coverage. Switch to the next segment first so that subsequent # restarts of pg_receivewal will see this segment as full.. -$primary->psql('postgres', 'CREATE TABLE test_table(x integer);'); -$primary->psql('postgres', 'SELECT pg_switch_wal();'); +$primary->safe_psql('postgres', 'CREATE TABLE test_table(x integer);'); +$primary->safe_psql('postgres', 'SELECT pg_switch_wal();'); my $nextlsn = $primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();'); chomp($nextlsn); -$primary->psql('postgres', +$primary->safe_psql('postgres', 'INSERT INTO test_table VALUES (generate_series(1,100));'); # Stream up to the given position. diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl index 99154bcf39..b225b3b4ee 100644 --- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl +++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl @@ -47,8 +47,8 @@ $node->command_ok( my $slot = $node->slot('test'); isnt($slot->{'restart_lsn'}, '', 'restart lsn is defined for new slot'); -$node->psql('postgres', 'CREATE TABLE test_table(x integer)'); -$node->psql('postgres', +$node->safe_psql('postgres', 'CREATE TABLE test_table(x integer)'); +$node->safe_psql('postgres', 'INSERT INTO test_table(x) SELECT y FROM generate_series(1, 10) a(y);'); my $nextlsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn()'); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index f064ea1063..3e7412dbbf 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -3282,7 +3282,7 @@ if ($collation_check_stderr !~ /ERROR: /) } # Create a second database for certain tests to work against -$node->psql('postgres', 'create database regress_pg_dump_test;'); +$node->safe_psql('postgres', 'create database regress_pg_dump_test;'); # Start with number of command_fails_like()*2 tests below (each # command_fails_like is actually 2 tests) diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index 3c743d7d7c..42af151ac9 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -125,7 +125,7 @@ test_target_session_attrs($node_standby_1, $node_master, $node_standby_1, # role. note "testing SHOW commands for replication connection"; -$node_master->psql( +$node_master->safe_psql( 'postgres', " CREATE ROLE repl_role REPLICATION LOGIN; GRANT pg_read_all_settings TO repl_role;"); diff --git a/src/test/recovery/t/008_fsm_truncation.pl b/src/test/recovery/t/008_fsm_truncation.pl index ddab464a97..a164b81777 100644 --- a/src/test/recovery/t/008_fsm_truncation.pl +++ b/src/test/recovery/t/008_fsm_truncation.pl @@ -30,7 +30,7 @@ $node_standby->init_from_backup($node_master, 'master_backup', has_streaming => 1); $node_standby->start; -$node_master->psql( +$node_master->safe_psql( 'postgres', qq{ create table testtab (a int, b char(100)); insert into testtab select generate_series(1,1000), 'foo'; @@ -39,7 +39,7 @@ delete from testtab where ctid > '(8,0)'; }); # Take a lock on the table to prevent following vacuum from truncating it -$node_master->psql( +$node_master->safe_psql( 'postgres', qq{ begin; lock table testtab in row share mode; @@ -47,14 +47,14 @@ prepare transaction 'p1'; }); # Vacuum, update FSM without truncation -$node_master->psql('postgres', 'vacuum verbose testtab'); +$node_master->safe_psql('postgres', 'vacuum verbose testtab'); # Force a checkpoint -$node_master->psql('postgres', 'checkpoint'); +$node_master->safe_psql('postgres', 'checkpoint'); # Now do some more insert/deletes, another vacuum to ensure full-page writes # are done -$node_master->psql( +$node_master->safe_psql( 'postgres', qq{ insert into testtab select generate_series(1,1000), 'foo'; delete from testtab where ctid > '(8,0)'; @@ -62,16 +62,16 @@ vacuum verbose testtab; }); # Ensure all buffers are now clean on the standby -$node_standby->psql('postgres', 'checkpoint'); +$node_standby->safe_psql('postgres', 'checkpoint'); # Release the lock, vacuum again which shoul
Re: refactoring - share str2*int64 functions
Bonjour Michaël, So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and then possibly generalize to lower sizes, 32, 16, depending on what is actually needed. I am interested in this patch, and the next commit fest is close by. Are you working on an updated version? If not, do you mind if I work on it and post a new version by the beginning of next week based on all the feedback gathered? Here is an updated patch for the u?int64 conversion functions. I have taken the liberty to optimize the existing int64 function by removing spurious tests. I have not created uint64 specific inlined overflow functions. If it looks ok, a separate patch could address the 32 & 16 versions. -- Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 221b47298c..28891ba337 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -79,6 +79,7 @@ #include "utils/builtins.h" #include "utils/hashutils.h" #include "utils/memutils.h" +#include "common/string.h" PG_MODULE_MAGIC; @@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString, /* parse command tag to retrieve the number of affected rows. */ if (completionTag && strncmp(completionTag, "COPY ", 5) == 0) - rows = pg_strtouint64(completionTag + 5, NULL, 10); + (void) pg_strtouint64(completionTag + 5, &rows); else rows = 0; diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index 2c0ae395ba..8e75d52b06 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -21,6 +21,7 @@ #include "catalog/heap.h" #include "catalog/pg_type.h" #include "commands/trigger.h" +#include "common/string.h" #include "executor/executor.h" #include "executor/spi_priv.h" #include "miscadmin.h" @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt; if (strncmp(completionTag, "SELECT ", 7) == 0) - _SPI_current->processed = - pg_strtouint64(completionTag + 7, NULL, 10); + (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed); else { /* @@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI, else if (IsA(stmt->utilityStmt, CopyStmt)) { Assert(strncmp(completionTag, "COPY ", 5) == 0); - _SPI_current->processed = pg_strtouint64(completionTag + 5, - NULL, 10); + (void) pg_strtouint64(completionTag + 5, &_SPI_current->processed); } } diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 764e3bb90c..17cde42b4d 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -34,6 +34,7 @@ #include "fmgr.h" #include "miscadmin.h" +#include "common/string.h" #include "nodes/extensible.h" #include "nodes/parsenodes.h" #include "nodes/plannodes.h" @@ -80,7 +81,7 @@ #define READ_UINT64_FIELD(fldname) \ token = pg_strtok(&length); /* skip :fldname */ \ token = pg_strtok(&length); /* get field value */ \ - local_node->fldname = pg_strtouint64(token, NULL, 10) + (void) pg_strtouint64(token, &local_node->fldname) /* Read a long integer field (anything written as ":fldname %ld") */ #define READ_LONG_FIELD(fldname) \ diff --git a/src/backend/parser/parse_node.c b/src/backend/parser/parse_node.c index 1baf7ef31f..4595c35875 100644 --- a/src/backend/parser/parse_node.c +++ b/src/backend/parser/parse_node.c @@ -25,10 +25,10 @@ #include "parser/parse_expr.h" #include "parser/parse_relation.h" #include "utils/builtins.h" -#include "utils/int8.h" #include "utils/lsyscache.h" #include "utils/syscache.h" #include "utils/varbit.h" +#include "common/string.h" static void pcb_error_callback(void *arg); @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location) case T_Float: /* could be an oversize integer as well as a float ... */ - if (scanint8(strVal(value), true, &val64)) + if (pg_strtoint64(strVal(value), &val64) == strtoint_ok) { /* * It might actually fit in int32. Probably only INT_MIN can diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 9c08757fca..b364a41fb5 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -12,6 +12,7 @@ */ #include "postgres.h" +#include "common/string.h" #include "catalog/pg_publication.h" #include "fmgr.h" @@ -22,7 +23,6 @@ #include "replication/pgoutput.h" #include "utils/inval.h" -#include "utils/int8.h" #include "utils/memutils.h" #include "utils/syscache.h" #include "utils/varlena.h" @@ -113,7 +113,7 @@ parse_output_parameters(List *options, uint32 *protocol_version, errmsg("conflicting or redundant options"))); protocol_version_given = true; - if (!