RE: Speedup of relation deletes during recovery
From: Fujii Masao [mailto:masao.fu...@gmail.com] > When multiple relations are deleted at the same transaction, the files of > those relations are deleted by one call to smgrdounlinkall(), which leads > to scan whole shared_buffers only one time. OTOH, during recovery, > smgrdounlink() (not smgrdounlinkall()) is called for each file to delete, > which leads to scan shared_buffers multiple times. > Obviously this can cause to increase the WAL replay time very much especially > when shared_buffers is huge. > > To alleviate this situation, I'd like to propose to change the recovery > so that it also calls smgrdounlinkall() only one time to delete multiple > relation files. Patch attached. Thought? Nice catch. As Horiguchi-san and Michael already commented, the patch looks good. As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possibly TRUNCATE as well.) How should we proceed with this? https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23 Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATE does that for each fork, resulting in three scans per table. How about changing the following functions smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, BlockNumber firstDelBlock) to smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, BlockNumber *firstDelBlock, int nforks) to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that. Regards Takayuki Tsunakawa
RE: Speedup of relation deletes during recovery
From: Jerry Sievers [mailto:gsiever...@comcast.net] > Wonder if this is the case for streaming standbys replaying truncates > also? Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE. Regards Takayuki Tsunakawa
RE: Built-in connection pooling
From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru] Oracle, for example, you can create dedicated and non-dedicated backends. > I wonder why we do not want to have something similar in Postgres. Yes, I want it, too. In addition to dedicated and shared server processes, Oracle provides Database Resident Connection Pooling (DRCP). I guessed you were inspired by this. https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc002.htm#ADMIN12348 BTW, you are doing various great work -- autoprepare, multithreaded Postgres, built-in connection pooling, etc. etc., aren't you? Are you doing all of these alone? Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Fujii Masao [mailto:masao.fu...@gmail.com] > a very long time before accessing to the relation. Which would cause the > response-time spikes, for example, I observed such spikes several times > on > the server with shared_buffers = 300GB while running the benchmark. FYI, a long transaction took about 900 ms, while the average transaction response time was 150 ms or so. (I'm working with Fujii-san in this performance benchmark.) > Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such > spikes > for that relation. How about an integer variable to replace the following? #define REL_TRUNCATE_FRACTION 16 > Also, first of all, if other transactions need to extend the relation > (i.e., need new pages) as soon as VACUUM truncates the empty pages at the > end, > that truncation would not be so helpful for performance. In this case, > the truncation and extension of the relation are unnecessarily repeated, > which would decrease the performance. So, to alleviate this situation, > $SUBJECT is useful, I think. I wonder if fillfactor=50 would alleviate this situation. Regards Takayuki Tsunakawa
RE: Speedup of relation deletes during recovery
From: Fujii Masao [mailto:masao.fu...@gmail.com] > Yeah, it's worth working on this problem. To decrease the number of scans > of > shared_buffers, you would need to change the order of truncations of files > and > WAL logging. In RelationTruncate(), currently WAL is logged after FSM and > VM > are truncated. IOW, with the patch, FSM and VM would need to be truncated > after > WAL logging. You would need to check whether this reordering is valid. Sure, thank you for advice. Takayuki Tsunakawa
RE: [bug fix] Produce a crash dump before main() on Windows
From: Michael Paquier [mailto:mich...@paquier.xyz] > The last patch submitted is here: > https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D > 1F8ECF73@G01JPEXMBYT05 > And based on the code paths it touches I would recommend to not play with > REL_12_STABLE at this stage. I'm reading the thread to see what I should do... Anyway, I don't think we should rush to include this fix in PG 12, too. But at the same time, I don't think it would be dangerous to put in PG 12. Regards Takayuki Tsunakawa
RE: [bug fix] Produce a crash dump before main() on Windows
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Alvaro Herrera from 2ndQuadrant writes: > > Well, IMV this is a backpatchable, localized bug fix. > > I dunno. This thread is approaching two years old, and a quick > review shows few signs that we actually have any consensus on > making behavioral changes here. If there is any consensus, > it's that the SetErrorMode calls should depend on checking > pgwin32_is_service(); but we haven't even seen a patch that > does that, let alone tested it. I think we're way too close > to beta4 wrap to be considering pushing such a patch the moment > it appears. We don't have to call pgwin32_is_service() to determine whether we call SetErrorMode() in postmaster.c because: * The dialog box doesn't appear under Windows service, whether PostgreSQL itself starts as a service or another (server) application runs as a service and does "pg_ctl start." * It's OK for the dialog box to appear when the user runs "pg_ctl start" on the command prompt. That usage is interactive, and should not be for production use (postgres processes vanish when the user mistakenly presses Ctrl+C, or closes the command prompt). Even now, the dialog box pops up if postmaster crashes before main(). > BTW, I also violently dislike taking Windows-specific stuff out of > startup_hacks where it belongs and putting it into main() where > it doesn't. I think the entire Windows bit in front of get_progname > should migrate into startup_hacks. Surely the odds of failure > inside get_progname are not worth worrying about --- they seem > significantly less than the odds of failure inside > pgwin32_is_service(), for instance. Agreed. Fixed. I changed the CF status to "need review." Regards Takayuki Tsunakawa crash_dump_before_main_v3.patch Description: crash_dump_before_main_v3.patch
RE: Libpq support to connect to standby server as priority
From: Alvaro Herrera from 2ndQuadrant [mailto:alvhe...@alvh.no-ip.org] > Testing protocol version 2 is difficult! Almost every single test fails > because of error messages being reported differently; and streaming > replication (incl. pg_basebackup) doesn't work at all because it's not > possible to establish replication connections. Manual inspection shows > it behaves correctly. Yeah, the code path for protocol v2 is sometimes annoying. I wish v2 support will be dropped soon. I know there was a discussion on it, but I didn't track the conclusion. > Remaining patchset attached (per my count it's v13 of your patchset. I'm afraid those weren't attached. > think we should merge one half of it with each of the other two patches > where the changes are introduced (0003 and 0004). I'm not convinced > that we need 0004-0006 to be separate commits. It was hard to review those separate patches, so I think it's better to merge those. OTOH, I can understand Haribabu-san's idea that the community may not accept the latter patches, e.g. accept only 0001-0005. Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > SIGTERM, which needs to be adjusted. For another, its > SIGQUIT handler does exit(1) not _exit(2), which seems rather > dubious ... should we make it more like the rest? I think > the reasoning there might've been that if some DBA decides to > SIGQUIT the archiver, we don't need to force a database-wide > reset; but why exactly should we tolerate that? postmaster doesn't distinguish return codes other than 0 for the archiver, and just starts the archiver unless postmaster is shutting down. So we can use _exit(2) like the other children. Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, just in case they are slow to respond to or ignore SIGINT/SIGTERM? That matches the idea of pg_ctl's immediate shutdown. (Windows cannot stop grandchildren because kill() in src/port/kill.c doesn't support the process group... That's a separate topic.) Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: David Steele [mailto:da...@pgmasters.net] > > Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, > just in case they are slow to respond to or ignore SIGINT/SIGTERM? That > matches the idea of pg_ctl's immediate shutdown. > > -1, at least not immediately. Archivers can be complex processes and > they should be given the chance to do a graceful shutdown. I agree that the user's archiver program should receive the chance for graceful stop in smart or fast shutdown. But I think in immediate shutdown, all should stop immediately. That's what I expect from the word "immediate." If the grandchildren left running don't disturb the cleanup of PostgreSQL's resources (shared memory, file/directory access, etc.) or restart of PostgreSQL, we may well be able to just advice the grandchildren to stop immediately with SIGINT/SIGTERM. However, for example, in the failover of shared-disk HA clustering, when the clustering software stops PostgreSQL with "pg_ctl stop -m immediate" and then tries to unmount the file systems for $PGDATA and archived WAL, the unmount may take time or fail due to the access from PostgreSQL's grandchildren. Regards Takayuki Tsunakawa
RE: [bug fix] Produce a crash dump before main() on Windows
From: Michael Paquier [mailto:mich...@paquier.xyz] > Imagine an application which relies on Postgres, still does *not* start > it as a service but uses "pg_ctl start" > automatically. This could be triggered as part of another service startup > which calls say system(), or as another script. Wouldn't that cause > potentially a freeze? Do you mean by "another service startup": another service (Windows service) -> an application (not a Windows service) -> pg_ctl start Then pg_ctl runs under the Windows service environment, and the popup won't appear. > I am also not sure that I quite understand why a > popup would never show up if a different service startup triggers pg_ctl > start by itself that fails. I experimented it with a sample program that I showed in this thread. A program invoked by a Windows services also runs in the service. Regards Takayuki Tsunakawa
[bug fix??] Fishy code in tts_cirtual_copyslot()
Hello, In the following code in execTuples.c, shouldn' srcdesc point to the source slot's tuple descriptor? The attached fix passes make check. What kind of failure could this cause? BTW, I thought that in PostgreSQL coding convention, local variables should be defined at the top of blocks, but this function writes "for (int natts;". I didn't modify it because many other source files also write in that way. -- static void tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) { TupleDesc srcdesc = dstslot->tts_tupleDescriptor; Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); tts_virtual_clear(dstslot); slot_getallattrs(srcslot); for (int natt = 0; natt < srcdesc->natts; natt++) { dstslot->tts_values[natt] = srcslot->tts_values[natt]; dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; } -- Regards Takayuki Tsunakawa virtslot_tiny_fix.patch Description: virtslot_tiny_fix.patch
RE: [bug fix??] Fishy code in tts_cirtual_copyslot()
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > I temporarily changed the Assert to be "==" rather than "<=", and > it still passed check-world, so evidently we are not testing any > cases where the descriptors are of different lengths. This explains > the lack of symptoms. It's still a bug though, so pushed. Thank you for committing. > > BTW, I thought that in PostgreSQL coding convention, local variables > should be defined at the top of blocks, but this function writes "for (int > natts;". > > Yeah, we've agreed to join the 21st century to the extent of allowing > local for-loop variables. That's good news. It'll help a bit to code comfortably. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > On 2019-Sep-03, Tsunakawa, Takayuki wrote: > > I don't think it's rejected. It would be a pity (mottainai) to refuse > > this, because it provides significant speedup despite its simple > > modification. > > I don't necessarily disagree with your argumentation, but Travis is > complaining thusly: I tried to revise David's latest patch (v8) and address Tom's comments in his last mail. But I'm a bit at a loss. First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries in the hash table, and make it available via a new function like hash_get_max_entries(). However, to cover the shared partitioned hash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it when entering and removing entries in the hash table. It spoils the effort to decrease contention by hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c, not in the hash table? Second, I couldn't understand the comment about the fill factor well. I can understand that it's not correct to compare the number of hash buckets and the number of locks. But what can we do? I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook and seems intuitive. So I attached the rebased version. Regards Takayuki Tsunakawa faster-locallock-scan_v3.patch Description: faster-locallock-scan_v3.patch
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > This makes the test page-size sensitive. While we don't ensure that tests > can be run with different page sizes, we should make a maximum effort to > keep the tests compatible if that's easy enough. In this case you could > just use > 0 as base comparison. I can fix that by myself, so no need to > send a new version. Good idea. Done. > Should we also document that the parameter is effective for autovacuum? > The name can lead to confusion regarding that. I'm not sure for the need because autovacuum is just an automatic execution of vacuum, and existing vacuum_xxx parameters also apply to autovacuum. But being specific is good anyway, so I added reference to autovacuum in the description. > Also, shouldn't the relopt check happen in should_attempt_truncation()? > It seems to me that if we use this routine somewhere else then it should > be controlled by the option. That makes sense. Done. > At the same time, we also have REL_TRUNCATE_FRACTION and > REL_TRUNCATE_MINIMUM which could be made equally user-tunnable. > That's more difficult to control, still why don't we also consider this > part? I thought of it, too. But I didn't have a good idea on how to explain those parameters. I'd like to separate it. > Another thing that seems worth thinking about is a system-level GUC, and > an option in the VACUUM command to control if truncation should happen or > not. We have a lot of infrastructure to control such options between vacuum > and autovacuum, so it could be a waste to not consider potential synergies. Being able to specify this parameter in postgresql.conf and SET (especially ALTER DATABASE/USER to target specific databases/applications) might be useful, but I'm not sure... I'm less confident about whether VACUUM command can specify this, because this is a property of table under a specific workload, not a changable property of each VACUUM action. Anyway, I expect it won't be difficult to add those configurability without contradicting the design, so I'm inclined to separate it. From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > Yeah, that would work. Or it's kind of hackie but the rolling back the > insertion instead of INSERT and DELETE might also work. That's good, because it allows us to keep running reloptions test in parallel with other tests. Done. Regards Takayuki Tsunakawa disable-vacuum-truncation_v4.patch Description: disable-vacuum-truncation_v4.patch
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't think that a VACUUM option would be out of place, but a GUC > sounds like an attractive nuisance to me. It will encourage people to > just flip it blindly instead of considering the particular cases where > they need that behavior, and I think chances are good that most people > who do that will end up being sad. Ouch, I sent my previous mail before reading this. I can understand it may be cumbersome to identify and specify each table, so I tend to agree the parameter in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune specific databases and applications. But should the vacuuming of system catalogs also follow this setting? Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > Attached are the updated patches. Thanks, all look fixed. > The target_server_type option yet to be implemented. Please let me review once more and proceed to testing when the above is added, to make sure the final code looks good. I'd like to see how complex the if conditions in multiple places would be after adding target_server_type, and consider whether we can simplify them together with you. Even now, the if conditions seem complicated to me... that's probably due to the existence of read_write_host_index. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com] > I measured the memory context accounting overhead using Tomas's tool > palloc_bench, > which he made it a while ago in the similar discussion. > https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz > > This tool is a little bit outdated so I fixed it but basically I followed > him. > Things I did: > - make one MemoryContext > - run both palloc() and pfree() for 32kB area 1,000,000 times. > - And measure this time > > The result shows that master is 30 times faster than patched one. > So as Andres mentioned in upper thread it seems it has overhead. > > [master (without v15 patch)] > 61.52 ms > 60.96 ms > 61.40 ms > 61.42 ms > 61.14 ms > > [with v15 patch] > 1838.02 ms > 1754.84 ms > 1755.83 ms > 1789.69 ms > 1789.44 ms > I'm afraid the measurement is not correct. First, the older discussion below shows that the accounting overhead is much, much smaller, even with a more complex accounting. 9.5: Better memory accounting, towards memory-bounded HashAg https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the accounting overhead will be more likely to be hidden under the overhead of malloc() and free(). What we'd like to know the overhead when malloc() and free() are not called. And are you sure you didn't enable assert checking? Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > So we could you consider adding an option for the VACUUM command as well > as vacuumdb? The interactions with the current patch is that you need to > define the behavior at the beginning of vacuum for a given heap, instead > of reading the parameter at the time the truncation happens, and give I'm not confident whether this is the same as the above, I imagined this: * Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default. This follows the naming style "verb_object" like log_connections and enable_xxx. We may want to use enable_vacuum_shrink or something like that, but enable_xxx seems to be used solely for planner control. Plus, vacuum-related parameters seem to start with vacuum_. * Give priority to the reloption, because it's targeted at a particular table. If the reloption is not set, the GUC takes effect. * As a consequence, the user can change the behavior of VACUUM command by SETting the GUC in the same session in advance, when the reloption is not set. If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE again to restore the table's setting. But I don't think this use case (change whether to shrink per VACUUM command execution) is necessary. This is no more than simply possible. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > Robert used the phrase "attractive nuisance", which maybe sounds like a > good thing to have to a non native speaker, but it actually isn't -- he > was saying we should avoid a GUC at all, and I can see the reason for > that. I think we should have a VACUUM option and a reloption, but no > GUC. Uh, thanks. I've just recognized I didn't know the meaning of "nuisance." I've looked up the meaning in the dictionary. Nuisance is like a trouble maker... Why do you think that it's better for VACUUM command to have the option? I think it's a table property whose value is determined based on the application workload, not per VACUUM execution. Rather, I think GUC is more useful to determine the behavior of the entire database and/or application. If we want to change a given execution of VACUUM, then we can ALTER TABLE SET, VACUUM, and ALTER TABLE SET back. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi/出利葉 健 > [Size=800, iter=1,000,000] > Master |15.763 > Patched|16.262 (+3%) > > [Size=32768, iter=1,000,000] > Master |61.3076 > Patched|62.9566 (+2%) What's the unit, second or millisecond? Why is the number of digits to the right of the decimal point? Is the measurement correct? I'm wondering because the difference is larger in the latter case. Isn't the accounting processing almost the sane in both cases? * former: 16.262 - 15.763 = 4.99 * latter: 62.956 - 61.307 = 16.49 > At least compared to previous HashAg version, the overhead is smaller. > It has some overhead but is increase by 2 or 3% a little bit? I think the overhead is sufficiently small. It may get even smaller with a trivial tweak. You added the new member usedspace at the end of MemoryContextData. The original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes. So, the new member will be on a separate cache line. Try putting usedspace before the name member. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Fabien COELHO [mailto:coe...@cri.ensmp.fr] > >> If the user reconnects, eg "\c db", the setting is lost. The > >> re-connection handling should probably take care of this parameter, and > maybe others. > > I think your opinion is reasonable, but it seems not in this thread. > > HI think that your patch is responsible for the added option at least. > > I agree that the underlying issue that other parameters should probably > also be reused, which would be a bug fix, does not belong to this thread. This doesn't seem to be a bug. \connect just establishes a new connection, not reusing the previous settings for most connection parameters. As the examples in the following manual page show, the user needs to specify necessary connection parameters. https://www.postgresql.org/docs/devel/app-psql.html => \c service=foo => \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable" But I'm afraid the description of \connect may not be clear enough about connection parameters, and that may cause users to expect the reuse of all connection parameter settings. Anyway, I think this would be an improvement for psql's documentation or new feature for psql. What do you think? Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > The first thing I notice about the socket_timeout patch is that the > documentation is definitely wrong: Agreed. I suppose the description should be clearer about: * the purpose and what situation this timeout will help: not for canceling a long-running query or recovering from a network failure. * relationship with other timeout parameters: e.g., socket_timeout > connect_timeout, socket_timeout > statement_timeout, socket_timeout > TCP keepalive/user timeout > Actually, I think that socket_timeout_v8.patch is a bad idea and > should be rejected, not because it doesn't send a cancel to the server > but just because it's not the right way of solving either of the > problems that it purports to solve. If you want a timeout for > queries, you need that to be administered by the server, which knows > when it starts and finishes executing a query; you cannot substitute a > client-side judgement based on the last time we received a byte of > data. Maybe a client-side judgement would work if the timer were > armed when we send a Query or Execute message and disarmed when we > receive ReadyForQuery. And, if you want a network problem detector, > then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff, > so that you don't kill perfectly good connections that just happen to > be idle. I think the purpose of socket_timeout is to avoid infinite or unduely long wait and return response to users, where other existing timeout parameters wouldn't help. For example, OS's process scheduling or paging/swapping problems might cause long waits before postgres gets control (e.g. due to Transparent HugePage (THP)). Within postgres, the unfair lwlock can unexpected long waits (I experienced dozens of seconds per wait on ProcArrayLock, which was unbelievable.) Someone may say that those should be fixed or improved instead of introducing this parameter, but I think it's good to have this as a stop-gap measure. In other words, we can suggest setting this parameter when the user asks "How can I return control to the end user in any situation?" Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > But that's not what it will do. As long as the server continues to > dribble out protocol messages from time to time, the timeout will > never fire no matter how much time passes. I saw a system once where > every 8kB read took many seconds to complete; such a system could > dribble out sequential scan results over an arbitrarily long period of > time without ever tripping the timeout. I understood hat the example is about an SELECT that returns multiple rows. If so, statement_timeout should handle it, shouldn't it? > If you really want to return > control to the user in any situation, what you can do is use the libpq > APIs in asynchronous mode which, barring certain limitations of the > current implementation, will actually allow you to maintain control > over the connection at all times. Maybe. But the users aren't often in a situation to modify the application to use libpq asynchronous APIs. > I think the use case for a timeout that has both false positives (i.e. > it will fire even when there's no problem, as when the connection is > legitimately idle) and false negatives (i.e. it will fail to trigger > when there is a problem, as when there are periodic notices or > notifies from the server connection) is extremely limited if not > nonexistent, and I think the potential for users to be confused is > really high. My understanding is that the false positive case doesn't occur, because libpq doesn't wait on the socket while the client is idle and not communicating SQL request/response. As for the false negative case, resetting the timer upon notices or notifies receipt is good, because they show that the database server is functioning. socket_timeout is not a mechanism to precisely limit the duration of query request/response. It is kind of a stop-gap, last resort to assure return control within reasonable amount of time, rather than minutes or hours. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > If so, in turn the socket_timeout doesn't work as expected? I > understand that what is proposed here is to disconnect after that > time of waiting for *the first tuple* of a query, regardless of > it is a long query or network failure. On of the problems raised > here is the socket_timetout patch doesn't work that way? No, the proposal is to set the timeout for every socket read/write like PgJDBC. It is not restricted to an SQL command execution; it applies to any communication with the server that could block the client. > I can't imagine that in the cases where other than applications > cannot be rebuild for some reasons. (E.g. the source code has > been lost or the source code no longer be built on modern > environment..) > > If an application is designed to have such a requirement, mustn't > it have implemented that by itself by any means? For example, an > application on JDBC can be designed to kill a querying thread > that takes longer than threshold. Any paid or free applications whose source code is closed. Also, ordinary users don't want to modify psql and pgAdmin source code by themselves, do they? > Doesn't TCP_KEEPALIVE work that way? > > statement_timeout works on a live connection, TCP_USER_TIMEOUT > works for an already failed network but if the network fails > after a backend already sent the ack for a query request, it > doesn't work. TCP_KEEPALIVE would work for the case where any > packet doesn't reach each other for a certain time. Is there any > other situations to save? For example, OS issues such as abnormally (buggy) slow process scheduling or paging/swapping that prevent control from being passed to postgres. Or, abnormally long waits on lwlocks in postgres. statement_timeout doesn't take effect while waiting on a lwlock. I have experienced both. And, any other issues in the server that we haven't experienced and not predictable. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Fabien COELHO [mailto:coe...@cri.ensmp.fr] > I think that the typical use-case of \c is to connect to another database > on the same host, at least that what I do pretty often. The natural > expectation is that the same "other" connection parameters are used, > otherwise it does not make much sense, and there is already a whole logic > of reusing the previous settings in psql, at least wrt describing the > target (host, port, user…). I agree about the natural expectation. > > Anyway, I think this would be an improvement for psql's documentation > or > > new feature for psql. What do you think? > > I think that we should fix the behavior rather than document the current > weirdness. I do not think that we should introduce more weirdness. Do you think all connection parameter settings should be reused by default, or when -reuse-previous=on is specified? Do you think this is a bug to be backported to previous versions, or a new feature? I think I'll make a patch separately from this thread, if time permits. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > > For example, OS issues such as abnormally (buggy) slow process scheduling > or paging/swapping that prevent control from being passed to postgres. Or, > abnormally long waits on lwlocks in postgres. statement_timeout doesn't > take effect while waiting on a lwlock. I have experienced both. And, any > other issues in the server that we haven't experienced and not predictable. > > For me all mentioned by Takayuki Tsunakawa problems looks like a lack of > design of end-user application or configuration of DB server. It is not > a problem of PostgreSQL. Certainly, my cases could be avoided by the OS and database configuration. But we cannot say all such problems can be avoided by configuration in advance. The point is to provide a method to get ready for any situation. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > One other thing -- I looked a bit into the pgsql-jdbc implementation > of a similarly-named option, and it does seem to match what you are > proposing here. I wonder what user experiences with that option have > been like. One case I faintly recall is that some SQL execution got stuck in NFS access in the server, and statement_timeout didn't work (probably due to inability to cancel NFS read/write operations). The user chose to use PgJDBC's socketTimeout. I'm not sure whether this use case is fundamental, because I wonder if NFS mount options could have been the solution. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > Now you might say - what if the server is stopped not because of > SIGSTOP but because of some other reason, like it's waiting for a > lock? Well, in that case, the database server is still functioning, > and you will not want the connection to be terminated if no notifies > are sent during the lock wait but not terminated if they are sent. At > least, I can't see why anyone would want that. Yes, so I think it would be kind to describe how to set socket_timeout with relation to other timeout parameters -- socket_timeout > statement_timeout > lock_timeout, for example. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > Do you mind me asking you whether you have thought that solving your problem > can lead to the problem in the other user applications? > Let's imagine a possible problem: > 1. end-user sets 'socket_timeout' only for current session > 2. 'socket_timeout' is much shorter than 'keep_alive', 'tcp_user_timeout' > and 'statement_timeout' > 3. end-user invokes a 'heavy' query which is not be possible to process > by PostgreSQL server within 'socket_timeout' > 4. if the query fails due to timeout, terminate the session and repeat it > > As the query is not cancelled, PostgreSQL server will process it till > completion or 'statement_timeout' expiration. In such a case PostgreSQL > server can be overloaded by an 'unreliable' user/users and other end-users > will not be able to work with PostgreSQL server as they expected. Yes, so I think it would be necessary to describe how to set socket_timeout with relation to other timeout parameters -- socket_timeout > statement_timeout, emphasizing that socket_timeout is not for canceling long-running queries but for returning control to the client. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > In case of failure PQcancel() terminates in 'socket_timeout'. So, control > to the end-user in such a failure situation will be returned in 2 * > 'socket_timeout' interval. It is much better than hanging forever in some > specific cases. Moreover, such solution will not lead to the overloading > of PostgreSQL server by abnormally ignored 'heavy' queries results by > end-users. Oops, unfortunately, PQcancel() does not follow any timeout parameters... It uses a blocking socket. Also, I still don't think it's a good idea to request cancellation. socket_timeout should be sufficiently longer than the usually expected query execution duration. And long-running queries should be handled by statement_timeout which indicates the maximum tolerable query execution duration. For example, if the usually expected query execution time is 100 ms, statement_timeout can be set to 3 seconds and socket_timeout to 5 seconds. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > Based on your comment it seems to me that 'socket_timeout' should be > connected with statement_timeout. I mean that end-user should wait > statement_timeout + 'socket_timeout' for returning control. It looks much > more safer for me. I'm afraid we cannot enforce that relationship programmatically, so I think the documentation should warn that socket_timeout be longer than statement_timeout. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > Target_session_attrs Target_server_type > > read-write prefer-slave, slave > > prefer-read master, slave > read-onlymaster, prefer-slave > > I know that some of the cases above is possible, like master server with > by default accepts > read-only sessions. Instead of we put a check to validate what is right > combination, how > about allowing the combinations and in case if such combination is not > possible, means > there shouldn't be any server the supports the requirement, and connection > fails. > > comments? I think that's OK. To follow the existing naming, it seems better to use "primary" and "standby" instead of master and slave -- primary_conninfo, synchronous_standby_names, hot_standby, max_standby_streaming_delay and such. > And also as we need to support the new option to connect to servers < 12 > also, this option > sends the command "select pg_is_in_recovery()" to the server to find out > whether the server > is recovery mode or not? The query looks good. OTOH, I think we can return an error when target_server_type is specified against older servers because the parameter is new, if the libpq code would get uglier if we support older servers. > And also regarding the implementation point of view, the new > target_server_type option > validation is separately handled, means the check for the required server > is handled in a separate > switch case, when both options are given, first find out the required server > for target_session_attrs > and validate the same again with target_server_type? Logically, it seems the order should be reverse; check the server type first, then the session attributes, considering other session attributes in the future. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't think so. I think it's just a weirdly-design parameter > without a really compelling use case. Enforcing limits on the value > of the parameter doesn't fix that. Most of the reviewers who have > opined so far have been somewhere between cautious and negative about > the value of that parameter, so I think we should just not add it. At > least for now. I don't think socket_timeout is so bad. I think Nagaura-san and I presented the use case, giving an answer to every question and concern. OTOH, it may be better to commit the tcp_user_timeout patch when Nagaura-san has refined the documentation, and then continue socket_timeout. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > Your changes in LOCALLOCK still refer to PGPROC, from your first version > of the patch. > > I think the reordering of struct members could be done as a separate > preliminary patch. > > Some more documentation in the comment before dlist_head LocalLocks to > explain this whole mechanism would be nice. Fixed. > You posted a link to some performance numbers, but I didn't see the test > setup explained there. I'd like to get some more information on this > impact of this. Is there an effect with 100 tables, or do you need 10? Imai-san, can you tell us the test setup? Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > Fixed. Rebased on HEAD. Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch
RE: [survey] New "Stable" QueryId based on normalized query text
From: legrand legrand [mailto:legrand_legr...@hotmail.com] > There are many projects that use alternate QueryId > distinct from the famous pg_stat_statements jumbling algorithm. I'd like to welcome the standard QueryID that DBAs and extension developers can depend on. Are you surveying the needs for you to develop the QueryID that can meet as many needs as possible? > needs.1: stable accross different databases, Does this mean different database clusters, not different databases in a single database cluster? needs.5: minimal overhead to calculate needs.6: doesn't change across database server restarts needs.7: same value on both the primary and standby? > norm.9: comments aware Is this to distinguish queries that have different comments for optimizer hints? If yes, I agree. Regards Takayuki Tsunakawa
RE: [survey] New "Stable" QueryId based on normalized query text
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > > > needs.1: stable accross different databases, > > > > Does this mean different database clusters, not different databases in > a single database cluster? > > Does this mean you want different QueryID for the same-looking > query for another database in the same cluster? (I'm afraid this question may be targeted at legland legland, not me...) I think the same query text can have either same or different QueryID in different databases in the database cluster. Even if the QueryID value is the same, we can use DatabaseID to choose desired information. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Robert Haas [mailto:robertmh...@gmail.com] > I really dislike having both target_sesion_attrs and > target_server_type. It doesn't solve any actual problem. master, > slave, prefer-save, or whatever you like could be put in > target_session_attrs just as easily, and then we wouldn't end up with > two keywords doing closely related things. 'master' is no more or > less a server attribute than 'read-write'. Hmm, that may be OK. At first, I felt it strange to treat the server type (primary or standby) as a session attribute. But we can see the server type as one attribute in a sense that a session is established for. I'm inclined to agree with: target_session_attr = {any | read-write | read-only | prefer-read | primary | standby | prefer-standby} Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: David Steele [mailto:da...@pgmasters.net] > This patch appears to have been stalled for a while. > > Takayuki -- the ball appears to be in your court. Perhaps it would be > helpful to summarize what you think are next steps? disable_index_cleanup is handled by Sawada-san in another thread. I understand I've reflected all review comments in the latest patch, and replied to the opinions/proposals, so the patch status is kept "needs review." (I hope new fire won't happen...) Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut > wrote: > > Perhaps "speeding up planning with partitions" needs to be accepted first? > > Yeah, I think it likely will require that patch to be able to measure > the gains from this patch. > > If planning a SELECT to a partitioned table with a large number of > partitions using PREPAREd statements, when we attempt the generic plan > on the 6th execution, it does cause the local lock table to expand to > fit all the locks for each partition. This does cause the > LockReleaseAll() to become slow due to the hash_seq_search having to > skip over many empty buckets. Since generating a custom plan for a > partitioned table with many partitions is still slow in master, then I > very much imagine you'll struggle to see the gains brought by this > patch. Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session: 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > My understanding of what David wrote is that the slowness of bloated hash > table is hard to notice, because planning itself is pretty slow. With the > "speeding up planning with partitions" patch, planning becomes quite fast, > so the bloated hash table overhead and so your patch's benefit is easier > to notice. This patch is clearly helpful, but it's just hard to notice > it > when the other big bottleneck is standing in the way. Ah, I see. I failed to recognize the simple fact that without your patch, EXECUTE on a table with many partitions is slow due to the custom planning time proportional to the number of partitions. Thanks for waking up my sleeping head! Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Here a benchmark doing that using pgbench's script weight feature. Wow, I didn't know that pgbench has evolved to have such a convenient feature. Thanks for telling me how to utilize it in testing. PostgreSQL is cool! Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > On Wed, Mar 27, 2019 at 2:30 AM Robert Haas wrote: > > > > On Tue, Mar 26, 2019 at 11:23 AM Masahiko Sawada > wrote: > > > > I don't see a patch with the naming updated, here or there, and I'm > > > > going to be really unhappy if we end up with inconsistent naming > > > > between two patches that do such fundamentally similar things. -1 > > > > from me to committing either one until that inconsistency is resolved. > > > > > > Agreed. I've just submitted the latest version patch that adds > > > INDEX_CLEANUP option and vacuum_index_cleanup reloption. I already > > > mentioned on that thread but I agreed with adding phrase positively > > > than negatively. So if we got consensus on such naming the new options > > > added by this patch could be something like SHRINK option (with > > > true/false) and vacuum_shrink reloption. > > > > No, that's just perpetuating the problem. Then you have an option > > SHRINK here that you set to TRUE to skip something, and an option > > INDEX_CLEANUP over there that you set to FALSE to skip something. > > > > Well, I imagined that both INDEX_CLEANUP option and SHRINK option (or > perhaps TRUNCATE option) should be true by default. If we want to skip > some operation of vacuum we can set each options to false like "VACUUM > (INDEX_CLEANUP false, SHRINK true, VERBOSE true)". I think that > resolves the problem but am I missing something? I almost have the same view as Sawada-san. The reloption vacuum_shrink_enabled is a positive name and follows the naming style of other reloptions. I hope this matches the style you have in mind. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > while going through the old patch where the GUC_REPORT is implemented, Tom > has commented the logic of sending the signal to all backends to process > the hot standby exit with SIGHUP, if we add the logic of updating the GUC > variable value in SIGHUP, we may need to change all the SIGHUP handler code > paths. It is also possible that there is no need to update the variable > for other processes except backends. > > If we go with adding the new SIGUSR1 type to check and update the GUC varaible > can work for most of the backends and background workers. > > opinions SIGUSR1 looks reasonable. We can consider it as notifying that the server status has changed. I've fully reviewed 0001-0003 and my comments follow. I'll review 0004-0007. (1) 0001 before issuing the transaction_readonly to find out whether the server is read-write or not is restuctured under a new transaction_readonly -> "SHOW transaction_read_only" restuctured -> restructured (2) 0001 +succesful connection or failure. +successful connection; if it returns on, means server succesful -> successful means -> it means (3) 0003 +But for servers version 12 or greater uses the transaction_read_only +GUC that is reported by the server upon successful connection. GUC doesn't seem to be a term to be used in the user manual. Instead: uses the value of transaction_read_only configuration parameter that is... as in: https://www.postgresql.org/docs/devel/libpq-connect.html client_encoding This sets the client_encoding configuration parameter for this connection. application_name Specifies a value for the application_name configuration parameter. (4) 0003 boolstd_strings;/* standard_conforming_strings */ + booltransaction_read_only; /* session_read_only */ Looking at the comment for std_strings, it's better to change the comment to transaction_read_only to represent the backing configuration parameter name. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
I've looked through 0004-0007. I've only found the following: (5) 0005 With this read-only option type, application can connect to connecting to a read-only server in the list of hosts, in case if there is any read-only servers available, the connection attempt fails. "connecting to" can be removed. in case if there is any read-only servers -> If There's no read only server Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Robert Haas [mailto:robertmh...@gmail.com] > You're both right and I'm wrong. > > However, I think it would be better to stick with the term 'truncate' > which is widely-used already, rather than introducing a new term. Yeah, I have the same feeling. OTOH, as I referred in this thread, shrink is used instead of truncate in the PostgreSQL documentation. So, I chose shrink. To repeat myself, I'm comfortable with either word. I'd like the committer to choose what he thinks better. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT, > + (char *) &timeout, sizeof(timeout)) < 0 && errno != > ENOPROTOOPT) > +{ > +charsebuf[256]; > + > +appendPQExpBuffer(&conn->errorMessage, > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT) > failed: %s\n"), > > I suppose that the reason ENOPROTOOPT is excluded from error > condition is that the system call may fail with that errno on > older kernels, but I don't think that that justifies ignoring the > failure. I think that's for the case where the modules is built on an OS that supports TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used on an older OS that doesn't support TCP_USER_TIMEOUT. I remember I was sometimes able to do such a thing on Linux and Solaris. If we don't have to handle such usage, I agree about removing the special handling of ENOTPROTO. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT, > > + (char *) &timeout, sizeof(timeout)) < 0 && errno != > > ENOPROTOOPT) > > +{ > > +charsebuf[256]; > > + > > +appendPQExpBuffer(&conn->errorMessage, > > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT) > > failed: %s\n"), > > > > I suppose that the reason ENOPROTOOPT is excluded from error > > condition is that the system call may fail with that errno on > > older kernels, but I don't think that that justifies ignoring the > > failure. > > I think that's for the case where the modules is built on an OS that supports > TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used > on an older OS that doesn't support TCP_USER_TIMEOUT. I remember I was > sometimes able to do such a thing on Linux and Solaris. If we don't have > to handle such usage, I agree about removing the special handling of > ENOTPROTO. Oops, I agree that we return an error even in the ENOPROTOOPT case, because setsockopt() is called only when the user specifies tcp_user_timeout. Regards Takayuki Tsunakawa
RE: Timeout parameters
Nagaura-san, The client-side tcp_user_timeout patch looks good. The server-side tcp_user_timeout patch needs fixing the following: (1) + GUC_UNIT_MS | GUC_NOT_IN_SAMPLE + 12000, 0, INT_MAX, GUC_NOT_IN_SAMPLE should be removed because the parameter appears in postgresql.conf.sample. The default value should be 0, not 12000. (2) Now that you've changed show_tcp_usertimeout() to use pq_gettcpusertimeout(), you need to modify pq_settcpusertimeout(); just immitate pq_setkeepalivesidle(). Otherwise, SHOW would display a wrong value. Regards Takayuki Tsunakawa
RE: Timeout parameters
Nagaura-san, The socket_timeout patch needs the following fixes. Now that others have already tested these patches successfully, they appear committable to me. (1) + else + goto iiv_error; ... + +iiv_error: + conn->status = CONNECTION_BAD; + printfPQExpBuffer(&conn->errorMessage, + libpq_gettext("invalid integer value for socket_timeout\n")); + return false; This goto and its corresponding iiv_error label are redundant. You can just set the error message and return at the call site of parse_int_param(). i.e.: if (!parse_int_param(...)) { error processing return false; } if(conn->socket_timeout > 0 && conn->socket_timeout < 2) conn->socket_timeout = 2; The reason why oom_error label is present is that it is used at multiple places to avoid repeating the same error processing code. (2) + conn->sock = -1; Use PGINVALID_SOCKET instead of -1. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
Hi Hari-san, I've reviewed all the files. The patch would be OK when the following have been fixed, except for the complexity of fe-connect.c (which probably cannot be improved.) Unfortunately, I'll be absent next week. The earliest date I can do the test will be April 8 or 9. I hope someone could take care of this patch... (1) 0001 With this read-only option type, application can connect to to a read-only server in the list of hosts, in case ... before issuing the SHOW transaction_readonly to find out whether "to" appears twice in a row. transaction_readonly -> transaction_read_only (2) 0001 +succesful connection or failure. succesful -> successful (3) 0008 to conenct to a standby server with a faster check instead of conenct -> connect (4) 0008 Logically, recovery exit should be notified after the following statement: XLogCtl->SharedRecoveryInProgress = false; (5) 0008 + /* Update in_recovery status. */ + if (LocalRecoveryInProgress) + SetConfigOption("in_recovery", + "on", + PGC_INTERNAL, PGC_S_OVERRIDE); + This SetConfigOption() is called for every RecoveryInProgress() call on the standby. It may cause undesirable overhead. How about just calling SetConfigOption() once in InitPostgres() to set the initial value for in_recovery? InitPostgres() and its subsidiary functions call SetConfigOption() likewise. (6) 0008 async.c is for LISTEN/UNLISTEN/NOTIFY. How about adding the new functions in postgres.c like RecoveryConflictInterrupt()? (7) 0008 + if (pid != 0) + { + (void) SendProcSignal(pid, reason, procvxid.backendId); + } The braces are not necessary because the block only contains a single statement. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Another counter-argument to this is that there's already an > unexplainable slowdown after you run a query which obtains a large > number of locks in a session or use prepared statements and a > partitioned table with the default plan_cache_mode setting. Are we not > just righting a wrong here? Albeit, possibly 1000 queries later. > > I am, of course, open to other ideas which solve the problem that v5 > has, but failing that, I don't see v6 as all that bad. At least all > the logic is contained in one function. I know Tom wanted to steer > clear of heuristics to reinitialise the table, but most of the stuff > that was in the patch back when he complained was trying to track the > average number of locks over the previous N transactions, and his > concerns were voiced before I showed the 7% performance regression > with unconditionally rebuilding the table. I think I understood what you mean. Sorry, I don't have a better idea. This unexplanability is probably what we should accept to avoid the shocking 7% slowdown. OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdown in the same test case. If it's not satisfactory, then v6 is the best to commit. Regards Takayuki Tsunakawa
RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write
From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com] > Detail: > If target_session_attrs is set to read-write, PQconnectPoll() calls > PQsendQuery("SHOW transaction_read_only") althogh previous return value > was PGRES_POLLING_READING not WRITING. The current code probably assumes that PQsendQuery() to send "SHOW transaction_read_only" shouldn't block, because the message is small enough to fit in the socket send buffer. Likewise, the code in CONNECTION_AWAITING_RESPONSE case sends authentication data using pg_fe_sendauth() without checking for the write-ready status. OTOH, the code in CONNECTION_MADE case waits for write-ready status in advance before sending the startup packet. That's because the startup packet could get large enough to cause pqPacketSend() to block. So, I don't think the fix is necessary. > In result, PQsendQuery() may be blocked in pqsecure_raw_write(). FWIW, if PQsendQuery() blocked during connection establishment, I think it should block in poll() called from from pqWait(), because the libpq's socket is set non-blocking. Regards Takayuki Tsunakawa
RE: Why overhead of SPI is so large?
From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru] > PL/pgSQL: 29044.361 ms > C/SPI: 22785.597 ms > > The fact that difference between PL/pgSQL and function implemented in C > using SPI is not so large was expected by me. This PL/pgSQL overhead is not so significant compared with the three times, but makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that compiles the PL/SQL logic to native code. I've seen a few dozen percent speed up. Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > After investigation, the mechanism that's causing that is that the > src/test/recovery/t/010_logical_decoding_timelines.pl test shuts > down its replica server with a mode-immediate stop, which causes > that postmaster to shut down all its children with SIGQUIT, and > in particular that signal propagates to a "cp" command that the > archiver process is executing. The "cp" is unsurprisingly running > with default SIGQUIT handling, which per the signal man page > includes dumping core. We've experienced this (core dump in the data directory by an archive command) years ago. Related to this, the example of using cp in the PostgreSQL manual is misleading, because cp doesn't reliably persist the WAL archive file. > This makes me wonder whether we shouldn't be using some other signal > to shut down archiver subprocesses. It's not real cool if we're > spewing cores all over the place. Admittedly, production servers > are likely running with "ulimit -c 0" on most modern platforms, > so this might not be a huge problem in the field; but accumulation > of core files could be a problem anywhere that's configured to allow > server core dumps. We enable the core dump in production to help the investigation just in case. > Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down > non-Postgres child processes. But redesigning the system's signal > handling to make that possible seems like a bit of a mess. > > Thoughts? We're using a shell script and a command that's called in the shell script. That is: archive_command = 'call some_shell_script.sh ...' [some_shell_script.sh] ulimit -c 0 trap SIGQUIT to just exit on the receipt of the signal call some_command to copy file some_command also catches SIGQUIT just exit. It copies and syncs the file. I proposed something in this line as below, but I couldn't respond to Peter's review comments due to other tasks. Does anyone think it's worth resuming this? https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Kyotaro Horiguchi [mailto:horikyota@gmail.com] > Since we are allowing OPs to use arbitrary command as > archive_command, providing a replacement with non-standard signal > handling for a specific command doesn't seem a general solution > to me. Couldn't we have pg_system(a tentative name), which > intercepts SIGQUIT then sends SIGINT to children? Might be need > to resend SIGQUIT after some interval, though.. The same idea that you referred to as pg_system occurred to me, too. But I wondered if the archiver process can get the pid of its child (shell? archive_command?), while keeping the capabilities of system() (= the shell). Even if we fork() and then system(), doesn't the OS send SIGQUIT to any descendents of the archiver when postmaster sends SIGQUIT to the child process group? > # Is there any means to view the whole of a thread from archive? > # I'm a kind of reluctant to wander among messages like a rat in > # a maze:p You can see the whole thread by clicking the "Whole Thread" link. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > Hmm ... is this patch rejected, or is somebody still trying to get it to > committable state? David, you're listed as committer. I don't think it's rejected. It would be a pity (mottainai) to refuse this, because it provides significant speedup despite its simple modification. Again, I think the v2 patch is OK. Tom's comment was as follows: [Tom's comment against v2] FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. * Extreme test case: Not extreme. Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now. We hit this issue -- a point query gets nearly 20% slower after automatically creating a generic plan. That's the reason for this proposal. * 0.5% slowdown with pgbench: I think it's below the noise, as Tom said. * sizeof(LOCALLOCK): As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11. * Use case is narrow: No. The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort: - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE! - LockReleaseSession: advisory lock - LockReleaseCurrentOwner: ?? - LockReassignCurrentOwner: ?? Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Jing Wang [mailto:jingwang...@gmail.com] > This is a proposal that let libpq support 'prefer-read' option in > target_session_attrs in pg_conn. The 'prefer-read' means the libpq will > try to connect to a 'read-only' server firstly from the multiple server > addresses. If failed to connect to the 'read-only' server then it will try > to connect to the 'read-write' server. There's a pending patch I started. I'd be happy if you could continue this. https://commitfest.postgresql.org/15/1148/ Regards Takayuki Tsunakawa
RE: [HACKERS] Statement-level rollback
From: Simon Riggs [mailto:si...@2ndquadrant.com] > When will the next version be posted? I'm very sorry I haven't submitted anything. I'd like to address this during this CF. Thanks for remembering this. Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Robert Haas [mailto:robertmh...@gmail.com] > Oh, incidentally -- in our internal testing, we found that > wal_sync_method=open_datasync was significantly faster than > wal_sync_method=fdatasync. You might find that open_datasync isn't much > different from pmem_drain, even though they're both faster than fdatasync. That's interesting. How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)? Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL archiving or streaming replication is used)? [Current port/linux.h] /* * Set the default wal_sync_method to fdatasync. With recent Linux versions, * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't * perform better and (b) causes outright failures on ext4 data=journal * filesystems, because those don't support O_DIRECT. */ #define PLATFORM_DEFAULT_SYNC_METHODSYNC_METHOD_FDATASYNC pg_test_fsync showed open_datasync is slower on my RHEL6 VM: ep 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 4276.373 ops/sec 234 usecs/op fdatasync 4895.256 ops/sec 204 usecs/op fsync 4797.094 ops/sec 208 usecs/op fsync_writethrough n/a open_sync 4575.661 ops/sec 219 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 2243.680 ops/sec 446 usecs/op fdatasync 4347.466 ops/sec 230 usecs/op fsync 4337.312 ops/sec 231 usecs/op fsync_writethrough n/a open_sync 2329.700 ops/sec 429 usecs/op ep Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Robert Haas [mailto:robertmh...@gmail.com] > I think open_datasync will be worse on systems where fsync() is expensive > -- it forces the data out to disk immediately, even if the data doesn't > need to be flushed immediately. That's bad, because we wait immediately > when we could have deferred the wait until later and maybe gotten the WAL > writer to do the work in the background. But it might be better on systems > where fsync() is basically free, because there you might as well just get > it out of the way immediately and not leave something left to be done later. > > This is just a guess, of course. You didn't mention what the underlying > storage for your test was? Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache. As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash memory. 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 50829.597 ops/sec 20 usecs/op fdatasync 42094.381 ops/sec 24 usecs/op fsync 42209.972 ops/sec 24 usecs/op fsync_writethroughn/a open_sync 48669.605 ops/sec 21 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 26366.373 ops/sec 38 usecs/op fdatasync 33922.725 ops/sec 29 usecs/op fsync 32990.209 ops/sec 30 usecs/op fsync_writethroughn/a open_sync 24326.249 ops/sec 41 usecs/op What do you think about changing the default value of wal_sync_method on Linux in PG 11? I can understand the concern that users might hit performance degredation if they are using PostgreSQL on older systems. But it's also mottainai that many users don't notice the benefits of wal_sync_method = open_datasync on new systems. Regards Takayuki Tsunakawa
Temporary tables prevent autovacuum, leading to XID wraparound
Hello, I've found a problem that an orphaned temporary table could cause XID wraparound. Our customer encountered this problem with PG 9.5.2, but I think this will happen with the latest PG. I'm willing to fix this, but I'd like to ask you what approach we should take. PROBLEM The customer has a database for application data, which I call it user_db here. They don't store application data in postgres database. No tables in user_db was autovacuumed for more than a month, leading to user tables bloating. Those tables are eligible for autovacuum according to pg_stat_all_tables and autovacuum settings. age(datfrozenxid) of user_db and postgres databases are greater than autovacuum_max_freeze_age, so they are eligible for autovacuuming for XID wraparound. There are user tables in user_db whose age(relfrozenxid) is greater than autovacuum_freeze_max_age, so those tables should get autovacuum treatment. CAUSE postgres database has a table named pg_temp_3.fetchchunks, whose age(relfrozenxid) is greater than autovacuum_freeze_max_age. This temporary table is the culprit. pg_temp_3.fetchchunks is created by pg_rewind. The customer says they ran pg_rewind. autovacuum launcher always choose postgres, because do_start_worker() scans pg_database and finds that postgres database needs vacuuming for XID wraparound. user_db is never chosen for vacuuming, although it also needs vacuuming for XID wraparound. autovacuum worker doesn't delete pg_temp3.fetchchunks, because the backendid 3 is used by some application so autovacuum worker thinks that the backend is active and the temporary table cannot be dropped. I don't know why pg_temp3.fetchchunks still exists. Maybe the user ran pg_ctl stop -mi while pg_rewind was running. FIX I have the following questions. Along which line should I proceed to fix the problem? * Why does autovacuum launcher always choose only one database when that database need vacuuming for XID wraparound? Shouldn't it also choose other databases? * I think temporary tables should not require vacuuming for XID wraparound. Furtherover, should updates/deletes to temporary tables be in-place instead of creating garbage, so that any form of vacuum is unnecessary? Other sessions do not need to read temporary tables. * In this incident, autovacuum worker misjudged that pg_temp_3.fetchchunks can't be deleted, although the creator (pg_rewind) is no longer active. How can we delete orphan temporary tables safely? Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
> From: Michael Paquier [mailto:michael.paqu...@gmail.com] > As a superuser, DROP TABLE should work on the temporary schema of another > session. Have you tried that to solve the situation? Yes, we asked the customer to do that today. I think the customer will do in the near future. > > * In this incident, autovacuum worker misjudged that > > pg_temp_3.fetchchunks can't be deleted, although the creator > > (pg_rewind) is no longer active. How can we delete orphan temporary > > tables safely? > > As long as Postgres sees that its temporary schema is in use, it would think > that the table is not orphaned. Another thing possible would be to have > the session now holding this schema space to reuse fetchchunks so as things > are reset. I understood you suggested a new session which recycle the temp schema should erase the zombie metadata of old temp tables or recreate the temp schema. That sounds easy. Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Michael Paquier [mailto:michael.paqu...@gmail.com] > On Thu, Jan 25, 2018 at 08:10:00AM +0000, Tsunakawa, Takayuki wrote: > > I understood you suggested a new session which recycle the temp schema > > should erase the zombie metadata of old temp tables or recreate the > > temp schema. That sounds easy. > > If the new session makes use of the same temporary schema where the orphan > table is, cleanup is possible. Now you have a problem if this is not available > as this depends on the backend ID uniquely assigned. Ouch, you're right. If the new session "uses the temp schema," it has a chance to clean the old temp table metadata. However, it won't help if the session doesn't try to use the temp schema by creating a temp table... > It would be better > to just drop the table manually at the end. Just to solve this very incident, it's so. But this is a bug, so we need to fix it somehow. Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Robert Haas [mailto:robertmh...@gmail.com] > On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki > wrote: > > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on > a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe > flash memory. > > So does that mean it was faster than your PMDK implementation? The PMDK patch is not mine, but is from people in NTT Lab. I'm very curious about the comparison of open_datasync and PMDK, too. > > What do you think about changing the default value of wal_sync_method > on Linux in PG 11? I can understand the concern that users might hit > performance degredation if they are using PostgreSQL on older systems. But > it's also mottainai that many users don't notice the benefits of > wal_sync_method = open_datasync on new systems. > > Well, some day persistent memory may be a common enough storage technology > that such a change makes sense, but these days most people have either SSD > or spinning disks, where the change would probably be a net negative. It > seems more like something we might think about changing in PG 20 or PG 30. No, I'm not saying we should make the persistent memory mode the default. I'm simply asking whether it's time to make open_datasync the default setting. We can write a notice in the release note for users who still use ext3 etc. on old systems. If there's no objection, I'll submit a patch for the next CF. Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Robert Haas [mailto:robertmh...@gmail.com]> On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki > wrote: > > No, I'm not saying we should make the persistent memory mode the default. > I'm simply asking whether it's time to make open_datasync the default > setting. We can write a notice in the release note for users who still > use ext3 etc. on old systems. If there's no objection, I'll submit a patch > for the next CF. > > Well, like I said, I think that will degrade performance for users of SSDs > or spinning disks. As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% faster with open_datasync than with fdatasync. In addition, regular file writes on HDD with ext4 was also 10% faster: -- 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 3408.905 ops/sec 293 usecs/op fdatasync 3111.621 ops/sec 321 usecs/op fsync 3609.940 ops/sec 277 usecs/op fsync_writethrough n/a open_sync 3356.362 ops/sec 298 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 1892.157 ops/sec 528 usecs/op fdatasync 3284.278 ops/sec 304 usecs/op fsync 3066.655 ops/sec 326 usecs/op fsync_writethrough n/a open_sync 1853.415 ops/sec 540 usecs/op -- And you said open_datasync was significantly faster than fdatasync. Could you show your results? What device and filesystem did you use? Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Michael Paquier [mailto:michael.paqu...@gmail.com] > Or to put it short, the lack of granular syncs in ext3 kills performance > for some workloads. Tomas Vondra's presentation on such matters are a really > cool read by the way: > https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf > s Yeah, I saw this recently, too. That was cool. Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From: Robert Haas [mailto:robertmh...@gmail.com] > If I understand correctly, those results are all just pg_test_fsync results. > That's not reflective of what will happen when the database is actually > running. When you use open_sync or open_datasync, you force WAL write and > WAL flush to happen simultaneously, instead of letting the WAL flush be > delayed. Yes, that's pg_test_fsync output. Isn't pg_test_fsync the tool to determine the value for wal_sync_method? Is this manual misleading? https://www.postgresql.org/docs/devel/static/pgtestfsync.html -- pg_test_fsync - determine fastest wal_sync_method for PostgreSQL pg_test_fsync is intended to give you a reasonable idea of what the fastest wal_sync_method is on your specific system, as well as supplying diagnostic information in the event of an identified I/O problem. -- Anyway, I'll use pgbench, and submit a patch if open_datasync is better than fdatasync. I guess the current tweak of making fdatasync the default is a holdover from the era before ext4 and XFS became prevalent. > I don't have the results handy at the moment. We found it to be faster > on a database benchmark where the WAL was stored on an NVRAM device. Oh, NVRAM. Interesting. Then I'll try open_datasync/fdatasync comparison on HDD and SSD/PCie flash with pgbench. Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > On Thu, Jan 25, 2018 at 3:14 PM, Tsunakawa, Takayuki > wrote: > > * Why does autovacuum launcher always choose only one database when that > database need vacuuming for XID wraparound? Shouldn't it also choose other > databases? > > Yeah, I'd also like to fix this issue. This can be problem even in other > case; there are two databases that require anti-wraparound vacuum, and one > of them has a very large table that could take a long time to vacuum. In > this case, if autovacuum chooses the database having big table first, > another database would not be selected until an autovacuum worker completes > vacuum on the large table. To deal with it, I think we can make autovacuum > workers tell that it is no longer necessary to launch a new autovacuum worker > on the database to autovacuum launcher before exit. For example, autovacuum > worker reports both the total number of relations and the number of relations > that require an anti-wraparound vacuum to the stats collector. Thanks for commenting. I believe you have deep knowledge and experience with vacuum because you did a great work for freeze map in 9.6, so I appreciate your help! How would you use those two counts? How about just modifying do_start_worker(), so that the launcher chooses a database in the following order? 1. wraparound-risky database not being vacuumed by any worker 2. non-wraparound-risky database not being vacuumed by any worker 3. wraparound-risky database being vacuumed by any worker 4. non-wraparound-risky database being vacuumed by any worker Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Robert Haas [mailto:robertmh...@gmail.com] > I think we should consider having backends try to remove their temporary > schema on startup; then, if a temp table in a backend is old enough that > it's due for vacuum for wraparound, have autovacuum kill the connection. > The former is necessary to prevent sessions from being killed on account > of temp tables they "inherited" from a backend that didn't exit cleanly. That seems to be the only reasonable solution. One might feel it annoying to emit WAL during connection establishment to delete the temp schema, but even now the client authentication can emit WAL for hot pruning while scanning pg_database, pg_authid, etc. Thanks. > The in-place update idea won't work for a couple of reasons. First, a > command shouldn't see the results of modifications made earlier in the same > command. Second, using cursors, it's possible to have more than one > distinct snapshot open against a temporary table at the same time. You're right. And if the transaction rolls back, it needs to see the old tuples, which requires undo log. Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > What I thought is that a worker reports these two values after scanned > pg_class and after freezed a table. The launcher decides to launch a new > worker if the number of tables requiring anti-wraparound vacuum is greater > than the number of workers running on the database. Similarly, the > autovacuum launcher doesn't launch a new worker if two values are equal, > which means all tables requiring an anti-wraparound vacuum is being vacuumed. > There are chances that new relation is added during a worker is running > on the last one table that requires anti-wraparound vacuum and launcher > launches a new worker on the database. I think it's no problem because the > new worker would update that two values and exits soon. I got it. Currently, the launcher assigns all workers to one database even if that database has only one table in danger of wraparound. With your suggestion, the launcher assigns as many workers as the tables to be frozen, and use remaining workers for the other databases. > > How about just modifying do_start_worker(), so that the launcher chooses > a database in the following order? > > > > 1. wraparound-risky database not being vacuumed by any worker 2. > > non-wraparound-risky database not being vacuumed by any worker 3. > > wraparound-risky database being vacuumed by any worker 4. > > non-wraparound-risky database being vacuumed by any worker > > > > IMO the limiting the number of worker on a database to 1 seems risky. > If a database has many tables that require an anti-wraparound vacuum, it > takes a long time to freeze the all of these tables. In current > implementation, as I mentioned as above, launcher can launch multiple worker > on the one database. I can understand your concern. On the other hand, it's unfair that one database could monopolize all workers, because other databases might also be facing wraparound risk. > Since the above idea would be complex a bit, as an > alternated idea it might be better to specify the number of worker to launch > per database by a new GUC parameter or something. If the number of worker > running on the database exceeds that limit, the launcher doesn't select > the database even if the database is about to wraparound. I'm afraid the GUC would be difficult for the user to understand and tune. I want to submit the patch for handling the garbage temporary table metadata as Robert suggested in the next CF. That should be enough to prevent this customer's problem. I would appreciate if anyone could address the other improvement that Sawada-san proposed. Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > On Mon, Jan 29, 2018 at 3:33 PM, Tsunakawa, Takayuki > wrote: > > I can understand your concern. On the other hand, it's unfair that one > database could monopolize all workers, because other databases might also > be facing wraparound risk. > > On third thought, we can change the policy of launching workers so that > the launcher dispatches workers evenly to wraparound-risky databases > instead of choosing only one most wraparound-risky database. +1 Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Robert Haas [mailto:robertmh...@gmail.com] > Unfortunately, I think a full solution to the problem of allocating AV > workers to avoid wraparound is quite complex. Yes, that easily puts my small brain into an infinite loop... > Given all of the foregoing this seems like a very hard problem. I can't > even articulate a clear set of rules for what our priorities should be, > and it seems that such rules would depend on the rate at which we're consuming > XIDs, how close we are in each database to a wraparound shutdown, what tables > exist in each database, how big the not-all-frozen part of each one is, > how big their indexes are, how much they're holding back relfrozenxid, and > which ones already have workers, among other things. I think it's quite > possible that we can come up with something that's better than what we have > now without embarking on a huge project, but it's not going to be anywhere > near perfect because this is really complicated, and there's a real risk > that we'll just making some cases better and others worse rather than > actually coming out ahead overall. So a simple improvement would be to assign workers fairly to databases facing a wraparound risk, as Sawada-san suggested. One ultimate solution should be the undo-based MVCC that makes vacuuming unnecessary, which you proposed about a year ago... Regards Takayuki Tsunakawa
[bug fix] ECPG: freeing memory for pgtypes crashes on Windows
Hello, Some user hit a problem with ECPG on Windows. The attached patch is a fix for it. I'd appreciate it if you could backport this in all supported versions. The problem is simple. free() in the following example crashes: char *out; out = PGTYPESnumeric_to_asc(...); free(out); The cause is the mismatch of the version of C runtime library. The version of Visual Studio used to build the application was different from that for building PostgreSQL (libpgtypes.dll). The fix is to add PGTYPES_free() in libpgtypes.dll, just like libpq has PQfreemem() described here: https://www.postgresql.org/docs/devel/static/libpq-misc.html Regards Takayuki Tsunakawa pgtypes_freemem.patch Description: pgtypes_freemem.patch
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > Thank you for suggestion. It sounds more smarter. So it would be more better > if we vacuums database for anti-wraparound in ascending order of > relfrozenxid? I thought so, too. The current behavior is inconsistent: the launcher tries to assign all workers to one database with the biggest wraparound risk in order to eliminate the risk as fast as possible, but the workers don't get hurry to reduce the risk. Regards Takayuki Tsunakawa
RE: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows
From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com] > +#ifndef PGTYPES_FREE > +#define PGTYPES_FREE > + extern void PGTYPES_free(void *ptr); > +#endif > > It seems quite strange to repeat this in pgtypes_date.h, pgtypes_interval.h > and pgtypes_numeric.h. I guess you might not want to introduce a new common > header file so that his can be back-patched more easily? Not sure if there > is a project policy about that, but it seems unfortunate to introduce > maintenance burden by duplicating this. Your guess is correct. I wanted to avoid the duplication, but there was no good place to put this without requiring existing applications to change their #include directives. > + PGTYPES_free()/ instead of > free(). > > The "/" needs to move inside then closing tag. Thanks, fixed. Regards Takayuki Tsunakawa pgtypes_freemem_v2.patch Description: pgtypes_freemem_v2.patch
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Robert Haas [mailto:robertmh...@gmail.com] > Temporary tables contain XIDs, so they need to be vacuumed for XID > wraparound. Otherwise, queries against those tables by the session > that created them could yield wrong answers. However, autovacuum > can't perform that vacuuming; it would have to be done by the session. > I think we should consider having backends try to remove their > temporary schema on startup; then, if a temp table in a backend is old > enough that it's due for vacuum for wraparound, have autovacuum kill > the connection. The former is necessary to prevent sessions from > being killed on account of temp tables they "inherited" from a backend > that didn't exit cleanly. The attached patch does the former. The small change in autovacuum.c is mainly for autovac launcher and background workers which don't connect to a database. I'll add this to the next CF. I'd like this to be back-patched. I didn't do the latter, because killing the connection seemed a bit overkill. If we're going to do it, then we should also kill the connection which is preventing vacuum regardless of whether it has temporary tables in its session. Regards Takayuki Tsunakawa reset_temp_schema_on_connect.patch Description: reset_temp_schema_on_connect.patch
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Michael Paquier [mailto:michael.paqu...@gmail.com] > I am not sure that we would like to give up that easily the property that > we have now to clean up past temporary files only at postmaster startup > and only when not in recovery. If you implement that, there is a risk that > the backend you are starting is eating the connection slot and by consequence > its temporary schema and its set of temporary tables on which one may want > to look into after a crash. postmaster deletes temporary relation files at startup by calling RemovePgTempFiles() regardless of whether it's in recovery. It doesn't call that function during auto restart after a crash when restart_after_crash is on. > > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d > > schema if the backend is active but in some other database (rather > > than only when the backend is not active at all). > > Yeah. Here we can do something. This does not sound much difficult to > me. I did that in my patch. Regards Takayuki Tsunakawa
RE: Temporary tables prevent autovacuum, leading to XID wraparound
From: Michael Paquier [mailto:michael.paqu...@gmail.com] > > postmaster deletes temporary relation files at startup by calling > > RemovePgTempFiles() regardless of whether it's in recovery. It > > doesn't call that function during auto restart after a crash when > > restart_after_crash is on. > > The comment on top of RemovePgTempFiles() states the following: > * NOTE: we could, but don't, call this during a post-backend-crash restart > * cycle. The argument for not doing it is that someone might want to > examine > * the temp files for debugging purposes. This does however mean that > * OpenTemporaryFile had better allow for collision with an existing temp > * file name. Yes, I saw that comment. postmaster keeps orphaned temp relation files only after a crash when restart_after_crash is on. > Nice to hear that. Please note that I did not check your patch, so I cannot > conclude on its correctness in details. I thought so. Don't mind. Regards Takayuki Tsunakawa
[bug fix] Cascaded standby cannot start after a clean shutdown
Hello, Our customer encountered a rare bug of PostgreSQL which prevents a cascaded standby from starting up. The attached patch is a fix for it. I hope this will be back-patched. I'll add this to the next CF. PROBLEM == The PostgreSQL version is 9.5. The cluster consists of a master, a cascading standby (SB1), and a cascaded standby (SB2). The WAL flows like this: master -> SB1 -> SB2. The user shut down SB2 and tried to restart it, but failed with the following message: FATAL: invalid memory alloc request size 3075129344 The master and SB1 continued to run. CAUSE == total_len in the code below was about 3GB, so palloc() rejected the memory allocation request. [xlogreader.c] record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); total_len = record->xl_tot_len; ... /* * Enlarge readRecordBuf as needed. */ if (total_len > state->readRecordBufSize && !allocate_recordbuf(state, total_len)) { Why was XLogRecord->xl_tot_len so big? That's because the XLOG reader read the garbage portion in a WAL file, which is immediately after the end of valid WAL records. Why was there the garbage? Because the cascading standby sends just the valid WAL records, not the whole WAL blocks, to the cascaded standby. When the cascaded standby receives those WAL records and write them in a recycled WAL file, the last WAL block in the file contains the mix of valid WAL records and the garbage left over. Why did the XLOG reader try to allocate memory for a garbage? Doesn't it stop reading the WAL? As the following code shows, when the WAL record header crosses a WAL block boundary, the XLOG reader first allocates memory for the whole record, and then checks the validity of the record header as soon as it reads the entire header. [xlogreader.c] /* * If the whole record header is on this page, validate it immediately. * Otherwise do just a basic sanity check on xl_tot_len, and validate the * rest of the header after reading it from the next page. The xl_tot_len * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record, randAccess)) goto err; gotheader = true; } else { /* XXX: more validation should be done here */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X: wanted %u, got %u", (uint32) (RecPtr >> 32), (uint32) RecPtr, (uint32) SizeOfXLogRecord, total_len); goto err; } gotheader = false; } FIX == One idea is to defer the memory allocation until the entire record header is read and ValidXLogRecordHeader() is called. However, ValidXLogRecordHeader() might misjudge the validity if the garbage contains xl_prev seeming correct. So, I modified walreceiver to zero-fill the last partial WAL block. This is the guard that the master does in AdvanceXLInsertBuffer() as below: /* * Be sure to re-zero the buffer so that bytes beyond what we've * written will look like zeroes and not valid XLOG records... */ MemSet((char *) NewPage, 0, XLOG_BLCKSZ); FYI, the following unsolved problem may be solved, too. https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com Regards Takayuki Tsunakawa zerofill_walblock_on_standby.patch Description: zerofill_walblock_on_standby.patch
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Fujii Masao [mailto:masao.fu...@gmail.com] > reloption for TOAST is also required? # I've come back to the office earlier than planned... Hm, there's no reason to not provide toast.vacuum_shrink_enabled. Done with the attached patch. Regards Takayuki Tsunakawa disable-vacuum-truncation_v5.patch Description: disable-vacuum-truncation_v5.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). That's strange... Peter, Imai-san, can you compare your test procedures? Peter, can you check and see the performance improvement with David's method on March 26 instead? Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > "VACUUM" needs or "vacuum" is more appropriate here? Looking at the same file and some other files, "vacuum" looks appropriate because it represents the vacuum action, not the specific VACUUM command. > The format of the documentation of new option is a bit weird. Could it > be possible to adjust it around 80 characters per line like other > description? Ah, fixed. It's easy to overlook the style with the screen reader software... I've been wondering if there are good settings for editing .sgml in Emacs that, for example, puts appropriate number of spaces at the beginning of each line when is pressed, automatically break the long line and put spaces, etc. From: Julien Rouhaud [mailto:rjuju...@gmail.com] > also, the documentation should point out that freeing is not > guaranteed. Something like > > + The default is true. If true, VACUUM will try to free empty > pages at the end of the table. That's nice. Done. > > I'm not sure the consensus we got here but we don't make the vacuum > > command option for this? > > I don't think here's a clear consensus, but my personal vote is to add > it, with SHRINK_TABLE = [ force_on | force_off | default ] (unless a > better proposal has been made already) IMO, which I mentioned earlier, I don't think the VACUUM option is necessary because: (1) this is a table property which is determined based on the expected workload, not the one that people want to change per VACUUM operation (2) if someone wants to change the behavior for a particular VACUUM operation, he can do it using ALTER TABLE SET. Anyway, we can add the VACUUM option separately if we want it by all means. I don't it to be the blocker for this feature to be included in PG 12, because the vacuum truncaton has been bothering us like others said in this and other threads... Regards Takayuki Tsunakawa disable-vacuum-truncation_v6.patch Description: disable-vacuum-truncation_v6.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Amit-san, Imai-snan, From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > I was able to detect it as follows. > plan_cache_mode = auto > >HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > >HEAD: 2402 tps > Patched: 2393 tps Thanks a lot for very quick confirmation. I'm relieved to still see the good results. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Michael Paquier [mailto:mich...@paquier.xyz] > The first letter should be upper-case. Thank you for taking care of this patch, and sorry to cause you trouble to fix that... > to me that socket_timeout_v14.patch should be rejected as it could cause > a connection to go down with no actual reason and that the server should > be in charge of handling timeouts. Is my impression right? No, the connection goes down for a good reason that the client could not get the response within a tolerable amount of time. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Michael Paquier [mailto:mich...@paquier.xyz] > I have just committed the GUC and libpq portion for TCP_USER_TIMEOUT after > a last lookup, and I have cleaned up a couple of places. Thank you for further cleanup and committing. > For the socket_timeout stuff, its way of solving the problem it thinks is > solves does not seem right to me, and this thread has not reached a consensus > anyway, so I have discarded the issue. > > I am marking the CF entry as committed. In the future, it would be better > to not propose multiple concepts on the same thread, and if the > socket_timeout business is resubmitted, I would suggest a completely new > CF entry, and a new thread. Understood. Looking back the review process, it seems that tcp_user_timeout and socket_timeout should have been handled in separate threads. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
Hi Andres, Fujii-san, any committer, From: Andres Freund [mailto:and...@anarazel.de] > On 2019-04-08 09:52:27 +0900, Fujii Masao wrote: > > I'm thinking to commit this patch at first. We can change the term > > and add the support of "TRUNCATE" option for VACUUM command later. > > I hope you realize feature freeze is in a few hours... Ouch! Could you take care of committing the patch, please? I wouldn't be able to express the sadness and tiredness just in case this is pushed to 13 because of the parameter name... Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > On the whole I don't think there's an adequate case for committing > this patch. From: Andres Freund [mailto:and...@anarazel.de] > On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > > If I reduce the number of partitions in Amit's example from 8192 > > to something more real-world, like 128, I do still measure a > > performance gain, but it's ~ 1.5% which is below what I'd consider > > a reproducible win. I'm accustomed to seeing changes up to 2% > > in narrow benchmarks like this one, even when "nothing changes" > > except unrelated code. > > I'm not sure it's actually that narrow these days. With all the > partitioning improvements happening, the numbers of locks commonly held > are going to rise. And while 8192 partitions is maybe on the more > extreme side, it's a workload with only a single table, and plenty > workloads touch more than a single partitioned table. I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate from a commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at least I can't insist that's bad). PostgreSQL is now expected to handle such huge workloads. From: Andres Freund [mailto:and...@anarazel.de] > I'm not sure I'm quite that concerned. For one, a good bit of that space > was up for grabs until the recent reordering of LOCALLOCK and nobody > complained. But more importantly, I think commonly the amount of locks > around is fairly constrained, isn't it? We can't really have that many > concurrently held locks, due to the shared memory space, and the size of > a LOCALLOCK isn't that big compared to say relcache entries. We also > probably fairly easily could win some space back - e.g. make nLocks 32 > bits. +1 From: Tom Lane [mailto:t...@sss.pgh.pa.us] > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. > > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. I think the linked list of LOCALLOCK approach is natural, simple, and good. In the Jim Gray's classic book "Transaction processing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's acquired locks. "All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback." And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize. We just want to quickly find all grabbed locks, so we use a linked list. A hash table is a mechanism to find a particular item quickly. So it was merely wrong to use the hash table to iterate all grabbed locks. Also, the hash table got big because some operation in the session needed it, and some subsequent operations in the same session may need it again. So we wouldn't be relieved with shrinking the hash table. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: 'Andres Freund' [mailto:and...@anarazel.de] > On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote: > > I think the linked list of LOCALLOCK approach is natural, simple, and > > good. > > Did you see that people measured slowdowns? Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that might be within noise as was mentioned. If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seem feasible... Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > It would be good to get your view on the > shrink_bloated_locallocktable_v3.patch I worked on last night. I was > unable to measure any overhead to solving the problem that way. Thanks, it looks super simple and good. I understood the idea behind your patch is: * Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is an unlucky accident. So it's reasonable to revert the bloated hash table back to the original size. * Repeated transactions that acquire many locks have to enlarge the hash table every time. However, the overhead of hash table expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the relations, accessing the catalogs of those relations) TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do (efficiently iterate over acquired locks) and is based on the classic book. But your approach seems to relieve people. So I'm OK with your patch. I'm registering you as another author and me as a reviewer, and marking this ready for committer. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > And, as far as I can see from a quick review of the thread, > we don't really have consensus on the names and behaviors. Consensus on the name seems to use truncate rather than shrink (a few poople kindly said they like shrink, and I'm OK with either name.) And there's no dispute on the behavior. Do you see any other point? Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > "vacuum_truncate" gets my vote too. +1 From: 'Andres Freund' [mailto:and...@anarazel.de] > Personally I think the name just needs some committer to make a > call. This largely is going to be used after encountering too many > cancellations in production, and researching the cause. Minor spelling > differences don't seem to particularly matter here. Absolutely. Thank you. From: 'Andres Freund' [mailto:and...@anarazel.de] > I think it needs to be an autovac option. The production issue is that > autovacuums constantly cancel queries on the standbys despite > hot_standby_feedback if you have a workload that does frequent > truncations. If there's no way to configure it in a way that autovacuum > takes it into account, people will just disable autovacuum and rely > entirely on manual scripts. That's what already happens - leading to a > much bigger foot-gun than disabling truncation. FWIW, people already in > production use the workaround to configuring snapshot_too_old as that, > for undocumented reasons, disables trunctations. That's not better > either. Right. We don't want autovacuum to be considered as a criminal. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Fujii Masao [mailto:masao.fu...@gmail.com] > Thanks for the info, so I marked the patch as committed. Thanks a lot for your hard work! This felt relatively tough despite the simplicity of the patch. I'm starting to feel the difficulty and fatigue in developing in the community... Regards Takayuki Tsunakawa
RE: IDE setup and development features?
From: Mori Bellamy [mailto:m...@invoked.net] > I'd like a few features when developing postgres -- (1) jump to definition > of symbol (2) find references to symbol and (3) semantic autocompletion. For 1), you can generate tags like: [for vi] $ src/tools/make_ctags [for Emacs] $ src/tools/make_etags Cscope works for 2). Regards Takayuki Tsunakawa
RE: Implementation of Flashback Query
From: Yang Jie [mailto:yang...@highgo.com] > Delayed cleanup, resulting in performance degradation, what are the > solutions recommended? > What do you suggest for the flashback feature? > Although postgres has excellent backup and restore capabilities, have you > considered adding flashbacks? Great proposal, I appreciate it. Regarding the bloat, check the recent mail threads titled "undo xxx". A new table storage format is being developed to use undo logs like Oracle and MySQL instead of generating dead tuples. How about the possibility of adding a feature like Flashback Drop to restore dropped and truncated tables? I sometimes saw customers who mistakenly dropped or truncated tables but didn't have any backup. Regards Takayuki Tsunakawa
RE: WAL archive (archive_mode = always) ?
From: Adelino Silva [mailto:adelino.j.si...@googlemail.com] > What is the advantage to use archive_mode = always in a slave server compared > to archive_mode = on (shared WAL archive) ? > > I only see duplication of Wal files, what is the purpose of this feature ? This also saves you the network bandwidth by not sending the WAL archive from the primary to the standby(s). The network bandwidth can be costly between remote regions for disaster recovery. Regards Takayuki Tsunakawa
RE: WAL archive (archive_mode = always) ?
From: Narayanan V [mailto:vnarayanan.em...@gmail.com] > I think what Takayuki is trying to say is that streaming replication works > by sending the contents of the WAL archives to the standbys. If archive_mode > was NOT set to always, and if you wanted to archive WAL logs in the standby > you would need to rely on the process_command and make it ship the WAL logs > (to the standby). This causes the same WAL log information to be shipped > in two places, > > 1. Through Streaming Replication > 2. By the process_command > > This redundant shipping of the same information is expensive and consumes > network bandwidth. This can be avoided with the use of archive_mode=always. > > archive_mode=always makes the standby archive the WAL logs it receives, > thus avoiding the requirement of having to ship it separately. Exactly. (archive_command, not process_command) Thanks, Narayanan. Regards Takayuki Tsunakawa
RE: PostgreSQL Limits and lack of documentation about them.
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I think it's a bit strange that we don't have this information fairly > early on in the official documentation. I only see a mention of the > 1600 column limit in the create table docs. Nothing central and don't > see mention of 32 TB table size limit. > > I don't have a patch, but I propose we include this information in the > docs, perhaps on a new page in the preface part of the documents. > > Does anyone else have any thoughts about this? +1 As a user, I feel I would look for such information in appendix like "A Database limits" in Oracle's Database Reference manual: https://docs.oracle.com/en/database/oracle/oracle-database/18/refrn/database-limits.html#GUID-ED26F826-DB40-433F-9C2C-8C63A46A3BFE As a somewhat related topic, PostgreSQL doesn't mention the maximum values for numeric parameters. I was asked several times the questions like "what's the maximum value for max_connections?" and "how much memory can I use for work_mem?" I don't feel a strong need to specify those values, but I wonder if we should do something. Regards Takayuki Tsunakawa