from:"Tsunakawa, Takayuki"

RE: Speedup of relation deletes during recovery

2018-03-29 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> When multiple relations are deleted at the same transaction, the files of
> those relations are deleted by one call to smgrdounlinkall(), which leads
> to scan whole shared_buffers only one time. OTOH, during recovery,
> smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
> which leads to scan shared_buffers multiple times.
> Obviously this can cause to increase the WAL replay time very much especially
> when shared_buffers is huge.
> 
> To alleviate this situation, I'd like to propose to change the recovery
> so that it also calls smgrdounlinkall() only one time to delete multiple
> relation files. Patch attached. Thought?

Nice catch.  As Horiguchi-san and Michael already commented, the patch looks 
good.

As a related improvement, the following proposal is effective for shortening 
WAL replay time of DROP TABLE (and possibly TRUNCATE as well.)  How should we 
proceed with this?

https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23


Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE scans 
the shared buffers once for each table, TRUNCATE does that for each fork, 
resulting in three scans per table.  How about changing the following functions

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
   BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int 
nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
   BlockNumber *firstDelBlock, int 
nforks)

to perform the scan only once per table?  If there doesn't seem to be a 
problem, I think I'll submit the patch next month.  I'd welcome if anyone could 
do that.


Regards
Takayuki Tsunakawa

RE: Speedup of relation deletes during recovery

2018-04-02 Thread Tsunakawa, Takayuki

From: Jerry Sievers [mailto:gsiever...@comcast.net]
> Wonder if this is the case for streaming standbys replaying truncates
> also?

Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE.

Regards
Takayuki Tsunakawa

RE: Built-in connection pooling

2018-04-18 Thread Tsunakawa, Takayuki

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
Oracle, for example, you can create dedicated and non-dedicated backends.
> I wonder why we do not want to have something similar in Postgres.

Yes, I want it, too.  In addition to dedicated and shared server processes, 
Oracle provides Database Resident Connection Pooling (DRCP).  I guessed you 
were inspired by this.

https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc002.htm#ADMIN12348

BTW, you are doing various great work -- autoprepare, multithreaded Postgres, 
built-in connection pooling, etc. etc., aren't you?  Are you doing all of these 
alone?

Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-04-18 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> a very long time before accessing to the relation. Which would cause the
> response-time spikes, for example, I observed such spikes several times
> on
> the server with shared_buffers = 300GB while running the benchmark.

FYI, a long transaction took about 900 ms, while the average transaction 
response time was 150 ms or so.  (I'm working with Fujii-san in this 
performance benchmark.)


> Therefore, I'm thinking to propose $SUBJECT and enable it to avoid such
> spikes
> for that relation.

How about an integer variable to replace the following?

#define REL_TRUNCATE_FRACTION   16


> Also, first of all, if other transactions need to extend the relation
> (i.e., need new pages) as soon as VACUUM truncates the empty pages at the
> end,
> that truncation would not be so helpful for performance. In this case,
> the truncation and extension of the relation are unnecessarily repeated,
> which would decrease the performance. So, to alleviate this situation,
> $SUBJECT is useful, I think.

I wonder if fillfactor=50 would alleviate this situation.

Regards
Takayuki Tsunakawa

RE: Speedup of relation deletes during recovery

2018-04-18 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> Yeah, it's worth working on this problem. To decrease the number of scans
> of
> shared_buffers, you would need to change the order of truncations of files
> and
> WAL logging. In RelationTruncate(), currently WAL is logged after FSM and
> VM
> are truncated. IOW, with the patch, FSM and VM would need to be truncated
> after
> WAL logging. You would need to check whether this reordering is valid.

Sure, thank you for advice.

Takayuki Tsunakawa

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-06 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> The last patch submitted is here:
> https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D
> 1F8ECF73@G01JPEXMBYT05
> And based on the code paths it touches I would recommend to not play with
> REL_12_STABLE at this stage.

I'm reading the thread to see what I should do...  Anyway, I don't think we 
should rush to include this fix in PG 12, too.  But at the same time, I don't 
think it would be dangerous to put in PG 12.


Regards
Takayuki Tsunakawa

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-09 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Alvaro Herrera from 2ndQuadrant  writes:
> > Well, IMV this is a backpatchable, localized bug fix.
> 
> I dunno.  This thread is approaching two years old, and a quick
> review shows few signs that we actually have any consensus on
> making behavioral changes here.  If there is any consensus,
> it's that the SetErrorMode calls should depend on checking
> pgwin32_is_service(); but we haven't even seen a patch that
> does that, let alone tested it.  I think we're way too close
> to beta4 wrap to be considering pushing such a patch the moment
> it appears.

We don't have to call pgwin32_is_service() to determine whether we call 
SetErrorMode() in postmaster.c because:

* The dialog box doesn't appear under Windows service, whether PostgreSQL 
itself starts as a service or another (server) application runs as a service 
and does "pg_ctl start."

* It's OK for the dialog box to appear when the user runs "pg_ctl start" on the 
command prompt.  That usage is interactive, and should not be for production 
use (postgres processes vanish when the user mistakenly presses Ctrl+C, or 
closes the command prompt).  Even now, the dialog box pops up if postmaster 
crashes before main().


> BTW, I also violently dislike taking Windows-specific stuff out of
> startup_hacks where it belongs and putting it into main() where
> it doesn't.  I think the entire Windows bit in front of get_progname
> should migrate into startup_hacks.  Surely the odds of failure
> inside get_progname are not worth worrying about --- they seem
> significantly less than the odds of failure inside
> pgwin32_is_service(), for instance.

Agreed.  Fixed.  I changed the CF status to "need review."


Regards
Takayuki Tsunakawa



crash_dump_before_main_v3.patch
Description: crash_dump_before_main_v3.patch

RE: Libpq support to connect to standby server as priority

2019-09-10 Thread Tsunakawa, Takayuki

From: Alvaro Herrera from 2ndQuadrant [mailto:alvhe...@alvh.no-ip.org]
> Testing protocol version 2 is difficult!  Almost every single test fails
> because of error messages being reported differently; and streaming
> replication (incl. pg_basebackup) doesn't work at all because it's not
> possible to establish replication connections.  Manual inspection shows
> it behaves correctly.

Yeah, the code path for protocol v2 is sometimes annoying.  I wish v2 support 
will be dropped soon.  I know there was a discussion on it, but I didn't track 
the conclusion.


> Remaining patchset attached (per my count it's v13 of your patchset.

I'm afraid those weren't attached.


> think we should merge one half of it with each of the other two patches
> where the changes are introduced (0003 and 0004).  I'm not convinced
> that we need 0004-0006 to be separate commits.

It was hard to review those separate patches, so I think it's better to merge 
those.  OTOH, I can understand Haribabu-san's idea that the community may not 
accept the latter patches, e.g. accept only 0001-0005.


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-10 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> SIGTERM, which needs to be adjusted.  For another, its
> SIGQUIT handler does exit(1) not _exit(2), which seems rather
> dubious ... should we make it more like the rest?  I think
> the reasoning there might've been that if some DBA decides to
> SIGQUIT the archiver, we don't need to force a database-wide
> reset; but why exactly should we tolerate that?

postmaster doesn't distinguish return codes other than 0 for the archiver, and 
just starts the archiver unless postmaster is shutting down.  So we can use 
_exit(2) like the other children.

Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, just 
in case they are slow to respond to or ignore SIGINT/SIGTERM?  That matches the 
idea of pg_ctl's immediate shutdown.

(Windows cannot stop grandchildren because kill() in src/port/kill.c doesn't 
support the process group...  That's a separate topic.)


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-10 Thread Tsunakawa, Takayuki

From: David Steele [mailto:da...@pgmasters.net]
> > Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren,
> just in case they are slow to respond to or ignore SIGINT/SIGTERM?  That
> matches the idea of pg_ctl's immediate shutdown.
> 
> -1, at least not immediately.  Archivers can be complex processes and
> they should be given the chance to do a graceful shutdown.

I agree that the user's archiver program should receive the chance for graceful 
stop in smart or fast shutdown.  But I think in immediate shutdown, all should 
stop immediately.  That's what I expect from the word "immediate."

If the grandchildren left running don't disturb the cleanup of PostgreSQL's 
resources (shared memory, file/directory access, etc.) or restart of 
PostgreSQL, we may well be able to just advice the grandchildren to stop 
immediately with SIGINT/SIGTERM.  However, for example, in the failover of 
shared-disk HA clustering, when the clustering software stops PostgreSQL with 
"pg_ctl stop -m immediate" and then tries to unmount the file systems for 
$PGDATA and archived WAL, the unmount  may take time or fail due to the access 
from PostgreSQL's grandchildren.


Regards
Takayuki Tsunakawa

RE: [bug fix] Produce a crash dump before main() on Windows

2019-09-10 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> Imagine an application which relies on Postgres, still does *not* start
> it as a service but uses "pg_ctl start"
> automatically.  This could be triggered as part of another service startup
> which calls say system(), or as another script.  Wouldn't that cause
> potentially a freeze?  

Do you mean by "another service startup":

another service (Windows service)
-> an application (not a Windows service)
  -> pg_ctl start

Then pg_ctl runs under the Windows service environment, and the popup won't 
appear.


> I am also not sure that I quite understand why a
> popup would never show up if a different service startup triggers pg_ctl
> start by itself that fails.

I experimented it with a sample program that I showed in this thread.  A 
program invoked by a Windows services also runs in the service.


Regards
Takayuki Tsunakawa

[bug fix??] Fishy code in tts_cirtual_copyslot()

2019-09-12 Thread Tsunakawa, Takayuki

Hello,

In the following code in execTuples.c, shouldn' srcdesc point to the source 
slot's tuple descriptor?  The attached fix passes make check.  What kind of 
failure could this cause?

BTW, I thought that in PostgreSQL coding convention, local variables should be 
defined at the top of blocks, but this function writes "for (int natts;".  I 
didn't modify it because many other source files also write in that way.


--
static void
tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
{
TupleDesc   srcdesc = dstslot->tts_tupleDescriptor;

Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);

tts_virtual_clear(dstslot);

slot_getallattrs(srcslot);

for (int natt = 0; natt < srcdesc->natts; natt++)
{
dstslot->tts_values[natt] = srcslot->tts_values[natt];
dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
}
--


Regards
Takayuki Tsunakawa



virtslot_tiny_fix.patch
Description: virtslot_tiny_fix.patch

RE: [bug fix??] Fishy code in tts_cirtual_copyslot()

2019-09-23 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I temporarily changed the Assert to be "==" rather than "<=", and
> it still passed check-world, so evidently we are not testing any
> cases where the descriptors are of different lengths.  This explains
> the lack of symptoms.  It's still a bug though, so pushed.

Thank you for committing.

> > BTW, I thought that in PostgreSQL coding convention, local variables
> should be defined at the top of blocks, but this function writes "for (int
> natts;".
> 
> Yeah, we've agreed to join the 21st century to the extent of allowing
> local for-loop variables.

That's good news.  It'll help a bit to code comfortably.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-26 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> On 2019-Sep-03, Tsunakawa, Takayuki wrote:
> > I don't think it's rejected.  It would be a pity (mottainai) to refuse
> > this, because it provides significant speedup despite its simple
> > modification.
> 
> I don't necessarily disagree with your argumentation, but Travis is
> complaining thusly:

I tried to revise David's latest patch (v8) and address Tom's comments in his 
last mail.  But I'm a bit at a loss.

First, to accurately count the maximum number of acquired locks in a 
transaction, we need to track the maximum entries in the hash table, and make 
it available via a new function like hash_get_max_entries().  However, to cover 
the shared partitioned hash table (that is not necessary for 
LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it 
when entering and removing entries in the hash table.  It spoils the effort to 
decrease contention by hashhdr->freelists[].mutex.  Do we want to track the 
maximum number of acquired locks in the global variable in lock.c, not in the 
hash table?

Second, I couldn't understand the comment about the fill factor well.  I can 
understand that it's not correct to compare the number of hash buckets and the 
number of locks.  But what can we do?

I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's 
approach is based on the database textbook and seems intuitive.  So I attached 
the rebased version.

Regards
Takayuki Tsunakawa

faster-locallock-scan_v3.patch
Description: faster-locallock-scan_v3.patch

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> This makes the test page-size sensitive.  While we don't ensure that tests
> can be run with different page sizes, we should make a maximum effort to
> keep the tests compatible if that's easy enough.  In this case you could
> just use > 0 as base comparison.  I can fix that by myself, so no need to
> send a new version.

Good idea.  Done.


> Should we also document that the parameter is effective for autovacuum?
> The name can lead to confusion regarding that.

I'm not sure for the need because autovacuum is just an automatic execution of 
vacuum, and existing vacuum_xxx parameters also apply to autovacuum.  But being 
specific is good anyway, so I added reference to autovacuum in the description.


> Also, shouldn't the relopt check happen in should_attempt_truncation()?
> It seems to me that if we use this routine somewhere else then it should
> be controlled by the option.

That makes sense.  Done.


> At the same time, we also have REL_TRUNCATE_FRACTION and
> REL_TRUNCATE_MINIMUM which could be made equally user-tunnable.
> That's more difficult to control, still why don't we also consider this
> part?

I thought of it, too.  But I didn't have a good idea on how to explain those 
parameters.  I'd like to separate it.


> Another thing that seems worth thinking about is a system-level GUC, and
> an option in the VACUUM command to control if truncation should happen or
> not.  We have a lot of infrastructure to control such options between vacuum
> and autovacuum, so it could be a waste to not consider potential synergies.

Being able to specify this parameter in postgresql.conf and SET (especially 
ALTER DATABASE/USER to target specific databases/applications) might be useful, 
but I'm not sure...  I'm less confident about whether VACUUM command can 
specify this, because this is a property of table under a specific workload, 
not a changable property of each VACUUM action.  Anyway, I expect it won't be 
difficult to add those configurability without contradicting the design, so I'm 
inclined to separate it.


From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> Yeah, that would work. Or it's kind of hackie but the rolling back the
> insertion instead of INSERT and DELETE might also work.

That's good, because it allows us to keep running reloptions test in parallel 
with other tests.  Done.


Regards
Takayuki Tsunakawa




disable-vacuum-truncation_v4.patch
Description: disable-vacuum-truncation_v4.patch

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't think that a VACUUM option would be out of place, but a GUC
> sounds like an attractive nuisance to me.  It will encourage people to
> just flip it blindly instead of considering the particular cases where
> they need that behavior, and I think chances are good that most people
> who do that will end up being sad.

Ouch, I sent my previous mail before reading this.  I can understand it may be 
cumbersome to identify and specify each table, so I tend to agree the parameter 
in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune 
specific databases and applications.  But should the vacuuming of system 
catalogs also follow this setting?


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-02-27 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Attached are the updated patches.

Thanks, all look fixed.


> The target_server_type option yet to be implemented.

Please let me review once more and proceed to testing when the above is added, 
to make sure the final code looks good.  I'd like to see how complex the if 
conditions in multiple places would be after adding target_server_type, and 
consider whether we can simplify them together with you.  Even now, the if 
conditions seem complicated to me... that's probably due to the existence of 
read_write_host_index.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-27 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> I measured the memory context accounting overhead using Tomas's tool
> palloc_bench,
> which he made it a while ago in the similar discussion.
> https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz
> 
> This tool is a little bit outdated so I fixed it but basically I followed
> him.
> Things I did:
> - make one MemoryContext
> - run both palloc() and pfree() for 32kB area 1,000,000 times.
> - And measure this time
> 
> The result shows that master is 30 times faster than patched one.
> So as Andres mentioned in upper thread it seems it has overhead.
> 
> [master (without v15 patch)]
> 61.52 ms
> 60.96 ms
> 61.40 ms
> 61.42 ms
> 61.14 ms
> 
> [with v15 patch]
> 1838.02 ms
> 1754.84 ms
> 1755.83 ms
> 1789.69 ms
> 1789.44 ms
> 

I'm afraid the measurement is not correct.  First, the older discussion below 
shows that the accounting overhead is much, much smaller, even with a more 
complex accounting.

9.5: Better memory accounting, towards memory-bounded HashAg
https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop

Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the 
accounting overhead will be more likely to be hidden under the overhead of 
malloc() and free().  What we'd like to know the overhead when malloc() and 
free() are not called.

And are you sure you didn't enable assert checking?


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> So we could you consider adding an option for the VACUUM command as well
> as vacuumdb?  The interactions with the current patch is that you need to
> define the behavior at the beginning of vacuum for a given heap, instead
> of reading the parameter at the time the truncation happens, and give

I'm not confident whether this is the same as the above, I imagined this:

* Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default.
This follows the naming style "verb_object" like log_connections and 
enable_xxx.  We may want to use enable_vacuum_shrink or something like that, 
but enable_xxx seems to be used solely for planner control.  Plus, 
vacuum-related parameters seem to start with vacuum_.

* Give priority to the reloption, because it's targeted at a particular table.  
If the reloption is not set, the GUC takes effect.

* As a consequence, the user can change the behavior of VACUUM command by 
SETting the GUC in the same session in advance, when the reloption is not set.  
If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE 
again to restore the table's setting.  But I don't think this use case (change 
whether to shrink per VACUUM command execution) is necessary.  This is no more 
than simply possible.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-28 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Robert used the phrase "attractive nuisance", which maybe sounds like a
> good thing to have to a non native speaker, but it actually isn't -- he
> was saying we should avoid a GUC at all, and I can see the reason for
> that.  I think we should have a VACUUM option and a reloption, but no
> GUC.

Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
maker...

Why do you think that it's better for VACUUM command to have the option?  I 
think it's a table property whose value is determined based on the application 
workload, not per VACUUM execution.  Rather, I think GUC is more useful to 
determine the behavior of the entire database and/or application.

If we want to change a given execution of VACUUM, then we can ALTER TABLE SET, 
VACUUM, and ALTER TABLE SET back.


Regards
Takayuki Tsunakawa

RE: Protect syscache from bloating with negative cache entries

2019-02-28 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi/出利葉 健
> [Size=800, iter=1,000,000]
> Master |15.763
> Patched|16.262 (+3%)
> 
> [Size=32768, iter=1,000,000]
> Master |61.3076
> Patched|62.9566 (+2%)

What's the unit, second or millisecond?
Why is the number of digits to the right of the decimal point?

Is the measurement correct?  I'm wondering because the difference is larger in 
the latter case.  Isn't the accounting processing almost the sane in both cases?
* former: 16.262 - 15.763 = 4.99
* latter: 62.956 - 61.307 = 16.49


> At least compared to previous HashAg version, the overhead is smaller.
> It has some overhead but is increase by 2 or 3% a little bit?

I think the overhead is sufficiently small.  It may get even smaller with a 
trivial tweak.

You added the new member usedspace at the end of MemoryContextData.  The 
original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 
64 bytes.  So, the new member will be on a separate cache line.  Try putting 
usedspace before the name member.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> >> If the user reconnects, eg "\c db", the setting is lost. The
> >> re-connection handling should probably take care of this parameter, and
> maybe others.
> > I think your opinion is reasonable, but it seems not in this thread.
> 
> HI think that your patch is responsible for the added option at least.
> 
> I agree that the underlying issue that other parameters should probably
> also be reused, which would be a bug fix, does not belong to this thread.

This doesn't seem to be a bug.  \connect just establishes a new connection, not 
reusing the previous settings for most connection parameters.  As the examples 
in the following manual page show, the user needs to specify necessary 
connection parameters.

https://www.postgresql.org/docs/devel/app-psql.html

=> \c service=foo
=> \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable"


But I'm afraid the description of \connect may not be clear enough about 
connection parameters, and that may cause users to expect the reuse of all 
connection parameter settings.  Anyway, I think this would be an improvement 
for psql's documentation or new feature for psql.  What do you think?


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> The first thing I notice about the socket_timeout patch is that the
> documentation is definitely wrong:

Agreed.  I suppose the description should be clearer about:

* the purpose and what situation this timeout will help: not for canceling a 
long-running query or recovering from a network failure.
* relationship with other timeout parameters: e.g., socket_timeout > 
connect_timeout, socket_timeout > statement_timeout, socket_timeout > TCP 
keepalive/user timeout


> Actually, I think that socket_timeout_v8.patch is a bad idea and
> should be rejected, not because it doesn't send a cancel to the server
> but just because it's not the right way of solving either of the
> problems that it purports to solve.  If you want a timeout for
> queries, you need that to be administered by the server, which knows
> when it starts and finishes executing a query; you cannot substitute a
> client-side judgement based on the last time we received a byte of
> data.  Maybe a client-side judgement would work if the timer were
> armed when we send a Query or Execute message and disarmed when we
> receive ReadyForQuery.  And, if you want a network problem detector,
> then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff,
> so that you don't kill perfectly good connections that just happen to
> be idle.

I think the purpose of socket_timeout is to avoid infinite or unduely long wait 
and return response to users, where other existing timeout parameters wouldn't 
help.  For example, OS's process scheduling or paging/swapping problems might 
cause long waits before postgres gets control (e.g. due to Transparent HugePage 
(THP)).  Within postgres, the unfair lwlock can unexpected long waits (I 
experienced dozens of seconds per wait on ProcArrayLock, which was 
unbelievable.)  Someone may say that those should be fixed or improved instead 
of introducing this parameter, but I think it's good to have this as a stop-gap 
measure.  In other words, we can suggest setting this parameter when the user 
asks "How can I return control to the end user in any situation?"


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-13 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> But that's not what it will do.  As long as the server continues to
> dribble out protocol messages from time to time, the timeout will
> never fire no matter how much time passes.  I saw a system once where
> every 8kB read took many seconds to complete; such a system could
> dribble out sequential scan results over an arbitrarily long period of
> time without ever tripping the timeout.

I understood hat the example is about an SELECT that returns multiple rows.  If 
so, statement_timeout should handle it, shouldn't it?

>  If you really want to return
> control to the user in any situation, what you can do is use the libpq
> APIs in asynchronous mode which, barring certain limitations of the
> current implementation, will actually allow you to maintain control
> over the connection at all times.

Maybe.  But the users aren't often in a situation to modify the application to 
use libpq asynchronous APIs.


> I think the use case for a timeout that has both false positives (i.e.
> it will fire even when there's no problem, as when the connection is
> legitimately idle) and false negatives (i.e. it will fail to trigger
> when there is a problem, as when there are periodic notices or
> notifies from the server connection) is extremely limited if not
> nonexistent, and I think the potential for users to be confused is
> really high.

My understanding is that the false positive case doesn't occur, because libpq 
doesn't wait on the socket while the client is idle and not communicating SQL 
request/response.  As for the false negative case, resetting the timer upon 
notices or notifies receipt is good, because they show that the database server 
is functioning.  socket_timeout is not a mechanism to precisely limit the 
duration of query request/response.  It is kind of a stop-gap, last resort to 
assure return control within reasonable amount of time, rather than minutes or 
hours.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> If so, in turn the socket_timeout doesn't work as expected? I
> understand that what is proposed here is to disconnect after that
> time of waiting for *the first tuple* of a query, regardless of
> it is a long query or network failure. On of the problems raised
> here is the socket_timetout patch doesn't work that way?

No, the proposal is to set the timeout for every socket read/write like PgJDBC. 
 It is not restricted to an SQL command execution; it applies to any 
communication with the server that could block the client.


> I can't imagine that in the cases where other than applications
> cannot be rebuild for some reasons. (E.g. the source code has
> been lost or the source code no longer be built on modern
> environment..)
> 
> If an application is designed to have such a requirement, mustn't
> it have implemented that by itself by any means? For example, an
> application on JDBC can be designed to kill a querying thread
> that takes longer than threshold.

Any paid or free applications whose source code is closed.  Also, ordinary 
users don't want to modify psql and pgAdmin source code by themselves, do they?


> Doesn't TCP_KEEPALIVE work that way?
> 
> statement_timeout works on a live connection, TCP_USER_TIMEOUT
> works for an already failed network but if the network fails
> after a backend already sent the ack for a query request, it
> doesn't work. TCP_KEEPALIVE would work for the case where any
> packet doesn't reach each other for a certain time. Is there any
> other situations to save?

For example, OS issues such as abnormally (buggy) slow process scheduling or 
paging/swapping that prevent control from being passed to postgres.  Or, 
abnormally long waits on lwlocks in postgres.  statement_timeout doesn't take 
effect while waiting on a lwlock.  I have experienced both.  And, any other 
issues in the server that we haven't experienced and not predictable.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> I think that the typical use-case of \c is to connect to another database
> on the same host, at least that what I do pretty often. The natural
> expectation is that the same "other" connection parameters are used,
> otherwise it does not make much sense, and there is already a whole logic
> of reusing the previous settings in psql, at least wrt describing the
> target (host, port, user…).

I agree about the natural expectation.


> > Anyway, I think this would be an improvement for psql's documentation
> or
> > new feature for psql.  What do you think?
> 
> I think that we should fix the behavior rather than document the current
> weirdness. I do not think that we should introduce more weirdness.

Do you think all connection parameter settings should be reused by default, or 
when -reuse-previous=on is specified?

Do you think this is a bug to be backported to previous versions, or a new 
feature?  I think I'll make a patch separately from this thread, if time 
permits.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> > For example, OS issues such as abnormally (buggy) slow process scheduling
> or paging/swapping that prevent control from being passed to postgres.  Or,
> abnormally long waits on lwlocks in postgres.  statement_timeout doesn't
> take effect while waiting on a lwlock.  I have experienced both.  And, any
> other issues in the server that we haven't experienced and not predictable.
> 
> For me all mentioned by Takayuki Tsunakawa problems looks like a lack of
> design of end-user application or configuration of DB server. It is not
> a problem of PostgreSQL.

Certainly, my cases could be avoided by the OS and database configuration.  But 
we cannot say all such problems can be avoided by configuration in advance.  
The point is to provide a method to get ready for any situation.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> One other thing -- I looked a bit into the pgsql-jdbc implementation
> of a similarly-named option, and it does seem to match what you are
> proposing here.  I wonder what user experiences with that option have
> been like.

One case I faintly recall is that some SQL execution got stuck in NFS access in 
the server, and statement_timeout didn't work (probably due to inability to 
cancel NFS read/write operations).  The user chose to use PgJDBC's 
socketTimeout.  I'm not sure whether this use case is fundamental, because I 
wonder if NFS mount options could have been the solution.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> Now you might say - what if the server is stopped not because of
> SIGSTOP but because of some other reason, like it's waiting for a
> lock?  Well, in that case, the database server is still functioning,
> and you will not want the connection to be terminated if no notifies
> are sent during the lock wait but not terminated if they are sent.  At
> least, I can't see why anyone would want that.

Yes, so I think it would be kind to describe how to set socket_timeout with 
relation to other timeout parameters -- socket_timeout > statement_timeout > 
lock_timeout, for example.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-14 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> Do you mind me asking you whether you have thought that solving your problem
> can lead to the problem in the other user applications?
> Let's imagine a possible problem:
> 1. end-user sets 'socket_timeout' only for current session
> 2. 'socket_timeout' is much shorter than 'keep_alive', 'tcp_user_timeout'
> and 'statement_timeout'
> 3. end-user invokes a 'heavy' query which is not be possible to process
> by PostgreSQL server within 'socket_timeout'
> 4. if the query fails due to timeout, terminate the session and repeat it
> 
> As the query is not cancelled, PostgreSQL server will process it till
> completion or 'statement_timeout' expiration. In such a case PostgreSQL
> server can be overloaded by an 'unreliable' user/users and other end-users
> will not be able to work with PostgreSQL server as they expected.

Yes, so I think it would be necessary to describe how to set socket_timeout 
with relation to other timeout parameters -- socket_timeout > 
statement_timeout, emphasizing that socket_timeout is not for canceling 
long-running queries but for returning control to the client.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-15 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> In case of failure PQcancel() terminates in 'socket_timeout'. So, control
> to the end-user in such a failure situation will be returned in 2 *
> 'socket_timeout' interval. It is much better than hanging forever in some
> specific cases. Moreover, such solution will not lead to the overloading
> of PostgreSQL server by abnormally ignored 'heavy' queries results by
> end-users.

Oops, unfortunately, PQcancel() does not follow any timeout parameters...  It 
uses a blocking socket.

Also, I still don't think it's a good idea to request cancellation.  
socket_timeout should be sufficiently longer than the usually expected query 
execution duration.  And long-running queries should be handled by 
statement_timeout which indicates the maximum tolerable query execution 
duration.
For example, if the usually expected query execution time is 100 ms, 
statement_timeout can be set to 3 seconds and socket_timeout to 5 seconds.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-17 Thread Tsunakawa, Takayuki

From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu]
> Based on your comment it seems to me that 'socket_timeout' should be
> connected with statement_timeout. I mean that end-user should wait
> statement_timeout + 'socket_timeout' for returning control. It looks much
> more safer for me.

I'm afraid we cannot enforce that relationship programmatically, so I think the 
documentation should warn that socket_timeout be longer than statement_timeout.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-18 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Target_session_attrs  Target_server_type
> 
> read-write   prefer-slave, slave
> 
> prefer-read master, slave
> read-onlymaster, prefer-slave
> 
> I know that some of the cases above is possible, like master server with
> by default accepts
> read-only sessions. Instead of we put a check to validate what is right
> combination, how
> about allowing the combinations and in case if such combination is not
> possible, means
> there shouldn't be any server the supports the requirement, and connection
> fails.
> 
> comments?

I think that's OK.

To follow the existing naming, it seems better to use "primary" and "standby" 
instead of master and slave -- primary_conninfo, synchronous_standby_names, 
hot_standby, max_standby_streaming_delay and such.


> And also as we need to support the new option to connect to servers < 12
> also, this option
> sends the command "select pg_is_in_recovery()" to the server to find out
> whether the server
> is recovery mode or not?

The query looks good.  OTOH, I think we can return an error when 
target_server_type is specified against older servers because the parameter is 
new, if the libpq code would get uglier if we support older servers.


> And also regarding the implementation point of view, the new
> target_server_type option
> validation is separately handled, means the check for the required server
> is handled in a separate
> switch case, when both options are given, first find out the required server
> for target_session_attrs
> and validate the same again with target_server_type?

Logically, it seems the order should be reverse; check the server type first, 
then the session attributes, considering other session attributes in the future.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-18 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't think so.  I think it's just a weirdly-design parameter
> without a really compelling use case.  Enforcing limits on the value
> of the parameter doesn't fix that.  Most of the reviewers who have
> opined so far have been somewhere between cautious and negative about
> the value of that parameter, so I think we should just not add it.  At
> least for now.

I don't think socket_timeout is so bad.  I think Nagaura-san and I presented 
the use case, giving an answer to every question and concern.  OTOH, it may be 
better to commit the tcp_user_timeout patch when Nagaura-san has refined the 
documentation, and then continue socket_timeout.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 10?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

RE: [survey] New "Stable" QueryId based on normalized query text

2019-03-19 Thread Tsunakawa, Takayuki

From: legrand legrand [mailto:legrand_legr...@hotmail.com]
> There are many projects that use alternate QueryId
> distinct from the famous pg_stat_statements jumbling algorithm.

I'd like to welcome the standard QueryID that DBAs and extension developers can 
depend on.  Are you surveying the needs for you to develop the QueryID that can 
meet as many needs as possible?


> needs.1: stable accross different databases,

Does this mean different database clusters, not different databases in a single 
database cluster?


needs.5: minimal overhead to calculate
needs.6: doesn't change across database server restarts
needs.7: same value on both the primary and standby?


> norm.9: comments aware

Is this to distinguish queries that have different comments for optimizer 
hints?  If yes, I agree.


Regards
Takayuki Tsunakawa

RE: [survey] New "Stable" QueryId based on normalized query text

2019-03-19 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> > > needs.1: stable accross different databases,
> >
> > Does this mean different database clusters, not different databases in
> a single database cluster?
> 
> Does this mean you want different QueryID for the same-looking
> query for another database in the same cluster?

(I'm afraid this question may be targeted at legland legland, not me...)
I think the same query text can have either same or different QueryID in 
different databases in the database cluster.  Even if the QueryID value is the 
same, we can use DatabaseID to choose desired information.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-19 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I really dislike having both target_sesion_attrs and
> target_server_type.  It doesn't solve any actual problem.  master,
> slave, prefer-save, or whatever you like could be put in
> target_session_attrs just as easily, and then we wouldn't end up with
> two keywords doing closely related things.  'master' is no more or
> less a server attribute than 'read-write'.

Hmm, that may be OK.  At first, I felt it strange to treat the server type 
(primary or standby) as a session attribute.  But we can see the server type as 
one attribute in a sense that a session is established for.  I'm inclined to 
agree with:

target_session_attr = {any | read-write | read-only | prefer-read | primary | 
standby | prefer-standby}


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-26 Thread Tsunakawa, Takayuki

From: David Steele [mailto:da...@pgmasters.net]
> This patch appears to have been stalled for a while.
> 
> Takayuki -- the ball appears to be in your court.  Perhaps it would be
> helpful to summarize what you think are next steps?

disable_index_cleanup is handled by Sawada-san in another thread.  I understand 
I've reflected all review comments in the latest patch, and replied to the 
opinions/proposals, so the patch status is kept "needs review."  (I hope new 
fire won't happen...)


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>  wrote:
> > Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> Yeah, I think it likely will require that patch to be able to measure
> the gains from this patch.
> 
> If planning a SELECT to a partitioned table with a large number of
> partitions using PREPAREd statements, when we attempt the generic plan
> on the 6th execution, it does cause the local lock table to expand to
> fit all the locks for each partition. This does cause the
> LockReleaseAll() to become slow due to the hash_seq_search having to
> skip over many empty buckets.   Since generating a custom plan for a
> partitioned table with many partitions is still slow in master, then I
> very much imagine you'll struggle to see the gains brought by this
> patch.

Thank you David for explaining.  Although I may not understand the effect of 
"speeding up planning with partitions" patch, this patch  takes effect even 
without it.  That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
transaction.  Without the patch, each transaction's LockReleaseAll() has to 
scan the bloated large hash table.

Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> My understanding of what David wrote is that the slowness of bloated hash
> table is hard to notice, because planning itself is pretty slow.  With the
> "speeding up planning with partitions" patch, planning becomes quite fast,
> so the bloated hash table overhead and so your patch's benefit is easier
> to notice.  This patch is clearly helpful, but it's just hard to notice
> it
> when the other big bottleneck is standing in the way.

Ah, I see.  I failed to recognize the simple fact that without your patch, 
EXECUTE on a table with many partitions is slow due to the custom planning time 
proportional to the number of partitions.  Thanks for waking up my sleeping 
head!


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Here a benchmark doing that using pgbench's script weight feature.

Wow, I didn't know that pgbench has evolved to have such a convenient feature.  
Thanks for telling me how to utilize it in testing.  PostgreSQL is cool!


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-26 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> On Wed, Mar 27, 2019 at 2:30 AM Robert Haas  wrote:
> >
> > On Tue, Mar 26, 2019 at 11:23 AM Masahiko Sawada 
> wrote:
> > > > I don't see a patch with the naming updated, here or there, and I'm
> > > > going to be really unhappy if we end up with inconsistent naming
> > > > between two patches that do such fundamentally similar things.  -1
> > > > from me to committing either one until that inconsistency is resolved.
> > >
> > > Agreed. I've just submitted the latest version patch that adds
> > > INDEX_CLEANUP option and vacuum_index_cleanup reloption. I already
> > > mentioned on that thread but I agreed with adding phrase positively
> > > than negatively. So if we got consensus on such naming the new options
> > > added by this patch could be something like SHRINK option (with
> > > true/false) and vacuum_shrink reloption.
> >
> > No, that's just perpetuating the problem.  Then you have an option
> > SHRINK here that you set to TRUE to skip something, and an option
> > INDEX_CLEANUP over there that you set to FALSE to skip something.
> >
> 
> Well, I imagined that both INDEX_CLEANUP option and SHRINK option (or
> perhaps TRUNCATE option) should be true by default. If we want to skip
> some operation of vacuum we can set each options to false like "VACUUM
> (INDEX_CLEANUP false, SHRINK true, VERBOSE true)". I think that
> resolves the problem but am I missing something?

I almost have the same view as Sawada-san.  The reloption vacuum_shrink_enabled 
is a positive name and follows the naming style of other reloptions.  I hope 
this matches the style you have in mind.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-26 Thread Tsunakawa, Takayuki

From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> while going through the old patch where the GUC_REPORT is implemented, Tom
> has commented the logic of sending the signal to all backends to process
> the hot standby exit with SIGHUP, if we add the logic of updating the GUC
> variable value in SIGHUP, we may need to change all the SIGHUP handler code
> paths. It is also possible that there is no need to update the variable
> for other processes except backends.
> 
> If we go with adding the new SIGUSR1 type to check and update the GUC varaible
> can work for most of the backends and background workers.
> 
> opinions

SIGUSR1 looks reasonable.  We can consider it as notifying that the server 
status has changed.

I've fully reviewed 0001-0003 and my comments follow.  I'll review 0004-0007.


(1) 0001
before issuing the transaction_readonly to find out whether
the server is read-write or not is restuctured under a new


transaction_readonly -> "SHOW transaction_read_only"
restuctured -> restructured


(2) 0001
+succesful connection or failure.
+successful connection; if it returns on, means 
server

succesful -> successful
means -> it means


(3) 0003
+But for servers version 12 or greater uses the 
transaction_read_only
+GUC that is reported by the server upon successful connection.

GUC doesn't seem to be a term to be used in the user manual.  Instead:

uses the value of transaction_read_only configuration 
parameter that is...

as in:

https://www.postgresql.org/docs/devel/libpq-connect.html

client_encoding
This sets the client_encoding configuration parameter for this connection.

application_name
Specifies a value for the application_name configuration parameter.


(4) 0003
boolstd_strings;/* standard_conforming_strings */
+   booltransaction_read_only;  /* session_read_only */

Looking at the comment for std_strings, it's better to change the comment to 
transaction_read_only to represent the backing configuration parameter name.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-26 Thread Tsunakawa, Takayuki

I've looked through 0004-0007.  I've only found the following:

(5) 0005
With this read-only option type, application can connect to
connecting to a read-only server in the list of hosts, in case
if there is any read-only servers available, the connection
attempt fails.

"connecting to" can be removed.

in case if there is any read-only servers
-> If There's no read only server


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-27 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> You're both right and I'm wrong.
> 
> However, I think it would be better to stick with the term 'truncate'
> which is widely-used already, rather than introducing a new term.

Yeah, I have the same feeling.  OTOH, as I referred in this thread, shrink is 
used instead of truncate in the PostgreSQL documentation.  So, I chose shrink.  
To repeat myself, I'm comfortable with either word.  I'd like the committer to 
choose what he thinks better.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-27 Thread Tsunakawa, Takayuki

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT,
> + (char *) &timeout, sizeof(timeout)) < 0 && errno !=
> ENOPROTOOPT)
> +{
> +charsebuf[256];
> +
> +appendPQExpBuffer(&conn->errorMessage,
> +libpq_gettext("setsockopt(TCP_USER_TIMEOUT)
> failed: %s\n"),
> 
> I suppose that the reason ENOPROTOOPT is excluded from error
> condition is that the system call may fail with that errno on
> older kernels, but I don't think that that justifies ignoring the
> failure.

I think that's for the case where the modules is built on an OS that supports 
TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used on 
an older OS that doesn't support TCP_USER_TIMEOUT.  I remember I was sometimes 
able to do such a thing on Linux and Solaris.  If we don't have to handle such 
usage, I agree about removing the special handling of ENOTPROTO.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-27 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT,
> > + (char *) &timeout, sizeof(timeout)) < 0 && errno !=
> > ENOPROTOOPT)
> > +{
> > +charsebuf[256];
> > +
> > +appendPQExpBuffer(&conn->errorMessage,
> > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT)
> > failed: %s\n"),
> >
> > I suppose that the reason ENOPROTOOPT is excluded from error
> > condition is that the system call may fail with that errno on
> > older kernels, but I don't think that that justifies ignoring the
> > failure.
> 
> I think that's for the case where the modules is built on an OS that supports
> TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used
> on an older OS that doesn't support TCP_USER_TIMEOUT.  I remember I was
> sometimes able to do such a thing on Linux and Solaris.  If we don't have
> to handle such usage, I agree about removing the special handling of
> ENOTPROTO.

Oops, I agree that we return an error even in the ENOPROTOOPT case, because 
setsockopt() is called only when the user specifies tcp_user_timeout.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-28 Thread Tsunakawa, Takayuki

Nagaura-san,

The client-side tcp_user_timeout patch looks good.

The server-side tcp_user_timeout patch needs fixing the following:


(1)
+   GUC_UNIT_MS | GUC_NOT_IN_SAMPLE
+   12000, 0, INT_MAX,

GUC_NOT_IN_SAMPLE should be removed because the parameter appears in 
postgresql.conf.sample.
The default value should be 0, not 12000.


(2)
Now that you've changed show_tcp_usertimeout() to use pq_gettcpusertimeout(), 
you need to modify pq_settcpusertimeout(); just immitate 
pq_setkeepalivesidle().  Otherwise, SHOW would display a wrong value.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-03-28 Thread Tsunakawa, Takayuki

Nagaura-san,

The socket_timeout patch needs the following fixes.  Now that others have 
already tested these patches successfully, they appear committable to me.


(1)
+   else
+   goto iiv_error;
...
+
+iiv_error:
+   conn->status = CONNECTION_BAD;
+   printfPQExpBuffer(&conn->errorMessage,
+ libpq_gettext("invalid integer value 
for socket_timeout\n"));
+   return false;

This goto and its corresponding iiv_error label are redundant.  You can just 
set the error message and return at the call site of parse_int_param().  i.e.:

if (!parse_int_param(...))
{
error processing
return false;
}
if(conn->socket_timeout > 0 && conn->socket_timeout < 2)
conn->socket_timeout = 2;

The reason why oom_error label is present is that it is used at multiple places 
to avoid repeating the same error processing code.


(2)
+   conn->sock = -1;

Use PGINVALID_SOCKET instead of -1.


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2019-03-29 Thread Tsunakawa, Takayuki

Hi Hari-san,

I've reviewed all the files.  The patch would be OK when the following have 
been fixed, except for the complexity of fe-connect.c (which probably cannot be 
improved.)

Unfortunately, I'll be absent next week.  The earliest date I can do the test 
will be April 8 or 9.  I hope someone could take care of this patch...


(1) 0001
With this read-only option type, application can connect to
to a read-only server in the list of hosts, in case
...
before issuing the SHOW transaction_readonly to find out whether


"to" appears twice in a row.
transaction_readonly -> transaction_read_only


(2) 0001
+succesful connection or failure.

succesful -> successful


(3) 0008
to conenct to a standby server with a faster check instead of

conenct -> connect


(4) 0008
Logically, recovery exit should be notified after the following statement:

XLogCtl->SharedRecoveryInProgress = false;


(5) 0008
+   /* Update in_recovery status. */
+   if (LocalRecoveryInProgress)
+   SetConfigOption("in_recovery",
+   "on",
+   PGC_INTERNAL, 
PGC_S_OVERRIDE);
+

This SetConfigOption() is called for every RecoveryInProgress() call on the 
standby.  It may cause undesirable overhead.  How about just calling 
SetConfigOption() once in InitPostgres() to set the initial value for 
in_recovery?  InitPostgres() and its subsidiary functions call 
SetConfigOption() likewise.


(6) 0008
async.c is for LISTEN/UNLISTEN/NOTIFY.  How about adding the new functions in 
postgres.c like RecoveryConflictInterrupt()?


(7) 0008
+   if (pid != 0)
+   {
+   (void) SendProcSignal(pid, reason, procvxid.backendId);
+   }

The braces are not necessary because the block only contains a single statement.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Another counter-argument to this is that there's already an
> unexplainable slowdown after you run a query which obtains a large
> number of locks in a session or use prepared statements and a
> partitioned table with the default plan_cache_mode setting. Are we not
> just righting a wrong here? Albeit, possibly 1000 queries later.
> 
> I am, of course, open to other ideas which solve the problem that v5
> has, but failing that, I don't see v6 as all that bad.  At least all
> the logic is contained in one function.  I know Tom wanted to steer
> clear of heuristics to reinitialise the table, but most of the stuff
> that was in the patch back when he complained was trying to track the
> average number of locks over the previous N transactions, and his
> concerns were voiced before I showed the 7% performance regression
> with unconditionally rebuilding the table.

I think I understood what you mean.  Sorry, I don't have a better idea.  This 
unexplanability is probably what we should accept to avoid the shocking 7% 
slowdown.

OTOH, how about my original patch that is based on the local lock list?  I 
expect that it won't that significant slowdown in the same test case.  If it's 
not satisfactory, then v6 is the best to commit.


Regards
Takayuki Tsunakawa

RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write

2019-07-24 Thread Tsunakawa, Takayuki

From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com]
> Detail:
> If target_session_attrs is set to read-write, PQconnectPoll() calls
> PQsendQuery("SHOW transaction_read_only") althogh previous return value
> was PGRES_POLLING_READING not WRITING.

The current code probably assumes that PQsendQuery() to send "SHOW 
transaction_read_only" shouldn't block, because the message is small enough to 
fit in the socket send buffer.  Likewise, the code in 
CONNECTION_AWAITING_RESPONSE case sends authentication data using 
pg_fe_sendauth() without checking for the write-ready status.  OTOH, the code 
in CONNECTION_MADE case waits for write-ready status in advance before sending 
the startup packet.  That's because the startup packet could get large enough 
to cause pqPacketSend() to block.

So, I don't think the fix is necessary.

> In result, PQsendQuery() may be blocked in pqsecure_raw_write().

FWIW, if PQsendQuery() blocked during connection establishment, I think it 
should block in poll() called from  from pqWait(), because the libpq's 
socket is set non-blocking.


Regards
Takayuki Tsunakawa

RE: Why overhead of SPI is so large?

2019-08-21 Thread Tsunakawa, Takayuki

From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru]
> PL/pgSQL:   29044.361 ms
> C/SPI:  22785.597 ms
> 
> The fact that difference between PL/pgSQL and function implemented in C
> using SPI is not so large  was expected by me.

This PL/pgSQL overhead is not so significant compared with the three times, but 
makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that 
compiles the PL/SQL logic to native code.  I've seen a few dozen percent speed 
up.


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-01 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> After investigation, the mechanism that's causing that is that the
> src/test/recovery/t/010_logical_decoding_timelines.pl test shuts
> down its replica server with a mode-immediate stop, which causes
> that postmaster to shut down all its children with SIGQUIT, and
> in particular that signal propagates to a "cp" command that the
> archiver process is executing.  The "cp" is unsurprisingly running
> with default SIGQUIT handling, which per the signal man page
> includes dumping core.

We've experienced this (core dump in the data directory by an archive command) 
years ago.  Related to this, the example of using cp in the PostgreSQL manual 
is misleading, because cp doesn't reliably persist the WAL archive file.


> This makes me wonder whether we shouldn't be using some other signal
> to shut down archiver subprocesses.  It's not real cool if we're
> spewing cores all over the place.  Admittedly, production servers
> are likely running with "ulimit -c 0" on most modern platforms,
> so this might not be a huge problem in the field; but accumulation
> of core files could be a problem anywhere that's configured to allow
> server core dumps.

We enable the core dump in production to help the investigation just in case.


> Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down
> non-Postgres child processes.  But redesigning the system's signal
> handling to make that possible seems like a bit of a mess.
> 
> Thoughts?

We're using a shell script and a command that's called in the shell script.  
That is:

archive_command = 'call some_shell_script.sh ...'

[some_shell_script.sh]
ulimit -c 0
trap SIGQUIT to just exit on the receipt of the signal
call some_command to copy file

some_command also catches SIGQUIT just exit.  It copies and syncs the file.

I proposed something in this line as below, but I couldn't respond to Peter's 
review comments due to other tasks.  Does anyone think it's worth resuming this?

https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau


Regards
Takayuki Tsunakawa

RE: SIGQUIT on archiver child processes maybe not such a hot idea?

2019-09-02 Thread Tsunakawa, Takayuki

From: Kyotaro Horiguchi [mailto:horikyota@gmail.com]
> Since we are allowing OPs to use arbitrary command as
> archive_command, providing a replacement with non-standard signal
> handling for a specific command doesn't seem a general solution
> to me. Couldn't we have pg_system(a tentative name), which
> intercepts SIGQUIT then sends SIGINT to children? Might be need
> to resend SIGQUIT after some interval, though..

The same idea that you referred to as pg_system occurred to me, too.  But I 
wondered if the archiver process can get the pid of its child (shell? 
archive_command?), while keeping the capabilities of system() (= the shell).  
Even if we fork() and then system(), doesn't the OS send SIGQUIT to any 
descendents of the archiver when postmaster sends SIGQUIT to the child process 
group?


> # Is there any means to view the whole of a thread from archive?
> # I'm a kind of reluctant to wander among messages like a rat in
> # a maze:p

You can see the whole thread by clicking the "Whole Thread" link.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-02 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Hmm ... is this patch rejected, or is somebody still trying to get it to
> committable state?  David, you're listed as committer.

I don't think it's rejected.  It would be a pity (mottainai) to refuse this, 
because it provides significant speedup despite its simple modification.

Again, I think the v2 patch is OK.  Tom's comment was as follows:


[Tom's comment against v2]

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.



* Extreme test case: 
Not extreme.  Two of our customers, who are considering to use PostgreSQL, are 
using thousands of partitions now.  We hit this issue -- a point query gets 
nearly 20% slower after automatically creating a generic plan.  That's the 
reason for this proposal.

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was 
bigger in PG 11.

* Use case is narrow:
No.  The bloated LockMethodLocalHash affects the performance of the items below 
as well as transaction commit/abort:
  - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice 
in PREPARE!
  - LockReleaseSession: advisory lock
  - LockReleaseCurrentOwner: ??
  - LockReassignCurrentOwner: ??


Regards
Takayuki Tsunakawa

RE: Libpq support to connect to standby server as priority

2018-01-03 Thread Tsunakawa, Takayuki

From: Jing Wang [mailto:jingwang...@gmail.com]
> This is a proposal that let libpq support 'prefer-read' option in
> target_session_attrs in pg_conn. The 'prefer-read' means the libpq will
> try to connect to a 'read-only' server firstly from the multiple server
> addresses. If failed to connect to the 'read-only' server then it will try
> to connect to the 'read-write' server.

There's a pending patch I started.  I'd be happy if you could continue this.

https://commitfest.postgresql.org/15/1148/

Regards
Takayuki Tsunakawa

RE: [HACKERS] Statement-level rollback

2018-01-09 Thread Tsunakawa, Takayuki

From: Simon Riggs [mailto:si...@2ndquadrant.com]
> When will the next version be posted?

I'm very sorry I haven't submitted anything.  I'd like to address this during 
this CF.  Thanks for remembering this.


Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-23 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> Oh, incidentally -- in our internal testing, we found that
> wal_sync_method=open_datasync was significantly faster than
> wal_sync_method=fdatasync.  You might find that open_datasync isn't much
> different from pmem_drain, even though they're both faster than fdatasync.

That's interesting.  How fast was open_datasync in what environment (Linux 
distro/kernel version, HDD or SSD etc.)?

Is it now time to change the default setting to open_datasync on Linux, at 
least when O_DIRECT is not used (i.e. WAL archiving or streaming replication is 
used)?

[Current port/linux.h]
/*
 * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
 * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
 * perform better and (b) causes outright failures on ext4 data=journal
 * filesystems, because those don't support O_DIRECT.
 */
#define PLATFORM_DEFAULT_SYNC_METHODSYNC_METHOD_FDATASYNC

pg_test_fsync showed open_datasync is slower on my RHEL6 VM:

ep
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  4276.373 ops/sec 234 usecs/op
fdatasync  4895.256 ops/sec 204 usecs/op
fsync  4797.094 ops/sec 208 usecs/op
fsync_writethrough  n/a
open_sync  4575.661 ops/sec 219 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  2243.680 ops/sec 446 usecs/op
fdatasync  4347.466 ops/sec 230 usecs/op
fsync  4337.312 ops/sec 231 usecs/op
fsync_writethrough  n/a
open_sync  2329.700 ops/sec 429 usecs/op
ep

Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-24 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I think open_datasync will be worse on systems where fsync() is expensive
> -- it forces the data out to disk immediately, even if the data doesn't
> need to be flushed immediately.  That's bad, because we wait immediately
> when we could have deferred the wait until later and maybe gotten the WAL
> writer to do the work in the background.  But it might be better on systems
> where fsync() is basically free, because there you might as well just get
> it out of the way immediately and not leave something left to be done later.
> 
> This is just a guess, of course.  You didn't mention what the underlying
> storage for your test was?

Uh, your guess was correct.  My file system was ext3, where fsync() writes all 
dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM 
volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash 
memory.

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec  20 usecs/op
fdatasync 42094.381 ops/sec  24 usecs/op
fsync  42209.972 ops/sec  
24 usecs/op
fsync_writethroughn/a
open_sync 48669.605 ops/sec  21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 26366.373 ops/sec  38 usecs/op
fdatasync 33922.725 ops/sec  29 usecs/op
fsync 32990.209 ops/sec  30 usecs/op
fsync_writethroughn/a
open_sync 24326.249 ops/sec  41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux 
in PG 11?  I can understand the concern that users might hit performance 
degredation if they are using PostgreSQL on older systems.  But it's also 
mottainai that many users don't notice the benefits of wal_sync_method = 
open_datasync on new systems.

Regards
Takayuki Tsunakawa

Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-24 Thread Tsunakawa, Takayuki

Hello,

I've found a problem that an orphaned temporary table could cause XID 
wraparound.  Our customer encountered this problem with PG 9.5.2, but I think 
this will happen with the latest PG.

I'm willing to fix this, but I'd like to ask you what approach we should take.


PROBLEM


The customer has a database for application data, which I call it user_db here. 
 They don't store application data in postgres database.

No tables in user_db was autovacuumed for more than a month, leading to user 
tables bloating.  Those tables are eligible for autovacuum according to 
pg_stat_all_tables and autovacuum settings.

age(datfrozenxid) of user_db and postgres databases are greater than 
autovacuum_max_freeze_age, so they are eligible for autovacuuming for XID 
wraparound.

There are user tables in user_db whose age(relfrozenxid) is greater than 
autovacuum_freeze_max_age, so those tables should get autovacuum treatment.


CAUSE


postgres database has a table named pg_temp_3.fetchchunks, whose 
age(relfrozenxid) is greater than autovacuum_freeze_max_age.  This temporary 
table is the culprit.  pg_temp_3.fetchchunks is created by pg_rewind.  The 
customer says they ran pg_rewind.

autovacuum launcher always choose postgres, because do_start_worker() scans 
pg_database and finds that postgres database needs vacuuming for XID 
wraparound.  user_db is never chosen for vacuuming, although it also needs 
vacuuming for XID wraparound.

autovacuum worker doesn't delete pg_temp3.fetchchunks, because the backendid 3 
is used by some application so autovacuum worker thinks that the backend is 
active and the temporary table cannot be dropped.

I don't know why pg_temp3.fetchchunks still exists.  Maybe the user ran pg_ctl 
stop -mi while pg_rewind was running.


FIX


I have the following questions.  Along which line should I proceed to fix the 
problem?

* Why does autovacuum launcher always choose only one database when that 
database need vacuuming for XID wraparound?  Shouldn't it also choose other 
databases?

* I think temporary tables should not require vacuuming for XID wraparound.  
Furtherover, should updates/deletes to temporary tables  be in-place instead of 
creating garbage, so that any form of vacuum is unnecessary?  Other sessions do 
not need to read temporary tables.

* In this incident, autovacuum worker misjudged that pg_temp_3.fetchchunks 
can't be deleted, although the creator (pg_rewind) is no longer active.  How 
can we delete orphan temporary tables safely?


Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-25 Thread Tsunakawa, Takayuki

> From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> As a superuser, DROP TABLE should work on the temporary schema of another
> session. Have you tried that to solve the situation?

Yes, we asked the customer to do that today.  I think the customer will do in 
the near future.

> > * In this incident, autovacuum worker misjudged that
> > pg_temp_3.fetchchunks can't be deleted, although the creator
> > (pg_rewind) is no longer active.  How can we delete orphan temporary
> > tables safely?
> 
> As long as Postgres sees that its temporary schema is in use, it would think
> that the table is not orphaned. Another thing possible would be to have
> the session now holding this schema space to reuse fetchchunks so as things
> are reset.

I understood you suggested a new session which recycle the temp schema should 
erase the zombie metadata of old temp tables or recreate the temp schema.  That 
sounds easy.

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-25 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> On Thu, Jan 25, 2018 at 08:10:00AM +0000, Tsunakawa, Takayuki wrote:
> > I understood you suggested a new session which recycle the temp schema
> > should erase the zombie metadata of old temp tables or recreate the
> > temp schema.  That sounds easy.
> 
> If the new session makes use of the same temporary schema where the orphan
> table is, cleanup is possible. Now you have a problem if this is not available
> as this depends on the backend ID uniquely assigned.

Ouch, you're right.  If the new session "uses the temp schema," it has a chance 
to clean the old temp table metadata.  However, it won't help if the session 
doesn't try to use the temp schema by creating a temp table...

> It would be better
> to just drop the table manually at the end.

Just to solve this very incident, it's so.  But this is a bug, so we need to 
fix it somehow.

Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
>  wrote:
> > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on
> a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe
> flash memory.
> 
> So does that mean it was faster than your PMDK implementation?

The PMDK patch is not mine, but is from people in NTT Lab.  I'm very curious 
about the comparison of open_datasync and PMDK, too.


> > What do you think about changing the default value of wal_sync_method
> on Linux in PG 11?  I can understand the concern that users might hit
> performance degredation if they are using PostgreSQL on older systems.  But
> it's also mottainai that many users don't notice the benefits of
> wal_sync_method = open_datasync on new systems.
> 
> Well, some day persistent memory may be a common enough storage technology
> that such a change makes sense, but these days most people have either SSD
> or spinning disks, where the change would probably be a net negative.  It
> seems more like something we might think about changing in PG 20 or PG 30.

No, I'm not saying we should make the persistent memory mode the default.  I'm 
simply asking whether it's time to make open_datasync the default setting.  We 
can write a notice in the release note for users who still use ext3 etc. on old 
systems.  If there's no objection, I'll submit a patch for the next CF.

Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]> On Thu, Jan 25, 2018 at 7:08 
PM, Tsunakawa, Takayuki
>  wrote:
> > No, I'm not saying we should make the persistent memory mode the default.
> I'm simply asking whether it's time to make open_datasync the default
> setting.  We can write a notice in the release note for users who still
> use ext3 etc. on old systems.  If there's no objection, I'll submit a patch
> for the next CF.
> 
> Well, like I said, I think that will degrade performance for users of SSDs
> or spinning disks.


As I showed previously, regular file writes on PCIe flash, *not writes using 
PMDK on persistent memory*, was 20% faster with open_datasync than with 
fdatasync.

In addition, regular file writes on HDD with ext4 was also 10% faster:

--
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  3408.905 ops/sec 293 usecs/op
fdatasync  3111.621 ops/sec 321 usecs/op
fsync  3609.940 ops/sec 277 usecs/op
fsync_writethrough  n/a
open_sync  3356.362 ops/sec 298 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  1892.157 ops/sec 528 usecs/op
fdatasync  3284.278 ops/sec 304 usecs/op
fsync  3066.655 ops/sec 326 usecs/op
fsync_writethrough  n/a
open_sync  1853.415 ops/sec 540 usecs/op
--


And you said open_datasync was significantly faster than fdatasync.  Could you 
show your results?  What device and filesystem did you use?

Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> Or to put it short, the lack of granular syncs in ext3 kills performance
> for some workloads. Tomas Vondra's presentation on such matters are a really
> cool read by the way:
> https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf
> s

Yeah, I saw this recently, too.  That was cool.

Regards
Takayuki Tsunakawa

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

2018-01-25 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> If I understand correctly, those results are all just pg_test_fsync results.
> That's not reflective of what will happen when the database is actually
> running.  When you use open_sync or open_datasync, you force WAL write and
> WAL flush to happen simultaneously, instead of letting the WAL flush be
> delayed.

Yes, that's pg_test_fsync output.  Isn't pg_test_fsync the tool to determine 
the value for wal_sync_method?  Is this manual misleading?

https://www.postgresql.org/docs/devel/static/pgtestfsync.html
--
pg_test_fsync - determine fastest wal_sync_method for PostgreSQL

pg_test_fsync is intended to give you a reasonable idea of what the fastest 
wal_sync_method is on your specific system, as well as supplying diagnostic 
information in the event of an identified I/O problem.
--


Anyway, I'll use pgbench, and submit a patch if open_datasync is better than 
fdatasync.  I guess the current tweak of making fdatasync the default is a 
holdover from the era before ext4 and XFS became prevalent.


> I don't have the results handy at the moment.  We found it to be faster
> on a database benchmark where the WAL was stored on an NVRAM device.

Oh, NVRAM.  Interesting.  Then I'll try open_datasync/fdatasync comparison on 
HDD and SSD/PCie flash with pgbench.

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-25 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> On Thu, Jan 25, 2018 at 3:14 PM, Tsunakawa, Takayuki
>  wrote:
> > * Why does autovacuum launcher always choose only one database when that
> database need vacuuming for XID wraparound?  Shouldn't it also choose other
> databases?
> 
> Yeah, I'd also like to fix this issue. This can be problem even in other
> case; there are two databases that require anti-wraparound vacuum, and one
> of them has a very large table that could take a long time to vacuum. In
> this case, if autovacuum chooses the database having big table first,
> another database would not be selected until an autovacuum worker completes
> vacuum on the large table. To deal with it, I think we can make autovacuum
> workers tell that it is no longer necessary to launch a new autovacuum worker
> on the database to autovacuum launcher before exit. For example, autovacuum
> worker reports both the total number of relations and the number of relations
> that require an anti-wraparound vacuum to the stats collector.

Thanks for commenting.  I believe you have deep knowledge and experience with 
vacuum because you did a great work for freeze map in 9.6, so I appreciate your 
help!

How would you use those two counts?

How about just modifying do_start_worker(), so that the launcher chooses a 
database in the following order?

1. wraparound-risky database not being vacuumed by any worker
2. non-wraparound-risky database  not being vacuumed by any worker
3. wraparound-risky database being vacuumed by any worker
4. non-wraparound-risky database being vacuumed by any worker

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-25 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> I think we should consider having backends try to remove their temporary
> schema on startup; then, if a temp table in a backend is old enough that
> it's due for vacuum for wraparound, have autovacuum kill the connection.
> The former is necessary to prevent sessions from being killed on account
> of temp tables they "inherited" from a backend that didn't exit cleanly.

That seems to be the only reasonable solution.  One might feel it annoying to 
emit WAL during connection establishment to delete the temp schema, but even 
now the client authentication can emit WAL for hot pruning while scanning 
pg_database, pg_authid, etc.  Thanks.

> The in-place update idea won't work for a couple of reasons.  First, a
> command shouldn't see the results of modifications made earlier in the same
> command.  Second, using cursors, it's possible to have more than one
> distinct snapshot open against a temporary table at the same time.

You're right.  And if the transaction rolls back, it needs to see the old 
tuples, which requires undo log.

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-28 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> What I thought is that a worker reports these two values after scanned
> pg_class and after freezed a table. The launcher decides to launch a new
> worker if the number of tables requiring anti-wraparound vacuum is greater
> than the number of workers running on the database. Similarly, the
> autovacuum launcher doesn't launch a new worker if two values are equal,
> which means all tables requiring an anti-wraparound vacuum is being vacuumed.
> There are chances that new relation is added during a worker is running
> on the last one table that requires anti-wraparound vacuum and launcher
> launches a new worker on the database. I think it's no problem because the
> new worker would update that two values and exits soon.

I got it.  Currently, the launcher assigns all workers to one database even if 
that database has only one table in danger of wraparound.  With your 
suggestion, the launcher assigns as many workers as the tables to be frozen, 
and use remaining workers for the other databases.


> > How about just modifying do_start_worker(), so that the launcher chooses
> a database in the following order?
> >
> > 1. wraparound-risky database not being vacuumed by any worker 2.
> > non-wraparound-risky database  not being vacuumed by any worker 3.
> > wraparound-risky database being vacuumed by any worker 4.
> > non-wraparound-risky database being vacuumed by any worker
> >
> 
> IMO the limiting the number of worker on a database to 1 seems risky.
> If a database has many tables that require an anti-wraparound vacuum, it
> takes a long time to freeze the all of these tables. In current
> implementation, as I mentioned as above, launcher can launch multiple worker
> on the one database. 

I can understand your concern.  On the other hand, it's unfair that one 
database could monopolize all workers, because other databases might also be 
facing wraparound risk.

> Since the above idea would be complex a bit, as an
> alternated idea it might be better to specify the number of worker to launch
> per database by a new GUC parameter or something. If the number of worker
> running on the database exceeds that limit, the launcher doesn't select
> the database even if the database is about to wraparound.

I'm afraid the GUC would be difficult for the user to understand and tune.

I want to submit the patch for handling the garbage temporary table metadata as 
Robert suggested in the next CF.  That should be enough to prevent this 
customer's problem.  I would appreciate if anyone could address the other 
improvement that Sawada-san proposed.

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-29 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> On Mon, Jan 29, 2018 at 3:33 PM, Tsunakawa, Takayuki
>  wrote:
> > I can understand your concern.  On the other hand, it's unfair that one
> database could monopolize all workers, because other databases might also
> be facing wraparound risk.
> 
> On third thought, we can change the policy of launching workers so that
> the launcher dispatches workers evenly to wraparound-risky databases
> instead of choosing only one most wraparound-risky database.

+1

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-01-30 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> Unfortunately, I think a full solution to the problem of allocating AV
> workers to avoid wraparound is quite complex.

Yes, that easily puts my small brain into an infinite loop...

> Given all of the foregoing this seems like a very hard problem.  I can't
> even articulate a clear set of rules for what our priorities should be,
> and it seems that such rules would depend on the rate at which we're consuming
> XIDs, how close we are in each database to a wraparound shutdown, what tables
> exist in each database, how big the not-all-frozen part of each one is,
> how big their indexes are, how much they're holding back relfrozenxid, and
> which ones already have workers, among other things.  I think it's quite
> possible that we can come up with something that's better than what we have
> now without embarking on a huge project, but it's not going to be anywhere
> near perfect because this is really complicated, and there's a real risk
> that we'll just making some cases better and others worse rather than
> actually coming out ahead overall.

So a simple improvement would be to assign workers fairly to databases facing a 
wraparound risk, as Sawada-san suggested.

One ultimate solution should be the undo-based MVCC that makes vacuuming 
unnecessary, which you proposed about a year ago...

Regards
Takayuki Tsunakawa

[bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-02-01 Thread Tsunakawa, Takayuki

Hello,

Some user hit a problem with ECPG on Windows.  The attached patch is a fix for 
it.  I'd appreciate it if you could backport this in all supported versions.

The problem is simple.  free() in the following example crashes:

char *out;

out = PGTYPESnumeric_to_asc(...);
free(out);

The cause is the mismatch of the version of C runtime library.  The version of 
Visual Studio used to build the application was different from that for 
building PostgreSQL (libpgtypes.dll).

The fix is to add PGTYPES_free() in libpgtypes.dll, just like libpq has 
PQfreemem() described here:

https://www.postgresql.org/docs/devel/static/libpq-misc.html


Regards
Takayuki Tsunakawa



pgtypes_freemem.patch
Description: pgtypes_freemem.patch

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-01 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> Thank you for suggestion. It sounds more smarter. So it would be more better
> if we vacuums database for anti-wraparound in ascending order of
> relfrozenxid?

I thought so, too.  The current behavior is inconsistent: the launcher tries to 
assign all workers to one database with the biggest wraparound risk in order to 
eliminate the risk as fast as possible, but the workers don't get hurry to 
reduce the risk.

Regards
Takayuki Tsunakawa

RE: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-02-04 Thread Tsunakawa, Takayuki

From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> +#ifndef PGTYPES_FREE
> +#define PGTYPES_FREE
> + extern void PGTYPES_free(void *ptr);
> +#endif
> 
> It seems quite strange to repeat this in pgtypes_date.h, pgtypes_interval.h
> and pgtypes_numeric.h.  I guess you might not want to introduce a new common
> header file so that his can be back-patched more easily?  Not sure if there
> is a project policy about that, but it seems unfortunate to introduce
> maintenance burden by duplicating this.

Your guess is correct.  I wanted to avoid the duplication, but there was no 
good place to put this without requiring existing applications to change their 
#include directives.  


> +   PGTYPES_free()/ instead of
> free().
> 
> The "/" needs to move inside then closing tag.

Thanks, fixed.

Regards
Takayuki Tsunakawa



pgtypes_freemem_v2.patch
Description: pgtypes_freemem_v2.patch

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-04 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> Temporary tables contain XIDs, so they need to be vacuumed for XID
> wraparound.  Otherwise, queries against those tables by the session
> that created them could yield wrong answers.  However, autovacuum
> can't perform that vacuuming; it would have to be done by the session.
> I think we should consider having backends try to remove their
> temporary schema on startup; then, if a temp table in a backend is old
> enough that it's due for vacuum for wraparound, have autovacuum kill
> the connection.  The former is necessary to prevent sessions from
> being killed on account of temp tables they "inherited" from a backend
> that didn't exit cleanly.

The attached patch does the former.  The small change in autovacuum.c is mainly 
for autovac launcher and background workers which don't connect to a database.  
I'll add this to the next CF.  I'd like this to be back-patched.

I didn't do the latter, because killing the connection seemed a bit overkill.  
If we're going to do it, then we should also kill the connection which is 
preventing vacuum regardless of whether it has temporary tables in its session.

Regards
Takayuki Tsunakawa





reset_temp_schema_on_connect.patch
Description: reset_temp_schema_on_connect.patch

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-06 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> I am not sure that we would like to give up that easily the property that
> we have now to clean up past temporary files only at postmaster startup
> and only when not in recovery.  If you implement that, there is a risk that
> the backend you are starting is eating the connection slot and by consequence
> its temporary schema and its set of temporary tables on which one may want
> to look into after a crash.

postmaster deletes temporary relation files at startup by calling 
RemovePgTempFiles() regardless of whether it's in recovery.  It doesn't call 
that function during auto restart after a crash when restart_after_crash is on.


> > 2. Teach autovacuum to remove any leftover tables in a pg_temp_%d
> > schema if the backend is active but in some other database (rather
> > than only when the backend is not active at all).
> 
> Yeah.  Here we can do something.  This does not sound much difficult to
> me.

I did that in my patch.

Regards
Takayuki Tsunakawa

RE: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-06 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> > postmaster deletes temporary relation files at startup by calling
> > RemovePgTempFiles() regardless of whether it's in recovery.  It
> > doesn't call that function during auto restart after a crash when
> > restart_after_crash is on.
> 
> The comment on top of RemovePgTempFiles() states the following:
>  * NOTE: we could, but don't, call this during a post-backend-crash restart
>  * cycle.  The argument for not doing it is that someone might want to
> examine
>  * the temp files for debugging purposes.  This does however mean that
>  * OpenTemporaryFile had better allow for collision with an existing temp
>  * file name.

Yes, I saw that comment.  postmaster keeps orphaned temp relation files only 
after a crash when restart_after_crash is on.


> Nice to hear that.  Please note that I did not check your patch, so I cannot
> conclude on its correctness in details.

I thought so.  Don't mind.

Regards
Takayuki Tsunakawa

[bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-13 Thread Tsunakawa, Takayuki

Hello,

Our customer encountered a rare bug of PostgreSQL which prevents a cascaded 
standby from starting up.  The attached patch is a fix for it.  I hope this 
will be back-patched.  I'll add this to the next CF.


PROBLEM
==

The PostgreSQL version is 9.5.  The cluster consists of a master, a cascading 
standby (SB1), and a cascaded standby (SB2).  The WAL flows like this: master 
-> SB1 -> SB2.

The user shut down SB2 and tried to restart it, but failed with the following 
message:

FATAL:  invalid memory alloc request size 3075129344

The master and SB1 continued to run.


CAUSE
==

total_len in the code below was about 3GB, so palloc() rejected the memory 
allocation request.

[xlogreader.c]
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
total_len = record->xl_tot_len;
...
/*
 * Enlarge readRecordBuf as needed.
 */
if (total_len > state->readRecordBufSize &&
!allocate_recordbuf(state, total_len))
{

Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader read the 
garbage portion in a WAL file, which is immediately after the end of valid WAL 
records.

Why was there the garbage?  Because the cascading standby sends just the valid 
WAL records, not the whole WAL blocks, to the cascaded standby.  When the 
cascaded standby receives those WAL records and write them in a recycled WAL 
file, the last WAL block in the file contains the mix of valid WAL records and 
the garbage left over.

Why did the XLOG reader try to allocate memory for a garbage?  Doesn't it stop 
reading the WAL?  As the following code shows, when the WAL record header 
crosses a WAL block boundary, the XLOG reader first allocates memory for the 
whole record, and then checks the validity of the record header as soon as it 
reads the entire header.

[xlogreader.c]
/*
 * If the whole record header is on this page, validate it immediately.
 * Otherwise do just a basic sanity check on xl_tot_len, and validate 
the
 * rest of the header after reading it from the next page.  The 
xl_tot_len
 * check is necessary here to ensure that we enter the "Need to 
reassemble
 * record" code path below; otherwise we might fail to apply
 * ValidXLogRecordHeader at all.
 */
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
{
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, 
record,
   randAccess))
goto err;
gotheader = true;
}
else
{
/* XXX: more validation should be done here */
if (total_len < SizeOfXLogRecord)
{
report_invalid_record(state,
  "invalid 
record length at %X/%X: wanted %u, got %u",
  (uint32) 
(RecPtr >> 32), (uint32) RecPtr,
  (uint32) 
SizeOfXLogRecord, total_len);
goto err;
}
gotheader = false;
}


FIX
==

One idea is to defer the memory allocation until the entire record header is 
read and ValidXLogRecordHeader() is called.  However, ValidXLogRecordHeader() 
might misjudge the validity if the garbage contains xl_prev seeming correct.

So, I modified walreceiver to zero-fill the last partial WAL block.  This is 
the guard that the master does in AdvanceXLInsertBuffer() as below:

/*
 * Be sure to re-zero the buffer so that bytes beyond what we've
 * written will look like zeroes and not valid XLOG records...
 */
MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

FYI, the following unsolved problem may be solved, too.

https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com


Regards
Takayuki Tsunakawa



zerofill_walblock_on_standby.patch
Description: zerofill_walblock_on_standby.patch

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-03 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> reloption for TOAST is also required?

# I've come back to the office earlier than planned...

Hm, there's no reason to not provide toast.vacuum_shrink_enabled.  Done with 
the attached patch.


Regards
Takayuki Tsunakawa




disable-vacuum-truncation_v5.patch
Description: disable-vacuum-truncation_v5.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Tsunakawa, Takayuki

Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
> 
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
> 
> Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, 
so I changed the status to needs review.  I'd appreciate it if you could 
continue the final check.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange...  Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on 
March 26 instead?


Regards
Takayuki Tsunakawa

RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-04 Thread Tsunakawa, Takayuki

From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> "VACUUM" needs  or "vacuum" is more appropriate here?

Looking at the same file and some other files, "vacuum" looks appropriate 
because it represents the vacuum action, not the specific VACUUM command.


> The format of the documentation of new option is a bit weird. Could it
> be possible to adjust it around 80 characters per line like other
> description?

Ah, fixed.  It's easy to overlook the style with the screen reader software...  
I've been wondering if there are good settings for editing .sgml in Emacs that, 
for example, puts appropriate number of spaces at the beginning of each line 
when  is pressed, automatically break the long line and put spaces, etc.


From: Julien Rouhaud [mailto:rjuju...@gmail.com]
> also, the documentation should point out that freeing is not
> guaranteed.  Something like
> 
>  + The default is true.  If true, VACUUM will try to free empty
> pages at the end of the table.

That's nice.  Done.


> > I'm not sure the consensus we got here but we don't make the vacuum
> > command option for this?
> 
> I don't think here's a clear consensus, but my personal vote is to add
> it, with  SHRINK_TABLE = [ force_on | force_off | default ] (unless a
> better proposal has been made already)

IMO, which I mentioned earlier, I don't think the VACUUM option is necessary 
because:
(1) this is a table property which is determined based on the expected 
workload, not the one that people want to change per VACUUM operation
(2) if someone wants to change the behavior for a particular VACUUM operation, 
he can do it using ALTER TABLE SET.
Anyway, we can add the VACUUM option separately if we want it by all means.  I 
don't it to be the blocker for this feature to be included in PG 12, because 
the vacuum truncaton has been bothering us like others said in this and other 
threads...


Regards
Takayuki Tsunakawa



disable-vacuum-truncation_v6.patch
Description: disable-vacuum-truncation_v6.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> I was able to detect it as follows.
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Thanks a lot for very quick confirmation.  I'm relieved to still see the good 
results.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-04-05 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> The first letter should be upper-case.

Thank you for taking care of this patch, and sorry to cause you trouble to fix 
that...


> to me that socket_timeout_v14.patch should be rejected as it could cause
> a connection to go down with no actual reason and that the server should
> be in charge of handling timeouts.  Is my impression right?

No, the connection goes down for a good reason that the client could not get 
the response within a tolerable amount of time.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-04-07 Thread Tsunakawa, Takayuki

From: Michael Paquier [mailto:mich...@paquier.xyz]
> I have just committed the GUC and libpq portion for TCP_USER_TIMEOUT after
> a last lookup, and I have cleaned up a couple of places. 

Thank you for further cleanup and committing.


> For the socket_timeout stuff, its way of solving the problem it thinks is
> solves does not seem right to me, and this thread has not reached a consensus
> anyway, so I have discarded the issue.
> 
> I am marking the CF entry as committed.  In the future, it would be better
> to not propose multiple concepts on the same thread, and if the
> socket_timeout business is resubmitted, I would suggest a completely new
> CF entry, and a new thread.

Understood.  Looking back the review process, it seems that tcp_user_timeout 
and socket_timeout should have been handled in separate threads.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

Hi Andres, Fujii-san, any committer,

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-08 09:52:27 +0900, Fujii Masao wrote:
> > I'm thinking to commit this patch at first.  We can change the term
> > and add the support of "TRUNCATE" option for VACUUM command later.
> 
> I hope you realize feature freeze is in a few hours...

Ouch!  Could you take care of committing the patch, please?  I wouldn't be able 
to express the sadness and tiredness just in case this is pushed to 13 because 
of the parameter name...

Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> On the whole I don't think there's an adequate case for committing
> this patch.

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> > If I reduce the number of partitions in Amit's example from 8192
> > to something more real-world, like 128, I do still measure a
> > performance gain, but it's ~ 1.5% which is below what I'd consider
> > a reproducible win.  I'm accustomed to seeing changes up to 2%
> > in narrow benchmarks like this one, even when "nothing changes"
> > except unrelated code.
> 
> I'm not sure it's actually that narrow these days. With all the
> partitioning improvements happening, the numbers of locks commonly held
> are going to rise. And while 8192 partitions is maybe on the more
> extreme side, it's a workload with only a single table, and plenty
> workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or 
impractical and ignore it, but it's not narrow.  Two of our customers are 
actually requesting such usage: one uses 5,500 partitions and is trying to 
migrate from a commercial database on Linux, and the other requires 200,000 
partitions to migrate from a legacy database on a mainframe.  At first, I 
thought such many partitions indicate a bad application design, but it sounded 
valid (or at least I can't insist that's bad).  PostgreSQL is now expected to 
handle such huge workloads.

From: Andres Freund [mailto:and...@anarazel.de]
> I'm not sure I'm quite that concerned. For one, a good bit of that space
> was up for grabs until the recent reordering of LOCALLOCK and nobody
> complained. But more importantly, I think commonly the amount of locks
> around is fairly constrained, isn't it? We can't really have that many
> concurrently held locks, due to the shared memory space, and the size of
> a LOCALLOCK isn't that big compared to say relcache entries.  We also
> probably fairly easily could win some space back - e.g. make nLocks 32
> bits.

+1

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.
> 
> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good.  In 
the Jim Gray's classic book "Transaction processing: concepts and techniques", 
we can find the following sentence in "8.4.5 Lock Manager Internal Logic."  The 
sample implementation code in the book uses a similar linked list to remember 
and release a transaction's acquired locks.

"All the locks of a transaction are kept in a list so they can be quickly found 
and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than 
with the hash table resize.  We just want to quickly find all grabbed locks, so 
we use a linked list.  A hash table is a mechanism to find a particular item 
quickly.  So it was merely wrong to use the hash table to iterate all grabbed 
locks.  Also, the hash table got big because some operation in the session 
needed it, and some subsequent operations in the same session may need it 
again.  So we wouldn't be relieved with shrinking the hash table.

Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: 'Andres Freund' [mailto:and...@anarazel.de]
> On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:
> > I think the linked list of LOCALLOCK approach is natural, simple, and
> > good.
> 
> Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like 
a somewhat extreme test case.  And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the 
LocalLockHash completely.  But it doesn't seem feasible...

Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> It would be good to get your view on the
> shrink_bloated_locallocktable_v3.patch I worked on last night. I was
> unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good.  I understood the idea behind your 
patch is:

* Transactions that touch many partitions and/or tables are a special event and 
not normal, and the hash table bloat is an unlucky accident.  So it's 
reasonable to revert the bloated hash table back to the original size.

* Repeated transactions that acquire many locks have to enlarge the hash table 
every time.  However, the overhead of hash table expansion should be hidden 
behind other various processing (acquiring/releasing locks, reading/writing the 
relations, accessing the catalogs of those relations)


TBH, I think the linked list approach feels more intuitive because the 
resulting code looks what it wants to do (efficiently iterate over acquired 
locks) and is based on the classic book.  But your approach seems to relieve 
people.  So I'm OK with your patch.

I'm registering you as another author and me as a reviewer, and marking this 
ready for committer.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> And, as far as I can see from a quick review of the thread,
> we don't really have consensus on the names and behaviors.

Consensus on the name seems to use truncate rather than shrink (a few poople 
kindly said they like shrink, and I'm OK with either name.)  And there's no 
dispute on the behavior.  Do you see any other point?


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-07 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> "vacuum_truncate" gets my vote too.

+1


From: 'Andres Freund' [mailto:and...@anarazel.de]
> Personally I think the name just needs some committer to make a
> call. This largely is going to be used after encountering too many
> cancellations in production, and researching the cause. Minor spelling
> differences don't seem to particularly matter here.

Absolutely.  Thank you.


From: 'Andres Freund' [mailto:and...@anarazel.de]
> I think it needs to be an autovac option. The production issue is that
> autovacuums constantly cancel queries on the standbys despite
> hot_standby_feedback if you have a workload that does frequent
> truncations. If there's no way to configure it in a way that autovacuum
> takes it into account, people will just disable autovacuum and rely
> entirely on manual scripts. That's what already happens - leading to a
> much bigger foot-gun than disabling truncation.  FWIW, people already in
> production use the workaround to configuring snapshot_too_old as that,
> for undocumented reasons, disables trunctations. That's not better
> either.

Right.  We don't want autovacuum to be considered as a criminal.


Regards
Takayuki Tsunakawa

RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-04-08 Thread Tsunakawa, Takayuki

From: Fujii Masao [mailto:masao.fu...@gmail.com]
> Thanks for the info, so I marked the patch as committed.

Thanks a lot for your hard work!  This felt relatively tough despite the 
simplicity of the patch.  I'm starting to feel the difficulty and fatigue in 
developing in the community...


Regards
Takayuki Tsunakawa

RE: IDE setup and development features?

2018-10-09 Thread Tsunakawa, Takayuki

From: Mori Bellamy [mailto:m...@invoked.net]
> I'd like a few features when developing postgres -- (1) jump to definition
> of symbol (2) find references to symbol and (3) semantic autocompletion.

For 1), you can generate tags like:

[for vi]
$ src/tools/make_ctags
[for Emacs]
$ src/tools/make_etags

Cscope works for 2).


Regards
Takayuki Tsunakawa

RE: Implementation of Flashback Query

2018-10-17 Thread Tsunakawa, Takayuki

From: Yang Jie [mailto:yang...@highgo.com]
> Delayed cleanup, resulting in performance degradation, what are the
> solutions recommended?
> What do you suggest for the flashback feature?
> Although postgres has excellent backup and restore capabilities, have you
> considered adding flashbacks?

Great proposal, I appreciate it.

Regarding the bloat, check the recent mail threads titled "undo xxx".  A new 
table storage format is being developed to use undo logs like Oracle and MySQL 
instead of generating dead tuples.

How about the possibility of adding a feature like Flashback Drop to restore 
dropped and truncated tables?  I sometimes saw customers who mistakenly dropped 
or truncated tables but didn't have any backup.


Regards
Takayuki Tsunakawa

RE: WAL archive (archive_mode = always) ?

2018-10-21 Thread Tsunakawa, Takayuki

From: Adelino Silva [mailto:adelino.j.si...@googlemail.com]
> What is the advantage to use archive_mode = always in a slave server compared
> to archive_mode = on (shared WAL archive) ?
> 
> I only see duplication of Wal files, what is the purpose of this feature ?

This also saves you the network bandwidth by not sending the WAL archive from 
the primary to the standby(s).  The network bandwidth can be costly between 
remote regions for disaster recovery.


Regards
Takayuki Tsunakawa

RE: WAL archive (archive_mode = always) ?

2018-10-22 Thread Tsunakawa, Takayuki

From: Narayanan V [mailto:vnarayanan.em...@gmail.com]
> I think what Takayuki is trying to say is that streaming replication works
> by sending the contents of the WAL archives to the standbys. If archive_mode
> was NOT set to always, and if you wanted to archive WAL logs in the standby
> you would need to rely on the process_command and make it ship the WAL logs
> (to the standby). This causes the same WAL log information to be shipped
> in two places,
> 
> 1. Through Streaming Replication
> 2. By the process_command
> 
> This redundant shipping of the same information is expensive and consumes
> network bandwidth. This can be avoided with the use of archive_mode=always.
> 
> archive_mode=always makes the standby archive the WAL logs it receives,
> thus avoiding the requirement of having to ship it separately.

Exactly.  (archive_command, not process_command)

Thanks, Narayanan.


Regards
Takayuki Tsunakawa

RE: PostgreSQL Limits and lack of documentation about them.

2018-10-25 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I think it's a bit strange that we don't have this information fairly
> early on in the official documentation.  I only see a mention of the
> 1600 column limit in the create table docs. Nothing central and don't
> see mention of 32 TB table size limit.
> 
> I don't have a patch, but I propose we include this information in the
> docs, perhaps on a new page in the preface part of the documents.
> 
> Does anyone else have any thoughts about this?

+1
As a user, I feel I would look for such information in appendix like "A 
Database limits" in Oracle's Database Reference manual:

https://docs.oracle.com/en/database/oracle/oracle-database/18/refrn/database-limits.html#GUID-ED26F826-DB40-433F-9C2C-8C63A46A3BFE

As a somewhat related topic, PostgreSQL doesn't mention the maximum values for 
numeric parameters.  I was asked several times the questions like "what's the 
maximum value for max_connections?" and "how much memory can I use for 
work_mem?"  I don't feel a strong need to specify those values, but I wonder if 
we should do something.


Regards
Takayuki Tsunakawa

1 2 3 4 >

1 - 100 of 302 matches

Mail list logo