RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-02 Thread k.jami...@fujitsu.com
On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> Hello, Kirk. Thank you for the new version.

Apologies for the delay, but attached are the updated versions to simplify the 
patches.
The changes reflected most of your comments/suggestions.

Summary of changes in the latest versions.
1. Updated the function description of DropRelFileNodeBuffers in 0003.
2. Updated the commit logs of 0003 and 0004.
3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
  instead of for all involved forks.
4. Removed the unnecessary palloc() and subscripts like forks[][],
   firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
   allocation for block[][] was also simplified.
   So 0004 became simpler and more readable.


> At Thu, 26 Nov 2020 03:04:10 +0000, "k.jami...@fujitsu.com"
>  wrote in
> > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > > From: Andres Freund 
> > > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers
> > > > (), which 3/4 doesn't address and 4/4 doesn't mention.
> > > >
> > > > 4/4 seems to address DropRelationFiles(), but only talks about
> > > TRUNCATE?
> > >
> > > Yes.  DropRelationFiles() is used in the following two paths:
> > >
> > > [Replay of TRUNCATE during recovery]
> > > xact_redo_commit/abort() -> DropRelationFiles()  ->
> > > smgrdounlinkall() ->
> > > DropRelFileNodesAllBuffers()
> > >
> > > [COMMIT/ROLLBACK PREPARED]
> > > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > > smgrdounlinkall()
> > > -> DropRelFileNodesAllBuffers()
> >
> > Yes. The concern is that it was not clear in the function descriptions
> > and commit logs what the optimizations for the
> > DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() are for. So
> > I revised only the function description of DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions
> would suffice.
> 
> I read the commit message of 3/4.  (Though this is not involved literally in 
> the
> final commit.)
> 
> > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or
> > autovacuum are replayed, the buffers are dropped when the sizes of all
> > involved forks of a relation are already "cached". We can get
> 
> This sentence seems missing "dropped by (or using) what".
> 
> > a reliable size of nblocks for supplied relation's fork at that time,
> > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > that cached nblocks will not be invalidated by file extension during
> > recovery.  Otherwise, or if not in recovery, proceed to sequential
> > search of the whole buffer pool.
> 
> This sentence seems involving confusion. It reads as if "we can rely on it
> because we're relying on it".  And "the cached value won't be invalidated"
> doesn't explain the reason precisely. The reason I think is that the cached
> value is guaranteed to be the maximum page we have in shared buffer at least
> while recovery, and that guarantee is holded by not asking fseek once we
> cached the value.

Fixed the commit log of 0003.

> > > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > > DropRelFileNodesAllBuffers() can be used for many relations at
> > > > once, this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD
> - 1
> > > lookups a lot
> > > > of times, once for each of nnodes relations?
> > >
> > > So, the threshold value should be compared with the total number of
> > > blocks of all target relations, not each relation.  You seem to be right, 
> > > got
> it.
> >
> > Fixed this in 0004 patch. Now we compare the total number of
> > buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
> 
> I didn't see the previous version, but the row of additional palloc/pfree's in
> this version looks uneasy.

Fixed this too.
 
>   int i,
> + j,
> + *nforks,
>   n = 0;
> 
> Perhaps I think we don't define variable in different types at once.
> (I'm not sure about defining multple variables at once.)

Fixed this too.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-03 Thread k.jami...@fujitsu.com
On Friday, December 4, 2020 12:42 PM, Tang, Haiying wrote:
> Hello, Kirk
> 
> Thanks for providing the new patches.
> I did the recovery performance test on them, the results look good. I'd like 
> to
> share them with you and everyone else.
> (I also record VACUUM and TRUNCATE execution time on master/primary in
> case you want to have a look.)

Hi, Tang.
Thank you very much for verifying the performance using the latest set of 
patches.
Although it's not supposed to affect the non-recovery path (execution on 
primary),
It's good to see those results too.

> 1. VACUUM and Failover test results(average of 15 times) [VACUUM]
> ---execution time on master/primary
> shared_buffers  master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M9.440  9.483   0%
> 10G74.689 76.219   2%
> 20G   152.538138.292  -9%
> 
> [Failover] ---execution time on standby
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M3.6292.961-18%
> 10G82.4432.627-97%
> 20G   171.3882.607-98%
> 
> 2. TRUNCATE and Failover test results(average of 15 times) [TRUNCATE]
> ---execution time on master/primary
> shared_buffers master(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   49.271   49.867 1%
> 10G   172.437  175.197 2%
> 20G   279.658  278.752 0%
> 
> [Failover] ---execution time on standby
> shared_buffersmaster(sec)
> patched(sec) %reg=((patched-master)/master)
> -
> -
> 128M   4.8773.989-18%
> 10G   92.6803.975-96%
> 20G  182.0353.962-98%
> 
> [Machine spec]
> CPU : 40 processors  (Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz)
> Memory: 64G
> OS: CentOS 8
> 
> [Failover test data]
> Total table Size: 700M
> Table: 1 tables (1000 rows per table)
> 
> If you have question on my test, please let me know.

Looks great.
That was helpful to see if there were any performance differences than the 
previous
versions' results. But I am glad it turned out great too.

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-06 Thread k.jami...@fujitsu.com
On Friday, December 4, 2020 8:27 PM, Amit Kapila  
wrote:
> On Fri, Nov 27, 2020 at 11:36 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 27 Nov 2020 02:19:57 +, "k.jami...@fujitsu.com"
> >  wrote in
> > > > From: Kyotaro Horiguchi  Hello, Kirk.
> > > > Thank you for the new version.
> > >
> > > Hi, Horiguchi-san. Thank you for your very helpful feedback.
> > > I'm updating the patches addressing those.
> > >
> > > > +   if (!smgrexists(rels[i], j))
> > > > +   continue;
> > > > +
> > > > +   /* Get the number of blocks for a relation's fork */
> > > > +   blocks[i][numForks] = smgrnblocks(rels[i], j,
> > > > NULL);
> > > >
> > > > If we see a fork which its size is not cached we must give up this
> > > > optimization for all target relations.
> > >
> > > I did not use the "cached" flag in DropRelFileNodesAllBuffers and
> > > use InRecovery when deciding for optimization because of the following
> reasons:
> > > XLogReadBufferExtended() calls smgrnblocks() to apply changes to
> > > relation page contents. So in DropRelFileNodeBuffers(),
> > > XLogReadBufferExtended() is called during VACUUM replay because
> VACUUM changes the page content.
> > > OTOH, TRUNCATE doesn't change the relation content, it just
> > > truncates relation pages without changing the page contents. So
> > > XLogReadBufferExtended() is not called, and the "cached" flag will
> > > always return false. I tested with "cached" flags before, and this
> >
> > A bit different from the point, but if some tuples have been inserted
> > to the truncated table, XLogReadBufferExtended() is called for the
> > table and the length is cached.
> >
> > > always return false, at least in DropRelFileNodesAllBuffers. Due to
> > > this, we cannot use the cached flag in DropRelFileNodesAllBuffers().
> > > However, I think we can still rely on smgrnblocks to get the file
> > > size as long as we're InRecovery. That cached nblocks is still guaranteed
> to be the maximum in the shared buffer.
> > > Thoughts?
> >
> > That means that we always think as if smgrnblocks returns "cached" (or
> > "safe") value during recovery, which is out of our current consensus.
> > If we go on that side, we don't need to consult the "cached" returned
> > from smgrnblocks at all and it's enough to see only InRecovery.
> >
> > I got confused..
> >
> > We are relying on the "fact" that the first lseek() call of a
> > (startup) process tells the truth.  We added an assertion so that we
> > make sure that the cached value won't be cleared during recovery.  A
> > possible remaining danger would be closing of an smgr object of a live
> > relation just after a file extension failure. I think we are thinking
> > that that doesn't happen during recovery.  Although it seems to me
> > true, I'm not confident.
> >
> 
> Yeah, I also think it might not be worth depending upon whether smgr close
> has been done before or not. I feel the current idea of using 'cached'
> parameter is relatively solid and we should rely on that.
> Also, which means that in DropRelFileNodesAllBuffers() we should rely on
> the same, I think doing things differently in this regard will lead to 
> confusion. I
> agree in some cases we might not get benefits but it is more important to be
> correct and keep the code consistent to avoid introducing bugs now or in the
> future.
> 
Hi, 
I have reported before that it is not always the case that the "cached" flag of
srnblocks() return true. So when I checked the truncate test case used in my
patch, it does not enter the optimization path despite doing INSERT before
truncation of table.
The reason for that is because in TRUNCATE, a new RelFileNode is assigned
to the relation when creating a new file. In recovery, XLogReadBufferExtended()
always opens the RelFileNode and calls smgrnblocks() for that RelFileNode for 
the
first time. And for recovery processing, different RelFileNodes are used for the
INSERTs to the table and TRUNCATE to the same table.

As we cannot use "cached" flag for both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() based from above.
I am thinking that if we want consistency, correctness, and to still make use of
the optimization, we can completely drop the "cached" flag parameter in 
smgrnblocks,
and use InRecovery.
Tsunakawa-san mentioned in [1] that it is safe because smgrclose is not called
by the startup process in recovery. Shared-inval messages are not sent to 
startup
process.

Otherwise, we use the current patch form as it is: using "cached" in
DropRelFileNodeBuffers() and InRecovery for DropRelFileNodesAllBuffers().
However, that does not seem to be what is wanted in this thread.

Thoughts?

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/TYAPR01MB2990B42570A5FAC349EE983AFEF40%40TYAPR01MB2990.jpnprd01.prod.outlook.com


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-07 Thread k.jami...@fujitsu.com
On Tuesday, December 8, 2020 2:35 PM, Amit Kapila wrote:

> On Tue, Dec 8, 2020 at 10:41 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 8 Dec 2020 08:08:25 +0530, Amit Kapila
> >  wrote in
> > > On Tue, Dec 8, 2020 at 7:24 AM Kyotaro Horiguchi
> > >  wrote:
> > > > We drop
> > > > buffers for the old relfilenode on truncation anyway.
> > > >
> > > > What I did is:
> > > >
> > > > a: Create a physical replication pair.
> > > > b: On the master, create a table. (without explicitly starting a
> > > > tx)
> > > > c: On the master, insert a tuple into the table.
> > > > d: On the master truncate the table.
> > > >
> > > > On the standby, smgrnblocks is called for the old relfilenode of
> > > > the table at c, then the same function is called for the same
> > > > relfilenode at d and the function takes the cached path.
> > > >
> > >
> > > This is on the lines I have tried for recovery. So, it seems we are
> > > in agreement that we can use the 'cached' flag in
> > > DropRelFileNodesAllBuffers and it will take the optimized path in
> > > many such cases, right?
> >
> >
> > Mmm. There seems to be a misunderstanding.. What I opposed to is
> > referring only to InRecovery and ignoring the value of "cached".
> >
> 
> Okay, I think it was Kirk-San who proposed to use InRecovery and ignoring
> the value of "cached" based on the theory that even if Insert (or other DMLs)
> are done before Truncate, it won't use an optimized path and I don't agree
> with the same. So, I did a small test to check the same and found that it
> should use the optimized path and the same is true for the experiment done
> by you. I am not sure why Kirk-San is seeing something different?
> 
> > The remaining issue is we don't get to the optimized path when a
> > standby makes the first call to smgrnblocks() when truncating a
> > relation. Still we can get to the optimized path as far as any
> > update(+insert) or select is performed earlier on the relation so I
> > think it doesn't matter so match.
> >
> 
> +1.

My question/proposal before was to either use InRecovery,
or completely drop the smgrnblocks' "cached" flag.
But that is coming from the results of my investigation below when
I used "cached" in DropRelFileNodesAllBuffers().
The optimization path was skipped because one of the
Rels' "cached" value was "false".

Test Case. (shared_buffer = 1GB)
0. Set physical replication to both master and standby.
1. Create 1 table.
2. Insert Data (1MB) to TABLE.
16385 is the relnode for insert (both Master and Standby).

3. Pause WAL on Standby.
4. TRUNCATE table on Primary.
  nrels = 3. relNodes 16389, 16388, 16385.

5. Stop Primary.

6. Promote standby and resume WAL recovery.  nrels = 3 
  1st rel's check for optimization: "cached" is TRUE. relNode = 16389.
  2nd rel's check for optimization. "cached" was returned FALSE by
smgrnblocks). relNode = 16388.
  Since one of the rels' cached is "FALSE", the optimization check for
  3rd relation and the whole optimization itself is skipped.
  Go to full-scan path in DropRelFileNodesAllBuffers().
  Then smgrclose for relNodes 16389, 16388, 16385.

Because one of the rel's cached value was false, it forced the
full-scan path for TRUNCATE.
Is there a possible workaround for this? 


Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread k.jami...@fujitsu.com
On Wednesday, December 9, 2020 10:58 AM, Tsunakawa, Takayuki wrote: 
> From: Kyotaro Horiguchi 
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila
> >  wrote in
>  > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that
> > > we leave this optimization entirely for the truncate path.
> >
> > An ugly way to cope with it would be to let other smgr functions
> > manage the cached value, for example, by calling smgrnblocks while
> > InRecovery.  Or letting smgr remember the maximum block number ever
> > accessed.  But we cannot fully rely on that since smgr can be closed
> > midst of a session and smgr doesn't offer such persistence.  In the
> > first place smgr doesn't seem to be the place to store such persistent
> > information.
> 
> Yeah, considering the future evolution of this patch to operations during
> normal running, I don't think that would be a good fit, either.
> 
> Then, the as we're currently targeting just recovery, the options we can take
> are below.  Which would vote for?  My choice would be (3) > (2) > (1).
> 
> 
> (1)
> Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
> This brings the most safety and code consistency.
> But this would not benefit from optimization for TRUNCATE in unexpectedly
> many cases -- when TOAST storage exists but it's not written, or FSM/VM is
> not updated after checkpoint.
> 
> 
> (2)
> Use the cached flag in VACUUM (0003), but use InRecovery instead of the
> cached flag in TRUNCATE (0004).
> This benefits from the optimization in all cases.
> But this lacks code consistency.
> You may be afraid of safety if the startup process smgrclose()s the relation
> after the shared buffer flushing hits disk full.  However, startup process
> doesn't smgrclose(), so it should be safe.  Just in case the startup process
> smgrclose()s, the worst consequence is PANIC shutdown after repeated
> failure of checkpoints due to lingering orphaned dirty shared buffers.  Accept
> it as Thomas-san's devil's suggestion.
> 
> 
> (3)
> Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
> This benefits from the optimization in all cases.
> The code is consistent and smaller.
> As for the safety, this is the same as (2), but it applies to VACUUM as well.

If we want code consistency, then we'd fall in either 1 or 3.
And if we want to take the benefits of optimization for both 
DropRelFileNodeBuffers
and DropRelFileNodesAllBuffers, then I'd choose 3.
However, if the reviewers and committer want to make use of the "cached" flag,
then we can live with "cached" value in place there even if it's not common to
get the optimization for TRUNCATE path. So only VACUUM would take the most
benefit.
My vote is also (3), then (2), and (1).

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com
On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: 
> On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila
> >  wrote in
> > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > > expansion, so we cannot get to the optimized path in major cases
> > > > of truncation involving btree (and/or maybe other indexes).
> > > >
> > >
> > > AFAICS, btree insert should call smgrnblocks via
> > >
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte
> nded->XLogReadBufferExtended->smgrnblocks.
> > > Similarly delete should also call smgrnblocks. Can you be bit more
> > > specific related to the btree case you have in mind?
> >
> > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> > called during buffer loading while recovery. So, smgrnblock is called
> > for indexes if any update happens on the heap relation.
> >
> 
> Okay, so this means that we can get the benefit of optimization in many cases
> in the Truncate code path as well even if we use 'cached'
> flag? If so, then I would prefer to keep the code consistent for both vacuum
> and truncate recovery code path.

Yes, I have tested that optimization works for index relations.

I have attached the V34, following the conditions that we use "cached" flag
for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
consistency.
I added comment in 0004 the limitation of optimization when there are TOAST
relations that use NON-PLAIN strategy. i.e. The optimization works if the data
types used are integers, OID, bytea, etc. But for TOAST-able data types like 
text,
the optimization will be skipped and force a full scan during recovery.

Regards,
Kirk Jamison



v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v34-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v34-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v34-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v34-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-10 Thread k.jami...@fujitsu.com
On Thursday, December 10, 2020 8:12 PM, Amit Kapila wrote:
> On Thu, Dec 10, 2020 at 1:40 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Yes, I have tested that optimization works for index relations.
> >
> > I have attached the V34, following the conditions that we use "cached"
> > flag for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers()
> > for consistency.
> > I added comment in 0004 the limitation of optimization when there are
> > TOAST relations that use NON-PLAIN strategy. i.e. The optimization
> > works if the data types used are integers, OID, bytea, etc. But for
> > TOAST-able data types like text, the optimization will be skipped and force 
> > a
> full scan during recovery.
> >
> 
> AFAIU, it won't take optimization path only when we have TOAST relation but
> there is no insertion corresponding to it. If so, then we don't need to 
> mention
> it specifically because there are other similar cases where the optimization
> won't work like when during recovery we have to just perform TRUNCATE.
> 

Right, I forgot to add that there should be an update like insert to the TOAST
relation for truncate optimization to work. However, that is only limited to
TOAST relations with PLAIN strategy. I have tested with text data type, with
Inserts before truncate, and it did not enter the optimization path. OTOH,
It worked for data type like integer. So should I still not include that 
information?

Also, I will remove the unnecessary "cached" from the line that Tsunakawa-san
mentioned. I will wait for a few more comments before reuploading, hopefully,
the final version & including the test for truncate,

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-13 Thread k.jami...@fujitsu.com
On Friday, December 11, 2020 10:27 AM, Amit Kapila wrote:
> On Fri, Dec 11, 2020 at 5:54 AM k.jami...@fujitsu.com
>  wrote:
> > So should I still not include that information?
> >
> 
> I think we can extend your existing comment like: "Otherwise if the size of a
> relation fork is not cached, we proceed to a full scan of the whole buffer 
> pool.
> This can happen if there is no update to a particular fork during recovery."

Attached are the final updated patches.
I followed this advice and updated the source code comment a little bit.
There are no changes from the previous except that and the unnecessary
"cached" condition which Tsunakawa-san mentioned.

Below is also the updated recovery performance test results for TRUNCATE.
(1000 tables, 1MB per table, results measured in seconds)
| s_b   | Master | Patched | % Reg   | 
|---||-|-| 
| 128MB | 0.406  | 0.406   | 0%  | 
| 512MB | 0.506  | 0.406   | -25%| 
| 1GB   | 0.806  | 0.406   | -99%| 
| 20GB  | 15.224 | 0.406   | -3650%  | 
| 100GB | 81.506 | 0.406   | -19975% |

Because of the size of relation, it is expected to enter full-scan for
the 128MB shared_buffers setting. And there was no regression.
Similar to previous test results, the recovery time was constant
for all shared_buffers setting with the patches applied.

Regards,
Kirk Jamison



v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v35-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v35-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v35-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: libpq debug log

2020-12-21 Thread k.jami...@fujitsu.com
On Tuesday, December 15, 2020 8:12 PM, Iwata-san wrote:
> > There are some things still to do:
> I worked on some to do.

Hi Iwata-san,

Thank you for updating the patch.
I would recommend to register this patch in the upcoming commitfest
to help us keep track of it. I will follow the thread to provide more
reviews too.

> > 1. Is the handling of protocol 2 correct?
> Protocol 3.0 is implemented in PostgreSQL v7.4 or later. Therefore, most
> servers and clients today want to connect using 3.0.
> Also, wishful thinking, I think Protocol 2.0 will no longer be supported. So I
> didn't develop it aggressively.
> Another reason I'm concerned about implementing it would make the code
> less maintainable.

Though I have also not tested it with 2.0 server yet, do I have to download 7.3
version to test that isn't it? Silly question, do we still want to have this
feature for 2.0?
I understand that protocol 2.0 is still supported, but it is only used for
PostgreSQL versions 7.3 and earlier, which is not updated by fixes anymore
since we only backpatch up to previous 5 versions. However I am not sure if
it's a good idea, but how about if we just support this feature for protocol 3.
There are similar chunks of code in fe-exec.c, PQsendPrepare(), 
PQsendDescribe(),
etc. that already do something similar, so I just thought in hindsight if we can
do the same.
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
{

}
else
{
/* Protocol 2.0 isn't supported */
 printfPQExpBuffer(&conn->errorMessage,
   libpq_gettext("function requires at least protocol 
version 3.0\n"));
 return 0;
}

But if it's necessary to provide this improved trace output for 2.0 servers, 
then ignore what
I suggested above, and although difficult I think we should test for protocol 
2.0 in older server.

Some minor code comments I noticed:
1.
+   LOG_FIRST_BYTE, /* logging the first byte 
identifing the
+* protocol 
message type */

"identifing" should be "identifying". And closing */ should be on 3rd line. 

2.
+   case LOG_CONTENTS:
+   /* Suppresses printing terminating zero byte */

--> Suppress printing of terminating zero byte

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-21 Thread k.jami...@fujitsu.com
On Monday, December 21, 2020 10:25 PM, Amit Kapila wrote:
> I have started doing minor edits to the patch especially planning to write a
> theory why is this optimization safe and here is what I can come up with: 
> "To
> remove all the pages of the specified relation forks from the buffer pool, we
> need to scan the entire buffer pool but we can optimize it by finding the
> buffers from BufMapping table provided we know the exact size of each fork
> of the relation. The exact size is required to ensure that we don't leave any
> buffer for the relation being dropped as otherwise the background writer or
> checkpointer can lead to a PANIC error while flushing buffers corresponding
> to files that don't exist.
> 
> To know the exact size, we rely on the size cached for each fork by us during
> recovery which limits the optimization to recovery and on standbys but we
> can easily extend it once we have shared cache for relation size.
> 
> In recovery, we cache the value returned by the first lseek(SEEK_END) and
> the future writes keeps the cached value up-to-date. See smgrextend. It is
> possible that the value of the first lseek is smaller than the actual number 
> of
> existing blocks in the file due to buggy Linux kernels that might not have
> accounted for the recent write. But that should be fine because there must
> not be any buffers after that file size.
> 
> XXX We would make the extra lseek call for the unoptimized paths but that is
> okay because we do it just for the first fork and we anyway have to scan the
> entire buffer pool the cost of which is so high that the extra lseek call 
> won't
> make any visible difference. However, we can use InRecovery flag to avoid the
> additional cost but that doesn't seem worth it."
> 
> Thoughts?

+1 
Thank you very much for expanding the comments to carefully explain the
reason on why the optimization is safe. I was also struggling to explain it 
completely
but your description also covers the possibility of extending the optimization 
in the
future once we have shared cache for rel size. So I like this addition.

(Also, it seems that we have concluded to retain the locking mechanism of the 
existing patch based from the recent email exchanges. Both the traditional path 
and
the optimized path do the rechecking. So there seems to be no problem, I'm 
definitely
fine with it.)

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com
On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: 
> Attached, please find the updated patch with the following modifications, (a)
> updated comments at various places especially to tell why this is a safe
> optimization, (b) merged the patch for extending the smgrnblocks and
> vacuum optimization patch, (c) made minor cosmetic changes and ran
> pgindent, and (d) updated commit message. BTW, this optimization will help
> not only vacuum but also truncate when it is done in the same transaction in
> which the relation is created.  I would like to see certain tests to ensure 
> that
> the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I
> see that some testing has been done earlier [1] for this threshold but I am 
> not
> still able to conclude. The criteria to find the right threshold should be 
> what is
> the maximum size of relation to be truncated above which we don't get
> benefit with this optimization.
> 
> One idea could be to remove "nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always use optimized path for the tests. Then use the relation size as
> NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> buffers as 128MB, 1GB, 20GB, 100GB.

Alright. I will also repeat the tests with the different threshold settings, 
and thank you for the tip.

> Apart from tests, do let me know if you are happy with the changes in the
> patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.

Thank you, Amit.
That looks more neat, combining the previous patches 0002-0003, so I am +1
with the changes because of the clearer explanations for the threshold and
optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that 
we
do it safely too and that we are not InRecovery.

> [1] -
> https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9
> FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-22 Thread k.jami...@fujitsu.com
On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila 
> wrote:
> > Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
> 
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> =
> ===
> 1.
> DropRelFileNodesAllBuffers()
> {
> ..
> +buffer_full_scan:
> + pfree(block);
> + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
> +for (i = 0; i < n; i++)  nodes[i] = smgr_reln[i]->smgr_rnode.node;
> +
> ..
> }
> 
> How is it correct to assign nodes array directly from smgr_reln? There is no
> one-to-one correspondence. If you see the code before patch, the passed
> array can have mixed of temp and non-temp relation information.

You are right. I mistakenly removed the array node that should have been
allocated for non-local relations. So I fixed that by doing:

SMgrRelation*rels;

rels = palloc(sizeof(SMgrRelation) * nnodes);   /* non-local relations 
*/

/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nnodes; i++)
{
if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode))
{
if (smgr_reln[i]->smgr_rnode.backend == MyBackendId)

DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node);
}
else
rels[n++] = smgr_reln[i];
}
...
if (n == 0)
{
pfree(rels);
return;
}
...
//traditional path:

pfree(block);
nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
for (i = 0; i < n; i++)
nodes[i] = rels[i]->smgr_rnode.node;

> 2.
> + for (i = 0; i < n; i++)
>   {
> - pfree(nodes);
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + /*
> + * Assign InvalidblockNumber to a block if a relation
> + * fork does not exist, so that we can skip it later
> + * when dropping the relation buffers.
> + */
> + if (!smgrexists(smgr_reln[i], j))
> + {
> + block[i][j] = InvalidBlockNumber;
> + continue;
> + }
> +
> + /* Get the number of blocks for a relation's fork */ block[i][j] =
> + smgrnblocks(smgr_reln[i], j, &cached);
> 
> Similar to above, how can we assume smgr_reln array has all non-local
> relations? Have we tried the case with mix of temp and non-temp relations?

Similar to above reply.

> In this code, I am slightly worried about the additional cost of each time
> checking smgrexists. Consider a case where there are many relations and only
> one or few of them have not cached the information, in such a case we will
> pay the cost of smgrexists for many relations without even going to the
> optimized path. Can we avoid that in some way or at least reduce its usage to
> only when it is required? One idea could be that we first check if the nblocks
> information is cached and if so then we don't need to call smgrnblocks,
> otherwise, check if it exists. For this, we need an API like
> smgrnblocks_cahced, something we discussed earlier but preferred the
> current API. Do you have any better ideas?

Right. I understand the point that let's say there are 100 relations, and
the first 99 non-local relations happen to enter the optimization path, but the 
last
one does not, calling smgrexist() would be too costly and waste of time in that 
case.
The only solution I could think of as you mentioned is to reintroduce the new 
API
which we discussed before: smgrnblocks_cached().
It's possible that we call smgrexist() only if smgrnblocks_cached() returns
InvalidBlockNumber.
So if everyone agrees, we can add that API smgrnblocks_cached() which will
Include the "cached" flag parameter, and remove the additional flag 
modifications
from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() will use the new API.

Thoughts?


Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread k.jami...@fujitsu.com
On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote:
> >
> > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com"
> >  wrote in
> > > From: Amit Kapila 
> > > > + /* Get the number of blocks for a relation's fork */ block[i][j]
> > > > + = smgrnblocks(smgr_reln[i], j, &cached);
> > > > +
> > > > + if (!cached)
> > > > + goto buffer_full_scan;
> > > >
> > > > Why do we need to use goto here? We can simply break from the loop
> > > > and then check if (cached && nBlocksToInvalidate <
> > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid
> goto
> > > > if possible without much complexity.
> > >
> > > That's because two for loops are nested -- breaking there only exits
> > > the inner loop.  (I thought the same as you at first... And I
> > > understand some people have alergy to goto, I think modest use of
> > > goto makes the code readable.)
> >
> > I don't strongly oppose to goto's but in this case the outer loop can
> > break on the same condition with the inner loop, since cached is true
> > whenever the inner loop runs to the end. It is needed to initialize
> > the variable cache with true, instead of false, though.
> >
> 
> +1. I think it is better to avoid goto here as it can be done without
> introducing any complexity or making code any less readable.

I also do not mind, so I have removed the goto and followed the advice
of all reviewers. It works fine in the latest attached patch 0003.

Attached herewith are the sets of patches. 0002 & 0003 have the following
changes:

1. I have removed the modifications in smgrnblocks(). So the modifications of 
other functions that uses smgrnblocks() in the previous patch versions were
also reverted.
2. Introduced a new API smgrnblocks_cached() instead which returns either
a cached size for the specified fork or an InvalidBlockNumber.
Since InvalidBlockNumber is used, I think it is logical not to use the 
additional
boolean parameter "cached" in the function as it will be redundant.
Although in 0003, I only used the "cached" as a Boolean variable to do the trick
of not using goto.
This function is called both in DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers().
3. Modified some minor comments from the patch and commit logs.

It compiles. Passes the regression tests too.
Your feedbacks are definitely welcome.

Regards,
Kirk Jamison


v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description:  v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch


v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-24 Thread k.jami...@fujitsu.com
On Thu, December 24, 2020 6:02 PM JST, Tang, Haiying wrote:
> Hi Amit, Kirk
> 
> >One idea could be to remove "nBlocksToInvalidate <
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always
> >use optimized path for the tests. Then use the relation size as
> >NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> >buffers as 128MB, 1GB, 20GB, 100GB.
> 
> I followed your idea to remove check and use different relation size for
> different shared buffers as 128M,1G,20G,50G(my environment can't support
> 100G, so I choose 50G).
> According to results, all three thresholds can get optimized, even
> NBuffers/128 when shared_buffers > 128M.
> IMHO, I think NBuffers/128 is the maximum relation size we can get
> optimization in the three thresholds, Please let me know if I made something
> wrong.
 

Hello Tang,
Thank you very much again for testing. Perhaps there is a confusing part in the
presented table where you indicated master(512), master(256), master(128). 
Because the master is not supposed to use the BUF_DROP_FULL_SCAN_THRESHOLD
and just execute the existing default full scan of NBuffers.
Or I may have misunderstood something?

> Recovery after vacuum test results as below ' Optimized percentage' and '
> Optimization details(unit: second)' shows:
> (512),(256),(128): means relation size is NBuffers/512, NBuffers/256,
> NBuffers/128
> %reg: means (patched(512)- master(512))/ master(512)
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%-1%-1%
> 1G -65%-49%-62%
> 20G -98%-98%-98%
> 50G -99%-99%-99%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1080.1080.1090.1080.1090.108
> 1G0.310 0.107 0.410 0.208 0.811 0.309
> 20G 94.493 1.511 188.777 3.014 380.633 6.020
> 50G537.9783.815867.4537.5241559.07615.541
> 
> Test prepare:
> Below is test table amount for different shared buffers. Each table size is 
> 8k,
> so I use table amount = NBuffers/(512 or 256 or 128):
> shared_buffersNBuffersNBuffers/512NBuffers/256NBuffers/128
> -
> --
> 128M163843264128
> 1G1310722565121024
> 20G2621440   51201024020480
> 50G6553600   128002560051200
> 
> Besides, I also did single table performance test.
> Still, NBuffers/128 is the max relation size which we can get optimization.
> 
> Optimized percentage:
> shared_buffers%reg(512)%reg(256)%reg(128)
> -
> 128M0%0%-1%
> 1G 0%1%0%
> 20G 0%-24%-25%
> 50G 0%-24%-20%
> 
> Optimization details(unit: second):
> shared_buffersmaster(512)patched(512)master(256)patched(256)master(12
> 8)patched(128)
> -
> 
> 128M0.1070.1070.1080.1080.1080.107
> 1G0.108 0.108 0.107 0.108 0.108 0.108
> 20G0.208 0.208 0.409 0.309 0.409 0.308
> 50G0.309 0.308 0.408 0.309 0.509 0.408

I will also post results from my machine in the next email.
Adding what Amit mentioned that we should also test for NBuffers/64, etc.
until we determine which of the threshold performs worse than master.


Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-02 Thread k.jami...@fujitsu.com
On Wednesday, December 30, 2020 8:58 PM, Amit Kapila wrote:
> On Wed, Dec 30, 2020 at 11:28 AM Tang, Haiying
>  wrote:
> >
> > Hi Amit,
> >
> > In last
> >
> mail(https://www.postgresql.org/message-id/66851e198f6b41eda59e625718
> 2
> > 564b6%40G08CNEXMBPEKD05.g08.fujitsu.local),
> > I've sent you the performance test results(run only 1 time) on single table.
> Here is my the retested results(average by 15 times) which I think is more
> accurate.
> >
> > In terms of 20G and 100G, the optimization on 100G is linear, but 20G is
> nonlinear(also include test results on shared buffers of 50G/60G), so it's a
> little difficult to decide the threshold from the two for me.
> > If just consider 100G, I think NBuffers/32 is the optimized max relation 
> > size.
> But I don't know how to judge for 20G. If you have any suggestion, kindly let
> me know.
> >
> 
> Considering these results NBuffers/64 seems a good threshold as beyond
> that there is no big advantage. BTW, it is not clear why the advantage for
> single table is not as big as multiple tables with the Truncate command. Can
> you share your exact test steps for any one of the tests?
> Also, did you change autovacumm = off for these tests, if not then the results
> might not be reliable because before you run the test via Vacuum command
> autovacuum would have done that work?

Happy new year. The V38 LGTM.
Apologies for a bit of delay on posting the test results, but since it's the
start of commitfest, here it goes and the results were interesting.

I executed a VACUUM test using the same approach that Tsunakawa-san did in [1],
but only this time, the total time that DropRelFileNodeBuffers() took.
I used only a single relation, tried with various sizes using the values of 
threshold:
NBuffers/512..NBuffers/1, as advised by Amit.

Example of relation sizes for NBuffers/512.
100GB shared_buffers:  200 MB 
20GB shared_buffers:   40 MB
1GB shared_buffers:2 MB
128MB shared_buffers:  0.25 MB

The regression, which means the patch performs worse than master, only happens
for relation size NBuffers/2 and below for all shared_buffers. The fastest
performance on a single relation was using the relation size NBuffers/512.

[VACUUM Recovery Performance on Single Relation]
Legend: P_XXX (Patch, NBuffers/XXX relation size),
M_XXX (Master, Nbuffers/XXX relation size)
Unit: seconds

| Rel Size | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| P_512|  0.012594  |  0.001989  |  0.81  |  0.12  | 
| M_512|  0.208757  |  0.046212  |  0.002013  |  0.000295  | 
| P_256|  0.026311  |  0.004416  |  0.000129  |  0.21  | 
| M_256|  0.241017  |  0.047234  |  0.002363  |  0.000298  | 
| P_128|  0.044684  |  0.009784  |  0.000290  |  0.42  | 
| M_128|  0.253588  |  0.047952  |  0.002454  |  0.000319  | 
| P_64 |  0.065806  |  0.017444  |  0.000521  |  0.75  | 
| M_64 |  0.268311  |  0.050361  |  0.002730  |  0.000339  | 
| P_32 |  0.121441  |  0.033431  |  0.001646  |  0.000112  | 
| M_32 |  0.285254  |  0.061486  |  0.003640  |  0.000364  | 
| P_16 |  0.255503  |  0.065492  |  0.001663  |  0.000144  | 
| M_16 |  0.377013  |  0.081613  |  0.003731  |  0.000372  | 
| P_8  |  0.560616  |  0.109509  |  0.005954  |  0.000465  | 
| M_8  |  0.692596  |  0.112178  |  0.006667  |  0.000553  | 
| P_4  |  1.109437  |  0.162924  |  0.011229  |  0.000861  | 
| M_4  |  1.162125  |  0.178764  |  0.011635  |  0.000935  | 
| P_2  |  2.202231  |  0.317832  |  0.020783  |  0.002646  | 
| M_2  |  1.583959  |  0.306269  |  0.015705  |  0.002021  | 
| P_1  |  3.080032  |  0.632747  |  0.032183  |  0.002660  | 
| M_1  |  2.705485  |  0.543970  |  0.030658  |  0.001941  | 

%reg = (Patched/Master)/Patched

| %reg_relsize | 100 GB s_b | 20 GB s_b  | 1 GB s_b   | 128 MB s_b | 
|--||||| 
| %reg_512 | -1557.587% | -2223.006% | -2385.185% | -2354.167% | 
| %reg_256 | -816.041%  | -969.691%  | -1731.783% | -1319.048% | 
| %reg_128 | -467.514%  | -390.123%  | -747.008%  | -658.333%  | 
| %reg_64  | -307.727%  | -188.704%  | -423.992%  | -352.000%  | 
| %reg_32  | -134.891%  | -83.920%   | -121.097%  | -225.970%  | 
| %reg_16  | -47.557%   | -24.614%   | -124.279%  | -157.390%  | 
| %reg_8   | -23.542%   | -2.437%| -11.967%   | -19.010%   | 
| %reg_4   | -4.749%| -9.722%| -3.608%| -8.595%| 
| %reg_2   | 28.075%| 3.638% | 24.436%| 23.615%| 
| %reg_1   | 12.160%| 14.030%| 4.739% | 27.010%| 

Since our goal is to get the approximate threshold where the cost for
finding to be invalidated buffers gets higher in optimized path than
the traditional path:
A. Traditional Path
 1. For each shared_buffers, compare the relfilenode.
 2. LockBufHdr()
 3. Compare block number, InvalidateBuffers() if it'

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com
On Sunday, January 3, 2021 10:35 PM (JST), Amit Kapila wrote:
> On Sat, Jan 2, 2021 at 7:47 PM k.jami...@fujitsu.com
>  wrote:
> >
> > Happy new year. The V38 LGTM.
> > Apologies for a bit of delay on posting the test results, but since
> > it's the start of commitfest, here it goes and the results were interesting.
> >
> > I executed a VACUUM test using the same approach that Tsunakawa-san
> > did in [1], but only this time, the total time that DropRelFileNodeBuffers()
> took.
> >
> 
> Please specify the exact steps like did you deleted all the rows from a table 
> or
> some of it or none before performing Vacuum? How did you measure this
> time, did you removed the cached check? It would be better if you share the
> scripts and or the exact steps so that the same can be used by others to
> reproduce.

Basically, I used the TimestampDifference function in DropRelFileNodeBuffers().
I also executed DELETE before VACUUM.
I also removed nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD
And used the threshold as the relation size.

> Hmm, in the tests done by Tang, the results indicate that in some cases the
> patched version is slower at even NBuffers/32, so not sure if we can go to
> values shown by you unless she is doing something wrong. I think the
> difference in results could be because both of you are using different
> techniques to measure the timings, so it might be better if both of you can
> share scripts or exact steps used to measure the time and other can use the
> same technique and see if we are getting consistent results.

Right, since we want consistent results, please disregard the approach that I 
did.
I will resume the test similar to Tang, because she also executed the original 
failover
test which I have been doing before.
To avoid confusion and to check if the results from mine and Tang are 
consistent,
I also did the recovery/failover test for VACUUM on single relation, which I 
will send
in a separate email after this.

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-06 Thread k.jami...@fujitsu.com
On Wed, January 6, 2021 7:04 PM (JST), I wrote:
> I will resume the test similar to Tang, because she also executed the original
> failover test which I have been doing before.
> To avoid confusion and to check if the results from mine and Tang are
> consistent, I also did the recovery/failover test for VACUUM on single 
> relation,
> which I will send in a separate email after this.

A. Test to find the right THRESHOLD

So below are the procedures and results of the VACUUM recovery performance
test on single relation.
I followed the advice below and applied the supplementary patch on top of V39:
Test-for-threshold.patch
This will ensure that we'll always enter the optimized path.
We're gonna use the threshold then as the relation size.

> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.

Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script
to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers.
I did not go further beyond 8 because it took too much time, and I could
already observe significant results until that.

[Vacuum Recovery Performance on Single Relation]
1. Setup synchronous streaming replication. I used the configuration
  written at the bottom of this email.
2. [Primary] Create 1 table. (rel.sh create)
3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct
   size for the set shared_buffers by commenting out the right size in "insert"
   of rel.sh script. (rel.sh insert)
4. [Primary] Delete table. (rel.sh delete)
5. [Standby] Optional: To double-check that DELETE is reflected on standby.
   SELECT count(*) FROM tableXXX;
  Make sure it returns 0.
6. [Standby] Pause WAL replay. (rel.sh pause)
  (This script will execute SELECT pg_wal_replay_pause(); .)
7. [Primary] VACUUM the single relation. (rel.sh vacuum)
8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop)
   (The script will execute pg_ctl stop -D $PGDATA -w -mi)
9. [Standby] Resume WAL replay and promote the standby.
  (rel.sh resume)
  It basically prints a timestamp when resuming WAL replay,
  and prints another timestamp when the promotion is done.
  Compute the time difference.

[Results for VACUUM on single relation]
Average of 5 runs.

1. % REGRESSION
% Regression: (patched - master)/master

| rel_size | 128MB  | 1GB| 20GB   | 100GB| 
|--||||--| 
| NB/512   | 0.000% | 0.000% | 0.000% | -32.680% | 
| NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   | 
| NB/128   | 0.000% | 0.000% | 0.000% | -16.502% | 
| NB/64| 0.000% | 0.000% | 0.000% | -9.841%  | 
| NB/32| 0.000% | 0.000% | 0.000% | -6.219%  | 
| NB/16| 0.000% | 0.000% | 0.000% | 3.323%   | 
| NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |

For 100GB shared_buffers, we can observe regression
beyond NBuffers/32. So with this, we can conclude
that NBuffers/32 is the right threshold.
For NBuffers/16 and beyond, the patched performs
worse than master. In other words, the cost of for finding
to be invalidated buffers gets higher in the optimized path
than the traditional path.

So in attached V39 patches, I have updated the threshold
BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.

2. [PATCHED]
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.206 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.506 | 
| NB/64| 0.106 | 0.106 | 0.306 | 0.907 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.508 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.109 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.614 |

3. MASTER
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|--|---|---|---|---| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.606 | 
| NB/64| 0.106 | 0.106 | 0.306 | 1.006 | 
| NB/32| 0.106 | 0.106 | 0.406 | 1.608 | 
| NB/16| 0.106 | 0.106 | 0.706 | 3.009 | 
| NB/8 | 0.106 | 0.106 | 1.307 | 6.114 |

I used the following configurations:
[postgesql.conf]
shared_buffers = 100GB #20GB,1GB,128MB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 1
max_wal_size = 20GB

# For streaming replication from primary. Don't uncomment on Standby.
synchronous_commit = remote_write
synchronous_standby_names = 'walreceiver'

# For Standby. Don't uncomment on Primary.
# hot_standby = on
#primary_conninfo = 'host=... user=... port=... application_name=walreceiver'

--
B. Regression Test using the NBuffers/32 Threshold (V39 Patches)

For this one,

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-07 Thread k.jami...@fujitsu.com
On Thu, January 7, 2021 5:36 PM (JST), Amit Kapila wrote:
> 
> On Wed, Jan 6, 2021 at 6:43 PM k.jami...@fujitsu.com
>  wrote:
> >
> > [Results for VACUUM on single relation]
> > Average of 5 runs.
> >
> > 1. % REGRESSION
> > % Regression: (patched - master)/master
> >
> > | rel_size | 128MB  | 1GB| 20GB   | 100GB|
> > |--||||--|
> > | NB/512   | 0.000% | 0.000% | 0.000% | -32.680% |
> > | NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   |
> > | NB/128   | 0.000% | 0.000% | 0.000% | -16.502% |
> > | NB/64| 0.000% | 0.000% | 0.000% | -9.841%  |
> > | NB/32| 0.000% | 0.000% | 0.000% | -6.219%  |
> > | NB/16| 0.000% | 0.000% | 0.000% | 3.323%   |
> > | NB/8 | 0.000% | 0.000% | 0.000% | 8.178%   |
> >
> > For 100GB shared_buffers, we can observe regression
> > beyond NBuffers/32. So with this, we can conclude
> > that NBuffers/32 is the right threshold.
> > For NBuffers/16 and beyond, the patched performs
> > worse than master. In other words, the cost of for finding
> > to be invalidated buffers gets higher in the optimized path
> > than the traditional path.
> >
> > So in attached V39 patches, I have updated the threshold
> > BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.
> >
> 
> Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> value for the threshold based on these results. I would like to
> slightly modify part of the commit message in the first patch as below
> [1], otherwise, I am fine with the changes. Unless you or anyone else
> has any more comments, I am planning to push the 0001 and 0002
> sometime next week.
> 
> [1]
> "The recovery path of DropRelFileNodeBuffers() is optimized so that
> scanning of the whole buffer pool can be avoided when the number of
> blocks to be truncated in a relation is below a certain threshold. For
> such cases, we find the buffers by doing lookups in BufMapping table.
> This improves the performance by more than 100 times in many cases
> when several small tables (tested with 1000 relations) are truncated
> and where the server is configured with a large value of shared
> buffers (greater than 100GB)."

Thank you for taking a look at the results of the tests. And it's also 
consistent with the results from Tang too.
The commit message LGTM.

Regards,
Kirk Jamison


RE: libpq debug log

2021-02-23 Thread k.jami...@fujitsu.com
> From: alvhe...@alvh.no-ip.org 
> I'll give this another look tomorrow, but I wanted to pass along that I prefer
> libpq-trace.{c,h} instead of libpq-logging.  I also renamed variable "pin" and
> pgindented.  I don't have any major reservations about this patch now, so I'll
> mark it ready-for-committer in case somebody else wants to have a look
> before push.
> 
> (Not sure about the use of the word "forcely")

Hi Iwata-san and Alvaro-san,

Thank you for updating the patch.
Although it's already set as "Ready for Committer", I just found minor parts
that need to be fixed.

(1) Doc: PQtraceSetFlags
+  flags contains flag bits describing the operating mode
+  of tracing.  If flags contains 
PQTRACE_SUPPRESS_TIMESTAMPS, 
+  then timestamp is not printed with each message. If set to 0 
(default),tracing will be output with timestamp.
+  This function should be called after calling 
PQtrace.

Missing space. And can be paraphrased to: 
"If set to 0 (default), tracing with timestamp is printed."


(2)
+ * pqTraceMaybeBreakLine:
+ * Check whether the backend message is complete. If so, print a 
line
+ * break and reset the buffer. If print break line, return 1.

The 2nd & last sentence can be combined to 
 "If so, print a line break, reset the buffer, and return 1."


(3) +PQtraceSetFlags(PGconn *conn, int flags)
+   /* If PQtrace() is failed, do noting. */

"If PQtrace() failed, do nothing."


(4)
> (Not sure about the use of the word "forcely")

I think it's not necessary.


Also, I tested the flag to not print timestamp. For example,
 PQtrace(conn, trace_file);
 PQtraceSetFlags(conn, PQTRACE_SUPPRESS_TIMESTAMPS);

And it did not print the timestamp. So it worked.
It also passed all the regression tests. (although PQtrace() is not tested in 
existing libpq tests).


Regards,
Kirk Jamison




RE: PATCH: Batch/pipelining support for libpq

2021-03-02 Thread k.jami...@fujitsu.com
On Wed, Feb 17, 2021 8:14 AM (JST), Alvaro Herrera wrote:

Hi Alvaro,

> Here's a new version, where I've renamed everything to "pipeline".  I think 
> the
> docs could use some additional tweaks now in order to make a coherent
> story on pipeline mode, how it can be used in a batched fashion, etc.

I tried applying this patch to test it on top of Iwata-san's libpq trace log 
[1].
In my environment, the compiler complains.
It seems that in libpqwalreceiver.c: libpqrcv_exec()
the switch for PQresultStatus needs to handle the
cases for PGRES_PIPELINE_END and PGRES_PIPELINE_ABORTED too.

 switch (PQresultStatus(pgres))
 {
 ...
 Case PGRES_PIPELINE_ABORTED:
...
 Case PGRES_PIPELINE_ END:

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/flat/1d650278-552a-4bd2-8411-a907fd54446c%40www.fastmail.com#32df02a4e2a08185508a2126e6e0caf1




RE: libpq debug log

2021-03-05 Thread k.jami...@fujitsu.com
Hi Alvaro-san and Horiguchi-san,
CC) Iwata-san, Tsunakawa-san

Attached is the V23 of the libpq trace patch.

(1)
From: Álvaro Herrera 
> It appears that something is still wrong.  I applied lipq pipeline v27 from 
> [1]
> and ran src/test/modules/test_libpq/pipeline singlerow, after patching it to
> do PQtrace() after PQconn(). Below is the output I get from that. The
> noteworthy point is that "ParseComplete" messages appear multiple times
> for some reason ... but that's quite odd, because if I look at the network 
> traffic
> with Wireshark I certainly do not see the ParseComplete message being sent
> three times.

I applied Alvaro's libpq pipeline patch (v33) [1] on top of this libpq trace 
patch,
then traced the src/test/modules/libpq_pipeline/libpq_pipeline.c 
test_singlerowmode.
It's still the same trace output even after applying the fix recommendations
of Horiguchi-san.

>   Parse   38 "" "SELECT generate_series(42, $1)" #0
>   Bind20 "" "" #0 #1 2 '44' #1 #0
>   Describe6 P ""
>   Execute 9 "" 0
>   Parse   38 "" "SELECT generate_series(42, $1)" #0
>   Bind20 "" "" #0 #1 2 '45' #1 #0
>   Describe6 P ""
>   Execute 9 "" 0
>   Parse   38 "" "SELECT generate_series(42, $1)" #0
>   Bind20 "" "" #0 #1 2 '46' #1 #0
>   Describe6 P ""
>   Execute 9 "" 0
>   Sync4
<   ParseComplete   4
<   BindComplete4
<   RowDescription  40 #1 "generate_series" 0 #0 23 #4 -1 #0
<   DataRow 12 #1 2 '42'
<   DataRow 12 #1 2 '43'
<   DataRow 12 #1 2 '44'
<   CommandComplete 13 "SELECT 3"
<   ParseComplete   4
<   ParseComplete   4
<   ParseComplete   4
<   BindComplete4
<   RowDescription  40 #1 "generate_series" 0 #0 23 #4 -1 #0
<   DataRow 12 #1 2 '42'
<   DataRow 12 #1 2 '43'
<   DataRow 12 #1 2 '44'
<   DataRow 12 #1 2 '45'
<   CommandComplete 13 "SELECT 4"
<   ParseComplete   4
<   ParseComplete   4
<   ParseComplete   4
<   BindComplete4
<   RowDescription  40 #1 "generate_series" 0 #0 23 #4 -1 #0
<   DataRow 12 #1 2 '42'
<   DataRow 12 #1 2 '43'
<   DataRow 12 #1 2 '44'
<   DataRow 12 #1 2 '45'
<   DataRow 12 #1 2 '46'
<   CommandComplete 13 "SELECT 5"
<   ReadyForQuery   5 I

ParseComplete still appears 3 times, but I wonder if that is expected only for 
pipeline's singlerowmode test.
From my observation, when I traced the whole regression test output
by putting PQtrace() in fe-connect.c: connectDBComplete(),
ParseComplete appears only once or twice, etc. for other commands.
In other words, ISTM, it's a case-to-case basis:
Some examples from the whole regression trace output is below:
a.
>   Describe6 P ""
>   Execute 9 "" 0
>   Sync4
<   ParseComplete   4
<   BindComplete4
b.
<   ErrorResponse   217 S "ERROR" V "ERROR" C "42883" M "function 
no_such_function(integer) does not exist" H "No function matches the given name 
and argument types. You might need to add explicit type casts." P "8" F 
"parse_func.c" L "629" R "ParseFuncOrColumn" \x00
<   ReadyForQuery   5 I
<   ParseComplete   4
<   ParseComplete   4
c.
>   Bind31 "" "my_insert" #0 #1 4 '9652' #1 #0
>   Describe6 P ""
>   Execute 9 "" 0
<   ParseComplete   4
<   ParseComplete   4
<   ParseComplete   4
<   BindComplete4
<   NoData  4
d.
<   CommandComplete 15 "INSERT 0 2"
<   ReadyForQuery   5 T
>   Parse   89 "i" "insert into test (name, amount, letter) select name, 
amount+$1, letter from test" #0
>   Sync4
<   ParseComplete   4
<   ReadyForQuery   5 T

As you can see from the examples above, ParseComplete appears in multiples/once 
depending on the test.
As for libpq_pipeline's test_singlerowmode, it appears 3x for the 
CommandComplete SELECT.
I am wondering now whether or not it's a bug. Maybe it's not.
Thoughts?


How I traced the whole regression test: fe-connect.c: connectDBComplete()

@@ -2114,6 +2137,7 @@ connectDBComplete(PGconn *conn)
int timeout = 0;
int last_whichhost = -2;/* certainly different 
from whichhost */
struct addrinfo *last_addr_cur = NULL;
+   FILE*trace_file;
 
if (conn == NULL || conn->status == CONNECTION_BAD)
return 0;
@@ -2171,6 +2195,9 @@ connectDBComplete(PGconn *conn)
switch (flag)
{
case PGRES_POLLING_OK:
+   trace_file = 
fopen("/home/postgres/tracelog/regression_trace

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-03-11 Thread k.jami...@fujitsu.com
> From: Tsunakawa, Takayuki/綱川 貴之 
> From: Kyotaro Horiguchi 
> > About the patch, it would be better to change the type of
> > BUF_DROP_FULL_SCAN_THRESHOLD to uint64, even though the current
> value
> > doesn't harm.
> 
> OK, attached, to be prepared for the distant future when NBuffers becomes
> 64-bit.

Thank you, Tsunakawa-san, for sending the quick fix. (I failed to notice to my 
thread.)

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-28 Thread k.jami...@fujitsu.com
Hi everyone,

Attached are the updated set of patches (V28).
 0004 - Truncate optimization is a new patch, while the rest are similar to V27.
This passes the build, regression and TAP tests.

Apologies for the delay.
I'll post the benchmark test results on SSD soon, considering the suggested 
benchmark of Horiguchi-san: 
> Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which is quite arbitrary that comes from a wild guess.
> 
> Perhaps we need to run benchmarks that drops one relation with several
> different ratios between the number of buffers to-be-dropped and Nbuffers,
> and preferably both on spinning rust and SSD.

Regards,
Kirk Jamison


v28-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v28-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v28-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v28-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v28-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v28-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v28-0004-TRUNCATE-optimization.patch
Description: v28-0004-TRUNCATE-optimization.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-03 Thread k.jami...@fujitsu.com
Hi,

I've updated the patch 0004 (Truncate optimization) with the previous comments 
of
Tsunakawa-san already addressed in the patch. (Thank you very much for the 
review.) 
The change here compared to the previous version is that in 
DropRelFileNodesAllBuffers()
we don't check for the accurate flag anymore when deciding whether to optimize 
or not.
For relations with blocks that do not exceed the threshold for full scan, we 
call
DropRelFileNodeBuffers where the flag will be checked anyway. Otherwise, we 
proceed
to the traditional buffer scan. Thoughts?

I've done recovery performance for TRUNCATE.
Test case: 1 parent table with 100 child partitions. TRUNCATE each child 
partition (1 transaction per table).
Currently, it takes a while to recover when we have large shared_buffers 
setting, but with the patch applied
the recovery is almost constant (0.206 s below).

| s_b   | master | patched | 
|---||-| 
| 128MB | 0.105  | 0.105   | 
| 1GB   | 0.205  | 0.205   | 
| 20GB  | 2.008  | 0.206   | 
| 100GB | 9.315  | 0.206   |

Method of Testing (assuming streaming replication is configured):
1. Create 1 parent table and 100 child partitions
2. Insert data to each table. 
3. Pause WAL replay on standby. ( SELECT pg_wal_replay_pause(); )
4. TRUNCATE each child partitions on primary (1 transaction per table). Stop 
the primary
5. Resume the WAL replay and promote standby. ( SELECT pg_wal_replay_resume(); 
pg_ctl promote)
I have confirmed that the relations became empty on standby.

Your thoughts, feedback are very much appreciated.

Regards,
Kirk Jamison


v29-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v29-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v29-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v29-0004-TRUNCATE-optimization.patch
Description: v29-0004-TRUNCATE-optimization.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-04 Thread k.jami...@fujitsu.com
On Thursday, November 5, 2020 10:22 AM, Horiguchi-san wrote:
> Hello.
> 
> Many of the quetions are on the code following my past suggestions.

Yeah, I was also about to answer with the feedback you have given.
Thank you for replying and taking a look too.

> At Wed, 4 Nov 2020 15:59:17 +0530, Amit Kapila 
> wrote in
> > On Wed, Nov 4, 2020 at 8:28 AM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > Hi,
> > >
> > > I've updated the patch 0004 (Truncate optimization) with the
> > > previous comments of Tsunakawa-san already addressed in the patch.
> > > (Thank you very much for the review.) The change here compared to
> > > the previous version is that in DropRelFileNodesAllBuffers() we don't
> check for the accurate flag anymore when deciding whether to optimize or
> not.
> > > For relations with blocks that do not exceed the threshold for full
> > > scan, we call DropRelFileNodeBuffers where the flag will be checked
> > > anyway. Otherwise, we proceed to the traditional buffer scan. Thoughts?
> > >
> >
> > Can we do a Truncate optimization once we decide about your other
> > patch as I see a few problems with it? If we can get the first patch
> > (vacuum optimization) committed it might be a bit easier for us to get
> > the truncate optimization. If possible, let's focus on (auto)vacuum
> > optimization first.

Sure. That'd be better.

> > Few comments on patches:
> > ==
> > v29-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks
> > --
> > -
> > 1.
> > -smgrnblocks(SMgrRelation reln, ForkNumber forknum)
> > +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
> >  {
> >   BlockNumber result;
> >
> >   /*
> >   * For now, we only use cached values in recovery due to lack of a
> > shared
> > - * invalidation mechanism for changes in file size.
> > + * invalidation mechanism for changes in file size.  The cached
> > + values
> > + * could be smaller than the actual number of existing buffers of the file.
> > + * This is caused by lseek of buggy Linux kernels that might not have
> > + * accounted for the recent write.
> >   */
> >   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> > InvalidBlockNumber)
> > + {
> > + if (accurate != NULL)
> > + *accurate = true;
> > +
> >
> > I don't understand this comment. Few emails back, I think we have
> > discussed that cached value can't be less than the number of buffers
> > during recovery. If that happens to be true then we have some problem.
> > If you want to explain 'accurate' variable then you can do the same
> > atop of function. Would it be better to name this variable as
> > 'cached'?
> 
> (I agree that the comment needs to be fixed.)
> 
> FWIW I don't think 'cached' suggests the characteristics of the returned value
> on its interface.  It was introduced to reduce fseek() calls, and after that 
> we
> have found that it can be regarded as the authoritative source of the file 
> size.
> The "accurate" means that it is guaranteed that we don't have a buffer for the
> file blocks further than that number.  I don't come up with a more proper
> word than "accurate" but also I don't think "cached" is proper here.

I also couldn't think of a better parameter name. Accurate seems to be better 
fit
as it describes a measurement close to an accepted value.
How about fixing the comment like below, would this suffice?

/*
 *  smgrnblocks() -- Calculate the number of blocks in the
 *   supplied relation.
 *
*   accurate flag acts as an authoritative source of the file size 
and
 *  ensures that no buffers exist for blocks after the file size is 
known
 *  to the process.
 */
BlockNumber
smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *accurate)
{
BlockNumber result;

/*
 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.  In recovery, the 
cached
 * value returned by the first lseek could be smaller than the actual 
number
 * of existing buffers of the file, which is caused by buggy Linux 
kernels
 * that might not have accounted for the recent write.  However, we can
 * still rely on the cached value even if we get a bogus value from 
first
 * lseek since it is impossible to have buffer for blocks 

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-05 Thread k.jami...@fujitsu.com
On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi 
 wrote:
> I'm not sure about the exact steps of the test, but it can be expected if we
> have many small relations to truncate.
> 
> Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which is quite arbitrary that comes from a wild guess.
> 
> Perhaps we need to run benchmarks that drops one relation with several
> different ratios between the number of buffers to-be-dropped and Nbuffers,
> and preferably both on spinning rust and SSD.

Sorry to get back to you on this just now.
Since we're prioritizing the vacuum patch, we also need to finalize which 
threshold value to use.
I proceeded testing with my latest set of patches because Amit-san's comments 
on the code, the ones we addressed,
don't really affect the performance. I'll post the updated patches for 0002 & 
0003 after we come up with the proper
boolean parameter name for smgrnblocks and the buffer full scan threshold value.

Test the VACUUM performance with the following thresholds: 
   NBuffers/512, NBuffers/256, NBuffers/128,
and determine which of the ratio has the best performance in terms of speed.

I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
Configure streaming replication environment.
shared_buffers = 100GB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min

[Steps]
1. Create TABLE
2. INSERT data
3. DELETE from TABLE
4. Pause WAL replay on standby
5. VACUUM. Stop the primary.
6. Resume WAL replay and promote standby.

With 1 relation, there were no significant changes that we can observe:
(In seconds)
| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 0.106  | 0.105| 0.105| 0.105| 
| 100GB | 0.106  | 0.105| 0.105| 0.105|

So I tested with 100 tables and got more convincing measurements:

| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 1.006  | 1.007| 1.006| 0.107| 
| 1GB   | 0.706  | 0.606| 0.606| 0.605| 
| 20GB  | 1.907  | 0.606| 0.606| 0.605| 
| 100GB | 7.013  | 0.706| 0.606| 0.607|

The threshold NBuffers/128 has the best performance for default shared_buffers 
(128MB)
with 0.107 s, and equally performing with large shared_buffers up to 100GB.

We can use NBuffers/128 for the threshold, although I don't have a measurement 
for HDD yet. 
However, I wonder if the above method would suffice to determine the final 
threshold that we
can use. If anyone has suggestions on how we can come up with the final value, 
like if I need
to modify some steps above, I'd appreciate it.

Regarding the parameter name. Instead of accurate, we can use "cached" as 
originally intended from
the early versions of the patch since it is the smgr that handles smgrnblocks 
to get the the block
size of smgr_cached_nblocks.. "accurate" may confuse us because the cached 
value may not
be actually accurate..

Regards,
Kirk Jamison





RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread k.jami...@fujitsu.com
> From: k.jami...@fujitsu.com 
> On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi
>  wrote:
> > I'm not sure about the exact steps of the test, but it can be expected
> > if we have many small relations to truncate.
> >
> > Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which
> > is quite arbitrary that comes from a wild guess.
> >
> > Perhaps we need to run benchmarks that drops one relation with several
> > different ratios between the number of buffers to-be-dropped and
> > Nbuffers, and preferably both on spinning rust and SSD.
> 
> Sorry to get back to you on this just now.
> Since we're prioritizing the vacuum patch, we also need to finalize which
> threshold value to use.
> I proceeded testing with my latest set of patches because Amit-san's
> comments on the code, the ones we addressed, don't really affect the
> performance. I'll post the updated patches for 0002 & 0003 after we come up
> with the proper boolean parameter name for smgrnblocks and the buffer full
> scan threshold value.
> 
> Test the VACUUM performance with the following thresholds:
>NBuffers/512, NBuffers/256, NBuffers/128, and determine which of the
> ratio has the best performance in terms of speed.
> 
> I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
> Configure streaming replication environment.
> shared_buffers = 100GB
> autovacuum = off
> full_page_writes = off
> checkpoint_timeout = 30min
> 
> [Steps]
> 1. Create TABLE
> 2. INSERT data
> 3. DELETE from TABLE
> 4. Pause WAL replay on standby
> 5. VACUUM. Stop the primary.
> 6. Resume WAL replay and promote standby.
> 
> With 1 relation, there were no significant changes that we can observe:
> (In seconds)
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |---||--|--|--|
> | 128MB | 0.106  | 0.105| 0.105| 0.105|
> | 100GB | 0.106  | 0.105| 0.105| 0.105|
> 
> So I tested with 100 tables and got more convincing measurements:
> 
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |---||--|--|--|
> | 128MB | 1.006  | 1.007| 1.006| 0.107|
> | 1GB   | 0.706  | 0.606| 0.606| 0.605|
> | 20GB  | 1.907  | 0.606| 0.606| 0.605|
> | 100GB | 7.013  | 0.706| 0.606| 0.607|
> 
> The threshold NBuffers/128 has the best performance for default
> shared_buffers (128MB) with 0.107 s, and equally performing with large
> shared_buffers up to 100GB.
> 
> We can use NBuffers/128 for the threshold, although I don't have a
> measurement for HDD yet.
> However, I wonder if the above method would suffice to determine the final
> threshold that we can use. If anyone has suggestions on how we can come
> up with the final value, like if I need to modify some steps above, I'd
> appreciate it.
> 
> Regarding the parameter name. Instead of accurate, we can use "cached" as
> originally intended from the early versions of the patch since it is the smgr
> that handles smgrnblocks to get the the block size of smgr_cached_nblocks..
> "accurate" may confuse us because the cached value may not be actually
> accurate..

Hi, 

So I proceeded to update the patches using the "cached" parameter and updated
the corresponding comments to it in 0002.

I've addressed the suggestions and comments of Amit-san on 0003:
1. For readability, I moved the code block to a new static function 
FindAndDropRelFileNodeBuffers()
2. Initialize the bool cached with false.
3. It's also decided that we don't need the extra pre-checking of RelFileNode
when locking the bufhdr in FindAndDropRelFileNodeBuffers

I repeated the recovery performance test for vacuum. (I made a mistake 
previously in NBuffers/128)
The 3 kinds of thresholds are almost equally performant. I chose NBuffers/256 
for this patch.

| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 1.006  | 1.007| 1.007| 1.007| 
| 1GB   | 0.706  | 0.606| 0.606| 0.606| 
| 20GB  | 1.907  | 0.606| 0.606| 0.606| 
| 100GB | 7.013  | 0.706| 0.606| 0.606|

Although we said that we'll prioritize vacuum optimization first, I've also 
updated the 0004 patch
(truncate optimization) which addresses the previous concern of slower truncate 
due to
redundant lookup of already-dropped buffers. In the new patch, we initially 
drop relation buffers
using the opt

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread k.jami...@fujitsu.com
On Tuesday, November 10, 2020 12:27 PM, Horiguchi-san wrote:
> To: amit.kapil...@gmail.com
> Cc: Jamison, Kirk/ジャミソン カーク ; Tsunakawa,
> Takayuki/綱川 貴之 ; t...@sss.pgh.pa.us;
> and...@anarazel.de; robertmh...@gmail.com;
> tomas.von...@2ndquadrant.com; pgsql-hack...@postgresql.org
> Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
> 
> At Tue, 10 Nov 2020 08:33:26 +0530, Amit Kapila 
> wrote in
> > On Tue, Nov 10, 2020 at 8:19 AM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > I repeated the recovery performance test for vacuum. (I made a
> > > mistake previously in NBuffers/128) The 3 kinds of thresholds are almost
> equally performant. I chose NBuffers/256 for this patch.
> > >
> > > | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> > > |---||--|--|--|
> > > | 128MB | 1.006  | 1.007| 1.007| 1.007|
> > > | 1GB   | 0.706  | 0.606| 0.606| 0.606|
> > > | 20GB  | 1.907  | 0.606| 0.606| 0.606|
> > > | 100GB | 7.013  | 0.706| 0.606| 0.606|
> > >
> >
> > I think this data is not very clear. What is the unit of time? What is
> > the size of the relation used for the test? Did the test use an
> > optimized path for all cases? If at 128MB, there is no performance
> > gain, can we consider the size of shared buffers as 256MB as well for
> > the threshold?
> 
> In the previous testing, it was shown as:
> 
> Recovery Time (in seconds)
> | s_b   | master | patched | %reg   |
> |---||-||
> | 128MB | 3.043  | 2.977   | -2.22% |
> | 1GB   | 3.417  | 3.41| -0.21% |
> | 20GB  | 20.597 | 2.409   | -755%  |
> | 100GB | 66.862 | 2.409   | -2676% |
> 
> 
> So... The numbers seems to be in seconds, but the master gets about 10
> times faster than this result for uncertain reasons. It seems that the result
> contains something different from the difference by this patch as a larger
> part.

The unit is in seconds.
The results that Horiguchi-san mentioned was the old test case I used where I 
vacuumed
database with 1000 relations that have been deleted.
I used a new test case in my last results that's why they're smaller:
VACUUM 1 parent table (350 MB) and 100 child partition tables (6 MB each)
in separate transcations after deleting the tables. After vacuum, the
parent table became 16kB and each child table was 2224kB.

I added the test for 256MB shared_buffers, and the performance is also almost 
the same.
We gain performance benefits for the larger shared_buffers.

| s_b| Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|||--|--|--| 
| 128MB  | 1.006  | 1.007| 1.007| 1.007| 
| 256 MB | 1.006  | 1.006| 0.906| 0.906| 
| 1GB| 0.706  | 0.606| 0.606| 0.606| 
| 20GB   | 1.907  | 0.606| 0.606| 0.606| 
| 100GB  | 7.013  | 0.706| 0.606| 0.606|

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-11 Thread k.jami...@fujitsu.com
On Tuesday, November 10, 2020 3:10 PM, Tsunakawa-san wrote:
> From: Jamison, Kirk/ジャミソン カーク 
> > So I proceeded to update the patches using the "cached" parameter and
> > updated the corresponding comments to it in 0002.
> 
> OK, I'm in favor of the name "cached" now, although I first agreed with
> Horiguchi-san in that it's better to use a name that represents the nature
> (accurate) of information rather than the implementation (cached).  Having
> a second thought, since smgr is a component that manages relation files on
> storage (file system), lseek(SEEK_END) is the accurate value for smgr.  The
> cached value holds a possibly stale size up to which the relation has
> extended.
> 
> 
> The patch looks almost good except for the minor ones:

Thank you for the review!

> (1)
> +extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber
> forknum,
> +bool *accurate);
> 
> It's still accurate here.

Already fixed in 0002.

> (2)
> + *   This is only called in recovery when the block count of any
> fork is
> + *   cached and the total number of to-be-invalidated blocks per
> relation
> 
> count of any fork is
> -> counts of all forks are

Fixed in 0003/
 
> (3)
> In 0004, I thought you would add the invalidated block counts of all relations
> to determine if the optimization is done, as Horiguchi-san suggested.  But I
> find the current patch okay too.

Yeah, I found my approach easier to implement. The new change in 0004 is that
when entering the optimized path we now call FindAndDropRelFileNodeBuffers()
instead of DropRelFileNodeBuffers().

I have attached all the updated patches.
I'd appreciate your feedback.

Regards,
Kirk Jamison


v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v31-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v31-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v31-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v31-0004-TRUNCATE-optimization.patch
Description: v31-0004-TRUNCATE-optimization.patch


RE: Cache relation sizes?

2020-11-16 Thread k.jami...@fujitsu.com
On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote:
> From: Thomas Munro 
> > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> >  wrote:
> > > I will look at your implementation more precisely latter.
> >
> > Thanks!  Warning:  I thought about making a thing like this for a
> > while, but the patch itself is only a one-day prototype, so I am sure
> > you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> > really needs a nice big comment to explain the theory about why this
> > value is fresh enough, based on earlier ideas about implicit memory
> > barriers.  (Something like: if you're extending, you must have
> > acquired the extension lock; if you're scanning you must have acquired
> > a snapshot that only needs to see blocks that existed at that time and
> > concurrent truncations are impossible; I'm wondering about special
> > snapshots like VACUUM, and whether this is too relaxed for that, or if
> > it's covered by existing horizon-based interlocking...).
> 
> We'd like to join the discussion and review, too, so that Kirk's optimization 
> for
> VACUUM truncation and TRUNCATRE can extend beyond recovery to
> operations during normal operation.  In that regard, would you mind
> becoming a committer for his patch?  We think it's committable now.  I'm
> fine with using the unmodified smgrnblocks() as your devil's suggestion.

Just saw this thread has been bumped. And yes, regarding the vacuum and truncate
optimization during recovery, it would still work even without the unmodified 
smgrnblocks.
We'd just need to remove the conditions that checks the flag parameter. And if 
we explore
your latest prototype patch, it can possibly use the new smgrnblocks_ APIs 
introduced here.
That said, we'd like to explore the next step of extending the feature to 
normal operations.

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-18 Thread k.jami...@fujitsu.com
On Thursday, November 12, 2020 1:14 PM, Tsunakawa-san wrote:
> The patch looks OK.  I think as Thomas-san suggested, we can remove the
> modification to smgrnblocks() and don't care wheter the size is cached or not.
> But I think the current patch is good too, so I'd like to leave it up to a
> committer to decide which to choose.
> I measured performance in a different angle -- the time
> DropRelFileNodeBuffers() and DropRelFileNodeAllBuffers() took.  That
> reveals the direct improvement and degradation.
> 
> I used 1,000 tables, each of which is 1 MB.  I used shared_buffers = 128 MB
> for the case where the traditional full buffer scan is done, and 
> shared_buffers
> = 100 GB for the case where the optimization path takes effect.
> 
> The results are almost good as follows:
> 
> A. UNPATCHED
> 
> 128 MB shared_buffers
> 1. VACUUM = 0.04 seconds
> 2. TRUNCATE = 0.04 seconds
> 
> 100 GB shared_buffers
> 3. VACUUM = 69.4 seconds
> 4. TRUNCATE = 69.1 seconds
> 
> 
> B. PATCHED
> 
> 128 MB shared_buffers (full scan)
> 5. VACUUM = 0.04 seconds
> 6. TRUNCATE = 0.07 seconds
> 
> 100 GB shared_buffers (optimized path)
> 7. VACUUM = 0.02 seconds
> 8. TRUNCATE = 0.08 seconds
> 
> 
> So, I'd like to mark this as ready for committer.
I forgot to reply.
Thank you very much Tsunakawa-san for testing and to everyone
who has provided their reviews and insights as well.

Now thinking about smgrnblocks(), currently Thomas Munro is also working on 
implementing a 
shared SmgrRelation [1] to store sizes. However, since that is still under 
development and the
discussion is still ongoing, I hope we can first commit these set of patches 
here as these are already
in committable form. I think it's alright to accept the early improvements 
implemented in this thread
to the source code.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B7Ok26MHiFWVEiAy2UMgHkrCieycQ1eFdA%3Dt2JTfUgwA%40mail.gmail.com

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread k.jami...@fujitsu.com
On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> From: Andres Freund 
> > DropRelFileNodeBuffers() in recovery? The most common path is
> > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > which 3/4 doesn't address and 4/4 doesn't mention.
> >
> > 4/4 seems to address DropRelationFiles(), but only talks about
> TRUNCATE?
> 
> Yes.  DropRelationFiles() is used in the following two paths:
> 
> [Replay of TRUNCATE during recovery]
> xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> DropRelFileNodesAllBuffers()
> 
> [COMMIT/ROLLBACK PREPARED]
> FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> -> DropRelFileNodesAllBuffers()

Yes. The concern is that it was not clear in the function descriptions and 
commit logs
what the optimizations for the DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers()
are for. So I revised only the function description of DropRelFileNodeBuffers() 
and the
commit logs of the 0003-0004 patches. Please check if the brief descriptions 
would suffice.


> > I also don't get why 4/4 would be a good idea on its own. It uses
> > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> lookups a lot
> > of times, once for each of nnodes relations?
> 
> So, the threshold value should be compared with the total number of blocks
> of all target relations, not each relation.  You seem to be right, got it.

Fixed this in 0004 patch. Now we compare the total number of 
buffers-to-be-invalidated
For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

> > Also, how is 4/4 safe - this is outside of recovery too?
> 
> It seems that DropRelFileNodesAllBuffers() should trigger the new
> optimization path only when InRecovery == true, because it intentionally
> doesn't check the "accurate" value returned from smgrnblocks().

Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
Iff during recovery.
 

> From: Amit Kapila 
> On Wed, Nov 18, 2020 at 11:43 PM Andres Freund 
> > I'm also worried about the cases where this could cause buffers left
> > in the buffer pool, without a crosscheck like Thomas' patch would
> > allow to add. Obviously other processes can dirty buffers in
> > hot_standby, so any leftover buffer could have bad consequences.
> >
> 
> The problem can only arise if other processes extend the relation. The idea
> was that in recovery it extends relation by one process which helps to
> maintain the cache. Kirk seems to have done testing to cross-verify it by 
> using
> his first patch (Prevent-invalidating-blocks-in-smgrextend-during). Which
> other crosscheck you are referring here?
> 
> I agree that we can do a better job by expanding comments to clearly state
> why it is safe.

Yes, basically what Amit-san also mentioned above. The first patch prevents 
that.
And in the description of DropRelFileNodeBuffers in the 0003 patch, please check
If that would suffice.


> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> *forkNum,
> > int nforks,
> > + BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > +   /* Check that it is in the buffer pool. If not, do
> nothing.
> > */
> > +   LWLockAcquire(bufPartitionLock, LW_SHARED);
> > +   buf_id = BufTableLookup(&bufTag, bufHash);
> > ...
> > +   bufHdr = GetBufferDescriptor(buf_id);
> > +
> > +   buf_state = LockBufHdr(bufHdr);
> > +
> > +   if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> &&
> > +   bufHdr->tag.forkNum == forkNum[i] &&
> > +   bufHdr->tag.blockNum >= firstDelBlock[i])
> > +   InvalidateBuffer(bufHdr);   /* releases
> > spinlock */
> > +   else
> > +   UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition
> > lock, and have done a lookup in the mapping table. Why are we then
> > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > holding the buffer header lock, which is essentially a spinlock, so
> > should only ever be held for very short portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc,
> > but there the situation is different: We haven't done a buffer mapping
> > lookup, and we don't hold a partition lock!
> 
> That's because the buffer partition lock is released immediately after the 
> hash
> table has been looked up.  As an aside, InvalidateBuffer() requires the caller
> to hold the buffer header spinlock and doesn't hold the buffer partition lock.

Yes. Holding the buffer header 

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread k.jami...@fujitsu.com
> From: k.jami...@fujitsu.com 
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund 
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers()
> > > , which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> >
> > Yes.  DropRelationFiles() is used in the following two paths:
> >
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall()
> > ->
> > DropRelFileNodesAllBuffers()
> >
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
> 
> Yes. The concern is that it was not clear in the function descriptions and
> commit logs what the optimizations for the DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() are for. So I revised only the function
> description of DropRelFileNodeBuffers() and the commit logs of the
> 0003-0004 patches. Please check if the brief descriptions would suffice.
> 
> 
> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> >
> > So, the threshold value should be compared with the total number of
> > blocks of all target relations, not each relation.  You seem to be right, 
> > got it.
> 
> Fixed this in 0004 patch. Now we compare the total number of
> buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
> 
> > > Also, how is 4/4 safe - this is outside of recovery too?
> >
> > It seems that DropRelFileNodesAllBuffers() should trigger the new
> > optimization path only when InRecovery == true, because it
> > intentionally doesn't check the "accurate" value returned from
> smgrnblocks().
> 
> Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
> Iff during recovery.
> 
> 
> > From: Amit Kapila  On Wed, Nov 18, 2020 at
> > 11:43 PM Andres Freund 
> > > I'm also worried about the cases where this could cause buffers left
> > > in the buffer pool, without a crosscheck like Thomas' patch would
> > > allow to add. Obviously other processes can dirty buffers in
> > > hot_standby, so any leftover buffer could have bad consequences.
> > >
> >
> > The problem can only arise if other processes extend the relation. The
> > idea was that in recovery it extends relation by one process which
> > helps to maintain the cache. Kirk seems to have done testing to
> > cross-verify it by using his first patch
> > (Prevent-invalidating-blocks-in-smgrextend-during). Which other
> crosscheck you are referring here?
> >
> > I agree that we can do a better job by expanding comments to clearly
> > state why it is safe.
> 
> Yes, basically what Amit-san also mentioned above. The first patch prevents
> that.
> And in the description of DropRelFileNodeBuffers in the 0003 patch, please
> check If that would suffice.
> 
> 
> > > Smaller comment:
> > >
> > > +static void
> > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> > *forkNum,
> > > int nforks,
> > > +   BlockNumber
> > > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > > + /* Check that it is in the buffer pool. If not, do
> > nothing.
> > > */
> > > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > > + buf_id = BufTableLookup(&bufTag, bufHash);
> > > ...
> > > + bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > + buf_state = LockBufHdr(bufHdr);
> > > +
> > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> > &&
> > > + bufHdr->tag.forkNum == forkNum[i] &&
> > > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > > + InvalidateBuffer(bufHdr);   /

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-26 Thread k.jami...@fujitsu.com
> From: Kyotaro Horiguchi 
> Hello, Kirk. Thank you for the new version.

Hi, Horiguchi-san. Thank you for your very helpful feedback.
I'm updating the patches addressing those.

> + if (!smgrexists(rels[i], j))
> + continue;
> +
> + /* Get the number of blocks for a relation's fork */
> + blocks[i][numForks] = smgrnblocks(rels[i], j,
> NULL);
> 
> If we see a fork which its size is not cached we must give up this 
> optimization
> for all target relations.

I did not use the "cached" flag in DropRelFileNodesAllBuffers and use InRecovery
when deciding for optimization because of the following reasons:
XLogReadBufferExtended() calls smgrnblocks() to apply changes to relation page
contents. So in DropRelFileNodeBuffers(), XLogReadBufferExtended() is called
during VACUUM replay because VACUUM changes the page content.
OTOH, TRUNCATE doesn't change the relation content, it just truncates relation 
pages
without changing the page contents. So XLogReadBufferExtended() is not called, 
and
the "cached" flag will always return false. I tested with "cached" flags 
before, and this
always return false, at least in DropRelFileNodesAllBuffers. Due to this, we 
cannot use
the cached flag in DropRelFileNodesAllBuffers(). However, I think we can still 
rely on
smgrnblocks to get the file size as long as we're InRecovery. That cached 
nblocks is still
guaranteed to be the maximum in the shared buffer.
Thoughts?

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread k.jami...@fujitsu.com
On Wed, January 13, 2021 2:15 PM (JST), Amit Kapila wrote:
> On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila
> >  wrote in
> > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com"
>  wrote in:
> > > > > > Thanks for the detailed tests. NBuffers/32 seems like an
> > > > > > appropriate value for the threshold based on these results. I
> > > > > > would like to slightly modify part of the commit message in
> > > > > > the first patch as below [1], otherwise, I am fine with the
> > > > > > changes. Unless you or anyone else has any more comments, I am
> > > > > > planning to push the 0001 and 0002 sometime next week.
> > > > > >
> > > > > > [1]
> > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so
> > > > > > that scanning of the whole buffer pool can be avoided when the
> > > > > > number of blocks to be truncated in a relation is below a
> > > > > > certain threshold. For such cases, we find the buffers by doing
> lookups in BufMapping table.
> > > > > > This improves the performance by more than 100 times in many
> > > > > > cases when several small tables (tested with 1000 relations)
> > > > > > are truncated and where the server is configured with a large
> > > > > > value of shared buffers (greater than 100GB)."
> > > > >
> > > > > Thank you for taking a look at the results of the tests. And
> > > > > it's also consistent with the results from Tang too.
> > > > > The commit message LGTM.
> > > >
> > > > +1.
> > > >
> > >
> > > I have pushed the 0001.
> >
> > Thank you for commiting this.
> >
> 
> Pushed 0002 as well.
> 

Thank you very much for committing those two patches, and for everyone here
who contributed in the simplifying the approaches, code reviews, testing, etc.

I compile with the --enable-coverage and check if the newly-added code and 
updated
parts were covered by tests.
Yes, the lines were hit including the updated lines of DropRelFileNodeBuffers(),
DropRelFileNodesAllBuffers(), smgrdounlinkall(), smgrnblocks().
Newly added APIs were covered too: FindAndDropRelFileNodeBuffers() and
smgrnblocks_cached(). 
However, the parts where UnlockBufHdr(bufHdr, buf_state); is called is not hit.
But I noticed that exists as well in previously existing functions in bufmgr.c.

Thank you very much again.

Regards,
Kirk Jamison



RE: libpq debug log

2021-01-17 Thread k.jami...@fujitsu.com
Hi Iwata-san,

In addition to Tsunakawa-san's comments,
The compiler also complains:
  fe-misc.c:678:20: error: ‘lenPos’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
conn->outMsgStart = lenPos;

There's no need for variable lenPos anymore since we have decided *not* to 
support pre protocol 3.0.
And by that we have to update the description of pqPutMsgStart too. 
So you can remove the lenPos variable and the condition where you have to check 
for protocol version.

diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 8bc9966..3de48be 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -644,14 +644,12 @@ pqCheckInBufferSpace(size_t bytes_needed, PGconn *conn)
  *
  * The state variable conn->outMsgStart points to the incomplete message's
  * length word: it is either outCount or outCount+1 depending on whether
- * there is a type byte.  If we are sending a message without length word
- * (pre protocol 3.0 only), then outMsgStart is -1.  The state variable
- * conn->outMsgEnd is the end of the data collected so far.
+ * there is a type byte.  The state variable conn->outMsgEnd is the end of
+ * the data collected so far.
  */
 int
 pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
 {
-   int lenPos;
int endPos;

/* allow room for message type byte */
@@ -661,9 +659,8 @@ pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
endPos = conn->outCount;

/* do we want a length word? */
-   if (force_len || PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
+   if (force_len)
{
-   lenPos = endPos;
/* allow room for message length */
endPos += 4;
}
@@ -675,7 +672,7 @@ pqPutMsgStart(char msg_type, bool force_len, PGconn *conn)
if (msg_type)
conn->outBuffer[conn->outCount] = msg_type;
/* set up the message pointers */
-   conn->outMsgStart = lenPos;
+   conn->outMsgStart = endPos;
conn->outMsgEnd = endPos;



At the same time, the one below lacks one more zero. (Only has 8)
There should be 9 as Matsumura-san mentioned.
+   0, 0, 0, 0, 0, 0, 0, 0, /* \x65 ... \0x6d */


The following can be removed in pqStoreFrontendMsg():
+ * In protocol v2, we immediately print each message as we receive 
it.
+ * (XXX why?)


Maybe the following description can be paraphrased:
+ * The message length is fixed after putting the last field, but 
message
+ * length should be print before printing any fields.So we must 
store the
+ * field data in memory.
to:
+ * The message length is fixed after putting the last field. But 
message
+ * length should be printed before printing any field, so we must 
store
+ * the field data in memory.


In pqStoreFrontendMsg, pqLogMessagenchar, pqLogMessageString,
pqLogMessageInt, pqLogMessageByte1, maybe it is unneccessary to use
+   if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
because you have already indicated in the PQtrace() to return
silently when protocol 2.0 is detected.
+   /* Protocol 2.0 is not supported. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;


In pqLogMessageInt(),
+   /* We delayed to print message type for special message. */
can be paraphrased to:
  /* We delay printing of the following special message_type */


Regards,
Kirk Jamison





RE: libpq debug log

2021-01-25 Thread k.jami...@fujitsu.com
Hello,

I have not tested nor review the latest patch changes yet, but I am reporting 
the compiler errors.
I am trying to compile the patch since V12 (Alvaro's version), but the 
following needs to
be fixed too because of the complaints: doc changes and undeclared INT_MAX

libpq.sgml:5893: parser error : xmlParseEntityRef: no name
  of tracing.  If (flags & TRACE_SUPPRESS_TIMESTAMPS) is
   ^
libpq.sgml:8799: parser error : chunk is not well balanced
postgres.sgml:195: parser error : Failure to process entity libpq
  &libpq;
 ^
postgres.sgml:195: parser error : Entity 'libpq' not defined
  &libpq;
 ^
...
fe-misc.c: In function ‘pqStoreFrontendMsg’:
fe-misc.c:903:35: error: ‘INT_MAX’ undeclared (first use in this function)
if (conn->fe_msg->max_fields > INT_MAX / 2)


Regards,
Kirk Jamison


RE: libpq debug log

2021-01-27 Thread k.jami...@fujitsu.com
On Mon, Jan 25, 2021 10:11 PM (JST), Alvaro Herrera wrote:
> On 2021-Jan-25, tsunakawa.ta...@fujitsu.com wrote:
> 
> > Iwata-san,
> 
> [...]
> 
> > Considering these and the compilation error Kirk-san found, I'd like
> > you to do more self-review before I resume this review.
> 
> Kindly note that these errors are all mine.
> 
> Thanks for the review
> 
Hello,
To make the CFBot happy, I fixed the compiler errors.
I have not included here the change proposal to move the tracing functions
to a new file: fe-trace.c or something, so I retained the changes .
Maybe Iwata-san can consider the proposal.
Currently, both pqTraceInit() and pqTraceUninit() are called only by one 
function.

Summary:
1. I changed the bits32 flags to int flags
2. I assumed INT_MAX value is the former MAX_FRONTEND_MSGS
   I defined it to solve the compile error. Please tell if the value was wrong.
3. Fixed the doc errors
4. Added freeing of be_msg in pqTraceUninit()
5. Addressed the following comments: 

> (25)
> + conn->fe_msg =
> malloc(sizeof(offsetof(pqFrontendMessage, fields)) +
> +
> DEF_FE_MSGFIELDS * sizeof(pqFrontendMessageField));
> 
> It's incorrect that offsetof() is nested in sizeof().  See my original 
> review comment.

I removed the initial sizeof().


> (26)
> +bool
> +pqTraceInit(PGconn *conn, bits32 flags) {
> ...
> + conn->be_msg->state = LOG_FIRST_BYTE;
> + conn->be_msg->length = 0;
> + }
> ...
> + conn->be_msg->state = LOG_FIRST_BYTE;
> + conn->be_msg->length = 0;
> 
> The former is redundant?

Right. I've removed the former.


> (27)
> +/*
> + * Deallocate frontend-message tracking memory.  We only do this 
> +because
> + * it can grow arbitrarily large, and skip it in the initial state, 
> +because
> + * it's likely pointless.
> + */
> +void
> +pqTraceUninit(PGconn *conn)
> +{
> + if (conn->fe_msg &&
> + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS)
> + {
> + pfree(conn->fe_msg);
> + conn->fe_msg = NULL;
> + }
> +}
> 
> I thought this function plays the reverse role of pqTraceInit(), but 
> it only frees memory for frontend messages.  Plus, use free() instead 
> of pfree(), because
> malloc() is used to allocate memory.

Fixed to use free(). Also added the freeing of be_msg.

> (28)
> +void PQtrace(PGconn *conn, FILE *stream, bits32 flags);
> 
> bits32 can't be used because it's a data type defined in c.h, which 
> should not be exposed to applications.  I think you can just int or something.

I used int. It still works to suppress the timestamp when flags is true.

In my environment, when flags is false(0):
2021-01-28 06:26:11.368 >   Query   27 "COPY country TO STDOUT"
2021-01-28 06:26:11.368 >   Query   -1
2021-01-28 06:26:11.369 <   CopyOutResponse 13 \x00 #3 #0 #0 #0
2021-01-28 06:26:11.369 <   CopyDone4

2021-01-28 06:26:11.369 <   CopyDone4
2021-01-28 06:26:11.369 <   CopyDone4
2021-01-28 06:26:11.369 <   CommandComplete 11 "COPY 0"
2021-01-28 06:26:11.369 <   ReadyForQuery   5
2021-01-28 06:26:11.369 >   Terminate   4
2021-01-28 06:26:11.369 >   Terminate   -1

When flags is true(1), running the same query:
>   Query   27 "COPY country TO STDOUT"
>   Query   -1
<   CopyOutResponse 13 \x00 #3 #0 #0 #0
<   CopyDone4

<   CopyDone4
<   CopyDone4
<   CommandComplete 11 "COPY 0"
<   ReadyForQuery   5
>   Terminate   4
>   Terminate   -1


Regards,
Kirk Jamison


v14-0001-libpq-trace.patch
Description: v14-0001-libpq-trace.patch


RE: libpq debug log

2021-01-28 Thread k.jami...@fujitsu.com
On Mon, January 25, 2021 4:13 PM (JST), Tsunakawa-san wrote:.
> Also, please note this as:
> 
> > Also, why don't you try running the regression tests with a temporary
> modification to PQtrace() to output the trace to a file?  The sole purpose is
> to confirm that this patch doesn't make the test crash (core dump).

I realized that existing libpq regression tests in src/test/examples do not
test PQtrace(). At the same time, no other functions Is calling PQtrace(),
so it's expected that by default if there are no compilation errors, then it
will pass all the tests. And it did. 

So to check that the tracing feature is enabled and, as requested, outputs
a trace file, I temporarily modified a bit of testlibpq.c instead **based from
my current environment settings**, and ran the regression test.

diff --git a/src/test/examples/testlibpq.c b/src/test/examples/testlibpq.c
index 0372781..ae163e4 100644
--- a/src/test/examples/testlibpq.c
+++ b/src/test/examples/testlibpq.c
@@ -26,6 +26,7 @@ main(int argc, char **argv)
int nFields;
int i,
j;
+   FILE*trace_file;

/*
 * If the user supplies a parameter on the command line, use it as the
@@ -40,6 +41,9 @@ main(int argc, char **argv)
/* Make a connection to the database */
conn = PQconnectdb(conninfo);

+   trace_file = fopen("/home/postgres/tracelog/trace.out","w");
+   PQtrace(conn, trace_file, 0);
+
/* Check to see that the backend connection was successfully made */
if (PQstatus(conn) != CONNECTION_OK)
{
@@ -127,5 +131,6 @@ main(int argc, char **argv)
/* close the connection to the database and cleanup */
PQfinish(conn);

+   fclose(trace_file);
return 0;
 }


make -C src/test/examples/ PROGS=testlibpq
./src/test/examples/testlibpq


I've verified that it works (no crash) and outputs a trace file to the PATH 
that I set in libpqtest.

The trace file contains the following log for the testlibq:

2021-01-28 09:22:28.155 >   Query   59 "SELECT 
pg_catalog.set_config('search_path', '', false)"
2021-01-28 09:22:28.156 >   Query   -1
2021-01-28 09:22:28.157 <   RowDescription  35 #1 "set_config" 0 #0 25 #65535 
-1 #0
2021-01-28 09:22:28.157 <   DataRow 10 #1 0
2021-01-28 09:22:28.157 <   CommandComplete 13 "SELECT 1"
2021-01-28 09:22:28.157 <   ReadyForQuery   5
2021-01-28 09:22:28.157 <   ReadyForQuery   5 I
2021-01-28 09:22:28.157 >   Query   10 "BEGIN"
2021-01-28 09:22:28.157 >   Query   -1
2021-01-28 09:22:28.157 <   CommandComplete 10 "BEGIN"
2021-01-28 09:22:28.157 <   ReadyForQuery   5
2021-01-28 09:22:28.157 <   ReadyForQuery   5 T
2021-01-28 09:22:28.157 >   Query   58 "DECLARE myportal CURSOR FOR select * 
from pg_database"
2021-01-28 09:22:28.157 >   Query   -1
2021-01-28 09:22:28.159 <   CommandComplete 19 "DECLARE CURSOR"
2021-01-28 09:22:28.159 <   ReadyForQuery   5
2021-01-28 09:22:28.159 <   ReadyForQuery   5 T
2021-01-28 09:22:28.159 >   Query   26 "FETCH ALL in myportal"
2021-01-28 09:22:28.159 >   Query   -1
2021-01-28 09:22:28.159 <   RowDescription  405 #14 "oid" 1262 #1 26 #4 -1 #0 
"datname" 1262 #2 19 #64 -1 #0 "datdba" 1262 #3 26 #4 -1 #0 "encoding" 1262 #4 
23 #4 -1 #0 "datcollate" 1262 #5 19 #64 -1 #0 "datctype" 1262 #6 19 #64 -1 #0 
"datistemplate" 1262 #7 16 #1 -1 #0 "datallowconn" 1262 #8 16 #1 -1 #0 
"datconnlimit" 1262 #9 23 #4 -1 #0 "datlastsysoid" 1262 #10 26 #4 -1 #0 
"datfrozenxid" 1262 #11 28 #4 -1 #0 "datminmxid" 1262 #12 28 #4 -1 #0 
"dattablespace" 1262 #13 26 #4 -1 #0 "datacl" 1262 #14 1034 #65535 -1 #0
2021-01-28 09:22:28.159 <   DataRow 117 #14 5 '13756' 8 'postgres' 2 '10' 1 '6' 
11 'en_US.UTF-8' 11 'en_US.UTF-8' 1 'f' 1 't' 2 '-1' 5 '13755' 3 '502' 1 '1' 4 
'1663' -1
2021-01-28 09:22:28.159 <   DataRow 149 #14 1 '1' 9 'template1' 2 '10' 1 '6' 11 
'en_US.UTF-8' 11 'en_US.UTF-8' 1 't' 1 't' 2 '-1' 5 '13755' 3 '502' 1 '1' 4 
'1663' 35 '{=c/postgres,postgres=CTc/postgres}'
2021-01-28 09:22:28.159 <   DataRow 153 #14 5 '13755' 9 'template0' 2 '10' 1 
'6' 11 'en_US.UTF-8' 11 'en_US.UTF-8' 1 't' 1 'f' 2 '-1' 5 '13755' 3 '502' 1 
'1' 4 '1663' 35 '{=c/postgres,postgres=CTc/postgres}'
2021-01-28 09:22:28.159 <   CommandComplete 12 "FETCH 3"
2021-01-28 09:22:28.159 <   ReadyForQuery   5
2021-01-28 09:22:28.159 <   ReadyForQuery   5 T
2021-01-28 09:22:28.159 >   Query   19 "CLOSE myportal"
2021-01-28 09:22:28.159 >   Query   -1
2021-01-28 09:22:28.159 <   CommandComplete 17 "CLOSE CURSOR"
2021-01-28 09:22:28.159 <   ReadyForQuery   5
2021-01-28 09:22:28.159 <   ReadyForQuery   5 T
2021-01-28 09:22:28.159 >   Query   8 "END"
2021-01-28 09:22:28.159 >   Query   -1
2021-01-28 09:22:28.160 <   CommandComplete 11 "COMMIT"
2021-01-28 09:22:28.160 <   ReadyForQuery   5
2021-01-28 09:22:28.160 <   ReadyForQuery   5 I
2021-01-28 09:22:28.160 >   Terminate   4
2021-01-28 09:22:28.160 >   Terminate   -1


Regards,
Kirk Jamison


RE: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread k.jami...@fujitsu.com
On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com 
> wrote:
> >
> > Thank you for the advice. I repeated the test as per your advice and
> > average of 3 runs per worker/s planned.
> > It still shows the following similar performance results between Master and
> Patch V2.
> > I wonder why there's no difference though.
> >
> > The test on my machine is roughly like this:
> >
> > createdb test
> > psql -d test
> > create table t_heap as select generate_series(1, 1) i; \q
> >
> > pg_ctl restart
> > psql -d test
> > SET track_io_timing = on;
> > SET max_parallel_workers_per_gather = 0; SHOW
> > max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT
> > count(*) from t_heap; \timing SELECT count(*) from t_heap;
> >
> > drop table t_heap;
> > \q
> > dropdb test
> > pg_ctl restart
> >
> > Below are the results. Again, almost no discernible difference between the
> master and patch.
> > Also, the results when max_parallel_workers_per_gather is more than 4
> > could be inaccurate due to my machine's limitation of only having v4
> > CPUs. Even so, query planner capped it at
> > 6 workers.
> >
> > Query Planner I/O Timings (track_io_timing = on) in ms :
> > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) |
> > | I/O WRITE (Patch) |
> >
> ||---|--||-
> --|
> > | 0  | "1,130.777"   | "1,250.821"  | "01,698.051"   |
> "01,733.439"  |
> > | 1  | "1,603.016"   | "1,660.767"  | "02,312.248"   |
> "02,291.661"  |
> > | 2  | "2,036.269"   | "2,107.066"  | "02,698.216"   |
> "02,796.893"  |
> > | 3  | "2,298.811"   | "2,307.254"  | "05,695.991"   |
> "05,894.183"  |
> > | 4  | "2,098.642"   | "2,135.960"  | "23,837.088"   |
> "26,537.158"  |
> > | 5  | "1,956.536"   | "1,997.464"  | "45,891.851"   |
> "48,049.338"  |
> > | 6  | "2,201.816"   | "2,219.001"  | "61,937.828"   |
> "67,809.486"  |
> >
> > Query Planner Execution Time (ms):
> > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) |
> > ||---|--|
> > | 0.000  | "40,454.252"  | "40,521.578" |
> > | 1.000  | "21,332.067"  | "21,205.068" |
> > | 2.000  | "14,266.756"  | "14,385.539" |
> > | 3.000  | "11,597.936"  | "11,722.055" |
> > | 4.000  | "12,937.468"  | "13,439.247" |
> > | 5.000  | "14,383.083"  | "14,782.866" |
> > | 6.000  | "14,671.336"  | "15,507.581" |
> >
> > Based from the results above, the I/O latency increases as number of
> > workers also increase. Despite that, the query planner execution time
> > is almost closely same when 2 or more workers are used (14~11s). Same
> results between Master and Patch V2.
> >
> > As for buffers, same results are shown per worker (both Master and Patch).
> > | Worker | Buffers  |
> > ||--|
> > | 0  | shared read=442478 dirtied=442478 written=442446 |
> > | 1  | shared read=442478 dirtied=442478 written=442414 |
> > | 2  | shared read=442478 dirtied=442478 written=442382 |
> > | 3  | shared read=442478 dirtied=442478 written=442350 |
> > | 4  | shared read=442478 dirtied=442478 written=442318 |
> > | 5  | shared read=442478 dirtied=442478 written=442286 |
> > | 6  | shared read=442478 dirtied=442478 written=442254 |
> >
> >
> > SQL Query Execution Time (ms) :
> > | Worker | SQL (Master) | SQL (Patch)  |
> > ||--|--|
> > | 0  | "10,418.606" | "10,377.377" |
> > | 1  | "05,427.460"  | "05,402.727" |
> > | 2  | "03,662.998"  | "03,650.277" |
> > | 3  | "02,718.837"  | "02,692.871" |
> &g

RE: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread k.jami...@fujitsu.com
On Friday, July 17, 2020 6:18 PM (GMT+9), Amit Kapila wrote:

> On Fri, Jul 17, 2020 at 11:35 AM k.jami...@fujitsu.com 
> wrote:
> >
> > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote:
> >
> > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila  wrote:
> > >>
> > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley 
> wrote:
> > >>> If we've not seen any performance regressions within 1 week, then
> > >>> I propose that we (pending final review) push this to allow wider
> > >>> testing.
> > >>
> > >> I think Soumyadeep has reported a regression case [1] with the
> > >> earlier version of the patch.  I am not sure if we have verified
> > >> that the situation improves with the latest version of the patch.
> > >> I request Soumyadeep to please try once with the latest patch.
> > >...
> > >Yeah, it would be good to see some more data points on that test.
> > >Jumping from 2 up to 6 workers just leaves us to guess where the
> > >performance started to become bad. >It would be good to know if the
> > >regression is repeatable or if it was affected by some other process.
> > >...
> > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> > >track_io_timing = on; for each value of >max_parallel_workers.
> >
> > Hi,
> >
> > If I'm following the thread correctly, we may have gains on this patch
> > of Thomas and David, but we need to test its effects on different
> > filesystems. It's also been clarified by David through benchmark tests
> > that synchronize_seqscans is not affected as long as the set cap per
> > chunk size of parallel scan is at 8192.
> >
> > I also agree that having a control on this through GUC can be
> > beneficial for users, however, that can be discussed in another thread
> > or development in the future.
> >
> > David Rowley wrote:
> > >I'd like to propose that if anyone wants to do further testing on
> > >other operating systems with SSDs or HDDs then it would be good if
> > >that could be done within a 1 week from this email. There are various
> > >benchmarking ideas on this thread for inspiration.
> >
> > I'd like to join on testing it, this one using HDD, and at the bottom
> > are the results. Due to my machine limitations, I only tested
> > 0~6 workers, that even if I increase max_parallel_workers_per_gather
> > more than that, the query planner would still cap the workers at 6.
> > I also set the track_io_timing to on as per David's recommendation.
> >
> > Tested on:
> > XFS filesystem, HDD virtual machine
> > RHEL4, 64-bit,
> > 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on
> > x86_64-pc-linux-gnu
> >
> >
> > Test Case (Soumyadeep's) [1]
> >
> > shared_buffers = 32MB (to use OS page cache)
> >
> > create table t_heap as select generate_series(1, 1) i;   --about
> 3.4GB size
> >
> > SET track_io_timing = on;
> >
> > \timing
> >
> > set max_parallel_workers_per_gather = 0;  --0 to 6
> >
> > SELECT count(*) from t_heap;
> > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
> >
> > [Summary]
> > I used the same query from the thread. However, the sql query
> > execution time and query planner execution time results between the
> > master and patched do not vary much.
> > OTOH, in terms of I/O stats, I observed similar regression in both
> > master and patched as we increase max_parallel_workers_per_gather.
> >
> > It could also be possible that each benchmark setting for
> > max_parallel_workers_per_gather is affected by previous result . IOW,
> > later benchmark runs benefit from the data cached by previous runs on OS 
> > level.
> >
> 
> Yeah, I think to some extent that is visible in results because, after patch, 
> at 0
> workers, the execution time is reduced significantly whereas there is not much
> difference at other worker counts.  I think for non-parallel case (0 workers),
> there shouldn't be any difference.
> Also, I am not sure if there is any reason why after patch the number of 
> shared hits
> is improved, probably due to caching effects?
> 
> > Any advice?
> 
> I think recreating the database and restarting the server after each run 
> might help
> in getting consistent results.  Also, you might want to take median of three 
> runs.

Thank you for the advice. I repeated the test as per your advice and av

RE: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread k.jami...@fujitsu.com
On Tuesday, July 21, 2020 7:33 PM, Amit Kapila wrote:
> On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com 
> wrote:
> >
> > On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> > > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com
> > > 
> > > wrote:
> > > >
> > > > I am definitely missing something. Perhaps I think I could not
> > > > understand why there's no I/O difference between the Master and
> > > > Patched (V2). Or has it been already improved even without this patch?
> > > >
> > >
> > > I don't think it is strange that you are not seeing much difference
> > > because as per the initial email by Thomas this patch is not
> > > supposed to give benefits on all systems.  I think we wanted to
> > > check that the patch should not regress performance in cases where
> > > it doesn't give benefits.  I think it might be okay to run with a
> > > higher number of workers than you have CPUs in the system as we
> > > wanted to check if such cases regress as shown by Soumyadeep above
> > > [1].  Can you once try with
> > > 8 and or 10 workers as well?
> > >
> >
> > You are right. Kindly excuse me on that part, which only means it may
> > or may not have any benefits on the filesystem I am using. But for
> > other fs, as we can see from David's benchmarks significant 
> > results/benefits.
> >
> > Following your advice on regression test case, I increased the number
> > of workers, but the query planner still capped the workers at 6, so
> > the results from 6 workers onwards are almost the same.
> >
> 
> I am slightly confused if the number of workers are capped at 6, then what 
> exactly
> the data at 32 worker count means?  If you want query planner to choose more
> number of workers, then I think either you need to increase the data or use 
> Alter
> Table  Set (parallel_workers = );

Oops I'm sorry, the "workers" labelled in those tables actually mean 
max_parallel_workers_per_gather
and not parallel_workers. In the query planner, I thought the _per_gather 
corresponds or controls
the workers planned/launched values, and those are the numbers that I used in 
the tables.

I used the default max_parallel_workers & max_worker_proceses which is 8 by 
default in postgresql.conf.
IOW, I ran all those tests with maximum of 8 processes set. But my query 
planner capped both the
Workers Planned and Launched at 6 for some reason when increasing the value for
max_parallel_workers_per_gather. 

However, when I used the ALTER TABLE SET (parallel_workers = N) based from your 
suggestion,
the query planner acquired that set value only for "Workers Planned", but not 
for "Workers Launched". 
The behavior of query planner is also different when I also set the value of 
max_worker_processes
and max_parallel_workers to parallel_workers + 1.

For example (ran on Master),
1. Set same value as parallel_workers, but "Workers Launched" and "Workers 
Planned" do not match.
max_worker_processes = 8
max_parallel_workers = 8
ALTER TABLE t_heap SET (parallel_workers = 8);
ALTER TABLE
SET max_parallel_workers_per_gather = 8;
SET
test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
QUERY PLAN
--
 Finalize Aggregate  (cost=619778.66..619778.67 rows=1 width=8) (actual 
time=16316.295..16316.295 rows=1 loops=1)
   Buffers: shared read=442478 dirtied=442478 written=44
   ->  Gather  (cost=619777.83..619778.64 rows=8 width=8) (actual 
time=16315.528..16316.668 rows=8 loops=1)
 Workers Planned: 8
 Workers Launched: 7
 Buffers: shared read=442478 dirtied=442478 written=44
 ->  Partial Aggregate  (cost=618777.83..618777.84 rows=1 width=8) 
(actual time=16305.092..16305.092 rows=1 loops=8)
   Buffers: shared read=442478 dirtied=442478 written=44
   ->  Parallel Seq Scan on t_heap  (cost=0.00..583517.86 
rows=14103986 width=0) (actual time=0.725..14290.117 rows=1250 loops=8)
 Buffers: shared read=442478 dirtied=442478 written=44
 Planning Time: 5.327 ms
   Buffers: shared hit=17 read=10
 Execution Time: 16316.915 ms
(13 rows)

2. Match the workers launched and workers planned values (parallel_workers + 1)
max_worker_processes = 9
max_parallel_workers = 9

ALTER TABLE t_heap SET (parallel_workers = 8);
ALTER TABLE;
SET max_parallel_workers_per_gather = 8;
SET

test=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;

RE: Parallel Seq Scan vs kernel read ahead

2020-07-22 Thread k.jami...@fujitsu.com
On Wednesday, July 22, 2020 2:21 PM (GMT+9), David Rowley wrote:

> On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com 
> wrote:
> > I used the default max_parallel_workers & max_worker_proceses which is 8 by
> default in postgresql.conf.
> > IOW, I ran all those tests with maximum of 8 processes set. But my
> > query planner capped both the Workers Planned and Launched at 6 for
> > some reason when increasing the value for max_parallel_workers_per_gather.
> 
> max_parallel_workers_per_gather just imposes a limit on the planner as to the
> maximum number of parallel workers it may choose for a given parallel portion 
> of
> a plan. The actual number of workers the planner will decide is best to use is
> based on the size of the relation. More pages = more workers. It sounds like 
> in
> this case the planner didn't think it was worth using more than 6 workers.
> 
> The parallel_workers reloption, when not set to -1 overwrites the planner's
> decision on how many workers to use. It'll just always try to use
> "parallel_workers".
>
> > However, when I used the ALTER TABLE SET (parallel_workers = N) based
> > from your suggestion, the query planner acquired that set value only for
> "Workers Planned", but not for "Workers Launched".
> > The behavior of query planner is also different when I also set the
> > value of max_worker_processes and max_parallel_workers to parallel_workers
> + 1.
> 
> When it comes to execution, the executor is limited to how many parallel 
> worker
> processes are available to execute the plan. If all workers happen to be busy 
> with
> other tasks then it may find itself having to process the entire query in 
> itself
> without any help from workers.  Or there may be workers available, just not as
> many as the planner picked to execute the query.

Even though I read the documentation [1][2] on parallel query, I might not have
understood it clearly yet. So thank you very much for explaining simpler how 
the 
relation size, GUCs, and reloption affect the query planner's behavior
So in this test case, I shouldn't force the workers to have same values for 
workers
planned and workers launched, is it correct? To just let the optimizer do its 
own decision.

> The number of available workers is configured with the
> "max_parallel_workers". That's set in postgresql.conf.   PostgreSQL
> won't complain if you try to set a relation's parallel_workers reloption to a 
> number
> higher than the "max_parallel_workers" GUC.
> "max_parallel_workers" is further limited by "max_worker_processes".
> Likely you'll want to set both those to at least 32 for this test, then just 
> adjust the
> relation's parallel_workers setting for each test.
> 
Thank you for the advice. For the same test case [3], I will use the following 
configuration:
shared_buffers = 32MB
max_parallel_workers =32
max_worker_processes = 32

Maybe the relation size is also small as you mentioned, that the query 
optimizer decided
to only use 6 workers in my previous test. So let me see first if the results 
would vary
again with the above configuration and testing different values for 
parallel_workers.

Kind regards,
Kirk Jamison

[1] https://www.postgresql.org/docs/13/how-parallel-query-works.html
[2] https://www.postgresql.org/docs/current/runtime-config-resource.html
[3] 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB=xyj192ezcnwgfcca_wj5ghvm7sv8oe...@mail.gmail.com



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-07-30 Thread k.jami...@fujitsu.com
On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
> On 17.06.2020 09:14, k.jami...@fujitsu.com wrote:
> > Hi,
> >
> > Since the last posted version of the patch fails, attached is a rebased 
> > version.
> > Written upthread were performance results and some benefits and challenges.
> > I'd appreciate your feedback/comments.
> >
> > Regards,
> > Kirk Jamison

> As far as i understand this patch can provide significant improvement of
> performance only in case of recovery  of truncates of large number of tables. 
> You
> have added shared hash of relation buffers and certainly if adds some extra
> overhead. According to your latest results this overhead is quite small. But 
> it will
> be hard to prove that there will be no noticeable regression at some 
> workloads.

Thank you for taking a look at this.

Yes, one of the aims is to speed up recovery of truncations, but at the same 
time the
patch also improves autovacuum, vacuum and relation truncate index executions. 
I showed results of pgbench results above for different types of workloads,
but I am not sure if those are validating enough...

> I wonder if you have considered case of local hash (maintained only during
> recovery)?
> If there is after-crash recovery, then there will be no concurrent access to 
> shared
> buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just one 
> flag
> or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers is
> called first time (so w have to scan all buffers once, but subsequent drop
> operation will be fast.
> 
> i have not thought much about it, but it seems to me that as far as this 
> problem
> only affects recovery, we do not need shared hash for it.
> 

The idea of the patch is to mark the relation buffers to be dropped after 
scanning
the whole shared buffers, and store them into shared memory maintained in a 
dlist,
and traverse the dlist on the next scan.
But I understand the point that it is expensive and may cause overhead, that is 
why
I tried to define a macro to limit the number of pages that we can cache for 
cases
that lookup cost can be problematic (i.e. too many pages of relation).

#define BUF_ID_ARRAY_SIZE 100
int buf_id_array[BUF_ID_ARRAY_SIZE];
int forknum_indexes[BUF_ID_ARRAY_SIZE];

In DropRelFileNodeBuffers
do
{
nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, 
lengthof(buf_id_array));
for (i = 0; i < nbufs; i++)
{
...
}
} while (nbufs == lengthof(buf_id_array));


Perhaps the patch affects complexities so we want to keep it simpler, or commit 
piece by piece?
I will look further into your suggestion of maintaining local hash only during 
recovery.
Thank you for the suggestion.

Regards,
Kirk Jamison


Fix minor source code comment mistake

2020-07-30 Thread k.jami...@fujitsu.com
Hi,

Just found a minor error in source code comment.
src/include/executor/instrument.h

Attached is the fix.

-   longlocal_blks_dirtied; /* # of shared blocks dirtied */
+   longlocal_blks_dirtied; /* # of local blocks dirtied */


Regards,
Kirk Jamison


0001-Fix-comment-in-source-code.patch
Description: 0001-Fix-comment-in-source-code.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-07-30 Thread k.jami...@fujitsu.com
On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
> 
> I have tested this patch at various workloads and hardware (including Power2
> server with 384 virtual cores) and didn't find performance regression.
> 
> The new status of this patch is: Ready for Committer

Thank you very much, Konstantin, for testing the patch for different workloads.
I wonder if I need to modify some documentations.
I'll leave the final review to the committer/s as well.

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-05 Thread k.jami...@fujitsu.com
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
> 
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
> 
> Oh, right, not sure what I was thinking.
> 
> 
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
> 
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure 
> out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
> 
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
> 
> Or we could just cache the result of the last smgrnblocks call...
> 
> 
> One of the cases where this type of strategy is most intersting to me is the 
> partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit 
into
RAM but not into shared buffers, could one of Andres' suggested idea/s above 
address
that, in addition to this patch's possible shared invalidation fix? Could that 
settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:
> But aside from that, I noted a number of things I didn't like a bit:
> 
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For 
> comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think 
> this
> needs to stay at least within hailing distance of there.
> 
> * It is fairly suspicious that the new data structure is manipulated while 
> holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in 
> deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
> 
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the 
speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like 
radix tree
buffer mapping which is complex, I thought a patch like mine could be an 
alternative
approach to that. So I want to improve the patch further. 
I hope you can help me clarify the direction, so that I can avoid going farther 
away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement 
is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore 
because
it would exceed the CPU cache line size. So I added new data structures (hash 
table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of 
the same
Relfilenode in the key and a lock when inserting and deleting buffers from 
buffer table,
as well as during lookups. As for the partition locking, I added it to reduce 
lock contentio

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread k.jami...@fujitsu.com
On Friday, August 7, 2020 12:38 PM, Amit Kapila wrote:
Hi,
> On Thu, Aug 6, 2020 at 6:53 AM k.jami...@fujitsu.com 
> wrote:
> >
> > On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
> >
> > Hi,
> > Thank you for your constructive review and comments.
> > Sorry for the late reply.
> >
> > > Hi,
> > >
> > > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > > Andres Freund  writes:
> > > > > Indeed. The buffer mapping hashtable already is visible as a
> > > > > major bottleneck in a number of workloads. Even in readonly
> > > > > pgbench if s_b is large enough (so the hashtable is larger than
> > > > > the cache). Not to speak of things like a cached sequential scan
> > > > > with a cheap qual and wide
> > > rows.
> > > >
> > > > To be fair, the added overhead is in buffer allocation not buffer 
> > > > lookup.
> > > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > > upthread, the potential trouble spot is where the working set is
> > > > bigger than shared buffers but still fits in RAM (so there's no
> > > > actual I/O needed, but we do still have to shuffle buffers a lot).
> > >
> > > Oh, right, not sure what I was thinking.
> > >
> > >
> > > > > Wonder if the temporary fix is just to do explicit hashtable
> > > > > probes for all pages iff the size of the relation is < s_b / 500 or 
> > > > > so.
> > > > > That'll address the case where small tables are frequently
> > > > > dropped - and dropping large relations is more expensive from
> > > > > the OS and data loading perspective, so it's not gonna happen as 
> > > > > often.
> > > >
> > > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > > relation had been (preferably without adding an lseek call), but
> > > > maybe that's do-able.
> > >
> > > IIRC we already do smgrnblocks nearby, when doing the truncation (to
> > > figure out which segments we need to remove). Perhaps we can arrange
> > > to combine the two? The layering probably makes that somewhat ugly
> > > :(
> > >
> > > We could also just use pg_class.relpages. It'll probably mostly be
> > > accurate enough?
> > >
> > > Or we could just cache the result of the last smgrnblocks call...
> > >
> > >
> > > One of the cases where this type of strategy is most intersting to
> > > me is the partial truncations that autovacuum does... There we even
> > > know the range of tables ahead of time.
> >
> > Konstantin tested it on various workloads and saw no regression.
> > But I understand the sentiment on the added overhead on BufferAlloc.
> > Regarding the case where the patch would potentially affect workloads
> > that fit into RAM but not into shared buffers, could one of Andres'
> > suggested idea/s above address that, in addition to this patch's
> > possible shared invalidation fix? Could that settle the added overhead in
> BufferAlloc() as temporary fix?
> >
> 
> Yes, I think so. Because as far as I can understand he is suggesting to do 
> changes
> only in the Truncate/Vacuum code path. Basically, I think you need to change
> DropRelFileNodeBuffers or similar functions.
> There shouldn't be any change in the BufferAlloc or the common code path, so
> there is no question of regression in such cases. I am not sure what you have 
> in
> mind for this but feel free to clarify your understanding before 
> implementation.
>
> > Thomas Munro is also working on caching relation sizes [1], maybe that
> > way we could get the latest known relation size. Currently, it's
> > possible only during recovery in smgrnblocks.
> >
> > Tom Lane wrote:
> > > But aside from that, I noted a number of things I didn't like a bit:
> > >
> > > * The amount of new shared memory this needs seems several orders of
> > > magnitude higher than what I'd call acceptable: according to my
> > > measurements it's over 10KB per shared buffer!  Most of that is
> > > going into the CachedBufTableLock data structure, which seems
> > > fundamentally misdesigned --- how could we be needing a lock per map
> > > partition *per buffer*?  For comparison, the space used by
> > > buf_table.c is about 56 bytes per shared buffer; I think this needs to 
> >

RE: please update ps display for recovery checkpoint

2020-08-18 Thread k.jami...@fujitsu.com
On Wednesday, August 19, 2020 7:53 AM (GMT+9), Justin Pryzby wrote: 

Hi, 

All the patches apply, although when applying them the following appears:
   (Stripping trailing CRs from patch; use --binary to disable.)

> During crash recovery, the server writes this to log:
> 
> < 2020-08-16 08:46:08.601 -03  >LOG:  redo done at 2299C/1EC6BA00 <
> 2020-08-16 08:46:08.877 -03  >LOG:  checkpoint starting: end-of-recovery
> immediate
> 
> But runs a checkpoint, which can take a long time, while the "ps" display 
> still says
> "recovering ".
> 
> Please change to say "recovery checkpoint" or similar, as I mentioned here.
> https://www.postgresql.org/message-id/2020011820.GP26045@telsasoft.c
> om

Yes, I agree that it is helpful to tell users about that.

About 0003 patch, there are similar phrases in bgwriter_flush_after and 
backend_flush_after. Should those be updated too?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3170,7 +3170,7 @@ include_dir 'conf.d'
 limit the amount of dirty data in the kernel's page cache, reducing
 the likelihood of stalls when an fsync is issued 
at the end of the
 checkpoint, or when the OS writes data back in larger batches in the
-background.  Often that will result in greatly reduced transaction
+background.  This feature will often result in greatly reduced 
transaction
 latency, but there also are some cases, especially with workloads
 that are bigger than , but smaller
 than the OS's page cache, where performance might degrade.  This


Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> On Fri, Aug 7, 2020 at 9:33 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund 
> wrote:
> > >> We could also just use pg_class.relpages. It'll probably mostly be
> > >> accurate enough?
> >
> > > Don't we need the accurate 'number of blocks' if we want to
> > > invalidate all the buffers? Basically, I think we need to perform
> > > BufTableLookup for all the blocks in the relation and then Invalidate all
> buffers.
> >
> > Yeah, there is no room for "good enough" here.  If a dirty buffer
> > remains in the system, the checkpointer will eventually try to flush
> > it, and fail (because there's no file to write it to), and then
> > checkpointing will be stuck.  So we cannot afford to risk missing any
> buffers.
> >
> 
> Today, again thinking about this point it occurred to me that during recovery
> we can reliably find the relation size and after Thomas's recent commit
> c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> following something on the lines of what Andres suggested) and then later
> once we have a generic mechanism for "caching the relation size" [1], we can
> do it for non-recovery paths.
> I think that will at least address the reported use case with some minimal
> changes.
> 
> [1] -
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 

Attached is an updated V9 version with minimal code changes only and
avoids the previous overhead in the BufferAlloc. This time, I only updated
the recovery path as suggested by Amit, and followed Andres' suggestion
of referring to the cached blocks in smgrnblocks.
The layering is kinda tricky so the logic may be wrong. But as of now,
it passes the regression tests. I'll follow up with the performance results.
It seems there's regression for smaller shared_buffers. I'll update if I find 
bugs.
But I'd also appreciate your reviews in case I missed something.

Regards,
Kirk Jamison




v9-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v9-Speedup-dropping-of-relation-buffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Wednesday, September 2, 2020 10:31 AM, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com"
>  wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during
> > > recovery we can reliably find the relation size and after Thomas's
> > > recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not
> > > need to even incur the cost of lseek. Why don't we fix this first
> > > for 'recovery' (by following something on the lines of what Andres
> > > suggested) and then later once we have a generic mechanism for
> > > "caching the relation size" [1], we can do it for non-recovery paths.
> > > I think that will at least address the reported use case with some
> > > minimal changes.
> > >
> > > [1] -
> > >
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 
> Isn't a relation always locked asscess-exclusively, at truncation time?  If 
> so,
> isn't even the result of lseek reliable enough? And if we don't care the cost 
> of
> lseek, we can do the same optimization also for non-recovery paths. Since
> anyway we perform actual file-truncation just after so I think the cost of 
> lseek
> is negligible here.

The reason for that is when I read the comment in smgrnblocks in smgr.c
I thought that smgrnblocks can only be reliably used during recovery here
to ensure that we have the correct size.
Please correct me if my understanding is wrong, and I'll fix the patch.

 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.
 */
if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum]; 

> > Attached is an updated V9 version with minimal code changes only and
> > avoids the previous overhead in the BufferAlloc. This time, I only
> > updated the recovery path as suggested by Amit, and followed Andres'
> > suggestion of referring to the cached blocks in smgrnblocks.
> > The layering is kinda tricky so the logic may be wrong. But as of now,
> > it passes the regression tests. I'll follow up with the performance results.
> > It seems there's regression for smaller shared_buffers. I'll update if I 
> > find
> bugs.
> > But I'd also appreciate your reviews in case I missed something.
> 
> BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
> number of file pages that we make relation-targetted search for buffers.
> Otherwise we scan through all buffers. On the other hand the latest patch just
> leaves all buffers for relation forks longer than the threshold.

Right, I missed the part or condition for that part. Fixed in the latest one.
 
> I think we should determine whether to do targetted-scan or full-scan based
> on the ratio of (expectedly maximum) total number of pages for all (specified)
> forks in a relation against total number of buffers.

> By the way
> 
> > #define BUF_DROP_THRESHOLD  500 /* NBuffers divided
> by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as described
> in the comment, we should have used (NBuffers/2) instead of "500". But I
> suppose you might wanted to use (NBuffders / 500) as Tom suggested
> upthread.  And the name of the macro seems too generic. I think more
> specific names like BUF_DROP_FULLSCAN_THRESHOLD would be better.

Fixed.

Thank you for the review!
Attached is the v10 of the patch.

Best regards,
Kirk Jamison


v10-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v10-Speedup-dropping-of-relation-buffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread k.jami...@fujitsu.com
On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Even if the relation is locked, background processes like
> > > checkpointer can still touch the relation which might cause
> > > problems. Consider a case where we extend the relation but didn't
> > > flush the newly added pages. Now during truncate operation,
> > > checkpointer can still flush those pages which can cause trouble for
> > > truncate. But, I think in the recovery path such cases won't cause a
> problem.
> >
> > I wouldn't count on that staying true ...
> >
> >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> OBV-jC
> > nacbyvtpptk2a9yy...@mail.gmail.com
> >
> 
> I don't think that proposal will matter after commit c5315f4f44 because we are
> caching the size/blocks for recovery while doing extend (smgrextend). In the
> above scenario, we would have cached the blocks which will be used at later
> point of time.
> 

Hi,

I'm guessing we can still pursue this idea of improving the recovery path first.

I'm working on an updated patch version, because the CFBot's telling
that postgres fails to build (one of the recovery TAP tests fails).
I'm still working on refactoring my patch, but have yet to find a proper 
solution at the moment.
So I'm going to continue my investigation.

Attached is an updated WIP patch.
I'd appreciate if you could take a look at the patch as well.

In addition, attached also are the regression logs for the failure and other 
logs
Accompanying it: wal_optimize_node_minimal and wal_optimize_node_replica.

The failures stated in my session was:
t/018_wal_optimize.pl  18/34 Bailout called.
Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Best regards,
Kirk Jamison


v11-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v11-Speedup-dropping-of-relation-buffers-during-recovery.patch


regress_log_018_wal_optimize
Description: regress_log_018_wal_optimize


018_wal_optimize_node_minimal.log
Description: 018_wal_optimize_node_minimal.log


018_wal_optimize_node_replica.log
Description: 018_wal_optimize_node_replica.log


[Patch] Optimize dropping of relation buffers using dlist

2019-10-28 Thread k.jami...@fujitsu.com
Hi,

Currently, we need to scan the WHOLE shared buffers when VACUUM
truncated off any empty pages at end of transaction or when relation
is TRUNCATEd.
As for our customer case, we periodically truncate thousands of tables,
and it's possible to TRUNCATE single table per transaction. This can be
problematic later on during recovery which could take longer, especially
when a sudden failover happens after those TRUNCATEs and when we
have to scan a large-sized shared buffer. In the performance test below,
it took almost 12.5 minutes for recovery to complete for 100GB shared
buffers. But we want to keep failover very short (within 10 seconds).

Previously, I made an improvement in speeding the truncates of relation
forks from 3 scans to one scan.[1] This time, the aim of this patch is
to further speedup the invalidation of pages, by linking the cached pages
of the target relation in a doubly-linked list and just traversing it
instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,
we just get the number of target buffers to invalidate for the relation.
There is a significant win in this patch, because we were able to
complete failover and recover in 3 seconds more or less.

I performed similar tests to what I did in the speedup truncates of
relations forks.[1][2] However, this time using 100GB shared_buffers.

[Machine spec used in testing]
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 256GB++

[Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)
(Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
3. (M) psql -c "\timing on" (measures total execution of SQL queries)
4. (M) VACUUM (whole db)
5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w
6. (S) Resume wal replay and promote standby.[2]

[Results]

A. HEAD (origin/master branch)
A1. Vacuum execution on Primary server
Time: 730932.408 ms (12:10.932) ~12min 11s
A2. Vacuum + Failover (WAL Recovery on Standby)
waiting for server to promote...
 stopped waiting
pg_ctl: server did not promote in time
2019/10/25_12:13:09.692─┐
2019/10/25_12:25:43.576─┘
-->Total: 12min34s

B. PATCH
B1. Vacuum execution on Primary/Master
Time: 6.518333s = 6518.333 ms
B2. Vacuum + Failover (WAL Recovery on Standby)
2019/10/25_14:17:21.822
waiting for server to promote.. done
server promoted
2019/10/25_14:17:24.827
2019/10/25_14:17:24.833
-->Total: 3.011s

[Other Notes]
Maybe one disadvantage is that we can have a variable number of
relations, and allocated the same number of relation structures as
the size of shared buffers. I tried to reduce the use of memory when
doing hash table lookup operation by having a fixed size array (100)
or threshold of target buffers to invalidate.
When doing CachedBufLookup() to scan the count of each buffer in the
dlist, I made sure to reduce the number of scans (2x at most).
First, we scan the dlist of cached buffers of relations.
Then store the target buffers in buf_id_array. Non-target buffers
would be removed from dlist but added to temporary dlist.
After reaching end of main dlist, we append the temporary dlist to
tail of main dlist.
I also performed pgbench buffer test, and this patch did not cause
overhead to normal DB access performance.

Another one that I'd need feedback of is the use of new dlist operations
for this cached buffer list. I did not use in this patch the existing
Postgres dlist architecture (ilist.h) because I want to save memory space
as much as possible especially when NBuffers become large. Both dlist_node
& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

Hope to hear your feedback and comments.

Thanks in advance,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24


v1-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v1-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2019-11-05 Thread k.jami...@fujitsu.com
Hi,


> Another one that I'd need feedback of is the use of new dlist operations

> for this cached buffer list. I did not use in this patch the existing

> Postgres dlist architecture (ilist.h) because I want to save memory space

> as much as possible especially when NBuffers become large. Both dlist_node

> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

In cb_dlist_combine(), the code block below can impact performance
especially for cases when the doubly linked list is long (IOW, many cached 
buffers).
  /* Point to the tail of main dlist */
  while (curr_main->next != CACHEDBLOCK_END_OF_LIST)
curr_main = cb_dlist_next(curr_main);

Attached is an improved version of the previous patch, which adds a pointer
information of the TAIL field in order to speed up the abovementioned operation.
I stored the tail field in the prev pointer of the head entry (maybe not a 
typical
approach). A more typical one is by adding a tail field (int tail) to 
CachedBufferEnt,
but I didn’t do that because as I mentioned in previous email I want to avoid
using more memory as much as possible.
The patch worked as intended and passed the tests.

Any thoughts?


Regards,
Kirk Jamison


v2-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v2-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2019-11-12 Thread k.jami...@fujitsu.com
On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra 
> wrote:
> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> > you've only done tests / benchmark for the case this optimizes. I
> > think we need to see a benchmark for workload that allocates and
> > invalidates lot of buffers. A pgbench with a workload that fits into
> > RAM but not into shared buffers would be interesting.
> 
> Yeah, it seems pretty hard to believe that this won't be bad for some 
> workloads.
> Not only do you have the overhead of the hash table operations, but you also
> have locking overhead around that. A whole new set of LWLocks where you have
> to take and release one of them every time you allocate or invalidate a buffer
> seems likely to cause a pretty substantial contention problem.

I'm sorry for the late reply. Thank you Tomas and Robert for checking this 
patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c since most 
of
  of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test now

Thanks for the advice on benchmark test. Please refer below for test and 
results.

[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GB

scale: 3125 (about 46GB DB size)
shared_buffers = 8GB

[workload that fits into RAM but not into shared buffers]
pgbench -i -s 3125 cachetest
pgbench -c 16 -j 8 -T 600 cachetest

[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123
latency average = 1.089 ms
tps = 14691.436343 (including connections establishing)
tps = 14691.482714 (excluding connections establishing)

[Master/Unpatched]
...
number of transactions actually processed: 8852327
latency average = 1.084 ms
tps = 14753.814648 (including connections establishing)
tps = 14753.861589 (excluding connections establishing)


My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.

Thanks,
Kirk Jamison


v3-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v3-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: Recovery performance of DROP DATABASE with many tablespaces

2019-11-12 Thread k.jami...@fujitsu.com
On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:
> On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier  wrote:
> >
> > On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:
> > > TBH, I have no numbers measured by the test.
> > > One question about your test is; how did you measure the *recovery
> > > time* of DROP DATABASE? Since it's *recovery* performance, basically
> > > it's not easy to measure that.
> >
> > It would be simple to measure the time it takes to replay this single
> > DROP DATABASE record by putting two gettimeofday() calls or such
> > things and then take the time difference.  There are many methods that
> > you could use here, and I suppose that with a shared buffer setting of
> > a couple of GBs of shared buffers you would see a measurable
> > difference with a dozen of tablespaces or so.  You could also take a
> > base backup after creating all the tablespaces, connect the standby
> > and then drop the database on the primary to see the actual time it
> > takes.  Your patch looks logically correct to me because
> > DropDatabaseBuffers is a
> > *bottleneck* with large shared_buffers, and it would be nice to see
> > numbers.
> 
> Thanks for the comment!
> 
> I measured how long it takes to replay DROP DATABASE with 1000 tablespaces,
> in master and patched version. shared_buffers was set to 16GB.
> 
> [master]
> It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as follows.
> In this case, 16GB shared_buffers was fully scanned 1000 times.
> 
> 2019-10-02 16:50:14 JST LOG:  redo starts at 0/228
> 2019-10-02 16:50:22 JST LOG:  redo done at 0/300A298
> 
> [patched]
> It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,
> as follows. In this case, 16GB shared_buffers was scanned only one time.
> 
> 2019-10-02 16:47:03 JST LOG:  redo starts at 0/228
> 2019-10-02 16:47:03 JST LOG:  redo done at 0/3001588
> 

Hi Fujii-san,

It's been a while, so I checked the patch once again.
It's fairly straightforward and I saw no problems nor bug in the code.

> [patched]
> It took less than 1 second to replay DROP DATABASE with 1000 tablespaces,
The results are good.
I want to replicate the performance to confirm the results as well.
Could you share how you measured the recovery replay?
Did you actually execute a failover?

Regards,
Kirk Jamison


RE: Recovery performance of DROP DATABASE with many tablespaces

2019-11-18 Thread k.jami...@fujitsu.com
On Wed, Nov 13, 2019 5:34PM (GMT+9), Fujii Masao wrote:
> On Wed, Nov 13, 2019 at 3:57 PM k.jami...@fujitsu.com 
> wrote:
> >
> > On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:
> > > On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier 
> wrote:
> > > >
> > > > On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:
> > > > > TBH, I have no numbers measured by the test.
> > > > > One question about your test is; how did you measure the
> > > > > *recovery
> > > > > time* of DROP DATABASE? Since it's *recovery* performance,
> > > > > basically it's not easy to measure that.
> > > >
> > > > It would be simple to measure the time it takes to replay this
> > > > single DROP DATABASE record by putting two gettimeofday() calls or
> > > > such things and then take the time difference.  There are many
> > > > methods that you could use here, and I suppose that with a shared
> > > > buffer setting of a couple of GBs of shared buffers you would see
> > > > a measurable difference with a dozen of tablespaces or so.  You
> > > > could also take a base backup after creating all the tablespaces,
> > > > connect the standby and then drop the database on the primary to
> > > > see the actual time it takes.  Your patch looks logically correct
> > > > to me because DropDatabaseBuffers is a
> > > > *bottleneck* with large shared_buffers, and it would be nice to
> > > > see numbers.
> > >
> > > Thanks for the comment!
> > >
> > > I measured how long it takes to replay DROP DATABASE with 1000
> > > tablespaces, in master and patched version. shared_buffers was set to
> 16GB.
> > >
> > > [master]
> > > It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as 
> > > follows.
> > > In this case, 16GB shared_buffers was fully scanned 1000 times.
> > >
> > > 2019-10-02 16:50:14 JST LOG:  redo starts at 0/228
> > > 2019-10-02 16:50:22 JST LOG:  redo done at 0/300A298
> > >
> > > [patched]
> > > It took less than 1 second to replay DROP DATABASE with 1000
> > > tablespaces, as follows. In this case, 16GB shared_buffers was scanned
> only one time.
> > >
> > > 2019-10-02 16:47:03 JST LOG:  redo starts at 0/228
> > > 2019-10-02 16:47:03 JST LOG:  redo done at 0/3001588
> > >
> >
> > Hi Fujii-san,
> >
> > It's been a while, so I checked the patch once again.
> > It's fairly straightforward and I saw no problems nor bug in the code.
> 
> Thanks for the review!
> 
> > > [patched]
> > > It took less than 1 second to replay DROP DATABASE with 1000
> > > tablespaces,
> > The results are good.
> > I want to replicate the performance to confirm the results as well.
> > Could you share how you measured the recovery replay?
> 
> I forgot the actual steps that I used for the measurement.
> But I think they are something like
> 
> 1. create database "hoge"
> 2. create 1,000 tablespaces
> 3. create 1,000 tables on the database "hoge".
> each table should be placed in different tablespace.
> 4. take a base backup
> 5. drop database "hoge"
> 6. shutdown the server with immediate mode 7. start an archive recovery from
> the backup taken at #4 8. measure how long it takes to apply DROP DATABASE
> record by
> checking the timestamp at REDO start and REDO end.
> 
> I think that I performed the above steps on the master and the patched 
> version.
> 
> > Did you actually execute a failover?
> 
> No.

I'm sorry for the late reply, and thanks for the guide above.
I replicated the same recovery test above on a standalone server
and have confirmed with the logs that the patch made the recovery faster.

[MASTER/UNPATCHED] ~10 seconds
2019-11-19 15:25:23.891 JST [23042] LOG:  redo starts at 0/180006A0
...
2019-11-19 15:25:34.492 JST [23042] LOG:  redo done at 0/1800A478

[PATCHED]  ~less than 1 sec
2019-11-19 15:31:59.415 JST [17625] LOG:  redo starts at 0/40005B8
...
2019-11-19 15:32:00.159 JST [17625] CONTEXT:  WAL redo at 0/4000668 for 
Database/DROP: dir 1663/16384 16385/16384...//further details ommitted//...
...
2019-11-19 15:32:00.159 JST [17625] LOG:  redo done at 0/4001638

I believe there are no problems, so I am marking this patch now
as "Ready for Committer".

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2019-11-27 Thread k.jami...@fujitsu.com
On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jami...@fujitsu.com wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> 
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> > you've only done tests / benchmark for the case this optimizes. I
> >> > think we need to see a benchmark for workload that allocates and
> >> > invalidates lot of buffers. A pgbench with a workload that fits into
> >> > RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> >  of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is 
> >small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
> 
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
> 
># 16 clients
>pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
> 
># 1 client
>pgbench -n -S -c 1 -M prepared -T 60 test
> 
> and average from 30 runs of each looks like this:
> 
> # clients  master patched %
>-
> 1  29690  27833   93.7%
> 16300935 283383   94.1%
> 
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).

I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL  128 /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL   4   /* block-level */
#define NUM_MAP_PARTITIONS \
(NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL) 

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36 
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76


I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

[Master]
 wait_event_type |  wait_event   |  calls   | microsec
-+---+--+--
 Client  | ClientRead|   25116  | 49552452
 IO  | Data

RE: [Patch] Optimize dropping of relation buffers using dlist

2019-12-13 Thread k.jami...@fujitsu.com
Hi,

I have updated the patch (v5).
I tried to reduce the lock waiting times by using spinlock
when inserting/deleting buffers in the new hash table, and
exclusive lock when doing lookup for buffers to be dropped.
In summary, instead of scanning the whole buffer pool in 
shared buffers, we just traverse the doubly-linked list of linked
buffers for the target relation and block.

In order to understand how this patch affects performance,
I also measured the cache hit rates in addition to
benchmarking db with various shared buffer size settings.

Using the same machine specs, I used the default script
of pgbench for read-only workload with prepared statement,
and executed about 15 runs for varying shared buffer sizes.
  pgbench -i -s 3200 test  //(about 48GB db size)
  pgbench -S -n -M prepared -c 16 -j 16 -T 60 test

[TPS Regression]
 shbuf | tps(master) |   tps(patch)  | %reg  
-+-+-+---
  5GB   | 195,737.23  | 191,422.23 | 2.23
 10GB  | 197,067.93  | 194,011.66 | 1.55
 20GB  | 200,241.18  | 200,425.29 | -0.09
 40GB  | 208,772.81  | 209,807.38 | -0.50
 50GB  | 215,684.33  | 218,955.43 | -1.52

[CACHE HIT RATE]
 Shbuf  |   master   |  patch
--+--+--
 10GB  | 0.141536 | 0.141485
 20GB  | 0.330088 | 0.329894
 30GB  | 0.573383 | 0.573377
 40GB  | 0.819499 | 0.819264
 50GB  | 0.999237 | 0.999577

For this workload, the regression increases for below 20GB
shared_buffers size. However, the cache hit rate both for
master and patch is 32% (20 GB shbuf). Therefore, I think we
can consider this kind of workload with low shared buffers
size as a “special case”, because in terms of db performance
tuning we want as much as possible for the db to have a higher
cache hit rate (99.9%, or maybe let's say 80% is acceptable).
And in this workload, ideal shared_buffers size would be
around 40GB more or less to hit that acceptable cache hit rate.
Looking at this patch's performance result, if it's within the acceptable
cache hit rate, there would be at least no regression and the results als
 show almost similar tps compared to master.

Your feedback about the patch and tests are welcome.

Regards,
Kirk Jamison


v5-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v5-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: VACUUM memory management

2020-01-21 Thread k.jami...@fujitsu.com
Hi Ibrar,

Are you still working on this patch?
Currently the patch does not apply mainly because of
recent commits for parallel vacuum have updated the files in this patch.
Kindly rebase it and change the status to "Needs Review" after.

Upon quick scan of another thread [1] mentioned above,
I believe the people involved had consensus on the same direction
of allocating mem in chunks, and dynamically alloc when
needed. A point for discussion was the size of chunk allocation.

After a brief look of your patch, there's a typo between
declaration and definition of lazy_vacuum_page():
arryindex --> arrindex

static int   lazy_vacuum_page(Relation onerel, BlockNumber blkno, 
Buffer buffer,
-   
int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+   
   int arryindex, int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);

static int
lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-   int tupindex, LVRelStats 
*vacrelstats, Buffer *vmbuffer)
+  int arrindex, int tupindex, 
LVRelStats *vacrelstats, Buffer *vmbuffer)

Unnecessary change:
-   long  maxtuples;
-   intvac_work_mem = 
IsAutoVacuumWorkerProcess() &&
+  longmaxtuples;
+  int vac_work_mem = IsAutoVacuumWorkerProcess() &&

Other typo:
+ * pg_bsearch() -- bsearch algorithem for two dimention array.
algorithem --> algorithm
dimention --> dimension

I might have missed something more,
but I'll continue reviewing after the rebased patch.

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/flat/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com


RE: [Patch]: Documentation of ALTER TABLE re column type changes on binary-coercible fields

2020-01-28 Thread k.jami...@fujitsu.com
On Wednesday, January 29, 2020 3:56 AM (GMT+9),  Mike Lissner wrote:
> Hi all, I didn't get any replies to this. Is this the right way to send in a 
> patch to the
> docs?

Hello,
Yes, although your current patch does not apply as I tried it in my machine.
But you can still rebase it.
For the reviewers/committers to keep track of this, I think it might be better 
to
register your patch to the commitfest app: 
https://commitfest.postgresql.org/27/,
and you may put it under the "Documentation" topic. 

There's also a CFbot to check online whether your patch still applies cleanly
and passes the tests, especially after several commits in the source code.
Current CF: http://commitfest.cputube.org/index.html
Next CF: http://commitfest.cputube.org/next.html

Regards,
Kirk Jamison


> On Thu, Jan 23, 2020 at 11:01 PM Mike Lissner   > wrote:
> 
> 
>   Hi, first patch here and first post to pgsql-hackers. Here goes.
> 
> 
>   Enclosed please find a patch to tweak the documentation of the ALTER 
> TABLE
> page. I believe this patch is ready to be applied to master and backported 
> all the way
> to 9.2.
> 
> 
>   On the ALTER TABLE page, it currently notes that if you change the type 
> of a
> column, even to a binary coercible type:
> 
>   > any indexes on the affected columns must still be rebuilt.
> 
> 
>   It appears this hasn't been true for about eight years, since 367bc426a.
> 
>   Here's the discussion of the topic from earlier today and yesterday:
> 
>   https://www.postgresql.org/message-
> id/flat/CAMp9%3DExXtH0NeF%2BLTsNrew_oXycAJTNVKbRYnqgoEAT01t%3D67A%40
> mail.gmail.com
> 
>   I haven't run tests, but I presume they'll be unaffected by a 
> documentation
> change.
> 
> 
>   I've made an effort to follow the example of other people's patches I 
> looked
> at, but I haven't contributed here before. Happy to take another stab at this 
> if this
> doesn't hit the mark — though I hope it does. I love and appreciate 
> Postgresql and
> hope that I can do my little part to make it better.
> 
>   For the moment, I haven't added this to commitfest. I don't know what 
> it is,
> but I suspect this is small enough somebody will just pick it up.
> 
> 
>   Mike
> 



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-03-24 Thread k.jami...@fujitsu.com
Hi,

I know this might already be late at end of CommitFest, but attached
is the latest version of the patch. The previous version only includes buffer
invalidation improvement for VACUUM. The new patch adds the same
routine for TRUNCATE WAL replay.

In summary, this patch aims to improve the buffer invalidation process
of VACUUM and  TRUNCATE. Although it may not be a common use
case, our customer uses these commands a lot. Recovery and WAL
replay of these commands can take time depending on the size of
database buffers. So this patch optimizes that using the newly-added
auxiliary cache and doubly-linked list on the shared memory, so that
we don't need to scan the shared buffers anymore.

As for the performance and how it affects the read-only workloads.
Using pgbench, I've kept the overload to a minimum, less than 1%.
I'll post follow-up results.

Although the additional hash table utilizes shared memory, there's
a significant performance gain for both TRUNCATE and VACUUM
from execution to recovery.

Regards,
Kirk Jamison


v7-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v7-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-03-30 Thread k.jami...@fujitsu.com
On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
> As for the performance and how it affects the read-only workloads.
> Using pgbench, I've kept the overload to a minimum, less than 1%.
> I'll post follow-up results.

Here's the follow-up results.
I executed the similar tests from top of the thread.
I hope the performance test results shown below would suffice.
If not, I'd appreciate any feedback with regards to test or the patch itself.

A. VACUUM execution + Failover test
- 100GB shared_buffers

1. 1000 tables (18MB)
1.1. Execution Time
- [MASTER] 77755.218 ms (01:17.755)
- [PATCH] Execution Time:   2147.914 ms (00:02.148)
1.2. Failover Time (Recovery WAL Replay):
- [MASTER] 01:37.084 (1 min 37.884 s)
- [PATCH] 1627 ms (1.627 s)

2. 1 tables (110MB)
2.1. Execution Time
- [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s
- [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s

2.2. Failover Time:
- [MASTER] est. 14 min++
(I didn't measure anymore because recovery takes
as much as the execution time.)
- [PATCH] 01:25.559 (1 min 25.559 s)

Significant performance results for VACUUM.


B. TPS Regression for READ-ONLY workload
(PREPARED QUERY MODE, NO VACUUM)

# [16 Clients]
- pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest

|shbuf|Master  |Patch |% reg|
|--|--|---|--|
|128MB| 77,416.76 | 77,162.78 |0.33% |
|1GB | 81,941.30 | 81,812.05 |0.16% |
|2GB | 84,273.69 | 84,356.38 |-0.10%|
|100GB| 83,807.30 | 83,924.68 |-0.14%|

# [1 Client]
- pgbench -n -S -c 1 -M prepared -T 60 cachetest

|shbuf|Master  |Patch |% reg|
|--|--|---|--|
|128MB| 12,044.54 | 12,037.13 |0.06% |
|1GB | 12,736.57 | 12,774.77 |-0.30%|
|2GB | 12,948.98 | 13,159.90 |-1.63%|
|100GB| 12,982.98 | 13,064.04 |-0.62%|

Both were run for 10 times and average tps and % regression are
shown above. At some point only minimal overload was caused by
the patch. As for other cases, it has higher tps compared to master.

If it does not make it this CF, I hope to receive feedback in the future
on how to proceed. Thanks in advance!

Regards,
Kirk Jamison


RE: Transactions involving multiple postgres foreign servers, take 2

2021-06-24 Thread k.jami...@fujitsu.com
Hi Sawada-san,

I also tried to play a bit with the latest patches similar to Ikeda-san,
and with foreign 2PC parameter enabled/required.

> > >> b. about performance bottleneck (just share my simple benchmark
> > >> results)
> > >>
> > >> The resolver process can be performance bottleneck easily although
> > >> I think some users want this feature even if the performance is not so
> good.
> > >>
> > >> I tested with very simple workload in my laptop.
> > >>
> > >> The test condition is
> > >> * two remote foreign partitions and one transaction inserts an
> > >> entry in each partitions.
> > >> * local connection only. If NW latency became higher, the
> > >> performance became worse.
> > >> * pgbench with 8 clients.
> > >>
> > >> The test results is the following. The performance of 2PC is only
> > >> 10% performance of the one of without 2PC.
> > >>
> > >> * with foreign_twophase_commit = requried
> > >> -> If load with more than 10TPS, the number of unresolved foreign
> > >> -> transactions
> > >> is increasing and stop with the warning "Increase
> > >> max_prepared_foreign_transactions".
> > >
> > > What was the value of max_prepared_foreign_transactions?
> >
> > Now, I tested with 200.
> >
> > If each resolution is finished very soon, I thought it's enough
> > because 8clients x 2partitions = 16, though... But, it's difficult how
> > to know the stable values.
> 
> During resolving one distributed transaction, the resolver needs both one
> round trip and fsync-ing WAL record for each foreign transaction.
> Since the client doesn’t wait for the distributed transaction to be resolved,
> the resolver process can be easily bottle-neck given there are 8 clients.
> 
> If foreign transaction resolution was resolved synchronously, 16 would
> suffice.


I tested the V36 patches on my 16-core machines.
I setup two foreign servers (F1, F2) .
F1 has addressbook table.
F2 has pgbench tables (scale factor = 1).
There is also 1 coordinator (coor) server where I created user mapping to 
access the foreign servers.
I executed the benchmark measurement on coordinator.
My custom scripts are setup in a way that queries from coordinator
would have to access the two foreign servers.

Coordinator:
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 1
foreign_twophase_commit = required

Other external servers 1 & 2 (F1 & F2):
max_prepared_transactions = 100


[select.sql]
\set int random(1, 10)
BEGIN;
SELECT ad.name, ad.age, ac.abalance
FROM addressbook ad, pgbench_accounts ac
WHERE ad.id = :int AND ad.id = ac.aid;
COMMIT;

I then executed:
pgbench -r -c 2 -j 2 -T 60 -f select.sql coor

While there were no problems with 1-2 clients, I started having problems
when running the benchmark with more than 3 clients.

pgbench -r -c 4 -j 4 -T 60 -f select.sql coor

I got the following error on coordinator:

[95396] ERROR:  could not prepare transaction on server F2 with ID 
fx_151455979_1216200_16422
[95396] STATEMENT:  COMMIT;
WARNING:  there is no transaction in progress
pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could 
not prepare transaction on server F2 with ID fx_151455979_1216200_16422

Here's the log on foreign server 2  matching the above error:
 LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
 ERROR:  maximum number of prepared transactions reached
 HINT:  Increase max_prepared_transactions (currently 100).
 STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'

So I increased the max_prepared_transactions of  and  from 100 to 200.
Then I got the error:

[146926] ERROR:  maximum number of foreign transactions reached
[146926] HINT:  Increase max_prepared_foreign_transactions: "200".

So I increased the max_prepared_foreign_transactions to "300",
and got the same error of need to increase the max_prepared_transactions of 
foreign servers.

I just can't find the right tuning values for this.
It seems that we always run out of memory in FdwXactState insert_fdwxact 
with multiple concurrent connections during PREPARE TRANSACTION.
This one I only encountered for SELECT benchmark. 
Although I've got no problems with multiple connections for my custom scripts 
for
UPDATE and INSERT benchmarks when I tested up to 30 clients.

Would the following possibly solve this bottleneck problem?

> > > To speed up the foreign transaction resolution, some ideas have been
> > > discussed. As another idea, how about launching resolvers for each
> > > foreign server? That way, we resolve foreign transactions on each
> > > foreign server in parallel. If foreign transactions are concentrated
> > > on the particular server, we can have multiple resolvers for the one
> > > foreign server. It doesn’t change the fact that all foreign
> > > transaction resolutions are processed by resolver processes.
> >
> > Awesome! There seems to be another pros that even if a foreign server
> > is temporarily busy or stopped due to fail over, other foreign
> > server's transactions can 

RE: Transactions involving multiple postgres foreign servers, take 2

2021-07-07 Thread k.jami...@fujitsu.com
On Wed, June 30, 2021 10:06 (GMT+9), Masahiko Sawada wrote:
> I've attached the new version patch that incorporates the comments from
> Fujii-san and Ikeda-san I got so far. We launch a resolver process per foreign
> server, committing prepared foreign transactions on foreign servers in 
> parallel.

Hi Sawada-san,
Thank you for the latest set of patches.
I've noticed from cfbot that the regression test failed, and I also could not 
compile it.

== running regression test queries==
test test_fdwxact ... FAILED   21 ms
== shutting down postmaster   ==
==
 1 of 1 tests failed. 
==

> To get a better performance based on the current architecture, we can have
> multiple resolver processes per foreign server but it seems not easy to tune 
> it
> in practice. Perhaps is it better if we simply have a pool of resolver 
> processes
> and we assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes as
> many as the concurrent backends using 2PC.

Yes, finding the right value to tune of of max_foreign_prepared_transactions and
max_prepared_transactions seem difficult. If we set the number of resolver
process to number of concurrent backends using 2PC, how do we determine
the value of max_foreign_transaction_resolvers? It might be good to set some
statistics to judge the value, then we can compare the performance from the V37
version.

-
Also, this is a bit of side topic, and I know we've been discussing how to 
improve/fix the resolver process bottlenecks, and Takahashi-san provided
the details above thread where V37 has problems. (I am joining the testing too.)

I am not sure if this has been brought up before because of the years of
thread. But I think that there is a need to consider the need to prevent for the
resolver process from an infinite wait loop of resolving a prepared foreign
transaction. Currently, when a crashed foreign server is recovered during
resolution retries, the information is recovered from WAL and files,
and the resolver process resumes the foreign transaction resolution.
However, what if we cannot (or intentionally do not want to) recover the
crashed server after a long time?

An idea is to make the resolver process to automatically stop after some
maximum number of retries.
We can call the parameter as foreign_transaction_resolution_max_retry_count.
There may be a better name, but I followed the pattern from your patch.

The server downtime can be estimated considering the proposed parameter
foreign_transaction_resolution_retry_interval (default 10s) from the
patch set.
In addition, according to docs, "a foreign server using the postgres_fdw
foreign data wrapper can have the same options that libpq accepts in
connection strings", so the connect_timeout set during CREATE SERVER can
also affect it.

Example:
CREATE SERVER's connect_timeout setting = 5s
foreign_transaction_resolution_retry_interval = 10s
foreign_transaction_resolution_max_retry_count = 3

Estimated total time before resolver stops: 
= (5s) * (3 + 1) + (10s) * (3) = 50 s

00s: 1st connect start
05s: 1st connect timeout
(retry interval)
15s: 2nd connect start (1st retry)
20s: 2nd connect timeout
(retry interval)
30s: 3rd connect start (2nd retry)
35s: 3rd connect timeout
(retry interval)
45s: 4th connect start (3rd retry)
50s: 4th connect timeout
(resolver process stops)

Then the resolver process will not wait indefinitely and will stop after
some time depending on the setting of the above parameters.
This could be the automatic implementation of pg_stop_foreign_xact_resolver.
Assuming that resolver is stopped, then the crashed server is
decided to be restored, the user can then execute pg_resolve_foreign_xact().
Do you think the idea is feasible and we can add it as part of the patch sets?

Regards,
Kirk Jamison


RE: Transactions involving multiple postgres foreign servers, take 2

2021-10-04 Thread k.jami...@fujitsu.com
Hi Sawada-san,

I noticed that this thread and its set of patches have been marked with 
"Returned with Feedback" by yourself.
I find the feature (atomic commit for foreign transactions) very useful
and it will pave the road for having a distributed transaction management in 
Postgres.
Although we have not arrived at consensus at which approach is best,
there were significant reviews and major patch changes in the past 2 years.
By any chance, do you have any plans to continue this from where you left off?

Regards,
Kirk Jamison



RE: Transactions involving multiple postgres foreign servers, take 2

2021-10-06 Thread k.jami...@fujitsu.com
Hi Fujii-san and Sawada-san,

Thank you very much for your replies.

> >> I noticed that this thread and its set of patches have been marked with
> "Returned with Feedback" by yourself.
> >> I find the feature (atomic commit for foreign transactions) very
> >> useful and it will pave the road for having a distributed transaction
> management in Postgres.
> >> Although we have not arrived at consensus at which approach is best,
> >> there were significant reviews and major patch changes in the past 2 years.
> >> By any chance, do you have any plans to continue this from where you left 
> >> off?
> >
> > As I could not reply to the review comments from Fujii-san for almost
> > three months, I don't have enough time to move this project forward at
> > least for now. That's why I marked this patch as RWF. I’d like to
> > continue working on this project in my spare time but I know this is
> > not a project that can be completed by using only my spare time. If
> > someone wants to work on this project, I’d appreciate it and am happy
> > to help.
> 
> Probably it's time to rethink the approach. The patch introduces foreign
> transaction manager into PostgreSQL core, but as far as I review the patch, 
> its
> changes look overkill and too complicated.
> This seems one of reasons why we could not have yet committed the feature even
> after several years.
> 
> Another concern about the approach of the patch is that it needs to change a
> backend so that it additionally waits for replication during commit phase 
> before
> executing PREPARE TRANSACTION to foreign servers. Which would decrease the
> performance during commit phase furthermore.
> 
> So I wonder if it's worth revisiting the original approach, i.e., add the 
> atomic
> commit into postgres_fdw. One disadvantage of this is that it supports atomic
> commit only between foreign PostgreSQL servers, not other various data
> resources like MySQL.
> But I'm not sure if we really want to do atomic commit between various FDWs.
> Maybe supporting only postgres_fdw is enough for most users. Thought?

The intention of Sawada-san's patch is grand although would be very much helpful
because it accommodates possible future support of atomic commit for
various types of FDWs. However, it's difficult to get the agreement altogether,
as other reviewers also point out the performance of commit. Another point is 
that
how it should work when we also implement atomic visibility (which is another
topic for distributed transactions but worth considering).
That said, if we're going to initially support it on postgres_fdw, which is 
simpler 
than the latest patches, we need to ensure that abnormalities and errors
are properly handled and prove that commit performance can be improved,
e.g. if we can commit not in serial but also possible in parallel.
And if possible, although not necessary during the first step, it may put at 
ease
the other reviewers if can we also think of the image on how to implement atomic
visibility on postgres_fdw. 
Thoughts?

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-06-16 Thread k.jami...@fujitsu.com
Hi,

Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.

Regards,
Kirk Jamison


v8-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v8-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2020-07-13 Thread k.jami...@fujitsu.com
On Tuesday, July 14, 2020 3:01 AM (GMT+9), Bossart, Nathan wrote:

Hi Nathan,

>On 7/13/20, 11:02 AM, "Justin Pryzby"  wrote:
>> Should bin/vacuumdb support this?
>
>Yes, it should.  I've added it in v5 of the patch.

Thank you for the updated patch. I've joined as a reviewer.
I've also noticed that you have incorporated Justin's suggested vacuumdb support
in the recent patch, but in my opinion it'd be better to split them for better 
readability.
According to the cfbot, patch applies cleanly and passes all the tests.

[Use Case]
>The main use case I'm targeting is when the level of bloat or
>transaction ages of a relation and its TOAST table have significantly
>diverged.  In these scenarios, it could be beneficial to be able to
>vacuum just one or the other, especially if the tables are large.
>...
>I reused the existing VACOPT_SKIPTOAST option to implement
>SECONDARY_RELATION_CLEANUP.  This option is currently only used for
>autovacuum.

Perhaps this has not gathered much attention yet because it's not experienced
by many, but I don't see any problem with the additional options on manual
VACUUM on top of existing autovacuum cleanups. And I think this is useful
for the special use case mentioned, especially that toast table access is not
in public as per role limitation.

[Docs]
I also agree with "TOAST_TABLE_CLEANUP" and just name the options after the
respective proposed relation types in the future.

+MAIN_RELATION_CLEANUP
+
+ 
+  Specifies that VACUUM should attempt to process the
+  main relation.  This is normally the desired behavior and is the default.
+  Setting this option to false may be useful when it is necessary to only
+  vacuum a relation's corresponding TOAST table.

Perhaps it's just my own opinion, but I think the word "process" is vague for
a beginner in postgres reading the documents. OTOH, I know it's also used
in the source code, so I guess it's just the convention. And "process" is
intuititve as "processing tables". Anyway, just my 2 cents & isn't a strong
opinion.

Also, there's an extra space between the 1st and 2nd sentences.


+TOAST_TABLE_CLEANUP
+
+ 
+  Specifies that VACUUM should attempt to process the
+  corresponding TOAST table for each relation, if one
+  exists.  This is normally the desired behavior and is the default.
+  Setting this option to false may be useful when it is necessary to only
+  vacuum the main relation.  This option cannot be disabled when the
+  FULL option is used.

Same comments as above, & extra spaces in between the sentences. 

@@ -1841,9 +1865,16 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams 
*params)
/*
 * Remember the relation's TOAST relation for later, if the caller asked
 * us to process it.  In VACUUM FULL, though, the toast table is
-* automatically rebuilt by cluster_rel so we shouldn't recurse to it.
+* automatically rebuilt by cluster_rel, so we shouldn't recurse to it
+* unless MAIN_RELATION_CLEANUP is disabled.

The additional last line is a bit confusing (and may be unnecessary/unrelated).
To clarify this thread on VACUUM FULL and my understanding of revised 
vacuum_rel below,
we allow MAIN_RELATION_CLEANUP option to be disabled (skip processing main 
relation)
and TOAST_TABLE_CLEANUP should be disabled because cluster_rel() will process 
the 
toast table anyway.
Is my understanding correct? If yes, then maybe "unless" should be "even if" 
instead,
or we can just remove the line.

 static bool
-vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
+vacuum_rel(Oid relid,
+  RangeVar *relation,
+  VacuumParams *params,
+  bool processing_toast_table)
{
...
+   boolprocess_toast;
...

-   if (!(params->options & VACOPT_SKIPTOAST) && !(params->options & 
VACOPT_FULL))
+   process_toast = (params->options & VACOPT_TOAST_CLEANUP) != 0;
+
+   if ((params->options & VACOPT_FULL) != 0 &&
+   (params->options & VACOPT_MAIN_REL_CLEANUP) != 0)
+   process_toast = false;
+
+   if (process_toast)
toast_relid = onerel->rd_rel->reltoastrelid;
else
toast_relid = InvalidOid;
...

 * Do the actual work --- either FULL or "lazy" vacuum
+*
+* We skip this part if we're processing the main relation and
+* MAIN_RELATION_CLEANUP has been disabled.
 */
-   if (params->options & VACOPT_FULL)
+   if ((params->options & VACOPT_MAIN_REL_CLEANUP) != 0 ||
+   processing_toast_table)
...
if (toast_relid != InvalidOid)
-   vacuum_rel(toast_relid, NULL, params);
+   vacuum_rel(toast_relid, NULL, params, true);



>I've attached v3 of the patch, which removes the restriction on
>ANALYZE with MAIN_RELATION_CLEANUP disabled.

I've also confirmed those through regression + tap test in my own env
and they'

RE: Parallel Seq Scan vs kernel read ahead

2020-07-17 Thread k.jami...@fujitsu.com
On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote:

>On Wed, 15 Jul 2020 at 14:51, Amit Kapila  wrote:
>>
>> On Wed, Jul 15, 2020 at 5:55 AM David Rowley  wrote:
>>> If we've not seen any performance regressions within 1 week, then I 
>>> propose that we (pending final review) push this to allow wider 
>>> testing.
>>
>> I think Soumyadeep has reported a regression case [1] with the earlier 
>> version of the patch.  I am not sure if we have verified that the 
>> situation improves with the latest version of the patch.  I request 
>> Soumyadeep to please try once with the latest patch.
>...
>Yeah, it would be good to see some more data points on that test.
>Jumping from 2 up to 6 workers just leaves us to guess where the performance
>started to become bad. >It would be good to know if the regression is
>repeatable or if it was affected by some other process.
>...
>It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET track_io_timing = 
>on;
>for each value of >max_parallel_workers.

Hi,

If I'm following the thread correctly, we may have gains on this patch
of Thomas and David, but we need to test its effects on different
filesystems. It's also been clarified by David through benchmark tests
that synchronize_seqscans is not affected as long as the set cap per
chunk size of parallel scan is at 8192.

I also agree that having a control on this through GUC can be
beneficial for users, however, that can be discussed in another
thread or development in the future.

David Rowley wrote:
>I'd like to propose that if anyone wants to do further testing on
>other operating systems with SSDs or HDDs then it would be good if
>that could be done within a 1 week from this email. There are various
>benchmarking ideas on this thread for inspiration.

I'd like to join on testing it, this one using HDD, and at the bottom
are the results. Due to my machine limitations, I only tested
0~6 workers, that even if I increase max_parallel_workers_per_gather
more than that, the query planner would still cap the workers at 6.
I also set the track_io_timing to on as per David's recommendation.

Tested on:
XFS filesystem, HDD virtual machine
RHEL4, 64-bit,
4 CPUs, Intel Core Processor (Haswell, IBRS)
PostgreSQL 14devel on x86_64-pc-linux-gnu


Test Case (Soumyadeep's) [1]

shared_buffers = 32MB (to use OS page cache)

create table t_heap as select generate_series(1, 1) i;   --about 3.4GB 
size

SET track_io_timing = on;

\timing

set max_parallel_workers_per_gather = 0;  --0 to 6

SELECT count(*) from t_heap;
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;

[Summary]
I used the same query from the thread. However, the sql query execution time
and query planner execution time results between the master and patched do
not vary much.
OTOH, in terms of I/O stats, I observed similar regression in both master
and patched as we increase max_parallel_workers_per_gather.

It could also be possible that each benchmark setting for 
max_parallel_workers_per_gather
is affected by previous result . IOW, later benchmark runs benefit from the 
data cached by
previous runs on OS level. 
Any advice? Please refer to tables below for results.

(MASTER/UNPATCHED)
| Parallel Workers | SQLExecTime  |  PlannerExecTime |  Buffers 
   | 
|--|--|--|-|
 
| 0| 12942.606 ms | 37031.786 ms | shared hit=32 
read=442446   | 
| 1|  4959.567 ms | 17601.813 ms | shared hit=128 
read=442350  | 
| 2|  3273.610 ms | 11766.441 ms | shared hit=288 
read=442190  | 
| 3|  2449.342 ms |  9057.236 ms | shared hit=512 
read=441966  | 
| 4|  2482.404 ms |  8853.702 ms | shared hit=800 
read=441678  | 
| 5|  2430.944 ms |  8777.630 ms | shared hit=1152 
read=441326 | 
| 6|  2493.416 ms |  8798.200 ms | shared hit=1568 
read=440910 | 

(PATCHED V2)
| Parallel Workers | SQLExecTime |  PlannerExecTime |  Buffers  
  | 
|--|-|--|-|
 
| 0| 9283.193 ms | 34471.050 ms | shared hit=2624 
read=439854 | 
| 1| 4872.728 ms | 17449.725 ms | shared hit=2528 
read=439950 | 
| 2| 3240.301 ms | 11556.243 ms | shared hit=2368 
read=440110 | 
| 3| 2419.512 ms |  8709.572 ms | shared hit=2144 
read=440334 | 
| 4| 2746.820 ms |  8768.812 ms | shared hit=1856 
read=440622 | 
| 5| 2424.687 ms |  8699.762 ms | shared hit=1504 
read=440974 | 
| 6| 2581.999 ms |  8627.627 ms | shared hit=1440 
read=441038 | 

(I/O Read Stat)
| Parallel Workers | I/O (Master)  | I/O (Patched) | 
|--|---|---| 
| 0| read=1850.233 | read=1071.209 | 
| 1| read=1246.939

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-02-04 Thread k.jami...@fujitsu.com
Hi,

I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.

Kindly check the attached V6 patch.
Any thoughts on this?

Regards,
Kirk Jamison


v6-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v6-Optimize-dropping-of-relation-buffers-using-dlist.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-14 Thread k.jami...@fujitsu.com
On Tuesday, September 8, 2020 1:02 PM, Amit Kapila wrote:
Hello,
> On Mon, Sep 7, 2020 at 1:33 PM k.jami...@fujitsu.com
>  wrote:
> >
> > On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> > > >
> > > > Amit Kapila  writes:
> > > > > Even if the relation is locked, background processes like
> > > > > checkpointer can still touch the relation which might cause
> > > > > problems. Consider a case where we extend the relation but
> > > > > didn't flush the newly added pages. Now during truncate
> > > > > operation, checkpointer can still flush those pages which can
> > > > > cause trouble for truncate. But, I think in the recovery path
> > > > > such cases won't cause a
> > > problem.
> > > >
> > > > I wouldn't count on that staying true ...
> > > >
> > > >
> > >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > > OBV-jC
> > > > nacbyvtpptk2a9yy...@mail.gmail.com
> > > >
> > >
> > > I don't think that proposal will matter after commit c5315f4f44
> > > because we are caching the size/blocks for recovery while doing
> > > extend (smgrextend). In the above scenario, we would have cached the
> > > blocks which will be used at later point of time.
> > >
> >
> > I'm guessing we can still pursue this idea of improving the recovery path
> first.
> >
> 
> I think so.

Alright, so I've updated the patch which passes the regression and TAP tests.
It compiles and builds as intended.

> > I'm working on an updated patch version, because the CFBot's telling
> > that postgres fails to build (one of the recovery TAP tests fails).
> > I'm still working on refactoring my patch, but have yet to find a proper
> solution at the moment.
> > So I'm going to continue my investigation.
> >
> > Attached is an updated WIP patch.
> > I'd appreciate if you could take a look at the patch as well.
> >
> 
> So, I see the below log as one of the problems:
> 2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
> 2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
> in block 1 of relation base/13743/24581
> 
> This indicates that we missed invalidating some buffer which should have
> been invalidated. If you are able to reproduce this locally then I suggest to 
> first
> write a simple patch without the check of the threshold, basically in recovery
> always try to use the new way to invalidate the buffer. That will reduce the
> scope of the code that can create a problem. Let us know if the problem still
> exists and share the logs. BTW, I think I see one problem in the code:
> 
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >= 
> + bufHdr->firstDelBlock[j])
> 
> Here, I think you need to use 'i' not 'j' for forkNum and 
> firstDelBlock as those are arrays w.r.t forks. That might fix the 
> problem but I am not sure as I haven't tried to reproduce it.

Thanks for advice. Right, that seems to be the cause of error,
and fixing that (using fork) solved the case.
I also followed the advice of Tsunakawa-san of using more meaningful iterator
Instead of using "i" & "j" for readability.

I also added a new function when relation fork is bigger than the threshold
If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
(DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that function.
However, as expected in the previous discussions, this is a bit slower than the
standard buffer invalidation process, because the whole shared buffers are 
scanned nfork times.
Currently, I set the threshold to (NBuffers / 500)

Feedback on the patch/testing are very much welcome.

Best regards,
Kirk Jamison



v12-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v12-Speedup-dropping-of-relation-buffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread k.jami...@fujitsu.com
Hi, 

> BTW, I think I see one problem in the code:
> >
> > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > + bufHdr->firstDelBlock[j])
> >
> > Here, I think you need to use 'i' not 'j' for forkNum and
> > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > problem but I am not sure as I haven't tried to reproduce it.
> 
> Thanks for advice. Right, that seems to be the cause of error, and fixing that
> (using fork) solved the case.
> I also followed the advice of Tsunakawa-san of using more meaningful
> iterator Instead of using "i" & "j" for readability.
> 
> I also added a new function when relation fork is bigger than the threshold
> If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> function.
> However, as expected in the previous discussions, this is a bit slower than 
> the
> standard buffer invalidation process, because the whole shared buffers are
> scanned nfork times.
> Currently, I set the threshold to (NBuffers / 500)

I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with 
firstDelBlock[block_num],
In the for-loop code block of block_num, because we want to process the current 
block of per-block loop

OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the 
threshold,
or if the cached blocks of small relations were already invalidated.

The logic could be either correct or wrong, so I'd appreciate feedback and 
comments/advice.

Regards,
Kirk Jamison


v13-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v13-Speedup-dropping-of-relation-buffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-17 Thread k.jami...@fujitsu.com
On Wednesday, September 16, 2020 5:32 PM, Kyotaro Horiguchi wrote:
> At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila 
> wrote in
> > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila
> > >  wrote in
> > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > >  wrote:
> > > By the way I'm not sure that actually happens, but if one smgrextend
> > > call exnteded the relation by two or more blocks, the cache is
> > > invalidated and succeeding smgrnblocks returns lseek()'s result.
> > >
> >
> > Can you think of any such case? I think in recovery we use
> > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the
> page
> > which seems to be extending page-by-page but there could be some case
> > where that is not true. One idea is to run regressions and add an
> > Assert to see if we are extending more than a block during recovery.
> 
> I agree with you. Actually XLogReadBufferExtended is the only point to read a
> page while recovery and seems calling ReadBufferWithoutRelcache page by
> page up to the target page. The only case I found where the cache is
> invalidated is ALTER TABLE SET TABLESPACE while wal_level=minimal and
> not during recovery. smgrextend is called without smgrnblocks called at the
> time.
> 
> Considering that the behavior of lseek can be a problem only just after
> extending a file, an assertion in smgrextend seems to be enough. Although,
> I'm not confident on the diagnosis.
> 
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum,
>   if (reln->smgr_cached_nblocks[forknum] == blocknum)
>   reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>   else
> + {
> + /*
> +  * DropRelFileNodeBuffers relies on the behavior that
> nblocks cache
> +  * won't be invalidated by file extension while recoverying.
> +  */
> + Assert(!InRecovery);
>   reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> + }
>  }
> 
> > > Don't
> > > we need to guarantee the cache to be valid while recovery?
> > >
> >
> > One possibility could be that we somehow detect that the value we are
> > using is cached one and if so then only do this optimization.
> 
> I basically like this direction.  But I'm not sure the additional parameter 
> for
> smgrnblocks is acceptable.
> 
> But on the contrary, it might be a better design that DropRelFileNodeBuffers
> gives up the optimization when smgrnblocks(,,must_accurate = true) returns
> InvalidBlockNumber.
> 

Thank you for your thoughtful reviews and discussions Horiguchi-san, 
Tsunakawa-san and Amit-san.
Apologies for my carelessness. I've addressed the bugs in the previous version.
1. Getting the total number of blocks for all the specified forks
2. Hashtable probing conditions

I added the suggestion of putting an assert on smgrextend for the 
XLogReadBufferExtended case,
and I thought that would be enough. I think modifying the smgrnblocks with the 
addition of new
parameter would complicate the source code because a number of functions call 
it.
So I thought that maybe putting BlockNumberIsValid(nblocks) in the condition 
would suffice.
Else, we do full scan of buffer pool.

+   if ((nblocks / (uint32)NBuffers) < 
BUF_DROP_FULLSCAN_THRESHOLD &&
+   BlockNumberIsValid(nblocks))

+   else
+   {
//full scan

Attached is the v14 of the patch. It compiles and passes the tests.
Hoping for your continuous reviews and feedback. Thank you very much.

Regards,
Kirk Jamison



v14-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v14-Speedup-dropping-of-relation-buffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread k.jami...@fujitsu.com
On Wednesday, September 23, 2020 11:26 AM, Tsunakawa, Takayuki wrote:

> I looked at v14.
Thank you for checking it!
 
> (1)
> + /* Get the total number of blocks for the supplied relation's
> fork */
> + for (j = 0; j < nforks; j++)
> + {
> + BlockNumber block =
> smgrnblocks(smgr_reln, forkNum[j]);
> + nblocks += block;
> + }
> 
> Why do you sum all forks?

I revised the patch based from my understanding of Horiguchi-san's comment,
but I could be wrong.
Quoting:

" 
+   /* Get the number of blocks for the supplied relation's 
fork */
+   nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+   Assert(BlockNumberIsValid(nblocks));
+
+   if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above should be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do lookup-drop method, we
do that for all forks."

> (2)
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
> 
> The division by NBuffers is not necessary, because both sides of = are
> number of blocks.

Again I based it from my understanding of the comment above,
so nblocks is the sum of all blocks to be truncated for all forks.


> Why is BlockNumberIsValid(nblocks)) call needed?

I thought we need to ensure that nblocks is not invalid, so I also added

> (3)
>   if (reln->smgr_cached_nblocks[forknum] == blocknum)
>   reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>   else
> + {
> + /*
> +  * DropRelFileNodeBuffers relies on the behavior that
> cached nblocks
> +  * won't be invalidated by file extension while recovering.
> +  */
> + Assert(!InRecovery);
>   reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> + }
> 
> I think this change is not directly related to this patch and can be a 
> separate
> patch, but I want to leave the decision up to a committer.
> 
This is noted. Once we clarified the above comments, I'll put it in a separate 
patch if it's necessary,

Thank you very much for the reviews.

Best regards,
Kirk Jamison






RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-23 Thread k.jami...@fujitsu.com
On Wednesday, September 23, 2020 2:37 PM, Tsunakawa, Takayuki wrote:
> > I revised the patch based from my understanding of Horiguchi-san's
> > comment, but I could be wrong.
> > Quoting:
> >
> > "
> > +   /* Get the number of blocks for the supplied
> relation's
> > fork */
> > +   nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > +   Assert(BlockNumberIsValid(nblocks));
> > +
> > +   if (nblocks <
> BUF_DROP_FULLSCAN_THRESHOLD)
> >
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
> 
> One takeaway from Horiguchi-san's comment is to use the number of blocks
> to invalidate for comparison, instead of all blocks in the fork.  That is, use
> 
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
> 
> Does this make sense?

Hmm. Ok, I think it got too much to my head that I misunderstood what it meant.
I'll debug again by using ereport just to check the values and behavior are 
correct.
Your comment about V14 patch has dawned on me that it reverted to previous
slower version where we scan NBuffers for each fork. Thank you for explaining 
it.

> What do you think is the reason for summing up all forks?  I didn't
> understand why.  Typically, FSM and VM forks are very small.  If the main
> fork is larger than NBuffers / 500, then v14 scans the entire shared buffers 
> for
> the FSM and VM forks as well as the main fork, resulting in three scans in
> total.
> 
> Also, if you want to judge the criteria based on the total blocks of all 
> forks, the
> following if should be placed outside the for loop, right?  Because this if
> condition doesn't change inside the for loop.
> 
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
> 
> 
> 
> > > (2)
> > > + if ((nblocks / (uint32)NBuffers) <
> > > BUF_DROP_FULLSCAN_THRESHOLD &&
> > > + BlockNumberIsValid(nblocks))
> > > + {
> > >
> > > The division by NBuffers is not necessary, because both sides of =
> > > are number of blocks.
> >
> > Again I based it from my understanding of the comment above, so
> > nblocks is the sum of all blocks to be truncated for all forks.
> 
> But the left expression of "<" is a percentage, while the right one is a block
> count.  Two different units are compared.
> 

Right. Makes sense. Fixed.

> > > Why is BlockNumberIsValid(nblocks)) call needed?
> >
> > I thought we need to ensure that nblocks is not invalid, so I also
> > added
> 
> When is it invalid?  smgrnblocks() seems to always return a valid block
> number.  Am I seeing a different source code (I saw HEAD)?

It's based from the discussion upthread to guarantee the cache to be valid 
while recovery
and that we don't want to proceed with the optimization in case that nblocks is 
invalid.
It may not be needed so I already removed it, because the correct direction is 
ensuring that
smgrnblocks return the precise value.
Considering the test case that Horiguchi-san suggested (attached as separate 
patch),
then maybe there's no need to indicate it in the loop condition.
For now, I haven't modified the design (or created a new function) of 
smgrnblocks, 
and I just updated the patches based from the recent comments.

Thank you very much again for the reviews.

Best regards,
Kirk Jamison


v15-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v15-Speedup-dropping-of-relation-buffers-during-recovery.patch


v1-Prevent-smgrextend-from-invalidating-blocks-during-recovery.patch
Description:  v1-Prevent-smgrextend-from-invalidating-blocks-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-24 Thread k.jami...@fujitsu.com
On Thursday, September 24, 2020 1:27 PM, Tsunakawa-san wrote:

> (1)
> + for (cur_blk = firstDelBlock[j]; cur_blk <
> nblocks; cur_blk++)
> 
> The right side of "cur_blk <" should not be nblocks, because nblocks is not
> the number of the relation fork anymore.

Right. Fixed. It should be the total number of (n)blocks of relation.

> (2)
> + BlockNumber nblocks;
> + nblocks = smgrnblocks(smgr_reln, forkNum[j]) -
> firstDelBlock[j];
> 
> You should either:
> 
> * Combine the two lines into one: BlockNumber nblocks = ...;
> 
> or
> 
> * Put an empty line between the two lines to separate declarations and
> execution statements.

Right. I separated them in the updated patch. And to prevent confusion,
instead of nblocks, nTotalBlocks & nBlocksToInvalidate are used.

/* Get the total number of blocks for the supplied relation's fork */
nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]);

/* Get the total number of blocks to be invalidated for the specified fork */
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];
 

> After correcting these, I think you can check the recovery performance.

I'll send performance measurement results in the next email. Thanks a lot for 
the reviews!

Regards,
Kirk Jamison


v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread k.jami...@fujitsu.com
Hi. 

> I'll send performance measurement results in the next email. Thanks a lot for
> the reviews!

Below are the performance measurement results.
I was only able to use low-spec machine:
CPU 4v, Memory 8GB, RHEL, xfs filesystem.

[Failover/Recovery Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
3. (Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
4. (M) psql -c "\timing on" (measures total execution of SQL queries)
5. (M) VACUUM (whole db)
6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
7. (S) Resume wal replay and promote standby.
Because it's difficult to measure recovery time I used the attached script 
(resume.sh)
that prints timestamp before and after promotion. It basically does the 
following
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is 
resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is 
measured.

[Results]
Recovery/Failover performance (in seconds). 3 trial runs.

| shared_buffers | master | patch  | %reg| 
||||-| 
| 128MB  | 32.406 | 33.785 | 4.08%   | 
| 1GB| 36.188 | 32.747 | -10.51% | 
| 2GB| 41.996 | 32.88  | -27.73% |

There's a bit of small regression with the default shared_buffers (128MB),
but as for the recovery time when we have large NBuffers, it's now at least 
almost constant
so there's boosted performance. IOW, we enter the optimization most of the time
during recovery.

I also did similar benchmark performance as what Tomas did [1],
simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients, but I'm not sure if my machine is reliable enough to
produce reliable results for 8 clients and more.

| #  | master  | patch   | %reg   | 
||-|-|| 
| 1 client   | 1676.937825 | 1707.018029 | -1.79% | 
| 8 clients  | 7706.835401 | 7529.089044 | 2.31%  | 
| 16 clients | 9823.65254  | 9991.184206 | -1.71% |


If there's additional/necessary performance measurement, kindly advise me too.
Thank you in advance.

[1] 
https://www.postgresql.org/message-id/flat/20200806213334.3bzadeirly3mdtzl%40development#473168a61e229de40eaf36326232f86c

Best regards,
Kirk Jamison


resume.sh
Description: resume.sh


run.sh
Description: run.sh


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-25 Thread k.jami...@fujitsu.com
On Friday, September 25, 2020 6:02 PM, Tsunakawa-san wrote:

> From: Jamison, Kirk/ジャミソン カーク 
> > [Results]
> > Recovery/Failover performance (in seconds). 3 trial runs.
> >
> > | shared_buffers | master | patch  | %reg|
> > ||||-|
> > | 128MB  | 32.406 | 33.785 | 4.08%   |
> > | 1GB| 36.188 | 32.747 | -10.51% |
> > | 2GB| 41.996 | 32.88  | -27.73% |
> 
> Thanks for sharing good results.  We want to know if we can get as
> significant results as you gained before with hundreds of GBs of shared
> buffers, don't we?

Yes. But I don't have a high-spec machine I could use at the moment.
I'll try if I can get one by next week. Or if someone would like to reproduce 
the
test with their available higher spec machines, it'd would be much appreciated.
The test case is upthread [1]

> > I also did similar benchmark performance as what Tomas did [1], simple
> > "pgbench -S" tests (warmup and then 15 x 1-minute runs with 1, 8 and
> > 16 clients, but I'm not sure if my machine is reliable enough to
> > produce reliable results for 8 clients and more.
> 
> Let me confirm just in case.  Your patch should not affect pgbench
> performance, but you measured it.  Is there anything you're concerned
> about?
> 

Not really. Because In the previous emails, the argument was the BufferAlloc 
overhead. But we don't have it in the latest patch. But just in case somebody
asks about benchmark performance, I also posted the results.

[1] 
https://www.postgresql.org/message-id/OSBPR01MB2341683DEDE0E7A8D045036FEF360%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-28 Thread k.jami...@fujitsu.com
On Monday, September 28, 2020 11:50 AM, Tsunakawa-san wrote:

> From: Amit Kapila 
> > I agree with the above two points.
> 
> Thank you.  I'm relieved to know I didn't misunderstand.
> 
> 
> > > * Then, add a new function, say, smgrnblocks_cached() that simply
> > > returns
> > the cached block count, and DropRelFileNodeBuffers() uses it instead
> > of smgrnblocks().
> > >
> >
> > I am not sure if it worth adding a new function for this. Why not
> > simply add a boolean variable in smgrnblocks for this?
> 
> 
> One reason is that adding an argument requires modification of existing call
> sites (10 + a few).  Another is that, although this may be different for each
> person's taste, it's sometimes not easy to understand when a function call
> with true/false appears.  One such example is find_XXX(some_args,
> true/false), where the true/false represents missing_ok.  Another example is
> as follows.  I often wonder "what's the meaning of this false, and that true?"
> 
> if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
> elog(ERROR, "InstallXLogFileSegment should not have failed");
> 
> Fortunately, the new function is very short and doesn't duplicate much code.
> The function is a simple getter and the function name can convey the
> meaning straight (if the name is good.)
> 
> 
> > BTW, AFAICS, the latest patch
> > doesn't have code to address this point.
> 
> Kirk-san, can you address this?  I don't mind much if you add an argument
> or a new function.

I maybe missing something. so I'd like to check if my understanding is correct,
as I'm confused with what do we mean exactly by "cached value of nblocks".

Discussed upthread, smgrnblocks() does not always guarantee that it returns a
"cached" nblocks even in recovery.
When we enter this path in recovery path of DropRelFileNodeBuffers,
according to Tsunakawa-san:
>> * During recovery, DropRelFileNodeBuffers() gets the cached size of the 
>> relation fork.  If it is cached, trust it and optimize the buffer 
>> invalidation.  If it's not cached, we can't trust the return value of 
>> smgrnblocks() because it's the lseek(END) return value, so we avoid the 
>> optimization.

+   nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]);

But this comment in the smgrnblocks source code:
 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.
 */
if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum];

So the nblocks returned in DropRelFileNodeBuffers are still not guaranteed to 
be "cached values"?
And that we want to add a new function (I think it's the lesser complicated way 
than modifying smgrnblocks):

/*
 *  smgrnblocksvalid() -- Calculate the number of blocks that are cached in
 *   the supplied relation.
 *
 * It is equivalent to calling smgrnblocks, but only used in recovery for now
 * when DropRelFileNodeBuffers() is called, to ensure that only cached value
 * is used, which is always valid.
 *
 * This returns an InvalidBlockNumber when smgr_cached_nblocks is not available
 * and when isCached is false.
 */
BlockNumber
smgrnblocksvalid(SMgrRelation reln, ForkNumber forknum, bool isCached)
{
BlockNumber result;

/*
 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.
 */
if (InRecovery && if reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber
&& isCached)
return reln->smgr_cached_nblocks[forknum];
}

result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum);

reln->smgr_cached_nblocks[forknum] = result;

if (!InRecovery && !isCached)
return InvalidBlockNumber;

return result;
}

Then in DropRelFileNodeBuffers
+   nTotalBlocks = smgrcachednblocks(smgr_reln, forkNum[j], true);

Is my understanding above correct?

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-28 Thread k.jami...@fujitsu.com
On Monday, September 28, 2020 5:08 PM, Tsunakawa-san wrote:

>   From: Jamison, Kirk/ジャミソン カーク 
> > Is my understanding above correct?
> 
> No.  I simply meant DropRelFileNodeBuffers() calls the following function,
> and avoids the optimization if it returns InvalidBlockNumber.
> 
> 
> BlockNumber
> smgrcachednblocks(SMgrRelation reln, ForkNumber forknum) {
>   return reln->smgr_cached_nblocks[forknum];
> }

Thank you for clarifying. 

So in the new function, it goes something like:
if (InRecovery)
{
if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum];
else
return InvalidBlockNumber;
}

I've revised the patch and added the new function accordingly in the attached 
file.
I also did not remove the duplicate code from smgrnblocks because Amit-san 
mentioned
that when the caching for non-recovery cases is implemented, we can use it
for non-recovery cases as well.

Although I am not sure if the way it's written in DropRelFileNodeBuffers is 
okay.
BlockNumberIsValid(nTotalBlocks)
 
nTotalBlocks = smgrcachednblocks(smgr_reln, forkNum[j]);
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];

if (BlockNumberIsValid(nTotalBlocks) &&
nBlocksToInvalidate < 
BUF_DROP_FULLSCAN_THRESHOLD)
{
//enter optimization loop
}
else
{
//full scan for each fork  
}

Regards,
Kirk Jamison


v17-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v17-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-28 Thread k.jami...@fujitsu.com
On Tuesday, September 29, 2020 10:35 AM, Horiguchi-san wrote:

> FWIW, I (and maybe Amit) am thinking that the property we need here is not it
> is cached or not but the accuracy of the returned file length, and that the
> "cached" property should be hidden behind the API.
> 
> Another reason for not adding this function is the cached value is not really
> reliable on non-recovery environment.
> 
> > So in the new function, it goes something like:
> > if (InRecovery)
> > {
> > if (reln->smgr_cached_nblocks[forknum] !=
> InvalidBlockNumber)
> > return reln->smgr_cached_nblocks[forknum];
> > else
> > return InvalidBlockNumber;
> > }
> 
> If we add the new function, it should reutrn InvalidBlockNumber without
> consulting smgr_nblocks().

So here's how I revised it
smgrcachednblocks(SMgrRelation reln, ForkNumber forknum)
{
if (InRecovery)
{
if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum];
}
return InvalidBlockNumber;


> Hmm. The current loop in DropRelFileNodeBuffers looks like this:
> 
> if (InRecovery)
>  for (for each forks)
> if (the fork meets the criteria)
>
>   else
>
> 
> I think this is somewhat different from the current discussion. Whether we
> sum-up the number of blcoks for all forks or just use that of the main fork, 
> we
> should take full scan if we failed to know the accurate size for any one of 
> the
> forks. (In other words, it is stupid that we run a full scan for more than one
> fork at a
> drop.)
> 
> Come to think of that, we can naturally sum-up all forks' blocks since anyway
> we need to call smgrnblocks for all forks to know the optimzation is usable.

I understand. We really don't have to enter the optimization when we know the
file size is inaccurate. That also makes the patch simpler.

> So that block would be something like this:
> 
> for (forks of the rel)
> /* the function returns InvalidBlockNumber if !InRecovery */
> if (smgrnblocks returned InvalidBlockNumber)
>total_blocks = InvalidBlockNumber;
>break;
>   total_blocks += nbloks of this fork
> 
> /*  */
> if (total_blocks != InvalidBlockNumber && total_blocks < threshold)
>  for (forks of the rel)
> for (blocks of the fork)
>  
> else
>

I followed this logic in the attached patch.
Thank you very much for the thoughtful reviews.

Performance measurement for large shared buffers to follow.

Best regards,
Kirk Jamison



v18-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v18-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-30 Thread k.jami...@fujitsu.com
On Thursday, October 1, 2020 11:49 AM, Amit Kapila wrote:
> On Thu, Oct 1, 2020 at 8:11 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Jamison, Kirk/ジャミソン カーク 
> > > Recovery performance measurement results below.
> > > But it seems there are overhead even with large shared buffers.
> > >
> > > | s_b   | master | patched | %reg  |
> > > |---||-|---|
> > > | 128MB | 36.052 | 39.451  | 8.62% |
> > > | 1GB   | 21.731 | 21.73   | 0.00% |
> > > | 20GB  | 24.534 | 25.137  | 2.40% | 100GB | 30.54  | 31.541  |
> > > | 3.17% |
> >
> > Did you really check that the optimization path is entered and the 
> > traditional
> path is never entered?
> >

Oops. Thanks Tsunakawa-san for catching that. 
Will fix in the next patch, replacing break with continue.

> I have one idea for performance testing. We can even test this for
> non-recovery paths by removing the recovery-related check like only use it
> when there are cached blocks. You can do this if testing via recovery path is
> difficult because at the end performance should be same for recovery and
> non-recovery paths.

For non-recovery path, did you mean by any chance
measuring the cache hit rate for varying shared_buffers?

SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM 
  pg_statio_user_tables;


Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-01 Thread k.jami...@fujitsu.com
On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote:
 
> From: Kyotaro Horiguchi 
> > I thought that the advantage of this optimization is that we don't
> > need to visit all buffers?  If we need to run a full-scan for any
> > reason, there's no point in looking-up already-visited buffers again.
> > That's just wastefull cycles.  Am I missing somethig?
> >
> > I don't understand. If we chose to the optimized dropping, the reason
> > is the number of buffer lookup is fewer than a certain threashold. Why
> > do you think that the fork kind a buffer belongs to is relevant to the
> > criteria?
> 
> I rethought about this, and you certainly have a point, but...  OK, I think I
> understood.  I should have thought in a complicated way.  In other words,
> you're suggesting "Let's simply treat all forks as one relation to determine
> whether to optimize," right?  That is, the code simple becomes:
> 
> Sums up the number of buffers to invalidate in all forks; if (the cached sizes
> of all forks are valid && # of buffers to invalidate < THRESHOLD) {
>   do the optimized way;
>   return;
> }
> do the traditional way;
> 
> This will be simple, and I'm +1.

This is actually close to the v18 I posted trying Horiguchi-san's approach, but 
that
patch had bug. So attached is an updated version (v20) trying this approach 
again.
I hope it's bug-free this time.

Regards,
Kirk Jamison
 


v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v20-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-04 Thread k.jami...@fujitsu.com
On Friday, October 2, 2020 11:45 AM, Horiguchi-san wrote:

> Thaks for the new version.

Thank you for your thoughtful reviews!
I've attached an updated patch addressing the comments below.

1.
> The following description is found in the comment for FlushRelationBuffers.
> 
> > *   XXX currently it sequentially searches the buffer pool, should
> be
> > *   changed to more clever ways of searching.  This routine is
> not
> > *   used in any performance-critical code paths, so it's not worth
> > *   adding additional overhead to normal paths to make it go
> faster;
> > *   but see also DropRelFileNodeBuffers.
> 
> This looks like to me "We won't do that kind of optimization for
> FlushRelationBuffers, but DropRelFileNodeBuffers would need it".  If so,
> don't we need to revise the comment together?

Yes, but instead of combining, I just removed the comment in 
FlushRelationBuffers that mentions
referring to DropRelFileNodeBuffers. I think it meant the same of using more 
clever ways of searching.
But that comment s not applicable anymore in DropRelFileNodeBuffers due to the 
optimization.
- * adding additional overhead to normal paths to make it go faster;
- * but see also DropRelFileNodeBuffers.
+ * adding additional overhead to normal paths to make it go faster.

2. 
> - *   XXX currently it sequentially searches the buffer pool, should 
> be
> - *   changed to more clever ways of searching.  However, this routine
> - *   is used only in code paths that aren't very 
> performance-critical,
> - *   and we shouldn't slow down the hot paths to make it faster ...

I revised and removed most parts of this code comment in the 
DropRelFileNodeBuffers
because isn't it the point of the optimization, to make the path faster for 
some performance
cases we've tackled in the thread?

3.
> This should no longer be a XXX comment. 
Alright. I've fixed it.

4.
> It seems to me somewhat
> describing too-detailed at this function's level. How about something like the
> follwoing? (excpet its syntax, or phrasing:p)
> 
> If we are expected to drop buffers less enough, we locate individual buffers
> using BufTableLookup. Otherwise we scan through all buffers. Since we
> mustn't leave a buffer behind, we take the latter way unless the sizes of all 
> the
> involved forks are known to be accurte. See smgrcachednblocks() for details.
> 

Sure. I paraphrased it like below.

If the expected maximum number of buffers to be dropped is small
enough, individual buffer is located by BufTableLookup().  Otherwise,
the buffer pool is sequentially scanned. Since buffers must not be
left behind, the latter way is executed unless the sizes of all the
involved forks are known to be accurate. See smgrcachednblocks() for
more details.

5. 
> (I'm still mildly opposed to the function name, which seems exposing  detail
> too much.)
I can't think of a better name, but smgrcachednblocks seems straightforward 
though.
Although I understand that it may be confused with the relation property 
smgr_cached_nblocks.
But isn't that what we're getting in the function?

6.
> +  * Get the total number of cached blocks and to-be-invalidated
> blocks
> +  * of the relation.  If a fork's nblocks is not valid, break the loop.
> 
> The number of file blocks is not usually equal to the number of existing
> buffers for the file. We might need to explain that limitation here.

I revised that comment like below..

Get the total number of cached blocks and to-be-invalidated blocks
of the relation.  The cached value returned by smgrcachednblocks
could be smaller than the actual number of existing buffers of the
file.  This is caused by buggy Linux kernels that might not have
accounted the recent write.  If a fork's nblocks is invalid, exit loop.

7. 
> + for (j = 0; j < nforks; j++)
> 
> Though I understand that j is considered to be in a connection with fork
> number, I'm a bit uncomfortable that j is used for the outmost loop..

I agree. We must use I for the outer loop for consistency.

8.
> + for (curBlock = firstDelBlock[j]; curBlock <
> nTotalBlocks;
> +curBlock++)
> 
> Mmm. We should compare curBlock with the number of blocks of the fork,
> not the total of all forks.

Oops. Yes. That should be nForkBlocks, so we have to call again 
smgrcachednblocks()
In the optimization loop for forks.

9.
> + uint32  newHash;/*hash 
> value for newTag */
> + BufferTag   newTag; /* identity of 
> requested block */
> + LWLock  *newPartitionLock;  /* 
> buffer partition lock for it */
> 
> It seems to be copied from somewhere, but the buffer is not new at all.

Thanks for catching that. Yeah. Fixed.

10.
> + if (RelFileNodeEquals(bufHdr->tag.rnode, 
> rnode.node) &&
> +

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-05 Thread k.jami...@fujitsu.com
On Monday, October 5, 2020 3:30 PM, Amit Kapila wrote:

> + for (i = 0; i < nforks; i++)
> + {
> + /* Get the total nblocks for a relation's fork */ nForkBlocks =
> + smgrcachednblocks(smgr_reln, forkNum[i]);
> +
> + if (nForkBlocks == InvalidBlockNumber) { nTotalBlocks =
> + InvalidBlockNumber; break; } nTotalBlocks += nForkBlocks;
> + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i]; }
> +
> + /*
> + * Do explicit hashtable probe if the total of nblocks of relation's
> + forks
> + * is not invalid and the nblocks to be invalidated is less than the
> + * full-scan threshold of buffer pool.  Otherwise, full scan is executed.
> + */
> + if (nTotalBlocks != InvalidBlockNumber && nBlocksToInvalidate <
> + BUF_DROP_FULL_SCAN_THRESHOLD) { for (j = 0; j < nforks; j++) {
> + BlockNumber curBlock;
> +
> + nForkBlocks = smgrcachednblocks(smgr_reln, forkNum[j]);
> +
> + for (curBlock = firstDelBlock[j]; curBlock < nForkBlocks; curBlock++)
> 
> What if one or more of the forks doesn't have cached value? I think the patch
> will skip such forks and will invalidate/unpin buffers for others. 

Not having a cached value is equivalent to InvalidBlockNumber, right?
Maybe I'm missing something? But in the first loop we are already doing the
pre-check of whether or not one of the forks doesn't have cached value.
If it's not cached, then the nTotalBlocks is set to InvalidBlockNumber so we
won't need to enter the optimization loop and just execute the full scan buffer
invalidation process.

> You probably
> need a local array of nForkBlocks which will be formed first time and then
> used in the second loop. You also in some way need to handle the case where
> that local array doesn't have cached blocks.

Understood. that would be cleaner.
BlockNumber nForkBlocks[MAX_FORKNUM];

As for handling whether the local array is empty, I think the first loop would 
cover it,
and there's no need to pre-check if the array is empty again in the second loop.
for (i = 0; i < nforks; i++)
{
nForkBlocks[i] = smgrcachednblocks(smgr_reln, forkNum[i]);

if (nForkBlocks[i] == InvalidBlockNumber)
{
nTotalBlocks = InvalidBlockNumber;
break;
}
nTotalBlocks += nForkBlocks[i];
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
}

> 2. Also, the other thing is I have asked for some testing to avoid the small
> regression we have for a smaller number of shared buffers which I don't see
> the results nor any change in the code. I think it is better if you post the
> pending/open items each time you post a new version of the patch.

Ah. Apologies for forgetting to include updates about that, but since I keep on 
updating
the patch I've decided not to post results yet as performance may vary per 
patch-update
due to possible bugs.
But for the performance case of not using recovery check, I just removed it 
from below.
Does it meet the intention?

BlockNumber smgrcachednblocks(SMgrRelation reln, ForkNumber forknum) {
-   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
+   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum];

Regards,
Kirk Jamison


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-07 Thread k.jami...@fujitsu.com
On Monday, October 5, 2020 8:50 PM, Amit Kapila wrote:

> On Mon, Oct 5, 2020 at 3:04 PM k.jami...@fujitsu.com
> > > 2. Also, the other thing is I have asked for some testing to avoid
> > > the small regression we have for a smaller number of shared buffers
> > > which I don't see the results nor any change in the code. I think it
> > > is better if you post the pending/open items each time you post a new
> version of the patch.
> >
> > Ah. Apologies for forgetting to include updates about that, but since
> > I keep on updating the patch I've decided not to post results yet as
> > performance may vary per patch-update due to possible bugs.
> > But for the performance case of not using recovery check, I just removed it
> from below.
> > Does it meet the intention?
> >
> > BlockNumber smgrcachednblocks(SMgrRelation reln, ForkNumber
> forknum) {
> > -   if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> InvalidBlockNumber)
> > +   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> > return reln->smgr_cached_nblocks[forknum];
> >
> 
> Yes, we can do that for the purpose of testing.

With the latest patches attached, and removing the recovery check in 
smgrnblocks,
I tested the performance of vacuum.
(3 trial runs, 3.5 GB db populated with 1000 tables)

Execution Time (seconds)
| s_b   | master | patched | %reg | 
|---||-|--| 
| 128MB | 15.265 | 15.260  | -0.03%   | 
| 1GB   | 14.808 | 15.009  | 1.34%| 
| 20GB  | 24.673 | 11.681  | -111.22% | 
| 100GB | 74.298 | 11.724  | -533.73% |

These are good results and we can see the improvements for large shared buffers,
For small s_b, the performance is almost the same.

I repeated the recovery performance test from the previous mail,
and ran three trials for each shared_buffer setting.
We can also clearly see the improvement here.

Recovery Time (seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 3.010   | -1.10% | 
| 1GB   | 3.417  | 3.477   | 1.73%  | 
| 20GB  | 20.597 | 2.409   | -755%  | 
| 100GB | 66.862 | 2.409   | -2676% |

For default and small shared_buffers, the recovery performance is almost the 
same.
But for bigger shared_buffers, we can see the benefit and improvement.
For 20GB, from 20.597 s to 2.409 s. For 100GB s_b, from 66.862 s to 2.409 s.

I have updated the latest patches, with 0002 being the new one.
Instead of introducing a new API, I just added the bool parameter to smgrnblocks
and modified its callers.

Comments and feedback are highly appreciated.

Regards,
Kirk Jamison




0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


0002-v1-Add-bool-param-in-arg-of-smgrextend-for-cached-block.patch
Description:  0002-v1-Add-bool-param-in-arg-of-smgrextend-for-cached-block.patch


0003-v2-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  0003-v2-Optimize-DropRelFileNodeBuffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread k.jami...@fujitsu.com
On Thursday, October 8, 2020 3:38 PM, Tsunakawa-san wrote:

> Hi Kirk san,
Thank you for looking into my patches!

> (1)
> + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not
> + * available and when not in recovery path.
> 
> + /*
> +  * We cannot believe the result from smgr_nblocks is always accurate
> +  * because lseek of buggy Linux kernels doesn't account for a recent
> +  * write.
> +  */
> + if (!InRecovery && result == InvalidBlockNumber)
> + return InvalidBlockNumber;
> +
> 
> These are unnecessary, because mdnblocks() never returns
> InvalidBlockNumber and conseuently smgrnblocks() doesn't return
> InvalidBlockNumber.

Yes. Thanks for carefully looking into that. Removed.

> (2)
> +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *isCached)
> 
> I think it's better to make the argument name iscached so that camel case
> aligns with forknum, which is not forkNum.

This is kinda tricky because of the surrounding code which follows inconsistent 
coding style too.
So I just followed the same like below and retained the change.

extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber 
blocknum, char *buffer, bool skipFsync);

> (3)
> +  * This is caused by buggy Linux kernels that might not have
> accounted
> +  * the recent write.  If a fork's nblocks is invalid, exit loop.
> 
> "accounted for" is the right English?
> I think The second sentence should be described in terms of its meaning, not
> the program logic.  For example, something like "Give up the optimization if
> the block count of any fork cannot be trusted."

Fixed.

> Likewise, express the following part in semantics.
> 
> +  * Do explicit hashtable lookup if the total of nblocks of relation's
> forks
> +  * is not invalid and the nblocks to be invalidated is less than the

I revised it like below:
"Look up the buffer in the hashtable if the block size is known to 
 be accurate and the total blocks to be invalidated is below the
 full scan threshold.  Otherwise, give up the optimization."

> (4)
> + if (nForkBlocks[i] == InvalidBlockNumber)
> + {
> + nTotalBlocks = InvalidBlockNumber;
> + break;
> + }
> 
> Use isCached in if condition because smgrnblocks() doesn't return
> InvalidBlockNumber.

Fixed. if (!isCached)

> (5)
> + nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
> 
> should be
> 
> + nBlocksToInvalidate += (nForkBlocks[i] - firstDelBlock[i]);

Fixed.

> (6)
> + bufHdr->tag.blockNum >=
> firstDelBlock[j])
> + InvalidateBuffer(bufHdr);   /*
> releases spinlock */
> 
> The right side of >= should be cur_block.

Fixed.


Attached are the updated patches.
Thank you again for the reviews.

Regards,
Kirk Jamison


0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-08 Thread k.jami...@fujitsu.com
Hi, 
> Attached are the updated patches.

Sorry there was an error in the 3rd patch. So attached is a rebase one.

Regards,
Kirk Jamison


0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  0001-v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  0002-v2-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  0003-v23-Optimize-DropRelFileNodeBuffers-during-recovery.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-12 Thread k.jami...@fujitsu.com
On Friday, October 9, 2020 11:12 AM, Horiguchi-san wrote:
> I have some comments on the latest patch.

Thank you for the feedback!
I've attached the latest patches.

>  visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)  {
>   BlockNumber newnblocks;
> + boolcached;
> 
> All the added variables added by 0002 is useless because all the caller sites
> are not interested in the value. smgrnblocks should accept NULL as isCached.
> (I'm agree with Tsunakawa-san that the camel-case name is not common
> there.)
> 
> + nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i],
> &isCached);
> +
> + if (!isCached)
> 
> "is cached" is not the property that code is interested in. No other callers 
> to
> smgrnblocks are interested in that property. The need for caching is purely
> internal of smgrnblocks().
> On the other hand, we are going to utilize the property of "accuracy"
> that is a biproduct of reducing fseek calls, and, again, not interested in 
> how it
> is achieved.
> So I suggest that the name should be "accurite" or something that is not
> suggest the mechanism used under the hood.

I changed the bool param to "accurate" per your suggestion.
And I also removed the additional variables "bool cached" from the modified 
functions.
Now NULL values are accepted for the new boolean parameter
 

> + if (nTotalBlocks != InvalidBlockNumber &&
> + nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
> 
> I don't think nTotalBlocks is useful. What we need here is only total blocks 
> for
> every forks (nForkBlocks[]) and the total number of buffers to be invalidated
> for all forks (nBlocksToInvalidate).

Alright. I also removed nTotalBlocks in v24-0003 patch.

for (i = 0; i < nforks; i++)
{
if (nForkBlocks[i] != InvalidBlockNumber &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
{
Optimization loop
}
else
break;
}
if (i >= nforks)
return;
{ usual buffer invalidation process }


> > > > The right side of >= should be cur_block.
> > > Fixed.
> > >= should be =, shouldn't it?
> 
> It's just from a paranoia. What we are going to invalidate is blocks blockNum
> of which >= curBlock. Although actually there's no chance of any other
> processes having replaced the buffer with another page (with lower blockid)
> of the same relation after BufTableLookup(), that condition makes it sure not
> to leave blocks to be invalidated left alone.
> Sorry. What we are going to invalidate is blocks that are blocNum >=
> firstDelBlock[i]. So what I wanted to suggest was the condition should be
> 
> + if (RelFileNodeEquals(bufHdr->tag.rnode,
> rnode.node) &&
> + bufHdr->tag.forkNum ==
> forkNum[j] &&
> + bufHdr->tag.blockNum >=
> firstDelBlock[j])

I used bufHdr->tag.blockNum >= firstDelBlock[i] in the latest patch.

> > Please measure and let us see just the recovery performance again because
> the critical part of the patch was modified.  If the performance is good as 
> the
> previous one, and there's no review interaction with others in progress, I'll
> mark the patch as ready for committer in a few days.
> 
> The performance is expected to be kept since smgrnblocks() is called in a
> non-hot code path and actually it is called at most four times per a buffer
> drop in this patch. But it's better making it sure.

Hmm. When I repeated the performance measurement for non-recovery,
I got almost similar execution results for both master and patched.

Execution Time (in seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 15.265 | 14.769  | -3.36% | 
| 1GB   | 14.808 | 14.618  | -1.30% | 
| 20GB  | 24.673 | 24.425  | -1.02% | 
| 100GB | 74.298 | 74.813  | 0.69%  |

That is considering that I removed the recovery-related checks in the patch and 
just
executed the commands on a standalone server.
-   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
+   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)

OTOH, I also measured the recovery performance by having hot standby and 
executing failover.
The results were good and almost similar to the previously reported recovery 
performance.

Recovery Time (in seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 2.977   | -2.22% | 
| 1GB   | 3.417  | 3.41| -0.21% | 
| 20GB  | 20.597 | 2.409   | -755%  | 
| 100GB | 66.862 | 2.409   | -2676% |

For 20GB s_b, from 20.597 s (Master) to 2.409 s (Patched).
For 100GB s_b, from 66.862 s (Master) to 2.409 s (Patched).
This is mainly benefits for large shared_buffers setting,
without compromising when shared_buffers is set to default or lower value.

If you could take a look again and if you have additional feedback or comments, 
I'd appreciate it.
Thank you for your time

Regards,
Kirk Jamison



v

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread k.jami...@fujitsu.com
On Wednesday, October 21, 2020 4:37 PM, Tsunakawa-san wrote:
> RelationTruncate() invalidates the cached fork sizes as follows.  This causes
> smgrnblocks() return accurate=false, resulting in not running optimization.
> Try commenting out for non-recovery case.
> 
> /*
>  * Make sure smgr_targblock etc aren't pointing somewhere past new
> end
>  */
> rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
> for (int i = 0; i <= MAX_FORKNUM; ++i)
> rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;

Hello, I have updated the set of patches which incorporated all your feedback 
in the previous email.
Thank you for also looking into it. The patch 0003 (DropRelFileNodeBuffers 
improvement)
is indeed for vacuum optimization and not for truncate.
I'll post a separate patch for the truncate optimization in the coming days.

1. Vacuum Optimization
I have confirmed that the above comment (commenting out the lines in 
RelationTruncate)
solves the issue for non-recovery case.
The attached 0004 patch is just for non-recovery testing and is not included in 
the
final set of patches to be committed for vacuum optimization.

The table below shows the vacuum execution time for non-recovery case.
I've also subtracted the execution time when VACUUM (truncate off) is set.

[NON-RECOVERY CASE - VACUUM execution Time in seconds]

| s_b   | master | patched | %reg  | 
|---||-|---| 
| 128MB | 0.22   | 0.181   | -21.55%   | 
| 1GB   | 0.701  | 0.712   | 1.54% | 
| 20GB  | 15.027 | 1.920   | -682.66%  | 
| 100GB | 65.456 | 1.795   | -3546.57% |

[RECOVERY CASE, VACUUM execution + failover]
I've made a mistake in my writing of the previous email [1].
DELETE from was executed before pausing the WAL replay on standby.
In short, the procedure and results were correct. But I repeated the
performance measurement just in case. The results are still great and 
almost the same as the previous measurement.

| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 3.009   | -1.13% | 
| 1GB   | 3.417  | 3.410   | -0.21% | 
| 20GB  | 20.597 | 2.410   | -755%  | 
| 100GB | 65.734 | 2.409   | -2629% |

Based from the results above, with the patches applied,
the performance for both recovery and non-recovery were relatively close.
For default and small shared_buffers (128MB, 1GB), the performance is
relatively the same as master. But we see the benefit when we have large 
shared_buffers setting.

I've tested using the same test case I indicated in the previous email,
Including the following additional setting:
vacuum_cost_delay = 0
vacuum_cost_limit = 1

That's it for the vacuum optimization. Feedback and comments would be highly 
appreciated.

2. Truncate Optimization
I'll post a separate patch in the future for the truncate optimization which 
modifies the
DropRelFileNodesAllBuffers and related functions along the truncate path..

Thank you.

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/OSBPR01MB2341672E9A95E5EC6D2E79B5EF020%40OSBPR01MB2341.jpnprd01.prod.outlook.com


v26-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v26-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v26-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v26-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v26-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v26-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v26-0004-For-non-recovery-performance-test-case-purposes-.patch
Description:  v26-0004-For-non-recovery-performance-test-case-purposes-.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread k.jami...@fujitsu.com
On Thursday, October 22, 2020 10:34 AM, Tsunakwa-san wrote:
> > I have confirmed that the above comment (commenting out the lines in
> > RelationTruncate) solves the issue for non-recovery case.
> > The attached 0004 patch is just for non-recovery testing and is not
> > included in the final set of patches to be committed for vacuum
> optimization.
> 
> I'm relieved to hear that.
> 
> As for 0004:
> When testing TRUNCATE, remove the change to storage.c because it was
> intended to troubleshoot the VACUUM test.
I've removed it now.

> What's the change in bufmgr.c for?  Is it to be included in 0001 or 0002?

Right. But that should be in 0003. Fixed.

I also fixed the feedback from the previous email:
>(1)
>+   * as the total nblocks for a given fork. The cached value returned by
>
>nblocks -> blocks


> > The table below shows the vacuum execution time for non-recovery case.
> > I've also subtracted the execution time when VACUUM (truncate off) is set.
> >
> > [NON-RECOVERY CASE - VACUUM execution Time in seconds]
> (snip)
> > | 100GB | 65.456 | 1.795   | -3546.57% |
> 
> So, the full shared buffer scan for 10,000 relations took about as long as 63
> seconds (= 6.3 ms per relation).  It's nice to shorten this long time.
> 
> I'll review the patch soon.

Thank you very much for the reviews. Attached are the latest set of patches.

Regards,
Kirk Jamison


v27-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v27-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v27-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v27-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v27-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v27-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v27-0004-For-non-recovery-performance-test-case-purposes-.patch
Description:  v27-0004-For-non-recovery-performance-test-case-purposes-.patch