RE: [PoC] Non-volatile WAL buffer

Takashi Menjo Wed, 14 Oct 2020 06:48:04 -0700

Hi Gang,

Thanks. I have tried to reproduce performance degrade, using your 
configuration, query, and steps. And today, I got some results that Original 
(PMEM) achieved better performance than Non-volatile WAL buffer on my Ubuntu 
environment. Now I work for further investigation.


Best regards,
Takashi

-- 
Takashi Menjo <[email protected]>
NTT Software Innovation Center

> -----Original Message-----
> From: Deng, Gang <[email protected]>
> Sent: Friday, October 9, 2020 3:10 PM
> To: Takashi Menjo <[email protected]>
> Cc: [email protected]; 'Takashi Menjo' <[email protected]>
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Hi Takashi,
> 
> There are some differences between our HW/SW configuration and test steps. I 
> attached postgresql.conf I used
> for your reference. I would like to try postgresql.conf and steps you 
> provided in the later days to see if I can find
> cause.
> 
> I also ran pgbench and postgres server on the same server but on different 
> NUMA node, and ensure server process
> and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 
> 9. But some difference in later
> steps, major of them are:
> 
> In step 10), I created a database and table for test by:
> #create database:
> psql -c "create database insert_bench;"
> #create table:
> psql -d insert_bench -c "create table test(crt_time timestamp, info text 
> default
> '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc
> 48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1
> d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d7
> 9a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"
> 
> in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds 
> to warm up.
> In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f 
> ./test.sql -T 600 -c _ -j _
> insert_bench. (test.sql can be found in attachment)
> 
> For HW/SW conf, the major differences are:
> CPU: I used Xeon 8268 ([email protected], HT enabled) OS Distro: CentOS 8.2.2004
> Kernel: 4.18.0-193.6.3.el8_2.x86_64
> GCC: 8.3.1
> 
> Best regards
> Gang
> 
> -----Original Message-----
> From: Takashi Menjo <[email protected]>
> Sent: Tuesday, October 6, 2020 4:49 PM
> To: Deng, Gang <[email protected]>
> Cc: [email protected]; 'Takashi Menjo' <[email protected]>
> Subject: RE: [PoC] Non-volatile WAL buffer
> 
> Hi Gang,
> 
> I have tried to but yet cannot reproduce performance degrade you reported 
> when inserting 328-byte records. So
> I think the condition of you and me would be different, such as steps to 
> reproduce, postgresql.conf, installation
> setup, and so on.
> 
> My results and condition are as follows. May I have your condition in more 
> detail? Note that I refer to your "Storage
> over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile 
> WAL buffer."
> 
> Best regards,
> Takashi
> 
> 
> # Results
> See the attached figure. In short, Non-volatile WAL buffer got better 
> performance than Original (PMEM).
> 
> # Steps
> Note that I ran postgres server and pgbench in a single-machine system but 
> separated two NUMA nodes. PMEM
> and PCI SSD for the server process are on the server-side NUMA node.
> 
> 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax 
> -M dev -e namespace0.0)
> 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
> mkfs.ext4 -q -F /dev/pmem0 ; sudo
> mount -o dax /dev/pmem0 /mnt/pmem0)
> 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 
> -q -F /dev/nvme0n1 ; sudo mount
> /dev/nvme0n1 /mnt/nvme0n1)
> 04) Make /mnt/pmem0/pg_wal directory for WAL
> 05) Make /mnt/nvme0n1/pgdata directory for PGDATA
> 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
>     - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of 
> Non-volatile WAL buffer
> 07) Edit postgresql.conf as the attached one
>     - Please remove nvwal_* lines in the case of Original (PMEM)
> 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
> -l pg.log start)
> 09) Create a database (createdb --locale=C --encoding=UTF8)
> 10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
> 11) Change # characters of "filler" column of "pgbench_history" table to 300 
> (ALTER TABLE pgbench_history
> ALTER filler TYPE character(300);)
>     - This would make the row size of the table 328 bytes
> 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
> 13) Remount the PMEM and the PCIe SSD
> 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
> pg_ctl -l pg.log start)
> 15) Run pg_prewarm for all the four pgbench_* tables
> 16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench 
> -r -M prepared -T 1800 -c __
> -j __)
>     - It executes the default tpcb-like transactions
> 
> I repeated all the steps three times for each (c,j) then got the median "tps 
> = __ (including connections
> establishing)" of the three as throughput and the "latency average = __ ms " 
> of that time as average latency.
> 
> # Environment variables
> export PGHOST=/tmp
> export PGPORT=5432
> export PGDATABASE="$USER"
> export PGUSER="$USER"
> export PGDATA=/mnt/nvme0n1/pgdata
> 
> # Setup
> - System: HPE ProLiant DL380 Gen10
> - CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by 
> BIOS)
> - DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 
> channels per socket)
> - Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 
> sockets (256 GiB per channel
> x 6 channels per socket; interleaving enabled)
> - PCIe SSD: DC P4800X Series SSDPED1K750GA
> - Distro: Ubuntu 20.04.1
> - C compiler: gcc 9.3.0
> - libc: glibc 2.31
> - Linux kernel: 5.7 (vanilla)
> - Filesystem: ext4 (DAX enabled when using Optane PMem)
> - PMDK: 1.9
> - PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
> - PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + 
> non-volatile WAL buffer patchset
> v4
> 
> --
> Takashi Menjo <[email protected]> NTT Software Innovation Center
> 
> > -----Original Message-----
> > From: Takashi Menjo <[email protected]>
> > Sent: Thursday, September 24, 2020 2:38 AM
> > To: Deng, Gang <[email protected]>
> > Cc: [email protected]; Takashi Menjo
> > <[email protected]>
> > Subject: Re: [PoC] Non-volatile WAL buffer
> >
> > Hello Gang,
> >
> > Thank you for your report. I have not taken care of record size deeply
> > yet, so your report is very interesting. I will also have a test like yours 
> > then post results here.
> >
> > Regards,
> > Takashi
> >
> >
> > 2020年9月21日(月) 14:14 Deng, Gang <[email protected] 
> > <mailto:[email protected]> >:
> >
> >
> >     Hi Takashi,
> >
> >
> >
> >     Thank you for the patch and work on accelerating PG performance with
> > NVM. I applied the patch and made some performance test based on the
> > patch v4. I stored database data files on NVMe SSD and stored WAL file on 
> > Intel PMem (NVM). I used two
> methods to store WAL file(s):
> >
> >     1.      Leverage your patch to access PMem with libpmem (NVWAL patch).
> >
> >     2.      Access PMem with legacy filesystem interface, that means use 
> > PMem as ordinary block device, no
> > PG patch is required to access PMem (Storage over App Direct).
> >
> >
> >
> >     I tried two insert scenarios:
> >
> >     A.     Insert small record (length of record to be inserted is 24 
> > bytes), I think it is similar as your test
> >
> >     B.      Insert large record (length of record to be inserted is 328 
> > bytes)
> >
> >
> >
> >     My original purpose is to see higher performance gain in scenario B as 
> > it is more write intensive on WAL.
> > But I observed that NVWAL patch method had ~5% performance improvement
> > compared with Storage over App Direct method in scenario A, while had ~20% 
> > performance degradation in
> scenario B.
> >
> >
> >
> >     I made further investigation on the test. I found that NVWAL patch
> > can improve performance of XlogFlush function, but it may impact
> > performance of CopyXlogRecordToWAL function. It may be related to the 
> > higher latency of memcpy to Intel
> PMem comparing with DRAM. Here are key data in my test:
> >
> >
> >
> >     Scenario A (length of record to be inserted: 24 bytes per record):
> >
> >     ==============================
> >
> >
> > NVWAL SoAD
> >
> >     ------------------------------------              -------               
> >            -------
> >
> >     Througput (10^3 TPS)                                              310.5
> > 296.0
> >
> >     CPU Time % of CopyXlogRecordToWAL                    0.4                
> >                  0.2
> >
> >     CPU Time % of XLogInsertRecord                              1.5         
> >                         0.8
> >
> >     CPU Time % of XLogFlush                                          2.1    
> >                              9.6
> >
> >
> >
> >     Scenario B (length of record to be inserted: 328 bytes per record):
> >
> >     ==============================
> >
> >
> > NVWAL SoAD
> >
> >     ------------------------------------              -------               
> >            -------
> >
> >     Througput (10^3 TPS)                                              13.0
> > 16.9
> >
> >     CPU Time % of CopyXlogRecordToWAL                    3.0                
> >                  1.6
> >
> >     CPU Time % of XLogInsertRecord                              23.0        
> >                        16.4
> >
> >     CPU Time % of XLogFlush                                          2.3    
> >                              5.9
> >
> >
> >
> >     Best Regards,
> >
> >     Gang
> >
> >
> >
> >     From: Takashi Menjo <[email protected] 
> > <mailto:[email protected]> >
> >     Sent: Thursday, September 10, 2020 4:01 PM
> >     To: Takashi Menjo <[email protected] 
> > <mailto:[email protected]> >
> >     Cc: [email protected] <mailto:[email protected]>
> >     Subject: Re: [PoC] Non-volatile WAL buffer
> >
> >
> >
> >     Rebased.
> >
> >
> >
> >
> >
> >     2020年6月24日(水) 16:44 Takashi Menjo <[email protected]
> > <mailto:[email protected]> >:
> >
> >             Dear hackers,
> >
> >             I update my non-volatile WAL buffer's patchset to v3.  Now we 
> > can
> > use it in streaming replication mode.
> >
> >             Updates from v2:
> >
> >             - walreceiver supports non-volatile WAL buffer
> >             Now walreceiver stores received records directly to 
> > non-volatile WAL buffer if applicable.
> >
> >             - pg_basebackup supports non-volatile WAL buffer
> >             Now pg_basebackup copies received WAL segments onto 
> > non-volatile WAL
> > buffer if you run it with "nvwal" mode (-Fn).
> >             You should specify a new NVWAL path with --nvwal-path option.  
> > The
> > path will be written to postgresql.auto.conf or recovery.conf.  The size of 
> > the new NVWAL is same as the
> master's one.
> >
> >
> >             Best regards,
> >             Takashi
> >
> >             --
> >             Takashi Menjo <[email protected] 
> > <mailto:[email protected]> >
> >             NTT Software Innovation Center
> >
> >             > -----Original Message-----
> >             > From: Takashi Menjo <[email protected]
> > <mailto:[email protected]> >
> >             > Sent: Wednesday, March 18, 2020 5:59 PM
> >             > To: 'PostgreSQL-development' <[email protected]
> > <mailto:[email protected]> >
> >             > Cc: 'Robert Haas' <[email protected]
> > <mailto:[email protected]> >; 'Heikki Linnakangas' <[email protected] 
> > <mailto:[email protected]> >; 'Amit
> Langote'
> >             > <[email protected] <mailto:[email protected]> >
> >             > Subject: RE: [PoC] Non-volatile WAL buffer
> >             >
> >             > Dear hackers,
> >             >
> >             > I rebased my non-volatile WAL buffer's patchset onto master.  
> > A
> > new v2 patchset is attached to this mail.
> >             >
> >             > I also measured performance before and after patchset, varying
> > -c/--client and -j/--jobs options of pgbench, for
> >             > each scaling factor s = 50 or 1000.  The results are 
> > presented in
> > the following tables and the attached charts.
> >             > Conditions, steps, and other details will be shown later.
> >             >
> >             >
> >             > Results (s=50)
> >             > ==============
> >             >          Throughput [10^3 TPS]  Average latency [ms]
> >             > ( c, j)  before  after          before  after
> >             > -------  ---------------------  ---------------------
> >             > ( 8, 8)  35.7    37.1 (+3.9%)   0.224   0.216 (-3.6%)
> >             > (18,18)  70.9    74.7 (+5.3%)   0.254   0.241 (-5.1%)
> >             > (36,18)  76.0    80.8 (+6.3%)   0.473   0.446 (-5.7%)
> >             > (54,18)  75.5    81.8 (+8.3%)   0.715   0.660 (-7.7%)
> >             >
> >             >
> >             > Results (s=1000)
> >             > ================
> >             >          Throughput [10^3 TPS]  Average latency [ms]
> >             > ( c, j)  before  after          before  after
> >             > -------  ---------------------  ---------------------
> >             > ( 8, 8)  37.4    40.1 (+7.3%)   0.214   0.199 (-7.0%)
> >             > (18,18)  79.3    86.7 (+9.3%)   0.227   0.208 (-8.4%)
> >             > (36,18)  87.2    95.5 (+9.5%)   0.413   0.377 (-8.7%)
> >             > (54,18)  86.8    94.8 (+9.3%)   0.622   0.569 (-8.5%)
> >             >
> >             >
> >             > Both throughput and average latency are improved for each 
> > scaling
> > factor.  Throughput seemed to almost reach
> >             > the upper limit when (c,j)=(36,18).
> >             >
> >             > The percentage in s=1000 case looks larger than in s=50 case. 
> >  I
> > think larger scaling factor leads to less
> >             > contentions on the same tables and/or indexes, that is, less 
> > lock
> > and unlock operations.  In such a situation,
> >             > write-ahead logging appears to be more significant for 
> > performance.
> >             >
> >             >
> >             > Conditions
> >             > ==========
> >             > - Use one physical server having 2 NUMA nodes (node 0 and 1)
> >             >   - Pin postgres (server processes) to node 0 and pgbench to 
> > node 1
> >             >   - 18 cores and 192GiB DRAM per node
> >             > - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 
> > NVDIMM-N set for pg_wal
> >             >   - Both are installed on the server-side node, that is, node > > 0
> >             >   - Both are formatted with ext4
> >             >   - NVDIMM-N is mounted with "-o dax" option to enable Direct 
> > Access (DAX)
> >             > - Use the attached postgresql.conf
> >             >   - Two new items nvwal_path and nvwal_size are used only 
> > after patch
> >             >
> >             >
> >             > Steps
> >             > =====
> >             > For each (c,j) pair, I did the following steps three times 
> > then I
> > found the median of the three as a final result shown
> >             > in the tables above.
> >             >
> >             > (1) Run initdb with proper -D and -X options; and also give
> > --nvwal-path and --nvwal-size options after patch
> >             > (2) Start postgres and create a database for pgbench tables
> >             > (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
> >             > (4) Stop postgres, remount filesystems, and start postgres 
> > again
> >             > (5) Execute pg_prewarm extension for all the four pgbench 
> > tables
> >             > (6) Run pgbench during 30 minutes
> >             >
> >             >
> >             > pgbench command line
> >             > ====================
> >             > $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 
> > -c ___ -j ___ dbname
> >             >
> >             > I gave no -b option to use the built-in "TPC-B (sort-of)" 
> > query.
> >             >
> >             >
> >             > Software
> >             > ========
> >             > - Distro: Ubuntu 18.04
> >             > - Kernel: Linux 5.4 (vanilla kernel)
> >             > - C Compiler: gcc 7.4.0
> >             > - PMDK: 1.7
> >             > - PostgreSQL: d677550 (master on Mar 3, 2020)
> >             >
> >             >
> >             > Hardware
> >             > ========
> >             > - System: HPE ProLiant DL380 Gen10
> >             > - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
> >             > - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
> >             > - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
> >             > - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
> >             >
> >             >
> >             > Best regards,
> >             > Takashi
> >             >
> >             > --
> >             > Takashi Menjo <[email protected]
> > <mailto:[email protected]> > NTT Software Innovation Center
> >             >
> >             > > -----Original Message-----
> >             > > From: Takashi Menjo <[email protected]
> > <mailto:[email protected]> >
> >             > > Sent: Thursday, February 20, 2020 6:30 PM
> >             > > To: 'Amit Langote' <[email protected] 
> > <mailto:[email protected]> >
> >             > > Cc: 'Robert Haas' <[email protected]
> > <mailto:[email protected]> >; 'Heikki Linnakangas' <[email protected] 
> > <mailto:[email protected]> >;
> >             > 'PostgreSQL-development'
> >             > > <[email protected] 
> > <mailto:[email protected]> >
> >             > > Subject: RE: [PoC] Non-volatile WAL buffer
> >             > >
> >             > > Dear Amit,
> >             > >
> >             > > Thank you for your advice.  Exactly, it's so to speak "do 
> > as the hackers do when in pgsql"...
> >             > >
> >             > > I'm rebasing my branch onto master.  I'll submit an updated
> > patchset and performance report later.
> >             > >
> >             > > Best regards,
> >             > > Takashi
> >             > >
> >             > > --
> >             > > Takashi Menjo <[email protected]
> > <mailto:[email protected]>
> > > NTT Software
> >             > > Innovation Center
> >             > >
> >             > > > -----Original Message-----
> >             > > > From: Amit Langote <[email protected] 
> > <mailto:[email protected]> >
> >             > > > Sent: Monday, February 17, 2020 5:21 PM
> >             > > > To: Takashi Menjo <[email protected]
> > <mailto:[email protected]> >
> >             > > > Cc: Robert Haas <[email protected]
> > <mailto:[email protected]> >; Heikki Linnakangas
> >             > > > <[email protected] <mailto:[email protected]> >; 
> > PostgreSQL-development
> >             > > > <[email protected] 
> > <mailto:[email protected]> >
> >             > > > Subject: Re: [PoC] Non-volatile WAL buffer
> >             > > >
> >             > > > Hello,
> >             > > >
> >             > > > On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
> > <[email protected] <mailto:[email protected]> > 
> > wrote:
> >             > > > > Hello Amit,
> >             > > > >
> >             > > > > > I apologize for not having any opinion on the patches
> >             > > > > > themselves, but let me point out that it's better to 
> > base these
> >             > > > > > patches on HEAD (master branch) than REL_12_0, 
> > because all new
> >             > > > > > code is committed to the master branch, whereas 
> > stable branches
> >             > > > > > such as
> >             > > > > > REL_12_0 only receive bug fixes.  Do you have any
> >             > > > specific reason to be working on REL_12_0?
> >             > > > >
> >             > > > > Yes, because I think it's human-friendly to reproduce 
> > and discuss
> >             > > > > performance measurement.  Of course I know
> >             > > > all new accepted patches are merged into master's HEAD, 
> > not stable
> >             > > > branches and not even release tags, so I'm aware of 
> > rebasing my
> >             > > > patchset onto master sooner or later.  However, if 
> > someone,
> >             > > > including me, says that s/he applies my patchset to 
> > "master" and
> >             > > > measures its performance, we have to pay attention to 
> > which commit the "master"
> >             > > > really points to.  Although we have sha1 hashes to 
> > specify which
> >             > > > commit, we should check whether the specific commit on 
> > master has
> >             > > > patches affecting performance or not
> >             > > because master's HEAD gets new patches day by day.  On the 
> > other hand,
> >             > > a release tag clearly points the commit all we probably 
> > know.  Also we
> >             > > can check more easily the features and improvements by using
> > release notes and user manuals.
> >             > > >
> >             > > > Thanks for clarifying. I see where you're coming from.
> >             > > >
> >             > > > While I do sometimes see people reporting numbers with 
> > the latest
> >             > > > stable release' branch, that's normally just one of the 
> > baselines.
> >             > > > The more important baseline for ongoing development is 
> > the master
> >             > > > branch's HEAD, which is also what people volunteering to 
> > test your
> >             > > > patches would use.  Anyone who reports would have to give 
> > at least
> >             > > > two numbers -- performance with a branch's HEAD without 
> > patch
> >             > > > applied and that with patch applied -- which can be 
> > enough in most
> >             > > > cases to see the difference the patch makes.  Sure, the 
> > numbers
> >             > > > might change on each report, but that's fine I'd think.  
> > If you
> >             > > > continue to develop against the stable branch, you might 
> > miss to
> >             > > notice impact from any relevant developments in the master 
> > branch,
> >             > > even developments which possibly require rethinking the
> > architecture of your own changes, although maybe that
> >             > rarely occurs.
> >             > > >
> >             > > > Thanks,
> >             > > > Amit
> >
> >
> >
> >
> >
> >
> >     --
> >
> >     Takashi Menjo <[email protected]
> > <mailto:[email protected]> >
> >
> >
> >
> > --
> >
> > Takashi Menjo <[email protected]
> > <mailto:[email protected]> >

RE: [PoC] Non-volatile WAL buffer

Reply via email to