Re: [HACKERS] WAL logging problem in 9.4.3?

Kyotaro Horiguchi Tue, 03 Dec 2019 03:52:14 -0800

Hello.

At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <n...@leadboat.com> wrote in 
> On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > I measured the performance with the latest patch set.
> > 
> > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > >    minute when done via syncs.
> > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > 3. Wait 10s.
> > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> If you have the raw data requested in (5), please share them here so folks
> have the option to reproduce your graphs and calculations.


Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.

> > > 2. Start server with wal_level = replica (all other variables are not
> > > changed) then run the attached ./bench.sh
> > 
> > The bench.sh attachment was missing; please attach it.  Please give the 
> > output
> > of this command:
> > 
> >   select name, setting from pg_settings where setting <> boot_val;

(I intentionally show all the results..)
=# select name, setting from pg_settings where setting<> boot_val;
            name            |      setting       
----------------------------+--------------------
 application_name           | psql
 archive_command            | (disabled)
 client_encoding            | UTF8
 data_directory_mode        | 0700
 default_text_search_config | pg_catalog.english
 lc_collate                 | en_US.UTF-8
 lc_ctype                   | en_US.UTF-8
 lc_messages                | en_US.UTF-8
 lc_monetary                | en_US.UTF-8
 lc_numeric                 | en_US.UTF-8
 lc_time                    | en_US.UTF-8
 log_checkpoints            | on
 log_file_mode              | 0600
 log_timezone               | Asia/Tokyo
 max_stack_depth            | 2048
 max_wal_senders            | 0
 max_wal_size               | 10240
 server_encoding            | UTF8
 shared_buffers             | 16384
 TimeZone                   | Asia/Tokyo
 unix_socket_permissions    | 0777
 wal_buffers                | 512
 wal_level                  | minimal
(23 rows)

The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).

> > 3. Restart server with wal_level = replica then run the bench.sh
> > twice.
> 
> I assume this is wal_level=minimal, not wal_level=replica.

Oops! It's wrong I ran once with replica, then twice with minimal.


Anyway, I revised the benchmarking scripts and attached them.  The
parameters written in benchmain.sh were decided as ./bench2.pl 5
<count> <pages> s with wal_level=minimal server takes around 60
seconds.

I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are:
  benchmain.sh    - main script
  bench2.sh       - run a benchmark with a single set of parameters
  bench1.pl       - behchmark client program
  summarize.pl    - script to summarize benchmain.sh's output
  f.txt.gz        - result only for pages=100, DDL count = 2200 (not 2250)

How to run:

$ /..unpatched_path../initdb -D <unpatched_datadir>
 (wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D <patched_datadir>
 (wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh > <result_file>   # output raw data
$./summarize.pl [-v] < <result_file>   # show summary


With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.

$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps  262.3 (100.0%), lat    39.840 ms (100.0%) (29 samples)
during: tps  120.7 ( 46.0%), lat   112.508 ms (282.4%) (35 samples)
 after: tps  106.3 ( 40.5%), lat   163.492 ms (410.4%) (86 samples)
DDL time:  34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps  226.3 (100.0%), lat    48.091 ms (100.0%) (29 samples)
during: tps   83.0 ( 36.7%), lat   184.942 ms (384.6%) (100 samples)
 after: tps   82.6 ( 36.5%), lat   196.863 ms (409.4%) (21 samples)
DDL time:  99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps  240.3 (100.0%), lat    44.686 ms (100.0%) (29 samples)
during: tps  129.6 ( 53.9%), lat   113.585 ms (254.2%) (31 samples)
 after: tps  124.5 ( 51.8%), lat   141.992 ms (317.8%) (90 samples)
DDL time:  30392 ms (  87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps  266.3 (100.0%), lat    45.884 ms (100.0%) (29 samples)
during: tps   87.9 ( 33.0%), lat   148.433 ms (323.5%) (54 samples)
 after: tps  105.6 ( 39.6%), lat   153.216 ms (333.9%) (67 samples)
DDL time:  53176 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=250 count=258 scale=20
(% are relative to "before")
before: tps  225.1 (100.0%), lat    47.705 ms (100.0%) (29 samples)
during: tps   93.7 ( 41.6%), lat   143.231 ms (300.2%) (83 samples)
 after: tps   93.8 ( 41.7%), lat   186.097 ms (390.1%) (38 samples)
DDL time:  82104 ms ( 154.4% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=250 count=258 scale=20
(% are relative to "before")
before: tps  230.2 (100.0%), lat    48.472 ms (100.0%) (29 samples)
during: tps   90.3 ( 39.2%), lat   183.365 ms (378.3%) (48 samples)
 after: tps  123.9 ( 53.8%), lat   131.129 ms (270.5%) (73 samples)
DDL time:  47660 ms (  89.6% relative to mode=none)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

#! /usr/bin/bash

PGBENCH_SCALE=20
export PATH=/home/horiguti/bin/pgsql_pendsync/bin:$PATH
function run_one_param()
{
        # bench2.sh <ddl_count> <insert_pages> <method (s:sync, w:WAL, 
n:not-set)> <pgbench scale>
        ./bench2.sh $1 $2 n ${PGBENCH_SCALE} 2>&1
        ./bench2.sh $1 $2 s ${PGBENCH_SCALE} 2>&1
        ./bench2.sh $1 $2 w ${PGBENCH_SCALE} 2>&1
}

killall -9 postgres
sleep 1

#run_one_param <ddl_count> <insert_pages>

# On slow HDD
export PGDATA=/home/horiguti/storage/hdd/data/data_pendsync
run_one_param  353   100
run_one_param  258   250
run_one_param  185   500
run_one_param  118  1000
run_one_param   58  2500
run_one_param   32  5000
run_one_param   18 10000

# On M.2 SSD
export PGDATA=/home/horiguti/storage/ssd/data/data_pendsync
run_one_param 2250   100
run_one_param 1162   250
run_one_param  564   500
run_one_param  297  1000
run_one_param  123  2500
run_one_param   63  5000
run_one_param   32 10000

#! /usr/bin/bash

if [ "$PGDATA" == "" ]; then
        echo "\$PGDATA should be set"
        exit;
fi;

rm -r $PGDATA/*

initdb
if [ "$3" == "n" ]; then
        wal_level=replica
else
        wal_level=minimal
fi      
cat <<EOF >> $PGDATA/postgresql.conf
wal_level=$wal_level
max_wal_senders=0
log_checkpoints=yes
max_wal_size = 10GB
EOF
        
binary=`which postgres`
scale=$4
echo "## params: count=$1 pages=$2 mode=$3 binary=$binary scale=$scale 
wal_level=$wal_level"
pg_ctl stop -m f
pg_ctl start 2>&1
pgbench -i -s ${scale}
psql -c 'checkpoint;'
((sleep 30; echo "START"; ./bench1.pl 5 $1 $2 $3; echo "END") & pgbench -rP1 
--progress-timestamp -T150 -c10 -j10) 2>&1
pg_ctl stop -m i

#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;
my $large_size = 100000000;
my @time = ();

sub bench {
        my ($header, $nprocs, $ntups, $threshold, $ddlcount) = @_;
        my @result = ();
        my @rds = ();
        
        for (my $ip = 0 ; $ip < $nprocs ; $ip++)
        {
                pipe(my $rd, my $wr);
                $rds[$ip] = $rd;
                
                my $pid = fork();

                die "fork failed: $!\n" if ($pid < 0);
                if ($pid == 0)
                {
                        close($rd);
                        
                        my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres 
> /dev/null");
                        if ($threshold >= 0) {
                                print $psqlwr "SET wal_skip_threshold to 
$threshold;\n";
                        }
                        print $psqlwr "DROP TABLE IF EXISTS t$ip;";
                        print $psqlwr "CREATE TABLE t$ip (a int);\n";

                        my @st = gettimeofday();
                        for (my $i = 0 ; $i < $ddlcount ; $i++)
                        {
                                print $psqlwr "BEGIN;";
                                print $psqlwr "TRUNCATE t$ip;";
                                print $psqlwr "INSERT INTO t$ip (SELECT a FROM 
generate_series(1, $ntups) a);";
                                print $psqlwr "COMMIT;";
                        }
                        close($psqlwr);
                        waitpid($pid, 0);

                        print $wr $ip, " ", 1000 * tv_interval(\@st, 
[gettimeofday]), "\n";
                        exit;
                }
                close($wr);
        }

        my $rpid;
        while (($rpid = wait()) == 0) {}

        my $sum = 0;
        for (my $ip = 0 ; $ip < $nprocs ; $ip++)
        {
                my $ret = readline($rds[$ip]);
                die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

                $sum += $2;
        }

        printf "$header: procs $nprocs: time %.0f\n", $sum / $nprocs;
}


sub log10 { return log($_[0]) / log(10); }

# benchmark for wal_level = replica, the third parameter of bench
# doesn't affect
sub bench1
{
        my $ddlcount = 5;
        $ddlcount = $ARGV[1] if ($#ARGV > 0);

        print "benchmark for wal_level = replica\n";
        for (my $s = 0 ; $s <= 4 ; $s += 0.25)
        {
                my $ss = int(10 ** $s);
                bench("size $ss", 1, $ss * $tupperpage, 0, $ddlcount);
        }
}

# benchmark for wal_level = minimal.
sub bench2
{
        my $ddlcount = 5;
        $ddlcount = $ARGV[1] if ($#ARGV > 0);

        print "benchmark for wal_level = minimal\n";
        for (my $s = 0 ; $s <= 4.5 ; $s += 0.25)
        {
                my $ss = int(10 ** $s);
                bench("size $ss: SYNC ", 1, $ss * $tupperpage,           0, 
$ddlcount);
                bench("size $ss: WAL  ", 1, $ss * $tupperpage, $large_size, 
$ddlcount);
        }
}

# find crossing point of WAL and SYNC by bisecting
sub bench3
{
        my $ddlcount = 5;
        $ddlcount = $ARGV[1] if ($#ARGV > 0);

        print "find crossing point of WAL and SYNC by bisecting\n";
        bench("SYNC: size 0", 1, 1, 8);
        bench("WAL : size 0", 1, 1, 16);
        my $s = 1;
        my $st = 10000;
        while (1)
        {
                my $ts = bench("SYNC: size $s",
                                           $tupperpage * $s,           0, 
$ddlcount);
                my $tw = bench("WAL : size $s",
                                           $tupperpage * $s, $large_size, 
$ddlcount);

                if ($st < 1.0){
                        print "DONE\n";
                        exit(0);
                }
                if ($ts > $tw)
                {
                        $s += $st; $st /= 2;
                }
                else
                {
                        $s -= $st; $st /= 2;
                }
        }
}

# benchmark with multiple processes
sub bench4
{
        my $ddlcount = 5;
        my $nprocs = 10;

        $nprocs = $ARGV[1] if ($#ARGV > 0);
        $ddlcount = $ARGV[2] if ($#ARGV > 1);
        
        print "benchmark for wal_level = minimal, $nprocs processes\n";
        print "bench 4: nprocs = $nprocs, DDL count = $ddlcount\n";
        
        for (my $s = 1.0 ; $s <= 3.5 ; $s += 0.25)
        {
                my $ss = int(10 ** $s);
                bench("pages $ss: SYNC ", $nprocs, $ss * $tupperpage,           
0, 5);
                bench("pages $ss: WAL  ", $nprocs, $ss * $tupperpage, 
$large_size, 5);
        }
}

sub bench5
{
        my $ddlcount = 5;
        my $pages = 100;
        my $mode = "s";
        my $threshold = 0;

        $ddlcount = $ARGV[1] if ($#ARGV > 0);
        $pages = $ARGV[2] if ($#ARGV > 1);
        $mode = $ARGV[3] if ($#ARGV > 2);
        if ($mode eq 's') {
                $threshold = 0;
        } elsif ($mode eq 's') {
                $threshold = $large_size;
        } elsif ($mode eq 'n') {
                $threshold = -1;
        } elsif ($mode eq 'w') {
                $threshold = $large_size;
        } else {
                die "mode must be s, w or n\n";
        }
        

        print "bench 5: mode = $mode, DDL count = $ddlcount, pages = $pages\n";
        bench("size $pages", 1, $pages * $tupperpage, $threshold, $ddlcount);
}

bench1() if ($ARGV[0] == 1);
bench2() if ($ARGV[0] == 2);
bench3() if ($ARGV[0] == 3);
bench4() if ($ARGV[0] == 4);
bench5() if ($ARGV[0] == 5);

#! /usr/bin/perl

use strict;
use POSIX qw(floor);

my $state = 0;
my $wal_level = '';
my $pages = 0;
my $binary = '';
my $scale = 0;
my $paramcount = 0;
my $mode = '';
my $sumtps = 0;
my $sumlat = 0;
my $count = 0;
my $trig = 0;
my $title = "(undef)";
my $ddltime = 0;
my @lines = ();
my $trailstr = '';
my $verbose = 1 if ($ARGV[0] eq '-v');
my %tps = ();
my %lat = ();
my %ddltime = ();
my %modestr=("n", "none", "s", "sync", "w", "WAL");

while (<STDIN>) {
        push(@lines, $_);
        chomp;
        next if (/^(END|START)$/);
        next if (/NOTICE:  /);

#       print "$state: $_\n";
        if ($state == 0) {
                if (/^## params: count=([0-9.]+) pages=([0-9.]+) mode=(.) 
binary=([^ ]+) scale=([0-9.]+) wal_level=([a-z]+)/) {
                        $paramcount = $1;
                        $pages = $2;
                        $mode = $3;
                        $binary = $4;
                        $scale = $5;
                        $wal_level = $6;
                        my $modestr = $modestr{$mode};

                        print "## params: wal_level=$wal_level mode=$modestr 
pages=$pages count=$paramcount scale=$scale\n";
                        print "(% are relative to \"before\")\n";
                        $state = 1;
                        next;
                } else {
                        next;
                }
        } elsif ($state == 1) {
                if (/^starting vacuum/) {
                        $state = 2;
                }
                next;
        } elsif ($state == 2) {
                if (/^bench.*/) {
                        $trig = 1;
                        $title = "before";
                        $state = 3;
                }
        } elsif ($state == 3) {
                if (/^size ([0-9]+): procs ([0-9]+): time ([0-9]+)$/) {
                        $ddltime{$mode} = $3;
                        $trig = 1;
                        $title = "during";
                        $trailstr = '';
                        $state = 4;
                }
        } elsif ($state == 4) {
                if (/^transaction type: /) {
                        $trig = 1;
                        $title = "after";
                        $trailstr = "# $_\n";
                        $state = 5;
                }
        } elsif ($state == 5) {
                if (!/^statement latencies /) {
                        $trailstr .= "# $_\n";
                        next;
                }
                printf "DDL time: %6.0f ms (%6.1f%% relative to mode=%s)\n",
                        $ddltime{$mode},
                        floor(1000.0 * $ddltime{$mode} / $ddltime{n} + 0.5) / 
10,
                        $modestr{n};
                $trailstr .= "# $_\n";
                $state = 6;
                next;
        } elsif ($state == 6) {
                if (/^ {8}/) {
                        $trailstr.= "# $_\n";
                        next;
                }
                if ($verbose) {
                        print $trailstr;
                }
                $state = 0;
                next;
        }

        if ($trig) {
                die "count 0?\n" if ($count == 0);
                $tps{$title} = $sumtps / $count;
                $lat{$title} = $sumlat / $count;
                printf "%6s: tps %6.1f (%5.1f%%), lat %9.3f ms (%5.1f%%) (%d 
samples)\n",
                        $title,
                        $tps{$title}, floor(1000.0 * $tps{$title} / 
$tps{before} + 0.5)/10,
                        $lat{$title}, floor(1000.0 * $lat{$title} / 
$lat{before} + 0.5)/10,
                        $count;
                $sumtps = $sumlat = $count = 0;
                $trig = 0;
                next;
        }

        if (!/^progress: ([0-9.]+) s, ([0-9.]+) tps, lat ([0-9.]+) ms stddev 
([0-9.]+|NaN)$/) {
                last;
        }
        $sumtps += $2;
        $sumlat += $3;
        $count++;
}

if ($state != 0) {
        print "Wrong state after EOF: state = $state\n";
        print "=====================================\n";
        foreach (-10 .. -1) {
                printf "%d: %s", ($. + $_), $lines[$. + $_];
        }
        print "=====================================\n";
        exit(1);
}

die "uncounted lines?\n" if ($count > 0);

f.txt.gz
Description: Binary data

Re: [HACKERS] WAL logging problem in 9.4.3?

Reply via email to