Re: WIP/PoC for parallel backup

David Zhang Thu, 30 Apr 2020 11:19:23 -0700

On 2020-04-30 2:18 a.m., Sumanta Mukherjee wrote:

Hi,


Would it be possible to put in the absolute numbers of the perf
so that it is easier to understand the amount of improvement with
and without the patch and different loads and workers.

Here is the parameters used to record the perf data on both server andclient side, for example, after applied the patch15 using 4 workers withload,

perf record -o postgres_patch_j4_load -e block:block_rq_insert -ecpu-clock -e cycles:k -e skb:consume_skb -aR -s --/home/ec2-user/after/bin/postgres -D /mnt/test/data

perf record -o backup_patch_j4_load -e block:block_rq_insert -ecpu-clock -e cycles:k -e skb:consume_skb -aR -s --/home/ec2-user/after/bin/pg_basebackup -h ${PG_SERVER} -p 5432 -D/mnt/backup/data -v


And this is how the report is generated.
perf report  -i postgres_patch_j4_load --stdio > postgres_patch_j4_load.txt

The original perf data files are still available, can you please clarifywhich parameter you would like to be added for regenerating the report,or any other parameters need to be added to recreate the perf.data andthen generate the report?

I am also unsure why the swapper is taking such a huge percentage ofthe absolute time

in the base run of just the postgres server and pg_basebackup client.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Apr 30, 2020 at 1:18 PM David Zhang <david.zh...@highgo.ca<mailto:david.zh...@highgo.ca>> wrote:


    Hi,

    Thanks a lot for sharing the test results. Here is the our test
    results using perf on three ASW t2.xlarge with below configuration.

    Machine configuration:
          Instance Type        :t2.xlarge
          Volume type          :io1
          Memory (MiB)         :16GB
          vCPU #                   :4
          Architecture           :x86_64
          IOP                         :6000
          Database Size (GB)  :45 (Server)

    case 1: postgres server: without patch and without load

    * Disk I/O:

    # Samples: 342K of event 'block:block_rq_insert'
    # Event count (approx.): 342834
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ................. .....................
    #
        97.65%  postgres         [kernel.kallsyms]  [k] __elv_add_request
         2.27%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request


    * CPU:

    # Samples: 6M of event 'cpu-clock'
    # Event count (approx.): 1559444750000
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ....................
    .............................................
    #
        64.73%  swapper          [kernel.kallsyms]     [k]
    native_safe_halt
        10.89%  postgres         [vdso]                [.]
    __vdso_gettimeofday
         5.64%  postgres         [kernel.kallsyms]     [k] do_syscall_64
         5.43%  postgres libpthread-2.26.so
    <http://libpthread-2.26.so>    [.] __libc_recv
         1.72%  postgres         [kernel.kallsyms]     [k]
    pvclock_clocksource_read

    * Network:

    # Samples: 2M of event 'skb:consume_skb'
    # Event count (approx.): 2739785
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  .................
    ...........................
    #
        91.58%  swapper          [kernel.kallsyms]  [k] consume_skb
         7.09%  postgres         [kernel.kallsyms]  [k] consume_skb
         0.61%  kswapd0          [kernel.kallsyms]  [k] consume_skb
         0.44%  ksoftirqd/3      [kernel.kallsyms]  [k] consume_skb


    case 1: pg_basebackup client: without patch and without load

    * Disk I/O:

    # Samples: 371K of event 'block:block_rq_insert'
    # Event count (approx.): 371362
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ................. .....................
    #
        96.78%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request
         2.82%  pg_basebackup    [kernel.kallsyms]  [k] __elv_add_request
         0.29%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
         0.09%  xfsaild/xvda1    [kernel.kallsyms]  [k] __elv_add_request


    * CPU:

    # Samples: 3M of event 'cpu-clock'
    # Event count (approx.): 903527000000
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ..................
    .............................................
    #
        87.99%  swapper          [kernel.kallsyms]   [k] native_safe_halt
         3.14%  swapper          [kernel.kallsyms]   [k] __lock_text_start
         0.48%  swapper          [kernel.kallsyms]   [k]
    __softirqentry_text_start
         0.37%  pg_basebackup    [kernel.kallsyms]   [k]
    copy_user_enhanced_fast_string
         0.35%  swapper          [kernel.kallsyms]   [k] do_csum

    * Network:

    # Samples: 12M of event 'skb:consume_skb'
    # Event count (approx.): 12260713
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  .................
    ...........................
    #
        95.12%  swapper          [kernel.kallsyms]  [k] consume_skb
         3.23%  pg_basebackup    [kernel.kallsyms]  [k] consume_skb
         0.83%  ksoftirqd/1      [kernel.kallsyms]  [k] consume_skb
         0.45%  kswapd0          [kernel.kallsyms]  [k] consume_skb


    case 2: postgres server: with patch and with load, 4 backup
    workers on client side

    * Disk I/O:

    # Samples: 3M of event 'block:block_rq_insert'
    # Event count (approx.): 3634542
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ................. .....................
    #
        98.88%  postgres         [kernel.kallsyms]  [k] __elv_add_request
         0.66%  perf             [kernel.kallsyms]  [k] __elv_add_request
         0.42%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
         0.01%  sshd             [kernel.kallsyms]  [k] __elv_add_request

    * CPU:

    # Samples: 9M of event 'cpu-clock'
    # Event count (approx.): 2299129250000
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  .....................
    .............................................
    #
        52.73%  swapper          [kernel.kallsyms]      [k]
    native_safe_halt
         8.31%  postgres         [vdso]                 [.]
    __vdso_gettimeofday
         4.46%  postgres         [kernel.kallsyms]      [k] do_syscall_64
         4.16%  postgres libpthread-2.26.so
    <http://libpthread-2.26.so>     [.] __libc_recv
         1.58%  postgres         [kernel.kallsyms]      [k]
    __lock_text_start
         1.52%  postgres         [kernel.kallsyms]      [k]
    pvclock_clocksource_read
         0.81%  postgres         [kernel.kallsyms]      [k]
    copy_user_enhanced_fast_string


    * Network:

    # Samples: 6M of event 'skb:consume_skb'
    # Event count (approx.): 6048795
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  .................
    ...........................
    #
        85.81%  postgres         [kernel.kallsyms]  [k] consume_skb
        12.03%  swapper          [kernel.kallsyms]  [k] consume_skb
         0.97%  postgres         [kernel.kallsyms]  [k]
    __consume_stateless_skb
         0.85%  ksoftirqd/3      [kernel.kallsyms]  [k] consume_skb
         0.24%  perf             [kernel.kallsyms]  [k] consume_skb


    case 2: pg_basebackup 4 workers: with patch and with load

    * Disk I/O:

    # Samples: 372K of event 'block:block_rq_insert'
    # Event count (approx.): 372360
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ................. .....................
    #
        97.26%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request
         1.45%  pg_basebackup    [kernel.kallsyms]  [k] __elv_add_request
         0.95%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
         0.14%  xfsaild/xvda1    [kernel.kallsyms]  [k] __elv_add_request


    * CPU:

    # Samples: 4M of event 'cpu-clock'
    # Event count (approx.): 1234071000000
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  ........................
    .................................................
    #
        89.25%  swapper          [kernel.kallsyms] [k] native_safe_halt
         0.93%  pg_basebackup    [kernel.kallsyms] [k] __lock_text_start
         0.91%  swapper          [kernel.kallsyms] [k] __lock_text_start
         0.69%  pg_basebackup    [kernel.kallsyms] [k]
    copy_user_enhanced_fast_string
         0.45%  swapper          [kernel.kallsyms] [k] do_csum


    * Network:

    # Samples: 6M of event 'skb:consume_skb'
    # Event count (approx.): 6449013
    #
    # Overhead  Command          Shared Object Symbol
    # ........  ...............  .................
    ...........................
    #
        90.28%  pg_basebackup    [kernel.kallsyms]  [k] consume_skb
         9.09%  swapper          [kernel.kallsyms]  [k] consume_skb
         0.29%  ksoftirqd/1      [kernel.kallsyms]  [k] consume_skb
         0.21%  sshd             [kernel.kallsyms]  [k] consume_skb


    The detailed perf report is attached, with different scenarios,
    i.e. without patch (with and without load for server and client) ,
    with patch (with and without load for 1, 2, 4, 8 workers for both
    server and client). The file name should self explain the cases.

    Let me know if more information required.

    Best regards,

    David

    On 2020-04-29 5:41 a.m., Suraj Kharage wrote:

    Hi,

    We at EnterpriseDB did some performance testing around this
    parallel backup to check how this is beneficial and below are the
    results. In this testing, we run the backup -
    1) Without Asif’s patch
    2) With Asif’s patch and combination of workers 1,2,4,8.

    We run those test on two setup

    1) Client and Server both on the same machine (Local backups)

    2) Client and server on a different machine (remote backups)


    *Machine details: *

    1: Server (on which local backups performed and used as server
    for remote backups)

    2: Client (Used as a client for remote backups)


    *Server:*

    RAM:500 GB
    CPU details:
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 128
    On-line CPU(s) list: 0-127
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s): 8
    NUMA node(s): 8
    Filesystem:ext4


    *Client:*
    RAM:490 GB
    CPU details:
    Architecture: ppc64le
    Byte Order: Little Endian
    CPU(s): 192
    On-line CPU(s) list: 0-191
    Thread(s) per core: 8
    Core(s) per socket: 1
    Socket(s): 24
    Filesystem:ext4

    Below are the results for the local test:

    Data size   without paralle backup
    patch       parallel backup with
    1 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    2 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    4 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    8 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)
    10 GB
    (10 tables - each table around 1.05 GB)     real 0m27.016s
    user 0m3.378s
    sys 0m23.059s       real 0m30.314s
    user 0m3.575s
    sys 0m22.946s       12% performance
    decreased   real 0m20.400s
    user 0m3.622s
    sys 0m29.670s       27% performace
    increased   real 0m15.331s
    user 0m3.706s
    sys 0m39.189s       43% performance
    increased   real 0m15.094s
    user 0m3.915s
    sys 1m23.350s       44% performace
    increased.
    50GB
    (50 tables - each table around 1.05 GB)     real 2m11.049s
    user 0m16.464s
    sys 2m1.757s        real 2m26.621s
    user 0m18.497s
    sys 2m4.792s        21% performance
    decreased   real 1m9.581s
    user 0m18.298s
    sys 2m12.030s       46% performance
    increased   real 0m53.894s
    user 0m18.588s
    sys 2m47.390s       58% performance
    increased.  real 0m55.373s
    user 0m18.423s
    sys 5m57.470s       57% performance
    increased.
    100GB
    (100 tables - each table around 1.05 GB)    real 4m4.776s
    user 0m33.699s
    sys 3m27.777s       real 4m20.862s
    user 0m35.753s
    sys 3m28.262s       6% performance
    decreased   real 2m37.411s
    user 0m36.440s
    sys 4m16.424s"         35% performance
    increased   real 1m49.503s
    user 0m37.200s
    sys 5m58.077s       55% performace
    increased   real 1m36.762s
    user 0m36.987s
    sys 9m36.906s       60% performace
    increased.
    200GB
    (200 tables - each table around 1.05 GB)    real 10m34.998s
    user 1m8.471s
    sys 7m21.520s       real 11m30.899s
    user 1m12.933s
    sys 8m14.496s       8% performance
    decreased   real 6m8.481s
    user 1m13.771s
    sys 9m31.216s       41% performance
    increased   real 4m2.403s
    user 1m18.331s
    sys 12m29.661s      61% performance
    increased   real 4m3.768s
    user 1m24.547s
    sys 15m21.421s      61% performance
    increased


    Results for the remote test:

    Data size   without paralle backup
    patch       parallel backup with
    1 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    2 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    4 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)     parallel backup with
    8 worker    % performance
    increased/decreased
    compare to normal
    backup
    (without patch)
    10 GB
    (10 tables - each table around 1.05 GB)     real 1m36.829s
    user 0m2.124s
    sys 0m14.004s       real 1m37.598s
    user 0m3.272s
    sys 0m11.110s       0.8% performance
    decreased   real 1m36.753s
    user 0m2.627s
    sys 0m15.312s       0.08% performance
    increased.  real 1m37.212s
    user 0m3.835s
    sys 0m13.221s       0.3% performance
    decreased.  real 1m36.977s
    user 0m4.475s
    sys 0m17.937s       0.1% perfomance
    decreased.
    50GB
    (50 tables - each table around 1.05 GB)     real 7m54.211s
    user 0m10.826s
    sys 1m10.435s       real 7m55.603s
    user 0m16.535s
    sys 1m8.147s        0.2% performance
    decreased   real 7m53.499s
    user 0m18.131s
    sys 1m8.822s        0.1% performance
    increased.  real 7m54.687s
    user 0m15.818s
    sys 1m30.991s       0.1% performance
    decreased   real 7m54.658s
    user 0m20.783s
    sys 1m34.460s       0.1% performance
    decreased
    100GB
    (100 tables - each table around 1.05 GB)    real 15m45.776s
    user 0m21.802s
    sys 2m59.006s       real 15m46.315s
    user 0m32.499s
    sys 2m47.245s       0.05% performance
    decreased   real 15m46.065s
    user 0m28.877s
    sys 2m21.181s       0.03% performacne
    drcreased   real 15m47.793s
    user 0m30.932s
    sys 2m36.708s       0.2% performance
    decresed    real 15m47.129s
    user 0m35.151s
    sys 3m23.572s       0.14% performance
    decreased.
    200GB
    (200 tables - each table around 1.05 GB)    real 32m55.720s
    user 0m50.602s
    sys 5m38.875s       real 31m30.602s
    user 0m45.377s
    sys 4m57.405s       4% performance
    increased   real 31m30.214s
    user 0m55.023s
    sys 5m8.689s        4% performance
    increased   real 31m31.187s
    user 1m13.390s
    sys 5m40.861s       4% performance
    increased   real 31m31.729s
    user 1m4.955s
    sys 6m35.774s       4% performance
    decreased



    Client & Server on the same machine, the result shows around 50%
    improvement in parallel run with worker 4 and 8.  We don’t see
    the huge performance improvement with more workers been added.


    Whereas, when the client and server on a different machine, we
    don’t see any major benefit in performance.  This testing result
    matches the testing results posted by David Zhang up thread.



    We ran the test for 100GB backup with parallel worker 4 to see
    the CPU usage and other information. What we noticed is that
    server is consuming the CPU almost 100% whole the time and
    pg_stat_activity shows that server is busy with ClientWrite most
    of the time.


    Attaching captured output for

    1) Top command output on the server after every 5 second

    2) pg_stat_activity output after every 5 second

    3) Top command output on the client after every 5 second


    Do let me know if anyone has further questions/inputs for the
    benchmarking.


    Thanks to Rushabh Lathia for helping me with this testing.

    On Tue, Apr 28, 2020 at 8:46 AM Amit Kapila
    <amit.kapil...@gmail.com <mailto:amit.kapil...@gmail.com>> wrote:

        On Mon, Apr 27, 2020 at 10:23 PM David Zhang
        <david.zh...@highgo.ca <mailto:david.zh...@highgo.ca>> wrote:
        >
        > Hi,
        >
        > Here is the parallel backup performance test results with
        and without
        > the patch "parallel_backup_v15" on AWS cloud environment. Two
        > "t2.xlarge" machines were used: one for Postgres server and
        the other
        > one for pg_basebackup with the same machine configuration
        showing below.
        >
        > Machine configuration:
        >      Instance Type        :t2.xlarge
        >      Volume type          :io1
        >      Memory (MiB)         :16GB
        >      vCPU #               :4
        >      Architecture         :x86_64
        >      IOP                  :6000
        >      Database Size (GB)   :108
        >
        > Performance test results:
        > without patch:
        >      real 18m49.346s
        >      user 1m24.178s
        >      sys 7m2.966s
        >
        > 1 worker with patch:
        >      real 18m43.201s
        >      user 1m55.787s
        >      sys 7m24.724s
        >
        > 2 worker with patch:
        >      real 18m47.373s
        >      user 2m22.970s
        >      sys 11m23.891s
        >
        > 4 worker with patch:
        >      real 18m46.878s
        >      user 2m26.791s
        >      sys 13m14.716s
        >
        > As required, I didn't have the pgbench running in parallel
        like we did
        > in the previous benchmark.
        >

        So, there doesn't seem to be any significant improvement in this
        scenario.  Now, it is not clear why there was a significant
        improvement in the previous run where pgbench was also running
        simultaneously.  I am not sure but maybe it is because when a
        lot of
        other backends were running (performing read-only workload) the
        backend that was responsible for doing backup was getting
        frequently
        scheduled out and it slowed down the overall backup process. 
        And when
        we start using multiple backends for backup one or other backup
        process is always running making the overall backup faster. 
        One idea
        to find this out is to check how much time backup takes when
        we run it
        with and without pgbench workload on HEAD (aka unpatched
        code).  Even
        if what I am saying is true or there is some other reason due
        to which
        we are seeing speedup in some cases (where there is a concurrent
        workload), it might not make the case for using multiple
        backends for
        backup but still, it is good to find that information as it
        might help
        in designing this feature better.

        > The perf report files for both Postgres server and
        pg_basebackup sides
        > are attached.
        >

        It is not clear which functions are taking more time or for which
        functions time is reduced as function symbols are not present
        in the
        reports.  I think you can refer
        "https://wiki.postgresql.org/wiki/Profiling_with_perf"; to see
        how to
        take profiles and additionally use -fno-omit-frame-pointer during
        configure (you can use CFLAGS="-fno-omit-frame-pointer during
        configure).

--With Regards,

        Amit Kapila.
        EnterpriseDB: http://www.enterprisedb.com

----

    Thanks & Regards,
    Suraj kharage,
    EnterpriseDB Corporation,
    The Postgres Database Company.

--David


    Software Engineer
    Highgo Software Inc. (Canada)
    www.highgo.ca <http://www.highgo.ca>

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: WIP/PoC for parallel backup

Reply via email to