Do we need to rethink how to parallelize regression tests to speedup CLOBBER_CACHE_ALWAYS?

David Rowley Wed, 12 May 2021 05:56:02 -0700

Right now Tom is doing a bit of work to try and improve the
performance of regression test runs with CLOBBER _CACHE_ALWAYS.  I'm
on board with making this go faster too.


I did a CLOBBER_CACHE_ALWAYS_RUN today and it took my machine almost 7
hours to complete.  I occasionally checked top -c and was a bit
disappointed that the majority of the time just a single backend was
busy.  The reason for this is that most groups have some test that
takes much longer to run than others and I just often caught it once
it had finished all the faster tests and was stuck on the slow one.

I did a bit of analysis into the runtimes and found that:

1. Without parallelism, the total run-time of all tests was 12.29 hours.
2. The run took 6.45 hours. (I took the max time from each group and
summed that from each group)

That means the average backends utilized here was about 1.9.

I wondered if there might be a better way to handle how parallel tests
work in pg_regress.  We have many parallel groups that have reached 20
tests and we often just create another parallel group because of the
not exceeding 20 rule.  In many cases, we could get busy running
another test instead of sitting around idle.

Right now we start 1 backend for each test in a parallel group then
wait for the final backend to complete before running the next group.

Is a particular reason for it to work that way?

Why can we not just have a much larger parallel group and lump all of
the tests that have no special needs not to be run concurrently or
concurrently with another test in particular and just run all those
with up to N workers.  Once a worker completes, give it another test
to process until there are none left. We could still limit the total
concurrency with --max-connections=20. I don't think we'd need to make
any code changes to make this idea work.

I did the maths on that and if it worked that way, and assuming all
the parallel tests don't mind being run at the same time with any
other parallel test, then the theoretical run-time comes down to  3.75
hours with 8 workers, or 4.11 with 4 workers.  The primary reason it
does not become much faster is due to the "privileges" test taking 3
hours. If I calculate assuming 128 workers the time only drops to 3.46
hours.  Here there are enough workers to start the slow privileges
test on a worker that's not done anything else yet. So the 3.46 hours
is just the time for the privileges test plus the time to do the
serial tests, one by one.

For the above, I didn't do anything to change the order of the tests
to start the long-running ones first, but if I do that, I can get the
times down to 3.46 with just 4 workers. That's 1.86x faster than my
run.

I've attached a text file with the method I used to calculate each of
the numbers above and I've also attached the results with timings from
my CLOBBER_CACHE_ALWAYS run for anyone who'd like to check my maths.

If I split the "privileges" test into 2 even parts, then 8 workers
would run the tests in 1.95 hours which is 3.2x faster than my run.

David

create table cca_result (groupid int not null, type text not null, name text 
not null, result text not null, milliseconds int not null);
copy cca_result from '/path/to/cca_result.csv' delimiter '      ';

sum the maximum time per group to find out how long the tests took in total.

postgres=# select sum(milliseconds)/1000.0/3600.0 from (select 
max(milliseconds) milliseconds from cca_result group by groupid) c;
      ?column?
--------------------
 6.4572605555555556
(1 row)


How long would the tests have taken if we ran them one after the other 
(assuming concurrency does not slow things down)?

postgres=# select sum(milliseconds)/1000.0/3600.0 from cca_result;
      ?column?
---------------------
 12.2959541666666667
(1 row)


calculate the worker utilization during the run of each group.

select
        groupid,
        round(sum(milliseconds)::numeric / (max(milliseconds) * count(*)) * 
100,1) utilization_percent,
        count(*) as number_of_tests_in_group
from cca_result
group by groupid
order by 2;
  groupid | utilization_percent | number_of_tests_in_group
---------+---------------------+--------------------------
      15 |                 7.6 |                       20
       3 |                11.4 |                       20
      17 |                14.4 |                       14
      24 |                15.2 |                       18
      22 |                15.3 |                       17
       4 |                15.7 |                       12
       2 |                18.0 |                       20
      12 |                19.3 |                       16
      14 |                19.5 |                       20
      25 |                23.9 |                       11
      23 |                28.6 |                        6
       9 |                29.5 |                        5
      18 |                33.0 |                        6
      11 |                42.9 |                        5
      21 |                57.4 |                        2
      16 |                63.4 |                        2
      10 |                66.3 |                        3
      26 |                93.1 |                        2
       1 |               100.0 |                        1
       6 |               100.0 |                        1
      27 |               100.0 |                        1
       8 |               100.0 |                        1
      19 |               100.0 |                        1
      28 |               100.0 |                        1
      20 |               100.0 |                        1
      13 |               100.0 |                        1
       7 |               100.0 |                        1
       5 |               100.0 |                        1
(28 rows)

-- Create a function to consume each test result and accumulate the time onto 
the parallel worker with the
-- least accumulated run-time.  If any workers have the same accumulated 
run-time, give it to the worker
-- with the smallest worker number.  Consume the tests starting with the test 
with the lowest ctid. (yeah, I should have added a serial column...)

-- I can't think of a way to do this in pure SQL.
create or replace function compute_times(nworkers int) returns table 
(worker_num int, milliseconds bigint) as $$
declare r record;
begin
        drop table if exists workers;
        create temp table workers (worker_num int, milliseconds int);
        insert into workers select x,0 from generate_series(1, nworkers) x;

        for r in select * from cca_result where type = 'parallel' order by ctid
        loop
                -- apply the time to the worker with the smallest accumulated 
time
                -- tie break on worker number so that we allocate to earlier 
workers first
                update workers w set milliseconds = w.milliseconds + 
r.milliseconds
                where w.worker_num = (select w2.worker_num from workers w2 
order by w2.milliseconds,w2.worker_num limit 1);
        end loop;
        
        return query select 0, sum(cca.milliseconds) from cca_result cca where 
type = 'test'
        union all
        select * from workers;
end;
$$ language plpgsql;


-- calculate how long the tests would take to run with 8 workers. We assume all 
of the parallel tests ran in parallel
-- so just take the max(milliseconds). The tests with worker_num = 0, we assume 
ran in serial, so take the sum(milliseconds)

postgres=# select (max(milliseconds) filter(where worker_num >= 1) + 
sum(milliseconds) filter (where worker_num = 0))/1000.0/3600.0 from 
compute_times(8);
      ?column?
--------------------
 3.7582911111111111
(1 row)


-- Create a function to consume each test result and accumulate the time onto 
the parallel worker with the
-- least accumulated run-time.  If any workers have the same accumulated 
run-time, give it to the worker
-- with the smallest worker number.  Consume the tests starting with the test 
with the longest run-time.

create or replace function compute_times_slowest_first(nworkers int) returns 
table (worker_num int, milliseconds bigint) as $$
declare r record;
begin
        drop table if exists workers;
        create temp table workers (worker_num int, milliseconds int);
        insert into workers select x,0 from generate_series(1, nworkers) x;

        for r in select * from cca_result where type = 'parallel' order by 
milliseconds desc
        loop
                -- apply the time to the worker with the smallest accumulated 
time
                -- tie break on worker number so that we allocate to earlier 
workers first
                update workers w set milliseconds = w.milliseconds + 
r.milliseconds
                where w.worker_num = (select w2.worker_num from workers w2 
order by w2.milliseconds,w2.worker_num limit 1);
        end loop;
        
        return query select 0, sum(cca.milliseconds) from cca_result cca where 
type = 'test'
        union all
        select * from workers;
end;
$$ language plpgsql;

-- Ordering the parallel tests starting a the slowest first, check how long it 
would take to churn through the tests with 4 parallel workers.

postgres=# select (max(milliseconds) filter(where worker_num >= 1) + 
sum(milliseconds) filter (where worker_num = 0))/1000.0/3600.0 from 
compute_times_slowest_first(4);
      ?column?
--------------------
 3.4639533333333333
(1 row)

cca_result.csv
Description: MS-Excel spreadsheet

Do we need to rethink how to parallelize regression tests to speedup CLOBBER_CACHE_ALWAYS?

Reply via email to