On Fri, Oct 10, 2008 at 02:30:42PM -0700, Steve Kargl wrote: > Yes, this is a long email. > > In working with a colleague to diagnosis poor performance of > his MPI code, we've discovered that ULE is drastically inferior > to 4BSD in utilizing a system with 2 physical cpus (opteron) and > a total of 8 cores. We have observed this problem with the Open MPI > implementation of MPI and with the MPICH2 implementation. > > Note, I am using the exact same hardware and FreeBSD-current > code dated Sep 22, 2008. The only difference in the kernel > config file is whether ULE or 4BSD is used. > > Using the following command, > > % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log > > we have > > ULE --> 546.99 real 0.02 user 0.03 sys > 4BSD -> 218.96 real 0.03 user 0.02 sys > > where the machinefile simply tells Open MPI to launch 8 jobs on the > local node. Test_mpi uses MPI's scatter, gather, and all_to_all > functions to transmit various arrays between the 8 jobs. To get > meaningful numbers, a number of iterations are done in a tight loop. > > With ULE, a snapshot of top(1) shows > > last pid: 33765; load averages: 7.98, 7.51, 5.63 up 10+03:20:30 > 13:13:56 > 43 processes: 9 running, 34 sleeping > CPU: 68.6% user, 0.0% nice, 18.9% system, 0.0% interrupt, 12.5% idle > Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 33743 kargl 1 118 0 300M 22788K CPU7 7 4:48 100.00% Test_mpi > 33747 kargl 1 118 0 300M 22820K CPU3 3 4:43 100.00% Test_mpi > 33742 kargl 1 118 0 300M 22692K CPU5 5 4:42 100.00% Test_mpi > 33744 kargl 1 117 0 300M 22752K CPU6 6 4:29 100.00% Test_mpi > 33748 kargl 1 117 0 300M 22768K CPU2 2 4:31 96.39% Test_mpi > 33741 kargl 1 112 0 299M 43628K CPU1 1 4:40 80.08% Test_mpi > 33745 kargl 1 113 0 300M 44272K RUN 0 4:27 76.17% Test_mpi > 33746 kargl 1 109 0 300M 22740K RUN 0 4:25 57.86% Test_mpi > 33749 kargl 1 44 0 8196K 2280K CPU4 4 0:00 0.20% top > > while with 4BSD, a snapshot of top(1) shows > > last pid: 1019; load averages: 7.24, 3.05, 1.25 up 0+00:04:40 > 13:27:09 > 43 processes: 9 running, 34 sleeping > CPU: 45.4% user, 0.0% nice, 54.5% system, 0.1% interrupt, 0.0% idle > Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free > Swap: 4096M Total, 4096M Free > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND > 1012 kargl 1 126 0 300M 44744K CPU6 6 2:16 99.07% Test_mpi > 1016 kargl 1 126 0 314M 59256K RUN 4 2:16 99.02% Test_mpi > 1011 kargl 1 126 0 300M 44652K CPU5 5 2:16 99.02% Test_mpi > 1013 kargl 1 126 0 300M 44680K CPU2 2 2:16 99.02% Test_mpi > 1010 kargl 1 126 0 300M 44740K CPU7 7 2:16 99.02% Test_mpi > 1009 kargl 1 126 0 299M 43884K CPU0 0 2:16 98.97% Test_mpi > 1014 kargl 1 126 0 300M 44664K CPU1 1 2:16 98.97% Test_mpi > 1015 kargl 1 126 0 300M 44620K CPU3 3 2:16 98.93% Test_mpi > 989 kargl 1 96 0 8196K 2460K CPU4 4 0:00 0.10% top > > Notice the interesting, or even perhaps odd, scheduling with ULE that results > in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). > With ULE, 2 Test_mpi jobs are always scheduled on the same core while one > core remains idle. Also, note the difference in the reported load averages. > > Various stats are generated by and collected from executing the MPI program > With ULE, the numbers are > > Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) > 8 800000 3125 100 scatter 12.58386 0.24251367 > 8 800000 3125 100 all_to_all 17.24503 0.17696444 > 8 800000 3125 100 gather 14.82058 0.20591355 > > 8 1600000 6250 100 scatter 28.25922 0.21598316 > 8 1600000 6250 100 all_to_all 1985.74915 0.00307366 > 8 1600000 6250 100 gather 30.42038 0.20063902 > > 8 2400000 9375 100 scatter 44.65615 0.20501709 > 8 2400000 9375 100 all_to_all 16.09386 0.56886748 > 8 2400000 9375 100 gather 44.38801 0.20625555 > > 8 3200000 12500 100 scatter 60.04160 0.20330956 > 8 3200000 12500 100 all_to_all 2157.10010 0.00565900 > 8 3200000 12500 100 gather 59.72242 0.20439614 > > 8 4000000 15625 100 scatter 86.65769 0.17608117 > 8 4000000 15625 100 all_to_all 2081.25195 0.00733154 > 8 4000000 15625 100 gather 27.47257 0.55541896 > > 8 4800000 18750 100 scatter 33.02306 0.55447768 > 8 4800000 18750 100 all_to_all 200.09908 0.09150740 > 8 4800000 18750 100 gather 91.08742 0.20102168 > > 8 5600000 21875 100 scatter 109.82005 0.19452098 > 8 5600000 21875 100 all_to_all 76.87574 0.27788095 > 8 5600000 21875 100 gather 41.67106 0.51264128 > > 8 6400000 25000 100 scatter 26.92482 0.90674917 > 8 6400000 25000 100 all_to_all 64.74528 0.37707868 > 8 6400000 25000 100 gather 41.29724 0.59117904 > > and with 4BSD, the numbers are > > Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s) > 8 800000 3125 100 scatter 21.33697 0.14302677 > 8 800000 3125 100 all_to_all 3941.39624 0.00077428 > 8 800000 3125 100 gather 24.75520 0.12327747 > > 8 1600000 6250 100 scatter 45.20134 0.13502954 > 8 1600000 6250 100 all_to_all 1987.94348 0.00307027 > 8 1600000 6250 100 gather 42.02498 0.14523541 > > 8 2400000 9375 100 scatter 63.03553 0.14523989 > 8 2400000 9375 100 all_to_all 2015.19580 0.00454312 > 8 2400000 9375 100 gather 66.72807 0.13720272 > > 8 3200000 12500 100 scatter 91.90541 0.13282169 > 8 3200000 12500 100 all_to_all 2029.62622 0.00601442 > 8 3200000 12500 100 gather 87.99693 0.13872112 > > 8 4000000 15625 100 scatter 107.48991 0.14195556 > 8 4000000 15625 100 all_to_all 1970.66907 0.00774295 > 8 4000000 15625 100 gather 110.70226 0.13783630 > > 8 4800000 18750 100 scatter 140.39014 0.13042616 > 8 4800000 18750 100 all_to_all 2401.80054 0.00762367 > 8 4800000 18750 100 gather 134.60948 0.13602717 > > 8 5600000 21875 100 scatter 152.31958 0.14024661 > 8 5600000 21875 100 all_to_all 2379.12207 0.00897907 > 8 5600000 21875 100 gather 154.60051 0.13817745 > > 8 6400000 25000 100 scatter 190.03561 0.12847099 > 8 6400000 25000 100 all_to_all 2661.36963 0.00917350 > 8 6400000 25000 100 gather 183.08250 0.13335006 > > Noting that all communication is over the memory bus, a comparison of > the Bandwidth columns suggests that ULE is causing the MPI jobs to stall > waiting for data. This has potentially serious negative impact on > clusters used for HPC.
What surprises me is that you didn't CC the individual who wrote ULE: Jeff Roberson. :-) I've CC'd him here. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"