for
...
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=4
rank 17=os229 slot=4

I've tried and here are the results, same thing happened.
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - RANK(0) Printing
Times...
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - os221 RANK(1)    :24
sec
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - os222 RANK(2)    :27
sec
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - os224 RANK(3)    :27
sec
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - os228 RANK(4)    :41
sec
2010-08-12 11:09:28,814 59759 DEBUG [0x7fbd3fdce740] - os229 RANK(5)    :42
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os223 RANK(6)    :27
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os221 RANK(7)    :28
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os222 RANK(8)    :22
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os224 RANK(9)    :22
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os228 RANK(10)    :*40
sec*
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os229 RANK(11)    :24
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os223 RANK(12)    :26
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os221 RANK(13)    :28
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os222 RANK(14)    :27
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os224 RANK(15)    :27
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os228 RANK(16)    :19
sec
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - os229 RANK(17)    :*43
sec*
2010-08-12 11:09:28,815 59759 DEBUG [0x7fbd3fdce740] - TOTAL CORRELATION
TIME: 43 sec

for
...
rank 12=os223 slot=2
rank 13=os221 slot=2
rank 14=os222 slot=2
rank 15=os224 slot=2
rank 16=os228 slot=2
rank 17=os229 slot=2

here are the results
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os221 RANK(1)    :23
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os222 RANK(2)    :23
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os224 RANK(3)    :24
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os228 RANK(4)    :20
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os229 RANK(5)    :20
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os223 RANK(6)    :24
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os221 RANK(7)    :23
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os222 RANK(8)    :22
sec
2010-08-12 11:19:33,916 54609 DEBUG [0x7f22881b5740] - os224 RANK(9)    :22
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os228 RANK(10)    :19
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os229 RANK(11)    :*35
sec*
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os223 RANK(12)    :23
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os221 RANK(13)    :23
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os222 RANK(14)    :23
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os224 RANK(15)    :23
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os228 RANK(16)    :19
sec
2010-08-12 11:19:33,917 54609 DEBUG [0x7f22881b5740] - os229 RANK(17)    :*37
sec*

Again the same thing happened. I also tried to give the slots 0, 3, 7 and
some other combinations, but it didn't change the result. Sometimes it gave
pretty normal, then I got some strange ones again.
*I guess specifiying the slot number doesn't affect the BIOS rank
choice.*The last test was as follows:

2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os221 RANK(1)    :24
sec
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os222 RANK(2)    :23
sec
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os224 RANK(3)    :23
sec
*2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os228 RANK(4)    :40
sec*
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os229 RANK(5)    :20
sec
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os223 RANK(6)    :24
sec
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os221 RANK(7)    :24
sec
2010-08-12 11:25:02,599 55467 DEBUG [0x7f15af87a740] - os222 RANK(8)    :22
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os224 RANK(9)    :22
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os228 RANK(10)    :20
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os229 RANK(11)    :21
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os223 RANK(12)    :23
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os221 RANK(13)    :24
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os222 RANK(14)    :24
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os224 RANK(15)    :23
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os228 RANK(16)    :38
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - os229 RANK(17)    :21
sec
2010-08-12 11:25:02,599 55468 DEBUG [0x7f15af87a740] - TOTAL CORRELATION
TIME: 40 sec

Now I'm gonna try the other advices here. Such as mpstat, or -bynode etc. I
hope to find a solution.
Then I'm gonna post it here.


On Wed, Aug 11, 2010 at 6:23 PM, Eugene Loh <eugene....@oracle.com> wrote:

>  The way MPI processes are being assigned to hardware threads is perhaps
> neither controlled nor optimal.  On the HT nodes, two processes may end up
> sharing the same core, with poorer performance.
>
> Try submitting your job like this
>
> % cat myrankfile1
> rank  0=os223 slot=0
> rank  1=os221 slot=0
> rank  2=os222 slot=0
> rank  3=os224 slot=0
> rank  4=os228 slot=0
> rank  5=os229 slot=0
> rank  6=os223 slot=1
> rank  7=os221 slot=1
> rank  8=os222 slot=1
> rank  9=os224 slot=1
> rank 10=os228 slot=1
> rank 11=os229 slot=1
> rank 12=os223 slot=2
> rank 13=os221 slot=2
> rank 14=os222 slot=2
> rank 15=os224 slot=2
> rank 16=os228 slot=2
> rank 17=os229 slot=2
> % mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
> myrankfile1 ./a.out
>
> You can also try
>
> % cat myrankfile2
> rank  0=os223 slot=0
> rank  1=os221 slot=0
> rank  2=os222 slot=0
> rank  3=os224 slot=0
> rank  4=os228 slot=0
> rank  5=os229 slot=0
> rank  6=os223 slot=1
> rank  7=os221 slot=1
> rank  8=os222 slot=1
> rank  9=os224 slot=1
> rank 10=os228 slot=2
> rank 11=os229 slot=2
> rank 12=os223 slot=2
> rank 13=os221 slot=2
> rank 14=os222 slot=2
> rank 15=os224 slot=2
> rank 16=os228 slot=4
> rank 17=os229 slot=4
> % mpirun -host os221,os222,os223,os224,os228,os229 -np 18 --rankfile
> myrankfile2 ./a.out
>
> which one reproduces your problem and which one avoids it depends on how
> the BIOS numbers your HTs.  Once you can confirm you understand the problem,
> you (with the help of this list) can devise a solution approach for your
> situation.
>
>
>
> Saygin Arkan wrote:
>
> Hello,
>
> I'm running mpi jobs in non-homogeneous cluster. 4 of my machines have the
> following properties, os221, os222, os223, os224:
>
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz
> stepping        : 7
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 4
> core id         : 3
> cpu cores       : 4
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est
> tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
> bogomips        : 4999.40
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
>
> and the problematic, hyper-threaded 2 machines are as follows, os228 and
> os229:
>
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 26
> model name      : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
> stepping        : 5
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 8
> core id         : 3
> cpu cores       : 4
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
> est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
> bogomips        : 5396.88
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
>
>
> The problem is: those 2 machines seem to be having 8 cores (virtually,
> actualy core number is 4).
> When I submit an MPI job, I calculated the comparison times in the cluster.
> I got strange results.
>
> I'm running the job on 6 nodes, 3 core per node. And sometimes ( I can say
> 1/3 of the tests) os228 or os229 returns strange results. 2 cores are slow
> (slower than the first 4 nodes) but the 3rd core is extremely fast.
>
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - RANK(0) Printing
> Times...
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(1)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(2)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(3)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(4)    :37
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(5)    :34
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(6)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(7)    :39
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os222 RANK(8)    :37
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os224 RANK(9)    :38
> sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os228 RANK(10)    :
> *48 sec*
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os229 RANK(11)
> :35 sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os223 RANK(12)
> :38 sec
> 2010-08-05 14:30:58,926 50672 DEBUG [0x7fcadf98c740] - os221 RANK(13)
> :37 sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os222 RANK(14)
> :37 sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os224 RANK(15)
> :38 sec
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os228 RANK(16)    :
> *43 sec*
> 2010-08-05 14:30:58,926 50673 DEBUG [0x7fcadf98c740] - os229 RANK(17)
> :35 sec
> TOTAL CORRELATION TIME: 48 sec
>
>
> or another test:
>
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - RANK(0) Printing
> Times...
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(1)
> :170 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os222 RANK(2)
> :161 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os224 RANK(3)
> :158 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os228 RANK(4)
> :142 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os229 RANK(5)    :
> *256 sec*
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os223 RANK(6)
> :156 sec
> 2010-08-09 15:28:10,947 272904 DEBUG [0x7f27dec27740] - os221 RANK(7)
> :162 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(8)
> :159 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(9)
> :168 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(10)
> :141 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os229 RANK(11)
> :136 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os223 RANK(12)
> :173 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os221 RANK(13)
> :164 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os222 RANK(14)
> :171 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os224 RANK(15)
> :156 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os228 RANK(16)
> :136 sec
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - os229 RANK(17)    :
> *250 sec*
> 2010-08-09 15:28:10,947 272905 DEBUG [0x7f27dec27740] - TOTAL CORRELATION
> TIME: 256 sec
>
>
> Do you have any idea? Why it is happening?
> I assume that it gives 2 jobs to 2 cores in os229, but actually those 2 are
> one core.
> Do you have any idea? If you have, how can I fix it? because the longest
> time affects the whole time information. 100 sec delay is too much for 250
> sec comparison time,
> and it might have finish around 160 sec.
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Saygin

Reply via email to