Hi, As requested, follow some details about blocked job.
This job includes 7 processes running on the same 8 cores. When blocked, a process is waiting for write on disk and is stated D : ========================================================================= 0x561df44e in write () from /lib32/libc.so.6 (gdb) where #0 0x561df44e in write () from /lib32/libc.so.6 #1 0x560e2bd9 in std::__basic_file::sys_open () from /usr/lib32/libstdc++.so.6 #2 0x56089eab in std::basic_filebuf >::_M_convert_to_external () from /usr/lib32/libstdc++.so.6 #3 0x5608b0b3 in std::basic_filebuf >::overflow () from /usr/lib32/libstdc++.so.6 #4 0x560899d7 in std::basic_filebuf >::sync () from /usr/lib32/libstdc++.so.6 #5 0x560b3c92 in std::ostream::flush () from /usr/lib32/libstdc++.so.6 #6 0x560b3d2d in std::flush > () from /usr/lib32/libstdc++.so.6 #7 0x560b56b9 in std::endl > () from /usr/lib32/libstdc++.so.6 ... ========================================================================= The others seem to wait in poll system : ========================================================================= 0x561e4408 in poll () from /lib32/libc.so.6 (gdb) where #0 0x561e4408 in poll () from /lib32/libc.so.6 #1 0x57203917 in poll_dispatch () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #2 0x57202a3a in opal_event_base_loop () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #3 0x57202d47 in opal_event_loop () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #4 0x571f6d28 in opal_progress () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #5 0x56e89415 in ompi_request_default_wait_all () from /home/semar/pelican/ExternalPackages/etch/openmpi/lib/libmpi.so.0 #6 0x5743284f in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/openmpi/mca_coll_tuned.so #7 0x5742fdac in ompi_coll_tuned_allreduce_intra_dec_fixed () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/openmpi/mca_coll_tuned.so #8 0x56e9d247 in PMPI_Allreduce () ... ========================================================================= Another : ========================================================================= #0 0x561e4408 in poll () from /lib32/libc.so.6 #1 0x57203917 in poll_dispatch () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #2 0x57202a3a in opal_event_base_loop () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #3 0x57202d47 in opal_event_loop () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #4 0x571f6d28 in opal_progress () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/libopen-pal.so.0 #5 0x5736b735 in mca_pml_ob1_recv () from /TMPCALCUL/lmpc/pelican/etch/openmpi/lib/openmpi/mca_pml_ob1.so #6 0x56eac1f9 in PMPI_Recv () from /home/semar/pelican/ExternalPackages/etch/openmpi/lib/libmpi.so.0 #7 0x55c6a7ba in EXT_MPIcommunicator::receive () ... ========================================================================= The more strange is that when stopping a precise process (here the second one ans only this one), all the others restart normally. It seems in fact that poll request of this process block system file and prevents first process (writing file) to join the others. It should be noticed that it happens aleatory for a long time and it repairs itself after some time (several mn of timeout). > Can you send all the information listed on the getting help page on the ompi web > site? Also, information about your application would be helpful. > > > > I experience some stange behaviour on multi-core node of our cluster that I > presume is linked to openmpi. > > When running for a long time, and several pseudo-nodes of a single multicore node > are concerned, one process freezes in a uninterrutible mode (D status) and the > others seem to wait for a long time (S status). > > Concurrent processes over the the other pseudo-nodes are also frozen in D mode. > > When forcing the sleeping processes to sleep (kill -STOP), normal activity of > other processes is recovered. > > When interrupting blocked process at wakeup, it seems to be blocked in > poll_dispatch method, and I guess that comment about multithread must be > concerned. > > Do you know more about this behaviour ? > > > > Thank a lot, > > > > Lionel > > > > Nb : I'm using openmpi v1.3 on Linux debian etch distribution > > nodes are as following (/proc/cpuinfo): > > rocessor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 23 > model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz > stepping : 10 > cpu MHz : 2833.422 > cache size : 6144 KB > physical id : 0 > siblings : 4 > core id : 0 > cpu cores : 4 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 13 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush > dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfm > on pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 > lahf_lm > bogomips : 5666.84 > clflush size : 64 > cache_alignment : 64 > address sizes : 38 bits physical, 48 bits virtual > power management: > .... > > ________________________________ > > Lionel CHAILAN > > ASSYSTEM > > Manager Technique Groupe Calcul Scientifique de PERTUIS > > lchai...@assystem.com // 06.73.08.85.69 > -- Lionel CHAILAN ASSYSTEM Manager Technique Groupe Calcul Scientifique de PERTUIS lchai...@assystem.com // 06.73.08.85.69