Dear Eugene
I am sorry that I may not explain the problem clearly last time. The problem is that I tested Ompi with PWscf program on one quadcore node. At the initial several hours, the program went on quite well. When the electronic scf is going to converge, the program started to hang. For example it hangs at the first scf iteration of bfgs steps =23. I waited another 10 hours for the program to go on, but in vain The kernel is 2.6.29.4-167.fc11.i686.PAE The following is the compiler I used to install Ompi. I configured Ompi with options of CC=gcc, FC=ifort. ****************************** ******************************************************************************** intel-icc101018-10.1.018-1.i386 libgcc-4.4.0-4.i586 gcc-4.4.0-4.i586 gcc-gfortran-4.4.0-4.i586 gcc-c++-4.4.0-4.i586 intel-ifort101018-10.1.018-1.i386 and the architecture is: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 10 cpu MHz : 2825.937 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority bogomips : 5651.87 clflush size : 64 power management: ************************************************************************************************************** On Tue, Nov 24, 2009 at 7:27 AM, Eugene Loh <eugene....@sun.com> wrote: > I can't tell if these problems are related to trac ticket 2043 or not. > > Compiler: In my experience, trac 2043 depends on GCC 4.4.x. It isn't > necessarily a GCC bug... perhaps it's just exposing an OMPI problem. I'm > confused what compiler Jiaye is using, and Vasilis is apparently seeing a > problem when using the PGI compiler. But, maybe other compilers in > addition to GCC 4.4.x are exposing the problem. > > Severity: In my experience, trac 2043 shows up rather dramatically: > within dozens to hundreds of iterations of simple message patterns. So, a > problem that shows up only after hours of execution feels to me to be > something different. But maybe I misunderstand Jiaye's and Vasili's cases: > are the programs running well for several hours before the hang occurs? > > Shared memory: Trac 2043 appears related to shared memory. Jiaye seems to > run on a single node. Vasilis talks of running on a "cluster" -- so I don't > know if that means over an interconnect or still using sm. > > Anyhow, it's hard to know which problems are the same or different when we > don't yet really understand what's going on. > > vasilis gkanis wrote: > > I also experience a similar problem with the MUMPS solver, when I run it >> on a cluster. After several hours of running the code does not produce any >> results, although the command top shows that the program occupies 100% of >> the CPU. >> >> The difference here, however, is that the same program runs fine on my PC. >> The differences between my PC and the cluster are: >> 1) 32bit vs 64-bit(cluster) >> 2) intel compiler vs portland compiler(cluster) >> >> On Friday 20 November 2009 03:50:17 am Jiaye Li wrote: >> >> >>> I installed openmpi-1.3.3 on my single node(cpu) intel 64bit quad-core >>> machine. The compiler info is: >>> >>> >>> *************************************************************************** >>> *********************************** intel-icc101018-10.1.018-1.i386 >>> libgcc-4.4.0-4.i586 >>> gcc-4.4.0-4.i586 >>> gcc-gfortran-4.4.0-4.i586 >>> gcc-c++-4.4.0-4.i586 >>> intel-ifort101018-10.1.018-1.i386 >>> >>> >>> *************************************************************************** >>> *********************************** >>> >>> I compiled PWscf program with openmpi and tested the program. At the >>> beginning, the execution of PW went on well, but after about 10 h, when >>> the program is going to finish. The program hang there, but the cpu time >>> is still occupied. (100% taken up by the program). There seems to be >>> something wrong, somewhere. Any ideas? Thank you in advance. >>> >>> >> -- Sincerely yours Jiaye Li
Makefile
Description: Binary data