Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)

2010-03-05 Thread Aurélien Bouteiller
Hi, setting the eager limit to such a drastically high value will have the effect of generating gigantic memory consumption for unexpected messages. Any message you send which does not have a preposted ready recv will mallocate 150mb of temporary storage, and will be memcopied from that intern

[OMPI users] change hosts to restart the checkpoint

2010-03-05 Thread 马少杰
2010-03-05 马少杰 Dear Sir: I want to use openmpi and blcr to checkpoint.However, I want restart the check point on other hosts. For example, I run mpi program using openmpi on host1 and host2, and I save the checkpoint file at a nfs shared path. Then I wan to restart the job (ompi-res

Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint

2010-03-05 Thread 马少杰
Dear Sir: - What version of Open MPI are you using? my version is 1.3.4 - What configure options are you using? ./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread --with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=/public/mpi/openmpi134-gnu-cr --enable-mpirun-prefix-by-default

Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)

2010-03-05 Thread TRINH Minh Hieu
Hi, Thank you for those informations. For the moment, I didn't encountered those problems yet. Maybe because, my program don't use much memory (100MB) and the master machine have huge RAM (8GB). So meanwhile, the solution seems to be the parameter "btl_tcp_eager_limit" but a cleaner solution is ve

Re: [OMPI users] low efficiency when we use --am ft-enable-cr to checkpoint

2010-03-05 Thread Joshua Hursey
On Mar 5, 2010, at 3:15 AM, 马少杰 wrote: > Dear Sir: > - What version of Open MPI are you using? > my version is 1.3.4 > - What configure options are you using? > ./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread > --with-blcr=$dir --with-blcr-libdir=/$dir/lib > --prefix=/public/m

Re: [OMPI users] running external program on same processor (Fortran)

2010-03-05 Thread abc def
Hello, Thanks for the comments. Indeed, until yesterday, I didn't realise the difference between MVAPICH, MVAPICH2 and Open-MPI. This problem has moved from mvapich2 to open-mpi now however, because I now realise that the production environment uses Open-MPI, which means my solution for mvapi

Re: [OMPI users] running external program on same processor (Fortran)

2010-03-05 Thread Ralph Castain
How are you trying to start this external program? With an MPI_Comm_spawn? Or are you just fork/exec'ing it? How are you waiting for this external program to finish? On Mar 5, 2010, at 7:52 AM, abc def wrote: > Hello, > > Thanks for the comments. Indeed, until yesterday, I didn't realise the

Re: [OMPI users] change hosts to restart the checkpoint

2010-03-05 Thread Josh Hursey
This type of failure is usually due to prelink'ing being left enabled on one or more of the systems. This has come up multiple times on the Open MPI list, but is actually a problem between BLCR and the Linux kernel. BLCR has a FAQ entry on this that you will want to check out: https://upc-

Re: [OMPI users] running external program on same processor (Fortran)

2010-03-05 Thread abc def
Hello, >From within the MPI fortran program I run the following command: CALL SYSTEM("cd " // TRIM(dir) // " ; mpirun -machinefile ./machinefile -np 1 /home01/group/Execute/DLPOLY.X > job.out 2> job.err ; cd - > /dev/null") where "dir" is a process-number-dependent directory, to ensure the proc

Re: [OMPI users] running external program on same processor (Fortran)

2010-03-05 Thread Ralph Castain
On Mar 5, 2010, at 8:52 AM, abc def wrote: > Hello, > From within the MPI fortran program I run the following command: > > CALL SYSTEM("cd " // TRIM(dir) // " ; mpirun -machinefile ./machinefile -np 1 > /home01/group/Execute/DLPOLY.X > job.out 2> job.err ; cd - > /dev/null") That is guaranteed

Re: [OMPI users] running externalprogram on same processor (Fortran)

2010-03-05 Thread Jeff Squyres
On Mar 5, 2010, at 2:38 PM, Ralph Castain wrote: >> CALL SYSTEM("cd " // TRIM(dir) // " ; mpirun -machinefile ./machinefile -np >> 1 /home01/group/Execute/DLPOLY.X > job.out 2> job.err ; cd - > /dev/null") > > That is guaranteed not to work. The problem is that mpirun sets environmental > varia