Gus -- Can you try v1.4.2 which was just released today?

On May 4, 2010, at 4:18 PM, Gus Correa wrote:

> Hi Ralph
> 
> Thank you very much.
> The "-mca btl ^sm" workaround seems to have solved the problem,
> at least for the little hello_c.c test.
> I just ran it fine up to 128 processes.
> 
> I confess I am puzzled by this workaround.
> * Why should we turn off "sm" in a standalone machine,
> where everything is supposed to operate via shared memory?
> * Do I incur in a performance penalty by not using "sm"?
> * What other mechanism is actually used by OpenMPI for process
> communication in this case?
> 
> It seems to be using tcp, because when I try -np 256 I get this error:
> 
> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
> of network connections a process can open was reached in file
> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
> --------------------------------------------------------------------------
> Error: system limit exceeded on number of network connections that can
> be open
> This can be resolved by setting the mca parameter
> opal_set_max_sys_limits to 1,
> increasing your limit descriptor setting (using limit or ulimit commands),
> or asking the system administrator to increase the system limit.
> --------------------------------------------------------------------------
> 
> Anyway, no big deal, because we don't intend to oversubrcribe the
> processors on real jobs anyway (and the very error message suggests a
> workaround to increase np, if needed).
> 
> Many thanks,
> Gus Correa
> 
> Ralph Castain wrote:
> > I would certainly try it -mca btl ^sm and see if that solves the problem.
> >
> > On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
> >
> >> Gus Correa wrote:
> >>
> >>> Dear Open MPI experts
> >>>
> >>> I need your help to get Open MPI right on a standalone
> >>> machine with Nehalem processors.
> >>>
> >>> How to tweak the mca parameters to avoid problems
> >>> with Nehalem (and perhaps AMD processors also),
> >>> where MPI programs hang, was discussed here before.
> >>>
> >>> However, I lost track of the details, how to work around the problem,
> >>> and if it was fully fixed already perhaps.
> >> Yes, perhaps the problem you're seeing is not what you remember being 
> >> discussed.
> >>
> >> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 
> >> .  It's presumably fixed.
> >>
> >>> I am now facing the problem directly on a single Nehalem box.
> >>>
> >>> I installed OpenMPI 1.4.1 from source,
> >>> and compiled the test hello_c.c with mpicc.
> >>> Then I tried to run it with:
> >>>
> >>> 1) mpirun -np 4 a.out
> >>> It ran OK (but seemed to be slow).
> >>>
> >>> 2) mpirun -np 16 a.out
> >>> It hung, and brought the machine to a halt.
> >>>
> >>> Any words of wisdom are appreciated.
> >>>
> >>> More info:
> >>>
> >>> * OpenMPI 1.4.1 installed from source (tarball from your site).
> >>> * Compilers are gcc/g++/gfortran 4.4.3-4.
> >>> * OS is Fedora Core 12.
> >>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
> >>> processors on a two-way motherboard and 48GB of RAM.
> >>> * /proc/cpuinfo indicates that hyperthreading is turned on.
> >>> (I can see 16 "processors".)
> >>>
> >>> **
> >>>
> >>> What should I do?
> >>>
> >>> Use -mca btl ^sm  ?
> >>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
> >>> Use Both?
> >>> Do something else?
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to