One possibility is that the sm btl might not like that you have hyperthreading enabled.
Another thing to check: do you have any paffinity settings turned on (e.g., mpi_paffinity_alone)? Our paffinity system doesn't handle hyperthreading at this time. I'm just suspicious of the HT since you have a quad-core machine, and the limit where things work seems to be 4... On May 4, 2010, at 3:44 PM, Gus Correa wrote: > Hi Jeff > > Sure, I will certainly try v1.4.2. > I am downloading it right now. > As of this morning, when I first downloaded, > the web site still had 1.4.1. > Maybe I should have refreshed the web page on my browser. > > I will tell you how it goes. > > Gus > > Jeff Squyres wrote: >> Gus -- Can you try v1.4.2 which was just released today? >> On May 4, 2010, at 4:18 PM, Gus Correa wrote: >>> Hi Ralph >>> >>> Thank you very much. >>> The "-mca btl ^sm" workaround seems to have solved the problem, >>> at least for the little hello_c.c test. >>> I just ran it fine up to 128 processes. >>> >>> I confess I am puzzled by this workaround. >>> * Why should we turn off "sm" in a standalone machine, >>> where everything is supposed to operate via shared memory? >>> * Do I incur in a performance penalty by not using "sm"? >>> * What other mechanism is actually used by OpenMPI for process >>> communication in this case? >>> >>> It seems to be using tcp, because when I try -np 256 I get this error: >>> >>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number >>> of network connections a process can open was reached in file >>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447 >>> -------------------------------------------------------------------------- >>> Error: system limit exceeded on number of network connections that can >>> be open >>> This can be resolved by setting the mca parameter >>> opal_set_max_sys_limits to 1, >>> increasing your limit descriptor setting (using limit or ulimit commands), >>> or asking the system administrator to increase the system limit. >>> -------------------------------------------------------------------------- >>> >>> Anyway, no big deal, because we don't intend to oversubrcribe the >>> processors on real jobs anyway (and the very error message suggests a >>> workaround to increase np, if needed). >>> >>> Many thanks, >>> Gus Correa >>> >>> Ralph Castain wrote: >>>> I would certainly try it -mca btl ^sm and see if that solves the problem. >>>> >>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote: >>>> >>>>> Gus Correa wrote: >>>>> >>>>>> Dear Open MPI experts >>>>>> >>>>>> I need your help to get Open MPI right on a standalone >>>>>> machine with Nehalem processors. >>>>>> >>>>>> How to tweak the mca parameters to avoid problems >>>>>> with Nehalem (and perhaps AMD processors also), >>>>>> where MPI programs hang, was discussed here before. >>>>>> >>>>>> However, I lost track of the details, how to work around the problem, >>>>>> and if it was fully fixed already perhaps. >>>>> Yes, perhaps the problem you're seeing is not what you remember being >>>>> discussed. >>>>> >>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 >>>>> . It's presumably fixed. >>>>> >>>>>> I am now facing the problem directly on a single Nehalem box. >>>>>> >>>>>> I installed OpenMPI 1.4.1 from source, >>>>>> and compiled the test hello_c.c with mpicc. >>>>>> Then I tried to run it with: >>>>>> >>>>>> 1) mpirun -np 4 a.out >>>>>> It ran OK (but seemed to be slow). >>>>>> >>>>>> 2) mpirun -np 16 a.out >>>>>> It hung, and brought the machine to a halt. >>>>>> >>>>>> Any words of wisdom are appreciated. >>>>>> >>>>>> More info: >>>>>> >>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site). >>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4. >>>>>> * OS is Fedora Core 12. >>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core) >>>>>> processors on a two-way motherboard and 48GB of RAM. >>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on. >>>>>> (I can see 16 "processors".) >>>>>> >>>>>> ** >>>>>> >>>>>> What should I do? >>>>>> >>>>>> Use -mca btl ^sm ? >>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?) >>>>>> Use Both? >>>>>> Do something else? >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users