One possibility is that the sm btl might not like that you have hyperthreading 
enabled.

Another thing to check: do you have any paffinity settings turned on (e.g., 
mpi_paffinity_alone)? Our paffinity system doesn't handle hyperthreading at 
this time.

I'm just suspicious of the HT since you have a quad-core machine, and the limit 
where things work seems to be 4...

On May 4, 2010, at 3:44 PM, Gus Correa wrote:

> Hi Jeff
> 
> Sure, I will certainly try v1.4.2.
> I am downloading it right now.
> As of this morning, when I first downloaded,
> the web site still had 1.4.1.
> Maybe I should have refreshed the web page on my browser.
> 
> I will tell you how it goes.
> 
> Gus
> 
> Jeff Squyres wrote:
>> Gus -- Can you try v1.4.2 which was just released today?
>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>> Hi Ralph
>>> 
>>> Thank you very much.
>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>> at least for the little hello_c.c test.
>>> I just ran it fine up to 128 processes.
>>> 
>>> I confess I am puzzled by this workaround.
>>> * Why should we turn off "sm" in a standalone machine,
>>> where everything is supposed to operate via shared memory?
>>> * Do I incur in a performance penalty by not using "sm"?
>>> * What other mechanism is actually used by OpenMPI for process
>>> communication in this case?
>>> 
>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>> 
>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>> of network connections a process can open was reached in file
>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>> --------------------------------------------------------------------------
>>> Error: system limit exceeded on number of network connections that can
>>> be open
>>> This can be resolved by setting the mca parameter
>>> opal_set_max_sys_limits to 1,
>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>> or asking the system administrator to increase the system limit.
>>> --------------------------------------------------------------------------
>>> 
>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>> processors on real jobs anyway (and the very error message suggests a
>>> workaround to increase np, if needed).
>>> 
>>> Many thanks,
>>> Gus Correa
>>> 
>>> Ralph Castain wrote:
>>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>> 
>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>> 
>>>>> Gus Correa wrote:
>>>>> 
>>>>>> Dear Open MPI experts
>>>>>> 
>>>>>> I need your help to get Open MPI right on a standalone
>>>>>> machine with Nehalem processors.
>>>>>> 
>>>>>> How to tweak the mca parameters to avoid problems
>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>> where MPI programs hang, was discussed here before.
>>>>>> 
>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>> and if it was fully fixed already perhaps.
>>>>> Yes, perhaps the problem you're seeing is not what you remember being 
>>>>> discussed.
>>>>> 
>>>>> Perhaps you're thinking of https://svn.open-mpi.org/trac/ompi/ticket/2043 
>>>>> .  It's presumably fixed.
>>>>> 
>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>> 
>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>> Then I tried to run it with:
>>>>>> 
>>>>>> 1) mpirun -np 4 a.out
>>>>>> It ran OK (but seemed to be slow).
>>>>>> 
>>>>>> 2) mpirun -np 16 a.out
>>>>>> It hung, and brought the machine to a halt.
>>>>>> 
>>>>>> Any words of wisdom are appreciated.
>>>>>> 
>>>>>> More info:
>>>>>> 
>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>> * OS is Fedora Core 12.
>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>> (I can see 16 "processors".)
>>>>>> 
>>>>>> **
>>>>>> 
>>>>>> What should I do?
>>>>>> 
>>>>>> Use -mca btl ^sm  ?
>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>> Use Both?
>>>>>> Do something else?
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to