On May 4, 2010, at 4:51 PM, Gus Correa wrote:

> Hi Ralph
> 
> Ralph Castain wrote:
>> One possibility is that the sm btl might not like that you have 
>> hyperthreading enabled.
> 
> I remember that hyperthreading was discussed months ago,
> in the previous incarnation of this problem/thread/discussion on "Nehalem vs. 
> Open MPI".
> (It sounds like one of those supreme court cases ... )
> 
> I don't really administer that machine,
> or any machine with hyperthreading,
> so I am not much familiar to the HT nitty-gritty.
> How do I turn off hyperthreading?
> Is it a BIOS or a Linux thing?
> I may try that.

I believe it can be turned off via an admin-level cmd, but I'm not certain 
about it

> 
>> Another thing to check: do you have any paffinity settings turned on 
> (e.g., mpi_paffinity_alone)?
> 
> I didn't turn on or off any paffinity setting explicitly,
> either in the command line or in the mca config file.
> All that I did on the tests was to turn off "sm",
> or just use the default settings.
> I wonder if paffinity is on by default, is it?
> Should I turn it off?

It is off by default - I mention it because sometimes people have it set in the 
default MCA param file and don't realize it is on. Sounds okay here, though.

> 
>> Our paffinity system doesn't handle hyperthreading at this time.
> 
> OK, so *if* paffinity is on by default (Is it?),
> and hyperthreading is also on, as it is now,
> I must turn off one of them, maybe both, right?
> I may go combinatorial about this tomorrow.
> Can't do it today.
> Darn locked office door!

I would say don't worry about the paffinity right now - sounds like it is off. 
You can always check, though, by running "ompi_info --param opal all" and 
checking for the setting of the opal_paffinity_alone variable

> 
>> I'm just suspicious of the HT since you have a quad-core machine, 
> and the limit where things work seems to be 4...
> 
> It may be.
> If you tell me how to turn off HT (I'll google around for it meanwhile),
> I will do it tomorrow, if I get a chance to
> hard reboot that pesky machine now locked behind a door.

Yeah, I'm beginning to believe it is the HT that is causing the problem...

> 
> Thanks again for your help.
> 
> Gus
> 
>> On May 4, 2010, at 3:44 PM, Gus Correa wrote:
>>> Hi Jeff
>>> 
>>> Sure, I will certainly try v1.4.2.
>>> I am downloading it right now.
>>> As of this morning, when I first downloaded,
>>> the web site still had 1.4.1.
>>> Maybe I should have refreshed the web page on my browser.
>>> 
>>> I will tell you how it goes.
>>> 
>>> Gus
>>> 
>>> Jeff Squyres wrote:
>>>> Gus -- Can you try v1.4.2 which was just released today?
>>>> On May 4, 2010, at 4:18 PM, Gus Correa wrote:
>>>>> Hi Ralph
>>>>> 
>>>>> Thank you very much.
>>>>> The "-mca btl ^sm" workaround seems to have solved the problem,
>>>>> at least for the little hello_c.c test.
>>>>> I just ran it fine up to 128 processes.
>>>>> 
>>>>> I confess I am puzzled by this workaround.
>>>>> * Why should we turn off "sm" in a standalone machine,
>>>>> where everything is supposed to operate via shared memory?
>>>>> * Do I incur in a performance penalty by not using "sm"?
>>>>> * What other mechanism is actually used by OpenMPI for process
>>>>> communication in this case?
>>>>> 
>>>>> It seems to be using tcp, because when I try -np 256 I get this error:
>>>>> 
>>>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number
>>>>> of network connections a process can open was reached in file
>>>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447
>>>>> --------------------------------------------------------------------------
>>>>> Error: system limit exceeded on number of network connections that can
>>>>> be open
>>>>> This can be resolved by setting the mca parameter
>>>>> opal_set_max_sys_limits to 1,
>>>>> increasing your limit descriptor setting (using limit or ulimit commands),
>>>>> or asking the system administrator to increase the system limit.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> Anyway, no big deal, because we don't intend to oversubrcribe the
>>>>> processors on real jobs anyway (and the very error message suggests a
>>>>> workaround to increase np, if needed).
>>>>> 
>>>>> Many thanks,
>>>>> Gus Correa
>>>>> 
>>>>> Ralph Castain wrote:
>>>>>> I would certainly try it -mca btl ^sm and see if that solves the problem.
>>>>>> 
>>>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote:
>>>>>> 
>>>>>>> Gus Correa wrote:
>>>>>>> 
>>>>>>>> Dear Open MPI experts
>>>>>>>> 
>>>>>>>> I need your help to get Open MPI right on a standalone
>>>>>>>> machine with Nehalem processors.
>>>>>>>> 
>>>>>>>> How to tweak the mca parameters to avoid problems
>>>>>>>> with Nehalem (and perhaps AMD processors also),
>>>>>>>> where MPI programs hang, was discussed here before.
>>>>>>>> 
>>>>>>>> However, I lost track of the details, how to work around the problem,
>>>>>>>> and if it was fully fixed already perhaps.
>>>>>>> Yes, perhaps the problem you're seeing is not what you remember being 
>>>>>>> discussed.
>>>>>>> 
>>>>>>> Perhaps you're thinking of 
>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2043 .  It's presumably fixed.
>>>>>>> 
>>>>>>>> I am now facing the problem directly on a single Nehalem box.
>>>>>>>> 
>>>>>>>> I installed OpenMPI 1.4.1 from source,
>>>>>>>> and compiled the test hello_c.c with mpicc.
>>>>>>>> Then I tried to run it with:
>>>>>>>> 
>>>>>>>> 1) mpirun -np 4 a.out
>>>>>>>> It ran OK (but seemed to be slow).
>>>>>>>> 
>>>>>>>> 2) mpirun -np 16 a.out
>>>>>>>> It hung, and brought the machine to a halt.
>>>>>>>> 
>>>>>>>> Any words of wisdom are appreciated.
>>>>>>>> 
>>>>>>>> More info:
>>>>>>>> 
>>>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site).
>>>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4.
>>>>>>>> * OS is Fedora Core 12.
>>>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core)
>>>>>>>> processors on a two-way motherboard and 48GB of RAM.
>>>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on.
>>>>>>>> (I can see 16 "processors".)
>>>>>>>> 
>>>>>>>> **
>>>>>>>> 
>>>>>>>> What should I do?
>>>>>>>> 
>>>>>>>> Use -mca btl ^sm  ?
>>>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?)
>>>>>>>> Use Both?
>>>>>>>> Do something else?
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to