On May 4, 2010, at 4:51 PM, Gus Correa wrote: > Hi Ralph > > Ralph Castain wrote: >> One possibility is that the sm btl might not like that you have >> hyperthreading enabled. > > I remember that hyperthreading was discussed months ago, > in the previous incarnation of this problem/thread/discussion on "Nehalem vs. > Open MPI". > (It sounds like one of those supreme court cases ... ) > > I don't really administer that machine, > or any machine with hyperthreading, > so I am not much familiar to the HT nitty-gritty. > How do I turn off hyperthreading? > Is it a BIOS or a Linux thing? > I may try that.
I believe it can be turned off via an admin-level cmd, but I'm not certain about it > >> Another thing to check: do you have any paffinity settings turned on > (e.g., mpi_paffinity_alone)? > > I didn't turn on or off any paffinity setting explicitly, > either in the command line or in the mca config file. > All that I did on the tests was to turn off "sm", > or just use the default settings. > I wonder if paffinity is on by default, is it? > Should I turn it off? It is off by default - I mention it because sometimes people have it set in the default MCA param file and don't realize it is on. Sounds okay here, though. > >> Our paffinity system doesn't handle hyperthreading at this time. > > OK, so *if* paffinity is on by default (Is it?), > and hyperthreading is also on, as it is now, > I must turn off one of them, maybe both, right? > I may go combinatorial about this tomorrow. > Can't do it today. > Darn locked office door! I would say don't worry about the paffinity right now - sounds like it is off. You can always check, though, by running "ompi_info --param opal all" and checking for the setting of the opal_paffinity_alone variable > >> I'm just suspicious of the HT since you have a quad-core machine, > and the limit where things work seems to be 4... > > It may be. > If you tell me how to turn off HT (I'll google around for it meanwhile), > I will do it tomorrow, if I get a chance to > hard reboot that pesky machine now locked behind a door. Yeah, I'm beginning to believe it is the HT that is causing the problem... > > Thanks again for your help. > > Gus > >> On May 4, 2010, at 3:44 PM, Gus Correa wrote: >>> Hi Jeff >>> >>> Sure, I will certainly try v1.4.2. >>> I am downloading it right now. >>> As of this morning, when I first downloaded, >>> the web site still had 1.4.1. >>> Maybe I should have refreshed the web page on my browser. >>> >>> I will tell you how it goes. >>> >>> Gus >>> >>> Jeff Squyres wrote: >>>> Gus -- Can you try v1.4.2 which was just released today? >>>> On May 4, 2010, at 4:18 PM, Gus Correa wrote: >>>>> Hi Ralph >>>>> >>>>> Thank you very much. >>>>> The "-mca btl ^sm" workaround seems to have solved the problem, >>>>> at least for the little hello_c.c test. >>>>> I just ran it fine up to 128 processes. >>>>> >>>>> I confess I am puzzled by this workaround. >>>>> * Why should we turn off "sm" in a standalone machine, >>>>> where everything is supposed to operate via shared memory? >>>>> * Do I incur in a performance penalty by not using "sm"? >>>>> * What other mechanism is actually used by OpenMPI for process >>>>> communication in this case? >>>>> >>>>> It seems to be using tcp, because when I try -np 256 I get this error: >>>>> >>>>> [spinoza:02715] [[11518,0],0] ORTE_ERROR_LOG: The system limit on number >>>>> of network connections a process can open was reached in file >>>>> ../../../../../orte/mca/oob/tcp/oob_tcp.c at line 447 >>>>> -------------------------------------------------------------------------- >>>>> Error: system limit exceeded on number of network connections that can >>>>> be open >>>>> This can be resolved by setting the mca parameter >>>>> opal_set_max_sys_limits to 1, >>>>> increasing your limit descriptor setting (using limit or ulimit commands), >>>>> or asking the system administrator to increase the system limit. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> Anyway, no big deal, because we don't intend to oversubrcribe the >>>>> processors on real jobs anyway (and the very error message suggests a >>>>> workaround to increase np, if needed). >>>>> >>>>> Many thanks, >>>>> Gus Correa >>>>> >>>>> Ralph Castain wrote: >>>>>> I would certainly try it -mca btl ^sm and see if that solves the problem. >>>>>> >>>>>> On May 4, 2010, at 2:38 PM, Eugene Loh wrote: >>>>>> >>>>>>> Gus Correa wrote: >>>>>>> >>>>>>>> Dear Open MPI experts >>>>>>>> >>>>>>>> I need your help to get Open MPI right on a standalone >>>>>>>> machine with Nehalem processors. >>>>>>>> >>>>>>>> How to tweak the mca parameters to avoid problems >>>>>>>> with Nehalem (and perhaps AMD processors also), >>>>>>>> where MPI programs hang, was discussed here before. >>>>>>>> >>>>>>>> However, I lost track of the details, how to work around the problem, >>>>>>>> and if it was fully fixed already perhaps. >>>>>>> Yes, perhaps the problem you're seeing is not what you remember being >>>>>>> discussed. >>>>>>> >>>>>>> Perhaps you're thinking of >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2043 . It's presumably fixed. >>>>>>> >>>>>>>> I am now facing the problem directly on a single Nehalem box. >>>>>>>> >>>>>>>> I installed OpenMPI 1.4.1 from source, >>>>>>>> and compiled the test hello_c.c with mpicc. >>>>>>>> Then I tried to run it with: >>>>>>>> >>>>>>>> 1) mpirun -np 4 a.out >>>>>>>> It ran OK (but seemed to be slow). >>>>>>>> >>>>>>>> 2) mpirun -np 16 a.out >>>>>>>> It hung, and brought the machine to a halt. >>>>>>>> >>>>>>>> Any words of wisdom are appreciated. >>>>>>>> >>>>>>>> More info: >>>>>>>> >>>>>>>> * OpenMPI 1.4.1 installed from source (tarball from your site). >>>>>>>> * Compilers are gcc/g++/gfortran 4.4.3-4. >>>>>>>> * OS is Fedora Core 12. >>>>>>>> * The machine is a Dell box with Intel Xeon 5540 (quad core) >>>>>>>> processors on a two-way motherboard and 48GB of RAM. >>>>>>>> * /proc/cpuinfo indicates that hyperthreading is turned on. >>>>>>>> (I can see 16 "processors".) >>>>>>>> >>>>>>>> ** >>>>>>>> >>>>>>>> What should I do? >>>>>>>> >>>>>>>> Use -mca btl ^sm ? >>>>>>>> Use -mca btl -mca btl_sm_num_fifos=some_number ? (Which number?) >>>>>>>> Use Both? >>>>>>>> Do something else? >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users