I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI 1.4.5 tarball.
I did some tests and it does not seem that i can make it work, i tried these: btl_sm_num_fifos 4 btl_sm_free_list_num 1000 btl_sm_free_list_max 1000000 mpool_sm_min_size 1500000000 mpool_sm_max_size 7500000000 but nothing helped. I started out with varying one parameter at the time from default to 1000000 (except fifo which i only varied till 100, and sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] to 7.5gb) to see what reactions i could get. When running with 10 npp everything worked, but as soon as i went to 11 npp it crashed with the same old error. On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote: > > That fixed the issue but have brought a big question mark on why this > happened. > > I'm pretty sure it's not a system memory issue, the node with least RAM > has 8gb which i would think is more than enough. > > Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, and > mpool_sm_max_size can help fix the problem? (Found this at > http://www.open-mpi.org/faq/?category=sm ) Because compared to the -np > 10 the performance of -np 18 is worse when running with the cmd you > suggested. I'll try playing around with the parameters and see what works. > > > Yes, performance will definitely be worse - I was just trying to isolate > the problem. I would play a little with those sizes and see what you can > do. Our shared memory person is pretty much unavailable for the next two > weeks, but the rest of us will at least try to get you working. > > We typically do run with more than 10 ppn, so I know the base sm code > works at that scale. However, those nodes usually have 32Gbytes of RAM, and > the default sm params are scaled accordingly. > > > > On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Afraid I have no idea how those packages were built, what release they >> correspond to, etc. I would suggest sticking with the tarballs. >> >> Your output indicates a problem with shared memory when you completely >> fill the machine. Could be a couple of things, like running out of memory - >> but for now, try adding -mca btl ^sm to your cmd line. Should work. >> >> >> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote: >> >> Hi, >> >> Sorry that it took so long to answer, I didn't get any return mails and >> had to check the digest for reply. >> >> Anyway, when i compiled from scratch then i did use the tarballs from >> open-mpi.org. GROMACS is not the problem (or at least i don't think so), >> i just used it as a check to see if i could run parallel jobs - i am now >> using OSU benchmarks because i can't be sure that the problem is not with >> GROMACS. >> >> On the new installation i have not installed (nor compiled) OMPI from the >> official tarballs but rather installed the "openmpi-bin, openmpi-common, >> libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" packages using >> apt-get. >> >> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c >> extracted from the 1.4.2 official tarball) i get the exact same behavior as >> with GROMACS/OSU bench. >> >> I suspect you'll have to ask someone familiar with GROMACS about that >>> specific package. As for testing OMPI, can you run the codes in the >>> examples directory - e.g., "hello" and "ring"? I assume you are downloading >>> and installing OMPI from our tarballs? >>> >> >>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote: >>> >> >>> > Hello, >>> >> > >>> >> > I have a very peculiar problem: I have a micro cluster with three nodes >>> (18 cores total); the nodes are clones of each other and connected to a >>> frontend via Ethernet and Debian squeeze as the OS for all nodes. When I >>> run parallel jobs I can used up ?-np 10? if I go further the job crashes, I >>> have primarily done tests with GROMACS (because that is what I will be >>> running) but have also used OSU Micro-Benchmarks 3.5.2. >>> >> > >>> >> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile >>> ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr? >>> >> > >>> >> > (path is global) For ?np XX being smaller than or 10 it works, however >>> as soon as I make use of 11 or larger the whole thing crashes. The terminal >>> dump is attached to this mail: when_working.txt is for ??np 10?, >>> when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from >>> ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v >>> ompi full ?parsable? >>> >> > >>> >> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all >>> yield the same result. >>> >> > >>> >> > The output files are from a new install I did today: I formatted all >>> nodes and started from a fresh minimal install of Squeeze and used "apt-get >>> install gromacs gromacs-openmpi" and installed all dependencies. Then I ran >>> two jobs using the parameters described above, I also did one with OSU >>> bench (data is not included) it also crashed with ?-np? larger than 10. >>> >> > >>> >> > I hope somebody can help figure out what is wrong and how I can fix it. >>> >> > >>> >> > Best regards, >>> >> > Mohtadin >>> >> > >>> >> > >>> ***************************************************************************** >>> >> > ** ** >>> >> > ** WARNING: This email contains an attachment of a very suspicious >>> type. ** >>> >> > ** You are urged NOT to open this attachment unless you are absolutely >>> ** >>> >> > ** sure it is legitimate. Opening this attachment may cause irreparable >>> ** >>> >> > ** damage to your computer and your files. If you have any questions ** >>> >> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING >>> IT. ** >>> >> > ** ** >>> >> > ** This warning was added by the IU Computer Science Dept. mail >>> scanner. ** >>> >> > >>> ***************************************************************************** >>> >> > >>> >> > <Archive.zip>_______________________________________________ >>> >> > users mailing list >>> >> > us...@open-mpi.org >>> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> > > > -- > De venligste hilsner/I am, yours most sincerely > Seyyed Mohtadin Hashemi > > > -- De venligste hilsner/I am, yours most sincerely Seyyed Mohtadin Hashemi