I thought we had code in the 1.5 series that would "bark" if the tmp dir was on a network mount? Is that not true?
On Apr 24, 2012, at 3:20 PM, Gutierrez, Samuel K wrote: > Hi, > > I just wanted to record the behind the scenes resolution to this particular > issue. For more info, take a look at: > https://svn.open-mpi.org/trac/ompi/ticket/3076 > > It seems as if the problem stems from /tmp being mounted as an NFS space that > is shared between the compute nodes. > > This problem can be resolved in a variety of ways. Below are a few avenues > that can help get around the "globally mounted /tmp space" issue, but others > are welcome to add to the list. > > o Change the place where ORTE stores its session information > -mca orte_tmpdir_base /path/to/some/local/store > For example: > -mca orte_tmpdir_base /dev/shm > > **Note: the following options are only available in Open MPI v1.5.5+** > > o Change where shmem mmap places its files. > -mca shmem_mmap_relocate_backing_file -1 -mca > shmem_mmap_backing_file_base_dir /dev/shm > > o Change the backing facility used by the sm mpool and sm BTL to posix or sysv > -mca shmem posix > -mca shmem sysv > > Sam > > On Apr 24, 2012, at 12:34 PM, Seyyed Mohtadin Hashemi wrote: > >> Hi, >> >> I ran those cmd's and have posted the outputs on: >> https://svn.open-mpi.org/trac/ompi/ticket/3076 >> >> -mca shmem posix worked for all -np (even when oversubscribing), however >> sysv did not work for any -np. >> >> On Tue, Apr 24, 2012 at 5:36 PM, Gutierrez, Samuel K <sam...@lanl.gov> wrote: >> Hi, >> >> Just out of curiosity, what happens when you add >> >> -mca shmem posix >> >> to your mpirun command line using 1.5.5? >> >> Can you also please try: >> >> -mca shmem sysv >> >> I'm shooting in the dark here, but I want to make sure that the failure >> isn't due to a small backing store. >> >> Thanks, >> >> Sam >> >> On Apr 16, 2012, at 8:57 AM, Gutierrez, Samuel K wrote: >> >>> Hi, >>> >>> Sorry about the lag. I'll take a closer look at this ASAP. >>> >>> Appreciate your patience, >>> >>> Sam >>> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of >>> Ralph Castain [r...@open-mpi.org] >>> Sent: Monday, April 16, 2012 8:52 AM >>> To: Seyyed Mohtadin Hashemi >>> Cc: us...@open-mpi.org >>> Subject: Re: [OMPI users] OpenMPI fails to run with -np larger than 10 >>> >>> No earthly idea. As I said, I'm afraid Sam is pretty much unavailable for >>> the next two weeks, so we probably don't have much hope of fixing it. >>> >>> I see in your original note that you tried the 1.5.5 beta rc and got the >>> same results, so I assume this must be something in your system config >>> that is causing the issue. I'll file a bug for him (pointing to this >>> thread) so this doesn't get lost, but would suggest you run ^sm for now >>> unless someone else has other suggestions. >>> >>> >>> On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote: >>> >>>> I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI >>>> 1.4.5 tarball. >>>> >>>> I did some tests and it does not seem that i can make it work, i tried >>>> these: >>>> >>>> btl_sm_num_fifos 4 >>>> btl_sm_free_list_num 1000 >>>> btl_sm_free_list_max 1000000 >>>> mpool_sm_min_size 1500000000 >>>> mpool_sm_max_size 7500000000 >>>> >>>> but nothing helped. I started out with varying one parameter at the time >>>> from default to 1000000 (except fifo which i only varied till 100, and >>>> sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] >>>> to 7.5gb) to see what reactions i could get. When running with 10 npp >>>> everything worked, but as soon as i went to 11 npp it crashed with the >>>> same old error. >>>> >>>> On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote: >>>> >>>>> That fixed the issue but have brought a big question mark on why this >>>>> happened. >>>>> >>>>> I'm pretty sure it's not a system memory issue, the node with least RAM >>>>> has 8gb which i would think is more than enough. >>>>> >>>>> Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, >>>>> and mpool_sm_max_size can help fix the problem? (Found this >>>>> athttp://www.open-mpi.org/faq/?category=sm ) Because compared to the -np >>>>> 10 the performance of -np 18 is worse when running with the cmd you >>>>> suggested. I'll try playing around with the parameters and see what >>>>> works. >>>> >>>> Yes, performance will definitely be worse - I was just trying to isolate >>>> the problem. I would play a little with those sizes and see what you can >>>> do. Our shared memory person is pretty much unavailable for the next two >>>> weeks, but the rest of us will at least try to get you working. >>>> >>>> We typically do run with more than 10 ppn, so I know the base sm code >>>> works at that scale. However, those nodes usually have 32Gbytes of RAM, >>>> and the default sm params are scaled accordingly. >>>> >>>> >>>>> >>>>> On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> Afraid I have no idea how those packages were built, what release they >>>>> correspond to, etc. I would suggest sticking with the tarballs. >>>>> >>>>> Your output indicates a problem with shared memory when you completely >>>>> fill the machine. Could be a couple of things, like running out of memory >>>>> - but for now, try adding -mca btl ^sm to your cmd line. Should work. >>>>> >>>>> >>>>> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Sorry that it took so long to answer, I didn't get any return mails and >>>>>> had to check the digest for reply. >>>>>> >>>>>> Anyway, when i compiled from scratch then i did use the tarballs from >>>>>> open-mpi.org. GROMACS is not the problem (or at least i don't think so), >>>>>> i just used it as a check to see if i could run parallel jobs - i am now >>>>>> using OSU benchmarks because i can't be sure that the problem is not >>>>>> with GROMACS. >>>>>> >>>>>> On the new installation i have not installed (nor compiled) OMPI from >>>>>> the official tarballs but rather installed the "openmpi-bin, >>>>>> openmpi-common, libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" >>>>>> packages using apt-get. >>>>>> >>>>>> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c >>>>>> extracted from the 1.4.2 official tarball) i get the exact same behavior >>>>>> as with GROMACS/OSU bench. >>>>>> >>>>>> I suspect you'll have to ask someone familiar with GROMACS about that >>>>>> specific package. As for testing OMPI, can you run the codes in the >>>>>> examples directory - e.g., "hello" and "ring"? I assume you are >>>>>> downloading and installing OMPI from our tarballs? >>>>>> >>>>>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote: >>>>>> >>>>>> > Hello, >>>>>> > >>>>>> > I have a very peculiar problem: I have a micro cluster with three >>>>>> > nodes (18 cores total); the nodes are clones of each other and >>>>>> > connected to a frontend via Ethernet and Debian squeeze as the OS for >>>>>> > all nodes. When I run parallel jobs I can used up ?-np 10? if I go >>>>>> > further the job crashes, I have primarily done tests with GROMACS >>>>>> > (because that is what I will be running) but have also used OSU >>>>>> > Micro-Benchmarks 3.5.2. >>>>>> > >>>>>> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile >>>>>> > ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o >>>>>> > path/output.trr? >>>>>> > >>>>>> > (path is global) For ?np XX being smaller than or 10 it works, however >>>>>> > as soon as I make use of 11 or larger the whole thing crashes. The >>>>>> > terminal dump is attached to this mail: when_working.txt is for ??np >>>>>> > 10?, when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output >>>>>> > from ?path/mpirun --bynode --hostfile path/hostfile --tag-output >>>>>> > ompi_info -v ompi full ?parsable? >>>>>> > >>>>>> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all >>>>>> > yield the same result. >>>>>> > >>>>>> > The output files are from a new install I did today: I formatted all >>>>>> > nodes and started from a fresh minimal install of Squeeze and used >>>>>> > "apt-get install gromacs gromacs-openmpi" and installed all >>>>>> > dependencies. Then I ran two jobs using the parameters described >>>>>> > above, I also did one with OSU bench (data is not included) it also >>>>>> > crashed with ?-np? larger than 10. >>>>>> > >>>>>> > I hope somebody can help figure out what is wrong and how I can fix it. >>>>>> > >>>>>> > Best regards, >>>>>> > Mohtadin >>>>>> > >>>>>> > ***************************************************************************** >>>>>> > ** ** >>>>>> > ** WARNING: This email contains an attachment of a very suspicious >>>>>> > type. ** >>>>>> > ** You are urged NOT to open this attachment unless you are absolutely >>>>>> > ** >>>>>> > ** sure it is legitimate. Opening this attachment may cause >>>>>> > irreparable ** >>>>>> > ** damage to your computer and your files. If you have any questions ** >>>>>> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING >>>>>> > IT. ** >>>>>> > ** ** >>>>>> > ** This warning was added by the IU Computer Science Dept. mail >>>>>> > scanner. ** >>>>>> > ***************************************************************************** >>>>>> > >>>>>> > <Archive.zip>_______________________________________________ >>>>>> > users mailing list >>>>>> > us...@open-mpi.org >>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> De venligste hilsner/I am, yours most sincerely >>>>> Seyyed Mohtadin Hashemi >>>> >>>> >>>> >>>> >>>> -- >>>> De venligste hilsner/I am, yours most sincerely >>>> Seyyed Mohtadin Hashemi >>> >>> >> >> >> >> >> -- >> De venligste hilsner/I am, yours most sincerely >> Seyyed Mohtadin Hashemi > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users