Re: [OMPI users] OpenMPI fails to run with -np larger than 10

Ralph Castain Tue, 24 Apr 2012 18:15:21 -0400

I thought we had code in the 1.5 series that would "bark" if the tmp dir was on 
a network mount? Is that not true?


On Apr 24, 2012, at 3:20 PM, Gutierrez, Samuel K wrote:

> Hi,
> 
> I just wanted to record the behind the scenes resolution to this particular 
> issue.  For more info, take a look at: 
> https://svn.open-mpi.org/trac/ompi/ticket/3076
> 
> It seems as if the problem stems from /tmp being mounted as an NFS space that 
> is shared between the compute nodes.
> 
> This problem can be resolved in a variety of ways.  Below are a few avenues 
> that can help get around the "globally mounted /tmp space" issue, but others 
> are welcome to add to the list. 
> 
> o Change the place where ORTE stores its session information
> -mca orte_tmpdir_base /path/to/some/local/store
> For example:
> -mca orte_tmpdir_base /dev/shm
> 
> **Note: the following options are only available in Open MPI v1.5.5+**
> 
> o Change where shmem mmap places its files.
> -mca shmem_mmap_relocate_backing_file -1 -mca 
> shmem_mmap_backing_file_base_dir /dev/shm
> 
> o Change the backing facility used by the sm mpool and sm BTL to posix or sysv
> -mca shmem posix
> -mca shmem sysv
> 
> Sam
>  
> On Apr 24, 2012, at 12:34 PM, Seyyed Mohtadin Hashemi wrote:
> 
>> Hi,
>> 
>> I ran those cmd's and have posted the outputs on: 
>> https://svn.open-mpi.org/trac/ompi/ticket/3076
>> 
>> -mca shmem posix worked for all -np (even when oversubscribing), however 
>> sysv did not work for any -np.
>> 
>> On Tue, Apr 24, 2012 at 5:36 PM, Gutierrez, Samuel K <sam...@lanl.gov> wrote:
>> Hi,
>> 
>> Just out of curiosity, what happens when you add
>> 
>> -mca shmem posix
>> 
>> to your mpirun command line using 1.5.5?
>> 
>> Can you also please try:
>> 
>> -mca shmem sysv
>> 
>> I'm shooting in the dark here, but I want to make sure that the failure 
>> isn't due to a small backing store.
>> 
>> Thanks,
>> 
>> Sam
>> 
>> On Apr 16, 2012, at 8:57 AM, Gutierrez, Samuel K wrote:
>> 
>>> Hi,
>>> 
>>> Sorry about the lag.  I'll take a closer look at this ASAP.
>>> 
>>> Appreciate your patience,
>>> 
>>> Sam
>>> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
>>> Ralph Castain [r...@open-mpi.org]
>>> Sent: Monday, April 16, 2012 8:52 AM
>>> To: Seyyed Mohtadin Hashemi
>>> Cc: us...@open-mpi.org
>>> Subject: Re: [OMPI users] OpenMPI fails to run with -np larger than 10
>>> 
>>> No earthly idea. As I said, I'm afraid Sam is pretty much unavailable for 
>>> the next two weeks, so we probably don't have much hope of fixing it.
>>> 
>>> I see in your original note that you tried the 1.5.5 beta rc and got the 
>>> same results, so I assume this must be something in  your system config 
>>> that is causing the issue. I'll file a bug for him (pointing to this 
>>> thread) so this doesn't get lost, but would suggest you run ^sm for now 
>>> unless someone else has other suggestions.
>>> 
>>> 
>>> On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote:
>>> 
>>>> I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI 
>>>> 1.4.5 tarball.
>>>> 
>>>> I did some tests and it does not seem that i can make it work, i tried 
>>>> these:
>>>> 
>>>> btl_sm_num_fifos 4
>>>> btl_sm_free_list_num 1000
>>>> btl_sm_free_list_max 1000000
>>>> mpool_sm_min_size 1500000000
>>>> mpool_sm_max_size 7500000000 
>>>> 
>>>> but nothing helped. I started out with varying one parameter at the time 
>>>> from default to 1000000 (except fifo which i only varied till 100, and 
>>>> sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] 
>>>> to 7.5gb) to see what reactions i could get. When running with 10 npp 
>>>> everything worked, but as soon as i went to 11 npp it crashed with the 
>>>> same old error.
>>>> 
>>>> On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote:
>>>> 
>>>>> That fixed the issue but have brought a big question mark on why this 
>>>>> happened.
>>>>> 
>>>>> I'm pretty sure it's not a system memory issue, the node with least RAM 
>>>>> has 8gb which i would think is more than enough.
>>>>> 
>>>>> Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, 
>>>>> and mpool_sm_max_size can help fix the problem? (Found this 
>>>>> athttp://www.open-mpi.org/faq/?category=sm )  Because compared to the -np 
>>>>> 10 the performance of -np 18 is worse when running with the cmd you 
>>>>> suggested. I'll try playing around with the parameters and see what 
>>>>> works. 
>>>> 
>>>> Yes, performance will definitely be worse - I was just trying to isolate 
>>>> the problem. I would play a little with those sizes and see what you can 
>>>> do. Our shared memory person is pretty much unavailable for the next two 
>>>> weeks, but the rest of us will at least try to get you working.
>>>> 
>>>> We typically do run with more than 10 ppn, so I know the base sm code 
>>>> works at that scale. However, those nodes usually have 32Gbytes of RAM, 
>>>> and the default sm params are scaled accordingly.
>>>> 
>>>> 
>>>>> 
>>>>> On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> Afraid I have no idea how those packages were built, what release they 
>>>>> correspond to, etc. I would suggest sticking with the tarballs.
>>>>> 
>>>>> Your output indicates a problem with shared memory when you completely 
>>>>> fill the machine. Could be a couple of things, like running out of memory 
>>>>> - but for now, try adding -mca btl ^sm to your cmd line. Should work.
>>>>> 
>>>>> 
>>>>> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Sorry that it took so long to answer, I didn't get any return mails and 
>>>>>> had to check the digest for reply.
>>>>>> 
>>>>>> Anyway, when i compiled from scratch then i did use the tarballs from 
>>>>>> open-mpi.org. GROMACS is not the problem (or at least i don't think so), 
>>>>>> i just used it as a check to see if i could run parallel jobs - i am now 
>>>>>> using OSU benchmarks because i can't be sure that the problem is not 
>>>>>> with GROMACS.
>>>>>> 
>>>>>> On the new installation i have not installed (nor compiled) OMPI from 
>>>>>> the official tarballs but rather installed the "openmpi-bin, 
>>>>>> openmpi-common, libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" 
>>>>>> packages using apt-get.
>>>>>> 
>>>>>> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c 
>>>>>> extracted from the 1.4.2 official tarball) i get the exact same behavior 
>>>>>> as with GROMACS/OSU bench.
>>>>>> 
>>>>>> I suspect you'll have to ask someone familiar with GROMACS about that 
>>>>>> specific package. As for testing OMPI, can you run the codes in the 
>>>>>> examples directory - e.g., "hello" and "ring"? I assume you are 
>>>>>> downloading and installing OMPI from our tarballs?
>>>>>> 
>>>>>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:
>>>>>> 
>>>>>> > Hello,
>>>>>> >
>>>>>> > I have a very peculiar problem: I have a micro cluster with three 
>>>>>> > nodes (18 cores total); the nodes are clones of each other and 
>>>>>> > connected to a frontend via Ethernet and Debian squeeze as the OS for 
>>>>>> > all nodes. When I run parallel jobs I can used up ?-np 10? if I go 
>>>>>> > further the job crashes, I have primarily done tests with GROMACS 
>>>>>> > (because that is what I will be running) but have also used OSU 
>>>>>> > Micro-Benchmarks 3.5.2.
>>>>>> >
>>>>>> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile 
>>>>>> > ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o 
>>>>>> > path/output.trr?
>>>>>> >
>>>>>> > (path is global) For ?np XX being smaller than or 10 it works, however 
>>>>>> > as soon as I make use of 11 or larger the whole thing crashes. The 
>>>>>> > terminal dump is attached to this mail: when_working.txt is for ??np 
>>>>>> > 10?, when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output 
>>>>>> > from ?path/mpirun --bynode --hostfile path/hostfile --tag-output 
>>>>>> > ompi_info -v ompi full ?parsable?
>>>>>> >
>>>>>> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all 
>>>>>> > yield the same result.
>>>>>> >
>>>>>> > The output files are from a new install I did today: I formatted all 
>>>>>> > nodes and started from a fresh minimal install of Squeeze and used 
>>>>>> > "apt-get install gromacs gromacs-openmpi" and installed all 
>>>>>> > dependencies. Then I ran two jobs using the parameters described 
>>>>>> > above, I also did one with OSU bench (data is not included) it also 
>>>>>> > crashed with ?-np? larger than 10.
>>>>>> >
>>>>>> > I hope somebody can help figure out what is wrong and how I can fix it.
>>>>>> >
>>>>>> > Best regards,
>>>>>> > Mohtadin
>>>>>> >
>>>>>> > *****************************************************************************
>>>>>> > ** **
>>>>>> > ** WARNING: This email contains an attachment of a very suspicious 
>>>>>> > type. **
>>>>>> > ** You are urged NOT to open this attachment unless you are absolutely 
>>>>>> > **
>>>>>> > ** sure it is legitimate. Opening this attachment may cause 
>>>>>> > irreparable **
>>>>>> > ** damage to your computer and your files. If you have any questions **
>>>>>> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING 
>>>>>> > IT. **
>>>>>> > ** **
>>>>>> > ** This warning was added by the IU Computer Science Dept. mail 
>>>>>> > scanner. **
>>>>>> > *****************************************************************************
>>>>>> >
>>>>>> > <Archive.zip>_______________________________________________
>>>>>> > users mailing list
>>>>>> > us...@open-mpi.org
>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> De venligste hilsner/I am, yours most sincerely
>>>>> Seyyed Mohtadin Hashemi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> De venligste hilsner/I am, yours most sincerely
>>>> Seyyed Mohtadin Hashemi
>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> De venligste hilsner/I am, yours most sincerely
>> Seyyed Mohtadin Hashemi
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI fails to run with -np larger than 10

Reply via email to