Jeff,

should we check ulimit in vader/sm btl and disable them with a warning if
value is too low ?

Cheers,

Gilles

On Friday, November 20, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> For what it's worth, that's open MPI creating a chunk of shared memory for
> use with on-server communication. It shows up as a "file", but it's really
> shared memory.
>
> You can disable sm and/or Vader, but your on-server message passing
> performance will be significantly lower.
>
> Is there a reason you have a file size limit?
>
> Sent from my phone. No type good.
>
> On Nov 19, 2015, at 4:32 PM, Saurabh T <saur...@hotmail.com
> <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>> wrote:
>
> I apologize, I have the wrong lines from strace for the initial file there
> (of course). The file with fd = 11 which causes the problem is called
> shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This
> is exactly 1024 times the ulimit of 131072 which makes sense as the ulimit
> is in 1K blocks).
>
>
> ------------------------------
> From: saur...@hotmail.com
> <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>
> To: us...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
> Date: Thu, 19 Nov 2015 17:08:22 -0500
>
>
> > Could you please provide a little more info regarding the environment
> you
> > are running under (which resource mgr or not, etc), how many nodes you
> had
> > in the allocation, etc?
>
> > There is no reason why something should behave that way. So it would
> help
> > if we could understand the setup.
> > Ralph
>
> To answer Ralph's above question on the other thread, all nodes are  on
> the same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install
> of openmpi 1.10.1. The only atypical thing is that
> btl_tcp_if_exclude = virbr0
> has been added to openmpi-mca-params.conf based on some failures I was
> seeing before.
> (And now of course I've added btl = ^sm as well to fix this issue, see my
> other response).
>
> Relevant output from strace (without the btl = ^sm) is below. Stuff in
> square brackets are my minor edits and snips.
>
> open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1",
> O_RDWR|O_CREAT, 0600) = 12
> ftruncate(12, 4194312)                  = 0
> mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) =
> 0x7fe506c8a000
> close(12)                               = 0
> write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
> [...]
> poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)
> = -1 EFBIG (File too large)
> --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005}
> ---
> --
>
> ------------------------------
> From: saur...@hotmail.com
> <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>
> To: us...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
> Date: Thu, 19 Nov 2015 15:24:08 -0500
>
> Hi,
>
> Sorry my previous email was garbled, sending it again.
>
> > cd examples
> > make hello_cxx
>
> > ulimit -f 131073
> > orterun -np 3 hello_cxx
> Hello, world
> (etc)
>
> > ulimit -f 131072
> > orterun -np 3 hello_cxx
> --------------------------------------------------------------------------
> orterun noticed that process rank 0 with PID 4473 on node sim16 exited on
> signal 25 (File size limit exceeded).
> --------------------------------------------------------------------------
>
> Any thoughts?
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28070.php
>
>

Reply via email to