Jeff, should we check ulimit in vader/sm btl and disable them with a warning if value is too low ?
Cheers, Gilles On Friday, November 20, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > For what it's worth, that's open MPI creating a chunk of shared memory for > use with on-server communication. It shows up as a "file", but it's really > shared memory. > > You can disable sm and/or Vader, but your on-server message passing > performance will be significantly lower. > > Is there a reason you have a file size limit? > > Sent from my phone. No type good. > > On Nov 19, 2015, at 4:32 PM, Saurabh T <saur...@hotmail.com > <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>> wrote: > > I apologize, I have the wrong lines from strace for the initial file there > (of course). The file with fd = 11 which causes the problem is called > shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This > is exactly 1024 times the ulimit of 131072 which makes sense as the ulimit > is in 1K blocks). > > > ------------------------------ > From: saur...@hotmail.com > <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');> > To: us...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 > Date: Thu, 19 Nov 2015 17:08:22 -0500 > > > > Could you please provide a little more info regarding the environment > you > > are running under (which resource mgr or not, etc), how many nodes you > had > > in the allocation, etc? > > > There is no reason why something should behave that way. So it would > help > > if we could understand the setup. > > Ralph > > To answer Ralph's above question on the other thread, all nodes are on > the same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install > of openmpi 1.10.1. The only atypical thing is that > btl_tcp_if_exclude = virbr0 > has been added to openmpi-mca-params.conf based on some failures I was > seeing before. > (And now of course I've added btl = ^sm as well to fix this issue, see my > other response). > > Relevant output from strace (without the btl = ^sm) is below. Stuff in > square brackets are my minor edits and snips. > > open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", > O_RDWR|O_CREAT, 0600) = 12 > ftruncate(12, 4194312) = 0 > mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = > 0x7fe506c8a000 > close(12) = 0 > write(9, "\1\0\0\0\0\0\0\0", 8) = 8 > [...] > poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) > = -1 EFBIG (File too large) > --- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} > --- > -- > > ------------------------------ > From: saur...@hotmail.com > <javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');> > To: us...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072 > Date: Thu, 19 Nov 2015 15:24:08 -0500 > > Hi, > > Sorry my previous email was garbled, sending it again. > > > cd examples > > make hello_cxx > > > ulimit -f 131073 > > orterun -np 3 hello_cxx > Hello, world > (etc) > > > ulimit -f 131072 > > orterun -np 3 hello_cxx > -------------------------------------------------------------------------- > orterun noticed that process rank 0 with PID 4473 on node sim16 exited on > signal 25 (File size limit exceeded). > -------------------------------------------------------------------------- > > Any thoughts? > > > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28070.php > >