Wouldn't be a bad idea to fail a little better, ya. Perhaps a good show-help 
message.

Sent from my phone. No type good.

On Nov 20, 2015, at 5:52 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:

Jeff,

should we check ulimit in vader/sm btl and disable them with a warning if value 
is too low ?

Cheers,

Gilles

On Friday, November 20, 2015, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
For what it's worth, that's open MPI creating a chunk of shared memory for use 
with on-server communication. It shows up as a "file", but it's really shared 
memory.

You can disable sm and/or Vader, but your on-server message passing performance 
will be significantly lower.

Is there a reason you have a file size limit?

Sent from my phone. No type good.

On Nov 19, 2015, at 4:32 PM, Saurabh T 
<saur...@hotmail.com<javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>> 
wrote:

I apologize, I have the wrong lines from strace for the initial file there (of 
course). The file with fd = 11 which causes the problem is called 
shared_mem_pool.[host] and fruncate(11, 134217736) is called on it. (This is 
exactly 1024 times the ulimit of 131072 which makes sense as the ulimit is in 
1K blocks).


________________________________
From: saur...@hotmail.com<javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>
To: us...@open-mpi.org<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subject: RE: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 17:08:22 -0500


> Could you please provide a little more info regarding the environment you
> are running under (which resource mgr or not, etc), how many nodes you had
> in the allocation, etc?

> There is no reason why something should behave that way. So it would help
> if we could understand the setup.
> Ralph

To answer Ralph's above question on the other thread, all nodes are  on the 
same machine orterun was run on. It's a redhat 7 64-bit gcc 4.8 install of 
openmpi 1.10.1. The only atypical thing is that
btl_tcp_if_exclude = virbr0
has been added to openmpi-mca-params.conf based on some failures I was seeing 
before.
(And now of course I've added btl = ^sm as well to fix this issue, see my other 
response).

Relevant output from strace (without the btl = ^sm) is below. Stuff in square 
brackets are my minor edits and snips.

open("/tmp/openmpi-sessions-[user]@[host]_0/40072/1/1/vader_segment.[host].1", 
O_RDWR|O_CREAT, 0600) = 12
ftruncate(12, 4194312)                  = 0
mmap(NULL, 4194312, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fe506c8a000
close(12)                               = 0
write(9, "\1\0\0\0\0\0\0\0", 8)         = 8
[...]
poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0)                = -1 
EFBIG (File too large)
--- SIGXFSZ {si_signo=SIGXFSZ, si_code=SI_USER, si_pid=12329, si_uid=1005} ---
--

________________________________
From: saur...@hotmail.com<javascript:_e(%7B%7D,'cvml','saur...@hotmail.com');>
To: us...@open-mpi.org<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subject: Openmpi 1.10.1 fails with SIGXFSZ on file limit <= 131072
List-Post: users@lists.open-mpi.org
Date: Thu, 19 Nov 2015 15:24:08 -0500

Hi,

Sorry my previous email was garbled, sending it again.

> cd examples
> make hello_cxx

> ulimit -f 131073
> orterun -np 3 hello_cxx
Hello, world
(etc)

> ulimit -f 131072
> orterun -np 3 hello_cxx
--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 4473 on node sim16 exited on 
signal 25 (File size limit exceeded).
--------------------------------------------------------------------------

Any thoughts?


_______________________________________________
users mailing list
us...@open-mpi.org<javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/11/28070.php
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/11/28080.php

Reply via email to