Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Great - thanks! On Mar 15, 2012, at 2:55 PM, Joshua Baker-LePain wrote: > On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote > >> Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series >> is being closed out. Please let me know if this solves the problem for you. > > I co

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series is being closed out. Please let me know if this solves the problem for you. I couldn't get the included inline patch to apply to 1.5.4 (probably my issue), but I do

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 18:14 schrieb Joshua Baker-LePain: > On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote > >> PS: In your example you also had the case 2 slots in the low priority queue, >> what is the actual setup in your cluster? > > Our actual setup is: > > o lab.q, slots=numprocs, load_thresholds=

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series is being closed out. Please let me know if this solves the problem for you. Modified: orte/mca/ras/gridengine/ras_gridengine_module.c == --- ort

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote No, I'll fix the parser as we should be able to run anyway. Just can't guarantee which queue the job will end up in, but at least it -will- run. Makes sense to me. Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
No, I'll fix the parser as we should be able to run anyway. Just can't guarantee which queue the job will end up in, but at least it -will- run. On Mar 15, 2012, at 11:34 AM, Joshua Baker-LePain wrote: > On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote > >> Am 15.03.2012 um 15:50 schrieb Ralph Castai

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote Am 15.03.2012 um 15:50 schrieb Ralph Castain: On Mar 15, 2012, at 8:46 AM, Reuti wrote: Am 15.03.2012 um 15:37 schrieb Ralph Castain: FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't e

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote PS: In your example you also had the case 2 slots in the low priority queue, what is the actual setup in your cluster? Our actual setup is: o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE projects) limited by RQS to a number

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 15:50 schrieb Ralph Castain: > > On Mar 15, 2012, at 8:46 AM, Reuti wrote: > >> Am 15.03.2012 um 15:37 schrieb Ralph Castain: >> >>> Just to be clear: I take it that the first entry is the host name, and the >>> second is the number of slots allocated on that host? >> >> This

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
On Mar 15, 2012, at 8:46 AM, Reuti wrote: > Am 15.03.2012 um 15:37 schrieb Ralph Castain: > >> Just to be clear: I take it that the first entry is the host name, and the >> second is the number of slots allocated on that host? > > This is correct. > > >> FWIW: I see the problem. Our parser w

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 15:37 schrieb Ralph Castain: > Just to be clear: I take it that the first entry is the host name, and the > second is the number of slots allocated on that host? This is correct. > FWIW: I see the problem. Our parser was apparently written assuming every > line was a unique h

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Just to be clear: I take it that the first entry is the host name, and the second is the number of slots allocated on that host? FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't even check to see if there is duplication. Easy fix -

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain: > On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote > >> On Mar 14, 2012, at 5:44 PM, Reuti wrote: > >>> (I was just typing when Ralph's message came in: I can confirm this. To >>> avoid it, it would mean for Open MPI to collect all lines fro

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Rayson Ho
Hi Joshua, I don't think the new built-in rsh in later versions of Grid Engine is going to make any difference - the orted is the real starter of the MPI tasks and should have a greater influence on the task environment. However, it would help if you can record the nice values and resource limits

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 12:44am, Reuti wrote Which version of SGE are you using? The traditional rsh startup was replaced by the builtin startup some time ago (although it should still work). We're currently running the rather ancient 6.1u4 (due to the "If it ain't broke..." philosophy). The

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote On Mar 14, 2012, at 5:44 PM, Reuti wrote: (I was just typing when Ralph's message came in: I can confirm this. To avoid it, it would mean for Open MPI to collect all lines from the hostfile which are on the same machine. SGE creates entries

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Ralph Castain
On Mar 14, 2012, at 5:44 PM, Reuti wrote: > Am 14.03.2012 um 23:48 schrieb Joshua Baker-LePain: > >> On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote >> >>> I just tested with two different queues on two machines and a small >>> mpihello and it is working as expected. >> >> At this point the narrat

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Reuti
Am 14.03.2012 um 23:48 schrieb Joshua Baker-LePain: > On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote > >> I just tested with two different queues on two machines and a small mpihello >> and it is working as expected. > > At this point the narrative is getting very confused, even for me. So I > tr

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Ralph Castain
Something is very wrong - there can only be one orted on each node. Having two orteds on the same node for the same job guarantees that things will become confused and generally fail. I don't know enough SGE to advise you what's wrong with your job script, but it looks like OMPI thinks there ar

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Reuti
Am 14.03.2012 um 18:30 schrieb Joshua Baker-LePain: > On Wed, 14 Mar 2012 at 9:33am, Reuti wrote > >>> I can run as many threads as I like on a single system with no problems, >>> even if those threads are running at different nice levels. >> >> How do they get different nice levels - you renic

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Joshua Baker-LePain
On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote I just tested with two different queues on two machines and a small mpihello and it is working as expected. At this point the narrative is getting very confused, even for me. So I tried to find a clear cut case where I can change one thing to flip

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Reuti
Am 14.03.2012 um 17:44 schrieb Ralph Castain: > Hi Reuti > > I appreciate your help on this thread - I confess I'm puzzled by it. As you > know, OMPI doesn't use SGE to launch the individual processes, nor does SGE > even know they exist. All SGE is used for is to launch the OMPI daemons > (or

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Joshua Baker-LePain
On Wed, 14 Mar 2012 at 9:33am, Reuti wrote I can run as many threads as I like on a single system with no problems, even if those threads are running at different nice levels. How do they get different nice levels - you renice them? I would assume that all start at the same of the parent. In

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Ralph Castain
Hi Reuti I appreciate your help on this thread - I confess I'm puzzled by it. As you know, OMPI doesn't use SGE to launch the individual processes, nor does SGE even know they exist. All SGE is used for is to launch the OMPI daemons (orteds). This is done as a single qrsh call, so won't all the

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-14 Thread Reuti
Hi, Am 14.03.2012 um 04:02 schrieb Joshua Baker-LePain: > On Tue, 13 Mar 2012 at 5:31pm, Ralph Castain wrote > >> FWIW: I have a Centos6 system myself, and I have no problems running OMPI on >> it (1.4 or 1.5). I can try building it the same way you do and see what >> happens. > > I can run a

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 5:31pm, Ralph Castain wrote FWIW: I have a Centos6 system myself, and I have no problems running OMPI on it (1.4 or 1.5). I can try building it the same way you do and see what happens. I can run as many threads as I like on a single system with no problems, even if th

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 11:28pm, Gutierrez, Samuel K wrote Can you rebuild without the "--enable-mpi-threads" option and try again. I did and still got segfaults (although w/ slightly different backtraces). See the response I just sent to Ralph. -- Joshua Baker-LePain QB3 Shared Cluster Sysa

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 6:05pm, Ralph Castain wrote I started playing with this configure line on my Centos6 machine, and I'd suggest a couple of things: 1. drop the --with-libltdl=external ==> not a good idea 2. drop --with-esmtp ==> useless unless you really want pager messages notifying

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gustavo Correa
Hi Joshua Can you your int counter "i" get so large? > for(i=0; i<=1; i++) I may be mistaken, but 1,000,000,000,000 = 10**12 > 2**31=2,147,483,647=maximum int. Unless they are 64-bit long[s]. Just a thought. Gus Correa On Mar 13, 2012, at 4:54 PM, Joshua Baker-LePain wrote:

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Ralph Castain
I started playing with this configure line on my Centos6 machine, and I'd suggest a couple of things: 1. drop the --with-libltdl=external ==> not a good idea 2. drop --with-esmtp ==> useless unless you really want pager messages notifying you of problems 3. drop --enable-mpi-threads for now

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Ralph Castain
Hmmm….you might try removing the -enable-mpi-threads from the configure to be safe. FWIW: I have a Centos6 system myself, and I have no problems running OMPI on it (1.4 or 1.5). I can try building it the same way you do and see what happens. On Mar 13, 2012, at 5:22 PM, Joshua Baker-LePain wro

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gutierrez, Samuel K
Can you rebuild without the "--enable-mpi-threads" option and try again. Thanks, Sam On Mar 13, 2012, at 5:22 PM, Joshua Baker-LePain wrote: > On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote > >> Fooey. What compiler are you using to build Open MPI and how are you >> configuring yo

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote Fooey. What compiler are you using to build Open MPI and how are you configuring your build? I'm using gcc as packaged by RH/CentOS 6.2: [jlb@opt200 1.4.5-2]$ gcc --version gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) I actually tried

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 5:06pm, Ralph Castain wrote Out of curiosity: could you send along the mpirun cmd line you are using to launch these jobs? I'm wondering if the SGE integration itself is the problem, and it only shows up in the sm code. It's about as simple as it gets: mpirun -np $NSLO

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Ralph Castain
Out of curiosity: could you send along the mpirun cmd line you are using to launch these jobs? I'm wondering if the SGE integration itself is the problem, and it only shows up in the sm code. On Mar 13, 2012, at 4:57 PM, Gutierrez, Samuel K wrote: > > On Mar 13, 2012, at 4:07 PM, Joshua Baker

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gutierrez, Samuel K
On Mar 13, 2012, at 4:07 PM, Joshua Baker-LePain wrote: > On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote > Any more information surrounding your failures in 1.5.4 are greatly appreciated. >>> >>> I'm happy to provide, but what exactly are you looking for? The test code >>

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote Any more information surrounding your failures in 1.5.4 are greatly appreciated. I'm happy to provide, but what exactly are you looking for? The test code I'm running is *very* simple: If you experience this type of failure with 1.4.

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gutierrez, Samuel K
On Mar 13, 2012, at 2:54 PM, Joshua Baker-LePain wrote: > On Tue, 13 Mar 2012 at 7:53pm, Gutierrez, Samuel K wrote > >> The failure signature isn't exactly what we were seeing here at LANL, but >> there were misplaced memory barriers in Open MPI 1.4.3. Ticket 2619 talks >> about this issue (h

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 7:53pm, Gutierrez, Samuel K wrote The failure signature isn't exactly what we were seeing here at LANL, but there were misplaced memory barriers in Open MPI 1.4.3. Ticket 2619 talks about this issue (https://svn.open-mpi.org/trac/ompi/ticket/2619). This doesn't explain,

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gutierrez, Samuel K
The failure signature isn't exactly what we were seeing here at LANL, but there were misplaced memory barriers in Open MPI 1.4.3. Ticket 2619 talks about this issue (https://svn.open-mpi.org/trac/ompi/ticket/2619). This doesn't explain, however, the failures that you are experiencing within Op

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
On Tue, 13 Mar 2012 at 7:20pm, Gutierrez, Samuel K wrote Just to be clear, what specific version of Open MPI produced the provided backtrace? This smells like a missing memory barrier problem. The backtrace in my original post was from 1.5.4 -- I took the 1.5.4 source and put it into the 1.5

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Gutierrez, Samuel K
Hi, Just to be clear, what specific version of Open MPI produced the provided backtrace? This smells like a missing memory barrier problem. -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 13, 2012, at 1:07 PM, Joshua Baker-LePain wrote: > I run a decent size (600+ nodes, 4000+ co

[OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-13 Thread Joshua Baker-LePain
I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple generations of x86_64 hardware) cluster. We use SGE (currently 6.1u4, which, yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2. We had been using MPICH2 under CentOS 5, but I'd much rather use OpenMPI as packa