[OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple generations of x86_64 hardware) cluster. We use SGE (currently 6.1u4, which, yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2. We had been using MPICH2 under CentOS 5, but I'd much rather use OpenMPI as packaged by RH/CentOS. Our SGE queues are setup with a high priority queue, running un-niced, and a low priority queue running at nice 19, each with 1 slot per core on every node. I'm seeing consistent segfaults with OpenMPI when I submit jobs without specifying a queue (meaning some threads run niced, others run un-niced). This was initially reported to me by 2 users, each with their own code, but I can reproduce it with my own very simple test program. The segfaults occur whether I'm using the default OpenMPI version of 1.5 or compat-openmpi-1.4.3. I'll note that I did upgrade the distro RPM of openmpi-1.5.3 to 1.5.4 to get around the broken SGE integration <https://bugzilla.redhat.com/show_bug.cgi?id=789150>. I can't absolutely say that jobs run entirely in the high priority queue do not segfault. But, if they do, it's not nearly as reproducible. The segfaults also don't seem to occur if a job runs entirely on one node. The error logs of failed jobs contain a stanza like this for each thread which segfaulted: [opt207:03766] *** Process received signal *** [opt207:03766] Signal: Segmentation fault (11) [opt207:03766] Signal code: Address not mapped (1) [opt207:03766] Failing at address: 0x2b4e279e778c [opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0] [opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc) [0x2b17aa6002fc] [opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a) [0x37fa0d1aba] [opt207:03766] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5) [0x2b17a7d234d5] [opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57] [opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170) [0x37fa063c70] [opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() [0x4006e6] [opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd] [opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() [0x400609] [opt207:03766] *** End of error message *** A backtrace of the core file looks like this: #0 sm_fifo_read () at btl_sm.h:353 #1 mca_btl_sm_component_progress () at btl_sm_component.c:588 #2 0x0037fa0d1aba in opal_progress () at runtime/opal_progress.c:207 #3 0x2b17a7d234d5 in barrier () at grpcomm_bad_module.c:277 #4 0x0037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8, requested=, provided=) at runtime/ompi_mpi_init.c:771 #5 0x0037fa063c70 in PMPI_Init (argc=0x7fff253657fc, argv=0x7fff253657f0) at pinit.c:84 #6 0x004006e6 in main (argc=1, argv=0x7fff253658f8) at mpihello-long.c:11 Those are both from a test with 1.5. The 1.4 errors are essentially identical, with the differences mainly in line numbers. I'm happy to post full logs, but I'm trying (albeit unsuccessfully) to keep this from turning into a novel. I'm happy to do as much debugging as I can -- I'm pretty motivated to get this working. Thanks for any insights. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 7:20pm, Gutierrez, Samuel K wrote Just to be clear, what specific version of Open MPI produced the provided backtrace? This smells like a missing memory barrier problem. The backtrace in my original post was from 1.5.4 -- I took the 1.5.4 source and put it into the 1.5.3 SRPM provided by Red Hat. Below is a backtrace from 1.4.3 as shipped by RH/CentOS: #0 sm_fifo_read () at btl_sm.h:267 #1 mca_btl_sm_component_progress () at btl_sm_component.c:391 #2 0x003e54a129ca in opal_progress () at runtime/opal_progress.c:207 #3 0x2b00fa6bb8d5 in barrier () at grpcomm_bad_module.c:270 #4 0x003e55e37d04 in ompi_mpi_init (argc=, argv=, requested=, provided=) at runtime/ompi_mpi_init.c:722 #5 0x003e55e5bae0 in PMPI_Init (argc=0x7fff8588b1cc, argv=0x7fff8588b1c0) at pinit.c:80 #6 0x00400826 in main (argc=1, argv=0x7fff8588b2c8) at mpihello-long.c:11 Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 7:53pm, Gutierrez, Samuel K wrote The failure signature isn't exactly what we were seeing here at LANL, but there were misplaced memory barriers in Open MPI 1.4.3. Ticket 2619 talks about this issue (https://svn.open-mpi.org/trac/ompi/ticket/2619). This doesn't explain, however, the failures that you are experiencing within Open MPI 1.5.4. Can you give 1.4.4 a whirl and see if this fixes the issue? Would it be best to use 1.4.4 specifically, or simply the most recent 1.4.x (which appears to be 1.4.5 at this point)? Any more information surrounding your failures in 1.5.4 are greatly appreciated. I'm happy to provide, but what exactly are you looking for? The test code I'm running is *very* simple: #include #include main(int argc, char **argv) { int node; int i, j; float f; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); printf("Hello World from Node %d.\n", node); for(i=0; i<=1; i++) f=i*2.718281828*i+i+i*3.141592654; MPI_Finalize(); } And my environment is a pretty standard CentOS-6.2 install. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote Any more information surrounding your failures in 1.5.4 are greatly appreciated. I'm happy to provide, but what exactly are you looking for? The test code I'm running is *very* simple: If you experience this type of failure with 1.4.5, can you send another backtrace? We'll go from there. In an odd way I'm relieved to say that 1.4.5 failed in the same way. From the SGE log of the run, here's the error message from one of the threads that segfaulted: [iq104:05697] *** Process received signal *** [iq104:05697] Signal: Segmentation fault (11) [iq104:05697] Signal code: Address not mapped (1) [iq104:05697] Failing at address: 0x2ad032188e8c [iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0] [iq104:05697] [ 1] /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) [0x2b0099ec4c4c] [iq104:05697] [ 2] /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) [0x2b00967737ca] [iq104:05697] [ 3] /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) [0x2b00975ef8d5] [iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) [0x2b009628da24] [iq104:05697] [ 5] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) [0x2b00962b24f0] [iq104:05697] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826] [iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd] [iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() [0x400749] [iq104:05697] *** End of error message *** And the backtrace of the resulting core file: #0 0x2b0099ec4c4c in mca_btl_sm_component_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so #1 0x2b00967737ca in opal_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0 #2 0x2b00975ef8d5 in barrier () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so #3 0x2b009628da24 in ompi_mpi_init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #4 0x2b00962b24f0 in PMPI_Init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #5 0x00400826 in main (argc=1, argv=0x7fff9fe113f8) at mpihello-long.c:11 Another question. How reproducible is this on your system? In my testing today, it's been 100% reproducible. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 5:06pm, Ralph Castain wrote Out of curiosity: could you send along the mpirun cmd line you are using to launch these jobs? I'm wondering if the SGE integration itself is the problem, and it only shows up in the sm code. It's about as simple as it gets: mpirun -np $NSLOTS $HOME/mybin/mpihello-long.ompi-1.4-debug where $NSLOTS is set by SGE based on how many slots in the PE one requests. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote Fooey. What compiler are you using to build Open MPI and how are you configuring your build? I'm using gcc as packaged by RH/CentOS 6.2: [jlb@opt200 1.4.5-2]$ gcc --version gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) I actually tried 2 custom builds of Open MPI 1.4.5. For the first I tried to stick close to the options in RH's compat-openmpi SRPM: ./configure --prefix=$HOME/ompi-1.4.5 --enable-mpi-threads --enable-openib-ibcm --with-sge --with-libltdl=external --with-valgrind --enable-memchecker --with-psm=no --with-esmtp LDFLAGS='-Wl,-z,noexecstack' That resulted in the backtrace I sent previously: #0 0x2b0099ec4c4c in mca_btl_sm_component_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so #1 0x2b00967737ca in opal_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0 #2 0x2b00975ef8d5 in barrier () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so #3 0x2b009628da24 in ompi_mpi_init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #4 0x2b00962b24f0 in PMPI_Init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #5 0x00400826 in main (argc=1, argv=0x7fff9fe113f8) at mpihello-long.c:11 For kicks, I tried a 2nd compile of 1.4.5 with a bare minimum of options: ./configure --prefix=$HOME/ompi-1.4.5 --with-sge That resulted in a slightly different backtrace that seems to be missing a bit: #0 0x2b7bbc8681d0 in ?? () #1 #2 0x2b7bbd2b8f6c in mca_btl_sm_component_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so #3 0x2b7bb9b2feda in opal_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0 #4 0x2b7bba9a98d5 in barrier () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so #5 0x2b7bb965d426 in ompi_mpi_init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #6 0x2b7bb967cba0 in PMPI_Init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #7 0x00400826 in main (argc=1, argv=0x7fff93634788) at mpihello-long.c:11 Can you also run with a debug build of Open MPI so we can see the line numbers? I'll do that first thing tomorrow. Another question. How reproducible is this on your system? In my testing today, it's been 100% reproducible. That's surprising. Heh. You're telling me. Thanks for taking an interest in this. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 6:05pm, Ralph Castain wrote I started playing with this configure line on my Centos6 machine, and I'd suggest a couple of things: 1. drop the --with-libltdl=external ==> not a good idea 2. drop --with-esmtp ==> useless unless you really want pager messages notifying you of problems 3. drop --enable-mpi-threads for now I'm continuing to play with it, but thought I'd pass those along. After my first custom build of 1.4.5 didn't work, I built it again using an utterly minimal configure line: ./configure --prefix=$HOME/ompi-1.4.5 --with-sge Runs with this library still failed, but the backtrace did change slightly: #0 0x2b7bbc8681d0 in ?? () #1 #2 0x2b7bbd2b8f6c in mca_btl_sm_component_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so #3 0x2b7bb9b2feda in opal_progress () from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0 #4 0x2b7bba9a98d5 in barrier () from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so #5 0x2b7bb965d426 in ompi_mpi_init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #6 0x2b7bb967cba0 in PMPI_Init () from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 #7 0x00400826 in main (argc=1, argv=0x7fff93634788) at mpihello-long.c:11 -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 11:28pm, Gutierrez, Samuel K wrote Can you rebuild without the "--enable-mpi-threads" option and try again. I did and still got segfaults (although w/ slightly different backtraces). See the response I just sent to Ralph. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Tue, 13 Mar 2012 at 5:31pm, Ralph Castain wrote FWIW: I have a Centos6 system myself, and I have no problems running OMPI on it (1.4 or 1.5). I can try building it the same way you do and see what happens. I can run as many threads as I like on a single system with no problems, even if those threads are running at different nice levels. The problem seems to arise when I'm both a) running across multiple machines and b) running threads at differing nice levels (which often happens as a result of our queueing setup). I can't guarantee that the problem *never* happens when I run across multiple machines with all the threads un-niced, but I haven't been able to reproduce that at will like I can for the other case. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Wed, 14 Mar 2012 at 9:33am, Reuti wrote I can run as many threads as I like on a single system with no problems, even if those threads are running at different nice levels. How do they get different nice levels - you renice them? I would assume that all start at the same of the parent. In your test program you posted there are no threads. Ah, thanks for pointing this out. Yes, when a job runs on a single host (even if SGE has assigned it to multiple queues), there's no qrsh involved. There's just a simple mpirun and all the threads run at the same priority. I did try renicing half the threads, and the job didn't fail. The problem seems to arise when I'm both a) running across multiple machines and b) running threads at differing nice levels (which often happens as a result of our queueing setup). This sounds like you are getting slots from different queues assigned to one and the same job. My experience: don't do it, unless you neeed it. You are correct -- the problem is specific to a parallel job getting slots from different queues. Our cluster is used by a combination of folks who've financially supported it, and those that haven't. Our high priority queue, lab.q, runs un-niced and is available only to those who have donated money and/or machines to us. Our low priority queue, long.q, runs nice 19 and is available to all. The goal is to ensure instant access by a lab to its "share" of the cluster while letting both those users and non-supporting users to use as many cores as they can in long.q. We explicitly allow overloading to further support our goal of keeping the usage both full and fair. The setup is a bit convoluted, but it has kept the users (and, more importantly, the PIs) happy. Until the recent upgrade to CentOS 6 and concomitant switch from MPICH2 to Open MPI, we've had no issues with parallel jobs and this queue setup. And the test jobs I've tried with our old MPICH2 install (and the MPICH tight integration) running under CentOS 6 don't fail either. Do you face the same if you stay in one and the same queue across the machines? Jobs don't crash if they either: a) all run in the same queue, or b) run in multiple queues all on one machine -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
llo-long.ompi-1.4.3-debug jlb 12796 2.0 0.0 153232 3752 ?S14:41 0:00 \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug Joshua: the Centos6 is the same on all nodes and the you recompiled the application with the actual version of the library? By "threads" you refer to "processes"? All the nodes are installed from the same kickstart file and kept fully up to date. And, yes, the application is compiled against the exact library I'm running it with. Thanks again to all for looking at this. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote On Mar 14, 2012, at 5:44 PM, Reuti wrote: (I was just typing when Ralph's message came in: I can confirm this. To avoid it, it would mean for Open MPI to collect all lines from the hostfile which are on the same machine. SGE creates entries for each queue/host pair in the machine file). Hmmm…I can take a look at the allocator module and see why we aren't doing it. Would the host names be the same for the two queues? I can't speak authoritatively like Reuti can, but here's what a hostfile looks like on my cluster (note that all our name resolution is done via /etc/hosts -- there's no DNS involved): iq103 8 lab.q@iq103 iq103 1 test.q@iq103 iq104 8 lab.q@iq104 iq104 1 test.q@iq104 opt221 2 lab.q@opt221 opt221 1 test.q@opt221 @Ralph: it could work if SGE would have a facility to request the desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for each orted again (assuming its using different ports for each). Gotcha! I suspect getting the allocator to handle this cleanly is the better solution, though. If I can help (testing patches, e.g.), let me know. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Thu, 15 Mar 2012 at 12:44am, Reuti wrote Which version of SGE are you using? The traditional rsh startup was replaced by the builtin startup some time ago (although it should still work). We're currently running the rather ancient 6.1u4 (due to the "If it ain't broke..." philosophy). The hardware for our new queue master recently arrived and I'll soon be upgrading to the most recent Open Grid Scheduler release. Are you saying that the upgrade with the new builtin startup method should avoid this problem? Maybe this shows already the problem: there are two `qrsh -inherit`, as Open MPI thinks these are different machines (I ran only with one slot on each host hence didn't get it first but can reproduce it now). But for SGE both may end up in the same queue overriding the openmpi-session in $TMPDIR. Although it's running: you get all output? If I request 4 slots and get one from each queue on both machines the mpihello outputs only 3 lines: the "Hello World from Node 3" is always missing. I do seem to get all the output -- there are indeed 64 Hello World lines. Thanks again for all the help on this. This is one of the most productive exchanges I've had on a mailing list in far too long. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote PS: In your example you also had the case 2 slots in the low priority queue, what is the actual setup in your cluster? Our actual setup is: o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE projects) limited by RQS to a number of slots equal to their "share" of the cluster, seq_no=0, priority=0. o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1, priority=19 o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2, priority=10 Users are instructed to not select a queue when submitting jobs. The theory is that even if non-contributing users have filled the cluster with long.q jobs, contributing users will still have instant access to "their" lab.q slots, overloading nodes with jobs running at a higher priority than the long.q jobs. long.q jobs won't start on nodes full of lab.q jobs. And short.q is for quick, high priority jobs regardless of cluster status (the main use case being processing MRI data into images while a patient is physically in the scanner). The truth is our cluster is primarily used for, and thus SGE is tuned for, large numbers of serial jobs. We do have *some* folks running parallel code, and it *is* starting to get to the point where I need to reconfigure things to make that part work better. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote Am 15.03.2012 um 15:50 schrieb Ralph Castain: On Mar 15, 2012, at 8:46 AM, Reuti wrote: Am 15.03.2012 um 15:37 schrieb Ralph Castain: FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't even check to see if there is duplication. Easy fix - can shoot it to you today. But even with the fix the nice value will be the same for all processes forked there. Either all have the nice value of his low priority queue or the high priority queue. Agreed - nothing I can do about that, though. We only do the one qrsh call, so the daemons are going to fall into a single queue, and so will all their children. In this scenario, it isn't clear to me (from this discussion) that I can control which queue gets used Correct. Which I understand. Our queue setup is admittedly a bit wonky (which is probably why I'm the first one to have this issue). I'm much more concerned with things not crashing than with them absolutely having the "right" nice levels. :) Should I? I can't speak for the community. Personally I would say: don't distribute parallel jobs among different queues at all, as some applications will use some internal communication about the environment variables of the master process to distribute them to the slaves (even if SGE's `qrsh -inherit ...` is called without -V, and even if Open MPI is not told to forward and specific environment variable). If you have a custom application it can work of course, but with closed source ones you can only test and get the experience whether it's working or not. Not to mention the timing issue of differently niced processes. Adjusting the SGE setup of the OP would be the smarter way IMO. And I agree with that as well. I understand if the decision is made to leave the parser the way it is, given that my setup is outside the norm. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote No, I'll fix the parser as we should be able to run anyway. Just can't guarantee which queue the job will end up in, but at least it -will- run. Makes sense to me. Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series is being closed out. Please let me know if this solves the problem for you. I couldn't get the included inline patch to apply to 1.5.4 (probably my issue), but I downloaded it from <https://svn.open-mpi.org/trac/ompi/changeset/26148> and applied that. My test job ran just fine, and looking at the nodes verified a single orted process per node despite SGE assigning slots in multiple queues. In short, WORKSFORME. Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Re: [OMPI users] mpicc command not found - Fedora
On Thu, 29 Mar 2012 at 7:45pm, Rohan Deshpande wrote I have installed mpi successfully on fedora using *yum install openmpi openmpi-devel openmpi-libs* What version of Fedora are you using, and on what architecture (i.e. i686 or x86_64)? As far as I can see, the last Fedora distro to use openmpi-libs was Fedora 11, which is rather old and unsupported. I have also added */usr/lib/openmpi/bin* to *PATH *and* LD_LIBRARY_PATH*variable. But when I try to complie my program using *mpicc hello.c* or*/usr/lib/openmpi/bin/mpicc hello.c * I get error saying *mpicc: command not found* * * I checked the contents of /user/lib/openmpi/bin and there is no mpicc... here is the screenshot Current versions of Fedora use the "module" command to load the proper environment for Open MPI. On a 64bit machine, e.g., one would run "module load openmpi-x86_64" to get all the env variables properly set. But I don't know what Fedora version that started with. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF