Re: [OMPI users] large jobs hang on startup (deadlock?)
Hi Ralph, Unfortunately, adding "-mca pls_rsh_num_concurrent 50" to mpirun (with just -np and -hostfile) has no effect. The number of established connections for slapd grows to the same number at the same rate as without it. BTW, I upgraded from 1.2b2 to 1.2b3 Thanks, TOdd -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, February 06, 2007 6:48 PM To: Open MPI Users Subject: Re: [OMPI users] large jobs hang on startup (deadlock?) Hi Todd Just as a thought - you could try not using --debug-daemons or -d and instead setting "-mca pls_rsh_num_concurrent 50" or some such small number. This will tell the system to launch 50 ssh calls at a time, waiting for each group to complete before launching the next. You can't use it with --debug-daemons as that option prevents the ssh calls from "closing" so that you can get the output from the daemons. You can still launch as big a job as you like - we'll just do it 50 ssh calls at a time. If we are truly overwhelming the slapd, then this should alleviate the problem. Let me know if you get to try it... Ralph On 2/6/07 4:05 PM, "Heywood, Todd" wrote: > Hi Ralph, It looks that way. I created a user local to each node, with local > authentication via /etc/passwd and /etc/shadow, and OpenMPI scales up just > fine for that. I know this is an OpenMPI list, but does anyone know how > common or uncommon LDAP-based clusters are? I would have thought this issue > would have arisen elsewhere, but Googling MPI+LDAP (and similar) doesn't turn > up much. I'd certainly be willing to test any patch. > Thanks. Todd -Original Message- From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: > Tuesday, February 06, 2007 9:54 AM To: Open MPI Users > Subject: Re: [OMPI users] large jobs hang on startup > (deadlock?) It sounds to me like we are probably overwhelming your slapd - > your test would seem to indicate that slowing down the slapd makes us fail > even with smaller jobs, which tends to support that idea. We frankly haven't > encountered that before since our rsh tests have all been done using non-LDAP > authentication (basically, we ask that you setup rsh to auto-authenticate on > each node). It sounds like we need to add an ability to slow down so that the > daemon doesn't "fail" due to authentication timeout and/or slapd rejection due > to the queue being full. This may take a little time to fix due to other > priorities, and will almost certainly have to be released in a subsequent > 1.2.x version. Meantime, I'll let you know when I get something to test - > would you be willing to give it a shot if I provide a patch? I don't have > access to an LDAP-based system. Ralph On 2/6/07 7:44 AM, "Heywood, Todd" > wrote: > Hi Ralph, Thanks for the reply. This is a tough > one. It is OpenLDAP. I had > thought that I might be hitting a file descriptor > limit for slapd (LDAP > daemon), which ulimit -n does not effect (you have to > rebuild LDAP with a > different FD_SETSIZE variable). However, I simply turned > on more expressive > logging to /var/log/slapd, and that resulted in smaller > jobs (which > successfully ran before) hanging. Go figure. It appears that > daemons are up > and running (from ps), and everything hangs in MPI_Init. > Ctl-C > gives [blade1:04524] ERROR: A daemon on node blade26 failed to start > as > expected. [blade1:04524] ERROR: There may be more information available > > from [blade1:04524] ERROR: the remote shell (see above). [blade1:04524] > ERROR: > The daemon exited unexpectedly with status 255. I'm interested in > any > suggestion, semi-fixes, etc. which might help get to the bottom of this. > Right > now: whether the daemons are indeed up and running, or if there are > some that > are not (causing MPI_Init to > hang). Thanks, Todd -Original > Message- From: > users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of > Ralph H Castain Sent: > Tuesday, February 06, 2007 8:52 AM To: Open MPI > Users > Subject: Re: [OMPI users] large jobs hang on > startup > (deadlock?) Well, I can't say for sure about LDAP. I did a quick > search and > found two things: 1. there are limits imposed in LDAP that may > apply to your > situation, and 2. that statement varies tremendously > depending upon the > specific LDAP implementation you are using I would > suggest you see which LDAP > you are using and contact the > respective organization to ask if they do have > such a limit, and if so, how > to adjust it. It sounds like maybe we are > hitting the LDAP server with too > many requests too rapidly. Usually, the issue > is not starting fast enough, > so this is a new one! We don't currently check to > see if everything started > up okay, so that is why the processes might hang - > we hope to fix that soon. > I'll have to see if there is something we can do to > help
Re: [OMPI users] large jobs hang on startup (deadlock?)
Hi Todd I truly appreciate your patience. If the rate was the same with that switch set, then that would indicate to me that we aren't having trouble getting through the slapd - it probably isn't a problem with how hard we are driving it, but rather with the total number of connections being created. Basically, we need to establish one connection/node to launch the orteds (the app procs are just fork/exec'd by the orteds so they shouldn't see the slapd). The issue may have to do with limits on the total number of LDAP authentication connections allowed for one user. I believe that is settable, but will have to look it up and/or ask a few friends that might know. I have not seen an LDAP-based cluster before (though authentication onto the head node of a cluster is frequently handled that way), but that doesn't mean someone hasn't done it. Again, appreciate the patience. Ralph On 2/7/07 10:28 AM, "Heywood, Todd" wrote: > Hi Ralph, Unfortunately, adding "-mca pls_rsh_num_concurrent 50" to mpirun > (with just -np and -hostfile) has no effect. The number of established > connections for slapd grows to the same number at the same rate as without it. > BTW, I upgraded from 1.2b2 to 1.2b3 Thanks, TOdd -Original > Message- From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, > February 06, 2007 6:48 PM To: Open MPI Users Subject: Re: [OMPI users] large > jobs hang on startup (deadlock?) Hi Todd Just as a thought - you could try > not using --debug-daemons or -d and instead setting "-mca > pls_rsh_num_concurrent 50" or some such small number. This will tell the > system to launch 50 ssh calls at a time, waiting for each group to complete > before launching the next. You can't use it with --debug-daemons as that > option prevents the ssh calls from "closing" so that you can get the output > from the daemons. You can still launch as big a job as you like - we'll just > do it 50 ssh calls at a time. If we are truly overwhelming the slapd, then > this should alleviate the problem. Let me know if you get to try > it... Ralph On 2/6/07 4:05 PM, "Heywood, Todd" wrote: > > Hi Ralph, It looks that way. I created a user local to each node, with > local > authentication via /etc/passwd and /etc/shadow, and OpenMPI scales up > just > fine for that. I know this is an OpenMPI list, but does anyone know > how > common or uncommon LDAP-based clusters are? I would have thought this > issue > would have arisen elsewhere, but Googling MPI+LDAP (and similar) > doesn't turn > up much. I'd certainly be willing to test any patch. > > Thanks. Todd -Original Message- From: users-boun...@open-mpi.org > > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: > > Tuesday, February 06, 2007 9:54 AM To: Open MPI Users > > Subject: Re: [OMPI users] large jobs hang on startup > > (deadlock?) It sounds to me like we are probably overwhelming your slapd - > > your test would seem to indicate that slowing down the slapd makes us fail > > even with smaller jobs, which tends to support that idea. We frankly > haven't > encountered that before since our rsh tests have all been done using > non-LDAP > authentication (basically, we ask that you setup rsh > to auto-authenticate on > each node). It sounds like we need to add an ability > to slow down so that the > daemon doesn't "fail" due to authentication > timeout and/or slapd rejection due > to the queue being full. This may take a > little time to fix due to other > priorities, and will almost certainly have > to be released in a subsequent > 1.2.x version. Meantime, I'll let you know > when I get something to test - > would you be willing to give it a shot if I > provide a patch? I don't have > access to an LDAP-based system. Ralph On > 2/6/07 7:44 AM, "Heywood, Todd" > wrote: > Hi > Ralph, Thanks for the reply. This is a tough > one. It is OpenLDAP. I had > > thought that I might be hitting a file descriptor > limit for slapd (LDAP > > daemon), which ulimit -n does not effect (you have to > rebuild LDAP with a > > different FD_SETSIZE variable). However, I simply turned > on more > expressive > logging to /var/log/slapd, and that resulted in smaller > jobs > (which > successfully ran before) hanging. Go figure. It appears that > > daemons are up > and running (from ps), and everything hangs in MPI_Init. > > Ctl-C > gives [blade1:04524] ERROR: A daemon on node blade26 failed to > start > as > expected. [blade1:04524] ERROR: There may be more information > available > > from [blade1:04524] ERROR: the remote shell (see > above). [blade1:04524] > ERROR: > The daemon exited unexpectedly with status > 255. I'm interested in > any > suggestion, semi-fixes, etc. which might help > get to the bottom of this. > Right > now: whether the daemons are indeed up > and running, or if there are > some that > are not (causing MPI_Init to > > hang). Thanks, Todd -Original > Message
Re: [OMPI users] large jobs hang on startup (deadlock?)
Hi Ralph, Patience is not an issue since I have a workaround (a locally authenticated user), and other users are not running large enough MPI jobs to hit this problem. I'm a bit confused now though. I thought that setting this switch would set off 50 ssh sessions at a time, or 50 connections to slapd. I.e. a second group of 50 connections wouldn't initiate until the first group "closed" their sessions, which should be reflected by a corresponding decrease in the number of established connections for slapd. So my conclusion was that no sessions are "closing". There's also the observation that when slapd is slowed down by logging (extensively), things hang with fewer number of established connections (open ssh sessions). I don't see how this fitrs with a total number of connections limitation. Thanks, Todd -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, February 07, 2007 1:28 PM To: Open MPI Users Subject: Re: [OMPI users] large jobs hang on startup (deadlock?) Hi Todd I truly appreciate your patience. If the rate was the same with that switch set, then that would indicate to me that we aren't having trouble getting through the slapd - it probably isn't a problem with how hard we are driving it, but rather with the total number of connections being created. Basically, we need to establish one connection/node to launch the orteds (the app procs are just fork/exec'd by the orteds so they shouldn't see the slapd). The issue may have to do with limits on the total number of LDAP authentication connections allowed for one user. I believe that is settable, but will have to look it up and/or ask a few friends that might know. I have not seen an LDAP-based cluster before (though authentication onto the head node of a cluster is frequently handled that way), but that doesn't mean someone hasn't done it. Again, appreciate the patience. Ralph On 2/7/07 10:28 AM, "Heywood, Todd" wrote: > Hi Ralph, Unfortunately, adding "-mca pls_rsh_num_concurrent 50" to mpirun > (with just -np and -hostfile) has no effect. The number of established > connections for slapd grows to the same number at the same rate as without it. > BTW, I upgraded from 1.2b2 to 1.2b3 Thanks, TOdd -Original > Message- From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, > February 06, 2007 6:48 PM To: Open MPI Users Subject: Re: [OMPI users] large > jobs hang on startup (deadlock?) Hi Todd Just as a thought - you could try > not using --debug-daemons or -d and instead setting "-mca > pls_rsh_num_concurrent 50" or some such small number. This will tell the > system to launch 50 ssh calls at a time, waiting for each group to complete > before launching the next. You can't use it with --debug-daemons as that > option prevents the ssh calls from "closing" so that you can get the output > from the daemons. You can still launch as big a job as you like - we'll just > do it 50 ssh calls at a time. If we are truly overwhelming the slapd, then > this should alleviate the problem. Let me know if you get to try > it... Ralph On 2/6/07 4:05 PM, "Heywood, Todd" wrote: > > Hi Ralph, It looks that way. I created a user local to each node, with > local > authentication via /etc/passwd and /etc/shadow, and OpenMPI scales up > just > fine for that. I know this is an OpenMPI list, but does anyone know > how > common or uncommon LDAP-based clusters are? I would have thought this > issue > would have arisen elsewhere, but Googling MPI+LDAP (and similar) > doesn't turn > up much. I'd certainly be willing to test any patch. > > Thanks. Todd -Original Message- From: users-boun...@open-mpi.org > > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph H Castain Sent: > > Tuesday, February 06, 2007 9:54 AM To: Open MPI Users > > Subject: Re: [OMPI users] large jobs hang on startup > > (deadlock?) It sounds to me like we are probably overwhelming your slapd - > > your test would seem to indicate that slowing down the slapd makes us fail > > even with smaller jobs, which tends to support that idea. We frankly > haven't > encountered that before since our rsh tests have all been done using > non-LDAP > authentication (basically, we ask that you setup rsh > to auto-authenticate on > each node). It sounds like we need to add an ability > to slow down so that the > daemon doesn't "fail" due to authentication > timeout and/or slapd rejection due > to the queue being full. This may take a > little time to fix due to other > priorities, and will almost certainly have > to be released in a subsequent > 1.2.x version. Meantime, I'll let you know > when I get something to test - > would you be willing to give it a shot if I > provide a patch? I don't have > access to an LDAP-based system. Ralph On > 2/6/07 7:44 AM, "Heywood, Todd" > wrote: > Hi > Ralph, Thanks for the repl
Re: [OMPI users] large jobs hang on startup (deadlock?)
On 2/7/07 12:07 PM, "Heywood, Todd" wrote: > Hi Ralph, Patience is not an issue since I have a workaround (a locally > authenticated user), and other users are not running large enough MPI jobs to > hit this problem. I'm a bit confused now though. I thought that setting this > switch would set off 50 ssh sessions at a time, or 50 connections to slapd. > I.e. a second group of 50 connections wouldn't initiate until the first group > "closed" their sessions, which should be reflected by a corresponding decrease > in the number of established connections for slapd. So my conclusion was that > no sessions are "closing". The way the rsh launcher works is to fork/exec num_concurrent rsh/ssh sessions, and watch as each one "closes" the connection back to the HNP. When that block has cleared, we then begin launching the next one. Note that we are talking here about closure of stdin/stdout connections - i.e., the orteds "daemonize" themselves after launch, thus severing their stdin/stdout relationship back to the HNP. It is possible that this mechanism isn't actually limiting the launch rate - e.g., the orteds may daemonize themselves so quickly that the block launch doesn't help. In a soon-to-come future version, we won't use that mechanism for determining when to launch the next block - my offer of a patch was to give you that new version now, modify it to more explicitly limit launch rate, and see if that helped. I'll try to put that together in the next week or so. Since you observed that the pls_rsh_num_concurrent option had no impact on the *rate* at which we launched, that would indicate that either the slapd connection isn't bottlenecking - the time to authenticate is showing as independent of the rate at which we are hitting the slapd - or we are not rate limiting as we had hoped. Hence my comment that it may not look like a rate issue. As I said earlier, we have never tested this with LDAP. From what I understand of LDAP (which is limited, I admit), the ssh'd process (the orted in this case) forms an authentication connection back to the slapd. It may not be possible to sever this connection during the life of that process. There typically are limits on the number of simultaneous LDAP sessions a single user can have open - mainly for security reasons - so that could also be causing the problem. Given that you also observed that the total number of nodes we could launch upon was the same regardless of the rate, it could be that we are hitting the LDAP session limit. Logging may have a broader impact than just slapd response rate - I honestly don't know. Hope that helps - I'll pass along that patch as soon as I can. Ralph > There's also the observation that when slapd is > slowed down by logging (extensively), things hang with fewer number of > established connections (open ssh sessions). I don't see how this fitrs with a > total number of connections limitation. > Thanks, Todd -Original > Message- From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: > Wednesday, February 07, 2007 1:28 PM To: Open MPI Users Subject: Re: [OMPI > users] large jobs hang on startup (deadlock?) Hi Todd I truly appreciate > your patience. If the rate was the same with that switch set, then that would > indicate to me that we aren't having trouble getting through the slapd - it > probably isn't a problem with how hard we are driving it, but rather with the > total number of connections being created. Basically, we need to establish one > connection/node to launch the orteds (the app procs are just fork/exec'd by > the orteds so they shouldn't see the slapd). The issue may have to do with > limits on the total number of LDAP authentication connections allowed for one > user. I believe that is settable, but will have to look it up and/or ask a few > friends that might know. I have not seen an LDAP-based cluster before (though > authentication onto the head node of a cluster is frequently handled that > way), but that doesn't mean someone hasn't done it. Again, appreciate the > patience. Ralph On 2/7/07 10:28 AM, "Heywood, Todd" > wrote: > Hi Ralph, Unfortunately, adding "-mca pls_rsh_num_concurrent 50" to > mpirun > (with just -np and -hostfile) has no effect. The number of > established > connections for slapd grows to the same number at the same rate > as without it. > BTW, I upgraded from 1.2b2 to > 1.2b3 Thanks, TOdd -Original > Message- From: > users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of > Ralph Castain Sent: Tuesday, > February 06, 2007 6:48 PM To: Open MPI > Users Subject: Re: [OMPI users] large > jobs hang on startup (deadlock?) Hi > Todd Just as a thought - you could try > not using --debug-daemons or -d > and instead setting "-mca > pls_rsh_num_concurrent 50" or some such small > number. This will tell the > system to launch 50 ssh calls at a time, waiting > for each group to complete
[OMPI users] Does Open MPI "Realy" support AIX?
Hello All, We are in the process to decide, if we should use Open MPI in an AIX environment. Our in-house testing indicates that OMPI (V 1.1.x and V 1.2.x) stdio is broken under AIX. At this point, I am trying to find out if there is a fix or work-around for this problem. I have put another posting (see attached). One recommendation was try pre-release of V 1.2, which didn't make any difference. I am hoping that an OMPI developer or someone from IBM come up with a solution. Open MPI documentation, indicates that AIX is being supported, with limited testing before each release. What is limited testing? Does it mean, Configure, Install and running "Hello World" on one node? In short, we did configure and installed V 1.1.x as well as V1.2.x, but attempt to running a simple test such as "mpirun -np 1 hostname", fails, see attached for more details. I have eight nodes IBM systems, I could run any test, to solve this problem. Thanks for your comments Ali, --- >From Previous Posting on OMPI user's group -- I have installed Open MPI 1.1.2 on IBM AIX 5.3 cluster. It looks like terminal output is broken. There are a few entry in the OpenMPI archive for this problem, with no suggested solution or real work around. I am putting this posting with hope to get some advise for a work around or solution. #mpirun -np 1 hostname No out put, piping the command to "cat" or "more" generate no out put as well. The only way to get an output from this command is to add --debug-daemons #mpirun -np 1 --debug-daemons hostname Even this debug option is not working for a real application which generate several output. Looking forward for any comments. Thanks
[OMPI users] first time user - can run mpi job SMP but not over cluster
Dear Open-MPI list: I'm trying to run two (soon to be three) dual opteron machines as a cluster (network of workstations - they each have a disk and OS). I can ssh between machines with no password. My open-mpi code compiled fine and works great as an SMP program (using both processors on one machine). However, I am not able to run my open-mpi program parallel between the two computers. For SMP work I use: mpirun -np 2 myprogram inputfile >outputfile For cluster work I have tried: mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile which does not write to the output file. I have also tried: mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile` which just ran serially on the initial machine. The open-mpi executable and libraries are on the head node NFS shared to the slave node. Both computers can run open-mpi [the open-mpi application] as an SMP program with no problems. When I am trying to run the open-mpi program with both computers, I am using a directory that is an NFS share to the other computer. I am running OpenSUSE 10.2 on both machines. I compiled with gcc 41 / ifort 9.1. I am using a gigabit network. My hostfile specifies slots=2 max-slots=2 for each computer. The computers are identified in the hostfile using the /etc/hosts alias. The only config.log that I found was in the directory I used to make open-mpi; since everything works as SMP, I am not including that file with this initial message. What should I be trying to do next to remedy this issue? Any help would be appreciated. Thanks, Mark Kosmowski
[OMPI users] install script issue
Building openmpi-1.3a1r13525 on OS X 10.4.8 (PowerPC), using my standard compile line ./configure F77=g95 FC=g95 LDFLAGS=-lSystemStubs --with-mpi-f90- size=large --with-f90-max-array-dim=3 ; make all and after installing I found that I couldn't compile, because of the following: -rw--- 1 root wheel 640216 Feb 7 14:48 libmpi_f90.a This has not happened in the past and I followed the same procedures I've been using for many months. One slight difference is that I installed using the command "make install all" rather then "make install", also I had uninstalled the previous version prior to installing this version. Michael
Re: [OMPI users] first time user - can run mpi job SMP but not over cluster
Hello, mpirun -np 2 myprogram inputfile >outputfile There can be a whole host of issues with the way you run your executable and/or the way you have the environment setup. First of all, when you ssh into the node, does the environment automatically get updated with correct Open MPI paths? I.e. LD_LIBRARY_PATH should be correctly set to the OMPI lib directory, PATH should contain OMPI's bin dir, etc. If this is not the case, you have two options: a. create small /etc/profile.d scripts to set up those env. variables b. use --prefix version when you invoke mpirun on the headnode Generally, it would be much more helpful if you provided the actual output of running the commands you listed here. mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile Another issue I can think of is path specification to 'myprogram'. Do you just cd into the directory where it resides and specify its name only? Try to either specify an absolute path to the executable or path relative to your homedir: ~/appdir/bin/appexec, assuming this location is the same on all the nodes. If mpirun can't find your executable on one of the nodes, it should report that as an error. which does not write to the output file. Does it write anything to stderr? You could also try invoking mpirun with '--mca pls_rsh_agent ssh' mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile` Are those backquotes?? I would recommend getting mpirun to invoke something basic on all the participating nodes successfully first, try mpirun --prefix /path/to/ompi/ --hostfile myhosfile --np 4 hostname for instance. Nothing else will work until this does. These are just a few pointers to get you started. Hope this helps. Alex.