Re: [OMPI users] Open MPI v1.2.5 released
Hi Warner. The simplest way would certainly be to launch your job with with the mpirun --nolocal option. If you're sure you want a hostfile-based way to set this, simply removing the headnode from the hostfile would also work. -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.] Warner Yuen wrote > Date: Wed, 9 Jan 2008 12:50:09 -0800 > From: Warner Yuen > Subject: Re: [OMPI users] Open MPI v1.2.5 released > To: us...@open-mpi.org > Message-ID: > Content-Type: text/plain; charset="us-ascii" > > Thanks to Brian Barrett, I was able to get through some ugly Intel > compiler bugs during the configure script. I now have OMPI v1.2.5 > running nicely under Mac OSX v10.5 Leopard! > > However, I have a question about hostfiles. I would like to manually > launch MPI jobs from my headnode, but I don't want the jobs to run on > the head node. In LAM/MPI I could add a "hostname schedule=no" to the > hostfile, is there an equivalent in OpenMPI? I'm sure this has come up > before, but I couldn't find an answer in the archives. > > Thanks, > > -Warner > > Warner Yuen > Scientific Computing Consultant > Apple Computer > email: wy...@apple.com > Tel: 408.718.2859 > Fax: 408.715.0133
Re: [OMPI users] mixed myrinet/non-myrinet nodes
We also have a mixed myrinet/ip cluster, and maybe I'm missing some nuance of your configuration, but openmpi seems to work fine for me "as is" with no --mca options across mixed nodes (there's a bunch of warnings at the beginning where the non-mx nodes realize they don't have myrinet cards and the mx nodes realize they can't talk mx to the non-mx nodes, but everything completes fine, so I assumed OpenMPI was working things out the transport details on it's own (and was quite pleased about that)). I just did a quick test to confirm that it is in fact still using mx in that situation, and it is. I'm running OpenMPI 1.2.4 and MX 1.2.3. It sounds to me based on those "PML add procs failed" messages that OpenMPI is dying on start up on the non-mx nodes unless you explicitly disable mx at runtime (perhaps because they're expecting the mx library to be there, but it's not?) users-request-at-open-mpi.org |openmpi-users/Allow| wrote: > Date: Tue, 15 Jan 2008 10:25:00 -0500 (EST) > From: M D Jones > Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes > To: Open MPI Users > Message-ID: > Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed > > > Hmm, that combination seems to hang on me - but > '--mca pml ob1 --mca btl ^mx' does indeed do the trick. > Many thanks! > > Matt > > On Tue, 15 Jan 2008, George Bosilca wrote: > >> This case actually works. We run into it few days ago, when we discovered >> that one of the compute nodes in a cluster didn't get his Myrinet card >> installed properly ... The performance were horrible but the application run >> to completion. >> >> You will have to use the following flags: --mca pml ob1 --mca btl mx,tcp,self >> -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]
Re: [OMPI users] mixed myrinet/non-myrinet nodes
> Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes > From: M D Jones (jonesm_at_[hidden]) > Date: 2008-01-15 14:07:19 > Hmm, that is the way that I expected it to work as well - > we see the warnings also, but closely followed by the > errors (I've been trying both 1.2.5 and a recent 1.3 > snapshot with the same behavior). You don't have the > mx driver loaded on the nodes that do not have a myrinet > card, do you? Well, the driver isn't "loaded" (ie: the kernel module isn't loaded), but the library (libmyriexpress.so) is available. If that library isn't available, OpenMPI will probably fail when it tries to call the mx functions (even if only to find there's no myrinet card available). > Our mx is a touch behind yours (1.2.3), > but I agree that it appears to be something in the process > startup that is at fault, so it doesn't seem likely that > the mx version is to blame (perhaps just the fact that it > is not installed on those nodes?). > > Matt > > On Wed, 16 Jan 2008, 8mj6tc902_at_[hidden] wrote: > >> We also have a mixed myrinet/ip cluster, and maybe I'm missing some >> nuance of your configuration, but openmpi seems to work fine for me "as >> is" with no --mca options across mixed nodes (there's a bunch of >> warnings at the beginning where the non-mx nodes realize they don't have >> myrinet cards and the mx nodes realize they can't talk mx to the non-mx >> nodes, but everything completes fine, so I assumed OpenMPI was working >> things out the transport details on it's own (and was quite pleased >> about that)). >> >> I just did a quick test to confirm that it is in fact still using mx in >> that situation, and it is. I'm running OpenMPI 1.2.4 and MX 1.2.3. >> >> It sounds to me based on those "PML add procs failed" messages that >> OpenMPI is dying on start up on the non-mx nodes unless you explicitly >> disable mx at runtime (perhaps because they're expecting the mx library >> to be there, but it's not?) >> >> users-request-at-open-mpi.org |openmpi-users/Allow| wrote: >>> Date: Tue, 15 Jan 2008 10:25:00 -0500 (EST) >>> From: M D Jones >>> Subject: Re: [OMPI users] mixed myrinet/non-myrinet nodes >>> To: Open MPI Users >>> Message-ID: >>> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed >>> >>> >>> Hmm, that combination seems to hang on me - but >>> '--mca pml ob1 --mca btl ^mx' does indeed do the trick. >>> Many thanks! >>> >>> Matt >>> >>> On Tue, 15 Jan 2008, George Bosilca wrote: >>> This case actually works. We run into it few days ago, when we discovered that one of the compute nodes in a cluster didn't get his Myrinet card installed properly ... The performance were horrible but the application run to completion. You will have to use the following flags: --mca pml ob1 --mca btl mx,tcp,self >> >> >> -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]
Re: [OMPI users] openmpi credits for eager messages
That would make sense. I able to break OpenMPI by having Node A wait for messages from Node B. Node B is in fact sleeping while Node C bombards Node A with a few thousand messages. After a while Node B wakes up and sends Node A the message it's been waiting on, but Node A has long since been buried and seg faults. If I decrease the number of messages C is sending, it works properly. This was on OpenMPI 1.2.4 (using I think the SM BTL (might have been MX or TCP, but certainly not infiniband. I could dig up the test and try again if anyone is seriously curious). Trying the same test on MPICH/MX went very very slow (I don't think they have any clever buffer management) but it didn't crash. Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com |openmpi-users/Allow| wrote: > Hi, > > I am readying an openmpi 1.2.5 software stack for use with a > many-thousand core cluster. I have a question about sending small > messages that I hope can be answered on this list. > > I was under the impression that if node A wants to send a small MPI > message to node B, it must have a credit to do so. The credit assures A > that B has enough buffer space to accept the message. Credits are > required by the mpi layer regardless of the BTL transport layer used. > > I have been told by a Voltaire tech that this is not so, the credits are > used by the infiniband transport layer to reliably send a message, and > is not an openmpi feature. > > Thanks, > Federico > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]
Re: [OMPI users] openmpi credits for eager messages
Wow this sparked a much more heated discussion than I was expecting. I was just commenting that the behaviour the original author (Federico Sacerdoti) mentioned would explain something I observed in one of my early trials of OpenMPI. But anyway, because it seems that quite a few people were interested, I've attached a simplified version of the test I was describing (with all the timing checks and some of the crazier output removed). Now that I go back and retest this it turns out that it wasn't actually a segfault that was killing it, but running out of memory as you and others have predicted. Brian W. Barrett brbarret-at-open-mpi.org |openmpi-users/Allow| wrote: > Now that this discussion has gone way off into the MPI standard woods :). > > Was your test using Open MPI 1.2.4 or 1.2.5 (the one with the segfault)? > There was definitely a bug in 1.2.4 that could cause exactly the behavior > you are describing when using the shared memory BTL, due to a silly > delayed initialization bug/optimization. I'm still using Open MPI 1.2.4 and actually the SM BTL seems to be the hardest to break (I guess I'm dodging the bullet on that delayed initialization bug you're referring to). > If you are using the OB1 PML (the default), you will still have the > possibility of running the receiver out of memory if the unexpected queue > grows without bounds. I'll withold my opinion on what the standard says > so that we can perhaps actually help you solve your problem and stay out > of the weeds :). Note however, that in general unexpected messages are a > bad idea and thousands of them from one peer to another should be avoided > at all costs -- this is just good MPI programming practice. Actually I was expecting to break something with this test. I just wanted to find out where it broke. Lesson learned, I wrote my more serious programs doing exactly that (no unexpected messages). I was just surprised that the default Open MPI settings allowed me to flood the system so easily whereas MPICH/MX still finished not matter what I threw at it (albeit with terrible performance (in the bad cases)). > Now, if you are using MX, you can replicate MPICH/MX's behavior (including > the very slow part) by using the CM PML (--mca pml cm on the mpirun > command line), which will use the MX library message matching and > unexpected queue and therefore behave exactly like MPICH/MX. That works exactly as you described, and it does indeed prevent memory usage from going wild due to the unexpected messages. Thanks for your help! (and to the others for the educational discussion!) > > Brian > > > On Sat, 2 Feb 2008, 8mj6tc...@sneakemail.com wrote: > >> That would make sense. I able to break OpenMPI by having Node A wait for >> messages from Node B. Node B is in fact sleeping while Node C bombards >> Node A with a few thousand messages. After a while Node B wakes up and >> sends Node A the message it's been waiting on, but Node A has long since >> been buried and seg faults. If I decrease the number of messages C is >> sending, it works properly. This was on OpenMPI 1.2.4 (using I think the >> SM BTL (might have been MX or TCP, but certainly not infiniband. I could >> dig up the test and try again if anyone is seriously curious). >> >> Trying the same test on MPICH/MX went very very slow (I don't think they >> have any clever buffer management) but it didn't crash. >> >> Sacerdoti, Federico Federico.Sacerdoti-at-deshaw.com >> |openmpi-users/Allow| wrote: >>> Hi, >>> >>> I am readying an openmpi 1.2.5 software stack for use with a >>> many-thousand core cluster. I have a question about sending small >>> messages that I hope can be answered on this list. >>> >>> I was under the impression that if node A wants to send a small MPI >>> message to node B, it must have a credit to do so. The credit assures A >>> that B has enough buffer space to accept the message. Credits are >>> required by the mpi layer regardless of the BTL transport layer used. >>> >>> I have been told by a Voltaire tech that this is not so, the credits are >>> used by the infiniband transport layer to reliably send a message, and >>> is not an openmpi feature. >>> >>> Thanks, >>> Federico >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.] #include #include #include #include #include //for atoi (in case someone doesn't have boost) const int buflen=5000; int main(int argc, char *argv[]) { using namespace std; int reps=1000; if(argc>1){ //optionally specify number of repeats on the command line reps=atoi(argv[1]); } int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; M
[OMPI users] Proper way to throw an error to all nodes?
So I'm working on this program which has many ways it might possibly die at runtime, but one of them that happens frequently is the user types a wrong (non-existant) filename on the command prompt. As it is now, the node looking for the file notices the file doesn't exist and tries to terminate the program. It tries to call MPI_Finalize(), but the other nodes are all waiting for a message from the node doing the file reading, so MPI_Finalize waits forever until the user realizes the job isn't doing anything and terminates it manually. So, my question is: what's the "correct" graceful way to handle situations like this? Is there some MPI function which can basically throw an exception to all other nodes telling them bail out now? Or is correct behaviour just to have the node that spotted the error die quietly and wait for the others to notice? Thanks for any suggestions! -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]
[OMPI users] Problems with MPI_Issend and MX
Hi. I've now spent many many hours tracking down a bug that was causing my program to die, as though either its memory were getting corrupted or messages were getting clobbered while going through the network, I couldn't tell which. I really wish the checksum flag on btl_mx_flags were working. But anyway, I think I've managed to recreate the core of the problem in a small-ish test case which I've attached (verifycontent.cc). This usually segfaults at MPI_Issend after sending about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun -mca btl ^mx) makes it work (likewise, the same for my own larger project (Murasaki)). The MPI_Ssend using version (verifycontent-ssend.cc) also works no problem over mx. So I suspect the issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's also possible I've horribly misunderstood something fundamental about MPI and it's just my fault, so if that's the case, please let me know (but both my this test case and Murasaki work over mpichmx, so OpenMPI is definitely doing something different). Here's a brief description of verifycontent.cc to make reading it easier: * given -np=N, half the nodes will be sending, half will be receiving some number of messages (reps) * each message consists of buflen (5000) chars, set to some value based on the sending node's rank and the sequence number of the message * the receiving node starts an irecv for each sending node, tests each request until a message arrives * the receiver then checks the contents of the message to make sure it matches what was supposed to be in there (this is where my real project, Murasaki, fails actually. I can't seem to replicate that however). * the senders meanwhile keep sending messages and dequeuing them when their request tests as completed. Testing out the current subversion trunk version, 1.4a1r21594, that seems to pass my test case, but also tends to show errors like "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on start up, and Murasaki still fails (messages turn into zeros about 132KB in), so something still isn't right... If anyone has any ideas about this test case failing, or my larger issue of messages turning into zeros after 132KB (though sadly sometimes it isn't at 132KB, but straight from 0KB, which is very confusing) while on MX, I'd greatly appreciate it. Even a simple confirmation of "Yes, MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity. -- Kris Popendorf Keio University http://murasaki... <- (Probably too cumbersome to expect most people to test, but if you feel daring, try putting in some Human/Mouse chromosomes over MX) #include #include #include #include #include #include #include //for atoi (in case someone doesn't have boost) const int buflen=5*24; int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; using namespace std; class Message { public: MPI_Request req; MPI_Status status; char buffer[buflen]; int count; void reset(char val){ memset(buffer,val,sizeof(char)*buflen); } Message(): count(0) { reset(rank); } Message(int _count) : count(_count) { reset(count+rank+1); } bool preVerify(){ char content=rank; for(int b=0;b"<< bi << " = "<< (int)buffer[bi]<"<< bi << " = "<< (int)buffer[bi]<1){ //optionally specify number of repeats on the command line reps=atoi(argv[1]); } MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); int senders=numprocs/2; int receivers=numprocs-senders; assert(senders>0); assert(receivers>0); cout << "Process "< > sendQs(receivers); vector counts(receivers,0); for(int i=0;i &sendQ=sendQs[receiver]; int target=receiver+senders; sendQ.push_back(Message(counts[receiver]++)); Message &msg=sendQ.back(); char content=msg.count+rank+1; //confirm that everything we're sending hasn't been corrupted assert(msg.buffer); // cerr << rank<< ">Starting send "<:Started send "<::iterator ite=sendQ.begin();ite!=sendQ.end();){ MPI_Test(&ite->req,&f,&ite->status); if(f){ // cerr << "Send "
Re: [OMPI users] Problems with MPI_Issend and MX
Scott, Thanks for your advice! Good to know about the checksum debug functionality! Strangely enough running with either "MX_CSUM=1" or "-mca pml cm" allows Murasaki to work normally, and makes the test case I attached in my previous mail work. Very suspicious, but at least this does make a functional solution (however, if I understand OpenMPI correctly, I shouldn't be able to use the CM PML over a network where some nodes have MX and some don't, correct?). Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote: > Hi Kris, > > I have not run your code yet, but I will try to this weekend. > > You can have MX checksum its messages if you set MX_CSUM=1 and use the > MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug). > > Do you have the problem if you use the MX MTL? To test it modify your > mpirun as follows: > > $ mpirun -mca pml cm ... > > and do not specify any BTL info. > > Scott > > On Jul 2, 2009, at 6:05 PM, 8mj6tc...@sneakemail.com wrote: > >> Hi. I've now spent many many hours tracking down a bug that was causing >> my program to die, as though either its memory were getting corrupted or >> messages were getting clobbered while going through the network, I >> couldn't tell which. I really wish the checksum flag on btl_mx_flags >> were working. But anyway, I think I've managed to recreate the core of >> the problem in a small-ish test case which I've attached >> (verifycontent.cc). This usually segfaults at MPI_Issend after sending >> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's >> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun >> -mca btl ^mx) makes it work (likewise, the same for my own larger >> project (Murasaki)). The MPI_Ssend using version >> (verifycontent-ssend.cc) also works no problem over mx. So I suspect the >> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's >> also possible I've horribly misunderstood something fundamental about >> MPI and it's just my fault, so if that's the case, please let me know >> (but both my this test case and Murasaki work over mpichmx, so OpenMPI >> is definitely doing something different). >> >> Here's a brief description of verifycontent.cc to make reading it easier: >> * given -np=N, half the nodes will be sending, half will be receiving >> some number of messages (reps) >> * each message consists of buflen (5000) chars, set to some value based >> on the sending node's rank and the sequence number of the message >> * the receiving node starts an irecv for each sending node, tests each >> request until a message arrives >> * the receiver then checks the contents of the message to make sure it >> matches what was supposed to be in there (this is where my real project, >> Murasaki, fails actually. I can't seem to replicate that however). >> * the senders meanwhile keep sending messages and dequeuing them when >> their request tests as completed. >> >> Testing out the current subversion trunk version, 1.4a1r21594, that >> seems to pass my test case, but also tends to show errors like >> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on >> start up, and Murasaki still fails (messages turn into zeros about 132KB >> in), so something still isn't right... >> >> If anyone has any ideas about this test case failing, or my larger issue >> of messages turning into zeros after 132KB (though sadly sometimes it >> isn't at 132KB, but straight from 0KB, which is very confusing) while on >> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes, >> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity. >> -- >> Kris Popendorf >> >> Keio University >> http://murasaki... <- (Probably too cumbersome to expect >> most people to test, but if you feel daring, try putting in some >> Human/Mouse chromosomes over MX) >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]