Re: [OMPI users] Passwordless ssh
Dear Reuti Then what I should do? I am novice in ssh, OpenMPI. Can you direct me little bit further? I am quite confused. Thank you > From: re...@staff.uni-marburg.de > Date: Wed, 11 Jan 2012 12:31:07 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Passwordless ssh > > Hi, > > Am 11.01.2012 um 05:46 schrieb Ralph Castain: > > > You might want to ask that on the Beowulf mailing lists - I suspect it has > > something to do with the mount procedure, but honestly have no real idea > > how to resolve it. > > > > On Jan 10, 2012, at 8:45 PM, Shaandar Nyamtulga wrote: > > > >> Hi > >> I built Beuwolf cluster using OpenMPI reading the following link. > >> http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-open-mpi-on-linux/ > >> I can do ssh to my slave nodes without the slave mpiuser's password before > >> mounting my slaves. > >> But when I mount my slaves and do ssh, the slaves ask again their > >> passwords. > >> Master and slaves' ssh directory and authorized_keys have permission 700, > >> 600 respectively and > >> they owned only by owner mpiuser through chown.RSA has no passphrase. > > it sounds like the ~/.ssh/authorized_keys on the master isn't containing its > own public key (as in a plain sever you don't need it). Hence if you mount it > on the slaves, it's missing again. > > -- Reuti > > > >> Please help me on this matter. > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Passwordless ssh
Am 12.01.2012 um 12:17 schrieb Shaandar Nyamtulga: > Dear Reuti > > Then what I should do? I am novice in ssh, OpenMPI. Can you direct me little > bit further? I am quite confused. > Thank you $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys on the file server. -- Reuti > > > From: re...@staff.uni-marburg.de > > Date: Wed, 11 Jan 2012 12:31:07 +0100 > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Passwordless ssh > > > > Hi, > > > > Am 11.01.2012 um 05:46 schrieb Ralph Castain: > > > > > You might want to ask that on the Beowulf mailing lists - I suspect it > > > has something to do with the mount procedure, but honestly have no real > > > idea how to resolve it. > > > > > > On Jan 10, 2012, at 8:45 PM, Shaandar Nyamtulga wrote: > > > > > >> Hi > > >> I built Beuwolf cluster using OpenMPI reading the following link. > > >> http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-open-mpi-on-linux/ > > >> I can do ssh to my slave nodes without the slave mpiuser's password > > >> before mounting my slaves. > > >> But when I mount my slaves and do ssh, the slaves ask again their > > >> passwords. > > >> Master and slaves' ssh directory and authorized_keys have permission > > >> 700, 600 respectively and > > >> they owned only by owner mpiuser through chown.RSA has no passphrase. > > > > it sounds like the ~/.ssh/authorized_keys on the master isn't containing > > its own public key (as in a plain sever you don't need it). Hence if you > > mount it on the slaves, it's missing again. > > > > -- Reuti > > > > > > >> Please help me on this matter. > > >> > > >> ___ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] checkpointing on other transports
What would be involved in adding checkpointing to other transports, specifically the PSM MTL? Are there (likely to be?) technical obstacles, and would it be a lot of work if not? I'm asking in case it would be easy, and we don't have to exclude QLogic from a procurement, given they won't respond about open-mpi support.
Re: [OMPI users] ompi + bash + GE + modules
Surely this should be on the gridengine list -- and it's in recent archives -- but there's some ob-openmpi below. Can Notre Dame not get the support they've paid Univa for? Reuti writes: > SGE 6.2u5 can't handle multi line environment variables or functions, > it was fixed in 6.2u6 which isn't free. [It's not listed for 6.2u6.] For what it's worth, my fix for Sun's fix is https://arc.liv.ac.uk/trac/SGE/changeset/3556/sge. > Do you use -V while submitting the job? Just ignore the error or look > into Son of Gridengine which fixed it too. Of course you can always avoid the issue by not using `export -f', which isn't in the modules version we have. I default -V in sge_request and load the open-mpi module in the job submission session. I don't fin whatever problems it causes, and it works for binaries like qsub -b y ... mpirun ... However, the folkloristic examples here typically load the module stuff in the job script. > If you can avoid -V, then it could be defined in any of the .profile > or alike if you use -l as suggested. You could even define a > started_method in SGE to define it for all users by default and avoid > to use -V: > > #!/bin/sh > module() { ...command...here... } > export -f module > exec "${@}" That won't work for example if someone is tasteless enough to submit csh.
Re: [OMPI users] ompi + bash + GE + modules
Dave- I'm working with Univa support as well. I started out debugging this with pretty poor grasp of where in the software flow the problem might be. Like most sysadmins, I belong to many community lists, and find them to be of tremendous help in running problems down. They certainly have been in this case- I've posted to the modules-interest sourcefourge group as well. I choose to use all the resources open to me, including community user forums and paid support, Using a commercial product's support should not preclude one from using other tools as well. Mark Mark Suhovecky HPC System Administrator Center for Research Computing University of Notre Dame suhove...@nd.edu From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Dave Love [d.l...@liverpool.ac.uk] Sent: Thursday, January 12, 2012 8:40 AM To: us...@open-mpi.org Subject: Re: [OMPI users] ompi + bash + GE + modules Surely this should be on the gridengine list -- and it's in recent archives -- but there's some ob-openmpi below. Can Notre Dame not get the support they've paid Univa for? Reuti writes: > SGE 6.2u5 can't handle multi line environment variables or functions, > it was fixed in 6.2u6 which isn't free. [It's not listed for 6.2u6.] For what it's worth, my fix for Sun's fix is https://arc.liv.ac.uk/trac/SGE/changeset/3556/sge. > Do you use -V while submitting the job? Just ignore the error or look > into Son of Gridengine which fixed it too. Of course you can always avoid the issue by not using `export -f', which isn't in the modules version we have. I default -V in sge_request and load the open-mpi module in the job submission session. I don't fin whatever problems it causes, and it works for binaries like qsub -b y ... mpirun ... However, the folkloristic examples here typically load the module stuff in the job script. > If you can avoid -V, then it could be defined in any of the .profile > or alike if you use -l as suggested. You could even define a > started_method in SGE to define it for all users by default and avoid > to use -V: > > #!/bin/sh > module() { ...command...here... } > export -f module > exec "${@}" That won't work for example if someone is tasteless enough to submit csh. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Strange TCP latency results on Amazon EC2
Hi again, Today I was trying with another TCP benchmark included in the hpcbench suite, and with a ping-pong test I'm also getting 100us of latency. Then, I tried with netperf and the same result So, in summary, I'm measuring TCP latency with messages size between 1-32 bytes: Netperf over TCP -> 100us Netpipe over TCP (NPtcp)-> 100us HPCbench over TCP-> 100us Netpipe over OpenMPI (NPmpi) -> 60us HPCBench over OpenMPI -> 60us Any clues? Thanks a lot! 2012/1/10 Roberto Rey > Hi, > > I'm running some tests on EC2 cluster instances with 10 Gigabit Ethernet > hardware and I'm getting strange latency results with Netpipe and OpenMPI. > > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60 > microseconds for small messages (less than 2kbytes). However, when I run > Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger > messages everything seems to be OK. > > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI > outperforms raw TCP performance for small messages (40us of difference). I > also have run the PingPong test from the Intel Media Benchmarks and the > latency results for OpenMPI are very similar (60us) to those obtained with > NPmpi > > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any > optimization in BTL TCP? > > The results for OpenMPI aren't so good but we must take into account the > network virtualization overhead under Xen > > Thanks for your reply > -- Roberto Rey Expósito
Re: [OMPI users] Strange TCP latency results on Amazon EC2
Hi Roberto. We've had strange reports of performance from EC2 before; it's actually been on my to-do list to go check this out in detail. I made contact with the EC2 folks at Supercomputing late last year. They've hooked me up with some credits on EC2 to go check out what's happening, but the pent-up email deluge from the Christmas vacation and my travel to the MPI Forum this week prevented me from testing yet. I hope to be able to get time to test Open MPI on EC2 next week and see what's going on. It's very strange to me that Open MPI is getting *better* than raw TCP performance. I don't have an immediate explanation for that -- if you're using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe and the others. You *might* want to check hyperthreading and process binding settings in all your tests. On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote: > Hi again, > > Today I was trying with another TCP benchmark included in the hpcbench suite, > and with a ping-pong test I'm also getting 100us of latency. Then, I tried > with netperf and the same result > > So, in summary, I'm measuring TCP latency with messages size between 1-32 > bytes: > > Netperf over TCP -> 100us > Netpipe over TCP (NPtcp)-> 100us > HPCbench over TCP-> 100us > Netpipe over OpenMPI (NPmpi) -> 60us > HPCBench over OpenMPI -> 60us > > Any clues? > > Thanks a lot! > > 2012/1/10 Roberto Rey > Hi, > > I'm running some tests on EC2 cluster instances with 10 Gigabit Ethernet > hardware and I'm getting strange latency results with Netpipe and OpenMPI. > > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60 > microseconds for small messages (less than 2kbytes). However, when I run > Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger > messages everything seems to be OK. > > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI > outperforms raw TCP performance for small messages (40us of difference). I > also have run the PingPong test from the Intel Media Benchmarks and the > latency results for OpenMPI are very similar (60us) to those obtained with > NPmpi > > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any > optimization in BTL TCP? > > The results for OpenMPI aren't so good but we must take into account the > network virtualization overhead under Xen > > Thanks for your reply > > > > -- > Roberto Rey Expósito > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Strange TCP latency results on Amazon EC2
Thanks for your reply! I'm using TCP BTL because I don't have any other option in Amazon with 10 Gbit Ethernet. I also tried with MPICH2 1.4 and I got 60 microseconds...so I am very confused about it... Regarding hyperthreading and process binding settings...I am using only one MPI process in each node (2 nodes for a clasical ping-pong latency benchmark). I don't know how it could affect on this test...but I could try anything that anyone suggest to me 2012/1/12 Jeff Squyres > Hi Roberto. > > We've had strange reports of performance from EC2 before; it's actually > been on my to-do list to go check this out in detail. I made contact with > the EC2 folks at Supercomputing late last year. They've hooked me up with > some credits on EC2 to go check out what's happening, but the pent-up email > deluge from the Christmas vacation and my travel to the MPI Forum this week > prevented me from testing yet. > > I hope to be able to get time to test Open MPI on EC2 next week and see > what's going on. > > It's very strange to me that Open MPI is getting *better* than raw TCP > performance. I don't have an immediate explanation for that -- if you're > using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe > and the others. > > You *might* want to check hyperthreading and process binding settings in > all your tests. > > > On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote: > > > Hi again, > > > > Today I was trying with another TCP benchmark included in the hpcbench > suite, and with a ping-pong test I'm also getting 100us of latency. Then, I > tried with netperf and the same result > > > > So, in summary, I'm measuring TCP latency with messages size between > 1-32 bytes: > > > > Netperf over TCP -> 100us > > Netpipe over TCP (NPtcp)-> 100us > > HPCbench over TCP-> 100us > > Netpipe over OpenMPI (NPmpi) -> 60us > > HPCBench over OpenMPI -> 60us > > > > Any clues? > > > > Thanks a lot! > > > > 2012/1/10 Roberto Rey > > Hi, > > > > I'm running some tests on EC2 cluster instances with 10 Gigabit Ethernet > hardware and I'm getting strange latency results with Netpipe and OpenMPI. > > > > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60 > microseconds for small messages (less than 2kbytes). However, when I run > Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger > messages everything seems to be OK. > > > > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI > outperforms raw TCP performance for small messages (40us of difference). I > also have run the PingPong test from the Intel Media Benchmarks and the > latency results for OpenMPI are very similar (60us) to those obtained with > NPmpi > > > > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any > optimization in BTL TCP? > > > > The results for OpenMPI aren't so good but we must take into account the > network virtualization overhead under Xen > > > > Thanks for your reply > > > > > > > > -- > > Roberto Rey Expósito > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Roberto Rey Expósito
Re: [OMPI users] Strange TCP latency results on Amazon EC2
Is it possible your EC2 cluster has another "unknown" crappy Ethernet card(e.g. 1Gb Ethernet card) . For small messages, they go through different paths in NPtcp or MPI over NPmpi. Teng Ma On Thu, Jan 12, 2012 at 10:28 AM, Roberto Rey wrote: > Thanks for your reply! > > I'm using TCP BTL because I don't have any other option in Amazon with 10 > Gbit Ethernet. > > I also tried with MPICH2 1.4 and I got 60 microseconds...so I am very > confused about it... > > Regarding hyperthreading and process binding settings...I am using only > one MPI process in each node (2 nodes for a clasical ping-pong latency > benchmark). I don't know how it could affect on this test...but I could try > anything that anyone suggest to me > > 2012/1/12 Jeff Squyres > >> Hi Roberto. >> >> We've had strange reports of performance from EC2 before; it's actually >> been on my to-do list to go check this out in detail. I made contact with >> the EC2 folks at Supercomputing late last year. They've hooked me up with >> some credits on EC2 to go check out what's happening, but the pent-up email >> deluge from the Christmas vacation and my travel to the MPI Forum this week >> prevented me from testing yet. >> >> I hope to be able to get time to test Open MPI on EC2 next week and see >> what's going on. >> >> It's very strange to me that Open MPI is getting *better* than raw TCP >> performance. I don't have an immediate explanation for that -- if you're >> using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe >> and the others. >> >> You *might* want to check hyperthreading and process binding settings in >> all your tests. >> >> >> On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote: >> >> > Hi again, >> > >> > Today I was trying with another TCP benchmark included in the hpcbench >> suite, and with a ping-pong test I'm also getting 100us of latency. Then, I >> tried with netperf and the same result >> > >> > So, in summary, I'm measuring TCP latency with messages size between >> 1-32 bytes: >> > >> > Netperf over TCP -> 100us >> > Netpipe over TCP (NPtcp)-> 100us >> > HPCbench over TCP-> 100us >> > Netpipe over OpenMPI (NPmpi) -> 60us >> > HPCBench over OpenMPI -> 60us >> > >> > Any clues? >> > >> > Thanks a lot! >> > >> > 2012/1/10 Roberto Rey >> > Hi, >> > >> > I'm running some tests on EC2 cluster instances with 10 Gigabit >> Ethernet hardware and I'm getting strange latency results with Netpipe and >> OpenMPI. >> > >> > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60 >> microseconds for small messages (less than 2kbytes). However, when I run >> Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger >> messages everything seems to be OK. >> > >> > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI >> outperforms raw TCP performance for small messages (40us of difference). I >> also have run the PingPong test from the Intel Media Benchmarks and the >> latency results for OpenMPI are very similar (60us) to those obtained with >> NPmpi >> > >> > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any >> optimization in BTL TCP? >> > >> > The results for OpenMPI aren't so good but we must take into account >> the network virtualization overhead under Xen >> > >> > Thanks for your reply >> > >> > >> > >> > -- >> > Roberto Rey Expósito >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Roberto Rey Expósito > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
Re: [OMPI users] Strange TCP latency results on Amazon EC2
With ifconfig I can only see one Ethernet card (eth0) as well as the loopback interface 2012/1/12 teng ma > Is it possible your EC2 cluster has another "unknown" crappy Ethernet > card(e.g. 1Gb > Ethernet card) . For small messages, they go through different paths in > NPtcp or MPI over NPmpi. > > Teng Ma > > > On Thu, Jan 12, 2012 at 10:28 AM, Roberto Rey wrote: > >> Thanks for your reply! >> >> I'm using TCP BTL because I don't have any other option in Amazon with 10 >> Gbit Ethernet. >> >> I also tried with MPICH2 1.4 and I got 60 microseconds...so I am very >> confused about it... >> >> Regarding hyperthreading and process binding settings...I am using only >> one MPI process in each node (2 nodes for a clasical ping-pong latency >> benchmark). I don't know how it could affect on this test...but I could try >> anything that anyone suggest to me >> >> 2012/1/12 Jeff Squyres >> >>> Hi Roberto. >>> >>> We've had strange reports of performance from EC2 before; it's actually >>> been on my to-do list to go check this out in detail. I made contact with >>> the EC2 folks at Supercomputing late last year. They've hooked me up with >>> some credits on EC2 to go check out what's happening, but the pent-up email >>> deluge from the Christmas vacation and my travel to the MPI Forum this week >>> prevented me from testing yet. >>> >>> I hope to be able to get time to test Open MPI on EC2 next week and see >>> what's going on. >>> >>> It's very strange to me that Open MPI is getting *better* than raw TCP >>> performance. I don't have an immediate explanation for that -- if you're >>> using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe >>> and the others. >>> >>> You *might* want to check hyperthreading and process binding settings in >>> all your tests. >>> >>> >>> On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote: >>> >>> > Hi again, >>> > >>> > Today I was trying with another TCP benchmark included in the hpcbench >>> suite, and with a ping-pong test I'm also getting 100us of latency. Then, I >>> tried with netperf and the same result >>> > >>> > So, in summary, I'm measuring TCP latency with messages size between >>> 1-32 bytes: >>> > >>> > Netperf over TCP -> 100us >>> > Netpipe over TCP (NPtcp)-> 100us >>> > HPCbench over TCP-> 100us >>> > Netpipe over OpenMPI (NPmpi) -> 60us >>> > HPCBench over OpenMPI -> 60us >>> > >>> > Any clues? >>> > >>> > Thanks a lot! >>> > >>> > 2012/1/10 Roberto Rey >>> > Hi, >>> > >>> > I'm running some tests on EC2 cluster instances with 10 Gigabit >>> Ethernet hardware and I'm getting strange latency results with Netpipe and >>> OpenMPI. >>> > >>> > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around >>> 60 microseconds for small messages (less than 2kbytes). However, when I run >>> Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger >>> messages everything seems to be OK. >>> > >>> > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI >>> outperforms raw TCP performance for small messages (40us of difference). I >>> also have run the PingPong test from the Intel Media Benchmarks and the >>> latency results for OpenMPI are very similar (60us) to those obtained with >>> NPmpi >>> > >>> > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any >>> optimization in BTL TCP? >>> > >>> > The results for OpenMPI aren't so good but we must take into account >>> the network virtualization overhead under Xen >>> > >>> > Thanks for your reply >>> > >>> > >>> > >>> > -- >>> > Roberto Rey Expósito >>> > ___ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> >> -- >> Roberto Rey Expósito >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > | Teng Ma Univ. of Tennessee | > | t...@cs.utk.eduKnoxville, TN | > | http://web.eecs.utk.edu/~tma/ | > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Roberto Rey Expósito
[OMPI users] SIGSEGV on MPI_Test
Hello Community: I am running into a strange problem. I get a SIGSEGV when I try to execute MPI_Test: ==21076== Process terminating with default action of signal 11 (SIGSEGV) ==21076== Bad permissions for mapped region at address 0x43AEE1 ==21076== at 0x509B957: ompi_request_default_test (req_test.c:68) ==21076== by 0x50EDEBB: PMPI_Test (ptest.c:59) ==21076== by 0x44210D: InterProcessorTransmit::StartTransmission() (InterProcessorTransmit.cpp:111) Here is the relevant piece of code: for ( this->dbIterator = localdb.begin( ) ; this->dbIterator != localdb.end( ); this->dbIterator++) { this->TransmissionDetails = (this->dbIterator)->second; SendComplete = 0; UniqueIDtoSendto = std::get<0>(this->TransmissionDetails); RecepientNode = (this->dbIterator)->first; Isend_request = MPI::COMM_WORLD.Issend(this->transmitbuffer, this->transmissionsize, MPI_BYTE, (this->dbIterator)->first, std::get<0>(this->TransmissionDetails)); /*This is line 111 */MPI_Test(&(this->Isend_request), &(this->SendComplete), &(this->ISend_status)); while(!this->SendComplete) { /* Test whether the transmission was okay*/ MPI_Test(&(this->Isend_request), &(this->SendComplete), &(this->ISend_status)); / see if we need to pause or stop / { /* The mutex is released after exiting this block */ std::unique_lock pr_dblock(this->mutexforPauseResume); while(this->pause==1) { /* pause till resume signal is received */ this->WaitingforResume.wait(pr_dblock); } if(this->stop == 1) { /* stop this transmission */ return(0); } /* mutex is released here */ } / End of pause/ stop check / Am I missing something here? The piece of code shown here runs in a thread. Thanks a lot for any pointers. Best Devendra
Re: [OMPI users] SIGSEGV on MPI_Test
Hello All, Continuing my previous mail, I thought attaching this debugger screenshot may help anyone come up with an explanation. The exact location where the segfault happens is also highlighted. Thanks a lot for any help. Best, Devendra From: devendra rai To: Open MPI Users Sent: Thursday, 12 January 2012, 17:05 Subject: [OMPI users] SIGSEGV on MPI_Test Hello Community: I am running into a strange problem. I get a SIGSEGV when I try to execute MPI_Test: ==21076== Process terminating with default action of signal 11 (SIGSEGV) ==21076== Bad permissions for mapped region at address 0x43AEE1 ==21076== at 0x509B957: ompi_request_default_test (req_test.c:68) ==21076== by 0x50EDEBB: PMPI_Test (ptest.c:59) ==21076== by 0x44210D: InterProcessorTransmit::StartTransmission() (InterProcessorTransmit.cpp:111) Here is the relevant piece of code: for ( this->dbIterator = localdb.begin( ) ; this->dbIterator != localdb.end( ); this->dbIterator++) { this->TransmissionDetails = (this->dbIterator)->second; SendComplete = 0; UniqueIDtoSendto = std::get<0>(this->TransmissionDetails); RecepientNode = (this->dbIterator)->first; Isend_request = MPI::COMM_WORLD.Issend(this->transmitbuffer, this->transmissionsize, MPI_BYTE, (this->dbIterator)->first, std::get<0>(this->TransmissionDetails)); /*This is line 111 */MPI_Test(&(this->Isend_request), &(this->SendComplete), &(this->ISend_status)); while(!this->SendComplete) { /* Test whether the transmission was okay*/ MPI_Test(&(this->Isend_request), &(this->SendComplete), &(this->ISend_status)); / see if we need to pause or stop / { /* The mutex is released after exiting this block */ std::unique_lock pr_dblock(this->mutexforPauseResume); while(this->pause==1) { /* pause till resume signal is received */ this->WaitingforResume.wait(pr_dblock); } if(this->stop == 1) { /* stop this transmission */ return(0); } /* mutex is released here */ } / End of pause/ stop check / Am I missing something here? The piece of code shown here runs in a thread. Thanks a lot for any pointers. Best Devendra ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] IB Memory Requirements, adjusting for reduced memory consumption
Open MPI IB Gurus, I have some slightly older InfiniBand-equipped nodes with IB which have less RAM than we'd like, and on which we tend to run jobs that can span 16-32 nodes of this type. The jobs themselves tend to run on the heavy side in terms of their own memory requirements. When we used to run on an older Intel MPI, these jobs managed to run within the available RAM without paging out to disk. Now using Open MPI 1.5.3, we can end up paging to disk or even running out of memory for the same codes and exact same jobs and node distributions. I'm suspecting that I can reduce overall memory consumption by tuning the IB-related memory that Open MPI consumes. I've looked at the FAQ: http://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage , but I'm still not certain about where I should start. Again, this is all for 1.5.3 (we are willing to update to 1.5.4 or 1.5.5 when released, if it would help). 1. It looks like there are several independent IB BTL MCA parameters to try adjusting: i. mpool_rdma_rcache_size_limit, ii. btl_openib_free_list_max , iii. btl_openib_max_send_size , iv. btl_openib_eager_rdma_num, v. btl_openib_max_eager_rdma, vi. btl_openib_eager_limit . Have I missed any others parameters that impact InfiniBand-related memory usage? These parameters are listed as affecting registered memory. Are there parameters that affect unregistered IB-related memory consumption on the part of Open MPI itself? 2. Where should I start with this? For example, is it worth trying to adjust any of the eager parameters, or are the bulk of the memory requirements coming from the mpool_rdma_rcache_size_limit? 3. Are there any gross/overall "master" parameters that will set limits, but keep the various buffers in intelligent proportion to one another, or will I need to manually adjust each set of buffers independently? If the latter, are there any guidelines on the relative proportions between buffers, or overall recommendations? Thank you very much. -- http://www.fastmail.fm - A fast, anti-spam email service.
Re: [OMPI users] IB Memory Requirements, adjusting for reduced memory consumption
I would start by adjusting btl_openib_receive_queues . The default uses a per-peer QP which can eat up a lot of memory. I recommend using no per-peer and several shared receive queues. We use S,4096,1024:S,12288,512:S,65536,512 -Nathan On Thu, 12 Jan 2012, V. Ram wrote: Open MPI IB Gurus, I have some slightly older InfiniBand-equipped nodes with IB which have less RAM than we'd like, and on which we tend to run jobs that can span 16-32 nodes of this type. The jobs themselves tend to run on the heavy side in terms of their own memory requirements. When we used to run on an older Intel MPI, these jobs managed to run within the available RAM without paging out to disk. Now using Open MPI 1.5.3, we can end up paging to disk or even running out of memory for the same codes and exact same jobs and node distributions. I'm suspecting that I can reduce overall memory consumption by tuning the IB-related memory that Open MPI consumes. I've looked at the FAQ: http://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage , but I'm still not certain about where I should start. Again, this is all for 1.5.3 (we are willing to update to 1.5.4 or 1.5.5 when released, if it would help). 1. It looks like there are several independent IB BTL MCA parameters to try adjusting: i. mpool_rdma_rcache_size_limit, ii. btl_openib_free_list_max , iii. btl_openib_max_send_size , iv. btl_openib_eager_rdma_num, v. btl_openib_max_eager_rdma, vi. btl_openib_eager_limit . Have I missed any others parameters that impact InfiniBand-related memory usage? These parameters are listed as affecting registered memory. Are there parameters that affect unregistered IB-related memory consumption on the part of Open MPI itself? 2. Where should I start with this? For example, is it worth trying to adjust any of the eager parameters, or are the bulk of the memory requirements coming from the mpool_rdma_rcache_size_limit? 3. Are there any gross/overall "master" parameters that will set limits, but keep the various buffers in intelligent proportion to one another, or will I need to manually adjust each set of buffers independently? If the latter, are there any guidelines on the relative proportions between buffers, or overall recommendations? Thank you very much. -- http://www.fastmail.fm - A fast, anti-spam email service. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users