Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine
I doubt that we have tested this kind of scenario much (specifically with shared memory). I guess I'm not too surprised that it doesn't work -- to my knowledge, you're the first person to ask for heterogeneous *on the same server*. As such, I don't know if we'll do much work to support it (there could be some gnarly issues with address ranges inside shared memory). But your point is noted that we should not hang/crash in such a scenario. I'll file a bug to at least detect this scenario and indicate that we do not support it. On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote: > Hi, > I have two processes, one a 32bit and another a 64bit, running on the same > 64bit machine. When running with TCP BTL everything works fine, however with > SM BTL it’s not. > In one application the processes just got stuck – one in Send and the other > in Recv. In another application I even saw a segfault inside the MPI > libraries in one of the processes. > > Is such scenario officially supported by SM BTL? > > Open MPI: 1.3.3 > Heterogeneous support: yes > > Thanks. > > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > - > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine
Jeff - Is indicating we don't support it really the right thing to do? Given that SM should already have the proc data, it seems that setting the reachable bit to zero for the other process of different "architecture" is all that is required. Brian On Jun 4, 2010, at 8:26 AM, Jeff Squyres wrote: > I doubt that we have tested this kind of scenario much (specifically with > shared memory). I guess I'm not too surprised that it doesn't work -- to my > knowledge, you're the first person to ask for heterogeneous *on the same > server*. As such, I don't know if we'll do much work to support it (there > could be some gnarly issues with address ranges inside shared memory). > > But your point is noted that we should not hang/crash in such a scenario. > I'll file a bug to at least detect this scenario and indicate that we do not > support it. > > > > On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote: > >> Hi, >> I have two processes, one a 32bit and another a 64bit, running on the same >> 64bit machine. When running with TCP BTL everything works fine, however with >> SM BTL it’s not. >> In one application the processes just got stuck – one in Send and the other >> in Recv. In another application I even saw a segfault inside the MPI >> libraries in one of the processes. >> >> Is such scenario officially supported by SM BTL? >> >> Open MPI: 1.3.3 >> Heterogeneous support: yes >> >> Thanks. >> >> Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: >> (8)-465-5726 >> >> - >> Intel Israel (74) Limited >> >> This e-mail and any attachments may contain confidential material for >> the sole use of the intended recipient(s). Any review or distribution >> by others is strictly prohibited. If you are not the intended >> recipient, please contact the sender and delete all copies. >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine
On Jun 4, 2010, at 10:43 AM, Barrett, Brian W wrote: > Is indicating we don't support it really the right thing to do? Given that > SM should already have the proc data, it seems that setting the reachable bit > to zero for the other process of different "architecture" is all that is > required. Yes, that's more specifically what I meant (I actually cited that on the ticket I just filed -- https://svn.open-mpi.org/trac/ompi/ticket/2433). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine
This would be a quite serious limitation from my point of view. I'm a library developer, and my library is used in heterogeneous environment. Since 32b executables regularly work on 64b machines, they get intermixed by the users with 64b executables on the same machine. Switching to another BTL would incur serious performance issues... I noticed an SM bug report that looks similar to mine and was reportedly fixed in 1.4.2. I'm going to check that version. If it still fails, what would be the effort to fix this? Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: (8)-465-5726 -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Friday, June 04, 2010 17:26 To: Open MPI Users Subject: Re: [OMPI users] SM failure with mixed 32/64-bit procs on the samemachine I doubt that we have tested this kind of scenario much (specifically with shared memory). I guess I'm not too surprised that it doesn't work -- to my knowledge, you're the first person to ask for heterogeneous *on the same server*. As such, I don't know if we'll do much work to support it (there could be some gnarly issues with address ranges inside shared memory). But your point is noted that we should not hang/crash in such a scenario. I'll file a bug to at least detect this scenario and indicate that we do not support it. On Jun 3, 2010, at 10:29 AM, Katz, Jacob wrote: > Hi, > I have two processes, one a 32bit and another a 64bit, running on the same > 64bit machine. When running with TCP BTL everything works fine, however with > SM BTL it's not. > In one application the processes just got stuck - one in Send and the other > in Recv. In another application I even saw a segfault inside the MPI > libraries in one of the processes. > > Is such scenario officially supported by SM BTL? > > Open MPI: 1.3.3 > Heterogeneous support: yes > > Thanks. > > Jacob M. Katz | jacob.k...@intel.com | Work: +972-4-865-5726 | iNet: > (8)-465-5726 > > - > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
[OMPI users] Debug info on Darwin
We've had a couple of reports of users trying to debug with Open MPI and TotalView on Darwin and not being able to use the classic mpirun -tv -np 4 ./foo launch. The typical problem shows up as something like Can't find typedef for MPIR_PROCDESC and then TotalView can't attach to the spawned processes. While the Open MPI build may correctly compile the needed files with -g, the problem arises in that the DWARF info on Darwin is kept in the .o files. If these files are kept around, we might be able to find that info and be happy debugging. But if they are deleted after the build, or things are moved around, then we are unable to locate the .o files containing the debug info, and no one is pleased. It was suggested by our CTO that if these files were compiled as to produce STABS debug info, rather than DWARF, then the debug info would be copied into the executables and shared libraries, and we would then be able to debug with Open MPI without a problem. I'm not sure if this is the best place to offer that suggestion, but I imagine it's not a bad place to start. ;-) Regards, Peter Thompson
Re: [OMPI users] SM failure with mixed 32/64-bit procs onthe samemachine
On Jun 4, 2010, at 2:18 PM, Katz, Jacob wrote: > This would be a quite serious limitation from my point of view. I'm a library > developer, and my library is used in heterogeneous environment. Since 32b > executables regularly work on 64b machines, they get intermixed by the users > with 64b executables on the same machine. Switching to another BTL would > incur serious performance issues... You're really the first person to ask us for combined 32/64 bit *on the same machine*. Just curious -- why would people still be compiling in 32 bit mode these days? > I noticed an SM bug report that looks similar to mine and was reportedly > fixed in 1.4.2. I'm going to check that version. If it still fails, what > would be the effort to fix this? No, that was for a different issue (32/64 bit *across different machines*) -- it won't fix this sm issue. I doubt that any of us had really even thought about mixing 32/64 bit in the sm BTL before (I know I hadn't). Indeed, we haven't had much demand for 32 bit support over the past few years (it's non-zero, but not large). We try to guide OMPI's development by customer demand for features and platforms to support. Although not a definitive measure, having only one person ask for a (potentially difficult to implement) feature is a good indicator that that's a feature only wanted/needed by a small number of users. FWIW, the 32/64 scenarios we've generally seen before have been for running an MPI job across multiple different flavors of hardware or OSs -- but we haven't seen much of that, either. All that being said, I'm *not* any kind of authoritative source of HPC knowledge that knows what every customer is doing -- for example, you obviously have a different perspective and viewpoint than me. Can you give some kind of quantification about how important this kind of feature is to the general HPC community? How many applications / users do this? Do you know if other MPI implementations support it? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Unable to connect to a server using MX MTL with TCP
Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network. Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2d6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2d4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2ab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2ed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -- [audet@fn1 bench]$ And the client stops with the following error message: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL). Enclosed also the output of ompi_info --all runned on the cluster node (cn18) and the workstation (linux15). Please help me. I think my problem is only a question of wrong MCA parameters (which is obscure for me). Thanks, Martin Audet, Research Officer Industrial Material Institute National Research Council of Canada 75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada
[OMPI users] RE : Unable to connect to a server using MX MTL with TCP
Sorry, I forgot the attachements... Martin De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Audet, Martin [martin.au...@imi.cnrc-nrc.gc.ca] Date d'envoi : 4 juin 2010 19:18 À : us...@open-mpi.org Objet : [OMPI users] Unable to connect to a server using MX MTL with TCP Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network. Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2d6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2d4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2ab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2ed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -- [audet@fn1 bench]$ And the client stops with the following error message: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL). Enclosed also the output of ompi_info --all runned on th