Re: [OMPI users] random IB failures when running medium core counts
Hello Brock, While it doesn't solve the problem, have you tried increasing the btl timeouts like the message suggest? With 1884 cores in use perhaps there is some over subscription in the fabric? -Joshua Bernstein Penguin Computing Brock Palen wrote: We recently installed a modest IB network to our cluster, When running a 1884 core IB HPL job after a run we will get an error about IB, it does not always happen in the same place, some iterations will pass others will fail the error is below, we are using openmpi/1.4.2 with the intel 11 compilers. Note that 1000 core jobs and other sizes work well also but this larger one does not. Thanks! [[62713,1],1867][btl_openib_component.c:3224:handle_wc] from nyx5011.engin.umich.edu to: nyx5120 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 413569408 opcode 2 vendor error 129 qp_idx 0 -- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. Below is some information about the host that raised the error and the peer to which it was connected: Local host: nyx5011.engin.umich.edu Local device: mlx4_0 Peer host:nyx5120 You may need to consult with your system administrator to get this problem fixed. -- -- mpirun has exited due to process rank 1867 with PID 3474 on node nyx5011 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [nyx5049.engin.umich.edu:07901] [[62713,0],32] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP - ABORTING [nyx5049:07901] *** Process received signal *** [nyx5049:07901] Signal: Aborted (6) [nyx5049:07901] Signal code: (-6) [nyx5049:07901] [ 0] /lib64/libpthread.so.0 [0x2b5dcbc70b10] [nyx5049:07901] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b5dcbeae265] [nyx5049:07901] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b5dcbeafd10] [nyx5049:07901] [ 3] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x216) [0x2b5dcacdb7e6] [nyx5049:07901] [ 4] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca) [0x2b5dcaf3a9aa] [nyx5049:07901] [ 5] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_progress+0x5e) [0x2b5dcaf2d26e] [nyx5049:07901] [ 6] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/openmpi/mca_rml_oob.so [0x2b5dcce37e5c] [nyx5049:07901] [ 7] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x3ae) [0x2b5dcacdb97e] [nyx5049:07901] [ 8] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca) [0x2b5dcaf3a9aa] [nyx5049:07901] [ 9] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_progress+0x5e) [0x2b5dcaf2d26e] [nyx5049:07901] [10] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/openmpi/mca_rml_oob.so [0x2b5dcce37e5c] [nyx5049:07901] [11] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x3ae) [0x2b5dcacdb97e] [nyx5049:07901] [12] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_loop+0x2ca) [0x2b5dcaf3a9aa] [nyx5049:07901] [13] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-pal.so.0(opal_event_dispatch+0x8) [0x2b5dcaf3a6d8] [nyx5049:07901] [14] /home/software/rhel5/openmpi-1.4.2/intel-11.0/lib/libopen-rte.so.0(orte_daemon+0xaaf) [0x2b5dcacdb15f] [nyx5049:07901] [15] orted [0x401ad6] [nyx5049:07901] [16] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b5dcbe9b994] [nyx5049:07901] [17] orted [0x401999] [nyx5049:07901] *** End of error message *** Brock Palen www.umich.edu/~brockp Center for Advanced Computing
[OMPI users] Deprecated parameter: plm_rsh_agent
Hello All, When building the examples included with OpenMPI version 1.5 I see a message printed as follows: -- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -- While I know that in pre 1.3.x releases the variable was pls_rsh_agent, plm_rsh_agent worked all the way through at least 1.4.3. What is the new keyword name? I can't seem to find it in the FAQ located here: http://www.open-mpi.org/faq/?category=rsh -Josh
Re: [OMPI users] Deprecated parameter: plm_rsh_agent
Thanks Samuel, I should have checked ompi_info myself. The FAQ on the website should probably be updated to reflect this function change. -Joshua Bernstein Software Development Manager Penguin Computing Samuel K. Gutierrez wrote: Hi Josh, I -think- the new name is orte_rsh_agent. At least according to ompi_info. $ ompi_info -a --parsable | grep orte_rsh_agent mca:orte:base:param:orte_rsh_agent:value:ssh : rsh mca:orte:base:param:orte_rsh_agent:data_source:default value mca:orte:base:param:orte_rsh_agent:status:writable mca:orte:base:param:orte_rsh_agent:help:The command used to launch executables on remote nodes (typically either "ssh" or "rsh") mca:orte:base:param:orte_rsh_agent:deprecated:no mca:orte:base:param:orte_rsh_agent:synonym:name:pls_rsh_agent mca:orte:base:param:orte_rsh_agent:synonym:name:plm_rsh_agent mca:plm:base:param:plm_rsh_agent:synonym_of:name:orte_rsh_agent -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2010, at 12:41 PM, Joshua Bernstein wrote: Hello All, When building the examples included with OpenMPI version 1.5 I see a message printed as follows: -- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -- While I know that in pre 1.3.x releases the variable was pls_rsh_agent, plm_rsh_agent worked all the way through at least 1.4.3. What is the new keyword name? I can't seem to find it in the FAQ located here: http://www.open-mpi.org/faq/?category=rsh -Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Displaying Selected MCA Modules
Hi There, I'm attempting to debug some configuration issue with the recent version of OMPI, version 1.2.6. I'm able to build all of the MCA modules, and I've figured out how to display the list of AVAILABLE modules using ompi_info, but is there a way to display the list of modules that was selected at runtime? I've tried the -v option to mpirun, and read through the FAQs, but I can't seem to figure out how to have OMPI display the selected MCAs when a job starts. Any help or guidance would be appreciated. -Josh
Re: [OMPI users] Displaying Selected MCA Modules
Well to answer my own question, If I use the -display-map option, I get printed out a nice bit of information that includes a list of the modules in use during the run as shown below: ---SNIP--- Argv[0]: ./cpi Env[0]: OMPI_MCA_pls=proxy Env[1]: OMPI_MCA_rmaps_base_display_map=1 Env[2]: OMPI_MCA_orte_precondition_transports=ad81e32181314110-4aea4dd5040c2593 Env[3]: OMPI_MCA_rds=proxy Env[4]: OMPI_MCA_ras=proxy Env[5]: OMPI_MCA_rmaps=proxy Env[6]: OMPI_MCA_rmgr=proxy Working dir: /home/ats (user: 0) ---END SNIP-- -Josh Joshua Bernstein wrote: Hi There, I'm attempting to debug some configuration issue with the recent version of OMPI, version 1.2.6. I'm able to build all of the MCA modules, and I've figured out how to display the list of AVAILABLE modules using ompi_info, but is there a way to display the list of modules that was selected at runtime? I've tried the -v option to mpirun, and read through the FAQs, but I can't seem to figure out how to have OMPI display the selected MCAs when a job starts. Any help or guidance would be appreciated. -Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Displaying Selected MCA Modules
Thanks for the response Jeff, Jeff Squyres wrote: Greetings Josh. No, we don't have an easy way to show which plugins were loaded and may/will be used during the run. The modules you found below in --display-map are only a few of the plugins (all dealing with the run-time environment, and only used on the back-end nodes, so it may not be what you're looking for -- e.g., it doesn't show the plugins used by mpirun). What do you need to know? Well basically I want to know what MTA's are being used to startup a job. I'm confused as to what the difference is between "used by mpirun" versus user on the back-end nodes. Doesn't --display-map show which MTA modules will used to start the backend processes? The overarching issue is that I'm attempting to just begin testing my build and when I attempt to startup a job, it just hangs: [ats@nt147 ~]$ mpirun --mca pls rsh -np 1 ./cpi [nt147.penguincomputing.com:04640] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at line 247 The same thing happens if I just disable the bjs RAS MTA, since bjs, really isn't used with Scyld anymore: [ats@nt147 ~]$ mpirun --mca ras ^bjs --mca pls rsh -np 1 ./cpi The interesting thing here is that orted starts up, but I'm not sure what is supposed to happen next: [root@nt147 ~]# ps -auxwww | grep orte Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ ats 4647 0.0 0.0 48204 2136 ?Ss 12:45 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename nt147.penguincomputing.com --universe a...@nt147.penguincomputing.com:default-universe-4645 --nsreplica "0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" --gprreplica "0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" --set-sid Finally, it should be noted that the upcoming release of Scyld will now include OpenMPI. This notion is how all of this got started. -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Displaying Selected MCA Modules
Wow, Seems like I've fallen behind in replying. I'll try to be sure to make sure I answer everbody's questions about what I am trying to accomplish. Jeff Squyres wrote: On Jun 20, 2008, at 3:50 PM, Joshua Bernstein wrote: No, we don't have an easy way to show which plugins were loaded and may/will be used during the run. The modules you found below in --display-map are only a few of the plugins (all dealing with the run-time environment, and only used on the back-end nodes, so it may not be what you're looking for -- e.g., it doesn't show the plugins used by mpirun). What do you need to know? Well basically I want to know what MTA's are being used to startup a job. MTA? Sorry, I should have said MCA I'm confused as to what the difference is between "used by mpirun" versus user on the back-end nodes. Doesn't --display-map show which MTA modules will used to start the backend processes? Yes. But OMPI's run-time design usually has mpirun load one plugin of a given type, and then have the MPI processes load another plugin of the same type. For example, for I/O forwarding - mpirun will load the "svc" plugin, while MPI processes will load the "proxy" plugin. In this case, mpirun is actually providing all the smarts for I/O forwarding, and all the MPI processes simply proxy requests up to mpirun. This is a common model throughout our run-time support, for example. Ah, okay. So then --display-map will show what modules the backend processes are using, not MPIRUN itself. The overarching issue is that I'm attempting to just begin testing my build and when I attempt to startup a job, it just hangs: [ats@nt147 ~]$ mpirun --mca pls rsh -np 1 ./cpi [nt147.penguincomputing.com:04640] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at line 247 The same thing happens if I just disable the bjs RAS MTA, since bjs, really isn't used with Scyld anymore: [ats@nt147 ~]$ mpirun --mca ras ^bjs --mca pls rsh -np 1 ./cpi I know very, very little about the bproc support in OMPI -- I know that it evolved over time and is disappearing in v1.3 due to lack of interest. If you want it to stay, I think you've missed the v1.3 boat (we're in feature freeze for v1.3), but possibilities exist for future versions if you're willing to get involved in Open MPI. Bummer! I would absolutely support, (along with Penguin) further contributions and development of BProc support. Note, though that BProc Scyld, and LANL BProc, have long ago forked. We believe our BProc functionality has been developed beyond what was running at LANL, (for example we have support for threads...). I understand it it probably too late to add BProc in for 1.3, but perhaps for subsequent releases, combined with contributions from Penguin, BProc support could be resurrected in some capacity. The interesting thing here is that orted starts up, but I'm not sure what is supposed to happen next: [root@nt147 ~]# ps -auxwww | grep orte Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ ats 4647 0.0 0.0 48204 2136 ?Ss 12:45 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename nt147.penguincomputing.com --universe a...@nt147.penguincomputing.com:default-universe-4645 --nsreplica "0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" --gprreplica "0.0.0;tcp://192.168.5.211:59110;tcp://10.10.10.1:59110;tcp://10.11.10.1:59110" --set-sid I'm not sure that just asking for the rsh pls is the Right thing to do -- I'll have to defer to Ralph on this one... Can you successfully run non-MPI apps, like hostname? Yes. Absoultely. Finally, it should be noted that the upcoming release of Scyld will now include OpenMPI. This notion is how all of this got started. Great! It sounds like you need to get involved, though, to preserve bproc support going forward. LANL was the only proponent of bproc-like support; they have been moving away from bproc-like clusters, however, and so support faded. We made the decision to axe bproc support in v1.3 because there was no one to maintain it. :-( This is what I'm in the process of doing right now. I'd like to be able to take the existing BProc functionality and modify if needed to support our BProc. I have buy in from the higher ups around here, and I will proceed with the Membership forms likely at the "Contributer" level, considering we hope to be contributing code. Signing of the 3rd part contribution agreement shouldn't be an issue. -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Displaying Selected MCA Modules
Ralph Castain wrote: Hi Joshua Again, forwarded by the friendly elf - so include me directly in any reply. I gather from Jeff that you are attempting to do something with bproc - true? If so, I will echo what Jeff said: bproc support in OMPI is being dropped with the 1.3 release due to lack of interest/support. Just a "heads up". Understood. If you are operating in a bproc environment, then I'm not sure why you are specifying that the system use the rsh launcher. Bproc requires some very special handling which is only present in the bproc launcher. You can run both MPI and non-MPI apps with it, but bproc is weird and so OMPI some -very- different logic in it to make it all work. Well, I'm trying to determine how broken, if at all, the bproc support is in OpenMPI. So considering out of the gate it wasn't working, I thought I'd try to disable the built in BProc stuff and fall back to RSH. I suspect the problem you are having is that all of the frameworks are detecting bproc and trying to run accordingly. This means that the orted is executing process startup procedures for bproc - which are totally different than for any other environment (e.g., rsh). If mpirun is attempting to execute an rsh launch, and the orted is expecting a bproc launch, then I can guarantee that no processes will be launched and you will hang. Exactly, what I'm seeing now... I'm not sure there is a way in 1.2 to tell the orteds to ignore the fact that they see bproc and do something else. I can look, but would rather wait to hear if that is truly what you are trying to do, and why. I would really appreciate it if you wouldn't mind looking. From reading the documentation I didn't realize that mpirun and the orted were doing two different things. I thought the --mca parameter applied to both. -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Displaying Selected MCA Modules
Jeff Squyres wrote: On Jun 23, 2008, at 2:52 PM, Joshua Bernstein wrote: Excellent. I'll let Ralph chime in with the relevant technical details. AFAIK, bproc works just fine in the v1.2 series (they use it at LANL every day). But note that we changed a *LOT* in ORTE between v1.2 and v1.3; the BPROC support will likely need to be re-written. This is likely worth some phone calls to describe what will be needed. Excellent! I'll be sure to initiate this when the time comes. -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Displaying Selected MCA Modules
Ralph, I really appreciate all of your help and guidance on this. Ralph H Castain wrote: Of more interest would be understanding why your build isn't working in bproc. Could you send me the error you are getting? I'm betting that the problem lies in determining the node allocation as that is the usual place we hit problems - not much is "standard" about how allocations are communicated in the bproc world, though we did try to support a few of the more common methods. Alright, I've been playing around a bit more, and I think I'm understanding what is going on. Though it seems that for whatever reason the ORTE daemon is failing to launch on a remote node, and I'm left with: [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at line 247 -- A daemon (pid 4208) launched by the bproc PLS component on node 0 died unexpectedly so we are aborting. This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in file pls_bproc.c at line 717 [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in file pls_bproc.c at line 1164 [goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line 462 [goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1 So, I take the advice suggested in the note, and double check to make sure our library caching is working. It nicely picks up the libraries though once they are staged on the compute nodes, now mpirun just dies: [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi [goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at line 247 [ats@goldstar mpi]$ I thought maybe it was actually working, but I/O forwarding wasn't setup properly, though checking the exit code shows that it infact crashed: [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi [ats@goldstar mpi]$ echo $? 1 Any ideas here? If I use the NODES envar, I can run a job on the head node though: [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi Process 0 on goldstar.penguincomputing.com pi is approximately 3.1416009869231254, Error is 0.0823 wall clock time = 0.97 What also is interesting, and you suspected correctly, only the NODES envar is being honored, things like BEOWULF_JOB_MAP is not being honored. This probably correct as I imagine this BEOWULF_JOB_MAP envar is Scyld specific and likely not implemented. This isn't a big issue though, its something I'll likely add later on. -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Displaying Selected MCA Modules
Ralph Castain wrote: Hmmmwell, the problem is as I suspected. The system doesn't see any allocation of nodes to your job, and so it aborts with a crummy error message that doesn't really tell you the problem. We are working on improving them. How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain info on the nodes that are to be used? BEOWULF_JOB_MAP is an array of integers separated by a colon that contains node mapping information. The easiest way to explain is is just my example: BEOWULF_JOB_MAP=0:0 This is a two process job, with each process running on node 0. BEOWULF_JOB_MAP=0:1:1 A three process job with the first process on node 0, and the next two on node 1. All said, this is of little consequent right now, and we/I can worry about adding support for this later. One of the biggest headaches with bproc is that there is no adhered-to standard for describing the node allocation. What we implemented will support LSF+Bproc (since that is what was being used here) and BJS. It sounds like you are using something different - true? Understood. We aren't using BJS, and have long depricated BJS in favor of bundling TORQUE with Scyld instead, though legacy functionality for things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in the MPICH extensions we've put together. If so, we can work around it by just mapping enviro variables to what the system is seeking. Or, IIRC, we could use the hostfile option (have to check on that one). Exactly, but for now, if I make sure the NODES envar is setup correctly and make sure the OpenMPI is NFS mounted, and I actually have to copy out the mca libraries (libcache doesn't seem to work), I actually end up with something running! [ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi Process 0 on n0 pi is approximately 3.1416009869231241, Error is 0.0809 wall clock time = 0.005377 Process 1 on n0 Hangup It seems the -H option and using a hostfile with BProc aren't honored correct? So the only thing that I can use to derrive the host mapping with BProc support is the BJS RAS MCA (via the NODES Envar?) -Josh
Re: [OMPI users] Setting up Open MPI to run on multiple servers
Rayne wrote: Hi all, I am trying to set up Open MPI to run on multiple servers, but as I have very little experience in networking, I'm getting confused by the info on open-mpi.org, with the .rhosts, rsh, ssh etc. Basically what I have now is a PC with Open MPI installed. I want to connect it to, say, 10 servers, so I can run MPI programs on all 11 nodes. From what I've read, I think I need to install Open MPI on the 10 servers too, and there must be a shared directory where I keep all the MPI programs I've written, so all nodes can access them. Then I need to create a machine file on my local PC (I found a default hostfile "openmpi-default-hostfile" in {prefix}/etc/. Can I use that instead so I need not have "-machinefile machine" with every mpiexec?) with the list of the 10 servers. I'm assuming I need to put down the IP addresses of the 10 servers in this file. I've also read that the 10 servers also need to each have a .rhosts file that tells them the machine (i.e. my local PC) and user from which the programs may be launched from. Is this right? There is also the rsh/ssh configuration, which I find the most confusing. How do I know whether I'm using rsh or ssh? Is following the instructions on http://www.open-mpi.org/faq/?category=rsh under "3: How can I make ssh not ask me for a password?" sufficient? Does this mean that when I'm using the 10 servers to run the MPI program, I'm login to them via ssh? Is this necessary in every case? Is doing all of the above all it takes to run MPI programs on all 11 nodes, or is there something else I missed? More or less. Though the first step is to setup password-less SSH between all 11 machines. I'd completely skip the use of RSH as its very insecure and shouldn't be used in non-dedicated cluster, and even then... You should basically setup SSH so a user can SSH from one node to another without specify a password or entering in any other information. Then, the next is to setup NFS. NFS provides you with a way to share a directory on one computer, to many other computers avoiding the hassel of having to copy all your MPI programs to all of the nodes. This is generally as easy as configuring /etc/exports, and then just mounting the directory on the other computers. Be Sure you mount the directories in the same place on every node though. Lastly, give your MPI programs a shot. While you don't need to have a hostlist, because you can specify the hostname (or IPs). on the mpirun command line. But you your case its likely a good idea. Hope that gets you started... -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] Question on open-mpi not working over wireless between Ubuntu and Mac OS-X
Hello Pallab, Is there a chance its something simple like having the Mac's Fireware turned on? On my 10.4 system this is in System Preference- >Sharing, and then the Firewall tab. -Joshua Bernstein Senior Software Engineer Penguin Computing On Sep 18, 2009, at 3:56 PM, Pallab Datta wrote: Hello, I am running open-mpi between a Mac OSX (v.10.5) and Ubuntu Server V.9.04 (Linux Box). I have configured OMPI V.1.3.3 on both of them with --enable-heterogeneous --disable-shared --enable-static options. The Linux box is connected via a wireless USB Adapter to the same sub-network in which the Macinstosh is sitting. When I tried to run mpirun with the following options between the Linux box with the wireless card with another linux machine on the network everything works fine. I ran : /usr/local/bin/mpirun --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -H localhost,10.11.14.205 ./app and it works. When I tried to run mpirun with the -hetero option from the Macintosh it invokes the processes on both ends and then hangs at the MPI_Send MPI_Receive functions. I ran: /usr/local/bin/mpirun --mca btl_base_verbose 30 --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 ./app and it hangs. I saw that the linux box is trying to connect() the Mac using port 4/260. So I purposely forced mpi to look for higher numbered ports.. I ran : /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 ./app and it still hangs giving the following message: btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 10.11.14.203 == localhost. Can anybody explain what I am missing and how I can make the macintosh and Linux boxes talk to each other over wireless.. regards, pallab ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hmm, On another angle, could this be a name resolution issue? Perhaps apex-backpack isn't able to resolve fuji.local and visa versa. Can you ping between the two of them using their hostnames rather then their IPs? -Joshua Bernstein Senior Software Engineer Penguin Computing Pallab Datta wrote: Yes it came up when i put the verbose mode in i.e. the debug output.. yes i knew its privileged so thats why i explicity asked it to connect to a higher port but still it blocks there..:( On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote: Yes I had tried that initially it (apex-backpack) was trying to connect the Mac (10.11.14.203) at port number 4 which is too low. So that's why I made the port range higher.. Port 4? OMPI should never connect at port 4; it's privileged. Was that in the debug output? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Jeff Squyres wrote: On Dec 9, 2009, at 4:36 PM, Jeff Squyres wrote: Given that we haven't moved this patch to the v1.4 branch yet (i.e., it's not yet in a nightly v1.4 tarball), probably the easiest thing to do is to apply the attached patch to a v1.4 tarball. I tried it with my PGI 10.0 install and it seems to work. So -- forget everything about autogen.sh and just apply the attached patch. Is there a reason why it hasn't moved into 1.4 yet or wasn't included with the 1.4 release? Can I toss my two cents in here and request it be made available in a mainline release, or at least in a snapshot sooner rather then later? I'd like to get it included in our build in time for our next release. -Joshua Bernstein Senior Software Engineer Penguin Computing
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Jeff Squyres wrote: Sorry -- I neglected to update the list yesterday: I got the RM approval and committed the fix to the v1.4 branch. So the PGI fix should be in last night's 1.4 snapshot. Could someone out in the wild give it a whirl and let me know if it works for you? (it works for *me*) Jeff, The Dec 17th Snapshot, posted here: http://www.open-mpi.org/nightly/v1.4/openmpi-1.4a1r22335.tar.gz Builds nicely with PGI v10.0 on both Redhat 5u4 and 4u8. Is there a plan to roll this up into a 1.4.1 release? I'd to not have to ship a snapshot version. -Joshua Bernstein Senior Software Engineer Penguin Computing
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Ralph Castain wrote: You definitely shouldn't ship that one - it may build, but it doesn't work. We are looking at a bug in that code branch prior to releasing. I have no plans to ship that one. Any idea when we'll see a 1.4.1 release? -Josh
Re: [OMPI users] Problem building OpenMPI with PGI compilers
Thanks Ralph, I'll keep an eye out... -Josh Ralph Castain wrote: Sometime Jan, would be my best guess...but I am not in charge of it, so don't take that as any kind of commitment. On Dec 28, 2009, at 5:59 PM, Joshua Bernstein wrote: Ralph Castain wrote: You definitely shouldn't ship that one - it may build, but it doesn't work. We are looking at a bug in that code branch prior to releasing. I have no plans to ship that one. Any idea when we'll see a 1.4.1 release? -Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem compiling 1.4.0 snap with PGI 10.0-1 and openib flags turned on ...
Hi Richard, I've built our OpenMPI with PGI 10.0 and included OpenIB support. I've verified it works. You'll notice we build with UDAPL and OpenIB, but generally only OpenIB is used. Our complete configure flags, and OpenMPI is included with Scyld ClusterWare is shown here: ./configure --prefix=%{_prefix} \ --bindir="%{_prefix}/${compiler}/bin" \ --datarootdir="%{_prefix}/${compiler}/share" \ --mandir="%{_prefix}/man" \ --sysconfdir="%{_sysconfdir}" \ --libdir="%{_libdir}/${compiler}" \ --includedir="%{_includedir}" \ --with-mx=/opt/open-mx \ --with-udapl \ --without-bproc \ --with-tm \ --with-openib \ --disable-dlopen \ ${EXTRA_CONFIG_OPTIONS} \ --without-xgrid --without-slurm --without-loadleveler --without-gm --without-lsf \ -Joshua Bernstein Senior Software Engineer Penguin Computing Richard Walsh wrote: All, Not overwhelmed with responses here ... ;-) ... No one using PGI 10.0 yet? We need it to make use of the GPU compiler directives they are supporting. Can some perhaps comment on whether this is the correct way to configure for an IB system? Everything works with Intel and/or if I compile without the IB flags. Sent the same report to PGI, but seems like the support team there is on break for the Holidays. Someone else must have seen this as well ... No ... ?? rbw Richard Walsh Parallel Applications and Systems Manager CUNY HPC Center, Staten Island, NY 718-982-3319 612-382-4620 Mighty the Wizard Who found me at sunrise Sleeping, and woke me And learn'd me Magic! From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of Richard Walsh [richard.wa...@csi.cuny.edu] Sent: Saturday, December 19, 2009 12:18 PM To: us...@open-mpi.org Subject: [OMPI users] Problem compiling 1.4.0 snap with PGI 10.0-1 and openib flags turned on ... All, Succeeded in overcoming the 'libtool' failure with PGI using the patched snap (thanks Jeff), but now I am running into a down stream problem compiling for our IB clusters. I am using the latest PGI compiler (10.0-1) and the 12-14-09 snap of OpenMPI of version 1.4.0. My configure line looks like this: $ ./configure CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 --enable-openib-ibcm --with-openib \ --prefix=/share/apps/openmpi-pgi/1.4.0 --with-tm=/share/apps/pbs/10.1.0.91350 The error I get during the make at about line 8078 is: libtool: compile: pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O -DNDEBUG -c connect/btl_openib_connect_xoob.c -fpic -DPIC -o connect/.libs/btl_openib_connect_xoob.o source='connect/btl_openib_connect_ibcm.c' object='connect/btl_openib_connect_ibcm.lo' libtool=yes \ DEPDIR=.deps depmode=none /bin/sh ../../../../config/depcomp \ /bin/sh ../../../../libtool --tag=CC --mode=compile pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O -DNDEBUG -c -o connect/btl_openib_connect_ibcm.lo connect/btl_openib_connect_ibcm.c libtool: compile: pgcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O -DNDEBUG -c connect/btl_openib_connect_ibcm.c -fpic -DPIC -o connect/.libs/btl_openib_connect_ibcm.o PGC-S-0040-Illegal use of symbol, __le64 (/usr/include/linux/byteorder/little_endian.h: 43) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/linux/byteorder/little_endian.h: 43) PGC-S-0039-Use of undeclared variable __le64 (/usr/include/linux/byteorder/little_endian.h: 45) PGC-S-0104-Non-numeric operand for multiplicative operator (/usr/include/linux/byteorder/little_endian.h: 45) PGC-S-0040-Illegal use of symbol, __le64 (/usr/include/linux/byteorder/little_endian.h: 47) PGC-S-0040-Illegal use of symbol, __be64 (/usr/include/linux/byteorder/little_endian.h: 67) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/linux/byteorder/little_endian.h: 67) PGC-S-0040-Illegal use of symbol, __be64 (/usr/include/linux/byteorder/little_endian.h: 69) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/linux/byteorder/little_endian.h: 69)PGC-S-0040-Illegal use of symbol, __be64 (/usr/include/linux/byteorder/little_endian.h: 71) PGC-W-0156-Type not specified, 'int' assumed (/usr/include/linux/byteorder/little_endian.h: 71)PGC/x86-64 Linux 10.0-1: compilation completed with severe errors make[2]: *** [connect/btl_openib_connect_ibcm.lo] Error 1
Re: [OMPI users] Seg fault with PBS Pro 10.2
Well, We all wish the Altair guys would at least try to maintain backwards compatibility with the community, but they have a big habit of breaking things. This isn't the first time they've broken a more customer facing function like tm_spawn. (The also like breaking pbs_statjob too!). I have access to PBS Pro and I can raise the issue with Altair if it would help. Just let me know how I can be helpful. -Joshua Bernstein Senior Software Engineer Penguin Computing On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote: Bummer! If it helps, could you put us in touch with the PBS Pro people? We usually only have access to Torque when developing the TM-launching stuff (PBS Pro and Torque supposedly share the same TM interface, but we don't have access to PBS Pro, so we don't know if it has diverged over time). On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote: Ralph, This is my first build of OpenMPI so I haven't had this working before. I'm pretty confident that PATH and LD_LIBRARY_PATH issues are not the cause, otherwise launches outside of PBS would fail too. Also, I tried compiling everything statically with the same result. Some additional info... (1) I did a diff on tm.h for PBS 10.2 and from version 8.0 that we had - they are identical, and (2) I've tried this with both the Intel 11.1 and GCC compilers and gotten the exact same run-time errors. For now, I've got a a work-around setup that launches over ssh and still attaches the processes to PBS. Thanks for your help. Steve From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Ralph Castain Sent: Friday, February 12, 2010 8:29 PM To: Open MPI Users Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2 Afraid compilers don't help when the param is a void*... It looks like this is consistent, but I've never tried it under that particular environment. Did prior versions of OMPI work, or are you trying this for the first time? One thing you might check is that you have the correct PATH and LD_LIBRARY_PATH set to point to this version of OMPI and the corresponding PBS Pro libs you used to build it. Most Linux distros come with OMPI installed, and that can cause surprises. We run under Torque at major installations every day, so it - should- work...unless PBS Pro has done something unusual. On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote: Yes, the failure seems to be in mpirun, it never even gets to my application. The proto for tm_init looks like this: int tm_init(void *info, struct tm_roots *roots); where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id If the API was different, wouldn't the compiler most likely generate an error at compile-time? Thanks! Steve From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Ralph Castain Sent: Friday, February 12, 2010 3:21 PM To: Open MPI Users Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2 I'm a tad confused - this trace would appear to indicate that mpirun is failing, yes? Not your application? The reason it works for local procs is that tm_init isn't called for that case - mpirun just fork/exec's the procs directly. When remote nodes are required, mpirun must connect to Torque. This is done with a call to: ret = tm_init(NULL, &tm_root); My guess is that something changed in PBS Pro 10.2 to that API. Can you check the tm header file and see? I have no access to PBSany more, so I'll have to rely on your eyes to see a diff. Thanks Ralph On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote: Hello, I'm having problems running Open MPI jobs under PBS Pro 10.2. I've configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux and with --with-tm support and the build runs fine. I've also built with static libraries per the FAQ suggestion since libpbs is static. However, my test application keep failing with a segmentation fault, but ONLY when trying to select more than 1 node. Running on a single node withing PBS works fine. Also, running outside of PBS vis ssh runs fine as well, even across multiple nodes. OpenIB support is also enabled, but that doesn't seem to affect the error because I've also tried running with the --mca btl tcp,self flag and it still doesn't work. Here is the error I'm getting: [n34:26892] *** Process received signal *** [n34:26892] Signal: Segmentation fault (11) [n34:26892] Signal code: Address not mapped (1) [n34:26892] Failing at address: 0x3f [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90] [n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ pbs_mpirun(discui_+0x84) [0x476a50] [n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ pbs_mpirun(diswsi+0xc3) [0x474063] [n34:26892] [ 3] /part0
Re: [OMPI users] Open MPI performance on Amazon Cloud
Hi Hammad, Before we launched the Penguin Computing On-Demand service we conducted several tests that compared the latencies of EC2 with a traditional HPC type setup (much like we have with our POD service). I have a whole suite of tests that I'd be happy to share with you, but to sum it up the EC2 latencies were absolutely terrible. For starters, the EC2 PingPong latencies for a zero byte message was around ~150ms, compared to an completely untuned, Gigabit Ethernet link of 32ms. For something actually useful, say a packet of 4K, EC2 was roughly ~265ms, where as a standard GigE link was a more reasonable (but still high) 71ms. One "real-world" application that was very sensitive to latency took almost 30 times longer to run on EC2 then a real cluster configuration such as POD. I have benchmarks from several complete IMB runs, as well as other types of benchmarks such as STREAM and some iobench. If you are interested in any particular type, please let me know as I'd be happy to share. If you really need an on-demand type system where latency is an issue, you should look towards our POD offering. We even offer Inifniband! On the compute side nothing is virtualized so your application runs on the hardware without the overhead of a VM. -Joshua Bernstein Senior Software Engineer Penguin Computing On Mar 19, 2010, at 11:19 AM, Jeff Squyres wrote: Yes, it is -- sometimes we get so caught up in other issues that user emails slip through the cracks. Sorry about that! I actually have little experience with EC2 -- other than knowing that it works, I don't know much about the performance that you can extract from it. I have heard issues about non-uniform latency between MPI processes because you really don't know where the individual MPI processes may land (network- / VM-wise). It suggests to me that EC2 might be best suited for compute-bound jobs (vs. latency-bound jobs). Amusingly enough, the first time someone reported an issue with Open MPI on EC2, I tried to submit a help ticket to EC2 support saying, "I'm one of the Open MPI developers ... blah blah blah ... is there anything I can do to help?" The answer I got back was along the lines of, "You need to have a paid EC2 support account before we can help you." I think they missed the point, but oh well. :-) On Mar 12, 2010, at 12:10 AM, Hammad Siddiqi wrote: Dear All, Is this the correct forum for sending these kind of emails. please let me know if there is some other mailing list. Thank Best Regards, Hammad Siddiqi System Administrator, Centre for High Performance Scientific Computing, School of Electrical Engineering and Computer Science, National University of Sciences and Technology, H-12, Islamabad. Office : +92 (51) 90852207 Web: http://hpc.seecs.nust.edu.pk/~hammad/ On Sat, Feb 27, 2010 at 10:07 PM, Hammad Siddiqi > wrote: Dear All, I am facing very wierd results of OpenMPI 1.4.1 on Amazon EC2. I have used Small Instance and and High CPU medium instance for benchmarking latency and bandwidth. The OpenMPI was configured with the default options. when the code is run in the cluster mode the latency and bandwidth of Amazon EC2 Small instance is really less than that of Amazon EC2 High CPU medium instance. To my understanding the difference should not be that much. The following are the links to graphs ad their data: Data: http://hpc.seecs.nust.edu.pk/~hammad/OpenMPI,Latency-BandwidthData.jpg Graphs: http://hpc.seecs.nust.edu.pk/~hammad/OpenMPI,Latency-Bandwidth.jpg Please have a look on them. Is anyone else facing the same problem. Any guidance in this regard will highly be appreciated. Thank you. -- Best Regards, Hammad Siddiqi System Administrator, Centre for High Performance Scientific Computing, School of Electrical Engineering and Computer Science, National University of Sciences and Technology, H-12, Islamabad. Office : +92 (51) 90852207 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] libnuma under ompi 1.3
Terry Frankcombe wrote: Having just downloaded and installed Open MPI 1.3 with ifort and gcc, I merrily went off to compile my application. In my final link with mpif90 I get the error: /usr/bin/ld: cannot find -lnuma Adding --showme reveals that -I/home/terry/bin/Local/include -pthread -I/home/terry/bin/Local/lib is added to the compile early in the aggregated ifort command, and -L/home/terry/bin/Local/lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -lpbs -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl is added to the end. I note than when compiling Open MPI -lnuma was visible in the gcc arguments, with no added -L. On this system libnuma.so exists in /usr/lib64. My (somewhat long!) configure command was You shouldn't have to. The runtime loader should look inside of /usr/lib64 by itself. Unless of course, you've built either your application or OpenMPI using a 32-bit Intel complier instead (say fc instead of fce). In that case the runtime loader would look inside of /usr/lib to find libnuma, rather then /usr/lib64. Are you sure you are using the 64-bit version of the Intel compilier? If you intend to use the 32-bit version of the compilier, and OpenMPI is 32-bits you may just need to install the numactl.i386 and numactl.x86_64 RPMS. -Joshua Bernstein Senior Software Engineer Penguin Computing
Re: [OMPI users] Installation Problems with Openmpi-1.2.9
Hi Amos, It looks like you do not have permission to make the directory /usr/ local/etc. Either you need to run the make all install as root, so you have permission to that directory, or you need to use the -- prefix= option to configure so that the installation gets installed into a path where you have permission. -Joshua Bernstein Senior Software Engineer Penguin Computing On Mar 12, 2009, at 12:13 PM, Amos Leffler wrote: Hello Forum, Attached is a file of my installation and trying examples for openmpi-1.2.9 which were not successful. Hopefully the problem is a simple one and obvious to a more experienced user. I am trying to install and test openmpi-1.2.9. I found that I could not use the Intel 11.0/.081 C++ and Fortran compilers although I think the problem is with these compilers not openmpi. The openmpi- 1.2.9 did compile successfully with the internal compilers of SuSE 10.2. However, at the end of the "make all install" command output I noted that some of the make commands did not run properly as shown below. I tried to run some of the simple examples and was not successful. For hello_c.c I received the message "mpicc not found". Is there a simple workaround? make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/ libltdl' make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/ libltdl' Making install in asm make[2]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ asm' make[3]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ asm' make[3]: Nothing to be done for `install-exec-am'. make[3]: Nothing to be done for `install-data-am'. make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/asm' make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/asm' Making install in etc make[2]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ etc' make[3]: Entering directory `/home/amos/Desktop/openmpi-1.2.9/opal/ etc' test -z "/usr/local/etc" || /bin/mkdir -p "/usr/local/etc" /bin/mkdir: cannot create directory `/usr/local/etc': Permission denied make[3]: *** [install-sysconfDATA] Error 1 make[3]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/etc' make[2]: *** [install-am] Error 2 make[2]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal/etc' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/home/amos/Desktop/openmpi-1.2.9/opal' make: *** [install-recursive] Error 1 Any help would be appreciated. Amos Leffler___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OpenMPI 1.3.2 with PathScale 3.2
Greetings All, I'm trying to build OpenMPI 1.3.2 with the Pathscale compiler, version 3.2. A bit of the way through the build the compiler dies with what it things is a bad optimization. Has anybody else seen this, or know a work around for it? I'm going to take it up with Pathscale of course, but I thought I'd throw it out here: ---SNIP--- /opt/pathscale/bin/pathCC -DHAVE_CONFIG_H -I. -I../.. -I../../extlib/otf/otflib -I../../extlib/otf/otflib -I../../vtlib/ -I../../vtlib -D_GNU_SOURCE -mp -DVT_OMP -O3 -DNDEBUG -finline-functions -pthread -MT vtfilter-vt_tracefilter.o -MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o vtfilter-vt_tracefilter.o `test -f 'vt_tracefilter.cc' || echo './'`vt_tracefilter.cc Signal: Segmentation fault in Global Optimization -- Dead Store Elimination phase. Error: Signal Segmentation fault in phase Global Optimization -- Dead Store Elimination -- processing aborted *** Internal stack backtrace: pathCC INTERNAL ERROR: /opt/pathscale/lib/3.2/be died due to signal 4 Please report this problem to . Problem report saved as /root/.ekopath-bugs/pathCC_error_LvXsJk.ii Please review the above file and, if possible, attach it to your problem report. bash-3.00# /opt/pathscale/bin/pathCC -version PathScale(TM) Compiler Suite: Version 3.2 Built on: 2008-06-16 16:45:36 -0700 Thread model: posix GNU gcc version 3.3.1 (PathScale 3.2 driver) Copyright 2000, 2001 Silicon Graphics, Inc. All Rights Reserved. Copyright 2002, 2003, 2004, 2005, 2006 PathScale, Inc. All Rights Reserved. Copyright 2006, 2007 QLogic Corporation. All Rights Reserved. Copyright 2007, 2008 PathScale LLC. All Rights Reserved. See complete copyright, patent and legal notices in the /opt/pathscale/share/doc/pathscale-compilers-3.2/LEGAL.pdf file. ---END SNIP--- -Joshua Bernstein Software Engineer Penguin Computing
Re: [OMPI users] OpenMPI 1.3.2 with PathScale 3.2
Well, I spoke Gautam Chakrabarti at Pathscale. It seems the long and short of it is that using OpenMP with C++ with a GNU3.3 (RHEL4) frontend creates some limitations inside of pathCC. On a RHEL4 system, the compilier activates the proper frontend for GCC 3.3, this is what creates the crash. As suggested I forced the compilier to use the newer frontend with the -gnu4 option and the build completes without an issue. Sad though that they aren't trying to be backwards compatible, or even testing on RHEL4 systems. I imagine there is still large group of people using RHEL4. Perhaps this is an OMPI FAQ entry? The full response from Pathscale appears below: ---SNIP--- It appears you are using the compiler on a relatively old linux distribution which has a default GCC compiler based on version 3.3. Our compiler has a front-end that is activated on such systems, and a different newer improved front-end which is activated on the newer GCC4-based systems. Our compiler is tested on GCC-based systems with versions up to 4.2. I see that you are using OpenMP (using -mp). C++ OpenMP has limitations when being used with the GNU3.3 based front-end, and is only fully supported when on a GNU4 based system. You can invoke the newer front-end by the option -gnu4 on a GNU3 based system. While compiling this particular file may work with -gnu4 on a GNU3 based system, it is generally not safe to use this option for C++ on a GNU3 based system due to incompatibility issues. The ideal fix would be to try your compilation on a GNU4 based linux distribution. ---END SNIP--- -Joshua Bernstein Software Engineer Penguin Computing Jeff Squyres wrote: FWIW, I'm able to duplicate the error. Looks definitely like a[nother] pathscale bug to me. Perhaps David's suggestions to disable some of the optimizations may help; otherwise, you can disable that entire chunk of code with the following: --enable-contrib-no-build=vt (as Ralph mentioned, this VampirTrace code is an add-on to Open MPI; it's not part of core OMPI itself) On May 15, 2009, at 9:17 AM, David O. Gunter wrote: Pathscale supports -O3 (at least as of the 3.1 line). Here are some suggestions from the 3.2 Users Manual you may also want to try. -david If there are numerical problems with -O3 -OPT:Ofast, then try either of the following: -O3 -OPT:Ofast:ro=1 -O3 -OPT:Ofast:div_split=OFF Note that ’ro’ is short for roundoff. -Ofast is equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math so similar cautions apply to it as to -O3 -OPT:Ofast. To use interprocedural analysis without the "Ofast-type" optimizations, use either of the following: -O3 -ipa -O2 -ipa Testing different optimizations can be automated by pathopt2. This program compiles and runs your program with a variety of compiler options and creates a sorted list of the execution times for each run. -- David Gunter Los Alamos National Laboratory > Last I checked when we were building here, I'm not sure Pathscale > supports -O3. IIRC, O2 is the max supported value, though it has been > awhile since I played with it. > > Have you checked the man page for it? > > It could also be something in the VampirTrace code since that is where > you are failing. That is a contributed code - not part of OMPI itself > - so we would have to check with those developers. > > > On May 14, 2009, at 2:49 PM, Åke Sandgren wrote: > >> On Thu, 2009-05-14 at 13:35 -0700, Joshua Bernstein wrote: >>> Greetings All, >>> >>> I'm trying to build OpenMPI 1.3.2 with the Pathscale compiler, >>> version 3.2. A >>> bit of the way through the build the compiler dies with what it >>> things is a bad >>> optimization. Has anybody else seen this, or know a work around for >>> it? I'm >>> going to take it up with Pathscale of course, but I thought I'd >>> throw it out here: >>> >>> ---SNIP--- >>> /opt/pathscale/bin/pathCC -DHAVE_CONFIG_H -I. -I../.. -I../../ >>> extlib/otf/otflib >>> -I../../extlib/otf/otflib -I../../vtlib/ -I../../vtlib - >>> D_GNU_SOURCE -mp >>> -DVT_OMP -O3 -DNDEBUG -finline-functions -pthread -MT vtfilter- >>> vt_tracefilter.o >>> -MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o vtfilter- >>> vt_tracefilter.o >>> `test -f 'vt_tracefilter.cc' || echo './'`vt_tracefilter.cc >>> Signal: Segmentation fault in Global Optimization -- Dead Store >>> Elimination phase. >>> Error: Signal Segmentation fault in phase Global Optimization -- >>> Dead Store >>> Elimination -- processing aborted >>> *** Internal stack backtrace: >>> pathCC INTERNAL ERROR: /opt/pathscale/lib/3.2/be died due to signal 4 >> >>
Re: [OMPI users] OpenMPI 1.3.2 with PathScale 3.2
Jeff Squyres wrote: Hah; this is probably at least tangentially related to http://www.open-mpi.org/faq/?category=building#pathscale-broken-with-mpi-c++-api This looks related, perhaps something about suggesting using the -gnu4 option might be nice to add, if not there then maybe someplace else? I'll be kind and say that Pathscale has been "unwilling to help on these kinds of issues" with me in the past as well. :-) They've been very responsive for me and their suggestions do generally do the trick. There is no doubt the compiler is smoking fast, its just about compatibility. :-) It's not entirely clear from the text, but I guess that sounds like Pathscale is unsupported on GCC 3.x systems...? Is that what you parse his answer to mean? From Pathscale: "PathScale does not support C++ OpenMP support on RHEL4. As I noted earlier, RHEL4 is otherwise supported for all other compiler features, and it's tested as well." -Joshua Bernstein Software Engineer Penguin Computing