Re: [OMPI users] problem with opal_list_remove_item for openmpi-v2.x-201702010255-8b16747 on Linux

2017-02-03 Thread Jeff Squyres (jsquyres)
I've filed this as https://github.com/open-mpi/ompi/issues/2920.

Ralph is just heading out for about a week or so; it may not get fixed until he 
comes back.



> On Feb 3, 2017, at 2:03 AM, Siegmar Gross 
>  wrote:
> 
> Hi,
> 
> I have installed openmpi-v2.x-201702010255-8b16747 on my "SUSE Linux
> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0.
> Unfortunately, I get a warning from "opal_list_remove_item" about a
> missing item when I run one of my programs.
> 
> loki spawn 115 mpiexec -np 1 --host loki,loki,nfs1 spawn_intra_comm
> Parent process 0: I create 2 slave processes
> 
> Parent process 0 running on loki
>MPI_COMM_WORLD ntasks:  1
>COMM_CHILD_PROCESSES ntasks_local:  1
>COMM_CHILD_PROCESSES ntasks_remote: 2
>COMM_ALL_PROCESSES ntasks:  3
>mytid in COMM_ALL_PROCESSES:0
> 
> Child process 0 running on loki
>MPI_COMM_WORLD ntasks:  2
>COMM_ALL_PROCESSES ntasks:  3
>mytid in COMM_ALL_PROCESSES:1
> 
> Child process 1 running on nfs1
>MPI_COMM_WORLD ntasks:  2
>COMM_ALL_PROCESSES ntasks:  3
>mytid in COMM_ALL_PROCESSES:2
> Warning :: opal_list_remove_item - the item 0xc45f80 is not on the list 
> 0x7f5bb1f34978
> loki spawn 116
> 
> 
> I used the following commands to build and install the package.
> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
> Linux machine. Option "--enable-mpi-cxx-bindings is now
> unrecognized. Are cxx-bindings now automatically supported?
> "configure" reports a warning that I should report.
> 
> mkdir openmpi-v2.x-201702010255-8b16747-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> cd openmpi-v2.x-201702010255-8b16747-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> 
> ../openmpi-v2.x-201702010255-8b16747/configure \
>  --prefix=/usr/local/openmpi-2.1.0_64_cc \
>  --libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
>  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>  JAVA_HOME=/usr/local/jdk1.8.0_66 \
>  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
>  CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>  CPP="cpp" CXXCPP="cpp" \
>  --enable-mpi-cxx \
>  --enable-mpi-cxx-bindings \
>  --enable-cxx-exceptions \
>  --enable-mpi-java \
>  --enable-mpi-thread-multiple \
>  --with-hwloc=internal \
>  --without-verbs \
>  --with-wrapper-cflags="-m64 -mt" \
>  --with-wrapper-cxxflags="-m64" \
>  --with-wrapper-fcflags="-m64" \
>  --with-wrapper-ldflags="-mt" \
>  --enable-debug \
>  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> 
> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> rm -r /usr/local/openmpi-2.1.0_64_cc.old
> mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> 
> 
> 
> 
> ...
> checking numaif.h usability... no
> checking numaif.h presence... yes
> configure: WARNING: numaif.h: present but cannot be compiled
> configure: WARNING: numaif.h: check for missing prerequisite headers?
> configure: WARNING: numaif.h: see the Autoconf documentation
> configure: WARNING: numaif.h: section "Present But Cannot Be Compiled"
> configure: WARNING: numaif.h: proceeding with the compiler's result
> configure: WARNING: ## 
> -- ##
> configure: WARNING: ## Report this to 
> http://www.open-mpi.org/community/help/ ##
> configure: WARNING: ## 
> -- ##
> checking for numaif.h... no
> ...
> 
> 
> I would be grateful, if somebody can fix the problems. do you need anything
> else? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Mark Dixon

Hi,

Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages 
that look like openmpi is using ssh to login to remote nodes instead of 
qrsh (see below). Has anyone else noticed gridengine integration being 
broken, or am I being dumb?


I built with "./configure 
--prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 
--with-sge --with-io-romio-flags=--with-file-system=lustre+ufs 
--enable-mpi-cxx --with-cma"


Can see the gridengine component via:

$ ompi_info -a | grep gridengine
 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
  MCA ras gridengine: ---
  MCA ras gridengine: parameter "ras_gridengine_priority" (current value: 
"100", data source: default, level: 9 dev/all, type: int)
  Priority of the gridengine ras component
  MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: 
"0", data source: default, level: 9 dev/all, type: int)
  Enable verbose output for the gridengine ras component
  MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current value: 
"false", data source: default, level: 9 dev/all, type: bool)

Cheers,

Mark

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied, please try again.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Reuti
Hi,

> Am 03.02.2017 um 17:10 schrieb Mark Dixon :
> 
> Hi,
> 
> Just tried upgrading from 2.0.1 to 2.0.2 and I'm getting error messages that 
> look like openmpi is using ssh to login to remote nodes instead of qrsh (see 
> below). Has anyone else noticed gridengine integration being broken, or am I 
> being dumb?
> 
> I built with "./configure 
> --prefix=/apps/developers/libraries/openmpi/2.0.2/1/intel-17.0.1 --with-sge 
> --with-io-romio-flags=--with-file-system=lustre+ufs --enable-mpi-cxx 
> --with-cma"

SGE on its own is not configured to use SSH? (I mean the entries in `qconf 
-sconf` for rsh_command resp. daemon).

-- Reuti


> Can see the gridengine component via:
> 
> $ ompi_info -a | grep gridengine
> MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2)
>  MCA ras gridengine: ---
>  MCA ras gridengine: parameter "ras_gridengine_priority" (current value: 
> "100", data source: default, level: 9 dev/all, type: int)
>  Priority of the gridengine ras component
>  MCA ras gridengine: parameter "ras_gridengine_verbose" (current value: 
> "0", data source: default, level: 9 dev/all, type: int)
>  Enable verbose output for the gridengine ras 
> component
>  MCA ras gridengine: parameter "ras_gridengine_show_jobid" (current 
> value: "false", data source: default, level: 9 dev/all, type: bool)
> 
> Cheers,
> 
> Mark
> 
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied, please try again.
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied, please try again.
> ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> --
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Mark Dixon

On Fri, 3 Feb 2017, Reuti wrote:
...
SGE on its own is not configured to use SSH? (I mean the entries in 
`qconf -sconf` for rsh_command resp. daemon).

...

Nope, everything left as the default:

$ qconf -sconf | grep _command
qlogin_command   builtin
rlogin_command   builtin
rsh_command  builtin

I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 
isn't.


I'll start digging, but I'd appreciate hearing from any other SGE user who 
had tried 2.0.2 and tell me if it had worked for them, please? :)


Cheers,

Mark
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread r...@open-mpi.org
I do see a diff between 2.0.1 and 2.0.2 that might have a related impact. The 
way we handled the MCA param that specifies the launch agent (ssh, rsh, or 
whatever) was modified, and I don’t think the change is correct. It basically 
says that we don’t look for qrsh unless the MCA param has been changed from the 
coded default, which means we are not detecting SGE by default.

Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past the 
test, and then we should auto-detect SGE again


> On Feb 3, 2017, at 8:49 AM, Mark Dixon  wrote:
> 
> On Fri, 3 Feb 2017, Reuti wrote:
> ...
>> SGE on its own is not configured to use SSH? (I mean the entries in `qconf 
>> -sconf` for rsh_command resp. daemon).
> ...
> 
> Nope, everything left as the default:
> 
> $ qconf -sconf | grep _command
> qlogin_command   builtin
> rlogin_command   builtin
> rsh_command  builtin
> 
> I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 isn't.
> 
> I'll start digging, but I'd appreciate hearing from any other SGE user who 
> had tried 2.0.2 and tell me if it had worked for them, please? :)
> 
> Cheers,
> 
> Mark
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread Glenn Johnson
Is this the same issue that was previously fixed in PR-1960?

https://github.com/open-mpi/ompi/pull/1960/files


Glenn

On Fri, Feb 3, 2017 at 10:56 AM, r...@open-mpi.org  wrote:

> I do see a diff between 2.0.1 and 2.0.2 that might have a related impact.
> The way we handled the MCA param that specifies the launch agent (ssh, rsh,
> or whatever) was modified, and I don’t think the change is correct. It
> basically says that we don’t look for qrsh unless the MCA param has been
> changed from the coded default, which means we are not detecting SGE by
> default.
>
> Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past
> the test, and then we should auto-detect SGE again
>
>
> > On Feb 3, 2017, at 8:49 AM, Mark Dixon  wrote:
> >
> > On Fri, 3 Feb 2017, Reuti wrote:
> > ...
> >> SGE on its own is not configured to use SSH? (I mean the entries in
> `qconf -sconf` for rsh_command resp. daemon).
> > ...
> >
> > Nope, everything left as the default:
> >
> > $ qconf -sconf | grep _command
> > qlogin_command   builtin
> > rlogin_command   builtin
> > rsh_command  builtin
> >
> > I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2
> isn't.
> >
> > I'll start digging, but I'd appreciate hearing from any other SGE user
> who had tried 2.0.2 and tell me if it had worked for them, please? :)
> >
> > Cheers,
> >
> > Mark
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-02-03 Thread Howard Pritchard
Hello Brendan,

Sorry for the delay in responding.  I've been on travel the past two weeks.

I traced through the debug output you sent.  It provided enough information
to show that for some reason, when using the breakout cable, Open MPI
is unable to complete initialization it needs to use the openib BTL.  It
correctly detects that the first port is not available, but for port 1, it
still fails to initialize.

To debug this further, I'd need to provide you with a custom Open MPI
to try that would have more debug output in the suspect area.

If you'd like to go this route let me know and I'll build a one of library
to try to debug this problem.

One thing to do just as a sanity check is to try tcp:

mpirun --mca btl tcp,self,sm 

with the breakout cable.  If that doesn't work, then I think there may
be some network setup problem that needs to be resolved first before
trying custom Open MPI tarballs.

Thanks,

Howard




2017-02-01 15:08 GMT-07:00 Brendan Myers :

> Hello Howard,
>
> I was wondering if you have been able to look at this issue at all, or if
> anyone has any ideas on what to try next.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Brendan
> Myers
> *Sent:* Tuesday, January 24, 2017 11:11 AM
>
> *To:* 'Open MPI Users' 
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Howard,
>
> Here is the error output after building with debug enabled.  These CX4
> Mellanox cards view each port as a separate device and I am using port 1 on
> the card which is device mlx5_0.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org
> ] *On Behalf Of *Howard Pritchard
> *Sent:* Tuesday, January 24, 2017 8:21 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Brendan,
>
>
>
> This helps some, but looks like we need more debug output.
>
>
>
> Could you build a debug version of Open MPI by adding --enable-debug
>
> to the config options and rerun the test with the breakout cable setup
>
> and keeping the --mca btl_base_verbose 100 command line option?
>
>
>
> Thanks
>
>
>
> Howard
>
>
>
>
>
> 2017-01-23 8:23 GMT-07:00 Brendan Myers :
>
> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers  schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/lis

Re: [OMPI users] Is gridengine integration broken in openmpi 2.0.2?

2017-02-03 Thread r...@open-mpi.org
I don’t think so - at least, that isn’t the code I was looking at.

> On Feb 3, 2017, at 9:43 AM, Glenn Johnson  wrote:
> 
> Is this the same issue that was previously fixed in PR-1960?
> 
> https://github.com/open-mpi/ompi/pull/1960/files 
> 
> 
> 
> Glenn
> 
> On Fri, Feb 3, 2017 at 10:56 AM, r...@open-mpi.org  
> mailto:r...@open-mpi.org>> wrote:
> I do see a diff between 2.0.1 and 2.0.2 that might have a related impact. The 
> way we handled the MCA param that specifies the launch agent (ssh, rsh, or 
> whatever) was modified, and I don’t think the change is correct. It basically 
> says that we don’t look for qrsh unless the MCA param has been changed from 
> the coded default, which means we are not detecting SGE by default.
> 
> Try setting "-mca plm_rsh_agent foo" on your cmd line - that will get past 
> the test, and then we should auto-detect SGE again
> 
> 
> > On Feb 3, 2017, at 8:49 AM, Mark Dixon  > > wrote:
> >
> > On Fri, 3 Feb 2017, Reuti wrote:
> > ...
> >> SGE on its own is not configured to use SSH? (I mean the entries in `qconf 
> >> -sconf` for rsh_command resp. daemon).
> > ...
> >
> > Nope, everything left as the default:
> >
> > $ qconf -sconf | grep _command
> > qlogin_command   builtin
> > rlogin_command   builtin
> > rsh_command  builtin
> >
> > I have 2.0.1 and 2.0.2 installed side by side. 2.0.1 is happy but 2.0.2 
> > isn't.
> >
> > I'll start digging, but I'd appreciate hearing from any other SGE user who 
> > had tried 2.0.2 and tell me if it had worked for them, please? :)
> >
> > Cheers,
> >
> > Mark
> > ___
> > users mailing list
> > users@lists.open-mpi.org 
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> > 
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_spawn question

2017-02-03 Thread r...@open-mpi.org
We know v2.0.1 has problems with comm_spawn, and so you may be encountering one 
of those. Regardless, there is indeed a timeout mechanism in there. It was 
added because people would execute a comm_spawn, and then would hang and eat up 
their entire allocation time for nothing.

In v2.0.2, I see it is still hardwired at 60 seconds. I believe we eventually 
realized we needed to make that a variable, but it didn’t get into the 2.0.2 
release.


> On Feb 1, 2017, at 1:00 AM, elistrato...@info.sgu.ru wrote:
> 
> I am using Open MPI version 2.0.1.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users