so far did not happen yet - will report if it does.

On Tue, Jan 24, 2012 at 5:10 PM, Jeff Squyres <jsquy...@cisco.com> wrote:

> Ralph's fix has now been committed to the v1.5 trunk (yesterday).
>
> Did that fix it?
>
>
> On Jan 22, 2012, at 3:40 PM, Mike Dubman wrote:
>
> > it was compiled with the same ompi.
> > We see it occasionally on different clusters with different ompi
> folders. (all v1.5)
> >
> > On Thu, Jan 19, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > I didn't commit anything to the v1.5 branch yesterday - just the trunk.
> >
> > As I told Mike off-list, I think it may have been that the binary was
> compiled against a different OMPI version by mistake. It looks very much
> like what I'd expect to have happen in that scenario.
> >
> > On Jan 19, 2012, at 7:52 AM, Jeff Squyres wrote:
> >
> > > Did you "svn up"?  I ask because Ralph committed some stuff yesterday
> that may have fixed this.
> > >
> > >
> > > On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote:
> > >
> > >> No, nothing specific. Only basic settings (--mca btl openib,self
> > >> --npernode 1, etc).
> > >>
> > >> Actually I'm were confused with this error because today it just
> > >> disapeared. I had 2 separate folders where it was reproduced in 100%
> > >> of test runs. Today I recompiled the source and it is gone in both
> > >> folders. But yesterday I tried recompiling multiple times with no
> > >> effect. So I believe this must be somehow related to some unknown
> > >> settings in the lab which have been changed. Trying to reproduce the
> > >> crash now...
> > >>
> > >> Regards,
> > >> Andrew Senin.
> > >>
> > >> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
> > >>> Jumping in pretty late in this thread here...
> > >>>
> > >>> I see that it's failing in opal_hwloc_base_close().  That's a little
> worrysome.
> > >>>
> > >>> I do see an odd path through the hwloc initialization that *could*
> cause an error during finalization -- but it would involve you setting an
> invalid value for an MCA parameter.  Are you setting
> hwloc_base_mem_bind_failure_action or
> > >>> hwloc_base_mem_alloc_policy, perchance?
> > >>>
> > >>>
> > >>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I think I've found a bug in the hear revision of the OpenMPI 1.5
> > >>>> branch. If it is configured with --disable-debug it crashes in
> > >>>> finalize on the hello_c.c example. Did I miss something out?
> > >>>>
> > >>>> Configure options:
> > >>>> ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
> > >>>> --disable-debug --enable-mpirun-prefix-by-default
> > >>>>
> --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
> > >>>>
> > >>>> Runtime command and output:
> > >>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl
> openib,self
> > >>>> --npernode 1 --host mir1,mir2 ./hello
> > >>>>
> > >>>> Hello, world, I am 0 of 2
> > >>>> Hello, world, I am 1 of 2
> > >>>> [mir1:05542] *** Process received signal ***
> > >>>> [mir1:05542] Signal: Segmentation fault (11)
> > >>>> [mir1:05542] Signal code: Address not mapped (1)
> > >>>> [mir1:05542] Failing at address: 0xe8
> > >>>> [mir2:10218] *** Process received signal ***
> > >>>> [mir2:10218] Signal: Segmentation fault (11)
> > >>>> [mir2:10218] Signal code: Address not mapped (1)
> > >>>> [mir2:10218] Failing at address: 0xe8
> > >>>> [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
> > >>>> [mir1:05542] [ 1]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
> > >>>> [0x7f4588cee6a8]
> > >>>> [mir1:05542] [ 2]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
> > >>>> [0x7f4588cee700]
> > >>>> [mir1:05542] [ 3]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
> > >>>> [0x7f4588d1beb2]
> > >>>> [mir1:05542] [ 4]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
> > >>>> [0x7f4588c81eb5]
> > >>>> [mir1:05542] [ 5]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
> > >>>> [0x7f4588c217c3]
> > >>>> [mir1:05542] [ 6]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
> > >>>> [0x7f4588c39959]
> > >>>> [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd]
> > >>>> [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x390ca1ec5d]
> > >>>> [mir1:05542] [ 9] ./hello() [0x4007d9]
> > >>>> [mir1:05542] *** End of error message ***
> > >>>> [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0]
> > >>>> [mir2:10218] [ 1]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
> > >>>> [0x7f409f31d6a8]
> > >>>> [mir2:10218] [ 2]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
> > >>>> [0x7f409f31d700]
> > >>>> [mir2:10218] [ 3]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
> > >>>> [0x7f409f34aeb2]
> > >>>> [mir2:10218] [ 4]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
> > >>>> [0x7f409f2b0eb5]
> > >>>> [mir2:10218] [ 5]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
> > >>>> [0x7f409f2507c3]
> > >>>> [mir2:10218] [ 6]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
> > >>>> [0x7f409f268959]
> > >>>> [mir2:10218] [ 7] ./hello(main+0x69) [0x4008fd]
> > >>>> [mir2:10218] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x3a6d41ec5d]
> > >>>> [mir2:10218] [ 9] ./hello() [0x4007d9]
> > >>>> [mir2:10218] *** End of error message ***
> > >>>>
> --------------------------------------------------------------------------
> > >>>> mpirun noticed that process rank 0 with PID 5542 on node mir1 exited
> > >>>> on signal 11 (Segmentation fault).
> > >>>>
> ---------------------------------------------------------------------
> > >>>>
> > >>>> Thanks,
> > >>>> Andrew Senin
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> us...@open-mpi.org
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>
> > >>>
> > >>> --
> > >>> Jeff Squyres
> > >>> jsquy...@cisco.com
> > >>> For corporate legal information go to:
> > >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> us...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> us...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to