Hi Ben
One of the ranks (52) called MPI_Abort.
This may be a bug in the code, or a problem with the setup
(e.g. a missing or incorrect input file).
For instance, the CCTM Wiki says:
"AERO6 expects emissions inputs for 13 new PM species. CCTM will crash
if any emitted PM species is not included in the emissions input file"
I am not familiar to CCTM, so these are just guesses.
It doesn't look like an MPI problem, though.
You may want to check any other logs that the CCTM code may
produce, for any clue on where it fails.
Otherwise, you could compile with -g -traceback (and remove any
optimization options in FFLAGS, FCFLAGS, CFLAGS, etc.)
It may also have a -DDEBUG or similar that can be turned on
in the CPPFLAGS, which in many models makes a more verbose log.
This *may* tell you where it fails (source file, subroutine and line),
and may help understand why it fails.
If it dumps a core file, you can trace the failure point with
a debugger.
I hope this helps,
Gus
On 05/21/2014 03:20 PM, Ben Lash wrote:
I used a different build of netcdf 4.1.3, and the code seems to run now.
I have a totally different, non-mpi related error in part of it, but
there's no way for the list to help, I mostly just wanted to report that
this particular problem seems to be solved for the record. It doesn't
seem to fail quite as gracefully anymore, but I'm still getting enough
of the error messages to know what's going on.
MPI_ABORT was invoked on rank 52 in communicator MPI_COMM_WORLD
with errorcode 0.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],52] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],54] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],55] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
[[63355,0],1]-[[63355,1],15] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
[[63355,0],1]-[[63355,1],17] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],56] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],53] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],51] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
[[63355,0],4]-[[63355,1],57] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
....
[cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
[[63355,0],1]-[[63355,1],16] mca_oob_tcp_msg_recv: readv failed:
Connection reset by peer (104)
--------------------------------------------------------------------------
mpiexec has exited due to process rank 49 with PID 26187 on
node cn-099 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
CCTM_V5g_Linux2_x 00000000007FEA29 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 00000000007FD3A0 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 00000000007BA9A2 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 0000000000759288 Unknown Unknown Unknown
...
On Wed, May 21, 2014 at 2:08 PM, Gus Correa <g...@ldeo.columbia.edu
<mailto:g...@ldeo.columbia.edu>> wrote:
Hi Ben
My guess is that your sys admins may have built NetCDF
with parallel support, pnetcdf, and the latter with OpenMPI,
which could explain the dependency.
Ideally, they should have built it again with the latest default
OpenMPI (1.6.5?)
Check if there is a NetCDF module that either doesn't have any
dependence on MPI, or depends on the current Open MPI that
you are using (1.6.5 I think).
A 'module show netcdf/bla/bla'
on the available netcdf modules will tell.
If the application code is old as you said, it probably doesn't use
any pnetcdf. In addition, it should work even with NetCDF 3.X.Y,
which probably doesn't have any pnetcdf built in.
Newer netcdf (4.Z.W > 4.1.3) should also work, and in this case
pick one that requires the default OpenMPI, if available.
Just out of curiosity, besides netcdf/4.1.3, did you load openmpi/1.6.5?
Somehow the openmpi/1.6.5 should have been marked
to conflict with 1.4.4.
Is it?
Anyway, you may want to do a 'which mpiexec' to see which one is
taking precedence in your environment (1.6.5 or 1.4.4)
Probably 1.6.5.
Does the code work now, or does it continue to fail?
I hope this helps,
Gus Correa
On 05/21/2014 02:36 PM, Ben Lash wrote:
Yep, there is is.
[bl10@login2 USlogsminus10]$ module show netcdf/4.1.3
------------------------------__------------------------------__-------
/opt/apps/modulefiles/netcdf/__4.1.3:
module load openmpi/1.4.4-intel
prepend-path PATH
/opt/apps/netcdf/4.1.3/bin:/__opt/apps/netcdf/4.1.3/deps/__hdf5/1.8.7/bin
prepend-path LD_LIBRARY_PATH
/opt/apps/netcdf/4.1.3/lib:/__opt/apps/netcdf/4.1.3/deps/__hdf5/1.8.7/lib:/opt/apps/__netcdf/4.1.3/deps/szip/2.1/lib
prepend-path MANPATH /opt/apps/netcdf/4.1.3/share/__man
------------------------------__------------------------------__-------
On Wed, May 21, 2014 at 1:34 PM, Douglas L Reeder
<d...@centurylink.net <mailto:d...@centurylink.net>
<mailto:d...@centurylink.net <mailto:d...@centurylink.net>>> wrote:
Ben,
The netcdf/4.1.3 module maybe loading the openmpi/1.4.4
module. Can
you do module show the netcdf module file to to see if
there is a
module load openmpi command.
Doug Reeder
On May 21, 2014, at 12:23 PM, Ben Lash <b...@rice.edu
<mailto:b...@rice.edu>
<mailto:b...@rice.edu <mailto:b...@rice.edu>>> wrote:
I just wanted to follow up for anyone else who got a
similar
problem - module load netcdf/4.1.3 *also* loaded
openmpi/1.4.4. <http://1.4.4.>
<http://1.4.4./> Don't ask me why. My code doesn't seem
to fail as
gracefully but otherwise works now. Thanks.
On Sat, May 17, 2014 at 6:02 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <mailto:jsquy...@cisco.com>
<mailto:jsquy...@cisco.com <mailto:jsquy...@cisco.com>>> wrote:
Ditto -- Lmod looks pretty cool. Thanks for the
heads up.
On May 16, 2014, at 6:23 PM, Douglas L Reeder
<d...@centurylink.net <mailto:d...@centurylink.net>
<mailto:d...@centurylink.net <mailto:d...@centurylink.net>>>
wrote:
> Maxime,
>
> I was unaware of Lmod. Thanks for bringing it to
my attention.
>
> Doug
> On May 16, 2014, at 4:07 PM, Maxime Boissonneault
<maxime.boissonneault@__calculquebec.ca
<mailto:maxime.boissonnea...@calculquebec.ca>
<mailto:maxime.boissonneault@__calculquebec.ca
<mailto:maxime.boissonnea...@calculquebec.ca>>> wrote:
>
>> Instead of using the outdated and not maintained
Module
environment, why not use Lmod :
https://www.tacc.utexas.edu/__tacc-projects/lmod
<https://www.tacc.utexas.edu/tacc-projects/lmod>
>>
>> It is a drop-in replacement for Module
environment that
supports all of their features and much, much more,
such as :
>> - module hierarchies
>> - module properties and color highlighting (we
use it to
higlight bioinformatic modules or tools for example)
>> - module caching (very useful for a parallel
filesystem
with tons of modules)
>> - path priorities (useful to make sure personal
modules
take precendence over system modules)
>> - export module tree to json
>>
>> It works like a charm, understand both TCL and
Lua modules
and is actively developped and debugged. There are
litteraly
new features every month or so. If it does not do
what you
want, odds are that the developper will add it
shortly (I've
had it happen).
>>
>> Maxime
>>
>> Le 2014-05-16 17:58, Douglas L Reeder a écrit :
>>> Ben,
>>>
>>> You might want to use module (source forge) to
manage
paths to different mpi implementations. It is
fairly easy to
set up and very robust for this type of problem.
You would
remove contentious application paths from you
standard PATH
and then use module to switch them in and out as
needed.
>>>
>>> Doug Reeder
>>> On May 16, 2014, at 3:39 PM, Ben Lash
<b...@rice.edu <mailto:b...@rice.edu>
<mailto:b...@rice.edu <mailto:b...@rice.edu>>> wrote:
>>>
>>>> My cluster has just upgraded to a new version
of MPI, and
I'm using an old one. It seems that I'm having trouble
compiling due to the compiler wrapper file moving
(full error
here: http://pastebin.com/EmwRvCd9)
>>>> "Cannot open configuration file
/opt/apps/openmpi/1.4.4-intel/__share/openmpi/mpif90-wrapper-__data.txt"
>>>>
>>>> I've found the file on the cluster at
/opt/apps/openmpi/retired/1.4.__4-intel/share/openmpi/mpif90-__wrapper-data.txt
>>>> How do I tell the old mpi wrapper where this
file is?
>>>> I've already corrected one link to mpich ->
/opt/apps/openmpi/retired/1.4.__4-intel/, which is
in the
software I'm trying to recompile's lib folder
(/home/bl10/CMAQv5.0.1/lib/__x86_64/ifort). Thanks
for any
ideas. I also tried changing $pkgdatadir based on
what I read
here:
>>>>
http://www.open-mpi.org/faq/?__category=mpi-apps#default-__wrapper-compiler-flags
<http://www.open-mpi.org/faq/?category=mpi-apps#default-wrapper-compiler-flags>
>>>>
>>>> Thanks.
>>>>
>>>> --Ben L
>>>> _________________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>
>>>
>>>
>>> _________________________________________________
>>> users mailing list
>>>
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>> --
>> ------------------------------__---
>> Maxime Boissonneault
>> Analyste de calcul - Calcul Québec, Université Laval
>> Ph. D. en physique
>>
>> _________________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
> _________________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>
<mailto:jsquy...@cisco.com <mailto:jsquy...@cisco.com>>
For corporate legal information go to:
http://www.cisco.com/web/__about/doing_business/legal/__cri/
<http://www.cisco.com/web/about/doing_business/legal/cri/>
_________________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
--
--Ben L
_________________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
_________________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
<mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
--
--Ben L
_________________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
_________________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/__mailman/listinfo.cgi/users
<http://www.open-mpi.org/mailman/listinfo.cgi/users>
--
--Ben L
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users