I will try to prepare test-case.
--
Anton Starikov.
On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote:
hm, so I am out of ideas. I created multiple variants of test-
programs which did what you basically described, and they all passed
and did not generate problems. I compiled the MUMPS library and ran
the tests that they have in the examples directory, and they all
worked.
Additionally, I checked in the source code of Open MPI. In comm_dup
there is only a single location where we raise the error
MPI_ERR_INTERN (which was reported in your email). I am fairly
positive, that this can not occur, else we would segfault prior to
that (it is a stupid check, don't ask). Furthermore, the code
segment that has been modified does not raise anywhere
MPI_ERR_INTERN. Of course, it could be a secondary effect and be
created somewhere else (PML_ADD or collective module selection) and
comm_dup just passes the error code up.
One way or the other, I need more hints on what the code does. Any
chance of getting a smaller code fragment which replicates the
problem? It could use the MUMPS library, I am fine with that since I
just compiled and installed it with the current ompi trunk...
Thanks
Edgar
Edgar Gabriel wrote:
I would say the probability is large that it is due to the recent
'fix'. I will try to create a testcase similar to what you
suggested. Could you give us maybe some hints on which
functionality of MUMPS you are using, or even share the code/ a
code fragment?
Thanks
Edgar
Jeff Squyres wrote:
Hey Edgar --
Could this have anything to do with your recent fixes?
On May 12, 2009, at 8:30 AM, Anton Starikov wrote:
hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)
It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode
mca parameters:
btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)
I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.
It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in
the
code.
I can't say too much about particular piece of the code, where it
is
happening, because it is in the 3rd-party library (MUMPS). When
error
occurs, MPI_Comm_dup in every task deals with single-task
communicator
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I can guess that before this
error,
MPI_Comm_dup is called something like 100 of times by the same
piece
of code on the same communicators without any problems.
I can say that it used to work correctly with all previous
versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also
works
correctly on other platforms/MPI implementations.
All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of
OMPI.
<config.log.gz><ompi-info.txt.gz><ATT9775601.txt><ATT9775603.txt>
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users