>> Is there any other information I could provide that might be useful?
>You might want to audit the code and ensure that you have no pending
>communications that haven't finished -- check all your sends and receives, not
>just in the code, but at run-time (e.g., use an MPI profiling tool to mat
pi.org] On Behalf
Of Reuti
Sent: vendredi 20 avril 2012 15:20
To: Open MPI Users
Subject: Re: [OMPI users] sge tight integration leads to bad allocation
Am 20.04.2012 um 15:04 schrieb Eloi Gaudry:
>
> Hi Ralph, Reuti,
>
> I've just observed the same issue without specifying -np.
Hi Ralph, Reuti,
I've just observed the same issue without specifying -np.
Please find attached the ps -elfax output from the computing nodes and some sge
related information.
Regards,
Eloi
-Original message-
From:Ralph Castain
Sent:Wed 04-11-2012 02:25 pm
Subject:Re: [OMPI
> This might be of interest to Reuti and you : it seems that we cannot
> reproduce the problem anymore if we don't provide the "-np N" option on the
> orterun command line. Of course, we need to launch a few other runs to be
> really sure because the allocation error was not always observable. A
f
Of Ralph Castain
Sent: mardi 10 avril 2012 16:43
To: Open MPI Users
Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
Could well be a bug in OMPI - I can take a look, though it may be awhile before
I get to it. Have you tried one of the 1.5 series releases?
On Apr 10, 201
Thx. This is the allocation which is also confirmed by the Open MPI output.
[eg: ] exactly, but not the one used afterwards by openmpi
- The application was compiled with the same version of Open MPI?
[eg: ] yes, version 1.4.4 for all
- Does the application start something on its own besides the
> - Can you please post while it's running the relevant lines from:
> ps -e f --cols=500
> (f w/o -) from both machines.
> It's allocated between the nodes more like in a round-robin fashion.
> [eg: ] I'll try to do this tomorrow, as soon as some slots become free.
> Thanks for your feedback Reuti
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf
Of Reuti
Sent: jeudi 5 avril 2012 18:41
To: Open MPI Users
Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
Am 05.04.2012 um 17:55 schrieb Eloi Gaudry:
>
> &
>> Here are the allocation info retrieved from `qstat -g t` for the related job:
>
> For me the output of `qstat -g t` shows MASTER and SLAVE entries but no
> variables. Is there any wrapper defined for `qstat` to reformat the output
> (or a ~/.sge_qstat defined)?
>
> [eg: ] sorry, i forgot abo
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf
Of Reuti
Sent: mardi 3 avril 2012 17:13
To: Open MPI Users
Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
Am 03.04.2012 um 16:59 schrieb Eloi Gaudry:
> Hi Re
Behalf
Of Reuti
Sent: mardi 3 avril 2012 16:24
To: Open MPI Users
Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
Hi,
Am 03.04.2012 um 16:12 schrieb Eloi Gaudry:
> Thanks for your feedback.
> No, this is the other way around, the "reserved" slots on all nod
so the two slots on charlie is in error?
Sent from my iPad
On Apr 3, 2012, at 6:23 AM, "Eloi Gaudry" mailto:eloi.gau...@fft.be> > wrote:
Hi,
I’ve observed a strange behavior during rank allocation on a distributed run
schedule and submitted using Sge (Son of Grid Egine 8.0.0d) a
munity/lists/users/2012/02/18399.php
<http://www.open-mpi.org/community/lists/users/2012/02/18399.php>
In my case, the workaround was just to launch the app with mpiexec, and the
allocation is handled correctly.
---Tom
On 4/3/12 9:23 AM, "Eloi Gaudry" wrote:
Hi,
I've obs
Hi,
I've observed a strange behavior during rank allocation on a distributed run
schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
Briefly, there is a one-slot difference between allocated rank/slot for Sge and
OpenMPI. The issue here is that one node becomes over
ions, apart from checking the driver and
firmware levels. The consensus was that it would be better if you
could take this up directly with your IB vendor.
Regards
--Nysal
On Mon, Sep 27, 2010 at 8:14 PM, Eloi Gaudry <mailto:e...@fft.be>> wrote:
Terry,
Please find enclosed the re
hi,
does anyone have a clue here ?
éloi
On 22/04/2011 08:52, Eloi Gaudry wrote:
it varies with the receive_queues specification *and* with the number
of mpi processes: memory_consumed = nb_mpi_process * nb_buffers *
(buffer_size + low_buffer_count_watermark + credit_window_size )
éloi
On
receive_queues specification?
On Apr 19, 2011, at 9:03 AM, Eloi Gaudry wrote:
hello,
i would like to get your input on this:
when launching a parallel computation on 128 nodes using openib and the "-mca
btl_openib_receive_queues P,65536,256,192,128" option, i observe a rather large
PI-1.4.2, built with gcc-4.3.4 and
'--enable-cxx-exceptions --with-pic --with-threads=posix' options.
thanks for your help,
éloi
--
Eloi Gaudry
Senior Product Development Engineer
Free Field Technologies
Company Website: http://www.fft.be
Direct Phone Number: +32 10 495 147
7 AM, Eloi Gaudry wrote:
hi,
i'd like to know if someone had a chance to check at the issue I
reported.
thanks and happy new year !
éloi
On 12/21/2010 10:58 AM, Eloi Gaudry wrote:
hi,
when launching a parallel computation on 128 nodes using openib and
the "-mca btl_ope
hi,
i'd like to know if someone had a chance to check at the issue I reported.
thanks and happy new year !
éloi
On 12/21/2010 10:58 AM, Eloi Gaudry wrote:
hi,
when launching a parallel computation on 128 nodes using openib and
the "-mca btl_openib_receive_queues P,65536,256,192,1
#x27;t use that amount of memory
- all others processes (i.e. located on any other nodes) neither
i'm using OpenMPI-1.4.2, built with gcc-4.3.4 and
'--enable-cxx-exceptions --with-pic --with-threads=posix' options. the
cluster is based on eight-core nodes using mellanox hca.
e to disable eager rdma.
Regards,
Pasha
On Sep 29, 2010, at 1:04 PM, Terry Dontje wrote:
Pasha, do you by any chance know who at Mellanox might be responsible for OMPI
working?
--td
Eloi Gaudry wrote:
Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. D
Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may
discuss with, preferably someone who has spent some time inside the
openib btl ?
Regards,
Eloi
On 29/09/2010 06:01, Nysal Jan wrote:
Hi Eloi,
We discussed this issue durin
ely from your last email I think it will still all have
> non-zero values.
> If that ends up being the case then there must be something odd with the
> descriptor pointer to the fragment.
>
> --td
>
> Eloi Gaudry wrote:
> > Terry,
> >
> > Please
btl/openib/btl_ope
> nib_endpoint.h#548
>
> --td
>
> Eloi Gaudry wrote:
> > Hi Terry,
> >
> > Do you have any patch that I could apply to be able to do so ? I'm
> > remotely working on a cluster (with a terminal) and I cannot use any
> > parallel debugg
btl/openib/btl_ope
> nib_endpoint.h#548
>
> --td
>
> Eloi Gaudry wrote:
> > Hi Terry,
> >
> > Do you have any patch that I could apply to be able to do so ? I'm
> > remotely working on a cluster (with a terminal) and I cannot use any
> > parallel debugg
e coalescing is not your issue and that the problem has
> something to do with the queue sizes. It would be helpful if we could
> detect the hdr->tag == 0 issue on the sending side and get at least a
> stack trace. There is something really odd going on here.
>
> --td
>
> El
I've already tried to write something but I
haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
Eloi
On 24/09/2010 18:37, Terry Dontje wrote:
Eloi Gaudry wrote:
Terry,
You were right, the error indeed seems to come from the message coalescing
feature.
If I tu
lar error
(https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all
closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352
that might be related), aren't they ? What would you suggest Terry ?
Eloi
On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> Eloi Gau
s other than the default and the one you mention.
>
> I wonder if you did a combination of the two receive queues causes a
> failure or not. Something like
>
> P,128,256,192,128:P,65536,256,192,128
>
> I am wondering if it is the first queuing definition causing the issue or
job
> it is? Does it always fail on the same bcast, or same process?
>
> Eloi Gaudry wrote:
> > Hi Nysal,
> >
> > Thanks for your suggestions.
> >
> > I'm now able to get the checksum computed and redirected to stdout,
> > thanks (I forgo
ver called because the hdr->tag is invalid. So
> enabling checksum tracing also might not be of much use. Is it the first
> Bcast that fails or the nth Bcast and what is the message size? I'm not
> sure what could be the problem at this moment. I'm afraid you will have to
> de
an try using it to see if it is able to
> catch anything.
>
> Regards
> --Nysal
>
> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry wrote:
> > Hi Nysal,
> >
> > I'm sorry to intrrupt, but I was wondering if you had a chance to look at
> > this error.
Hi,
I was wondering if anybody got a chance to have a look at this issue.
Regards,
Eloi
On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
>
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn
Hi Jeff,
here is the valgrind output when using OpenMPI -1.5rc5, just in case.
Thanks,
Eloi
On Wednesday 18 August 2010 23:01:49 Jeff Squyres wrote:
> On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote:
> > would it help if i use the upcoming 1.5 version of openmpi ? i read that
> >
--suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
valgrind.supp --suppressions=./suppressions.python.supp
/opt/actran/bin/actranpy_mp ...
Thanks,
Eloi
On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > On Aug
our application? The
> openib BTL is not yet thread safe in the 1.4 release series. There have
> been improvements to openib BTL thread safety in 1.5, but it is still not
> officially supported.
>
> --Nysal
>
> On Tue, Aug 17, 2010 at 1:06 PM, Eloi Gaudry wrote:
> &g
> So hdr->tag should be a value >= 65
> Since the tag is incorrect you are not getting the proper callback function
> pointer and hence the SEGV.
> I'm not sure at this point as to why you are getting an invalid/corrupt
> message header ?
>
> --Nysal
>
> On Tu
On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > I did run our application through valgrind but it couldn't find any
> > "Invalid write": there is a bunch of "Invalid read" (I'm using
ack trace looks like you're
calling through python, but can you run this application through valgrind, or
some other memory-checking debugger?
On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
Hi,
sorry, i just forgot to add the values of the function parameters:
(gdb) print reg-&g
3f4110,
btl_register_error = 0x2b341eb90565 ,
btl_ft_event = 0x2b341eb952e7 }
(gdb) print hdr->tag
$3 = 0 '\0'
(gdb) print des
$4 = (mca_btl_base_descriptor_t *) 0xf4a6700
(gdb) print reg->cbfunc
$5 = (mca_btl_base_module_recv_cb_fn_t) 0
Eloi
On Tuesday 10 August 2010 16:04:08 E
;tag, des, reg->cbdata );
2882if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) {
2883cqp = (hdr->credits >> 11) & 0x0f;
2884 hdr->credits &= 0x87ff;
2885} else {
Regards,
Eloi
On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> Hi Edgar
> On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry wrote:
> > Hi,
> >
> > Could someone have a look on these two different error messages ? I'd
> > like to know the reason(s) why they were displayed and their actual
> > meaning.
> >
> > Thanks,
&
Hi,
Could someone have a look on these two different error messages ? I'd like to
know the reason(s) why they were displayed and their actual meaning.
Thanks,
Eloi
On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote:
> Hi,
>
> I've been working on a random segmentation fault
QP_ACCESS_ERR)
This error may indicate connectivity problems within the fabric; please contact
your system administrator.
--
I'd like to know what these two errors mean and where they come from.
Thanks for your help
ue is not somehow limited
to the tuned collective routines.
Thanks,
Eloi
On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > hi edgar,
> >
> > thanks for the tips, I'm gonna try this option as well. the segmentati
oblem in the openib btl triggered from the tuned
> collective component, in cases where the ofed libraries were installed
> but no NCA was found on a node. It used to work however with the basic
> component.
>
> Thanks
> Edgar
>
> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
&
ferent algorithms that can
> > be selected for the various collectives.
> > Therefore, you need this:
> >
> > --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> >
> > Rolf
> >
> > On 07/13/10 11:28, Eloi Gaudry wrote:
> > > Hi,
> &
rules 1 --mca coll_tuned_bcast_algorithm 1
>
> Rolf
>
> On 07/13/10 11:28, Eloi Gaudry wrote:
> > Hi,
> >
> > I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to
> > the basic linear algorithm. Anyway whatever the alg
nday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> Hi,
>
> I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
> using the openib btl. I'd like to know if there is any way to make OpenMPI
> switch to a different algorithm than the default one being s
2010 11:06:52 Eloi Gaudry wrote:
> Hi,
>
> I'm observing a random segmentation fault during an internode parallel
> computation involving the openib btl and OpenMPI-1.4.2 (the same issue
> can be observed with OpenMPI-1.3.3).
>mpirun (Open MPI) 1.4.2
>Report bugs
.list --mca
btl self,sm,tcp --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
Thanks,
Eloi
--
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
C
:
> valgrind is installed, and worked with Open MPI 1.4.1.
>
> 2010/6/22 Eloi Gaudry :
> > Hi Michele,
> >
> > You may actually need to have gdb/valgrind installed before configuring
> > and building OpenMPI with the --enable-memchecker option.
> >
> > Regards,
ERROR_LOG: Not found
> in file ../../../../orte/tools/orterun/orterun.c at line 543
>
>
> It seems that the memchecker does not work, because after
> reconfiguring without "--enable-memchecker" and rebuilding, I don't
> receive the same error anymore.
>
> May any
Hi Reuti,
I've been unable to reproduce the issue so far.
Sorry for the convenience,
Eloi
On Tuesday 25 May 2010 11:32:44 Reuti wrote:
> Hi,
>
> Am 25.05.2010 um 09:14 schrieb Eloi Gaudry:
> > I do no reset any environment variable during job submission or job
> > h
May 2010 17:35:24 Reuti wrote:
> Hi,
>
> Am 21.05.2010 um 17:19 schrieb Eloi Gaudry:
> > Hi Reuti,
> >
> > Yes, the openmpi binaries used were build after having used the
> > --with-sge during configure, and we only use those binaries on our
> > cluster.
v2.0, API v2.0, Component v1.3.3)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
Regards,
Eloi
On Friday 21 May 2010 16:01:54 Reuti wrote:
> Hi,
>
> Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:
> > Hi there,
> >
> > I'm observing something s
the different command
line options.
Any help would be appreciated,
Thanks,
Eloi
--
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
Company Fax: +32 10 454 626
Hi,
FYI, This issue is solved with the last version of the library
(v2-1.11), at least on my side.
Eloi
Gus Correa wrote:
Hi Dorian
Dorian Krause wrote:
Hi,
@Gus I don't use any flags for the installed OpenMPI version. In fact
for this mail I used an OpenMPI version just installed with t
I hope this helps.
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-----
Eloi Gaudry wrote:
Dorian Krause wrote:
Hi Eloi,
Does the seg
des, NY, 10964-8000 - USA
---------
Eloi Gaudry wrote:
Dorian Krause wrote:
Hi Eloi,
Does the segmentation faults you're facing also happen in a
sequential environment (i.e. not linked against openmpi libraries) ?
No, without MPI everything works fine.
Dorian Krause wrote:
Hi Eloi,
Does the segmentation faults you're facing also happen in a
sequential environment (i.e. not linked against openmpi libraries) ?
No, without MPI everything works fine. Also, linking against mvapich
doesn't give any errors. I think there is a problem with GotoBL
Dorian Krause wrote:
Hi,
has any one successfully combined OpenMPI and GotoBLAS2? I'm facing
segfaults in any program which combines the two libraries (as shared
libs). The segmentation fault seems to occur in MPI_Init(). The gdb
backtrace is
Program received signal SIGSEGV, Segmentation fa
This is what I did (create by hand /opt/sge/tmp/test on an execution
host log as a regular cluster user).
Eloi
On 11/11/2009 00:26, Reuti wrote:
To avoid misunderstandings:
Am 11.11.2009 um 00:19 schrieb Eloi Gaudry:
On any execution node, creating a subdirectory of /opt/sge/tmp (i.e
sge got nobody/nogroup as owner.
Eloi
On 11/11/2009 00:14, Reuti wrote:
Am 11.11.2009 um 00:03 schrieb Eloi Gaudry:
The user/group used to generate the temporary directories was
nobody/nogroup, when using a shared $tmpdir.
Now that I'm using a local $tmpdir (one for each node
nMPI could failed
when using such a configuration (i.e. with a shared "tmpdir").
Eloi
On 10/11/2009 19:17, Eloi Gaudry wrote:
Reuti,
The acl here were just added when I tried to force the /opt/sge/tmp
subdirectories to be 777 (which I did when I first encountered the
error of sub
stead of a shared one for "tmpdir".
But as this issue seems somehow related to permissions, I don't know if
this would eventually be the rigth solution.
Thanks for your help,
Eloi
Reuti wrote:
Hi,
Am 10.11.2009 um 19:01 schrieb Eloi Gaudry:
Reuti,
I'm using "tmpdi
gs/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Thanks for your help,
Eloi
Reuti wrote:
Am 10.11.2009 um 18:20 schrieb Eloi Gaudry:
Thanks for your help Reuti,
I'm
inside (as OpenMPI won't use
nobody:nogroup credentials).
Ad Ralph suggested, I checked the SGE configuration, but I haven't found
anything related to nobody:nogroup configuration so far.
Eloi
Reuti wrote:
Hi,
Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
Thanks for your help Ralp
e - check "mpirun -h", or ompi_info
for the required option.
But I would first check your SGE config as that just doesn't sound right.
On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
Hi there,
I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3 (with
gridengin
ny solution was found.
Thanks for your help,
Eloi
--
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
Company Fax: +32 10 454 626
ption when
compiling (configuring) OpenMPI to prevent such issues. Is there any
extensive doc. about this specific option ? Should I be using something
else when building OpenMPI ?
Thanks for your help,
Eloi
--
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui,
nary called MPI_init (assuming
it was the method redefined in the fake_mpi library), it was actually
calling the the MPI_init method from the openmpi library.
Thanks for your reactivity Jeff,
Eloi
Jeff Squyres wrote:
On Jul 23, 2008, at 8:33 AM, Eloi Gaudry wrote:
I've been encountering
Hi there,
I've been encountering some issues with openmpi on a linux-ia64 platform
(centos-4.6 with gcc-4.3.1) within a call to MPI_Query_thread (in a fake
single process run):
An error occurred in MPI_Query_thread
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
I'd like to
74 matches
Mail list logo