Hi.
We've run into an IO issue with 1.4.1 and earlier versions. We're able to
reproduce the issue in around 120 lines of code to help, I'd like to find if
there's something we're simply doing incorrectly with the build or if it's in
fact a known bug. I've included the following in order:
1. Co
We don't have anything similar in OMPI. There are fault tolerance modes, but
not like the one you describe.
On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
> Hi,
>
> I have implemented a simple fault tolerant ping pong C program with MPI,
> here: http://pastebin.com/7mtmQH2q
>
> MPICH2 offers
The two are synonyms for each other - they resolve to the identical variable,
so there isn't anything different about them.
Not sure what the issue might be, but I would check for a typo - we don't check
that mca params are spelled correctly, nor do we check for params that don't
exist (e.g., b
Hi,
I have implemented a simple fault tolerant ping pong C program with MPI,
here: http://pastebin.com/7mtmQH2q
MPICH2 offers a parameter with mpiexec:
$ mpiexec -disable-auto-cleanup
.. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
It is fault tolerant in the respec
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
It was set to 0 previously. We've set it to 4 and restarted some service and
now it works. So both your and Samuel's suggestions worked.
On another system, slightly older, it was defaulted to 3 instead of 0, and
apparently that explains why the j
It was set to 0 previously. We've set it to 4 and restarted some service and
now it works. So both your and Samuel's suggestions worked.
On another system, slightly older, it was defaulted to 3 instead of 0, and
apparently that explains why the job always ran before and on this newer system
I have a hello world program that runs without prompting for password with
plm_rsh_agent but not with orte_rsh_agent, I mean it runs but only after
prompting for a password:
/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent
/usr/bin/rsh ./test_setup
Hello from process
Hello all,
I recently successfully compiled Open MPI 1.5.4 with Visual Studio
2008 for the 32-bit platform. Because of some adaptations (yet to be
added in) I cannot use the provided binary release.
For initial testing I also compiled the Hello World example code
(hello_cxx.cc). The program works
On Sep 12, 2011, at 10:16 AM, Blosch, Edwin L wrote:
> Samuel,
>
> This worked.
Great!
> Did this magic line disable the use of per-peer queue pairs?
Yes, it sure did.
> I have seen a previous post by Jeff that explains what this line does
> generally, but I didn’t study the post in deta
Great! We'll get that in the next OMPI v1.5.x release.
On Sep 12, 2011, at 2:23 PM, Kaizaad Bilimorya wrote:
>
> On Fri, 9 Sep 2011, Brice Goglin wrote:
>
>> This looks like the exact same issue. Did you try the patch(es) I sent
>> earlier?
>> See http://www.open-mpi.org/community/lists/users/
On Fri, 9 Sep 2011, Brice Goglin wrote:
This looks like the exact same issue. Did you try the patch(es) I sent
earlier?
See http://www.open-mpi.org/community/lists/users/2011/09/17159.php
If it's not enough, try adding the other patch from
http://www.open-mpi.org/community/lists/users/2011/09/1
On Sep 12, 2011, at 12:39 PM, Shamis, Pavel wrote:
> OMPI Developers:
>
> Maybe we should consider disabling the use of per-peer queue pairs by
> default. Do they buy us anything? For what it is worth, we have stopped
> using them on all of our large systems here at LANL.
>
> It is cons-and-
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
Forgot to say that you will need to reload the mlx4_core module by either
rebooting or unloading/reloadi
Actually we were already aware of this FAQ and already have the limits set to
hard and soft unlimited in the PAM limits.conf as well as in the pbs_mom
resource manager startup script. We encountered those issues a few years ago
and definitely are aware of having process limits set too low. I d
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
Add the following line to /etc/modprobe (replace X with the appropriate value
for log_mtts_per_seg):
opti
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf
Of Nathan Hjelm
Sent: Monday, September 12, 2011
Alternative solution for the problem is updating your memory limits
Please see below:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Apparently you memory limit is low and the driver fails to create QPs
What happens when you add the following to your mpirun command?
-mca btl
Samuel,
This worked. Did this magic line disable the use of per-peer queue pairs?I
have seen a previous post by Jeff that explains what this line does generally,
but I didn't study the post in detail, so if you could provide a little
explanation I would appreciate it.
Ed
From: users-boun
FWIW, the default for the ib_timeout is 20 in both v1.4.x and v1.5.x.
As Ralph said, ompi_info will show the current value -- not the default value.
Of course, the current value will be the default value, unless it has been
overridden. In OMPI v1.5, ompi_info should indicate where the value ca
I met a similar problem possibly related with QP memory allocation. I run
768 processes' allgather with 1MB message size but by node binding(forcing
the edge of Tuned's ring algorithm through IB links every time). The IMB
test hang over there more than 3 hours without any output. I don't know
whe
I also recommend checking the log_mtts_per_set parameter to the mlx4 module.
This parameter controls how much memory can be registered for use by the mlx4
driver and it should be in the range 1-5 (or 0-7 depending on the version of
the mlx4 driver). I recommend tthe parameter be set such that y
Hi,
This problem can be caused by a variety of things, but I suspect our default
queue pair parameters (QP) aren't helping the situation :-).
What happens when you add the following to your mpirun command?
-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12
OMPI Developers:
Mayb
I am getting this error message below and I don't know what it means or how to
fix it. It only happens when I run on a large number of processes, e.g. 960.
Things work fine on 480, and I don't think the application has a bug. Any help
is appreciated...
[c1n01][[30697,1],3][connect/btl_openi
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
Actually I'm surprised that defaul
This usually means that you have a memory error of some kind in your
application.
Have you tried running your application through a memory-checking debugger,
such as valgrind?
On Sep 5, 2011, at 3:48 AM, Jai Dayal wrote:
> Hi all,
> I've been beating my head on this for quite a while now.
This means that you have some problem on that node,
and it's probably unrelated to Open MPI.
Bad cable? Bad port? FW/driver in some bad state?
Do other IB performance tests work OK on this node?
Try rebooting the node.
-- YK
On 12-Sep-11 7:52 AM, Ahsan Ali wrote:
> Hello all
>
> I am getting fol
I ask because those are set via MCA param. So ompi_info would show the
"default" if the param isn't set in the environment or param file, but the app
could see something different if you set the param on the mpirun cmd line.
Those are the default values, but it looks like the MCA param is being
Thank you: this is very enlightening.
I will try this and let you know...
Ghislain.
Le 9 sept. 2011 à 18:00, Eugene Loh a écrit :
>
>
> On 9/8/2011 11:47 AM, Ghislain Lartigue wrote:
>> I guess you're perfectly right!
>> I will try to test it tomorrow by putting a call system("wait(X)) befor t
Hello all
I am getting following error during an application run which causes it to
crash.
*[[36944,1],41][btl_openib_component.c:3227:handle_wc] from
compute-01-19.private.dns.zone to: compute-01-04 error polling LP CQ with
status RETRY EXCEEDED ERROR status number 12 for wr_id 167703304 opcode
29 matches
Mail list logo