Re: [OMPI users] running with the dr pml.

2006-12-07 Thread Scott Atchley
George, Using DR was suggested to see if it could find an error. The original problem was using OB1, and HPL gave failed residuals. The hope was that DR would pinpoint any problems. It did not and HPL did not progress at all (the GM counters incremented, but no tests were completed succes

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread Brock Palen
On Dec 7, 2006, at 3:14 PM, George Bosilca wrote: On Dec 7, 2006, at 2:45 PM, Brock Palen wrote: $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm and HPL passes. The problem seems to be in the RDMA fragmenting code on OSX. The bound

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread George Bosilca
On Dec 7, 2006, at 2:45 PM, Brock Palen wrote: $ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm and HPL passes. The problem seems to be in the RDMA fragmenting code on OSX. The boundary values at the edges of the fragments are not

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread Brock Palen
There were two issues here, one found the other. the OB1 works just fine on OSX on PPC64. the DR PML does not work, there is no output to STDOUT and the application while you can see the threads in 'top' no progress is ever made in running the application. The original problem stems

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread George Bosilca
Something is not clear for me in this discussion. Sometimes the subject was the DR PML and sometimes the OB1 PML. In fact I'm completely in the dark ... Which PML fails the HPCC test on MAC ? When I look at the command line it look like it should be OB1 not DR ... george. On Dec 7, 2006

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread Brock Palen
That is wonderful, that fixes the observed problem for running with OB1. Has a bug for this been filed to get RDMA working on macs? The only working MPI lib is MPICH-GM as this problem happens with LAM-7.1.3 also. So on track for one bug. Would the person working on the DR PML like m

Re: [OMPI users] running with the dr pml.

2006-12-07 Thread Scott Atchley
On Dec 6, 2006, at 3:09 PM, Scott Atchley wrote: Brock and Galen, We are willing to assist. Our best guess is that OMPI is using the code in a way different than MPICH-GM does. One of our other developers who is more comfortable with the GM API is looking into it. We tried running with HPCC w

Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Scott Atchley
On Dec 6, 2006, at 2:29 PM, Brock Palen wrote: I wonder if we can narrow this down a bit to perhaps a PML protocol issue. Start by disabling RDMA by using: -mca btl_gm_flags 1 On the other-hand, with OB1 using btl_gm_flags 1 fixed the error problem with OMPI! Which is a great first step.

Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Brock Palen
I wonder if we can narrow this down a bit to perhaps a PML protocol issue. Start by disabling RDMA by using: -mca btl_gm_flags 1 This helps some, I at least now see the start up of HPL, but i never get a single pass, output ends at: - Computational tests pass if scaled residuals are less

Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Galen Shipman
The problem is that, when running HPL, he sees failed residuals. When running HPL under MPICH-GM, he does not. I have tried running HPCC (HPL plus other benchmarks) using OMPI with GM on 32-bit Xeons and 64-bit Opterons. I do not see any failed residuals. I am trying to get access to a couple of

Re: [OMPI users] running with the dr pml.

2006-12-06 Thread Brock Palen
Is there any gotchas on using the dr pml? also if the dr pml is finding errors, and is resending data, can i have it tell me when that happens? Like a verbose mode? Unfortunately you will need to update the source and recompile, try: Updating this file: topdir/ompi/mca/pml/dr/pml_dr.h:245:

Re: [OMPI users] running with the dr pml.

2006-12-05 Thread Scott Atchley
On Dec 5, 2006, at 6:15 PM, Galen M. Shipman wrote: Brock Palen wrote: I was asked by mirycom to run a test using the data reliability pml. (dr) I ran it like so: $ mpirun --mca pml dr -np 4 ./xhpl Is this the right format for running the dr pml? This should be fine, yes. I can running H

Re: [OMPI users] running with the dr pml.

2006-12-05 Thread Galen M. Shipman
Brock Palen wrote: I was asked by mirycom to run a test using the data reliability pml. (dr) I ran it like so: $ mpirun --mca pml dr -np 4 ./xhpl Is this the right format for running the dr pml? This should be fine, yes. I can running HPL on our test cluster to see if something is wr

[OMPI users] running with the dr pml.

2006-12-05 Thread Brock Palen
I was asked by mirycom to run a test using the data reliability pml. (dr) I ran it like so: $ mpirun --mca pml dr -np 4 ./xhpl Is this the right format for running the dr pml? Also it has been running for along time, but produced no output, The counters on the gm card are incrementing, (no