from:"Jack Bryan"

[OMPI users] OpenMPI killed by signal 9

2010-07-22 Thread Jack Bryan


Dear All:
I run a parallel job on 6 nodes of an OpenMPI cluster. 
But I got error: 
rank 0 in job 82  system.cluster_37948   caused collective abort of all ranks  
exit status of rank 0: killed by signal 9
It seems that there is segmentation fault on node 0. 
But, if the program is run for a short time, no problem.
Any help is appreciated. 
thanks,
Jack
July 22  2010 
_
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

[OMPI users] OpenMPI Segmentation fault (11)

2010-07-25 Thread Jack Bryan


Dear All,
I run a 6 parallel processes on OpenMPI. 
When the run-time of the program is short, it works well.
But, if the run-time is long, I got errors: 
[n124:45521] *** Process received signal ***[n124:45521] Signal: Segmentation 
fault (11)[n124:45521] Signal code: Address not mapped (1)[n124:45521] Failing 
at address: 0x44[n124:45521] [ 0] /lib64/libpthread.so.0 
[0x3c50e0e4c0][n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) 
[0x3c50278d60][n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) 
[0x3c50246b19][n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) 
[0x3c5024d3aa][n124:45521] [ 4] /home/path/exec [0x40ec9a][n124:45521] [ 5] 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n124:45521] [ 6] 
/home/path/exec [0x401139][n124:45521] *** End of error message ***
It seems that there may be some problems about memory management. 
But, I cannot find the reason. 
My program needs to write results to some files. 
If I open the files too many without closing them, I may get the above errors. 
But, I have removed the writing files from my program. 
The problem appears again when the program runs longer time. 
Any help is appreciated. 
Jack
July 25  2010
  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: [OMPI users] OpenMPI Segmentation fault (11)

2010-07-26 Thread Jack Bryan


Thanks
It can be installed on linux and work with gcc ? 
If I have many processes, such as 30, I have to open 30 terminal windows ? 
thanks
Jack

> Date: Mon, 26 Jul 2010 08:23:57 +0200
> From: jody@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OpenMPI Segmentation fault (11)
> 
> Hi Jack
> 
> Have you tried to run your aplication under valgrind?
> Even though applications generallay run slower under valgrind,
> it may detect memory errors before the actual crash happens.
> 
> The best would be to start a terminal window for each of your processes
> so you can see valgrind's output for each process separately.
> 
> Jody
> 
> On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan  wrote:
> > Dear All,
> > I run a 6 parallel processes on OpenMPI.
> > When the run-time of the program is short, it works well.
> > But, if the run-time is long, I got errors:
> > [n124:45521] *** Process received signal ***
> > [n124:45521] Signal: Segmentation fault (11)
> > [n124:45521] Signal code: Address not mapped (1)
> > [n124:45521] Failing at address: 0x44
> > [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0]
> > [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60]
> > [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19]
> > [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa]
> > [n124:45521] [ 4] /home/path/exec [0x40ec9a]
> > [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974]
> > [n124:45521] [ 6] /home/path/exec [0x401139]
> > [n124:45521] *** End of error message ***
> > It seems that there may be some problems about memory management.
> > But, I cannot find the reason.
> > My program needs to write results to some files.
> > If I open the files too many without closing them, I may get the above
> > errors.
> > But, I have removed the writing files from my program.
> > The problem appears again when the program runs longer time.
> > Any help is appreciated.
> > Jack
> > July 25  2010
> >
> > 
> > Hotmail is redefining busy with tools for the New Busy. Get more from your
> > inbox. See how.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[OMPI users] Open MPI C++ class datatype

2010-08-04 Thread Jack Bryan


Dear All:
I need to transfer some data, which is C++ class with some vector member 
data.
I want to use MPI_Bcast(buffer, count, datatype, root, comm);
May I use MPI_Datatype to define customized data structure that contain C++ 
class ? 
Any help is appreciated. 
Jack
Aug 3  2010

[OMPI users] Open MPI dynamic data structure error

2010-10-19 Thread Jack Bryan


Hi, 
I need to design a data structure to transfer data between nodes on Open MPI 
system. 
Some elements of the the structure has dynamic size. 
For example, 
typedef struct{
double data1;vector dataVec; 
} myDataType;
The size of the dataVec depends on some intermidiate computing results.
If I only declear it as the above myDataType, I think, only a pointer is 
transfered. 
When the data receiver try to access the elements of vector dataVec, it 
got segmentation fault error.
But, I also need to use the myDataType to declear other data structures. 
such as vector newDataVec;
I cannot declear myDataType in a function , sucjh as main(), I got errors: 
 main.cpp:200: error: main(int, char**)::myDataType; uses local type main(int, 
char**)::myDataType;

Any help is really appreciated. 
thanks
Jack
Oct. 19 2010

[OMPI users] OPEN MPI data transfer error

2010-10-22 Thread Jack Bryan


Hi, 
I am using open MPI to transfer data between nodes. 
But the received data is not what the data sender sends out . 
I have tried C and C++ binding . 
data sender:double* sendArray = new double[sendResultVec.size()];
for (int ii =0 ; ii < sendResultVec.size() ; ii++)  {   
sendArray[ii] = sendResultVec[ii];  }
MPI::COMM_WORLD.Send(sendArray, sendResultVec.size(), MPI_DOUBLE, 0, 
myworkerUpStreamTaskTag);  data receiver:  double* recvArray = new 
double[objSize];
mToMasterT1Req = MPI::COMM_WORLD.Irecv(recvArray, objSize, MPI_DOUBLE, 
destRank, myUpStreamTaskTag);

The sendResultVec.size() = objSize. 

What is the possible reason ? 

Any help is appreciated. 
thanks
jack
Oct. 22 1010

Re: [OMPI users] OPEN MPI data transfer error

2010-10-22 Thread Jack Bryan


Hi, 
I have used mpi_waitall() to do it. 
The data can be received but the contents are wrong.
Any help is appreciated. 
thanks

> From: jsquy...@cisco.com
> Date: Fri, 22 Oct 2010 15:35:11 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OPEN MPI data transfer error
> 
> It doesn't look like you have completed the request that came back from 
> Irecv.  You need to TEST or WAIT on requests before they are actually 
> completed (e.g., in the case of a receive, the data won't be guaranteed to be 
> in the target buffer until TEST/WAIT indicates that the request has 
> completed).
> 
> 
> 
> On Oct 22, 2010, at 3:19 PM, Jack Bryan wrote:
> 
> > Hi, 
> > 
> > I am using open MPI to transfer data between nodes. 
> > 
> > But the received data is not what the data sender sends out . 
> > 
> > I have tried C and C++ binding . 
> > 
> > data sender: 
> > double* sendArray = new double[sendResultVec.size()];
> > 
> > for (int ii =0 ; ii < sendResultVec.size() ; ii++)
> > {
> > sendArray[ii] = sendResultVec[ii];
> > }
> > 
> > MPI::COMM_WORLD.Send(sendArray, sendResultVec.size(), MPI_DOUBLE, 0, 
> > myworkerUpStreamTaskTag);  
> > 
> > data receiver: 
> > double* recvArray = new double[objSize];
> > 
> > mToMasterT1Req = MPI::COMM_WORLD.Irecv(recvArray, objSize, MPI_DOUBLE, 
> > destRank, myUpStreamTaskTag);
> > 
> > 
> > The sendResultVec.size() = objSize. 
> > 
> > 
> > What is the possible reason ? 
> > 
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > 
> > jack
> > 
> > Oct. 22 1010
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Open MPI program cannot complete

2010-10-24 Thread Jack Bryan


Hi 
I got a problem of open MPI.
My program has 5 processes. 
All of them can run MPI_Finalize() and return 0. 
But, the whole program cannot be completed. 
In the MPI cluster job queue, it is strill in running status. 
If I use 1 process to run it, no problem.
Why ? 
My program:
int main (int argc, char **argv) {
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); 
MPI_Comm_size(MPI_COMM_WORLD, &mySize); MPI_Comm world; world = MPI_COMM_WORLD;
if (myRank == 0){   do some things. }
if (myRank != 0){   do some things. 
MPI_Finalize(); return 0 ;  }   if (myRank == 0){   
MPI_Finalize(); return 0;   }   }
And, some output files get wrong codes, which can not be readible. In 1-process 
case, the program can print correct results to these output files . 
Any help is appreciated. 
thanks
Jack
Oct. 24 2010

Re: [OMPI users] Open MPI program cannot complete

2010-10-24 Thread Jack Bryan


Thanks for the reply. But, I use mpi_waitall() to make sure that all MPI 
communications have been done before a process call MPI_Finalize() and returns. 
Any help is appreciated.
thanks
Jack
Oct. 24 2010

> From: g...@ldeo.columbia.edu
> Date: Sun, 24 Oct 2010 17:31:11 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Hi Jack
> 
> It may depend on "do some things".
> Does it involve MPI communication?
> 
> Also, why not put MPI_Finalize();return 0 outside the ifs? 
> 
> Gus Correa
> 
> On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote:
> 
> > Hi 
> > 
> > I got a problem of open MPI.
> > 
> > My program has 5 processes. 
> > 
> > All of them can run MPI_Finalize() and return 0. 
> > 
> > But, the whole program cannot be completed. 
> > 
> > In the MPI cluster job queue, it is strill in running status. 
> > 
> > If I use 1 process to run it, no problem.
> > 
> > Why ? 
> > 
> > My program:
> > 
> > int main (int argc, char **argv) 
> > {
> > 
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> > MPI_Comm_size(MPI_COMM_WORLD, &mySize);
> > MPI_Comm world;
> > world = MPI_COMM_WORLD;
> > 
> > if (myRank == 0)
> > {
> > do some things. 
> > }
> > 
> > if (myRank != 0)
> > {
> > do some things. 
> > MPI_Finalize();
> > return 0 ;
> > }
> > if (myRank == 0)
> > {
> > MPI_Finalize();
> > return 0;
> > }
> > 
> > }
> > 
> > And, some output files get wrong codes, which can not be readible. 
> > In 1-process case, the program can print correct results to these output 
> > files . 
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > 
> > Jack
> > 
> > Oct. 24 2010
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-24 Thread Jack Bryan


Thanks, 
But, my code is too long to be posted. 
What are the common reasons of this kind of problems ? 
Any help is appreciated. 
Jack
Oct. 24 2010
> From: g...@ldeo.columbia.edu
> Date: Sun, 24 Oct 2010 18:09:52 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Hi Jack
> 
> Your code snippet is too terse, doesn't show the MPI calls.
> It is hard to guess what is the problem this way.
> 
> Gus Correa 
> On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote:
> 
> > Thanks for the reply. 
> > But, I use mpi_waitall() to make sure that all MPI communications have been 
> > done before a process call MPI_Finalize() and returns. 
> > 
> > Any help is appreciated.
> > 
> > thanks
> > 
> > Jack
> > 
> > Oct. 24 2010
> > 
> > > From: g...@ldeo.columbia.edu
> > > Date: Sun, 24 Oct 2010 17:31:11 -0400
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] Open MPI program cannot complete
> > > 
> > > Hi Jack
> > > 
> > > It may depend on "do some things".
> > > Does it involve MPI communication?
> > > 
> > > Also, why not put MPI_Finalize();return 0 outside the ifs? 
> > > 
> > > Gus Correa
> > > 
> > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote:
> > > 
> > > > Hi 
> > > > 
> > > > I got a problem of open MPI.
> > > > 
> > > > My program has 5 processes. 
> > > > 
> > > > All of them can run MPI_Finalize() and return 0. 
> > > > 
> > > > But, the whole program cannot be completed. 
> > > > 
> > > > In the MPI cluster job queue, it is strill in running status. 
> > > > 
> > > > If I use 1 process to run it, no problem.
> > > > 
> > > > Why ? 
> > > > 
> > > > My program:
> > > > 
> > > > int main (int argc, char **argv) 
> > > > {
> > > > 
> > > > MPI_Init(&argc, &argv);
> > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize);
> > > > MPI_Comm world;
> > > > world = MPI_COMM_WORLD;
> > > > 
> > > > if (myRank == 0)
> > > > {
> > > > do some things. 
> > > > }
> > > > 
> > > > if (myRank != 0)
> > > > {
> > > > do some things. 
> > > > MPI_Finalize();
> > > > return 0 ;
> > > > }
> > > > if (myRank == 0)
> > > > {
> > > > MPI_Finalize();
> > > > return 0;
> > > > }
> > > > 
> > > > }
> > > > 
> > > > And, some output files get wrong codes, which can not be readible. 
> > > > In 1-process case, the program can print correct results to these 
> > > > output files . 
> > > > 
> > > > Any help is appreciated. 
> > > > 
> > > > thanks
> > > > 
> > > > Jack
> > > > 
> > > > Oct. 24 2010
> > > > 
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > 
> > > 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanks
I used: 
 cout << " I am rank " << rank << " I am before MPI_Finalize()" << 
endl; MPI_Finalize();  return 0;
I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". 
Are there other better ways to check this ? 
Any help is appreciated. 
thanks
Jack
Oct. 25 2010
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Sun, 24 Oct 2010 19:47:54 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

how do you know all process call mpi_finalize?  did you have all of them print 
out something before they call mpi_finalize? I think what Gustavo is getting at 
is maybe you had some MPI calls within your snippets that hangs your program, 
thus some of your processes never called mpi_finalize.



On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan  wrote:







Thanks, 
But, my code is too long to be posted. 
What are the common reasons of this kind of problems ? 
Any help is appreciated. 


Jack
Oct. 24 2010
> From: g...@ldeo.columbia.edu
> Date: Sun, 24 Oct 2010 18:09:52 -0400


> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Hi Jack
> 
> Your code snippet is too terse, doesn't show the MPI calls.


> It is hard to guess what is the problem this way.
> 
> Gus Correa 
> On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote:
> 
> > Thanks for the reply. 
> > But, I use mpi_waitall() to make sure that all MPI communications have been 
> > done before a process call MPI_Finalize() and returns. 


> > 
> > Any help is appreciated.
> > 
> > thanks
> > 
> > Jack
> > 
> > Oct. 24 2010
> > 
> > > From: g...@ldeo.columbia.edu


> > > Date: Sun, 24 Oct 2010 17:31:11 -0400
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] Open MPI program cannot complete


> > > 
> > > Hi Jack
> > > 
> > > It may depend on "do some things".
> > > Does it involve MPI communication?
> > > 
> > > Also, why not put MPI_Finalize();return 0 outside the ifs? 


> > > 
> > > Gus Correa
> > > 
> > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote:
> > > 
> > > > Hi 
> > > > 
> > > > I got a problem of open MPI.


> > > > 
> > > > My program has 5 processes. 
> > > > 
> > > > All of them can run MPI_Finalize() and return 0. 
> > > > 
> > > > But, the whole program cannot be completed. 


> > > > 
> > > > In the MPI cluster job queue, it is strill in running status. 
> > > > 
> > > > If I use 1 process to run it, no problem.
> > > > 


> > > > Why ? 
> > > > 
> > > > My program:
> > > > 
> > > > int main (int argc, char **argv) 
> > > > {
> > > > 
> > > > MPI_Init(&argc, &argv);


> > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize);
> > > > MPI_Comm world;
> > > > world = MPI_COMM_WORLD;

> > > > 

> > > > if (myRank == 0)
> > > > {
> > > > do some things. 
> > > > }
> > > > 
> > > > if (myRank != 0)
> > > > {
> > > > do some things. 


> > > > MPI_Finalize();
> > > > return 0 ;
> > > > }
> > > > if (myRank == 0)
> > > > {
> > > > MPI_Finalize();
> > > > return 0;


> > > > }
> > > > 
> > > > }
> > > > 
> > > > And, some output files get wrong codes, which can not be readible. 
> > > > In 1-process case, the program can print correct results to these 
> > > > output files . 


> > > > 
> > > > Any help is appreciated. 
> > > > 
> > > > thanks
> > > > 
> > > > Jack
> > > > 
> > > > Oct. 24 2010


> > > > 
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org


> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > 
> > > 
> > > ___


> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users


> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users


> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


  

___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanksI found a problem: 
I used:  cout << " I am rank " << rank << " I am before MPI_Finalize()" 
<< endl; MPI_Finalize();cout << " I am rank " << rank << " I am 
after MPI_Finalize()" << endl; return 0;I can get the output " I am 
rank 0 (1, 2, ) I am before MPI_Finalize() ". and  " I am rank 
0 I am after MPI_Finalize() "But, other processes do not printed out "I am rank 
... I am after MPI_Finalize()" .
It is weird. The process has reached the point just before MPI_Finalize(), why 
they are hanged there ? 
Are there other better ways to check this ? Any help is appreciated. 
thanksJackOct. 25 2010
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Sun, 24 Oct 2010 19:47:54 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

how do you know all process call mpi_finalize?  did you have all of them print 
out something before they call mpi_finalize? I think what Gustavo is getting at 
is maybe you had some MPI calls within your snippets that hangs your program, 
thus some of your processes never called mpi_finalize.



On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan  wrote:







Thanks, 
But, my code is too long to be posted. 
What are the common reasons of this kind of problems ? 
Any help is appreciated. 


Jack
Oct. 24 2010
> From: g...@ldeo.columbia.edu
> Date: Sun, 24 Oct 2010 18:09:52 -0400


> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Hi Jack
> 
> Your code snippet is too terse, doesn't show the MPI calls.


> It is hard to guess what is the problem this way.
> 
> Gus Correa 
> On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote:
> 
> > Thanks for the reply. 
> > But, I use mpi_waitall() to make sure that all MPI communications have been 
> > done before a process call MPI_Finalize() and returns. 


> > 
> > Any help is appreciated.
> > 
> > thanks
> > 
> > Jack
> > 
> > Oct. 24 2010
> > 
> > > From: g...@ldeo.columbia.edu


> > > Date: Sun, 24 Oct 2010 17:31:11 -0400
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] Open MPI program cannot complete


> > > 
> > > Hi Jack
> > > 
> > > It may depend on "do some things".
> > > Does it involve MPI communication?
> > > 
> > > Also, why not put MPI_Finalize();return 0 outside the ifs? 


> > > 
> > > Gus Correa
> > > 
> > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote:
> > > 
> > > > Hi 
> > > > 
> > > > I got a problem of open MPI.


> > > > 
> > > > My program has 5 processes. 
> > > > 
> > > > All of them can run MPI_Finalize() and return 0. 
> > > > 
> > > > But, the whole program cannot be completed. 


> > > > 
> > > > In the MPI cluster job queue, it is strill in running status. 
> > > > 
> > > > If I use 1 process to run it, no problem.
> > > > 


> > > > Why ? 
> > > > 
> > > > My program:
> > > > 
> > > > int main (int argc, char **argv) 
> > > > {
> > > > 
> > > > MPI_Init(&argc, &argv);


> > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
> > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize);
> > > > MPI_Comm world;
> > > > world = MPI_COMM_WORLD;

> > > > 

> > > > if (myRank == 0)
> > > > {
> > > > do some things. 
> > > > }
> > > > 
> > > > if (myRank != 0)
> > > > {
> > > > do some things. 


> > > > MPI_Finalize();
> > > > return 0 ;
> > > > }
> > > > if (myRank == 0)
> > > > {
> > > > MPI_Finalize();
> > > > return 0;


> > > > }
> > > > 
> > > > }
> > > > 
> > > > And, some output files get wrong codes, which can not be readible. 
> > > > In 1-process case, the program can print correct results to these 
> > > > output files . 


> > > > 
> > > > Any help is appreciated. 
> > > > 
> > > > thanks
> > > > 
> > > > Jack
> > > > 
> > > > Oct. 24 2010


> > > > 
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org


> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > 
> > > 
> > > ___


> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users


> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users


> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


  

___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan

Thanks, But, I have put a mpi_waitall(request) before
cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl;
If the above sentence has been printed out, it means that all requests have 
been checked and finished. right ?  
What may be the possible reasons for that stuck ? 
Any help is appreciated. 
Jack
Oct. 25 2010 

List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 05:32:44 -0400
From: terry.don...@oracle.com
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

Message body

So what you are saying is *all* the ranks have entered MPI_Finalize
and only a subset has exited per placing prints before and after
MPI_Finalize.  Good.  So my guess is that the processes stuck in
MPI_Finalize have a prior MPI request outstanding that for whatever
reason is unable to complete.  So I would first look at all the MPI
requests and make sure they completed.

--td

On 10/25/2010 02:38 AM, Jack Bryan wrote:

  thanks
  I found a problem: 

  I used: 

 cout << " I am rank " << rank << " I
am before MPI_Finalize()" << endl;
   MPI_Finalize(); 
   cout
<< " I am rank " << rank << " I am after
MPI_Finalize()" << endl;
   return 0;

  I can get the output " I
am rank 0 (1, 2, ) I am before MPI_Finalize() ". 

  and 

 " I am rank 0 I am
after MPI_Finalize() "
  But, other processes do
not printed out "I am rank ... I am after MPI_Finalize()" .

  It is weird. The process has reached the
point just before MPI_Finalize(), why they are hanged there
? 

  Are there other better
ways to check this ? 

  Any help is appreciated. 

  thanks

  Jack

  Oct. 25 2010

  From:
  solarbik...@gmail.com

  Date: Sun,
  24 Oct 2010 19:47:54 -0700

  To:
  us...@open-mpi.org

  Subject: Re:
  [OMPI users] Open MPI program cannot complete

  how do you
  know all process call mpi_finalize?  did you have all of them
  print out something before they call mpi_finalize? I think
  what Gustavo is getting at is maybe you had some MPI calls
      within your snippets that hangs your program, thus some of
  your processes never called mpi_finalize.

  On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan 
wrote:

Thanks, 

But, my code is too long to be posted. 

What are the common reasons of this kind of problems ? 

Any help is appreciated. 

Jack

  Oct. 24 2010

> From: g...@ldeo.columbia.edu

  > Date: Sun, 24 Oct 2010 18:09:52 -0400

  > To: us...@open-mpi.org

  > Subject: Re: [OMPI users] Open MPI program cannot
  complete

  > 

  > Hi Jack

  > 

  > Your code snippet is too terse, doesn't show the
  MPI calls.

  > It is hard to guess what is the problem this way.

  > 

  > Gus Correa 

  > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote:

  > 

  > > Thanks for the reply. 

  > > But, I use mpi_waitall() to make sure that
  all MPI communications have been done before a process
  call MPI_Finalize() and returns. 

  > > 

  > > Any help is appreciated.

  > > 

  > > thanks

  > > 

  > > Jack

  > > 

  > > Oct. 24 2010

  > > 

  > > > From: g...@ldeo.columbia.edu

  > > > Date: Sun, 24 Oct 2010 17:31:11 -0400

  > > > To: us...@open-mpi.org

  > > > Subject: Re: [OMPI users] Open MPI
  program cannot complete

  > > > 

  > > > Hi Jack

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan

Thanks, the problem is still there. 
I used: 
cout << "In main(), I am rank " << myRank << " , I am before 
MPI_Barrier(MPI_COMM_WORLD). \n\n"  << endl ;
MPI_Barrier(MPI_COMM_WORLD);cout << "In main(), I am rank " 
<< myRank << " , I am before MPI_Finalize() and after 
MPI_Barrier(MPI_COMM_WORLD). \n\n"  << endl ; MPI_Finalize();   
  cout << "In main(), I am rank " << myRank << " , I am after MPI_Finalize(), 
then return 0 . \n\n"  << endl ;return 0 ;
Only process 0 returns. Other processes are still struck inMPI_Finalize(). 
Any help is appreciated. 
JACK
Oct. 25 2010

From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 08:27:19 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

I think I got this problem before.  Put a mpi_barrier(mpi_comm_world) before 
mpi_finalize for all processes.  For me, mpi terminates nicely only when all 
process are calling mpi_finalize the same time.  So I do it for all my programs.

On Mon, Oct 25, 2010 at 7:13 AM, Jack Bryan  wrote:

Thanks, But, I have put a mpi_waitall(request) before
cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl;

If the above sentence has been printed out, it means that all requests have 
been checked and finished. right ?  

What may be the possible reasons for that stuck ? 
Any help is appreciated. 
Jack
Oct. 25 2010 

List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 05:32:44 -0400
From: terry.don...@oracle.com
To: us...@open-mpi.org

Subject: Re: [OMPI users] Open MPI program cannot complete

So what you are saying is *all* the ranks have entered MPI_Finalize
and only a subset has exited per placing prints before and after
MPI_Finalize.  Good.  So my guess is that the processes stuck in
MPI_Finalize have a prior MPI request outstanding that for whatever
reason is unable to complete.  So I would first look at all the MPI
requests and make sure they completed.

--td

On 10/25/2010 02:38 AM, Jack Bryan wrote:

  thanks
  I found a problem: 

  I used: 

 cout << " I am rank " << rank << " I
am before MPI_Finalize()" << endl;
   MPI_Finalize(); 
   cout
<< " I am rank " << rank << " I am after
MPI_Finalize()" << endl;
   return 0;

  I can get the output " I
am rank 0 (1, 2, ) I am before MPI_Finalize() ". 

  and 

 " I am rank 0 I am
after MPI_Finalize() "
  But, other processes do
not printed out "I am rank ... I am after MPI_Finalize()" .

  It is weird. The process has reached the
point just before MPI_Finalize(), why they are hanged there
? 

  Are there other better
ways to check this ? 

  Any help is appreciated. 

  thanks

  Jack

  Oct. 25 2010

  From:
  solarbik...@gmail.com

  Date: Sun,
  24 Oct 2010 19:47:54 -0700

  To:
  us...@open-mpi.org

  Subject: Re:
  [OMPI users] Open MPI program cannot complete

  how do you
  know all process call mpi_finalize?  did you have all of them
  print out something before they call mpi_finalize? I think
  what Gustavo is getting at is maybe you had some MPI calls
  within your snippets that hangs your program, thus some of
  your processes never called mpi_finalize.

  On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan 
wrote:

Thanks, 

But, my code is too long to be posted. 

What are the common reasons of this kind of problems ? 

Any help is appreciated. 

Jack

  Oct. 24 2010

> From: g...@ldeo.columbia.edu

  > Date: Sun, 24 Oct 2010 18:09:52 -0400

  > To: us...@open-mpi.org

  > Subject: Re: [OMPI users] Open MPI program cannot
  complete

  > 

  > Hi Jack

  > 

  > Your code snippet is too terse, doesn't show the

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanks, 
Would like to tell me how to use 
(gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID)
in MPI ? 
I need to use #PBS parallel job script to submit a job on MPI cluster. 
Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
ZOMBIE_PID) in the script ? 
How to get the ZOMBIE_PID ? 
thanks
Any help is appreciated. 
Jack
Oct. 25 2010
List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 19:01:38 +0200
From: j...@59a2.org
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

On Mon, Oct 25, 2010 at 18:26, Jack Bryan  wrote:

Thanks, the problem is still there.
This really doesn't prove that there are no outstanding asynchronous requests, 
but perhaps you know that there are not, despite not being able to post a 
complete test case here.  I suggest attaching a debugger and getting a stack 
trace from the zombies (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
ZOMBIE_PID).

Jed

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanks
I have to use #PBS to submit any jobs in my cluster. I cannot use command line 
to hang a job on my cluster. 
this is my script: --#!/bin/bash#PBS -N 
jobname#PBS -l walltime=00:08:00,nodes=1#PBS -q 
queuenameCOMMAND=/mypath/myprogNCORES=5
cd $PBS_O_WORKDIRNODES=`cat $PBS_NODEFILE | wc -l`NPROC=$(( $NCORES * $NODES ))
mpirun -np $NPROC --mca btl self,sm,openib  $COMMAND
---

Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
ZOMBIE_PID) in the script ? And how to get ZOMBIE_PID from the script ? 
Any help is appreciated. 
thanks
Oct. 25 2010
List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 19:24:35 +0200
From: j...@59a2.org
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI program cannot complete

On Mon, Oct 25, 2010 at 19:07, Jack Bryan  wrote:

I need to use #PBS parallel job script to submit a job on MPI cluster. 
Is it not possible to reproduce locally?  Most clusters have a way to submit an 
interactive job (which would let you start this thing and then inspect 
individual processes).  Ashley's Padb suggestion will certainly be better in a 
non-interactive environment.
 Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
ZOMBIE_PID) in the script ? 

Is control returning to your script after rank 0 has exited?  In that case, you 
can just put this on the next line. 
How to get the ZOMBIE_PID ? 

"ps" from the command line, or getpid() from C code.
Jed

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanks
I use qsub -I nsga2_job.shqsub: waiting for job 
48270.clusterName to start
By qstatI found the job name is none and no results show up. 
No shell prompt appear, the command line is hang there , no response. 
Any help is appreciated. 
Thanks
Jack 
Oct. 25 2010
> From: jsquy...@cisco.com
> Date: Mon, 25 Oct 2010 13:39:30 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Can you use the interactive mode of PBS to get 5 cores on 1 node?  IIRC, 
> "qsub -I ..." ?
> 
> Then you get a shell prompt with your allocated cores and can run stuff 
> interactively.  I don't know if your site allows this, but interactive 
> debugging here might be *significantly* easier than try to automate some 
> debugging.
> 
> 
> On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote:
> 
> > thanks
> > 
> > I have to use #PBS to submit any jobs in my cluster. 
> > I cannot use command line to hang a job on my cluster. 
> > 
> > this is my script: 
> > --
> > #!/bin/bash
> > #PBS -N jobname
> > #PBS -l walltime=00:08:00,nodes=1
> > #PBS -q queuename
> > COMMAND=/mypath/myprog
> > NCORES=5
> > 
> > cd $PBS_O_WORKDIR
> > NODES=`cat $PBS_NODEFILE | wc -l`
> > NPROC=$(( $NCORES * $NODES ))
> > 
> > mpirun -np $NPROC --mca btl self,sm,openib  $COMMAND
> > 
> > ---
> > 
> > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
> > ZOMBIE_PID) in the script ? 
> > And how to get ZOMBIE_PID from the script ? 
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > 
> > Oct. 25 2010
> > 
> > Date: Mon, 25 Oct 2010 19:24:35 +0200
> > From: j...@59a2.org
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Open MPI program cannot complete
> > 
> > On Mon, Oct 25, 2010 at 19:07, Jack Bryan  wrote:
> > I need to use #PBS parallel job script to submit a job on MPI cluster. 
> > 
> > Is it not possible to reproduce locally?  Most clusters have a way to 
> > submit an interactive job (which would let you start this thing and then 
> > inspect individual processes).  Ashley's Padb suggestion will certainly be 
> > better in a non-interactive environment.
> >  
> > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
> > ZOMBIE_PID) in the script ? 
> > 
> > Is control returning to your script after rank 0 has exited?  In that case, 
> > you can just put this on the next line.
> >  
> > How to get the ZOMBIE_PID ? 
> > 
> > "ps" from the command line, or getpid() from C code.
> > 
> > Jed
> > 
> > ___ users mailing list 
> > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


thanks
But, the code is too long.

Jack Oct. 25 2010
> Date: Mon, 25 Oct 2010 14:08:54 -0400
> From: g...@ldeo.columbia.edu
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> Your job may be queued, not executing, because there are no
> resources available, all nodes are busy.
> Try qstat -a.
> 
> Posting a code snippet with all your MPI calls may prove effective.
> You might get a trove of advice for a thrift of effort.
> 
> Jeff Squyres wrote:
> > Check the man page for qsub for proper use.
> > 
> > 
> > On Oct 25, 2010, at 1:49 PM, Jack Bryan wrote:
> > 
> >> thanks
> >>
> >> I use 
> >> qsub -I nsga2_job.sh
> >> qsub: waiting for job 48270.clusterName to start
> >>
> >> By qstat
> >> I found the job name is none and no results show up. 
> >>
> >> No shell prompt appear, the command line is hang there , no response. 
> >>
> >> Any help is appreciated. 
> >>
> >> Thanks
> >>
> >> Jack 
> >>
> >> Oct. 25 2010
> >>
> >>> From: jsquy...@cisco.com
> >>> Date: Mon, 25 Oct 2010 13:39:30 -0400
> >>> To: us...@open-mpi.org
> >>> Subject: Re: [OMPI users] Open MPI program cannot complete
> >>>
> >>> Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, 
> >>> "qsub -I ..." ?
> >>>
> >>> Then you get a shell prompt with your allocated cores and can run stuff 
> >>> interactively. I don't know if your site allows this, but interactive 
> >>> debugging here might be *significantly* easier than try to automate some 
> >>> debugging.
> >>>
> >>>
> >>> On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote:
> >>>
> >>>> thanks
> >>>>
> >>>> I have to use #PBS to submit any jobs in my cluster. 
> >>>> I cannot use command line to hang a job on my cluster. 
> >>>>
> >>>> this is my script: 
> >>>> --
> >>>> #!/bin/bash
> >>>> #PBS -N jobname
> >>>> #PBS -l walltime=00:08:00,nodes=1
> >>>> #PBS -q queuename
> >>>> COMMAND=/mypath/myprog
> >>>> NCORES=5
> >>>>
> >>>> cd $PBS_O_WORKDIR
> >>>> NODES=`cat $PBS_NODEFILE | wc -l`
> >>>> NPROC=$(( $NCORES * $NODES ))
> >>>>
> >>>> mpirun -np $NPROC --mca btl self,sm,openib $COMMAND
> >>>>
> >>>> ---
> >>>>
> >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
> >>>> ZOMBIE_PID) in the script ? 
> >>>> And how to get ZOMBIE_PID from the script ? 
> >>>>
> >>>> Any help is appreciated. 
> >>>>
> >>>> thanks
> >>>>
> >>>> Oct. 25 2010
> >>>>
> >>>> Date: Mon, 25 Oct 2010 19:24:35 +0200
> >>>> From: j...@59a2.org
> >>>> To: us...@open-mpi.org
> >>>> Subject: Re: [OMPI users] Open MPI program cannot complete
> >>>>
> >>>> On Mon, Oct 25, 2010 at 19:07, Jack Bryan  wrote:
> >>>> I need to use #PBS parallel job script to submit a job on MPI cluster. 
> >>>>
> >>>> Is it not possible to reproduce locally? Most clusters have a way to 
> >>>> submit an interactive job (which would let you start this thing and then 
> >>>> inspect individual processes). Ashley's Padb suggestion will certainly 
> >>>> be better in a non-interactive environment.
> >>>>
> >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid 
> >>>> ZOMBIE_PID) in the script ? 
> >>>>
> >>>> Is control returning to your script after rank 0 has exited? In that 
> >>>> case, you can just put this on the next line.
> >>>>
> >>>> How to get the ZOMBIE_PID ? 
> >>>>
> >>>> "ps" from the command line, or getpid() from C code.
> >>>>
> >>>> Jed
> >>>>
> >>>> ___ users mailing list 
> >>>> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> ___
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> -- 
> >>> Jeff Squyres
> >>> jsquy...@cisco.com
> >>> For corporate legal information go to:
> >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


ThanksI have downloaded http://padb.googlecode.com/files/padb-3.0.tgz
and compile it.
But, no user manual, I can not use it by padb -aQ.

./padb -aQ myjobpadb: Error: --all incompatible with specific ids
Actually, myjob is running in the queue. 
Do you have use manual about how to use it ? 
thanks

> From: ash...@pittman.co.uk
> Date: Mon, 25 Oct 2010 18:08:32 +0100
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> 
> On 25 Oct 2010, at 17:26, Jack Bryan wrote:
> 
> > Thanks, the problem is still there. 
> > 
> > I used: 
> > 
> > Only process 0 returns. Other processes are still struck in
> > MPI_Finalize(). 
> > 
> > Any help is appreciated. 
> 
> You can use the command "padb -aQ" to show you the message queues for your 
> application, you'll need to download and install padb then simply run your 
> job, allow it to hang and they run padb - it'll show you the message queues 
> for each rank that it can find processes for (the ones that haven't exited).  
> If this isn't any help run "padb -axt" for the stack traces and send the 
> output to this list.
> 
> The web-site is in my signature or there is a new beta release out this week 
> at http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz
> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI program cannot complete

2010-10-25 Thread Jack Bryan


Thanks
I have downloaded http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz
and followed the instructions of INSTALL file and installed it at 
/mypath/padb32 
But, I got:
-bash-3.2$ padb -Ormgr=pbs -Q 48279.clusterJob 48279.cluster is not active
Actually, the job was running. 
I have installed bin at  
/mypath/padb32/bin

libexec  at/lustre/jxding/padb32/libexec
When I installed it, I used 
./configure --prefix=/mypath/padb32
I got -
checking for a BSD-compatible install... /usr/bin/install -cchecking whether 
build environment is sane... yeschecking for a thread-safe mkdir -p... 
/bin/mkdir -pchecking for gawk... gawkchecking whether make sets $(MAKE)... 
yeschecking for gcc... gccchecking whether the C compiler works... yeschecking 
for C compiler default output file name... a.outchecking for suffix of 
executables...checking whether we are cross compiling... nochecking for suffix 
of object files... ochecking whether we are using the GNU C compiler... 
yeschecking whether gcc accepts -g... yeschecking for gcc option to accept ISO 
C89... none neededchecking for style of include used by make... GNUchecking 
dependency style of gcc... gcc3checking whether gcc and cc understand -c and -o 
together... yesconfigure: creating ./config.statusconfig.status: creating 
Makefileconfig.status: creating src/Makefileconfig.status: executing depfiles 
commands
---
-bash-3.2$ makeMaking all in srcmake[1]: Entering directory 
`/mypath/padb32/padb-3.2-beta1/src'gcc -DPACKAGE_NAME=\"\" 
-DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" 
-DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"padb\" 
-DVERSION=\"3.2-beta1\" -I.-Wall -g -O2 -MT minfo-minfo.o -MD -MP -MF 
.deps/minfo-minfo.Tpo -c -o minfo-minfo.o `test -f 'minfo.c' || echo 
'./'`minfo.cminfo.c: In function âfind_symâ:minfo.c:158: warning: dereferencing 
type-punned pointer will break strict-aliasing rulesminfo.c: In function 
âmainâ:minfo.c:649: warning: type-punning to incomplete type might break 
strict-aliasing rulesminfo.c:650: warning: type-punning to incomplete type 
might break strict-aliasing rulesmv -f .deps/minfo-minfo.Tpo 
.deps/minfo-minfo.Pogcc -Wall -g -O2 -ldl  -o minfo minfo-minfo.omake[1]: 
Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering 
directory `/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for 
`all-am'.make[1]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1'-
-bash-3.2$ make installMaking install in srcmake[1]: Entering directory 
`/mypath/padb32/padb-3.2-beta1/src'make[2]: Entering directory 
`/mypath/padb32/padb-3.2-beta1/src'test -z "/lustre/jxding/padb32/bin" || 
/bin/mkdir -p "/mypath/padb32/bin" /usr/bin/install -c padb 
'/lustre/jxding/padb32/bin'test -z "/lustre/jxding/padb32/libexec" || 
/bin/mkdir -p "/mypath/padb32/libexec"  /usr/bin/install -c minfo 
'/lustre/jxding/padb32/libexec'make[2]: Nothing to be done for 
`install-data-am'.make[2]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1/src'make[1]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory 
`/mypath/padb32/padb-3.2-beta1'make[2]: Entering directory 
`/mypath/padb32/padb-3.2-beta1'make[2]: Nothing to be done for 
`install-exec-am'.make[2]: Nothing to be done for `install-data-am'.make[2]: 
Leaving directory `/mypath/padb32/padb-3.2-beta1'make[1]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1'-bash-3.2$ make installcheckMaking installcheck 
in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: 
Nothing to be done for `installcheck'.make[1]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory 
`/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for 
`installcheck-am'.make[1]: Leaving directory 
`/mypath/padb32/padb-3.2-beta1'--
Are there something wrong with what I have done ?
Any help is appreciated. 
thanks
Jack
Oct. 25 2010

> From: ash...@pittman.co.uk
> Date: Mon, 25 Oct 2010 20:40:18 +0100
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI program cannot complete
> 
> 
> On 25 Oct 2010, at 20:18, Jack Bryan wrote:
> 
> > Thanks
> > I have downloaded 
> > http://padb.googlecode.com/files/padb-3.0.tgz
> > 
> > and compile it.
> > 
> > But, no user manual, I can not use it by padb -aQ.
> 
> The -a flag is a shortcut to all jobs, if you are providing a jobid (which is 
> normally numeric) then don't set the -a flag.
> 
> > Do you have use manual about how to use it ? 
> 
> In my previ

Re: [OMPI users] Open MPI program cannot complete

2010-10-26 Thread Jack Bryan



thanksBut, I cannot see the attachment in the email. Would you please send me 
again ? and also copy to another my email:tomviewisu@yahoo.comthanksOct. 25 2010
From: dtustud...@hotmail.com
To: ash...@pittman.co.uk
Subject: RE: [OMPI users] Open MPI program cannot complete
List-Post: users@lists.open-mpi.org
Date: Mon, 25 Oct 2010 16:53:32 -0600












thanks
But, I cannot see the attachment in the email. 

Would you please send me again ? 
and also copy to another my email:
tomview...@yahoo.com
thanks
Oct. 25 2010

> Subject: Re: [OMPI users] Open MPI program cannot complete
> From: ash...@pittman.co.uk
> Date: Mon, 25 Oct 2010 23:41:32 +0100
> To: dtustud...@hotmail.com
> 
> 
> Thanks, that's tells me a lot.
> 
> Try the attached padb, I've added the patch for you and remove the -w option. 
>  Can you run it and send me back the output please.
> 
> Ashley.
> 
> On 25 Oct 2010, at 23:29, Jack Bryan wrote:
> 
> > Thanks
> > 
> > Here is the 
> > 
> > -bash-3.2$ qstat -fB
> > Server: clusterName
> > server_state = Active
> > scheduling = True
> > total_jobs = 26
> > state_count = Transit:0 Queued:7 Held:0 Waiting:0 Running:18 Exiting:0
> > acl_hosts = clustername
> > default_queue = normal
> > log_events = 511
> > mail_from = adm
> > query_other_jobs = True
> > resources_assigned.nodect = 246
> > scheduler_iteration = 600
> > node_check_rate = 150
> > tcp_timeout = 6
> > mom_job_sync = True
> > pbs_version = 2.4.2
> > keep_completed = 300
> > submit_hosts = clusterName
> > next_job_number = 48293
> > net_counter = 2 9 6
> > 
> > -bash-3.2$ qstat -w -n
> > qstat: invalid option -- w
> > 
> > 
> > Which line should I put the 
> > -
> > --- padb (revision 401)
> > +++ padb (working copy)
> > @@ -2824,6 +2824,7 @@
> > foreach my $server (@servers) {
> > pbs_get_lqsub( $user, $server ); # get job list by qsub
> > }
> > + print Dumper \%pbs_tabjobs;
> > return \%pbs_tabjobs;
> > }
> > 
> > 
> > in the bin file   padb
> > 
> > Any help is appreciated.
> > 
> > thanks
> > 
> > Jack 
> > 
> > Oct. 25 2010
> > 
> > 
> > 
> > > Subject: Re: [OMPI users] Open MPI program cannot complete
> > > From: ash...@pittman.co.uk
> > > Date: Mon, 25 Oct 2010 22:54:21 +0100
> > > To: dtustud...@hotmail.com
> > > 
> > > 
> > > [off list]
> > > 
> > > The PBS support was added by a third-party so I've not used it in anger 
> > > myself, it appears you are doing the correct thing as far as I can tell.
> > > 
> > > Can you send me the output of the following two commands and also apply 
> > > the patch below to padb (you can do this just in the bin dir - it's a 
> > > perl script) and send me the output when you run that as well?
> > > 
> > > qstat -fB
> > > qstat -w -n
> > > 
> > > --- padb (revision 401)
> > > +++ padb (working copy)
> > > @@ -2824,6 +2824,7 @@
> > > foreach my $server (@servers) {
> > > pbs_get_lqsub( $user, $server ); # get job list by qsub
> > > }
> > > + print Dumper \%pbs_tabjobs;
> > > return \%pbs_tabjobs;
> > > }
> > > 
> > > On 25 Oct 2010, at 22:30, Jack Bryan wrote:
> > > 
> > > > Thanks
> > > > 
> > > > I have downloaded 
> > > > http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz
> > > > 
> > > > and followed the instructions of INSTALL file and installed it at 
> > > > /mypath/padb32 
> > > > 
> > > > But, I got:
> > > > 
> > > > -bash-3.2$ padb -Ormgr=pbs -Q 48279.cluster
> > > > Job 48279.cluster is not active
> > > > 
> > > > Actually, the job was running. 
> > > > 
> > > > I have installed 
> > > > bin at 
> > > > 
> > > > /mypath/padb32/bin
> > > > 
> > > > 
> > > > libexec at
> > > > /lustre/jxding/padb32/libexec
> > > > 
> > > > When I installed it, I used 
> > > > 
> > > > ./configure --prefix=/mypath/padb32
> > > > 
> > > > I got 
> > > > -
> > > > 
> > > >

Re: [OMPI users] Open MPI program cannot complete

2010-10-26 Thread Jack Bryan


Hi, 
I put your attahced padb in mypath and also set it up in env variable.I got 
this: 
-bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2-bash: 
/mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad interpreter: No such 
file or directory
Any help is appreciated. 
thanks
Jack 
Oct. 26 2010

Subject: Re: [OMPI users] Open MPI program cannot complete
From: ash...@pittman.co.uk
List-Post: users@lists.open-mpi.org
Date: Tue, 26 Oct 2010 08:39:56 +0100
CC: tomview...@yahoo.com
To: dtustud...@hotmail.com

 
Sorry, I forgot to attach it last night.

Re: [OMPI users] Open MPI program cannot complete

2010-10-26 Thread Jack Bryan


thanks
I got :
-bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 48516.cluster  is 
not active
Actually, the job is running. 
Any help is appreciated. 
thanksJinxu Ding
Oct. 26 2010
> Subject: Re: [OMPI users] Open MPI program cannot complete
> From: ash...@pittman.co.uk
> Date: Tue, 26 Oct 2010 23:18:57 +0100
> To: dtustud...@hotmail.com
> 
> 
> The "^M: bad interpreter" tells me that you've downloaded the file in Windows 
> and have got dos-based new-lines in the file.
> 
> Assuming it's installed on your machine run "dos2unix padb" and it'll remove 
> them, if that doesn't work save the file using a unix based email program.  I 
> hope this helps you when we finally get it working!
> 
> Ashley.
> 
> On 26 Oct 2010, at 22:14, Jack Bryan wrote:
> 
> > Hi, 
> > 
> > I put your attahced padb in mypath and also set it up in env variable.
> > I got this: 
> > 
> > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2
> > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad 
> > interpreter: No such file or directory
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > 
> > Jack 
> > 
> > Oct. 26 2010
> > 
> > 
> > Subject: Re: [OMPI users] Open MPI program cannot complete
> > From: ash...@pittman.co.uk
> > Date: Tue, 26 Oct 2010 08:39:56 +0100
> > CC: tomview...@yahoo.com
> > To: dtustud...@hotmail.com
> > 
> >  
> > Sorry, I forgot to attach it last night.
> >  
> > 
> > 
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>

Re: [OMPI users] Open MPI program cannot complete

2010-10-27 Thread Jack Bryan


thanksI got :-bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 
48516.cluster  is not activeActually, the job is running. Any help is 
appreciated. thanksJinxu DingOct. 27 2010
> Subject: Re: [OMPI users] Open MPI program cannot complete
> From: ash...@pittman.co.uk
> Date: Tue, 26 Oct 2010 23:18:57 +0100
> To: dtustud...@hotmail.com
> 
> 
> The "^M: bad interpreter" tells me that you've downloaded the file in Windows 
> and have got dos-based new-lines in the file.
> 
> Assuming it's installed on your machine run "dos2unix padb" and it'll remove 
> them, if that doesn't work save the file using a unix based email program.  I 
> hope this helps you when we finally get it working!
> 
> Ashley.
> 
> On 26 Oct 2010, at 22:14, Jack Bryan wrote:
> 
> > Hi, 
> > 
> > I put your attahced padb in mypath and also set it up in env variable.
> > I got this: 
> > 
> > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2
> > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad 
> > interpreter: No such file or directory
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > 
> > Jack 
> > 
> > Oct. 26 2010
> > 
> > 
> > Subject: Re: [OMPI users] Open MPI program cannot complete
> > From: ash...@pittman.co.uk
> > Date: Tue, 26 Oct 2010 08:39:56 +0100
> > CC: tomview...@yahoo.com
> > To: dtustud...@hotmail.com
> > 
> >  
> > Sorry, I forgot to attach it last night.
> >  
> > 
> > 
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>

[OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Jack Bryan


Hi,
Would you please recommend a debugger, which can do debugging for parallel 
processes on Open MPI systems ? 
I hope that it can be installed without root right because I am not a root user 
for ourMPI cluster. 
Any help is appreciated. 
Thanks
Jack
Oct. 28 2010

Re: [OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Jack Bryan


thanksI have run padb (the new one with your patch) on my system and got 
:-bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster$VAR1 = {};Job 48516.cluster  is 
not activeActually, the job is running. 
How to check whether my system has pbs_pro ?
Any help is appreciated. thanksJinxu DingOct. 29 2010

> From: ash...@pittman.co.uk
> Date: Fri, 29 Oct 2010 18:21:46 +0100
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI
> 
> 
> On 29 Oct 2010, at 12:06, Jeremy Roberts wrote:
> 
> > I'd suggest looking into TotalView (http://www.totalviewtech.com) and/or 
> > DDT (http://www.allinea.com/).  I've used TotalView pretty extensively and 
> > found it to be pretty easy to use.  They are both commercial, however, and 
> > not cheap.  
> > 
> > As far as I know, there isn't a whole lot of open source support for 
> > parallel debugging. The Parallel Tools Platform of Eclipse claims to 
> > provide a parallel debugger, though I have yet to try it 
> > (http://www.eclipse.org/ptp/).
> 
> Jeremy has covered the graphical parallel debuggers that I'm aware of, for a 
> different approach there is padb which isn't a "parallel debugger" in the 
> traditional model but is able to show you the same type of information, it 
> won't allow you to point-and-click through the source or single step through 
> the code but it is lightweight and will show you the information which you 
> need to know. 
> 
> Padb needs to integrate with the resource manager, I know it works with 
> pbs_pro but it seems there are a few issues on your system which is pbs 
> (without the pro).  I can help you with this and work through the problems 
> but only if you work with me and provide details of the integration, in 
> particular I've sent you a version which has a small patch and some debug 
> printfs added, if you could send me the output from this I'd be able to tell 
> you if it was likely to work and how to go about making it do so.
> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] open MPI please recommend a debugger for open MPI

2010-10-29 Thread Jack Bryan


Hi, 
this is what I got :
-bash-3.2$ qstat -n -u myName
clsuter:
 Req'd  Req'd   ElapJob ID   Username QueueJobname  
SessID NDS   TSK Memory Time  S Time   
 -- - --- -- - - -48933.cluster.e 
myName   develmyJob  107835 1  ----  00:02 C 00:00   n20/0
Any help is appreciated. 
thanks
> From: ash...@pittman.co.uk
> Date: Fri, 29 Oct 2010 18:38:25 +0100
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI
> 
> 
> Can you try the following and send me the output.
> 
> qstat -n -u `whoami` @clusterName
> 
> The output sent before implies that your cluster is called "clusterName" 
> rather than "cluster" which is a little surprising but let's see what it 
> gives us if we query on that basis.
> 
> Ashley.
> 
> On 29 Oct 2010, at 18:29, Jack Bryan wrote:
> 
> > thanks
> > 
> > I have run padb (the new one with your patch) on my system and got :
> > 
> > -bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster
> > $VAR1 = {};
> > Job 48516.cluster  is not active
> > 
> > Actually, the job is running. 
> > 
> > How to check whether my system has pbs_pro ?
> > 
> > Any help is appreciated. 
> > 
> > thanks
> > Jinxu Ding
> > 
> > Oct. 29 2010
> > 
> > 
> > > From: ash...@pittman.co.uk
> > > Date: Fri, 29 Oct 2010 18:21:46 +0100
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] open MPI please recommend a debugger for open 
> > > MPI
> > > 
> > > 
> > > On 29 Oct 2010, at 12:06, Jeremy Roberts wrote:
> > > 
> > > > I'd suggest looking into TotalView (http://www.totalviewtech.com) 
> > > > and/or DDT (http://www.allinea.com/). I've used TotalView pretty 
> > > > extensively and found it to be pretty easy to use. They are both 
> > > > commercial, however, and not cheap. 
> > > > 
> > > > As far as I know, there isn't a whole lot of open source support for 
> > > > parallel debugging. The Parallel Tools Platform of Eclipse claims to 
> > > > provide a parallel debugger, though I have yet to try it 
> > > > (http://www.eclipse.org/ptp/).
> > > 
> > > Jeremy has covered the graphical parallel debuggers that I'm aware of, 
> > > for a different approach there is padb which isn't a "parallel debugger" 
> > > in the traditional model but is able to show you the same type of 
> > > information, it won't allow you to point-and-click through the source or 
> > > single step through the code but it is lightweight and will show you the 
> > > information which you need to know. 
> > > 
> > > Padb needs to integrate with the resource manager, I know it works with 
> > > pbs_pro but it seems there are a few issues on your system which is pbs 
> > > (without the pro). I can help you with this and work through the problems 
> > > but only if you work with me and provide details of the integration, in 
> > > particular I've sent you a version which has a small patch and some debug 
> > > printfs added, if you could send me the output from this I'd be able to 
> > > tell you if it was likely to work and how to go about making it do so.
> > > 
> > > Ashley.
> > > 
> > > -- 
> > > 
> > > Ashley Pittman, Bath, UK.
> > > 
> > > Padb - A parallel job inspection tool for cluster computing
> > > http://padb.pittman.org.uk
> > > 
> > > 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] message truncated error

2010-11-01 Thread Jack Bryan


HI, 
In my MPI program, master send many msaages to another worker with the same 
tag. 
The worker uses sMPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1, 
message_para_to_workers_type, 0, downStreamTaskTag);
to receive the messages 
I got error: 

n36:94880] *** An error occurred in MPI_Recv[n36:94880] *** on communicator 
MPI_COMM_WORLD[n36:94880] *** MPI_ERR_TRUNCATE: message truncated[n36:94880] 
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)[n36:94880] *** Process 
received signal ***[n36:94880] Signal: Segmentation fault (11)[n36:94880] 
Signal code: Address not mapped (1)

Is this (the same tag) the reason for the errors ? 
ANy help is appreciated. 
thanks
Jack 
Oct. 31 2010

Re: [OMPI users] message truncated error

2010-11-01 Thread Jack Bryan


thanks
I use 
double* recvArray  = new double[buffersize];
The receive buffer size 
MPI::COMM_WORLD.Recv(&(recvDataArray[0]), xVSize, MPI_DOUBLE, 0, mytaskTag);
delete [] recvArray  ;
In first iteration, the receiver works well.
But, in second iteration , 
I got the 
MPI_ERR_TRUNCATE: message truncated
the buffersize is the same in two iterations. 

ANy help is appreciated. 
thanks
Nov. 1 2010 

> Date: Mon, 1 Nov 2010 08:08:08 +0100
> From: jody@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] message truncated error
> 
> Hi Jack
> 
> Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv
> (or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in.
> Check your code to make sure you assign enough memory to your buffers.
> 
> regards
> Jody
> 
> 
> On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan  wrote:
> > HI,
> > In my MPI program, master send many msaages to another worker with the same
> > tag.
> > The worker uses
> > s
> > MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1,
> > message_para_to_workers_type, 0, downStreamTaskTag);
> > to receive the messages
> > I got error:
> >
> > n36:94880] *** An error occurred in MPI_Recv
> > [n36:94880] *** on communicator MPI_COMM_WORLD
> > [n36:94880] *** MPI_ERR_TRUNCATE: message truncated
> > [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > [n36:94880] *** Process received signal ***
> > [n36:94880] Signal: Segmentation fault (11)
> > [n36:94880] Signal code: Address not mapped (1)
> >
> > Is this (the same tag) the reason for the errors ?
> > ANy help is appreciated.
> > thanks
> > Jack
> > Oct. 31 2010
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Open MPI data transfer error

2010-11-05 Thread Jack Bryan



Hi, 
In my Open MPI program, one master sends data to 3 workers.
Two workers can receive their data. 
But, the third  worker can not get their data. 
Before sending data, the master sends a head information to each worker 
receiver so that each worker knows what the following data package is. (such as 
length, package tag). The third worker can get its head information message 
from master but cannot get its correct data package. 
It got the data that should be received by first worker, which get its correct 
data. 
Why ? 
Any help is appreciated. 
thanks
Jack
Nov. 4 2010

Re: [OMPI users] Open MPI data transfer error

2010-11-05 Thread Jack Bryan


Thanks, I have used "cout" in c++ to print the values of data. 
The sender sends correct data to correct receiver. 
But, receiver gets wrong data from correct sender. 
why ? 
thanks 
Nov. 5 2010
> Date: Fri, 5 Nov 2010 08:54:22 -0400
> From: prent...@ias.edu
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI data transfer error
> 
> Jack Bryan wrote:
> > 
> > Hi, 
> > 
> > In my Open MPI program, one master sends data to 3 workers.
> > 
> > Two workers can receive their data. 
> > 
> > But, the third  worker can not get their data. 
> > 
> > Before sending data, the master sends a head information to each worker
> > receiver 
> > so that each worker knows what the following data package is. (such as
> > length, package tag).
> >  
> > The third worker can get its head information message from master but
> > cannot get its correct 
> > data package. 
> > 
> > It got the data that should be received by first worker, which get its
> > correct data. 
> > 
> 
> 
> Jack,
> 
> Providing the relevant sections of code here would be very helpful.
> 
> 
> I would tell you to add some printf statements to your code to see what
> data is stored in your variables on the master before it sends them to
> each node, but Jeff Squyres and I agreed to disagree in a civil manner
> on that debugging technique earlier this week, and I'd hate to re-open
> those old wounds by suggesting that technique here. ;)
> 
> 
> -- 
> Prentice
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI data transfer error

2010-11-05 Thread Jack Bryan

Thanks,
But, my code is too long to be posted. 
dozens of files, thousands of lines. 
Do you have better ideas ? 
Any help is appreciated. 
Jack
Nov. 5 2010
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Fri, 5 Nov 2010 11:20:57 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI data transfer error

As Prentice said, we can't help you without seeing your code.  openMPI has 
stood many trials from many programmers, with many bugs ironed out. So 
typically it is unlikely openMPI is the source of your error.  Without seeing 
your code the only logical conclusion is that something is wrong with your 
programming. 

On Fri, Nov 5, 2010 at 10:52 AM, Prentice Bisbal  wrote:

We can't help you with your coding problem without seeing your code.

Jack Bryan wrote:

> Thanks,

> I have used "cout" in c++ to print the values of data.

>

> The sender sends correct data to correct receiver.

>

> But, receiver gets wrong data from correct sender.

>

> why ?

>

> thanks

>

> Nov. 5 2010

>

>> Date: Fri, 5 Nov 2010 08:54:22 -0400

>> From: prent...@ias.edu

>> To: us...@open-mpi.org

>> Subject: Re: [OMPI users] Open MPI data transfer error

>>

>> Jack Bryan wrote:

>> >

>> > Hi,

>> >

>> > In my Open MPI program, one master sends data to 3 workers.

>> >

>> > Two workers can receive their data.

>> >

>> > But, the third worker can not get their data.

>> >

>> > Before sending data, the master sends a head information to each worker

>> > receiver

>> > so that each worker knows what the following data package is. (such as

>> > length, package tag).

>> >

>> > The third worker can get its head information message from master but

>> > cannot get its correct

>> > data package.

>> >

>> > It got the data that should be received by first worker, which get its

>> > correct data.

>> >

>>

>>

>> Jack,

>>

>> Providing the relevant sections of code here would be very helpful.

>>

>> 

>> I would tell you to add some printf statements to your code to see what

>> data is stored in your variables on the master before it sends them to

>> each node, but Jeff Squyres and I agreed to disagree in a civil manner

>> on that debugging technique earlier this week, and I'd hate to re-open

>> those old wounds by suggesting that technique here. ;)

>> 

>>

>> --

>> Prentice

___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
David Zhang
University of California, San Diego

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI data transfer error

2010-11-06 Thread Jack Bryan

Thanks,
About my MPI program bugs: 
I used GDB and got the error:
Program received signal SIGSEGV, Segmentation fault.0:  0x003a31c62184 in 
fwrite () from /lib64/libc.so.6
also error :
1:  Program received signal SIGABRT, Aborted.0:  I am rank 0, I have sent 
4tasks out of total tasks1:  0x003a31c30265 in raise () from 
/lib64/libc.so.6

It may be caused by a class usage.
My program master-worker MPI framework: 
class CNSGA2{   allocate mem for var;   some deallocate statement;  some 
pointers;  evaluate(); // it is a function}
CNSGA2::CNSGA2(){}
class newCNSGA2:public CNSGA2{public:   newCNSGA2(){cout << " constructor for 
newCNSGA2 \n\n" << endl;};~newCNSGA2(){cout << " destructor for 
newCNSGA2 \n\n" << endl;};};

main(){ CNSGA2* nsga2a = new CNSGA2(true); // true or false are only for 
different constructors CNSGA2* nsga2b = new CNSGA2(false); if 
(myRank == 0) // scope1  {   initialize the objects of nsga2a 
or nsga2b; }   broadcast some 
parameters, which are got from scope1. 
According to the parameters, define a datatype (myData) so that all 
workers use that to do  recv and send. 
if (myRank == 0) // scope2  {   send out myData 
to workers by the datatype defined above;   }   
if (myRank != 0){   newCNSGA2 myNsga2;  recv 
data from master and work on the recved data;  
myNsga2.evaluate(recv data);send back results;  }
}

If I declear objects (nsga2a nsga2b ) in scope 1 , they cannot be visible in 
scope2. But, actually, the two objects are only used in master not in workers.
Workers only needs to call  evaluate() from the class CNSGA2. 
This is why I used inheritance to define a new class newCNSGA2. 
But, the problem is there some memory allocation and deallocation inside class 
CNSGA2. 
The new class newCNSGA2 donot need these memory allocation and deallocation. 
If I put the delaration of CNSGA2* nsga2a or CNSGA2* nsga2b in scope1, they are 
not visible in scope 2. 

I cannot combine the two scopes because the datatype need them to de defined so 
that all workers can see them and use them to do send and recv. 

Any help is appreciated. 
Jack
Nov. 6 2010

> Date: Fri, 5 Nov 2010 14:55:32 -0800
> From: eugene@oracle.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI data transfer error
> 
> Debugging is not a straightforward task.  Even posting the code doesn't 
> necessarily help (since no one may be motivated to help or they can't 
> reproduce the problem or...).  You'll just have to try different things 
> and see what works for you.  Another option is to trace the MPI calls.  
> If a process sends a message, dump out the MPI_Send() arguments.  When a 
> receiver receives, correspondingly dump those arguments.  Etc.  This 
> might be a way of seeing what the program is doing in terms of MPI and 
> thereby getting to suggestion B below.
> 
> How do you trace and sort through the resulting data?  That's another 
> tough question.  Among other things, if you can't find a tool that fits 
> your needs, you can use the PMPI layer to write wrappers.  Writing 
> wrappers is like inserting printf() statements, but doesn't quite have 
> the same amount of moral shame associated with it!
> 
> Prentice Bisbal wrote:
> 
> >Choose one
> >
> >A) Post only the relevant sections of the code. If you have syntax
> >error, it should be in the Send and Receive calls, or one of the lines
> >where the data is copied or read from the array/buffer/whatever that
> >you're sending or receiving.
> >
> >B) Try reproducing your problem in a toy program that has only enough
> >code to reproduce your problem. For example, create an array, populate
> >it with data, send it, and then on the receiving end, receive it, and
> >print it out. Something simple like that. I find when I do that, I
> >usually find the error in my code.
> >
> >Jack Bryan wrote:
> >  
> >
> >>But, my code is too long to be posted. 
> >>dozens of files, thousands of lines. 
> >>Do you have better ideas ? 
> >>Any help is appreciated. 
> >>
> >>Nov. 5 2010
> >>
> >>From: solarbik...@gmail.com
> >>Date: Fri, 5 Nov 2010 11:20:57 -0700
> >>To: us...@open-mpi.org
> >>Subject: Re: [OMPI users] Open MPI data transfer error
> >>
> >>As Prentice said, we can't help you without seeing your code.  openMPI
> >>has stood many trials from many programmers, with many b

[OMPI users] Open MPI access the same file in parallel ?

2011-03-09 Thread Jack Bryan


Hi, 
I have a file, which is located in a system folder, which can be accessed by 
all parallel processes. 
Does Open MPI allow multi processes to access the same file at the same time ? 
For example, all processes open the file and load data from it at the same 
time. 
Any help is really appreciated. 
thanks
Jack
Mar 9 2011

Re: [OMPI users] Open MPI access the same file in parallel ?

2011-03-10 Thread Jack Bryan


Thanks, 
I only need to read the file. And, all processes only to read the file only 
once. 
But, the file is about 200MB. 
But, my code is C++. 
Does Open MPI support this ?
thanks

From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Wed, 9 Mar 2011 20:57:03 -0800
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI access the same file in parallel ?

Under my programming environment, FORTRAN, it is possible to parallel read 
(using native read function instead of MPI's parallel read function).  Although 
you'll run into problem when you try to parallel write to the same file.



On Wed, Mar 9, 2011 at 8:45 PM, Jack Bryan  wrote:







Hi, 
I have a file, which is located in a system folder, which can be accessed by 
all parallel processes. 
Does Open MPI allow multi processes to access the same file at the same time ? 


For example, all processes open the file and load data from it at the same 
time. 
Any help is really appreciated. 
thanks
Jack


Mar 9 2011
  

___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI access the same file in parallel ?

2011-03-10 Thread Jack Bryan


Hi, thanks for your code. 
I have test it with a simple example file. It works well without any conflict 
of parallel accessing the same file.
Now, I am using CPLEX (an optimization model solver) to load a model data file, 
which can be 200 MBytes. 
CPLEX.importModel(modelName, dataFileName) ;
I do not know how CPLEX code handle the reading the model data file.
Any suggestions or ideas are welcome.

thanks
Jack 

From: belaid_...@hotmail.com
To: us...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Thu, 10 Mar 2011 05:51:31 +
Subject: Re: [OMPI users] Open MPI access the same file in parallel ?











Hi,
  You can do that with C++ also. Just for fun of it, I produced a little 
program for that; each process reads the whole
file and print the content to stdout. I hope this helps:

#include 
#include 
#include 
#include 
using namespace std;

int main (int argc, char* argv[]) {
  int rank, size;
  string line;
  MPI_Init (&argc, &argv);  
  MPI_Comm_size (MPI_COMM_WORLD, &size);
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);   
  ifstream txtFile("example.txt");
  if (txtFile.is_open()) {
while ( txtFile.good() ) {
  getline (txtFile,line);
  cout << line << endl;
}
txtFile.close();
  }else {
cout << "Unable to open file";
  }
  MPI_Finalize(); /*end MPI*/
  return 0;
}

With best regards,
-Belaid.



From: dtustud...@hotmail.com
To: us...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Wed, 9 Mar 2011 22:08:44 -0700
Subject: Re: [OMPI users] Open MPI access the same file in parallel ?








Thanks, 
I only need to read the file. And, all processes only to read the file only 
once. 
But, the file is about 200MB. 
But, my code is C++. 
Does Open MPI support this ?
thanks

From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Wed, 9 Mar 2011 20:57:03 -0800
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI access the same file in parallel ?

Under my programming environment, FORTRAN, it is possible to parallel read 
(using native read function instead of MPI's parallel read function).  Although 
you'll run into problem when you try to parallel write to the same file.



On Wed, Mar 9, 2011 at 8:45 PM, Jack Bryan  wrote:







Hi, 
I have a file, which is located in a system folder, which can be accessed by 
all parallel processes. 
Does Open MPI allow multi processes to access the same file at the same time ? 


For example, all processes open the file and load data from it at the same 
time. 
Any help is really appreciated. 
thanks
Jack


Mar 9 2011
  

___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI access the same file in parallel ?

2011-03-10 Thread Jack Bryan


thanks
I am using GNU mpic++ compiler. 
Does it can automatically support accessing  a file by many parallel processes 
? 

thanks
> Date: Wed, 9 Mar 2011 22:54:18 -0800
> From: n...@aol.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI access the same file in parallel ?
> 
> On 3/9/2011 8:57 PM, David Zhang wrote:
> > Under my programming environment, FORTRAN, it is possible to parallel
> > read (using native read function instead of MPI's parallel read
> > function).  Although you'll run into problem when you try to parallel
> > write to the same file.
> >
> 
> If your Fortran compiler/library are reasonably up to date, you will 
> need to specify action='read' as opening once with default readwrite 
> will lock out other processes.
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] OMPI seg fault by a class with weird address.

2011-03-14 Thread Jack Bryan


Hi, 
I got a run-time error of a Open MPI C++ program. 
The following output is from gdb: 
--Program
 received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in 
opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
At the point 
Breakpoint 9, Index::Index (this=0x7fffcb80) at src/index.cpp:2020  
Name(0) {}
The Index has been called before this point and no 
problem:---Breakpoint 9, 
Index::Index (this=0x117d800) at src/index.cpp:2020  Name(0) 
{}(gdb) cContinuing.
Breakpoint 9, Index::Index (this=0x117d860) at src/index.cpp:2020  
Name(0) {}(gdb) 
cContinuing.
It seems that the 0x7fffcb80 address is a problem. 
But, I donot know the reason and how to remove the bug. 
Any help is really appreciated. 
thanks
the following is the index definition.
-class Index {
public:Index();Index(const Index& rhs);~Index();
Index& operator=(const Index& rhs);   vector 
GetPosition() const;vector GetColumn() const;  
vector GetYear() const;vector GetName() const; 
int GetPosition(const int idx) const;   int GetColumn(const int idx) 
const; int GetYear(const int idx) const;   string 
GetName(const int idx) const;int GetSize() const;   
 void Add(const int idx, const int col, const string& name);
 void Add(const int idx, const int col, const int year, const string& name);
 void Add(const int idx, const Step& col, const string& name);  
 void WriteFile(const char* fileinput) const;private:   
 vector Position;   vector Column; vector 
Year;   vector Name;};// Contructors and destructor for the 
Index classIndex::Index() : Position(0),Column(0),  Year(0),
Name(0) {}
Index::Index(const Index& rhs) :Position(rhs.GetPosition()),
Column(rhs.GetColumn()),Year(rhs.GetYear()),Name(rhs.GetName()) {}
Index::~Index() {}
Index& Index::operator=(const Index& rhs) {Position = rhs.GetPosition();
Column = rhs.GetColumn(),   Year = rhs.GetYear(),   Name = rhs.GetName();   
 return *this;}--

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan


Hi, 
Because the code is very long, I just  show the calling relationship of 
functions. 
main(){scheduler();
}scheduler(){ ImportIndices();}
ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");}
Index ReadFile(const char* fileinput) { Index TempIndex;.
}
vector Index::GetPosition() const { return Position; }vector 
Index::GetColumn() const { return Column; }vector Index::GetYear() const { 
return Year; }vector Index::GetName() const { return Name; }int 
Index::GetPosition(const int idx) const { return Position[idx]; }int 
Index::GetColumn(const int idx) const { return Column[idx]; }int 
Index::GetYear(const int idx) const { return Year[idx]; }string 
Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() 
const { return Position.size(); }
The sequential code works well, and there is no  scheduler(). 
The parallel code output from 
gdb:--Breakpoint 1, 
myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, 
message_para_to_workers_VecT &, MPI_Datatype, int &, int &, 
std::vector >, 
std::allocator > > > &, 
std::vector >, 
std::allocator > > > &, 
std::vector > &, int, 
std::vector >, 
std::allocator > > > &, 
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, 
myPopParaVec=std::vector of length 4, capacity 4 = {...}, 
message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, 
capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 
4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,   
  resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, 
xdata_to_workers_type=0x121c410, myGenerationNum=1, 
Mpara_to_workers_type=0x121b9b0, nconNum=0)at 
src/nsga2/myNetplanScheduler.cpp:109109 
ImportIndices();(gdb) cContinuing.
Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = 
ReadFile("prepdata/idx_node.csv");(gdb) cContinuing.
Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at 
src/index.cpp:8686  Index TempIndex;(gdb) cContinuing.
Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020  
Name(0) {}(gdb) cContinuing.
Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in 
opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
---the backtrace output from the above 
parallel OpenMPI code:
(gdb) bt#0  0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1  0x2b3b2bd3 in 
opal_memory_ptmalloc2_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2  0x003f7c8bd1dd in operator 
new(unsigned long) ()   from /usr/lib64/libstdc++.so.6#3  0x004646a7 in 
__gnu_cxx::new_allocator::allocate (this=0x7fffcb80, __n=0)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:88#4
  0x004646cf in std::_Vector_base 
>::_M_allocate (this=0x7fffcb80, __n=0)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:127#5
  0x00464701 in std::_Vector_base 
>::_Vector_base (this=0x7fffcb80, __n=0, __a=...)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:113#6
  0x00464d0b in std::vector >::vector (
this=0x7fffcb80, __n=0, __value=@0x7fffc968, __a=...)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:216#7
  0x004890d7 in Index::Index (this=0x7fffcb80)---Type  to 
continue, or q  to quit---at src/index.cpp:20#8  0x0048927a 
in ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at 
src/index.cpp:86#9  0x00489533 in ImportIndices () at 
src/index.cpp:120#10 0x00445e0e in myNeplanTaskScheduler(CNSGA2 *, int, 
int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int 
&, int &, std::vector >, 
std::allocator > > > &, 
std::vector >, 
std::allocator > > > &, 
std::vector > &, int, 
std::vector >, 
std::allocator > > > &, 
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, 
myPopParaVec=std::vector of length 4, capacity 4 = {...}, 
message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, 
capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 
4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,   
  resultTaskPackageT12=std::vector of length 4, capacity 4 = {..

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan


Thanks,
I do not have system administrator authorization. I am afraid that I cannot 
rebuild OpenMPI --without-memory-manager. 
Are there other ways to get around it ? 
For example, use other things to replace "ptmalloc" ?
Any help is really appreciated. 
thanks 

From: belaid_...@hotmail.com
To: dtustud...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 08:00:56 +








Hi Jack,
  I may need to see the whole code to decide but my quick look suggest that 
ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the 
openMPI internal malloc library. Could you try to build openMPI without memory 
management (using --without-memory-manager) and let us know the outcome. 
ptmalloc is not needed if you are not using an RDMA interconnect.

  With best regards,
-Belaid.

From: dtustud...@hotmail.com
To: belaid_...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 00:30:19 -0600








Hi, 
Because the code is very long, I just  show the calling relationship of 
functions. 
main(){scheduler();
}scheduler(){ ImportIndices();}
ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");}
Index ReadFile(const char* fileinput) { Index TempIndex;.
}
vector Index::GetPosition() const { return Position; }vector 
Index::GetColumn() const { return Column; }vector Index::GetYear() const { 
return Year; }vector Index::GetName() const { return Name; }int 
Index::GetPosition(const int idx) const { return Position[idx]; }int 
Index::GetColumn(const int idx) const { return Column[idx]; }int 
Index::GetYear(const int idx) const { return Year[idx]; }string 
Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() 
const { return Position.size(); }
The sequential code works well, and there is no  scheduler(). 
The parallel code output from 
gdb:--Breakpoint 1, 
myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, 
message_para_to_workers_VecT &, MPI_Datatype, int &, int &, 
std::vector >, 
std::allocator > > > &, 
std::vector >, 
std::allocator > > > &, 
std::vector > &, int, 
std::vector >, 
std::allocator > > > &, 
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, 
myPopParaVec=std::vector of length 4, capacity 4 = {...}, 
message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, 
capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 
4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,   
  resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, 
xdata_to_workers_type=0x121c410, myGenerationNum=1, 
Mpara_to_workers_type=0x121b9b0, nconNum=0)at 
src/nsga2/myNetplanScheduler.cpp:109109 
ImportIndices();(gdb) cContinuing.
Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = 
ReadFile("prepdata/idx_node.csv");(gdb) cContinuing.
Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at 
src/index.cpp:8686  Index TempIndex;(gdb) cContinuing.
Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020  
Name(0) {}(gdb) cContinuing.
Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in 
opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
---the backtrace output from the above 
parallel OpenMPI code:
(gdb) bt#0  0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1  0x2b3b2bd3 in 
opal_memory_ptmalloc2_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2  0x003f7c8bd1dd in operator 
new(unsigned long) ()   from /usr/lib64/libstdc++.so.6#3  0x004646a7 in 
__gnu_cxx::new_allocator::allocate (this=0x7fffcb80, __n=0)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:88#4
  0x004646cf in std::_Vector_base 
>::_M_allocate (this=0x7fffcb80, __n=0)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:127#5
  0x00464701 in std::_Vector_base 
>::_Vector_base (this=0x7fffcb80, __n=0, __a=...)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:113#6
  0x00464d0b in std::vector >::vector (
this=0x7fffcb80, __n=0, __value=@0x7fffc968, __a=...)at 
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:216#7
  0x004890d7 in Index::Index (this

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan


Thanks,From  http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap
I find that 
"Currently the wrappers are only buildable with mpiccs which are based on GNU 
GCC or Intel's C++ Compiler."
The cluster which I am working on is using GNU Open MPI mpic++. i am afraid 
that the Valgrind wrapper can work here. 
I do not have system administrator authorization. 
Are there other mem-checkers (open source) that can do this ?
thanks
Jack
> Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.
> From: jsquy...@cisco.com
> Date: Tue, 15 Mar 2011 06:19:53 -0400
> CC: dtustud...@hotmail.com
> To: us...@open-mpi.org
> 
> You may also want to run your program through a memory-checking debugger such 
> as valgrind to see if it turns up any other problems.
> 
> AFIK, ptmalloc should be fine for use with STL vector allocation.
> 
> 
> On Mar 15, 2011, at 4:00 AM, Belaid MOA wrote:
> 
> > Hi Jack,
> >   I may need to see the whole code to decide but my quick look suggest that 
> > ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the 
> > openMPI internal malloc library. Could you try to build openMPI without 
> > memory management (using --without-memory-manager) and let us know the 
> > outcome. ptmalloc is not needed if you are not using an RDMA interconnect.
> > 
> >   With best regards,
> > -Belaid.
> > 
> > From: dtustud...@hotmail.com
> > To: belaid_...@hotmail.com; us...@open-mpi.org
> > Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
> > Date: Tue, 15 Mar 2011 00:30:19 -0600
> > 
> > Hi, 
> > 
> > Because the code is very long, I just  show the calling relationship of 
> > functions. 
> > 
> > main()
> > {
> > scheduler();
> > 
> > }
> > scheduler()
> > {
> >  ImportIndices();
> > }
> > 
> > ImportIndices()
> > {
> > Index IdxNode ;
> > IdxNode = ReadFile("fileName");
> > }
> > 
> > Index ReadFile(const char* fileinput) 
> > {
> > Index TempIndex;
> > .
> > 
> > }
> > 
> > vector Index::GetPosition() const { return Position; }
> > vector Index::GetColumn() const { return Column; }
> > vector Index::GetYear() const { return Year; }
> > vector Index::GetName() const { return Name; }
> > int Index::GetPosition(const int idx) const { return Position[idx]; }
> > int Index::GetColumn(const int idx) const { return Column[idx]; }
> > int Index::GetYear(const int idx) const { return Year[idx]; }
> > string Index::GetName(const int idx) const { return Name[idx]; }
> > int Index::GetSize() const { return Position.size(); }
> > 
> > The sequential code works well, and there is no  scheduler(). 
> > 
> > The parallel code output from gdb:
> > --
> > Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, 
> > int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, 
> > std::vector >, 
> > std::allocator > > > &, 
> > std::vector >, 
> > std::allocator > > > &, 
> > std::vector > &, int, 
> > std::vector >, 
> > std::allocator > > > &, 
> > MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, 
> > popSize=, nodeSize=, 
> > myRank=, myChildpop=0x1208d80, genCandTag=65 'A', 
> > generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = 
> > {...}, 
> > message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
> > myT2Flag=@0x7fffd688, 
> > resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, 
> > resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, 
> > xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, 
> > resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, 
> > xdata_to_workers_type=0x121c410, myGenerationNum=1, 
> > Mpara_to_workers_type=0x121b9b0, nconNum=0)
> > at src/nsga2/myNetplanScheduler.cpp:109
> > 109 ImportIndices();
> > (gdb) c
> > Continuing.
> > 
> > Breakpoint 2, ImportIndices () at src/index.cpp:120
> > 120 IdxNode = ReadFile("prepdata/idx_node.csv");
> > (gdb) c
> > Continuing.
> > 
> > Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")
> > at src/index.cpp:86
> > 86  Index TempIndex;
> > (gdb) c
> > Continuing.
> > 
> > Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:20
> > 20  Name(0) {}
> > (gdb) c
> > Continuing.
> > 
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()
> >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
> > 
> > ---
> > the backtrace output from the above parallel OpenMPI code:
> > 
> > (gdb) bt
> > #0  0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()
> >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
> > #1  0x2b3b2bd3 in opal_memory_ptmalloc2_malloc ()
> >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
> > #2  0x003f7c8bd1dd in operator new(unsigned long) ()
>

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan

I have tried export OMPI_MCA_memory_ptmalloc2_disable=1
It does not work. The same error. 
thanks

From: sam...@lanl.gov
To: us...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 09:27:35 -0600
Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.

I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off OMPI's 
memory wrappers without having to rebuild.  Someone please correct me if I'm 
wrong :-).
For example (bash-like shell):
export OMPI_MCA_memory_ptmalloc2_disable=1
Hope that helps,
 --Samuel K. GutierrezLos Alamos National Laboratory 

On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote:Thanks,
I do not have system administrator authorization. I am afraid that I cannot 
rebuild OpenMPI --without-memory-manager. 
Are there other ways to get around it ? 
For example, use other things to replace "ptmalloc" ?
Any help is really appreciated. 
thanks 

From: belaid_...@hotmail.com
To: dtustud...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 08:00:56 +

Hi Jack,
  I may need to see the whole code to decide but my quick look suggest that 
ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the 
openMPI internal malloc library. Could you try to build openMPI without memory 
management (using --without-memory-manager) and let us know the outcome. 
ptmalloc is not needed if you are not using an RDMA interconnect.

  With best regards,
-Belaid.

From: dtustud...@hotmail.com
To: belaid_...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 00:30:19 -0600

Hi, 
Because the code is very long, I just  show the calling relationship of 
functions. 
main(){scheduler();
}scheduler(){ ImportIndices();}
ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");}
Index ReadFile(const char* fileinput) { Index TempIndex;.
}
vector Index::GetPosition() const { return Position; }vector 
Index::GetColumn() const { return Column; }vector Index::GetYear() const { 
return Year; }vector Index::GetName() const { return Name; }int 
Index::GetPosition(const int idx) const { return Position[idx]; }int 
Index::GetColumn(const int idx) const { return Column[idx]; }int 
Index::GetYear(const int idx) const { return Year[idx]; }string 
Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() 
const { return Position.size(); }
The sequential code works well, and there is no  scheduler(). 
The parallel code output from 
gdb:--Breakpoint 1, 
myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, 
message_para_to_workers_VecT &, MPI_Datatype, int &, int &, 
std::vector >, 
std::allocator > > > &, 
std::vector >, 
std::allocator > > > &, 
std::vector > &, int, 
std::vector >, 
std::allocator > > > &, 
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, 
myPopParaVec=std::vector of length 4, capacity 4 = {...}, 
message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, 
capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 
4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,   
  resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, 
xdata_to_workers_type=0x121c410, myGenerationNum=1, 
Mpara_to_workers_type=0x121b9b0, nconNum=0)at 
src/nsga2/myNetplanScheduler.cpp:109109 
ImportIndices();(gdb) cContinuing.
Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = 
ReadFile("prepdata/idx_node.csv");(gdb) cContinuing.
Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at 
src/index.cpp:8686  Index TempIndex;(gdb) cContinuing.
Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020  
Name(0) {}(gdb) cContinuing.
Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in 
opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
---the backtrace output from the above 
parallel OpenMPI code:
(gdb) bt#0  0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1  0x2b3b2bd3 in 
opal_memory_ptmalloc2_malloc ()   from 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2  0x003f7c8bd1dd in operator 
new(unsigned long) ()   from /usr/lib64/libstdc++.so.6#3  0x004646a7 in 
__gnu_cxx::new_allocator::allocate (t

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan


This should be the configure info about Open MPI which I am using. 
-bash-3.2$ mpic++ -v

Using built-in
specs.

Target:
x86_64-redhat-linux

Configured with:
../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info
--enable-shared --enable-threads=posix --enable-checking=release
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
--disable-dssi --disable-plugin
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic
--host=x86_64-redhat-linux

Thread model: posix

gcc version 4.1.2
20080704 (Red Hat 4.1.2-50)

 
thanks
From: sam...@lanl.gov
To: us...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 09:27:35 -0600
Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.

I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off OMPI's 
memory wrappers without having to rebuild.  Someone please correct me if I'm 
wrong :-).
For example (bash-like shell):
export OMPI_MCA_memory_ptmalloc2_disable=1
Hope that helps,
 --Samuel K. GutierrezLos Alamos National Laboratory 
 
On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote:Thanks,
I do not have system administrator authorization. I am afraid that I cannot 
rebuild OpenMPI --without-memory-manager. 
Are there other ways to get around it ? 
For example, use other things to replace "ptmalloc" ?
Any help is really appreciated. 
thanks 

From: belaid_...@hotmail.com
To: dtustud...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 08:00:56 +

Hi Jack,
  I may need to see the whole code to decide but my quick look suggest that 
ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the 
openMPI internal malloc library. Could you try to build openMPI without memory 
management (using --without-memory-manager) and let us know the outcome. 
ptmalloc is not needed if you are not using an RDMA interconnect.

  With best regards,
-Belaid.

From: dtustud...@hotmail.com
To: belaid_...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
List-Post: users@lists.open-mpi.org
Date: Tue, 15 Mar 2011 00:30:19 -0600

Hi, 
Because the code is very long, I just  show the calling relationship of 
functions. 
main(){scheduler();
}scheduler(){ ImportIndices();}
ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");}
Index ReadFile(const char* fileinput) { Index TempIndex;.
}
vector Index::GetPosition() const { return Position; }vector 
Index::GetColumn() const { return Column; }vector Index::GetYear() const { 
return Year; }vector Index::GetName() const { return Name; }int 
Index::GetPosition(const int idx) const { return Position[idx]; }int 
Index::GetColumn(const int idx) const { return Column[idx]; }int 
Index::GetYear(const int idx) const { return Year[idx]; }string 
Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() 
const { return Position.size(); }
The sequential code works well, and there is no  scheduler(). 
The parallel code output from 
gdb:--Breakpoint 1, 
myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, 
message_para_to_workers_VecT &, MPI_Datatype, int &, int &, 
std::vector >, 
std::allocator > > > &, 
std::vector >, 
std::allocator > > > &, 
std::vector > &, int, 
std::vector >, 
std::allocator > > > &, 
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, 
myPopParaVec=std::vector of length 4, capacity 4 = {...}, 
message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, 
myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, 
capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 
4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,   
  resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, 
xdata_to_workers_type=0x121c410, myGenerationNum=1, 
Mpara_to_workers_type=0x121b9b0, nconNum=0)at 
src/nsga2/myNetplanScheduler.cpp:109109 
ImportIndices();(gdb) cContinuing.
Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = 
ReadFile("prepdata/idx_node.csv");(gdb) cContinuing.
Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at 
src/index.cpp:8686  Index TempIndex;(gdb) cContinuing.
Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020  
Name(0) {}(gdb) cContinuing.
Program received signal SIGSEGV, Se

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Jack Bryan

s.
> From: jsquy...@cisco.com
> Date: Tue, 15 Mar 2011 12:50:41 -0400
> CC: us...@open-mpi.org
> To: dtustud...@hotmail.com
> 
> You can:
> 
> mpirun -np 4 valgrind ./my_application
> 
> That is, you run 4 copies of valgrind, each with one instance of 
> ./my_application.  Then you'll get valgrind reports for your applications.  
> You might want to dig into the valgrind command line options to have it dump 
> the results to files with unique prefixes (e.g., PID and/or hostname) so that 
> you can get a unique report from each process.
> 
> If you disabled ptmalloc and you're still getting the same error, then it 
> sounds like an application error.  Check out and see what valgrind tells you.
> 
> 
> 
> On Mar 15, 2011, at 11:25 AM, Jack Bryan wrote:
> 
> > Thanks,
> > 
> > From  http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap
> > 
> > I find that 
> > 
> > "Currently the wrappers are only buildable with mpiccs which are based on 
> > GNU GCC or Intel's C++ Compiler."
> > 
> > The cluster which I am working on is using GNU Open MPI mpic++. i am afraid 
> > that the Valgrind wrapper can work here. 
> > 
> > I do not have system administrator authorization. 
> > 
> > Are there other mem-checkers (open source) that can do this ?
> > 
> > thanks
> > 
> > Jack
> > 
> > > Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.
> > > From: jsquy...@cisco.com
> > > Date: Tue, 15 Mar 2011 06:19:53 -0400
> > > CC: dtustud...@hotmail.com
> > > To: us...@open-mpi.org
> > > 
> > > You may also want to run your program through a memory-checking debugger 
> > > such as valgrind to see if it turns up any other problems.
> > > 
> > > AFIK, ptmalloc should be fine for use with STL vector allocation.
> > > 
> > > 
> > > On Mar 15, 2011, at 4:00 AM, Belaid MOA wrote:
> > > 
> > > > Hi Jack,
> > > > I may need to see the whole code to decide but my quick look suggest 
> > > > that ptmalloc is causing a problem with STL-vector allocation. ptmalloc 
> > > > is the openMPI internal malloc library. Could you try to build openMPI 
> > > > without memory management (using --without-memory-manager) and let us 
> > > > know the outcome. ptmalloc is not needed if you are not using an RDMA 
> > > > interconnect.
> > > > 
> > > > With best regards,
> > > > -Belaid.
> > > > 
> > > > From: dtustud...@hotmail.com
> > > > To: belaid_...@hotmail.com; us...@open-mpi.org
> > > > Subject: RE: [OMPI users] OMPI seg fault by a class with weird address.
> > > > Date: Tue, 15 Mar 2011 00:30:19 -0600
> > > > 
> > > > Hi, 
> > > > 
> > > > Because the code is very long, I just show the calling relationship of 
> > > > functions. 
> > > > 
> > > > main()
> > > > {
> > > > scheduler();
> > > > 
> > > > }
> > > > scheduler()
> > > > {
> > > > ImportIndices();
> > > > }
> > > > 
> > > > ImportIndices()
> > > > {
> > > > Index IdxNode ;
> > > > IdxNode = ReadFile("fileName");
> > > > }
> > > > 
> > > > Index ReadFile(const char* fileinput) 
> > > > {
> > > > Index TempIndex;
> > > > .
> > > > 
> > > > }
> > > > 
> > > > vector Index::GetPosition() const { return Position; }
> > > > vector Index::GetColumn() const { return Column; }
> > > > vector Index::GetYear() const { return Year; }
> > > > vector Index::GetName() const { return Name; }
> > > > int Index::GetPosition(const int idx) const { return Position[idx]; }
> > > > int Index::GetColumn(const int idx) const { return Column[idx]; }
> > > > int Index::GetYear(const int idx) const { return Year[idx]; }
> > > > string Index::GetName(const int idx) const { return Name[idx]; }
> > > > int Index::GetSize() const { return Position.size(); }
> > > > 
> > > > The sequential code works well, and there is no scheduler(). 
> > > > 
> > > > The parallel code output from gdb:
> > > > --
> > > > Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, 
> > > > char, i

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-16 Thread Jack Bryan

tor > > >&, 
std::vector >, 
std::allocator > > >&, 
std::vector >&, int, 
std::vector >, 
std::allocator > > >&, 
ompi_datatype_t*, int, ompi_datatype_t*, int) 
(myNetplanScheduler.cpp:109)==18729==by 0x44F2DF: main 
(main-parallel2.cpp:216)

Note: see also the FAQ in the source distribution.It contains workarounds to 
several common problems.In particular, if Valgrind aborted or crashed 
afteridentifying problems in your program, there's a good chancethat fixing 
those problems will prevent Valgrind aborting orcrashing, especially if it 
happened in m_mallocfree.c.




> Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.
> From: jsquy...@cisco.com
> Date: Wed, 16 Mar 2011 06:43:01 -0400
> To: dtustud...@hotmail.com
> CC: us...@open-mpi.org
> 
> Did you run with a memory checking debugger like Valgrind?
> 
> Sent from my phone. No type good. 
> 
> On Mar 15, 2011, at 8:30 PM, "Jack Bryan"  wrote:
> 
> > Hi, 
> > 
> > I have installed a new open MPI 1.3.4. 
> > 
> > But I got more weird errors: 
> > 
> > *** glibc detected *** /lustre/nsga2b: malloc(): memory corruption (fast): 
> > 0x1cafc450 ***
> > === Backtrace: =
> > /lib64/libc.so.6[0x3c50272aeb]
> > /lib64/libc.so.6(__libc_malloc+0x7a)[0x3c5027402a]
> > /usr/lib64/libstdc++.so.6(_Znwm+0x1d)[0x3c590bd17d]
> > /lustre/jxding/netplan49/nsga2b[0x445bc6]
> > /lustre/jxding/netplan49/nsga2b[0x44f43b]
> > /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c5021d974]
> > /lustre/jxding/netplan49/nsga2b(__gxx_personality_v0+0x499)[0x443909]
> > === Memory map: 
> > 0040-00f33000 r-xp  6ac:e3210 685016360  
> > /lustre/netplan49/nsga2b
> > 01132000-0117e000 rwxp 00b32000 6ac:e3210 685016360  
> > /lustre/netplan49/nsga2b
> > 0117e000-01188000 rwxp 0117e000 00:00 0
> > 1ca11000-1ca78000 rwxp 1ca11000 00:00 0
> > 1ca78000-1ca79000 rwxp 1ca78000 00:00 0
> > 1ca79000-1ca7a000 rwxp 1ca79000 00:00 0
> > 1ca7a000-1cab8000 rwxp 1ca7a000 00:00 0
> > 1cab8000-1cac7000 rwxp 1cab8000 00:00 0
> > 1cac7000-1cacf000 rwxp 1cac7000 00:00 0
> > 1cacf000-1cad rwxp 1cacf000 00:00 0
> > 1cad-1cad1000 rwxp 1cad 00:00 0
> > 1cad1000-1cad2000 rwxp 1cad1000 00:00 0
> > 1cad2000-1cada000 rwxp 1cad2000 00:00 0
> > 1cada000-1cadc000 rwxp 1cada000 00:00 0
> > 1cadc000-1cae rwxp 1cadc000 00:00 0
> > 
> > .
> > 51260-3512605000 r-xp  00:11 12043  
> > /usr/lib64/librdmacm.so.1
> > 3512605000-3512804000 ---p 5000 00:11 12043  
> > /usr/lib64/librdmacm.so.1
> > 3512804000-3512805000 rwxp 4000 00:11 12043  
> > /usr/lib64/librdmacm.so.1
> > 3512e0-3512e0c000 r-xp  00:11 5545   
> > /usr/lib64/libibverbs.so.1
> > 3512e0c000-351300b000 ---p c000 00:11 5545   
> > /usr/lib64/libibverbs.so.1
> > 351300b000-351300c000 rwxp b000 00:11 5545   
> > /usr/lib64/libibverbs.so.1
> > 3c4f20-3c4f21c000 r-xp  00:11 2853   
> > /lib64/ld-2.5.so
> > 3c4f41b000-3c4f41c000 r-xp 0001b000 00:11 2853   
> > /lib64/ld-2.5.so
> > 3c4f41c000-3c4f41d000 rwxp 0001c000 00:11 2853   
> > /lib64/ld-2.5.so
> > 3c5020-3c5034c000 r-xp  00:11 897
> > /lib64/libc.so.6
> > 3c5034c000-3c5054c000 ---p 0014c000 00:11 897
> > /lib64/libc.so.6
> > 3c5054c000-3c5055 r-xp 0014c000 00:11 897
> > /lib64/libc.so.6
> > 3c5055-3c50551000 rwxp 0015 00:11 897
> > /lib64/libc.so.6
> > 3c50551000-3c50556000 rwxp 3c50551000 00:00 0
> > 3c5060-3c50682000 r-xp  00:11 2924   
> > /lib64/libm.so.6
> > 3c50682000-3c50881000 ---p 00082000 00:11 2924   
> > /lib64/libm.so.6
> > 3c50881000-3c50882000 r-xp 00081000 00:11 2924   
> > /lib64/libm.so.6
> > 3c50882000-3c50883000 rwxp 00082000 00:11 2924   
> > /lib64/libm.so.6
> > 3c50a0-3c50a02000 r-xp  00:11 923
> > /lib64/libdl.so.2
> > 3c50a02000-3c50c02000 ---p 2000 00:11 923
> > /lib64/libdl.so.2

Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling

2011-03-18 Thread Jack Bryan




> Date: Thu, 17 Mar 2011 23:40:31 +0100
> From: dominik.goedd...@math.tu-dortmund.de
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling
> 
> glad we could help and the two hours of stripping things down were 
> effectively not wasted. Also good to hear (implicitly) that we were not 
> too stupid to understand the MPI standard...
> 
> Since to the best of my understanding, our workaround is practically 
> overhead-free, we went ahead and coded everything up analogously to the 
> workaround, i.e. we don't rely on / wait for an immediate fix.
> 
> Please let us know if further information is needed.
> 
> Thanks,
> 
> dom
> 
> On 03/17/2011 05:10 PM, Jeff Squyres wrote:
> > Sorry for the late reply, but many thanks for the bug report and reliable 
> > reproducer.
> >
> > I've confirmed the problem and filed a bug about this:
> >
> >   https://svn.open-mpi.org/trac/ompi/ticket/2752
> >
> >
> > On Mar 6, 2011, at 6:12 PM, Dominik Goeddeke wrote:
> >
> >> The attached example code (stripped down from a bigger app) demonstrates a 
> >> way to trigger a severe crash in all recent ompi releases but not in a 
> >> bunch of latest MPICH2 releases. The code is minimalistic and boils down 
> >> to the call
> >>
> >> MPI_Comm_create(MPI_COMM_WORLD, MPI_GROUP_EMPTY,&dummy_comm);
> >>
> >> which isn't supposed to be illegal. Please refer to the (well-documented) 
> >> code for details on the high-dimensional cross product I tested (on ubuntu 
> >> 10.04 LTS), a potential workaround (which isn't supposed to be necessary I 
> >> think) and an exemplary stack trace.
> >>
> >> Instructions: mpicc test.c -Wall -O0&&  mpirun -np 2 ./a.out
> >>
> >> Thanks!
> >>
> >> dom
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> -- 
> Dr. Dominik Göddeke
> Institut für Angewandte Mathematik
> Technische Universität Dortmund
> http://www.mathematik.tu-dortmund.de/~goeddeke
> Tel. +49-(0)231-755-7218  Fax +49-(0)231-755-5933
> 
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] OMPI free() error

2011-03-18 Thread Jack Bryan


Hi, 
I am running a C++ program with OMPI.I got error: 
*** glibc detected *** /nsga2b: free(): invalid next size (fast): 
0x01817a90 ***
I used GDB: 
=== Backtrace: =Program received signal SIGABRT, 
Aborted.0x0038b8830265 in raise () from /lib64/libc.so.6(gdb) bt#0  
0x0038b8830265 in raise () from /lib64/libc.so.6#1  0x0038b8831d10 in 
abort () from /lib64/libc.so.6#2  0x0038b886a99b in __libc_message () from 
/lib64/libc.so.6#3  0x0038b887245f in _int_free () from /lib64/libc.so.6#4  
0x0038b88728bb in free () from /lib64/libc.so.6#5  0x0044a4e3 in 
workerRunTask (message_to_master_type=0x38c06efe18, nodeSize=2, myRank=1, 
xVSize=84, objSize=7, xdata_to_workers_type=0x1206350, 
recvXDataVec=std::vector of length 0, capacity 84, myNsga2=..., 
Mpara_to_workers_type=0x1205390, events=0x7fffb1f0, netplan=...)at 
src/nsga2/workerRunTask.cpp:447#6  0x004514d9 in main (argc=1, 
argv=0x7fffcb48)at 
src/nsga2/main-parallel2.cpp:425-
In valgrind, 
there are some invalid read and write butno errors about this  free(): invalid 
next size .
---(populp.ind)->xreal  = new 
double[nreal];(populp.ind)->obj   = new double[nobj];   
  (populp.ind)->constr= new double[ncon]; (populp.ind)->xbin
  = new double[nbin]; if ((populp.ind)->xreal == NULL || 
(populp.ind)->obj == NULL || (populp.ind)->constr == NULL || (populp.ind)->xbin 
== NULL ){   #ifdef DEBUG_workerRunTask 
 cout << "In workerRunTask(), I am rank "<< myRank << " 
(populp.ind)->xreal or (populp.ind)->obj or (populp.ind)->constr or 
(populp.ind)->xbin is NULL .\n\n" << endl;   #endif  }  
 
delete [] (populp.ind)->xreal ; delete [] (populp.ind)->xbin ;  
delete [] (populp.ind)->obj ;   delete [] (populp.ind)->constr ;
delete [] sendResultArrayPr;

thanks
Any help is really appreciated.

Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-18 Thread Jack Bryan


thanks, 
I forgot to set up storage capacity for some a vector before using [] operator 
on it. 
thanks

> Subject: Re: [OMPI users] OMPI seg fault by a class with weird address.
> From: jsquy...@cisco.com
> Date: Wed, 16 Mar 2011 20:20:20 -0400
> CC: us...@open-mpi.org
> To: dtustud...@hotmail.com
> 
> Make sure you have the latest version of valgrind.
> 
> But it definitely does highlight what could be real problems if you read down 
> far enough in the output.
> 
> > ==18729== Invalid write of size 8
> > ==18729==at 0x443BEF: initPopPara(population*, 
> > std::vector > std::allocator >&, initParaType&, int, int, 
> > std::vector >&) (main-parallel2.cpp:552)
> > ==18729==by 0x44F12E: main (main-parallel2.cpp:204)
> > ==18729==  Address 0x62c9da0 is 0 bytes after a block of size 0 alloc'd
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x4573E4: void 
> > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, __false_type) (new_allocator.h:88)
> > ==18729==by 0x4576CF: void 
> > std::__uninitialized_fill_n_a > message_para_to_workersT, 
> > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, std::allocator) 
> > (stl_uninitialized.h:218)
> > ==18729==by 0x44EE2E: main (stl_vector.h:218)
> 
> The above is an invalid read of write of size 8 -- you're essentially writing 
> outside of an array. 
> 
> Valgrind is showing you the call stack to how it got there.  Looks like you 
> new'ed or malloc'ed a block of size 0 and then tried to write something to 
> it.  Writing to memory that you don't own is a no-no; it can cause Very Bad 
> Things to happen.
> 
> You should probably investigate this, and the other issues that it is 
> reporting (e.g., the next invalid read of size 8).
> 
> > ==18729==
> > ==18729== Invalid read of size 8
> > ==18729==at 0x44F13A: main (main-parallel2.cpp:208)
> > ==18729==  Address 0x62c9d60 is 0 bytes after a block of size 0 alloc'd
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x45733D: void 
> > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, __false_type) (new_allocator.h:88)
> > ==18729==by 0x4576CF: void 
> > std::__uninitialized_fill_n_a > message_para_to_workersT, 
> > message_para_to_workersT>(message_para_to_workersT*, unsigned long, 
> > message_para_to_workersT const&, std::allocator) 
> > (stl_uninitialized.h:218)
> > ==18729==by 0x44EE2E: main (stl_vector.h:218)
> > ==18729==
> > 
> > valgrind: m_mallocfree.c:225 (mk_plain_bszB): Assertion 'bszB != 0' failed.
> > valgrind: This is probably caused by your program erroneously writing past 
> > the
> > end of a heap block and corrupting heap metadata.  If you fix any
> > invalid writes reported by Memcheck, this assertion failure will
> > 
> > probably go away.  Please try that before reporting this as a bug.
> > 
> > ==18729==at 0x38029D5C: report_and_quit (m_libcassert.c:145)
> > ==18729==by 0x3802A032: vgPlain_assert_fail (m_libcassert.c:217)
> > ==18729==by 0x38035645: vgPlain_arena_malloc (m_mallocfree.c:225)
> > ==18729==by 0x38002BB5: vgMemCheck_new_block (mc_malloc_wrappers.c:199)
> > ==18729==by 0x38002F6B: vgMemCheck___builtin_new 
> > (mc_malloc_wrappers.c:246)
> > ==18729==by 0x3806070C: do_client_request (scheduler.c:1362)
> > ==18729==by 0x38061D30: vgPlain_scheduler (scheduler.c:1061)
> > ==18729==by 0x38085E6E: run_a_thread_NORETURN (syswrap-linux.c:91)
> > 
> > sched status:
> >   running_tid=1
> > 
> > Thread 1: status = VgTs_Runnable
> > ==18729==at 0x4A0666E: operator new(unsigned long) 
> > (vg_replace_malloc.c:220)
> > ==18729==by 0x464506: __gnu_cxx::new_allocator::allocate(unsigned 
> > long, void const*) (new_allocator.h:88)
> > ==18729==by 0x46452E: std::_Vector_base 
> > >::_M_allocate(unsigned long) (stl_vector.h:127)
> > ==18729==by 0x464560: std::_Vector_base 
> > >::_Vector_base(unsigned long, std::allocator const&) 
> > (stl_vector.h:113)
> > ==18729==by 0x464B6A: std::vector 
> > >::vector(unsigned long, int const&, std::allocator const&) 
> > (stl_vector.h:216)
> > ==18729==by 0x488F62: Index::Index() (index.cpp:20)
> > ==18729==by 0x489147: ReadFile(char const*) (index.cpp:86)
> > ==18729==by 0x48941C: ImportIndices() (index.cpp:121)
> > ==18729==by 0x445D00: myNeplanTaskScheduler(CNSGA2*, int, int, int, 
> > population*, char, int, std::vector > std::allocator >&, ompi_datatype_t*, int&, int&, 
> > std::vector >, 
> > std::allocator > > >&, 
> > std::vector >, 
> > std::allocator > > >&, 
> > std::vector >&, int, 
> > std::vector >, 
> > std::allocator > > >&, 
> > ompi_datatype_t*, int, ompi_datatype_t*, int) (myNetplanScheduler.cpp:109)
> >

[OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the
process is currently in an unworkable state and should be terminated with
extreme prejudice
 If a process does not respond to any other
termination signals, sending it a SIGKILL signal will almost always cause it to
go away.
 The system will generate SIGKILL for a process itself under
some unusual conditions where the program cannot possibly continue to run (even
to run a signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


Hi, 
I have tried this. But, the printout from 200 parallel processes make it very 
hard to locate the possible bug. 
They may not stop at the same point when the program got signal 9.
So, even though I can figure out the print out statements from all200 
processes, so many different locations where the processesare stopped make it 
harder to find out some hints about the bug. 
Are there some other programming tricks, which can help me narrow down to the 
doubt points ASAP.Any help is appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



Try adding some print statements so you can see where the error occurs.
On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the process is currently in an unworkable state and should be terminated with 
extreme prejudice
 If a process does not respond to any other termination signals, sending it a 
SIGKILL signal will almost always cause it to go away.
 The system will generate SIGKILL for a process itself under some unusual 
conditions where the program cannot possibly continue to run (even to run a 
signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


Hi, 
I am working on a cluster, where I am not allowed to install software on system 
folder. 
My Open MPI is 1.3.4. 
I have a very quick of the padb on http://padb.pittman.org.uk/ . 
Does it require some software install on the cluster in order to use it ? 
I cannot use command-line to run job on the lcuster , but only script.
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 12:12:11 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



Have you tried a parallel debugger such as padb?
On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, 
I have tried this. But, the printout from 200 parallel processes make it very 
hard to locate the possible bug. 
They may not stop at the same point when the program got signal 9.
So, even though I can figure out the print out statements from all200 
processes, so many different locations where the processesare stopped make it 
harder to find out some hints about the bug. 
Are there some other programming tricks, which can help me narrow down to the 
doubt points ASAP.Any help is appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Try adding some print statements so you can see where the error occurs.
On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the process is currently in an unworkable state and should be terminated with 
extreme prejudice
 If a process does not respond to any other termination signals, sending it a 
SIGKILL signal will almost always cause it to go away.
 The system will generate SIGKILL for a process itself under some unusual 
conditions where the program cannot possibly continue to run (even to run a 
signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___ users mailing list 
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


Is it possible to enable padb to print out the stack trace and other program 
execute information into a file ?
I can run the program in gdb as this: 
mpirun -np 200 -e gdb ./myapplication 
How to make gdb print out the debug information to a file ? So that I can check 
it when the program is terminated. 
thanks
Jack

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 13:56:13 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



You don't need to install anything on a system folder - you can just install it 
in your home directory, assuming that is accessible on the remote nodes.
As for the script - unless you can somehow modify it to allow you to run under 
a debugger, I am afraid you are completely out of luck.

On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, 
I am working on a cluster, where I am not allowed to install software on system 
folder. 
My Open MPI is 1.3.4. 
I have a very quick of the padb on http://padb.pittman.org.uk/ . 
Does it require some software install on the cluster in order to use it ? 
I cannot use command-line to run job on the lcuster , but only script.
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 12:12:11 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Have you tried a parallel debugger such as padb?
On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, 
I have tried this. But, the printout from 200 parallel processes make it very 
hard to locate the possible bug. 
They may not stop at the same point when the program got signal 9.
So, even though I can figure out the print out statements from all200 
processes, so many different locations where the processesare stopped make it 
harder to find out some hints about the bug. 
Are there some other programming tricks, which can help me narrow down to the 
doubt points ASAP.Any help is appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Try adding some print statements so you can see where the error occurs.
On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the process is currently in an unworkable state and should be terminated with 
extreme prejudice
 If a process does not respond to any other termination signals, sending it a 
SIGKILL signal will almost always cause it to go away.
 The system will generate SIGKILL for a process itself under some unusual 
conditions where the program cannot possibly continue to run (even to run a 
signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___ users mailing list 
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___ users mailing list 
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


The cluster can print out all output into one file. 
But, checking them for bugs is very hard. 
The cluster also print out possible error messages into one file. 

But, sometimes the error file is empty , sometimes it is signal 9.
If I only run dummy tasks on worker nodes, no errors. 
If I run real task, sometimes processes are terminated w/o any errors before 
the program normally exit.Sometimes, the program get signal 9 but no other 
error messages. 
It is weird. 
Any help is really appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 15:18:53 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



I don't know, but Ashley may be able to help - or you can see his web site for 
instructions.
Alternatively, since you can put print statements into your code, have you 
considered using mpirun's option to direct output from each rank into its own 
file? Look at "mpirun -h" for the options.
   -output-filename|--output-filenameRedirect 
output from application processes into filename.rank

On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to 
print out the stack trace and other program execute information into a file ?
I can run the program in gdb as this: 
mpirun -np 200 -e gdb ./myapplication 
How to make gdb print out the debug information to a file ? So that I can check 
it when the program is terminated. 
thanks
Jack

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 13:56:13 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

You don't need to install anything on a system folder - you can just install it 
in your home directory, assuming that is accessible on the remote nodes.
As for the script - unless you can somehow modify it to allow you to run under 
a debugger, I am afraid you are completely out of luck.

On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, 
I am working on a cluster, where I am not allowed to install software on system 
folder. 
My Open MPI is 1.3.4. 
I have a very quick of the padb on http://padb.pittman.org.uk/ . 
Does it require some software install on the cluster in order to use it ? 
I cannot use command-line to run job on the lcuster , but only script.
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 12:12:11 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Have you tried a parallel debugger such as padb?
On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, 
I have tried this. But, the printout from 200 parallel processes make it very 
hard to locate the possible bug. 
They may not stop at the same point when the program got signal 9.
So, even though I can figure out the print out statements from all200 
processes, so many different locations where the processesare stopped make it 
harder to find out some hints about the bug. 
Are there some other programming tricks, which can help me narrow down to the 
doubt points ASAP.Any help is appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Try adding some print statements so you can see where the error occurs.
On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the process is currently in an unworkable state and should be terminated with 
extreme prejudice
 If a process does not respond to any other termination signals, sending it a 
SIGKILL signal will almost always cause it to go away.
 The system will generate SIGKILL for a process itself under some unusual 
conditions where the program cannot possibly continue to run (even to run a 
signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

__

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Jack Bryan


Hi, I used : 
 mpirun -np 200 -rf  --output-filename /mypath/myapplication
But, no files are printed out.
Can "--debug" option help me hear ? 
When I tried :
-bash-3.2$ mpirun 
-debug--A
 suitable debugger could not be found in your PATH.  Check the valuesspecified 
in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was 
searched.--Any
 help is really appreciated. 
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 15:45:39 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



If you use that mpirun option, mpirun will place the output from each rank into 
a -separate- file for you. Give it:
mpirun --output-filename /myhome/debug/run01
and in /myhome/debug, you will find files:
run01.0run01.1...
each with the output from the indicated rank.


On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:The cluster can print out all 
output into one file. 
But, checking them for bugs is very hard. 
The cluster also print out possible error messages into one file. 

But, sometimes the error file is empty , sometimes it is signal 9.
If I only run dummy tasks on worker nodes, no errors. 
If I run real task, sometimes processes are terminated w/o any errors before 
the program normally exit.Sometimes, the program get signal 9 but no other 
error messages. 
It is weird. 
Any help is really appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 15:18:53 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

I don't know, but Ashley may be able to help - or you can see his web site for 
instructions.
Alternatively, since you can put print statements into your code, have you 
considered using mpirun's option to direct output from each rank into its own 
file? Look at "mpirun -h" for the options.
   -output-filename|--output-filenameRedirect 
output from application processes into filename.rank

On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to 
print out the stack trace and other program execute information into a file ?
I can run the program in gdb as this: 
mpirun -np 200 -e gdb ./myapplication 
How to make gdb print out the debug information to a file ? So that I can check 
it when the program is terminated. 
thanks
Jack

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 13:56:13 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

You don't need to install anything on a system folder - you can just install it 
in your home directory, assuming that is accessible on the remote nodes.
As for the script - unless you can somehow modify it to allow you to run under 
a debugger, I am afraid you are completely out of luck.

On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, 
I am working on a cluster, where I am not allowed to install software on system 
folder. 
My Open MPI is 1.3.4. 
I have a very quick of the padb on http://padb.pittman.org.uk/ . 
Does it require some software install on the cluster in order to use it ? 
I cannot use command-line to run job on the lcuster , but only script.
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 12:12:11 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Have you tried a parallel debugger such as padb?
On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, 
I have tried this. But, the printout from 200 parallel processes make it very 
hard to locate the possible bug. 
They may not stop at the same point when the program got signal 9.
So, even though I can figure out the print out statements from all200 
processes, so many different locations where the processesare stopped make it 
harder to find out some hints about the bug. 
Are there some other programming tricks, which can help me narrow down to the 
doubt points ASAP.Any help is appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Try adding some print statements so you can see where the error occurs.
On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
After searching, the signal 9 means: 
the process is currently in an unworkable state and should be terminated with 
e

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-27 Thread Jack Bryan


Hi, I have figured out how to run the command. 
OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
 mpirun -np 200  -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 
700g200i200p14ye  ./myapplication 
Each process print out to a distinct file.
But, the program is terminated by the error 
:-=>>
 PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - 
received SISTER_EOF attempting to communicate with sister MOM'smpirun: 
Forwarding signal 10 to jobmpirun: killing job...
--mpirun
 was unable to cleanly terminate the daemons on the nodes shownbelow. 
Additional manual cleanup may be required - please refer tothe "orte-clean" 
tool for 
assistance.--
n341n338n337n336n335n334
n333n332n331n329n328n326n324
n321n318n316n315n314n313
n312n309n308n306n305

After searching, I find that the error is probably related to the highly 
frequent I/O activities. 
I have also run valgrind to do mem check in  order to find the possible reason 
for the original signal 9 (SIGKILL) problem. 
mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib  /usr/bin/valgrind 
--tool=memcheck --error-limit=no --leak-check=yes 
--log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log  ./myapplication 
But, I got the similar error as the above. 
What does the error mean ?   I cannot change the file system of the cluster. 
I only want to find a way to find the bug, which only appears in the case that 
the problem size is very large. 
But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. 
Any help is really appreciated. 
thanks
Jack 
From:
 r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 20:47:19 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



That command line cannot possibly work. Both the -rf and --output-filename 
options require arguments.
PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to 
correctly use these options.

On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:Hi, I used : 
 mpirun -np 200 -rf  --output-filename /mypath/myapplication
But, no files are printed out.
Can "--debug" option help me hear ? 
When I tried :
-bash-3.2$ mpirun 
-debug--A
 suitable debugger could not be found in your PATH.  Check the valuesspecified 
in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was 
searched.--Any
 help is really appreciated. 
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 15:45:39 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

If you use that mpirun option, mpirun will place the output from each rank into 
a -separate- file for you. Give it:
mpirun --output-filename /myhome/debug/run01
and in /myhome/debug, you will find files:
run01.0run01.1...
each with the output from the indicated rank.


On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:The cluster can print out all 
output into one file. 
But, checking them for bugs is very hard. 
The cluster also print out possible error messages into one file. 

But, sometimes the error file is empty , sometimes it is signal 9.
If I only run dummy tasks on worker nodes, no errors. 
If I run real task, sometimes processes are terminated w/o any errors before 
the program normally exit.Sometimes, the program get signal 9 but no other 
error messages. 
It is weird. 
Any help is really appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 15:18:53 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

I don't know, but Ashley may be able to help - or you can see his web site for 
instructions.
Alternatively, since you can put print statements into your code, have you 
considered using mpirun's option to direct output from each rank into its own 
file? Look at "mpirun -h" for the options.
   -output-filename|--output-filenameRedirect 
output from application processes into     filename.rank

On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to 
print out the stack trace and other pr

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-27 Thread Jack Bryan


Hi, my original bug is :
--mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--
The main framework of my code is: 
main(){ for masternode: while (loop <= LOOP_NUMBER) {   
master node distributes tasks to workers;   master collects results 
from workers;   ++loop; }   for worker nodes:   {   
get the task ;  run the task; // call CPLEX API lib 
return results to master;   }}
When the  LOOP_NUMBER <= 600 (with 200 parallel processes), it works well.But, 
when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error:
The possible limit of my Torque may be reason for the above error ? 
It seems that Torque complains about my high I/O caused by print out something 
from each process. 
But, if I comment out the printout statements in my code the Torque complains 
will be gone, but the signal 9 error is still there. 
Any help is really appreciated. 
thanks
Jack

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sun, 27 Mar 2011 13:08:31 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons



It means that Torque is unhappy with your job - either you are running longer 
than it permits, or you exceeded some other system limit.
Talk to your sys admin about imposed limits. Usually, there are flags you can 
provide to your job submission that allow you to change limits for your program.

On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:Hi, I have figured out how to 
run the command. 
OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
 mpirun -np 200  -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 
700g200i200p14ye  ./myapplication 
Each process print out to a distinct file.
But, the program is terminated by the error 
:-=>>
 PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - 
received SISTER_EOF attempting to communicate with sister MOM'smpirun: 
Forwarding signal 10 to jobmpirun: killing job...
--mpirun
 was unable to cleanly terminate the daemons on the nodes shownbelow. 
Additional manual cleanup may be required - please refer tothe "orte-clean" 
tool for 
assistance.--
n341n338n337n336n335n334
n333n332n331n329n328n326n324
n321n318n316n315n314n313
n312n309n308n306n305

After searching, I find that the error is probably related to the highly 
frequent I/O activities. 
I have also run valgrind to do mem check in  order to find the possible reason 
for the original signal 9 (SIGKILL) problem. 
mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib  /usr/bin/valgrind 
--tool=memcheck --error-limit=no --leak-check=yes 
--log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log  ./myapplication 
But, I got the similar error as the above. 
What does the error mean ?   I cannot change the file system of the cluster. 
I only want to find a way to find the bug, which only appears in the case that 
the problem size is very large. 
But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. 
Any help is really appreciated. 
thanks
Jack 
From:
 r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 20:47:19 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

That command line cannot possibly work. Both the -rf and --output-filename 
options require arguments.
PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to 
correctly use these options.

On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:Hi, I used : 
 mpirun -np 200 -rf  --output-filename /mypath/myapplication
But, no files are printed out.
Can "--debug" option help me hear ? 
When I tried :
-bash-3.2$ mpirun 
-debug--A
 suitable debugger could not be found in your PATH.  Check the valuesspecified 
in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was 
searched.--Any
 help is really appreciated. 
thanks

From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 2

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-27 Thread Jack Bryan


Hi, I use MPI_Barrier to make all processes to terminate at the same time. 
int main(){ for masternode: while (loop <= LOOP_NUMBER) {   
master node distributes tasks to workers;   master collects 
results from workers;   ++loop; }   for worker nodes:   
{   get the task ;  run the task; // call CPLEX API lib 
return results to master;   }   MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize(); return (0);}
thanks
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Sun, 27 Mar 2011 15:32:51 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

This might not have anything to do with your problem, but how do you finalize 
your worker nodes when your master loop terminates?

On Sun, Mar 27, 2011 at 3:27 PM, Jack Bryan  wrote:







Hi, my original bug is :
--

mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 
9 
(Killed).--


The main framework of my code is: 
main()

{   for masternode: while (loop <= LOOP_NUMBER)

{   master node distributes tasks to workers;

master collects results from workers;   ++loop;

}   for worker nodes: 

{   get the task ;  run the task; // call CPLEX API 
lib

return results to master;   }}


When the  LOOP_NUMBER <= 600 (with 200 parallel processes), it works well.But, 
when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error:


The possible limit of my Torque may be reason for the above error ? 


It seems that Torque complains about my high I/O caused by print out something 
from each process. 


But, if I comment out the printout statements in my code the Torque complains 
will be gone, but 

the signal 9 error is still there. 


Any help is really appreciated. 


thanks


Jack



From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sun, 27 Mar 2011 13:08:31 -0600


To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons





It means that Torque is unhappy with your job - either you are running longer 
than it permits, or you exceeded some other system limit.


Talk to your sys admin about imposed limits. Usually, there are flags you can 
provide to your job submission that allow you to change limits for your program.



On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:


Hi, I have figured out how to run the command. 
OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
 mpirun -np 200  -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 
700g200i200p14ye  ./myapplication 


Each process print out to a distinct file.
But, the program is terminated by the error 
:-

=>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) 
- received SISTER_EOF attempting to communicate with sister MOM'smpirun: 
Forwarding signal 10 to job

mpirun: killing job...
--mpirun
 was unable to cleanly terminate the daemons on the nodes shownbelow. 
Additional manual cleanup may be required - please refer to

the "orte-clean" tool for 
assistance.--
n341n338n337n336

n335n334n333n332n331n329
n328n326n324n321

n318n316n315n314n313n312
n309n308n306n305



After searching, I find that the error is probably related to the highly 
frequent I/O activities. 


I have also run valgrind to do mem check in  order to find the possible reason 
for the original signal 9 (SIGKILL) problem. 
mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib  /usr/bin/valgrind 
--tool=memcheck --error-limit=no --leak-check=yes 
--log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log  ./myapplication 


But, I got the similar error as the above. 
What does the error mean ?   I cannot change the file system of the cluster. 
I only want to find a way to find the bug, which only appears in the case that 
the problem size is very large. 


But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. 
Any help is really appreciated. 
thanks
Jack 




From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Sat, 26 Mar 2011 20:47:19 -0600

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-27 Thread Jack Bryan


Hi, 
The job queue has a time budget, which has been set in my job script.
For example, my current job queue is 24 hours. 
But, my program got SIGKILL (signal 9) within not more than 2 hours since it 
began to run. 
Are there other possible settings that I need to consider ? 
thanks
Jack

> From: jsquy...@cisco.com
> Date: Sun, 27 Mar 2011 20:29:11 -0400
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> 
> +1 on what Ralph is saying.
> 
> You need to talk to your local administrators and ask them why Torque is 
> killing your job.  Perhaps you're submitting to a queue that only allows jobs 
> to run for a few seconds, or something like that.
> 
> 
> On Mar 27, 2011, at 3:08 PM, Ralph Castain wrote:
> 
> > It means that Torque is unhappy with your job - either you are running 
> > longer than it permits, or you exceeded some other system limit.
> > 
> > Talk to your sys admin about imposed limits. Usually, there are flags you 
> > can provide to your job submission that allow you to change limits for your 
> > program.
> > 
> > 
> > On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:
> > 
> >> Hi, I have figured out how to run the command. 
> >> 
> >> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
> >> 
> >>  mpirun -np 200  -rf $OMPI_RANKFILE --mca btl self,sm,openib 
> >> -output-filename 700g200i200p14ye  ./myapplication 
> >> 
> >> Each process print out to a distinct file.
> >> 
> >> But, the program is terminated by the error :
> >> -
> >> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 
> >> 1099) - received SISTER_EOF attempting to communicate with sister MOM's
> >> mpirun: Forwarding signal 10 to job
> >> mpirun: killing job...
> >> 
> >> --
> >> mpirun was unable to cleanly terminate the daemons on the nodes shown
> >> below. Additional manual cleanup may be required - please refer to
> >> the "orte-clean" tool for assistance.
> >> --
> >> n341
> >> n338
> >> n337
> >> n336
> >> n335
> >> n334
> >> n333
> >> n332
> >> n331
> >> n329
> >> n328
> >> n326
> >> n324
> >> n321
> >> n318
> >> n316
> >> n315
> >> n314
> >> n313
> >> n312
> >> n309
> >> n308
> >> n306
> >> n305
> >> 
> >> 
> >> 
> >> After searching, I find that the error is probably related to the highly 
> >> frequent I/O activities. 
> >> 
> >> I have also run valgrind to do mem check in  order to find the possible 
> >> reason for the original 
> >> signal 9 (SIGKILL) problem. 
> >> 
> >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib  
> >> /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes 
> >> --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log  ./myapplication 
> >> 
> >> But, I got the similar error as the above. 
> >> 
> >> What does the error mean ?   
> >> I cannot change the file system of the cluster. 
> >> 
> >> I only want to find a way to find the bug, which only appears in the case 
> >> that the problem size is very large. 
> >> 
> >> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. 
> >> 
> >> Any help is really appreciated. 
> >> 
> >> thanks
> >> 
> >> Jack 
> >> 
> >> 
> >> From: r...@open-mpi.org
> >> Date: Sat, 26 Mar 2011 20:47:19 -0600
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
> >> 
> >> That command line cannot possibly work. Both the -rf and --output-filename 
> >> options require arguments.
> >> 
> >> PLEASE read the documentation? mpirun -h, or &quo

[OMPI users] OMPI not calling finalize error

2011-04-02 Thread Jack Bryan


Hi, 
When I run a parallel program, I got an error : 
--[n333:129522] 
*** Process received signal ***[n333:129522] Signal: Segmentation fault 
(11)[n333:129522] Signal code: Address not mapped (1)[n333:129522] Failing at 
address: 0x40[n333:129522] [ 0] /lib64/libpthread.so.0 
[0x3c50e0e4c0][n333:129522] [ 1] /opt/openmpi-1.3.4-gnu/lib/libmpi.so.0 
[0x4cd19b1][n333:129522] [ 2] 
/opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0(opal_progress+0x75) 
[0x52e5165][n333:129522] [ 3] /opt/openmpi-1.3.4-gnu/lib/libopen-rte.so.0 
[0x508565c][n333:129522] [ 4] /opt/openmpi-1.3.4-gnu/lib/libmpi.so.0 
[0x4c653eb][n333:129522] [ 5] 
/opt/openmpi-1.3.4-gnu/lib/libmpi.so.0(MPI_Init+0x120) [0x4c84b90][n333:129522] 
[ 6] /lustre/jxding/netplan49/nsga2b [0x4497f6][n333:129522] [ 7] 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n333:129522] [ 8] 
/lustre/jxding/netplan49/nsga2b(__gxx_personality_v0+0x499) 
[0x4436e9][n333:129522] *** End of error message 
***--mpirun
 has exited due to process rank 24 with PID 129522 onnode n333 exiting without 
calling "finalize". This mayhave caused other processes in the application to 
beterminated by signals sent by mpirun (as reported 
here).-But,
 the program only run for not more than a few of minutes. It should take hours 
to finish. 
How can it reach "finalize" so fast ? 
Any help is appreciated. 
Jack

[OMPI users] OMPI monitor each process behavior

2011-04-12 Thread Jack Bryan


Hi , All: 
I need to monitor the memory usage of each parallel process on a linux Open MPI 
cluster. 
But, top, ps command cannot help here because they only show the head node 
information. 
I need to follow the behavior of each process on each cluster node.
I cannot use ssh to access each node. 
The program takes 8 hours to finish. 
Any help is really appreciated. 
Jack

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Jack Bryan


Hi,  I am using 
mpirun (Open MPI) 1.3.4

But, I have these, 
orte-clean  orted   orte-ioforte-ps orterun
Can they do the same thing ? 
If I use them, will they use a lot of memory on each worker node and print out 
a lot of things on some log files ?
Any help is really appreciated. 
Thanks
Jack 
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Wed, 13 Apr 2011 08:09:17 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI monitor each process behavior



What version are you using? If you are using 1.5.x, there is an "orte-top" 
command that will do what you ask. It queries the daemons to get the info.

On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote:Hi , All: 
I need to monitor the memory usage of each parallel process on a linux Open MPI 
cluster. 
But, top, ps command cannot help here because they only show the head node 
information. 
I need to follow the behavior of each process on each cluster node.
I cannot use ssh to access each node. 
The program takes 8 hours to finish. 
Any help is really appreciated. 
Jack ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Jack Bryan


Hi , 
If I cannot ssh to a worker node, it means that my program cannot work 
correctly ? 
I can run it on 32 nodes *4 cores/node parallel processes. But, for larger 
parallel processes, 128 nodes * 1 cpu/node, it is killed by signal 9. 
Is this a reason ? 
thanks

> Date: Wed, 13 Apr 2011 05:59:10 -0700
> From: n...@aol.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OMPI monitor each process behavior
> 
> On 4/12/2011 8:55 PM, Jack Bryan wrote:
> 
> >
> > I need to monitor the memory usage of each parallel process on a linux
> > Open MPI cluster.
> >
> > But, top, ps command cannot help here because they only show the head
> > node information.
> >
> > I need to follow the behavior of each process on each cluster node.
> Did you consider ganglia et al?
> >
> > I cannot use ssh to access each node.
> How can MPI run?
> >
> > The program takes 8 hours to finish.
> 
> 
> 
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Jack Bryan


Hi,
I do not have qrsh
I have qrerunqrls  qrttoppm  qrun

Can they do the same thing ? 
thanks
> From: re...@staff.uni-marburg.de
> Date: Wed, 13 Apr 2011 16:28:14 +0200
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OMPI monitor each process behavior
> 
> Am 13.04.2011 um 05:55 schrieb Jack Bryan:
> 
> > I need to monitor the memory usage of each parallel process on a linux Open 
> > MPI cluster. 
> > 
> > But, top, ps command cannot help here because they only show the head node 
> > information. 
> > 
> > I need to follow the behavior of each process on each cluster node.
> > 
> > I cannot use ssh to access each node. 
> 
> What about submitting another job with `mpirun ... ps -e f` or alike - in 
> case you can request the same nodes?
> 
> Can you `qrsh` to a node by the queuingsystem?
> 
> -- Reuti
> 
> 
> > The program takes 8 hours to finish. 
> > 
> > Any help is really appreciated. 
> > 
> > Jack 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Jack Bryan


Hi, 
I find the reason why the program is killed by operating system in the case 
that the problem size is large.
It consumes more memory and leads to more memory swap. 
This also degrade the program performance. 
But, I cannot determine which function of the worker process causes the 
problem. 
I have used try-catch in my code but no exception popped out.
I found that 
---When the 
processes running on your server attempt to allocate more memory than your 
system has available, the kernel begins to swap memory pages to and from the 
disk.
This is done in order to free up sufficient physical memory to meet the RAM 
allocation requirements of the 
requestor.--
I am not sure it is really caused by CPLEX ( an optimization model solver) or 
other routines or maybe by other dynamic memory allocation used by CPLEX API 
libray at background. 
Any help is really appreciated. 
Jack
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Wed, 13 Apr 2011 10:34:38 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI monitor each process behavior




On Apr 13, 2011, at 10:19 AM, Jack Bryan wrote:Hi,  I am using 
mpirun (Open MPI) 1.3.4

But, I have these, 
orte-clean  orted   orte-ioforte-ps orterun
Can they do the same thing ? 
Unfortunately, no

If I use them, will they use a lot of memory on each worker node and print out 
a lot of things on some log files ?
No, but they won't help. orte-top would be run only on the head node (i.e., 
where you are logged in), and would generate output to your screen.
But you don't have it with that release, so the point is moot. Afraid there 
isn't much else you can do - you might talk to your sys admin and see what 
tools are available on your cluster for this purpose. Perhaps a nice parallel 
debugger is available?


Any help is really appreciated. 
Thanks
Jack 
From: r...@open-mpi.org
List-Post: users@lists.open-mpi.org
Date: Wed, 13 Apr 2011 08:09:17 -0600
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI monitor each process behavior

What version are you using? If you are using 1.5.x, there is an "orte-top" 
command that will do what you ask. It queries the daemons to get the info.

On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote:Hi , All: 
I need to monitor the memory usage of each parallel process on a linux Open MPI 
cluster. 
But, top, ps command cannot help here because they only show the head node 
information. 
I need to follow the behavior of each process on each cluster node.
I cannot use ssh to access each node. 
The program takes 8 hours to finish. 
Any help is really appreciated. 
Jack ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___ users mailing list 
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] OMPI vs. network socket communcation

2011-04-30 Thread Jack Bryan


Hi, All:
What is the relationship between MPI communication and socket communication ? 
Is the network socket programming better than MPI ? 
I am a newbie of   network socket programming. 
I do not know which one is better for parallel/distributed computing ? 
I know that network socket is unix-based file communication between server and 
client. 
If they can also be used for parallel computing, how MPI can work better than 
them ? 
I know MPI is for homogeneous cluster system and network socket is based on 
internet TCP/IP. 
Any help is really appreciated. 
Thanks

Re: [OMPI users] OMPI vs. network socket communcation

2011-05-02 Thread Jack Bryan

Thanks for your reply. 
MPI is for academic purpose. How about business applications ? 
What kinds of parallel/distributed computing environment do the financial 
institutionsuse for their high frequency trading ? 
Any help is really appreciated. 
Thanks, 
List-Post: users@lists.open-mpi.org
Date: Mon, 2 May 2011 08:34:33 -0400
From: terry.don...@oracle.com
To: us...@open-mpi.org
Subject: Re: [OMPI users] OMPI vs. network socket communcation

On 04/30/2011 08:52 PM, Jack Bryan wrote:

  Hi, All:

  What is
  the relationship between MPI communication and socket
  communication ? 

MPI may use socket communications to do communications between two
processes.  Aside from that they are used for different purposes.

  Is the
  network socket programming better than MPI ? 

Depends on what you are trying to do.  If you are writing a parallel
program that may run in multiple environments with different types
of performing protocols available for its use then MPI is probably
better.  If you are looking to do simple client/server type
programming then socket program might have an advantage.

  I am a
  newbie of   network socket
  programming. 

  I do not know which one is better for
  parallel/distributed computing ? 

IMO MPI.

  I know that network socket is unix-based
  file communication between server and client. 

  If they can also be used for parallel
  computing, how MPI can work better than them ? 

There is a lot of stuff that MPI does behind the curtain to make a
parallel applications life a lot easier.  As far as performance MPI
will not perform better than sockets if it is using sockets as the
underlying model.  However, the performance difference should be
negligible which makes all the other stuff MPI does for you a big
win.

  I know
  MPI is for homogeneous cluster system and network socket is
  based on internet TCP/IP. 

What do you mean by homogeneous cluster?  There are some MPIs that
can work among different platforms and even different OSes (though
some initial setup may be necessary).

Hope this helps,

-- 

  Message body

  Terry D. Dontje | Principal Software Engineer

  Developer Tools
Engineering | +1.781.442.2631

   Oracle  - Performance
  Technologies

   95 Network Drive,
Burlington, MA 01803

Email terry.don...@oracle.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] OpenMPI data transfer error

2011-07-26 Thread Jack Bryan


Hi, 
I am using Open MPI to do data transfer from master node to worker nodes. 
But, the worker node can the data which is not what it should get. 
I have checked destination node rank, taskTag, datatype, all of them are 
correct. 
I do an experiment. 
Node 0 sends data to node 1 , 2 ,3. 
Only node 3 can get correct data, but node 1 and 2 get the wrong data, 
whichshould be received by node 3. 
What is the possible reason ? 
I have printed out the data that is sent by master node. They are exactly what 
the node 1 , 2, 3 should receive. 
But why node 1 and 2 get data of node 3. 
Any help is appreciated. 
Jack

[OMPI users] Open MPI process cannot do send-receive message correctly on a distributed memory cluster

2011-09-30 Thread Jack Bryan


Hi, 

I have a Open MPI program, which works well on a Linux shared memory multicore 
(2 x 6 cores) machine.

But, it does not work well on a distributed cluster with Linux Open MPI.

I found that the the process sends out some messages to other processes, which 
can not receive them. 

What is the possible reason ? 

I do not change anything of the program. 

Any help is really appreciated. 

Thanks

Re: [OMPI users] Open MPI process cannot do send-receive message correctly on a distributed memory cluster

2011-09-30 Thread Jack Bryan


Thanks, 

I am using non-blocking MPI_Isend to send out message and using blocking 
MPI_Recv to get the message. 

Each MPI_Isend use a distinct buffer to hold the message, which is not changed 
until the message is received. 

Then, the sender process waits for the MPI_Isend to be finished. 


Before this message is sent out, a heading message (about how many data and 
what data will be sent out in the following MPI_Isend) 
is sent out in the same way, they can be received well. 

Why the following message (which has larger size) cannot be received ? 

Any help is really appreciated. 

> Date: Fri, 30 Sep 2011 11:33:16 -0400
> From: raysonlo...@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI process cannot do send-receive message 
> correctly on a distributed memory cluster
> 
> You can use a debugger (just gdb will do, no TotalView needed) to find
> out which MPI send & receive calls are hanging the code on the
> distributed cluster, and see if the send & receive pair is due to a
> problem described at:
> 
> Deadlock avoidance in your MPI programs:
> http://www.cs.ucsb.edu/~hnielsen/cs140/mpi-deadlocks.html
> 
> Rayson
> 
> =
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net
> 
> Wikipedia Commons
> http://commons.wikimedia.org/wiki/User:Raysonho
> 
> 
> On Fri, Sep 30, 2011 at 11:06 AM, Jack Bryan  wrote:
> > Hi,
> >
> > I have a Open MPI program, which works well on a Linux shared memory
> > multicore (2 x 6 cores) machine.
> >
> > But, it does not work well on a distributed cluster with Linux Open MPI.
> >
> > I found that the the process sends out some messages to other processes,
> > which can not receive them.
> >
> > What is the possible reason ?
> >
> > I do not change anything of the program.
> >
> > Any help is really appreciated.
> >
> > Thanks
> >
> >
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> ==
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Open MPI error to define MPI_Datatype in header file

2011-10-09 Thread Jack Bryan


Hi,
I need to define a (Open MPI) MPI_Datatype in a header file so that all other 
files that include it can find it. 
I also try to use extern to do decleration in .h file and then define them in 
.cpp file. 
But, I always get error: 
undefined reference 
It is not allowed in Open MPI ? 
Why ? 
Any help is really appreciated. 
Thanks

[OMPI users] How to check processes working in parallel on one node of MPI cluster

2012-06-23 Thread Jack Bryan


Hi, 

I am running an OpenMPI program on a linux cluster with 4 quad cores per node. 

I use qstat -n jobID to check how many processes working in parallel and find 
that : 

node160/15+node160/14+node160/13+node160/12+node160/11+node160/10+node160/9
   +node160/8+node160/7+node160/6+node160/5+node160/4+node160/3+node160/2
   +node160/1+node160/0+node166/15+node166/14+node166/13+node166/12+node166/11
   +node166/10+node166/9+node166/8+node166/7+node166/6+node166/5+node166/4
   +node166/3+node166/2+node166/1+node166/0+node173/15+node173/14+node173/13
   +node173/12+node173/11+node173/10+node173/9+node173/8+node173/7+node173/6
   +node173/5+node173/4+node173/3+node173/2+node173/1+node173/0+node175/15
   +node175/14+node175/13+node175/12+node175/11+node175/10+node175/9+node175/8
   +node175/7+node175/6+node175/5+node175/4+node175/3+node175/2+node175/1
   +node175/0

But, when I use ssh to be on a node , e..g ssh node175, 

I use top command to check how many processes working on node 175 and find that 
there are only one process working, not 8 processes. 

Would you please tell me how to check the processes number on one node ?

Any  help will be appreciated.  

Thanks 

Jinxu Ding

[OMPI users] Open MPI task scheduler

2010-06-20 Thread Jack Bryan


Hi, all: 
I need to design a task scheduler (not PBS job scheduler) on Open MPI cluster. 
I need to parallelize an algorithm so that a big problem is decomposed into 
small tasks, which can be distributed to other worker nodes by the Scheduler 
and after being solved, the results of these tasks are returned to the manager 
node with the Scheduler, which will distribute more tasks on the base of the 
collected results.  
I need to use C++ to design the scheduler. 
I have searched online and I cannot find any scheduler available for this 
purpose. 
Any help is appreciated. 
thanks 
Jack 
June 19  2010 
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: [OMPI users] Open MPI task scheduler

2010-06-20 Thread Jack Bryan


Hi, Matthieu:
Thanks for your help. 
Most of your ideas show that what I want to do. 
My scheduler should be able to be called from any C++ program, which can put a 
list of tasks to the scheduler and then the scheduler distributes the tasks to 
other client nodes. 
It may work like in this way: 
while(still tasks available) {  myScheduler.push(tasks);
myScheduler.get(tasks results from client nodes);}  

My cluster has 400 nodes with Open MPI. The tasks should be transferred b y MPI 
protocol. 
I am not familiar with  RPC Protocol.   
If I use Boost.ASIO and some Python/GCCXML script to generate the code, it can 
be called from C++ program on Open MPI cluster ? 
I cannot find the skeletton on your blog. 
Would you please tell me where to find it ?
I really appreciate your help. 

Jack
June 20 2010
> Date: Sun, 20 Jun 2010 20:13:14 +0200
> From: matthieu.bruc...@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI task scheduler
> 
> Hi Jack,
> 
> What you are seeking is the client/server pattern. Have one node act
> as a server. It will create a list of tasks or even a graph of tasks
> if you have dependencies, and then create clients that will connect to
> the server with an RPC protocol (I've done this with a SOAP+TCP
> protocol, the severance of the TCP connection meaning that the client
> is dead and that its task should be recycled, ités easy to do with
> Boost.ASIO and some Python/GCCXML scripts to automatically generate
> your code, I've written a skeletton on my blog). You may even have
> clients with different sizes or capabilities and tell the server what
> each client can do, and then the server may dispatch appropriate
> tickets to the clients.
> 
> Each client and server can be a MPI process, you don't have to create
> all clients inside one MPI process (you may use several if the
> smallest resource your batch scheduler allocates is bigger that one of
> your tasks). With a batch scheduler, it's better to allocate your
> tasks as small as possible so that you can balance the resources you
> need.
> 
> Matthieu
> 
> 2010/6/20 Jack Bryan :
> > Hi, all:
> > I need to design a task scheduler (not PBS job scheduler) on Open MPI
> > cluster.
> > I need to parallelize an algorithm so that a big problem is decomposed into
> > small tasks, which can be distributed
> > to other worker nodes by the Scheduler and after being solved, the results
> > of these tasks are returned to the manager node with the Scheduler, which
> > will distribute more tasks on the base of the collected results.
> > I need to use C++ to design the scheduler.
> > I have searched online and I cannot find any scheduler available
> > for this purpose.
> > Any help is appreciated.
> > thanks
> > Jack
> > June 19  2010
> > 
> > Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
> > Learn more.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> 
> -- 
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Re: [OMPI users] Open MPI task scheduler

2010-06-20 Thread Jack Bryan

Thanks for your reply. 
My task scheduler is application program level not OS level.
PBS is to ask OS to do the job scheduling. 
My scheduler needs to be called by any C++ program to out tasks in to the 
scheduler and then distribute tasks to worker nodes.
After the tasks are done, the manager node collects the results. 
 It may work like in this way: 
while(still tasks available) {  myScheduler.push(tasks);
myScheduler.get(tasks results from client nodes);}  

Any help is appreciated. 
Jack
June 20  2010
> From: bill.ran...@sas.com
> To: us...@open-mpi.org
> Date: Sun, 20 Jun 2010 20:04:26 +
> Subject: Re: [OMPI users] Open MPI task scheduler
> 
> 
> On Jun 20, 2010, at 1:49 PM, Jack Bryan wrote:
> 
> Hi, all:
> 
> I need to design a task scheduler (not PBS job scheduler) on Open MPI cluster.
> 
> Quick question - why *not* PBS?
> 
> Using shell scripts with the Job Array and Dependent Jobs features of PBS Pro 
> (not sure about Maui/Torque nor SGE) you can implement this in a fairly 
> straight forward manner.  It worked for the Bioinformaticists using BLAST.
> 
> It just seems that the workflow you are describing is part and partial of 
> what any good workload management system is supposed to do and do well.
> 
> Just a thought.
> 
> Good luck,
> 
> -bill
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

Re: [OMPI users] Open MPI task scheduler

2010-06-20 Thread Jack Bryan


Hi, thank you very much for your help. 
What is the meaning of " must find a system so that every task can be 
serialized in the same form." What is the meaning of "serize " ? 
I have no experience of programming with python and XML. 
I have studied your blog. 
Where can I find a simple example to use the techniques you have said ? 
For exmple, I have 5 task (print "hello world !"). 
I want to use 6 processors to do it in parallel. 
One processr is the manager node who distributes tasks and other 5 processorsdo 
the printing jobs and when they are done, they tell this to the manager noitde. 

Boost.Asio is a cross-platform C++ library for network and low-level I/O 
programming. I have no experiences of using it. Will it take a long time to 
learn how to use it ? 
If the messages are transferred by SOAP+TCP, how the manager node calls it and 
push task into it ? 
Do I need to install SOAP+TCP on my cluster so that I can use it ? 

Any help is appreciated. 
Jack 
June 20  2010
> Date: Sun, 20 Jun 2010 21:00:06 +0200
> From: matthieu.bruc...@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI task scheduler
> 
> 2010/6/20 Jack Bryan :
> > Hi, Matthieu:
> > Thanks for your help.
> > Most of your ideas show that what I want to do.
> > My scheduler should be able to be called from any C++ program, which can
> > put
> > a list of tasks to the scheduler and then the scheduler distributes the
> > tasks to other client nodes.
> > It may work like in this way:
> > while(still tasks available) {
> > myScheduler.push(tasks);
> > myScheduler.get(tasks results from client nodes);
> > }
> 
> Exactly. In your case, you want only one server, so you must find a
> system so that every task can be serialized in the same form. The
> easiest way to do so is to serialize your parameter set as an XML
> fragment and add the type of task as another field.
> 
> > My cluster has 400 nodes with Open MPI. The tasks should be transferred b y
> > MPI protocol.
> 
> No, they should not ;) MPI can be used, but it is not the easiest way
> to do so. You still have to serialize your ticket, and you have to use
> some functions that are from MPI2 (so perhaps not as portable as MPI1
> functions). Besides, it cannot be used from programs that do not know
> of using MPI protocols.
> 
> > I am not familiar with  RPC Protocol.
> 
> RPC is not a protocol per se. SOAP is. RPC stands for Remote Procedure
> Call. It is basically your scheduler that has several functions
> clients can call:
> - add tickets
> - retrieve ticket
> - ticket is done
> 
> > If I use Boost.ASIO and some Python/GCCXML script to generate the code, it
> > can be
> > called from C++ program on Open MPI cluster ?
> 
> Yes, SOAP is just an XML way of representing the fact that you call a
> function on the server. You can use it with C++, Java, ... I use it
> with Python to monitor how many tasks are remaining, for instance.
> 
> > I cannot find the skeletton on your blog.
> > Would you please tell me where to find it ?
> 
> It's not complete as some of the work is property of my employer. This
> is how I use GCCXML to generate the calling code:
> http://matt.eifelle.com/2009/07/21/using-gccxml-to-automate-c-wrappers-creation/
> You have some additional code to write, but this is the main idea.
> 
> > I really appreciate your help.
> 
> No sweat, I hope I can give you correct hints!
> 
> Matthieu
> -- 
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

[OMPI users] openMPI asychronous communication

2010-06-27 Thread Jack Bryan


Dear All:
How to do asychronous communication among nodes by openMPI or boot.MPI  in 
cluster ?
I need to set up a kind of asychronous communication protocol such that message 
senders and receivers can communicate asychronously without losing anymessages 
between them. 
I do not want to use blocking MPI routines because the processors can do 
otheroperations when they wait for new messages coming.
I donot find this kind of MPI routines that support this asychronous 
communication. 
Any help is appreciated. 
thanks
Jack 
June 27  2010 
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Re: [OMPI users] openMPI asychronous communication

2010-06-27 Thread Jack Bryan


thanks
I know this. 
but, what if sender can send a lot of messages to receivers faster than what 
receiver can receive ? 
it means that sender works faster than receiver. 
Any help is appreciated. 
jack 

From: jiangzuo...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Mon, 28 Jun 2010 11:31:16 +0800
To: us...@open-mpi.org
Subject: Re: [OMPI users] openMPI asychronous communication

MPI_Isend - Starts a standard-mode, nonblocking send.

BTW, are there any asynchronous collective operations?
  Changsheng Jiang



On Mon, Jun 28, 2010 at 11:22, Jack Bryan  wrote:







Dear All:
How to do asychronous communication among nodes by openMPI or boot.MPI  in 
cluster ?


I need to set up a kind of asychronous communication protocol such that message 
senders and receivers can communicate asychronously without losing anymessages 
between them. 


I do not want to use blocking MPI routines because the processors can do 
otheroperations when they wait for new messages coming.
I donot find this kind of MPI routines that support this 

asychronous communication. 
Any help is appreciated. 
thanks
Jack 
June 27  2010 


The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail.  Get busy.



___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Re: [OMPI users] openMPI asychronous communication

2010-06-28 Thread Jack Bryan


thanks
I know that. 
MPI_irecv() ;
do other works;
MPI_wait();
But, my message receiver is much slower than sender. 
when the sender is doing its local  works, the sender has sent out their 
messages. but at this time, the sender is very busy doing its local work and 
cannot post MPI_irecv to get the messages from senders. 
Any help is appreciated. 
jack


From: jiangzuo...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Mon, 28 Jun 2010 11:55:32 +0800
To: us...@open-mpi.org
Subject: Re: [OMPI users] openMPI asychronous communication

OK, then i think you also know using MPI_Wait to wait the asynchronous requests 
to complete. if sender works faster then receiver(or reverse), then the 
MPI_Wait will do wait, not just deallocted. you should keep the buffer content 
before MPI_Wait.


  Changsheng Jiang



On Mon, Jun 28, 2010 at 11:41, Jack Bryan  wrote:







thanks
I know this. 
but, what if sender can send a lot of messages to receivers faster than what 
receiver can receive ? 
it means that sender works faster than receiver. 


Any help is appreciated. 
jack 

From: jiangzuo...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Mon, 28 Jun 2010 11:31:16 +0800


To: us...@open-mpi.org
Subject: Re: [OMPI users] openMPI asychronous communication

MPI_Isend - Starts a standard-mode, nonblocking send.



BTW, are there any asynchronous collective operations?
  Changsheng Jiang



On Mon, Jun 28, 2010 at 11:22, Jack Bryan  wrote:







Dear All:
How to do asychronous communication among nodes by openMPI or boot.MPI  in 
cluster ?




I need to set up a kind of asychronous communication protocol such that message 
senders and receivers can communicate asychronously without losing anymessages 
between them. 




I do not want to use blocking MPI routines because the processors can do 
otheroperations when they wait for new messages coming.
I donot find this kind of MPI routines that support this 



asychronous communication. 
Any help is appreciated. 
thanks
Jack 
June 27  2010 


The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail.  Get busy.





___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

  
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail.  Get busy.



___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[OMPI users] Open MPI ERR_TRUNCATE: message truncated

2010-06-28 Thread Jack Bryan


Dear All, 
I am using Open MPI : mpirun (Open MPI) 1.3.4
I  got error:
terminate called after throwing an instance of 
'boost::exception_detail::clone_impl
 >'  what():  MPI_Test: MPI_ERR_TRUNCATE: message truncated
I installed boost MPI library and compile and run the program  by openMPI. It 
seems that the message has been truncated by the receiver. 
How can I fix the problem ? 
Is it a bug of OpenMPI ? 
Any help is appreciated.
Jack
June 28 2010


  
_
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

[OMPI users] Open MPI, Segmentation fault

2010-06-30 Thread Jack Bryan


Dear All,
I am using Open MPI, I got the error: 
n337:37664] *** Process received signal ***[n337:37664] Signal: Segmentation 
fault (11)[n337:37664] Signal code: Address not mapped (1)[n337:37664] Failing 
at address: 0x7fffcfe9[n337:37664] [ 0] /lib64/libpthread.so.0 
[0x3c50e0e4c0][n337:37664] [ 1] 
/lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2 
[0x414ed7][n337:37664] [ 2] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x3c5021d974][n337:37664] [ 3] 
/lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2(__gxx_personality_v0+0x1f1)
 [0x412139][n337:37664] *** End of error message ***
After searching answers, it seems that some functions fail.  My program can run 
well for 1,2,10 processors, but fail when the number of tasks cannotbe divided 
evenly by number of processes. 
Any help is appreciated. 
thanks
Jack
June 30  2010
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Re: [OMPI users] Open MPI, Segmentation fault

2010-07-01 Thread Jack Bryan


thanks
I am not familiar with OpenMPI. 
Would you please help me with how to ask openMPI to show where the fault occurs 
?
GNU debuger ?
Any help is appreciated. 
thanks!!!
Jack 
June 30  2010

List-Post: users@lists.open-mpi.org
Date: Wed, 30 Jun 2010 16:13:09 -0400
From: amja...@gmail.com
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI, Segmentation fault

Based on my experiences, I would FULLY endorse (100% agree with) David Zhang.
It is usually a coding or typo mistake.

 At first, Ensure that array sizes and dimension are correct.

I experience that if openmpi is compiled with gnu compilers (not with Intel) 
then it also point outs the subroutine exactly in which the fault occur. have a 
try.


best,
AA

  

On Wed, Jun 30, 2010 at 12:43 PM, David Zhang  wrote:

When I got segmentation faults, it has always been my coding mistakes.  Perhaps 
your code is not robust against number of processes not divisible by 2?

On Wed, Jun 30, 2010 at 8:47 AM, Jack Bryan  wrote:








Dear All,
I am using Open MPI, I got the error: 
n337:37664] *** Process received signal ***[n337:37664] Signal: Segmentation 
fault (11)[n337:37664] Signal code: Address not mapped (1)


[n337:37664] Failing at address: 0x7fffcfe9[n337:37664] [ 0] 
/lib64/libpthread.so.0 [0x3c50e0e4c0][n337:37664] [ 1] 
/lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2 [0x414ed7]


[n337:37664] [ 2] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x3c5021d974][n337:37664] [ 3] 
/lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2(__gxx_personality_v0+0x1f1)
 [0x412139][n337:37664] *** End of error message ***



After searching answers, it seems that some functions fail.  My program can run 
well for 1,2,10 processors, but fail when the number of tasks cannotbe divided 
evenly by number of processes. 



Any help is appreciated. 
thanks
Jack
June 30  2010
  
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail.  Get busy.




___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang

University of California, San Diego




___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users

  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: [OMPI users] Open MPI, Segmentation fault

2010-07-01 Thread Jack Bryan

Thanks for all your replies. 

I want to do master-worker asynchronous communication. 

The master needs to distribute tasks to workers and then collect results from 
them. 

master : 

world.irecv(resultSourceRank, upStreamTaskTag, 
myResultTaskPackage[iRank][taskCounterT3]);

I got this error "MPI_ERR_TRUNCATE" , because I declared " TaskPackage 
myResultTaskPackage. "

It seems that the 2-dimension array cannot be used to receive my defined 
class package from worker, who sends a TaskPackage to master. 

So, I changed it to an int 2-d array to get the result, it works well. 

But, I still want to find out how to store the result in a data structure with 
the type TaskPackage because 
int type data can only be used to carry integers. Too limited.

What I want to do is: 

The master can store the results from each worker and then combine them 
together 
to form the final result after collecting all results from workers. 

But, if the master has number of tasks that cannot be divided evenly by worker 
numbers, 
each worker may have different number of tasks. 

If we have 11 tasks and 3 workers.

aveTaskNumPerNode = (11 - 11%3) /3 = 3
leftTaskNum = 11%3 =2 = Z

the master distributes each of left tasks from worker 1 to work Z (Z < 
totalNumWorkers).

For example, worker 1: 4 tasks, worker 2: 4 task, worker 3: 3 tasks.

The master tries to distribute tasks evenly so that the difference between 
workloads of 
each worker is minimized. 

I am going to use vector's vector to do the dynamic data storage. 

The 2-dimensional data-structure that can store results from workers. 

Each row element of the data-structure has different columns. 

It can be indexed by iterator so that I can find the a specified number worker 
task result 
by searching the data strucutre. 

For example, 
   column   column 
  12
 row 1   (worker1.task1)(worker1.task4) 
 row 2   (worker2.task2) (worker1.task5)   
 row 3   (worker3.task3) 

the data strucutre should remember the location of work ID and the task ID.
So that the master can know which task comes from which worker. 

Any help or comment are appreciated. 

thanks

Jack 

June 30   2010

> Date: Thu, 1 Jul 2010 11:44:19 -0400
> From: g...@ldeo.columbia.edu
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI, Segmentation fault
> 
> Hello Jack, list
> 
> As others mentioned, this may be a problem with dynamic
> memory allocation.
> It could also be a violation of statically allocated memory,
> I guess.
> 
> You say:
> 
> > My program can run well for 1,2,10 processors, but fail when the
> > number of tasks cannot
> > be divided evenly by number of processes.
> 
> Often times, when the division of the number of "tasks"
> (or the global problem size) by the number of "processors" is not even, 
> one processor gets a lighter/heavier workload then the others,
> it also allocates  less/more memory than the others,
> and it accesses smaller/larger arrays than the others.
> 
> In general integer division and remainder/module calculations
> are used to control memory allocation, the array sizes, etc,
> on different processors.
> These formulas tend to use the MPI communicator size
> (i.e., effectively the number of processors if you are using 
> MPI_COMM_WORLD) to split the workload across the processors.
> 
> I would search for the lines of code where those calculations are done, 
> and where the arrays are allocated and accessed,
> to make sure the algorithm works both when
> they are of the same size
> (even workload across the processors),
> as when they are of different sizes
> (uneven workload across the processors).
> You may be violating memory access by a few bytes only, due to a small
> mistake in one of those integer division / remainder/module formulas,
> perhaps where an array index upper or lower bound is calculated.
> It happened to me before, probably to others too.
> 
> This type of code inspection can be done without a debugger,
> or before you get to the debugger phase.
> 
> I hope this helps,
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> > Jeff Squyres wrote:
> > Also see http://www.open-mpi.org/faq/?category=debugging.
> > 
> > On Jul 1, 2010, at 3:17 AM, Asad Ali wrote:
> > 
> >> Hi Jack,
> >>
> >> Debugging OpenMPI with traditional debuggers is a pain.
> >> >From your error message it sounds that you have some memory allocation 
> >> >problem. Do you use dynamic memory a

[OMPI users] OpenMPI error MPI_ERR_TRUNCATE

2010-07-02 Thread Jack Bryan


Dear All:
With boost MPI, I trying to ask some worker nodes to send some message to the 
single master node. I am using OpenMPI 1.3.4.
I use an array recvArray[row][column] to receive the message, which is a C++ 
class that contain int, member functions. But I got an error of 
terminate called after throwing an instance of 
'boost::exception_detail::clone_impl
 >'  what():  MPI_Test: MPI_ERR_TRUNCATE: message truncated[n124:126639] *** 
Process received signal ***[n124:126639] Signal: Aborted (6)[n124:126639] 
Signal code:  (-6)
It seems that the master cannot find enough space for the receicved message. 
But, I have decleared the recvArray , which is a vector with 
element as my received class package. 
The error is very wierd.
When I open the recvied package, the elements are not expected numbers buy only 
some very large or small numbers.
Any help is appreciated. 
Jack 
July 2  2010
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[OMPI users] Open MPI, cannot get the results from workers

2010-07-04 Thread Jack Bryan


Dear All : 
I designed a master-worker framework, in which the master can schedulemultiple 
tasks (numTaskPerWorkerNode) to each worker and then collects results from 
workers.
if the numTaskPerWorkerNode = 1, it works well. 
But, if numTaskPerWorkerNode > 1, the master cannot get the results from 
workers. 
But, the workers can get the tasks from master. 
why ?

I have used different taskTag to distinguish the tasks, but still does not work.
Any help is appreciated. 
Thanks, 
Jack 
July 4  2010  
_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

Re: [OMPI users] Open MPI, cannot get the results from workers

2010-07-05 Thread Jack Bryan


When the master sends out the task, it assign a distinct task number ID to the 
task. 
When the worker receive the task, it  still use the task's assigned ID as task 
tag to send it to master. 
Any help is appreciated. 
July 5 2010



From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Mon, 5 Jul 2010 13:17:27 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI, cannot get the results from workers

how does the master receive results from the workers? if a worker is sending 
multiple task results, how does the master knows what the message tags are 
ahead of time?

On Sun, Jul 4, 2010 at 10:26 AM, Jack Bryan  wrote:







Dear All : 
I designed a master-worker framework, in which the master can schedulemultiple 
tasks (numTaskPerWorkerNode) to each worker and then collects results from 
workers.


if the numTaskPerWorkerNode = 1, it works well. 
But, if numTaskPerWorkerNode > 1, the master cannot get the results from 
workers. 
But, the workers can get the tasks from master. 


why ?

I have used different taskTag to distinguish the tasks, but still does not work.
Any help is appreciated. 


Thanks, 
Jack 
July 4  2010  
The New Busy is not the too busy. Combine all your e-mail accounts with 
Hotmail. Get busy.



___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego
  
_
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

[OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated

2010-07-07 Thread Jack Bryan


Dear All:
I need to transfer some messages from workers master node on MPI cluster with 
Open MPI.
The number of messages is fixed. 
When I increase the number of worker nodes, i got error: 
--
terminate called after throwing an instance of 
'boost::exception_detail::clone_impl
 >'  what():  MPI_Unpack: MPI_ERR_TRUNCATE: message truncated[n231:45873] *** 
Process received signal ***[n231:45873] Signal: Aborted (6)[n231:45873] Signal 
code:  (-6)[n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n231:45873] 
[ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215][n231:45873] [ 2] 
/lib64/libc.so.6(abort+0x110) [0x3c50231cc0]

--
For 40 workers , it works well. 
But for 50 workers, it got this error. 
The largest message size is not more then 72 bytes. 
Any help is appreciated. 
thanks
Jack
July 7 2010   
_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated

2010-07-07 Thread Jack Bryan


thanks
Wat if the master has to send and receive large data package ? 
It has to be splited into multiple parts ? 
This may increase communication overhead. 
I can use MPI_datatype to wrap it up as a specific datatype, which can carry 
the data. What if the data is very large? 1k bytes or 10 kbytes , 100 kbytes ?
the master need to collect the same datatype from all workers. 
So, in this way, the master has to set up a data pool to get all data. 
The master's buffer provided by the MPI may not be large enough to do this. 
Are there some other ways to do it ? 
Any help is appreciated. 
thanks
Jack
july 7  2010 
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Wed, 7 Jul 2010 17:32:27 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated

This error typically occurs when the received message is bigger than the 
specified buffer size.  You need to narrow your code down to offending receive 
command to see if this is indeed the case.



On Wed, Jul 7, 2010 at 8:42 AM, Jack Bryan  wrote:







Dear All:
I need to transfer some messages from workers master node on MPI cluster with 
Open MPI.
The number of messages is fixed. 
When I increase the number of worker nodes, i got error: 


--
terminate called after throwing an instance of 
'boost::exception_detail::clone_impl
 >'

  what():  MPI_Unpack: MPI_ERR_TRUNCATE: message truncated[n231:45873] *** 
Process received signal ***[n231:45873] Signal: Aborted (6)[n231:45873] Signal 
code:  (-6)[n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0]

[n231:45873] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215][n231:45873] [ 
2] /lib64/libc.so.6(abort+0x110) [0x3c50231cc0]

--


For 40 workers , it works well. 
But for 50 workers, it got this error. 
The largest message size is not more then 72 bytes. 


Any help is appreciated. 
thanks
Jack
July 7 2010   
The New Busy is not the too busy. Combine all your e-mail accounts with 
Hotmail. Get busy.



___

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
David Zhang
University of California, San Diego
  
_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

[OMPI users] OpenMPI how large its buffer size ?

2010-07-10 Thread Jack Bryan


Dear All:
How to find the buffer size of OpenMPI ? 
I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4.
Many nodes need to send data to the same node . 
Workers use mpi_isend, the receiver node use  mpi_irecv. 
because they are non-blocking, the messages are stored in buffers of senders. 
And then, the receiver collect messages from its buffer. 
If the receiver's buffer is too small, there will be truncate error. 
Any help is appreciated. 
Jack
July 9  2010
  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: [OMPI users] OpenMPI how large its buffer size ?

2010-07-10 Thread Jack Bryan

Hi, 
thanks for the program from Jody. 
David indicated the question that I want to ask. 
But, Jody's approach is ok when the MPI built-in buffer size is large enough to 
hold the message such as 100kB in the buffer. 
In asynchronous communication, when the sender posts a mpi_isend, the message 
is put in a buffer provided by the MPI. 
At this point, the receiver may still not post its corresponding mpi_irecv. So, 
the buffer size is important here. 
Without knowing the buffer size, I may get " truncate error " on Open MPI. 
How to know the size of the buffer automatically created by Open MPI in the 
background ?
Any help is appreciated. 
Jack,
July 10 2010
From: solarbik...@gmail.com
List-Post: users@lists.open-mpi.org
Date: Sat, 10 Jul 2010 16:46:12 -0700
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI how large its buffer size ?

I believe his question is regarding when under non-blocking send/recv, how does 
MPI know how much memory to allocate to receive the message, since the size is 
determined AFTER the irecv is posted.  So if the send post isend, but the 
receiver hasn't post irecv, what would the MPI do with the message.

I believe MPI would automatically create a buffer in the background to store 
the message.

On Sat, Jul 10, 2010 at 1:55 PM, jody  wrote:

Perhaps i misunderstand your question...

Generally, it is the user's job to provide the buffers both to send and receive.

If you call MPI_Recv, you must pass a buffer that is large enough to

hold the data sent by the

corresponding MPI_Send. I.e., if you know your sender will send

messages of 100kB,

then you must provide a buffer of size 100kB to the receiver.

If the message size is unknown at compile time, you may have to send

two messages:

first an integer which tells the receiver how large a buffer it has to

allocate, and then

the actual message (which then nicely fits into the freshly allocated buffer)

#include 

#include 

#include 

#include "mpi.h"

#define SENDER 1

#define RECEIVER   0

#define TAG_LEN   77

#define TAG_DATA  78

#define MAX_MESSAGE 16

int main(int argc, char *argv[]) {

int num_procs;

int rank;

int *send_buf;

int *recv_buf;

int send_message_size;

int recv_message_size;

MPI_Status st;

int i;

/* initialize random numbers */

srand(time(NULL));

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == RECEIVER) {

/* the receiver */

/* wait for message length */

MPI_Recv(&recv_message_size, 1, MPI_INT, SENDER, TAG_LEN,

MPI_COMM_WORLD, &st);

/* create a buffer of the required size */

recv_buf = (int*) malloc(recv_message_size*sizeof(int));

/* get data */

MPI_Recv(recv_buf, recv_message_size, MPI_INT, SENDER,

TAG_DATA, MPI_COMM_WORLD, &st);

printf("Receiver got %d integers:", recv_message_size);

for (i = 0; i < recv_message_size; i++) {

printf(" %d", recv_buf[i]);

}

printf("\n");

/* clean up */

free(recv_buf);

} else if (rank == SENDER) {

/* the sender */

/* random message size */

send_message_size = (int)((1.0*MAX_MESSAGE*rand())/(1.0*RAND_MAX));

/* create a buffer of the required size */

send_buf = (int*) malloc(send_message_size*sizeof(int));

/* create random message */

for (i = 0; i < send_message_size; i++) {

send_buf[i] = rand();

}

printf("Sender has %d integers:", send_message_size);

for (i = 0; i < send_message_size; i++) {

printf(" %d", send_buf[i]);

}

printf("\n");

/* send message size to receiver */

MPI_Send(&send_message_size,  1, MPI_INT, RECEIVER, TAG_LEN,

MPI_COMM_WORLD);

/* now send messagge */

MPI_Send(send_buf, send_message_size, MPI_INT, RECEIVER,

TAG_DATA, MPI_COMM_WORLD);

/* clean up */

free(send_buf);

}

MPI_Finalize();

}

I hope this helps

  Jody

On Sat, Jul 10, 2010 at 7:12 AM, Jack Bryan  wrote:

> Dear All:

> How to find the buffer size of OpenMPI ?

> I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4.

> Many nodes need to send data to the same node .

> Workers use mpi_isend, the receiver node use  mpi_irecv.

> because they are non-blocking, the messages are stored in buffers of

> senders.

> And then, the receiver collect messages from its buffer.

> If the receiver's buffer is too small, there will be truncate error.

> Any help is appreciated.

> Jack

> July 9  2010

>

> 

> Hotmail is redefining busy with tools

Re: [OMPI users] OpenMPI how large its buffer size ?

2010-07-11 Thread Jack Bryan


Hi, 
thanks for all your replies. 
The master node can receive message ( the same size)  from 50 worker nodes. 
But, it cannot receive message from 51 nodes. It caused "truncate error". 
I used the same buffer to get the message in 50 node case. 
About ""rendezvous" protocol", what is the meaning of "the sender sends a short 
portion "?
What is the "short portion", is it a small mart of the message of the sender 
?This "rendezvous" protocol" can work automatically in background without 
programmerindicates in his program ? 
The "acknowledgement " can be generated by the receiver only when 
thecorresponding mpi_irecv is posted by the receiver ? 
Any help is appreciated. 
Jack
July 10  2010  
List-Post: users@lists.open-mpi.org
Date: Sat, 10 Jul 2010 20:41:26 -0700
From: eugene@oracle.com
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI how large its buffer size ?






  
  


I hope I understand the question properly.



The "truncate error" means that the receive buffer provided by the user
was too small to receive the designated message.  That's an error in
the user code.



You're asking about some buffering sizes within the MPI
implementation.  We can talk about that, but it probably first makes
sense to clarify what MPI is doing.  If a sender posts a large send and
the receiver has not posted a reply, the MPI implementation is not
required to move any data.  In particular, most MPI implementations
will use a "rendezvous" protocol in which the sender sends a short
portion and then waits for an acknowledgement from the receiver that it
is ready to receive the message (and knows into which user buffer to
place the received data).  This protocol is used so that the MPI
implementation does not have to buffer internally arbitrarily large
messages.



So, if you post a large send but no receive, the MPI implementation is
probably buffering very little data.  The message won't advance until
the receive has been posted.  This means that a blocking MPI_Send will
wait and a nonblocking MPI_Isend will return without having done much.



Jack Bryan wrote:

  Hi, 
  

  
  thanks for the program from Jody. 
  

  
  David indicated the question that I want to ask. 
  

  
  But, Jody's approach is ok when the MPI built-in buffer size is
large enough to hold the 
  message such as 100kB in the buffer. 
  

  
  In
asynchronous communication, when the sender posts a mpi_isend, the
message is put in 
  a
buffer provided by the MPI. 
  

  
  At
this point, the receiver may still not post its corresponding
mpi_irecv. So, the buffer size is 
  important
here. 
  

  
  Without
knowing the buffer size, I may get " truncate error " on Open MPI. 
  

  
  How to know the size of the buffer automatically
created by Open MPI in the background ?
  

  
  Any
help is appreciated. 
  

Jack,
  

  
  July 10 2010
  

  From: solarbik...@gmail.com

List-Post: users@lists.open-mpi.org
Date: Sat, 10 Jul 2010 16:46:12 -0700

To: us...@open-mpi.org

Subject: Re: [OMPI users] OpenMPI how large its buffer size ?

  

I believe his question is regarding when under non-blocking send/recv,
how does MPI know how much memory to allocate to receive the message,
since the size is determined AFTER the irecv is posted.  So if the send
post isend, but the receiver hasn't post irecv, what would the MPI do
with the message.

  

I believe MPI would automatically create a buffer in the background to
store the message.

  

  On Sat, Jul 10, 2010 at 1:55 PM, jody 
wrote:

  Perhaps
i misunderstand your question...

Generally, it is the user's job to provide the buffers both to send and
receive.

If you call MPI_Recv, you must pass a buffer that is large enough to

hold the data sent by the

corresponding MPI_Send. I.e., if you know your sender will send

messages of 100kB,

then you must provide a buffer of size 100kB to the receiver.

If the message size is unknown at compile time, you may have to send

two messages:

first an integer which tells the receiver how large a buffer it has to

allocate, and then

the actual message (which then nicely fits into the freshly allocated
buffer)



#include 

#include 



#include 





#include "mpi.h"



#define SENDER 1

#define RECEIVER   0

#define TAG_LEN   77

#define TAG_DATA  78

#define MAX_MESSAGE 16



int main(int argc, char *argv[]) {



   int num_procs;

   int rank;

   int *send_buf;

   int *recv_buf;

   int send_message_size;

   int recv_message_size;

   MPI_Status st;

   int i;



   /* initialize random numbers */

   srand(time(NULL));

   MPI_Init(&argc, &argv);

   MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

   MPI_Comm_rank(MPI_COMM_WORLD, &rank);



   if (rank == RECEIVER) {

   /* the receiver */

   /* wait for message length */

   MPI

Re: [OMPI users] OpenMPI how large its buffer size ?

2010-07-11 Thread Jack Bryan


thanks for your reply. 
The message size is 72 bytes. 
The master sends out the message package to each 51 nodes. 
Then, after doing their local work, the worker node send back the same-size 
message to the master. 
Master use vector.push_back(new messageType) to receive each message from 
workers. 
Master use thempi_irecv(workerNodeID, messageTag, bufferVector[row][column])
to receive the worker message. 
the row is the rankID of each worker, the column is index for  message from 
worker.Each worker may send multiple messages to master. 
when the worker node size is large, i got MPI_ERR_TRUNCATE error.
Any help is appreciated. 
JACK
July 10  2010

List-Post: users@lists.open-mpi.org
Date: Sat, 10 Jul 2010 23:12:49 -0700
From: eugene@oracle.com
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI how large its buffer size ?






  
  


Jack Bryan wrote:

  
  The master node can receive message ( the same size)  from 50
worker nodes. 
  But, it cannot receive message from 51 nodes. It caused
"truncate error".

How big was the buffer that the program specified in the receive call? 
How big was the message that was sent?



MPI_ERR_TRUNCATE means that you posted a receive with an application
buffer that turned out to be too small to hold the message that was
received.  It's a user application error that has nothing to do with
MPI's internal buffers.  MPI's internal buffers don't need to be big
enough to hold that message.  MPI could require the sender and receiver
to coordinate so that only part of the message is moved at a time.


  

  
  I used the same buffer to get the message in 50 node case. 
  

  
  About ""rendezvous" protocol", what is the meaning of "the
sender sends a short portion "?
  What is the "short portion", is it a small mart of the message
of the sender ?

It's at least the message header (communicator, tag, etc.) so that the
receiver can figure out if this is the expected message or not.  In
practice, there is probably also some data in there as well.  The
amount of that portion depends on the MPI implementation and, in
practice, the interconnect the message traveled over,
MPI-implementation-dependent environment variables set by the user,
etc.  E.g., with OMPI over shared memory by default it's about 4Kbytes
(if I remember correctly).


  This "rendezvous" protocol" can work automatically in background
without programmer
  indicates in his program ?

Right.  MPI actually allows you to force such synchronization with
MPI_Ssend, but typically MPI implementations use it automatically for
"plain" long sends as well even if the user didn't not use MPI_Ssend.


  The "acknowledgement " can be generated by the receiver only
when the
  corresponding mpi_irecv is posted by the receiver ? 

Right.
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[OMPI users] OpenMPI load data to multiple nodes

2010-07-12 Thread Jack Bryan


Dear All,
I am working on a multi-computer Open MPI cluster system. 
If I put some data files in /home/mypath/folder, is it possible that all 
non-head nodes can access the files in the folder ? 
I need to load some data to some nodes, if all nodes can access the data, I do 
not need to load them to each node one by one. 
If multiple nodes access the same file to get data, is there conflict ? 
For example, 
fopen(myFile) by node 1, at the same time fopen(myFile) by node 2. 
Is it allowed to do that on MPI cluster without conflict ? 
Any help is appreciated. 
Jinxu Ding
July 12  2010 
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Re: [OMPI users] OpenMPI load data to multiple nodes

2010-07-12 Thread Jack Bryan


thanks very much !!!
May I use global variable to do that ? 
It means that all nodes have the same global variable, such as globalVector. 
In the initialization, only node 0 load data from files and assign values to 
the globalVector. 
After that, all other nodes can get the same data by accessing the 
globalVector. 
Does it make sense ? 
Any help is appreciated. 
Jack
July 12  2010
> Date: Mon, 12 Jul 2010 21:44:34 -0400
> From: g...@ldeo.columbia.edu
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OpenMPI load data to multiple nodes
> 
> Hi Jack/Jinxu
> 
> Jack Bryan wrote:
> > Dear All,
> > 
> > I am working on a multi-computer Open MPI cluster system. 
> > 
> > If I put some data files in /home/mypath/folder, is it possible that all 
> > non-head nodes can access the files in the folder ? 
> >
> 
> Yes, possible, for instance, if the /home/mypath/folder directory is
> NFS mounted on all nodes/computers.
> Otherwise, if all disks and directories are local to each computer,
> you need to copy the input files to the local disks before you
> start, and copy the output files back to your login computer after the
> program ends.
> 
> > I need to load some data to some nodes, if all nodes can access the 
> > data, I do not need to load them to each node one by one. 
> > 
> > If multiple nodes access the same file to get data, is there conflict ? 
> > 
> 
> To some extent.
> The OS (on the computer where the file is located)
> will do the arbitration on which process gets the hold of the file at 
> each time.
> If you have 1000 processes, this means a lot of arbitration,
> and most likely contention.
> Even for two processes only, if the processes are writing data to a 
> single file, this won't ensure that they write
> the output data in the order that you want.
> 
> > For example, 
> > 
> > fopen(myFile) by node 1, at the same time fopen(myFile) by node 2. 
> > 
> > Is it allowed to do that on MPI cluster without conflict ? 
> > 
> 
> I think MPI won't have any control over this.
> It is up to the operational system, and depends on
> which process gets its "fopen" request to the OS first,
> which is not a deterministic sequence of events.
> That is not a clean technique.
> 
> You could instead:
> 
> 1) Assign a single process, say, rank 0,
> to read and write data from/to the file(s).
> Then use, say, MPI_Scatter[v] and MPI_Gather[v],
> to distribute and collect the data back and forth
> between that process (rank 0) and all other processes.
> 
> That is an old fashioned but very robust technique.
> It avoids any I/O conflict or contention among processes.
> All the data flows across the processes via MPI.
> The OS receives I/O requests from a single process (rank 0).
> 
> Besides MPI_Gather/MPI_Scatter, look also at MPI_Bcast,
> if you need to send the same data to all processes,
> assuming the data is being read by a single process.
> 
> 2) Alternatively, you could use the MPI I/O functions,
> if your files are binary.
> 
> I hope it helps,
> Gus Correa
> 
> > Any help is appreciated. 
> > 
> > Jinxu Ding
> > 
> > July 12  2010
> > 
> > 
> > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars 
> > with Hotmail. Get busy. 
> > <http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5>
> > 
> > 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
  
_
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

[OMPI users] openMPI, transfer data from multiple sources to one destination

2008-12-28 Thread Jack Bryan


HI, 

I need to transfer data from multiple sources to one destination. 
The requirement is:

(1) The sources and destination nodes may work asynchronously.
 
(2) Each source node generates data package in their own paces.
And, there may be many packages to send. Whenever, a data package 
is generated , it should be sent to the desination node at once.
And then, the source node continue to work on generating the next 
package. 

(3) There is only one destination node , which must receive all data 
package generated from the source nodes. 
Because the source and destination nodes may work asynchronously,
the destination node should not wait for a specific source node until 
the source node sends out its data. 

The destination node should be able to receive data package 
from anyone source node whenever the data package is available in a 
source node. 

My question is :

What MPI function should be used to implement the protocol above ? 

If I use MPI_Send/Recv, they are blocking function. The destination
node have to wait for one node until its data is available. 

The communication overhead is too high. 

If I use MPI_Bsend, the destination node has to use MPI_Recv to , 
a Blocking receive for a message .

This can make the destination node wait for only one source node and 
actually other source nodes may have data avaiable. 


Any help or comment is appreciated !!!

thanks

Dec. 28 2008


_
It’s the same Hotmail®. If by “same” you mean up to 70% faster.
http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008

[OMPI users] Open mpi 123 install error for BLACS

2009-01-31 Thread Jack Bryan


Hi,

I am installing BLACS in order to install PCSDP - a parallell interior point
solver for linear programming. 

I need to install it on Open MPI 1.2.3 platform. 

I ahve installed BLAS, LAPACK successfully. 

Now I need to install BLACS.

I can run "make mpi" successfully. 

But, When I run "make tester".

[BLACS]$ make tester
( cd TESTING ; make  )
make[1]: Entering directory `/home/PCSDP/BLACS/TESTING'
mpif77  -o /home/PCSDP/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0 blacstest.o 
btprim_MPI.o tools.o /home/PCSDP/BLACS/LIB/blacsF77init_MPI-LINUX-0.a 
/home/PCSDP/BLACS/LIB/blacs_MPI-LINUX-0.a 
/home/PCSDP/BLACS/LIB/blacsF77init_MPI-LINUX-0.a 
/home/openmpi_123/lib/libmpi_cxx.la
/home/openmpi_123/lib/libmpi_cxx.la: file not recognized: File format not 
recognized
collect2: ld returned 1 exit status

make[1]: *** [/home/PCSDP/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0] Error 1
make[1]: Leaving directory `/home/PCSDP/BLACS/TESTING'
make: *** [tester] Error 2
-

In the "Makefile" of TESTING/, I have changed :

tools.o : tools.f
#$(F77) $(F77NO_OPTFLAGS) -c $*.f
$(F77) $(F77NO_OPTFLAGS) -fno-globals -fno-f90 -fugly-complex -w -c $*.f

blacstest.o : blacstest.f
#$(F77) $(F77NO_OPTFLAGS) -c $*.f
$(F77) $(F77NO_OPTFLAGS) -fno-globals -fno-f90 -fugly-complex -w -c $*.f


--

In "Bconfig.h", I have changed 

include "/home/openmpi_123/include/mpi.h"

In OpenMPI 1.2.3, the lib directory does not include: "*.a" library.
only "*.la" library. 


Any help is appreciated. 

Jack

Jan.  30  2009


My "Bmake.inc" is:

-
SECTION 1: PATHS AND LIBRARIES
SHELL = /bin/sh
BTOPdir = /home/PCSDP/BLACS
COMMLIB = MPI
PLAT = LINUX

BLACSdir= $(BTOPdir)/LIB
   BLACSDBGLVL = 0
   BLACSFINIT  = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a
   BLACSCINIT  = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a
   BLACSLIB= $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a

   MPIdir =  /home/openmpi_123
   MPILIBdir = $(MPIdir)/lib
   MPIINCdir = $(MPIdir)/include

MPILIB = $(MPILIBdir)/libmpi_cxx.la

BTLIBS = $(BLACSFINIT) $(BLACSLIB) $(BLACSFINIT) $(MPILIB)

INSTdir = $(BTOPdir)/INSTALL/EXE


 TESTdir = $(BTOPdir)/TESTING/EXE
   FTESTexe = $(TESTdir)/xFbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL)
   CTESTexe = $(TESTdir)/xCbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL)

SYSINC = -I$(MPIINCdir)

 INTFACE = -Df77IsF2C
SENDIS =
BUFF =
 TRANSCOMM = -DCSameF77
 WHATMPI =  
SYSERRORS =
DEBUGLVL = -DBlacsDebugLvl=$(BLACSDBGLVL)
DEFS1 = -DSYSINC $(SYSINC) $(INTFACE) $(DEFBSTOP) $(DEFCOMBTOP) $(DEBUGLVL)
BLACSDEFS = $(DEFS1) $(SENDIS) $(BUFF) $(TRANSCOMM) $(WHATMPI) $(SYSERRORS)

SECTION 3: COMPILERS

 F77= mpif77
   F77NO_OPTFLAGS =
   F77FLAGS   = $(F77NO_OPTFLAGS) -O
   F77LOADER  = $(F77)
   F77LOADFLAGS   =

CC = mpicc
   CCFLAGS= -O4
   CCLOADER   = $(CC)
   CCLOADFLAGS=

 ARCH  = ar
   ARCHFLAGS = r
   RANLIB= ranlib

---





_
Windows Live™ Hotmail®…more than just e-mail. 
http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t2_hm_justgotbetter_howitworks_012009

95 matches

Mail list logo