[OMPI users] OpenMPI killed by signal 9
Dear All: I run a parallel job on 6 nodes of an OpenMPI cluster. But I got error: rank 0 in job 82 system.cluster_37948 caused collective abort of all ranks exit status of rank 0: killed by signal 9 It seems that there is segmentation fault on node 0. But, if the program is run for a short time, no problem. Any help is appreciated. thanks, Jack July 22 2010 _ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
[OMPI users] OpenMPI Segmentation fault (11)
Dear All, I run a 6 parallel processes on OpenMPI. When the run-time of the program is short, it works well. But, if the run-time is long, I got errors: [n124:45521] *** Process received signal ***[n124:45521] Signal: Segmentation fault (11)[n124:45521] Signal code: Address not mapped (1)[n124:45521] Failing at address: 0x44[n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60][n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19][n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa][n124:45521] [ 4] /home/path/exec [0x40ec9a][n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n124:45521] [ 6] /home/path/exec [0x401139][n124:45521] *** End of error message *** It seems that there may be some problems about memory management. But, I cannot find the reason. My program needs to write results to some files. If I open the files too many without closing them, I may get the above errors. But, I have removed the writing files from my program. The problem appears again when the program runs longer time. Any help is appreciated. Jack July 25 2010 _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Re: [OMPI users] OpenMPI Segmentation fault (11)
Thanks It can be installed on linux and work with gcc ? If I have many processes, such as 30, I have to open 30 terminal windows ? thanks Jack > Date: Mon, 26 Jul 2010 08:23:57 +0200 > From: jody@gmail.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] OpenMPI Segmentation fault (11) > > Hi Jack > > Have you tried to run your aplication under valgrind? > Even though applications generallay run slower under valgrind, > it may detect memory errors before the actual crash happens. > > The best would be to start a terminal window for each of your processes > so you can see valgrind's output for each process separately. > > Jody > > On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan wrote: > > Dear All, > > I run a 6 parallel processes on OpenMPI. > > When the run-time of the program is short, it works well. > > But, if the run-time is long, I got errors: > > [n124:45521] *** Process received signal *** > > [n124:45521] Signal: Segmentation fault (11) > > [n124:45521] Signal code: Address not mapped (1) > > [n124:45521] Failing at address: 0x44 > > [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0] > > [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60] > > [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19] > > [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa] > > [n124:45521] [ 4] /home/path/exec [0x40ec9a] > > [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974] > > [n124:45521] [ 6] /home/path/exec [0x401139] > > [n124:45521] *** End of error message *** > > It seems that there may be some problems about memory management. > > But, I cannot find the reason. > > My program needs to write results to some files. > > If I open the files too many without closing them, I may get the above > > errors. > > But, I have removed the writing files from my program. > > The problem appears again when the program runs longer time. > > Any help is appreciated. > > Jack > > July 25 2010 > > > > > > Hotmail is redefining busy with tools for the New Busy. Get more from your > > inbox. See how. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[OMPI users] Open MPI C++ class datatype
Dear All: I need to transfer some data, which is C++ class with some vector member data. I want to use MPI_Bcast(buffer, count, datatype, root, comm); May I use MPI_Datatype to define customized data structure that contain C++ class ? Any help is appreciated. Jack Aug 3 2010
[OMPI users] Open MPI dynamic data structure error
Hi, I need to design a data structure to transfer data between nodes on Open MPI system. Some elements of the the structure has dynamic size. For example, typedef struct{ double data1;vector dataVec; } myDataType; The size of the dataVec depends on some intermidiate computing results. If I only declear it as the above myDataType, I think, only a pointer is transfered. When the data receiver try to access the elements of vector dataVec, it got segmentation fault error. But, I also need to use the myDataType to declear other data structures. such as vector newDataVec; I cannot declear myDataType in a function , sucjh as main(), I got errors: main.cpp:200: error: main(int, char**)::myDataType; uses local type main(int, char**)::myDataType; Any help is really appreciated. thanks Jack Oct. 19 2010
[OMPI users] OPEN MPI data transfer error
Hi, I am using open MPI to transfer data between nodes. But the received data is not what the data sender sends out . I have tried C and C++ binding . data sender:double* sendArray = new double[sendResultVec.size()]; for (int ii =0 ; ii < sendResultVec.size() ; ii++) { sendArray[ii] = sendResultVec[ii]; } MPI::COMM_WORLD.Send(sendArray, sendResultVec.size(), MPI_DOUBLE, 0, myworkerUpStreamTaskTag); data receiver: double* recvArray = new double[objSize]; mToMasterT1Req = MPI::COMM_WORLD.Irecv(recvArray, objSize, MPI_DOUBLE, destRank, myUpStreamTaskTag); The sendResultVec.size() = objSize. What is the possible reason ? Any help is appreciated. thanks jack Oct. 22 1010
Re: [OMPI users] OPEN MPI data transfer error
Hi, I have used mpi_waitall() to do it. The data can be received but the contents are wrong. Any help is appreciated. thanks > From: jsquy...@cisco.com > Date: Fri, 22 Oct 2010 15:35:11 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OPEN MPI data transfer error > > It doesn't look like you have completed the request that came back from > Irecv. You need to TEST or WAIT on requests before they are actually > completed (e.g., in the case of a receive, the data won't be guaranteed to be > in the target buffer until TEST/WAIT indicates that the request has > completed). > > > > On Oct 22, 2010, at 3:19 PM, Jack Bryan wrote: > > > Hi, > > > > I am using open MPI to transfer data between nodes. > > > > But the received data is not what the data sender sends out . > > > > I have tried C and C++ binding . > > > > data sender: > > double* sendArray = new double[sendResultVec.size()]; > > > > for (int ii =0 ; ii < sendResultVec.size() ; ii++) > > { > > sendArray[ii] = sendResultVec[ii]; > > } > > > > MPI::COMM_WORLD.Send(sendArray, sendResultVec.size(), MPI_DOUBLE, 0, > > myworkerUpStreamTaskTag); > > > > data receiver: > > double* recvArray = new double[objSize]; > > > > mToMasterT1Req = MPI::COMM_WORLD.Irecv(recvArray, objSize, MPI_DOUBLE, > > destRank, myUpStreamTaskTag); > > > > > > The sendResultVec.size() = objSize. > > > > > > What is the possible reason ? > > > > > > Any help is appreciated. > > > > thanks > > > > jack > > > > Oct. 22 1010 > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open MPI program cannot complete
Hi I got a problem of open MPI. My program has 5 processes. All of them can run MPI_Finalize() and return 0. But, the whole program cannot be completed. In the MPI cluster job queue, it is strill in running status. If I use 1 process to run it, no problem. Why ? My program: int main (int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); MPI_Comm_size(MPI_COMM_WORLD, &mySize); MPI_Comm world; world = MPI_COMM_WORLD; if (myRank == 0){ do some things. } if (myRank != 0){ do some things. MPI_Finalize(); return 0 ; } if (myRank == 0){ MPI_Finalize(); return 0; } } And, some output files get wrong codes, which can not be readible. In 1-process case, the program can print correct results to these output files . Any help is appreciated. thanks Jack Oct. 24 2010
Re: [OMPI users] Open MPI program cannot complete
Thanks for the reply. But, I use mpi_waitall() to make sure that all MPI communications have been done before a process call MPI_Finalize() and returns. Any help is appreciated. thanks Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 17:31:11 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > It may depend on "do some things". > Does it involve MPI communication? > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > Gus Correa > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > Hi > > > > I got a problem of open MPI. > > > > My program has 5 processes. > > > > All of them can run MPI_Finalize() and return 0. > > > > But, the whole program cannot be completed. > > > > In the MPI cluster job queue, it is strill in running status. > > > > If I use 1 process to run it, no problem. > > > > Why ? > > > > My program: > > > > int main (int argc, char **argv) > > { > > > > MPI_Init(&argc, &argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank); > > MPI_Comm_size(MPI_COMM_WORLD, &mySize); > > MPI_Comm world; > > world = MPI_COMM_WORLD; > > > > if (myRank == 0) > > { > > do some things. > > } > > > > if (myRank != 0) > > { > > do some things. > > MPI_Finalize(); > > return 0 ; > > } > > if (myRank == 0) > > { > > MPI_Finalize(); > > return 0; > > } > > > > } > > > > And, some output files get wrong codes, which can not be readible. > > In 1-process case, the program can print correct results to these output > > files . > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(&argc, &argv); > > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank); > > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(&argc, &argv); > > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank); > > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanksI found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize();cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0;I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() "But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanksJackOct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been > > done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack > > > > > > It may depend on "do some things". > > > Does it involve MPI communication? > > > > > > Also, why not put MPI_Finalize();return 0 outside the ifs? > > > > > > Gus Correa > > > > > > On Oct 24, 2010, at 2:23 PM, Jack Bryan wrote: > > > > > > > Hi > > > > > > > > I got a problem of open MPI. > > > > > > > > My program has 5 processes. > > > > > > > > All of them can run MPI_Finalize() and return 0. > > > > > > > > But, the whole program cannot be completed. > > > > > > > > In the MPI cluster job queue, it is strill in running status. > > > > > > > > If I use 1 process to run it, no problem. > > > > > > > > Why ? > > > > > > > > My program: > > > > > > > > int main (int argc, char **argv) > > > > { > > > > > > > > MPI_Init(&argc, &argv); > > > > MPI_Comm_rank(MPI_COMM_WORLD, &myRank); > > > > MPI_Comm_size(MPI_COMM_WORLD, &mySize); > > > > MPI_Comm world; > > > > world = MPI_COMM_WORLD; > > > > > > > > if (myRank == 0) > > > > { > > > > do some things. > > > > } > > > > > > > > if (myRank != 0) > > > > { > > > > do some things. > > > > MPI_Finalize(); > > > > return 0 ; > > > > } > > > > if (myRank == 0) > > > > { > > > > MPI_Finalize(); > > > > return 0; > > > > } > > > > > > > > } > > > > > > > > And, some output files get wrong codes, which can not be readible. > > > > In 1-process case, the program can print correct results to these > > > > output files . > > > > > > > > Any help is appreciated. > > > > > > > > thanks > > > > > > > > Jack > > > > > > > > Oct. 24 2010 > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks, But, I have put a mpi_waitall(request) before cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; If the above sentence has been printed out, it means that all requests have been checked and finished. right ? What may be the possible reasons for that stuck ? Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 05:32:44 -0400 From: terry.don...@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete Message body So what you are saying is *all* the ranks have entered MPI_Finalize and only a subset has exited per placing prints before and after MPI_Finalize. Good. So my guess is that the processes stuck in MPI_Finalize have a prior MPI request outstanding that for whatever reason is unable to complete. So I would first look at all the MPI requests and make sure they completed. --td On 10/25/2010 02:38 AM, Jack Bryan wrote: thanks I found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() " But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the MPI calls. > It is hard to guess what is the problem this way. > > Gus Correa > On Oct 24, 2010, at 5:43 PM, Jack Bryan wrote: > > > Thanks for the reply. > > But, I use mpi_waitall() to make sure that all MPI communications have been done before a process call MPI_Finalize() and returns. > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 24 2010 > > > > > From: g...@ldeo.columbia.edu > > > Date: Sun, 24 Oct 2010 17:31:11 -0400 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > > > Hi Jack
Re: [OMPI users] Open MPI program cannot complete
Thanks, the problem is still there. I used: cout << "In main(), I am rank " << myRank << " , I am before MPI_Barrier(MPI_COMM_WORLD). \n\n" << endl ; MPI_Barrier(MPI_COMM_WORLD);cout << "In main(), I am rank " << myRank << " , I am before MPI_Finalize() and after MPI_Barrier(MPI_COMM_WORLD). \n\n" << endl ; MPI_Finalize(); cout << "In main(), I am rank " << myRank << " , I am after MPI_Finalize(), then return 0 . \n\n" << endl ;return 0 ; Only process 0 returns. Other processes are still struck inMPI_Finalize(). Any help is appreciated. JACK Oct. 25 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 08:27:19 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete I think I got this problem before. Put a mpi_barrier(mpi_comm_world) before mpi_finalize for all processes. For me, mpi terminates nicely only when all process are calling mpi_finalize the same time. So I do it for all my programs. On Mon, Oct 25, 2010 at 7:13 AM, Jack Bryan wrote: Thanks, But, I have put a mpi_waitall(request) before cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; If the above sentence has been printed out, it means that all requests have been checked and finished. right ? What may be the possible reasons for that stuck ? Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 05:32:44 -0400 From: terry.don...@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete So what you are saying is *all* the ranks have entered MPI_Finalize and only a subset has exited per placing prints before and after MPI_Finalize. Good. So my guess is that the processes stuck in MPI_Finalize have a prior MPI request outstanding that for whatever reason is unable to complete. So I would first look at all the MPI requests and make sure they completed. --td On 10/25/2010 02:38 AM, Jack Bryan wrote: thanks I found a problem: I used: cout << " I am rank " << rank << " I am before MPI_Finalize()" << endl; MPI_Finalize(); cout << " I am rank " << rank << " I am after MPI_Finalize()" << endl; return 0; I can get the output " I am rank 0 (1, 2, ) I am before MPI_Finalize() ". and " I am rank 0 I am after MPI_Finalize() " But, other processes do not printed out "I am rank ... I am after MPI_Finalize()" . It is weird. The process has reached the point just before MPI_Finalize(), why they are hanged there ? Are there other better ways to check this ? Any help is appreciated. thanks Jack Oct. 25 2010 From: solarbik...@gmail.com Date: Sun, 24 Oct 2010 19:47:54 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete how do you know all process call mpi_finalize? did you have all of them print out something before they call mpi_finalize? I think what Gustavo is getting at is maybe you had some MPI calls within your snippets that hangs your program, thus some of your processes never called mpi_finalize. On Sun, Oct 24, 2010 at 6:59 PM, Jack Bryan wrote: Thanks, But, my code is too long to be posted. What are the common reasons of this kind of problems ? Any help is appreciated. Jack Oct. 24 2010 > From: g...@ldeo.columbia.edu > Date: Sun, 24 Oct 2010 18:09:52 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Hi Jack > > Your code snippet is too terse, doesn't show the
Re: [OMPI users] Open MPI program cannot complete
thanks, Would like to tell me how to use (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in MPI ? I need to use #PBS parallel job script to submit a job on MPI cluster. Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? How to get the ZOMBIE_PID ? thanks Any help is appreciated. Jack Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 19:01:38 +0200 From: j...@59a2.org To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete On Mon, Oct 25, 2010 at 18:26, Jack Bryan wrote: Thanks, the problem is still there. This really doesn't prove that there are no outstanding asynchronous requests, but perhaps you know that there are not, despite not being able to post a complete test case here. I suggest attaching a debugger and getting a stack trace from the zombies (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID). Jed ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks I have to use #PBS to submit any jobs in my cluster. I cannot use command line to hang a job on my cluster. this is my script: --#!/bin/bash#PBS -N jobname#PBS -l walltime=00:08:00,nodes=1#PBS -q queuenameCOMMAND=/mypath/myprogNCORES=5 cd $PBS_O_WORKDIRNODES=`cat $PBS_NODEFILE | wc -l`NPROC=$(( $NCORES * $NODES )) mpirun -np $NPROC --mca btl self,sm,openib $COMMAND --- Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? And how to get ZOMBIE_PID from the script ? Any help is appreciated. thanks Oct. 25 2010 List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 19:24:35 +0200 From: j...@59a2.org To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI program cannot complete On Mon, Oct 25, 2010 at 19:07, Jack Bryan wrote: I need to use #PBS parallel job script to submit a job on MPI cluster. Is it not possible to reproduce locally? Most clusters have a way to submit an interactive job (which would let you start this thing and then inspect individual processes). Ashley's Padb suggestion will certainly be better in a non-interactive environment. Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID) in the script ? Is control returning to your script after rank 0 has exited? In that case, you can just put this on the next line. How to get the ZOMBIE_PID ? "ps" from the command line, or getpid() from C code. Jed ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks I use qsub -I nsga2_job.shqsub: waiting for job 48270.clusterName to start By qstatI found the job name is none and no results show up. No shell prompt appear, the command line is hang there , no response. Any help is appreciated. Thanks Jack Oct. 25 2010 > From: jsquy...@cisco.com > Date: Mon, 25 Oct 2010 13:39:30 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, > "qsub -I ..." ? > > Then you get a shell prompt with your allocated cores and can run stuff > interactively. I don't know if your site allows this, but interactive > debugging here might be *significantly* easier than try to automate some > debugging. > > > On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > > > thanks > > > > I have to use #PBS to submit any jobs in my cluster. > > I cannot use command line to hang a job on my cluster. > > > > this is my script: > > -- > > #!/bin/bash > > #PBS -N jobname > > #PBS -l walltime=00:08:00,nodes=1 > > #PBS -q queuename > > COMMAND=/mypath/myprog > > NCORES=5 > > > > cd $PBS_O_WORKDIR > > NODES=`cat $PBS_NODEFILE | wc -l` > > NPROC=$(( $NCORES * $NODES )) > > > > mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > > > > --- > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > ZOMBIE_PID) in the script ? > > And how to get ZOMBIE_PID from the script ? > > > > Any help is appreciated. > > > > thanks > > > > Oct. 25 2010 > > > > Date: Mon, 25 Oct 2010 19:24:35 +0200 > > From: j...@59a2.org > > To: us...@open-mpi.org > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > > On Mon, Oct 25, 2010 at 19:07, Jack Bryan wrote: > > I need to use #PBS parallel job script to submit a job on MPI cluster. > > > > Is it not possible to reproduce locally? Most clusters have a way to > > submit an interactive job (which would let you start this thing and then > > inspect individual processes). Ashley's Padb suggestion will certainly be > > better in a non-interactive environment. > > > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > > ZOMBIE_PID) in the script ? > > > > Is control returning to your script after rank 0 has exited? In that case, > > you can just put this on the next line. > > > > How to get the ZOMBIE_PID ? > > > > "ps" from the command line, or getpid() from C code. > > > > Jed > > > > ___ users mailing list > > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
thanks But, the code is too long. Jack Oct. 25 2010 > Date: Mon, 25 Oct 2010 14:08:54 -0400 > From: g...@ldeo.columbia.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > Your job may be queued, not executing, because there are no > resources available, all nodes are busy. > Try qstat -a. > > Posting a code snippet with all your MPI calls may prove effective. > You might get a trove of advice for a thrift of effort. > > Jeff Squyres wrote: > > Check the man page for qsub for proper use. > > > > > > On Oct 25, 2010, at 1:49 PM, Jack Bryan wrote: > > > >> thanks > >> > >> I use > >> qsub -I nsga2_job.sh > >> qsub: waiting for job 48270.clusterName to start > >> > >> By qstat > >> I found the job name is none and no results show up. > >> > >> No shell prompt appear, the command line is hang there , no response. > >> > >> Any help is appreciated. > >> > >> Thanks > >> > >> Jack > >> > >> Oct. 25 2010 > >> > >>> From: jsquy...@cisco.com > >>> Date: Mon, 25 Oct 2010 13:39:30 -0400 > >>> To: us...@open-mpi.org > >>> Subject: Re: [OMPI users] Open MPI program cannot complete > >>> > >>> Can you use the interactive mode of PBS to get 5 cores on 1 node? IIRC, > >>> "qsub -I ..." ? > >>> > >>> Then you get a shell prompt with your allocated cores and can run stuff > >>> interactively. I don't know if your site allows this, but interactive > >>> debugging here might be *significantly* easier than try to automate some > >>> debugging. > >>> > >>> > >>> On Oct 25, 2010, at 1:35 PM, Jack Bryan wrote: > >>> > >>>> thanks > >>>> > >>>> I have to use #PBS to submit any jobs in my cluster. > >>>> I cannot use command line to hang a job on my cluster. > >>>> > >>>> this is my script: > >>>> -- > >>>> #!/bin/bash > >>>> #PBS -N jobname > >>>> #PBS -l walltime=00:08:00,nodes=1 > >>>> #PBS -q queuename > >>>> COMMAND=/mypath/myprog > >>>> NCORES=5 > >>>> > >>>> cd $PBS_O_WORKDIR > >>>> NODES=`cat $PBS_NODEFILE | wc -l` > >>>> NPROC=$(( $NCORES * $NODES )) > >>>> > >>>> mpirun -np $NPROC --mca btl self,sm,openib $COMMAND > >>>> > >>>> --- > >>>> > >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > >>>> ZOMBIE_PID) in the script ? > >>>> And how to get ZOMBIE_PID from the script ? > >>>> > >>>> Any help is appreciated. > >>>> > >>>> thanks > >>>> > >>>> Oct. 25 2010 > >>>> > >>>> Date: Mon, 25 Oct 2010 19:24:35 +0200 > >>>> From: j...@59a2.org > >>>> To: us...@open-mpi.org > >>>> Subject: Re: [OMPI users] Open MPI program cannot complete > >>>> > >>>> On Mon, Oct 25, 2010 at 19:07, Jack Bryan wrote: > >>>> I need to use #PBS parallel job script to submit a job on MPI cluster. > >>>> > >>>> Is it not possible to reproduce locally? Most clusters have a way to > >>>> submit an interactive job (which would let you start this thing and then > >>>> inspect individual processes). Ashley's Padb suggestion will certainly > >>>> be better in a non-interactive environment. > >>>> > >>>> Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > >>>> ZOMBIE_PID) in the script ? > >>>> > >>>> Is control returning to your script after rank 0 has exited? In that > >>>> case, you can just put this on the next line. > >>>> > >>>> How to get the ZOMBIE_PID ? > >>>> > >>>> "ps" from the command line, or getpid() from C code. > >>>> > >>>> Jed > >>>> > >>>> ___ users mailing list > >>>> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> ___ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> -- > >>> Jeff Squyres > >>> jsquy...@cisco.com > >>> For corporate legal information go to: > >>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>> > >>> > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
ThanksI have downloaded http://padb.googlecode.com/files/padb-3.0.tgz and compile it. But, no user manual, I can not use it by padb -aQ. ./padb -aQ myjobpadb: Error: --all incompatible with specific ids Actually, myjob is running in the queue. Do you have use manual about how to use it ? thanks > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 18:08:32 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > > On 25 Oct 2010, at 17:26, Jack Bryan wrote: > > > Thanks, the problem is still there. > > > > I used: > > > > Only process 0 returns. Other processes are still struck in > > MPI_Finalize(). > > > > Any help is appreciated. > > You can use the command "padb -aQ" to show you the message queues for your > application, you'll need to download and install padb then simply run your > job, allow it to hang and they run padb - it'll show you the message queues > for each rank that it can find processes for (the ones that haven't exited). > If this isn't any help run "padb -axt" for the stack traces and send the > output to this list. > > The web-site is in my signature or there is a new beta release out this week > at http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI program cannot complete
Thanks I have downloaded http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz and followed the instructions of INSTALL file and installed it at /mypath/padb32 But, I got: -bash-3.2$ padb -Ormgr=pbs -Q 48279.clusterJob 48279.cluster is not active Actually, the job was running. I have installed bin at /mypath/padb32/bin libexec at/lustre/jxding/padb32/libexec When I installed it, I used ./configure --prefix=/mypath/padb32 I got - checking for a BSD-compatible install... /usr/bin/install -cchecking whether build environment is sane... yeschecking for a thread-safe mkdir -p... /bin/mkdir -pchecking for gawk... gawkchecking whether make sets $(MAKE)... yeschecking for gcc... gccchecking whether the C compiler works... yeschecking for C compiler default output file name... a.outchecking for suffix of executables...checking whether we are cross compiling... nochecking for suffix of object files... ochecking whether we are using the GNU C compiler... yeschecking whether gcc accepts -g... yeschecking for gcc option to accept ISO C89... none neededchecking for style of include used by make... GNUchecking dependency style of gcc... gcc3checking whether gcc and cc understand -c and -o together... yesconfigure: creating ./config.statusconfig.status: creating Makefileconfig.status: creating src/Makefileconfig.status: executing depfiles commands --- -bash-3.2$ makeMaking all in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'gcc -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"padb\" -DVERSION=\"3.2-beta1\" -I.-Wall -g -O2 -MT minfo-minfo.o -MD -MP -MF .deps/minfo-minfo.Tpo -c -o minfo-minfo.o `test -f 'minfo.c' || echo './'`minfo.cminfo.c: In function âfind_symâ:minfo.c:158: warning: dereferencing type-punned pointer will break strict-aliasing rulesminfo.c: In function âmainâ:minfo.c:649: warning: type-punning to incomplete type might break strict-aliasing rulesminfo.c:650: warning: type-punning to incomplete type might break strict-aliasing rulesmv -f .deps/minfo-minfo.Tpo .deps/minfo-minfo.Pogcc -Wall -g -O2 -ldl -o minfo minfo-minfo.omake[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for `all-am'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'- -bash-3.2$ make installMaking install in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'make[2]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'test -z "/lustre/jxding/padb32/bin" || /bin/mkdir -p "/mypath/padb32/bin" /usr/bin/install -c padb '/lustre/jxding/padb32/bin'test -z "/lustre/jxding/padb32/libexec" || /bin/mkdir -p "/mypath/padb32/libexec" /usr/bin/install -c minfo '/lustre/jxding/padb32/libexec'make[2]: Nothing to be done for `install-data-am'.make[2]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[2]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[2]: Nothing to be done for `install-exec-am'.make[2]: Nothing to be done for `install-data-am'.make[2]: Leaving directory `/mypath/padb32/padb-3.2-beta1'make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'-bash-3.2$ make installcheckMaking installcheck in srcmake[1]: Entering directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Nothing to be done for `installcheck'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1/src'make[1]: Entering directory `/mypath/padb32/padb-3.2-beta1'make[1]: Nothing to be done for `installcheck-am'.make[1]: Leaving directory `/mypath/padb32/padb-3.2-beta1'-- Are there something wrong with what I have done ? Any help is appreciated. thanks Jack Oct. 25 2010 > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 20:40:18 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI program cannot complete > > > On 25 Oct 2010, at 20:18, Jack Bryan wrote: > > > Thanks > > I have downloaded > > http://padb.googlecode.com/files/padb-3.0.tgz > > > > and compile it. > > > > But, no user manual, I can not use it by padb -aQ. > > The -a flag is a shortcut to all jobs, if you are providing a jobid (which is > normally numeric) then don't set the -a flag. > > > Do you have use manual about how to use it ? > > In my previ
Re: [OMPI users] Open MPI program cannot complete
thanksBut, I cannot see the attachment in the email. Would you please send me again ? and also copy to another my email:tomviewisu@yahoo.comthanksOct. 25 2010 From: dtustud...@hotmail.com To: ash...@pittman.co.uk Subject: RE: [OMPI users] Open MPI program cannot complete List-Post: users@lists.open-mpi.org Date: Mon, 25 Oct 2010 16:53:32 -0600 thanks But, I cannot see the attachment in the email. Would you please send me again ? and also copy to another my email: tomview...@yahoo.com thanks Oct. 25 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Mon, 25 Oct 2010 23:41:32 +0100 > To: dtustud...@hotmail.com > > > Thanks, that's tells me a lot. > > Try the attached padb, I've added the patch for you and remove the -w option. > Can you run it and send me back the output please. > > Ashley. > > On 25 Oct 2010, at 23:29, Jack Bryan wrote: > > > Thanks > > > > Here is the > > > > -bash-3.2$ qstat -fB > > Server: clusterName > > server_state = Active > > scheduling = True > > total_jobs = 26 > > state_count = Transit:0 Queued:7 Held:0 Waiting:0 Running:18 Exiting:0 > > acl_hosts = clustername > > default_queue = normal > > log_events = 511 > > mail_from = adm > > query_other_jobs = True > > resources_assigned.nodect = 246 > > scheduler_iteration = 600 > > node_check_rate = 150 > > tcp_timeout = 6 > > mom_job_sync = True > > pbs_version = 2.4.2 > > keep_completed = 300 > > submit_hosts = clusterName > > next_job_number = 48293 > > net_counter = 2 9 6 > > > > -bash-3.2$ qstat -w -n > > qstat: invalid option -- w > > > > > > Which line should I put the > > - > > --- padb (revision 401) > > +++ padb (working copy) > > @@ -2824,6 +2824,7 @@ > > foreach my $server (@servers) { > > pbs_get_lqsub( $user, $server ); # get job list by qsub > > } > > + print Dumper \%pbs_tabjobs; > > return \%pbs_tabjobs; > > } > > > > > > in the bin file padb > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 25 2010 > > > > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > > From: ash...@pittman.co.uk > > > Date: Mon, 25 Oct 2010 22:54:21 +0100 > > > To: dtustud...@hotmail.com > > > > > > > > > [off list] > > > > > > The PBS support was added by a third-party so I've not used it in anger > > > myself, it appears you are doing the correct thing as far as I can tell. > > > > > > Can you send me the output of the following two commands and also apply > > > the patch below to padb (you can do this just in the bin dir - it's a > > > perl script) and send me the output when you run that as well? > > > > > > qstat -fB > > > qstat -w -n > > > > > > --- padb (revision 401) > > > +++ padb (working copy) > > > @@ -2824,6 +2824,7 @@ > > > foreach my $server (@servers) { > > > pbs_get_lqsub( $user, $server ); # get job list by qsub > > > } > > > + print Dumper \%pbs_tabjobs; > > > return \%pbs_tabjobs; > > > } > > > > > > On 25 Oct 2010, at 22:30, Jack Bryan wrote: > > > > > > > Thanks > > > > > > > > I have downloaded > > > > http://padb.googlecode.com/files/padb-3.2-beta1.tar.gz > > > > > > > > and followed the instructions of INSTALL file and installed it at > > > > /mypath/padb32 > > > > > > > > But, I got: > > > > > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48279.cluster > > > > Job 48279.cluster is not active > > > > > > > > Actually, the job was running. > > > > > > > > I have installed > > > > bin at > > > > > > > > /mypath/padb32/bin > > > > > > > > > > > > libexec at > > > > /lustre/jxding/padb32/libexec > > > > > > > > When I installed it, I used > > > > > > > > ./configure --prefix=/mypath/padb32 > > > > > > > > I got > > > > - > > > > > > > >
Re: [OMPI users] Open MPI program cannot complete
Hi, I put your attahced padb in mypath and also set it up in env variable.I got this: -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2-bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad interpreter: No such file or directory Any help is appreciated. thanks Jack Oct. 26 2010 Subject: Re: [OMPI users] Open MPI program cannot complete From: ash...@pittman.co.uk List-Post: users@lists.open-mpi.org Date: Tue, 26 Oct 2010 08:39:56 +0100 CC: tomview...@yahoo.com To: dtustud...@hotmail.com Sorry, I forgot to attach it last night.
Re: [OMPI users] Open MPI program cannot complete
thanks I got : -bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 48516.cluster is not active Actually, the job is running. Any help is appreciated. thanksJinxu Ding Oct. 26 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Tue, 26 Oct 2010 23:18:57 +0100 > To: dtustud...@hotmail.com > > > The "^M: bad interpreter" tells me that you've downloaded the file in Windows > and have got dos-based new-lines in the file. > > Assuming it's installed on your machine run "dos2unix padb" and it'll remove > them, if that doesn't work save the file using a unix based email program. I > hope this helps you when we finally get it working! > > Ashley. > > On 26 Oct 2010, at 22:14, Jack Bryan wrote: > > > Hi, > > > > I put your attahced padb in mypath and also set it up in env variable. > > I got this: > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2 > > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad > > interpreter: No such file or directory > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 26 2010 > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > From: ash...@pittman.co.uk > > Date: Tue, 26 Oct 2010 08:39:56 +0100 > > CC: tomview...@yahoo.com > > To: dtustud...@hotmail.com > > > > > > Sorry, I forgot to attach it last night. > > > > > > > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk >
Re: [OMPI users] Open MPI program cannot complete
thanksI got :-bash-3.2$ padb -Ormgr=pbs -Q 48516.cystorm2$VAR1 = {};Job 48516.cluster is not activeActually, the job is running. Any help is appreciated. thanksJinxu DingOct. 27 2010 > Subject: Re: [OMPI users] Open MPI program cannot complete > From: ash...@pittman.co.uk > Date: Tue, 26 Oct 2010 23:18:57 +0100 > To: dtustud...@hotmail.com > > > The "^M: bad interpreter" tells me that you've downloaded the file in Windows > and have got dos-based new-lines in the file. > > Assuming it's installed on your machine run "dos2unix padb" and it'll remove > them, if that doesn't work save the file using a unix based email program. I > hope this helps you when we finally get it working! > > Ashley. > > On 26 Oct 2010, at 22:14, Jack Bryan wrote: > > > Hi, > > > > I put your attahced padb in mypath and also set it up in env variable. > > I got this: > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48494.cystorm2 > > -bash: /mypath/padb_patch_2010_10_26/padb: /usr/bin/perl^M: bad > > interpreter: No such file or directory > > > > Any help is appreciated. > > > > thanks > > > > Jack > > > > Oct. 26 2010 > > > > > > Subject: Re: [OMPI users] Open MPI program cannot complete > > From: ash...@pittman.co.uk > > Date: Tue, 26 Oct 2010 08:39:56 +0100 > > CC: tomview...@yahoo.com > > To: dtustud...@hotmail.com > > > > > > Sorry, I forgot to attach it last night. > > > > > > > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk >
[OMPI users] open MPI please recommend a debugger for open MPI
Hi, Would you please recommend a debugger, which can do debugging for parallel processes on Open MPI systems ? I hope that it can be installed without root right because I am not a root user for ourMPI cluster. Any help is appreciated. Thanks Jack Oct. 28 2010
Re: [OMPI users] open MPI please recommend a debugger for open MPI
thanksI have run padb (the new one with your patch) on my system and got :-bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster$VAR1 = {};Job 48516.cluster is not activeActually, the job is running. How to check whether my system has pbs_pro ? Any help is appreciated. thanksJinxu DingOct. 29 2010 > From: ash...@pittman.co.uk > Date: Fri, 29 Oct 2010 18:21:46 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI > > > On 29 Oct 2010, at 12:06, Jeremy Roberts wrote: > > > I'd suggest looking into TotalView (http://www.totalviewtech.com) and/or > > DDT (http://www.allinea.com/). I've used TotalView pretty extensively and > > found it to be pretty easy to use. They are both commercial, however, and > > not cheap. > > > > As far as I know, there isn't a whole lot of open source support for > > parallel debugging. The Parallel Tools Platform of Eclipse claims to > > provide a parallel debugger, though I have yet to try it > > (http://www.eclipse.org/ptp/). > > Jeremy has covered the graphical parallel debuggers that I'm aware of, for a > different approach there is padb which isn't a "parallel debugger" in the > traditional model but is able to show you the same type of information, it > won't allow you to point-and-click through the source or single step through > the code but it is lightweight and will show you the information which you > need to know. > > Padb needs to integrate with the resource manager, I know it works with > pbs_pro but it seems there are a few issues on your system which is pbs > (without the pro). I can help you with this and work through the problems > but only if you work with me and provide details of the integration, in > particular I've sent you a version which has a small patch and some debug > printfs added, if you could send me the output from this I'd be able to tell > you if it was likely to work and how to go about making it do so. > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] open MPI please recommend a debugger for open MPI
Hi, this is what I got : -bash-3.2$ qstat -n -u myName clsuter: Req'd Req'd ElapJob ID Username QueueJobname SessID NDS TSK Memory Time S Time -- - --- -- - - -48933.cluster.e myName develmyJob 107835 1 ---- 00:02 C 00:00 n20/0 Any help is appreciated. thanks > From: ash...@pittman.co.uk > Date: Fri, 29 Oct 2010 18:38:25 +0100 > To: us...@open-mpi.org > Subject: Re: [OMPI users] open MPI please recommend a debugger for open MPI > > > Can you try the following and send me the output. > > qstat -n -u `whoami` @clusterName > > The output sent before implies that your cluster is called "clusterName" > rather than "cluster" which is a little surprising but let's see what it > gives us if we query on that basis. > > Ashley. > > On 29 Oct 2010, at 18:29, Jack Bryan wrote: > > > thanks > > > > I have run padb (the new one with your patch) on my system and got : > > > > -bash-3.2$ padb -Ormgr=pbs -Q 48516.cluster > > $VAR1 = {}; > > Job 48516.cluster is not active > > > > Actually, the job is running. > > > > How to check whether my system has pbs_pro ? > > > > Any help is appreciated. > > > > thanks > > Jinxu Ding > > > > Oct. 29 2010 > > > > > > > From: ash...@pittman.co.uk > > > Date: Fri, 29 Oct 2010 18:21:46 +0100 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] open MPI please recommend a debugger for open > > > MPI > > > > > > > > > On 29 Oct 2010, at 12:06, Jeremy Roberts wrote: > > > > > > > I'd suggest looking into TotalView (http://www.totalviewtech.com) > > > > and/or DDT (http://www.allinea.com/). I've used TotalView pretty > > > > extensively and found it to be pretty easy to use. They are both > > > > commercial, however, and not cheap. > > > > > > > > As far as I know, there isn't a whole lot of open source support for > > > > parallel debugging. The Parallel Tools Platform of Eclipse claims to > > > > provide a parallel debugger, though I have yet to try it > > > > (http://www.eclipse.org/ptp/). > > > > > > Jeremy has covered the graphical parallel debuggers that I'm aware of, > > > for a different approach there is padb which isn't a "parallel debugger" > > > in the traditional model but is able to show you the same type of > > > information, it won't allow you to point-and-click through the source or > > > single step through the code but it is lightweight and will show you the > > > information which you need to know. > > > > > > Padb needs to integrate with the resource manager, I know it works with > > > pbs_pro but it seems there are a few issues on your system which is pbs > > > (without the pro). I can help you with this and work through the problems > > > but only if you work with me and provide details of the integration, in > > > particular I've sent you a version which has a small patch and some debug > > > printfs added, if you could send me the output from this I'd be able to > > > tell you if it was likely to work and how to go about making it do so. > > > > > > Ashley. > > > > > > -- > > > > > > Ashley Pittman, Bath, UK. > > > > > > Padb - A parallel job inspection tool for cluster computing > > > http://padb.pittman.org.uk > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] message truncated error
HI, In my MPI program, master send many msaages to another worker with the same tag. The worker uses sMPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1, message_para_to_workers_type, 0, downStreamTaskTag); to receive the messages I got error: n36:94880] *** An error occurred in MPI_Recv[n36:94880] *** on communicator MPI_COMM_WORLD[n36:94880] *** MPI_ERR_TRUNCATE: message truncated[n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)[n36:94880] *** Process received signal ***[n36:94880] Signal: Segmentation fault (11)[n36:94880] Signal code: Address not mapped (1) Is this (the same tag) the reason for the errors ? ANy help is appreciated. thanks Jack Oct. 31 2010
Re: [OMPI users] message truncated error
thanks I use double* recvArray = new double[buffersize]; The receive buffer size MPI::COMM_WORLD.Recv(&(recvDataArray[0]), xVSize, MPI_DOUBLE, 0, mytaskTag); delete [] recvArray ; In first iteration, the receiver works well. But, in second iteration , I got the MPI_ERR_TRUNCATE: message truncated the buffersize is the same in two iterations. ANy help is appreciated. thanks Nov. 1 2010 > Date: Mon, 1 Nov 2010 08:08:08 +0100 > From: jody@gmail.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] message truncated error > > Hi Jack > > Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv > (or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in. > Check your code to make sure you assign enough memory to your buffers. > > regards > Jody > > > On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan wrote: > > HI, > > In my MPI program, master send many msaages to another worker with the same > > tag. > > The worker uses > > s > > MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1, > > message_para_to_workers_type, 0, downStreamTaskTag); > > to receive the messages > > I got error: > > > > n36:94880] *** An error occurred in MPI_Recv > > [n36:94880] *** on communicator MPI_COMM_WORLD > > [n36:94880] *** MPI_ERR_TRUNCATE: message truncated > > [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > > [n36:94880] *** Process received signal *** > > [n36:94880] Signal: Segmentation fault (11) > > [n36:94880] Signal code: Address not mapped (1) > > > > Is this (the same tag) the reason for the errors ? > > ANy help is appreciated. > > thanks > > Jack > > Oct. 31 2010 > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open MPI data transfer error
Hi, In my Open MPI program, one master sends data to 3 workers. Two workers can receive their data. But, the third worker can not get their data. Before sending data, the master sends a head information to each worker receiver so that each worker knows what the following data package is. (such as length, package tag). The third worker can get its head information message from master but cannot get its correct data package. It got the data that should be received by first worker, which get its correct data. Why ? Any help is appreciated. thanks Jack Nov. 4 2010
Re: [OMPI users] Open MPI data transfer error
Thanks, I have used "cout" in c++ to print the values of data. The sender sends correct data to correct receiver. But, receiver gets wrong data from correct sender. why ? thanks Nov. 5 2010 > Date: Fri, 5 Nov 2010 08:54:22 -0400 > From: prent...@ias.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI data transfer error > > Jack Bryan wrote: > > > > Hi, > > > > In my Open MPI program, one master sends data to 3 workers. > > > > Two workers can receive their data. > > > > But, the third worker can not get their data. > > > > Before sending data, the master sends a head information to each worker > > receiver > > so that each worker knows what the following data package is. (such as > > length, package tag). > > > > The third worker can get its head information message from master but > > cannot get its correct > > data package. > > > > It got the data that should be received by first worker, which get its > > correct data. > > > > > Jack, > > Providing the relevant sections of code here would be very helpful. > > > I would tell you to add some printf statements to your code to see what > data is stored in your variables on the master before it sends them to > each node, but Jeff Squyres and I agreed to disagree in a civil manner > on that debugging technique earlier this week, and I'd hate to re-open > those old wounds by suggesting that technique here. ;) > > > -- > Prentice > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI data transfer error
Thanks, But, my code is too long to be posted. dozens of files, thousands of lines. Do you have better ideas ? Any help is appreciated. Jack Nov. 5 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Fri, 5 Nov 2010 11:20:57 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI data transfer error As Prentice said, we can't help you without seeing your code. openMPI has stood many trials from many programmers, with many bugs ironed out. So typically it is unlikely openMPI is the source of your error. Without seeing your code the only logical conclusion is that something is wrong with your programming. On Fri, Nov 5, 2010 at 10:52 AM, Prentice Bisbal wrote: We can't help you with your coding problem without seeing your code. Jack Bryan wrote: > Thanks, > I have used "cout" in c++ to print the values of data. > > The sender sends correct data to correct receiver. > > But, receiver gets wrong data from correct sender. > > why ? > > thanks > > Nov. 5 2010 > >> Date: Fri, 5 Nov 2010 08:54:22 -0400 >> From: prent...@ias.edu >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] Open MPI data transfer error >> >> Jack Bryan wrote: >> > >> > Hi, >> > >> > In my Open MPI program, one master sends data to 3 workers. >> > >> > Two workers can receive their data. >> > >> > But, the third worker can not get their data. >> > >> > Before sending data, the master sends a head information to each worker >> > receiver >> > so that each worker knows what the following data package is. (such as >> > length, package tag). >> > >> > The third worker can get its head information message from master but >> > cannot get its correct >> > data package. >> > >> > It got the data that should be received by first worker, which get its >> > correct data. >> > >> >> >> Jack, >> >> Providing the relevant sections of code here would be very helpful. >> >> >> I would tell you to add some printf statements to your code to see what >> data is stored in your variables on the master before it sends them to >> each node, but Jeff Squyres and I agreed to disagree in a civil manner >> on that debugging technique earlier this week, and I'd hate to re-open >> those old wounds by suggesting that technique here. ;) >> >> >> -- >> Prentice ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI data transfer error
Thanks, About my MPI program bugs: I used GDB and got the error: Program received signal SIGSEGV, Segmentation fault.0: 0x003a31c62184 in fwrite () from /lib64/libc.so.6 also error : 1: Program received signal SIGABRT, Aborted.0: I am rank 0, I have sent 4tasks out of total tasks1: 0x003a31c30265 in raise () from /lib64/libc.so.6 It may be caused by a class usage. My program master-worker MPI framework: class CNSGA2{ allocate mem for var; some deallocate statement; some pointers; evaluate(); // it is a function} CNSGA2::CNSGA2(){} class newCNSGA2:public CNSGA2{public: newCNSGA2(){cout << " constructor for newCNSGA2 \n\n" << endl;};~newCNSGA2(){cout << " destructor for newCNSGA2 \n\n" << endl;};}; main(){ CNSGA2* nsga2a = new CNSGA2(true); // true or false are only for different constructors CNSGA2* nsga2b = new CNSGA2(false); if (myRank == 0) // scope1 { initialize the objects of nsga2a or nsga2b; } broadcast some parameters, which are got from scope1. According to the parameters, define a datatype (myData) so that all workers use that to do recv and send. if (myRank == 0) // scope2 { send out myData to workers by the datatype defined above; } if (myRank != 0){ newCNSGA2 myNsga2; recv data from master and work on the recved data; myNsga2.evaluate(recv data);send back results; } } If I declear objects (nsga2a nsga2b ) in scope 1 , they cannot be visible in scope2. But, actually, the two objects are only used in master not in workers. Workers only needs to call evaluate() from the class CNSGA2. This is why I used inheritance to define a new class newCNSGA2. But, the problem is there some memory allocation and deallocation inside class CNSGA2. The new class newCNSGA2 donot need these memory allocation and deallocation. If I put the delaration of CNSGA2* nsga2a or CNSGA2* nsga2b in scope1, they are not visible in scope 2. I cannot combine the two scopes because the datatype need them to de defined so that all workers can see them and use them to do send and recv. Any help is appreciated. Jack Nov. 6 2010 > Date: Fri, 5 Nov 2010 14:55:32 -0800 > From: eugene@oracle.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI data transfer error > > Debugging is not a straightforward task. Even posting the code doesn't > necessarily help (since no one may be motivated to help or they can't > reproduce the problem or...). You'll just have to try different things > and see what works for you. Another option is to trace the MPI calls. > If a process sends a message, dump out the MPI_Send() arguments. When a > receiver receives, correspondingly dump those arguments. Etc. This > might be a way of seeing what the program is doing in terms of MPI and > thereby getting to suggestion B below. > > How do you trace and sort through the resulting data? That's another > tough question. Among other things, if you can't find a tool that fits > your needs, you can use the PMPI layer to write wrappers. Writing > wrappers is like inserting printf() statements, but doesn't quite have > the same amount of moral shame associated with it! > > Prentice Bisbal wrote: > > >Choose one > > > >A) Post only the relevant sections of the code. If you have syntax > >error, it should be in the Send and Receive calls, or one of the lines > >where the data is copied or read from the array/buffer/whatever that > >you're sending or receiving. > > > >B) Try reproducing your problem in a toy program that has only enough > >code to reproduce your problem. For example, create an array, populate > >it with data, send it, and then on the receiving end, receive it, and > >print it out. Something simple like that. I find when I do that, I > >usually find the error in my code. > > > >Jack Bryan wrote: > > > > > >>But, my code is too long to be posted. > >>dozens of files, thousands of lines. > >>Do you have better ideas ? > >>Any help is appreciated. > >> > >>Nov. 5 2010 > >> > >>From: solarbik...@gmail.com > >>Date: Fri, 5 Nov 2010 11:20:57 -0700 > >>To: us...@open-mpi.org > >>Subject: Re: [OMPI users] Open MPI data transfer error > >> > >>As Prentice said, we can't help you without seeing your code. openMPI > >>has stood many trials from many programmers, with many b
[OMPI users] Open MPI access the same file in parallel ?
Hi, I have a file, which is located in a system folder, which can be accessed by all parallel processes. Does Open MPI allow multi processes to access the same file at the same time ? For example, all processes open the file and load data from it at the same time. Any help is really appreciated. thanks Jack Mar 9 2011
Re: [OMPI users] Open MPI access the same file in parallel ?
Thanks, I only need to read the file. And, all processes only to read the file only once. But, the file is about 200MB. But, my code is C++. Does Open MPI support this ? thanks From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Wed, 9 Mar 2011 20:57:03 -0800 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI access the same file in parallel ? Under my programming environment, FORTRAN, it is possible to parallel read (using native read function instead of MPI's parallel read function). Although you'll run into problem when you try to parallel write to the same file. On Wed, Mar 9, 2011 at 8:45 PM, Jack Bryan wrote: Hi, I have a file, which is located in a system folder, which can be accessed by all parallel processes. Does Open MPI allow multi processes to access the same file at the same time ? For example, all processes open the file and load data from it at the same time. Any help is really appreciated. thanks Jack Mar 9 2011 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI access the same file in parallel ?
Hi, thanks for your code. I have test it with a simple example file. It works well without any conflict of parallel accessing the same file. Now, I am using CPLEX (an optimization model solver) to load a model data file, which can be 200 MBytes. CPLEX.importModel(modelName, dataFileName) ; I do not know how CPLEX code handle the reading the model data file. Any suggestions or ideas are welcome. thanks Jack From: belaid_...@hotmail.com To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Thu, 10 Mar 2011 05:51:31 + Subject: Re: [OMPI users] Open MPI access the same file in parallel ? Hi, You can do that with C++ also. Just for fun of it, I produced a little program for that; each process reads the whole file and print the content to stdout. I hope this helps: #include #include #include #include using namespace std; int main (int argc, char* argv[]) { int rank, size; string line; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); ifstream txtFile("example.txt"); if (txtFile.is_open()) { while ( txtFile.good() ) { getline (txtFile,line); cout << line << endl; } txtFile.close(); }else { cout << "Unable to open file"; } MPI_Finalize(); /*end MPI*/ return 0; } With best regards, -Belaid. From: dtustud...@hotmail.com To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Wed, 9 Mar 2011 22:08:44 -0700 Subject: Re: [OMPI users] Open MPI access the same file in parallel ? Thanks, I only need to read the file. And, all processes only to read the file only once. But, the file is about 200MB. But, my code is C++. Does Open MPI support this ? thanks From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Wed, 9 Mar 2011 20:57:03 -0800 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI access the same file in parallel ? Under my programming environment, FORTRAN, it is possible to parallel read (using native read function instead of MPI's parallel read function). Although you'll run into problem when you try to parallel write to the same file. On Wed, Mar 9, 2011 at 8:45 PM, Jack Bryan wrote: Hi, I have a file, which is located in a system folder, which can be accessed by all parallel processes. Does Open MPI allow multi processes to access the same file at the same time ? For example, all processes open the file and load data from it at the same time. Any help is really appreciated. thanks Jack Mar 9 2011 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Open MPI access the same file in parallel ?
thanks I am using GNU mpic++ compiler. Does it can automatically support accessing a file by many parallel processes ? thanks > Date: Wed, 9 Mar 2011 22:54:18 -0800 > From: n...@aol.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI access the same file in parallel ? > > On 3/9/2011 8:57 PM, David Zhang wrote: > > Under my programming environment, FORTRAN, it is possible to parallel > > read (using native read function instead of MPI's parallel read > > function). Although you'll run into problem when you try to parallel > > write to the same file. > > > > If your Fortran compiler/library are reasonably up to date, you will > need to specify action='read' as opening once with default readwrite > will lock out other processes. > -- > Tim Prince > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OMPI seg fault by a class with weird address.
Hi, I got a run-time error of a Open MPI C++ program. The following output is from gdb: --Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 At the point Breakpoint 9, Index::Index (this=0x7fffcb80) at src/index.cpp:2020 Name(0) {} The Index has been called before this point and no problem:---Breakpoint 9, Index::Index (this=0x117d800) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. Breakpoint 9, Index::Index (this=0x117d860) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. It seems that the 0x7fffcb80 address is a problem. But, I donot know the reason and how to remove the bug. Any help is really appreciated. thanks the following is the index definition. -class Index { public:Index();Index(const Index& rhs);~Index(); Index& operator=(const Index& rhs); vector GetPosition() const;vector GetColumn() const; vector GetYear() const;vector GetName() const; int GetPosition(const int idx) const; int GetColumn(const int idx) const; int GetYear(const int idx) const; string GetName(const int idx) const;int GetSize() const; void Add(const int idx, const int col, const string& name); void Add(const int idx, const int col, const int year, const string& name); void Add(const int idx, const Step& col, const string& name); void WriteFile(const char* fileinput) const;private: vector Position; vector Column; vector Year; vector Name;};// Contructors and destructor for the Index classIndex::Index() : Position(0),Column(0), Year(0), Name(0) {} Index::Index(const Index& rhs) :Position(rhs.GetPosition()), Column(rhs.GetColumn()),Year(rhs.GetYear()),Name(rhs.GetName()) {} Index::~Index() {} Index& Index::operator=(const Index& rhs) {Position = rhs.GetPosition(); Column = rhs.GetColumn(), Year = rhs.GetYear(), Name = rhs.GetName(); return *this;}--
Re: [OMPI users] OMPI seg fault by a class with weird address.
Hi, Because the code is very long, I just show the calling relationship of functions. main(){scheduler(); }scheduler(){ ImportIndices();} ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");} Index ReadFile(const char* fileinput) { Index TempIndex;. } vector Index::GetPosition() const { return Position; }vector Index::GetColumn() const { return Column; }vector Index::GetYear() const { return Year; }vector Index::GetName() const { return Name; }int Index::GetPosition(const int idx) const { return Position[idx]; }int Index::GetColumn(const int idx) const { return Column[idx]; }int Index::GetYear(const int idx) const { return Year[idx]; }string Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() const { return Position.size(); } The sequential code works well, and there is no scheduler(). The parallel code output from gdb:--Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector >, std::allocator > > > &, std::vector >, std::allocator > > > &, std::vector > &, int, std::vector >, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, xdata_to_workers_type=0x121c410, myGenerationNum=1, Mpara_to_workers_type=0x121b9b0, nconNum=0)at src/nsga2/myNetplanScheduler.cpp:109109 ImportIndices();(gdb) cContinuing. Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = ReadFile("prepdata/idx_node.csv");(gdb) cContinuing. Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at src/index.cpp:8686 Index TempIndex;(gdb) cContinuing. Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 ---the backtrace output from the above parallel OpenMPI code: (gdb) bt#0 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1 0x2b3b2bd3 in opal_memory_ptmalloc2_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2 0x003f7c8bd1dd in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6#3 0x004646a7 in __gnu_cxx::new_allocator::allocate (this=0x7fffcb80, __n=0)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:88#4 0x004646cf in std::_Vector_base >::_M_allocate (this=0x7fffcb80, __n=0)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:127#5 0x00464701 in std::_Vector_base >::_Vector_base (this=0x7fffcb80, __n=0, __a=...)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:113#6 0x00464d0b in std::vector >::vector ( this=0x7fffcb80, __n=0, __value=@0x7fffc968, __a=...)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:216#7 0x004890d7 in Index::Index (this=0x7fffcb80)---Type to continue, or q to quit---at src/index.cpp:20#8 0x0048927a in ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at src/index.cpp:86#9 0x00489533 in ImportIndices () at src/index.cpp:120#10 0x00445e0e in myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector >, std::allocator > > > &, std::vector >, std::allocator > > > &, std::vector > &, int, std::vector >, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {..
Re: [OMPI users] OMPI seg fault by a class with weird address.
Thanks, I do not have system administrator authorization. I am afraid that I cannot rebuild OpenMPI --without-memory-manager. Are there other ways to get around it ? For example, use other things to replace "ptmalloc" ? Any help is really appreciated. thanks From: belaid_...@hotmail.com To: dtustud...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 08:00:56 + Hi Jack, I may need to see the whole code to decide but my quick look suggest that ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the openMPI internal malloc library. Could you try to build openMPI without memory management (using --without-memory-manager) and let us know the outcome. ptmalloc is not needed if you are not using an RDMA interconnect. With best regards, -Belaid. From: dtustud...@hotmail.com To: belaid_...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 00:30:19 -0600 Hi, Because the code is very long, I just show the calling relationship of functions. main(){scheduler(); }scheduler(){ ImportIndices();} ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");} Index ReadFile(const char* fileinput) { Index TempIndex;. } vector Index::GetPosition() const { return Position; }vector Index::GetColumn() const { return Column; }vector Index::GetYear() const { return Year; }vector Index::GetName() const { return Name; }int Index::GetPosition(const int idx) const { return Position[idx]; }int Index::GetColumn(const int idx) const { return Column[idx]; }int Index::GetYear(const int idx) const { return Year[idx]; }string Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() const { return Position.size(); } The sequential code works well, and there is no scheduler(). The parallel code output from gdb:--Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector >, std::allocator > > > &, std::vector >, std::allocator > > > &, std::vector > &, int, std::vector >, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, xdata_to_workers_type=0x121c410, myGenerationNum=1, Mpara_to_workers_type=0x121b9b0, nconNum=0)at src/nsga2/myNetplanScheduler.cpp:109109 ImportIndices();(gdb) cContinuing. Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = ReadFile("prepdata/idx_node.csv");(gdb) cContinuing. Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at src/index.cpp:8686 Index TempIndex;(gdb) cContinuing. Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 ---the backtrace output from the above parallel OpenMPI code: (gdb) bt#0 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1 0x2b3b2bd3 in opal_memory_ptmalloc2_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2 0x003f7c8bd1dd in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6#3 0x004646a7 in __gnu_cxx::new_allocator::allocate (this=0x7fffcb80, __n=0)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/ext/new_allocator.h:88#4 0x004646cf in std::_Vector_base >::_M_allocate (this=0x7fffcb80, __n=0)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:127#5 0x00464701 in std::_Vector_base >::_Vector_base (this=0x7fffcb80, __n=0, __a=...)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:113#6 0x00464d0b in std::vector >::vector ( this=0x7fffcb80, __n=0, __value=@0x7fffc968, __a=...)at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_vector.h:216#7 0x004890d7 in Index::Index (this
Re: [OMPI users] OMPI seg fault by a class with weird address.
Thanks,From http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap I find that "Currently the wrappers are only buildable with mpiccs which are based on GNU GCC or Intel's C++ Compiler." The cluster which I am working on is using GNU Open MPI mpic++. i am afraid that the Valgrind wrapper can work here. I do not have system administrator authorization. Are there other mem-checkers (open source) that can do this ? thanks Jack > Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. > From: jsquy...@cisco.com > Date: Tue, 15 Mar 2011 06:19:53 -0400 > CC: dtustud...@hotmail.com > To: us...@open-mpi.org > > You may also want to run your program through a memory-checking debugger such > as valgrind to see if it turns up any other problems. > > AFIK, ptmalloc should be fine for use with STL vector allocation. > > > On Mar 15, 2011, at 4:00 AM, Belaid MOA wrote: > > > Hi Jack, > > I may need to see the whole code to decide but my quick look suggest that > > ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the > > openMPI internal malloc library. Could you try to build openMPI without > > memory management (using --without-memory-manager) and let us know the > > outcome. ptmalloc is not needed if you are not using an RDMA interconnect. > > > > With best regards, > > -Belaid. > > > > From: dtustud...@hotmail.com > > To: belaid_...@hotmail.com; us...@open-mpi.org > > Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. > > Date: Tue, 15 Mar 2011 00:30:19 -0600 > > > > Hi, > > > > Because the code is very long, I just show the calling relationship of > > functions. > > > > main() > > { > > scheduler(); > > > > } > > scheduler() > > { > > ImportIndices(); > > } > > > > ImportIndices() > > { > > Index IdxNode ; > > IdxNode = ReadFile("fileName"); > > } > > > > Index ReadFile(const char* fileinput) > > { > > Index TempIndex; > > . > > > > } > > > > vector Index::GetPosition() const { return Position; } > > vector Index::GetColumn() const { return Column; } > > vector Index::GetYear() const { return Year; } > > vector Index::GetName() const { return Name; } > > int Index::GetPosition(const int idx) const { return Position[idx]; } > > int Index::GetColumn(const int idx) const { return Column[idx]; } > > int Index::GetYear(const int idx) const { return Year[idx]; } > > string Index::GetName(const int idx) const { return Name[idx]; } > > int Index::GetSize() const { return Position.size(); } > > > > The sequential code works well, and there is no scheduler(). > > > > The parallel code output from gdb: > > -- > > Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, > > int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, > > std::vector >, > > std::allocator > > > &, > > std::vector >, > > std::allocator > > > &, > > std::vector > &, int, > > std::vector >, > > std::allocator > > > &, > > MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, > > popSize=, nodeSize=, > > myRank=, myChildpop=0x1208d80, genCandTag=65 'A', > > generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = > > {...}, > > message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, > > myT2Flag=@0x7fffd688, > > resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, > > resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, > > xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, > > resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, > > xdata_to_workers_type=0x121c410, myGenerationNum=1, > > Mpara_to_workers_type=0x121b9b0, nconNum=0) > > at src/nsga2/myNetplanScheduler.cpp:109 > > 109 ImportIndices(); > > (gdb) c > > Continuing. > > > > Breakpoint 2, ImportIndices () at src/index.cpp:120 > > 120 IdxNode = ReadFile("prepdata/idx_node.csv"); > > (gdb) c > > Continuing. > > > > Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv") > > at src/index.cpp:86 > > 86 Index TempIndex; > > (gdb) c > > Continuing. > > > > Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:20 > > 20 Name(0) {} > > (gdb) c > > Continuing. > > > > Program received signal SIGSEGV, Segmentation fault. > > 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () > >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 > > > > --- > > the backtrace output from the above parallel OpenMPI code: > > > > (gdb) bt > > #0 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () > >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 > > #1 0x2b3b2bd3 in opal_memory_ptmalloc2_malloc () > >from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 > > #2 0x003f7c8bd1dd in operator new(unsigned long) () >
Re: [OMPI users] OMPI seg fault by a class with weird address.
I have tried export OMPI_MCA_memory_ptmalloc2_disable=1 It does not work. The same error. thanks From: sam...@lanl.gov To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 09:27:35 -0600 Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off OMPI's memory wrappers without having to rebuild. Someone please correct me if I'm wrong :-). For example (bash-like shell): export OMPI_MCA_memory_ptmalloc2_disable=1 Hope that helps, --Samuel K. GutierrezLos Alamos National Laboratory On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote:Thanks, I do not have system administrator authorization. I am afraid that I cannot rebuild OpenMPI --without-memory-manager. Are there other ways to get around it ? For example, use other things to replace "ptmalloc" ? Any help is really appreciated. thanks From: belaid_...@hotmail.com To: dtustud...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 08:00:56 + Hi Jack, I may need to see the whole code to decide but my quick look suggest that ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the openMPI internal malloc library. Could you try to build openMPI without memory management (using --without-memory-manager) and let us know the outcome. ptmalloc is not needed if you are not using an RDMA interconnect. With best regards, -Belaid. From: dtustud...@hotmail.com To: belaid_...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 00:30:19 -0600 Hi, Because the code is very long, I just show the calling relationship of functions. main(){scheduler(); }scheduler(){ ImportIndices();} ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");} Index ReadFile(const char* fileinput) { Index TempIndex;. } vector Index::GetPosition() const { return Position; }vector Index::GetColumn() const { return Column; }vector Index::GetYear() const { return Year; }vector Index::GetName() const { return Name; }int Index::GetPosition(const int idx) const { return Position[idx]; }int Index::GetColumn(const int idx) const { return Column[idx]; }int Index::GetYear(const int idx) const { return Year[idx]; }string Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() const { return Position.size(); } The sequential code works well, and there is no scheduler(). The parallel code output from gdb:--Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector >, std::allocator > > > &, std::vector >, std::allocator > > > &, std::vector > &, int, std::vector >, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, xdata_to_workers_type=0x121c410, myGenerationNum=1, Mpara_to_workers_type=0x121b9b0, nconNum=0)at src/nsga2/myNetplanScheduler.cpp:109109 ImportIndices();(gdb) cContinuing. Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = ReadFile("prepdata/idx_node.csv");(gdb) cContinuing. Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at src/index.cpp:8686 Index TempIndex;(gdb) cContinuing. Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. Program received signal SIGSEGV, Segmentation fault.0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 ---the backtrace output from the above parallel OpenMPI code: (gdb) bt#0 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#1 0x2b3b2bd3 in opal_memory_ptmalloc2_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0#2 0x003f7c8bd1dd in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6#3 0x004646a7 in __gnu_cxx::new_allocator::allocate (t
Re: [OMPI users] OMPI seg fault by a class with weird address.
This should be the configure info about Open MPI which I am using. -bash-3.2$ mpic++ -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --disable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux Thread model: posix gcc version 4.1.2 20080704 (Red Hat 4.1.2-50) thanks From: sam...@lanl.gov To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 09:27:35 -0600 Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off OMPI's memory wrappers without having to rebuild. Someone please correct me if I'm wrong :-). For example (bash-like shell): export OMPI_MCA_memory_ptmalloc2_disable=1 Hope that helps, --Samuel K. GutierrezLos Alamos National Laboratory On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote:Thanks, I do not have system administrator authorization. I am afraid that I cannot rebuild OpenMPI --without-memory-manager. Are there other ways to get around it ? For example, use other things to replace "ptmalloc" ? Any help is really appreciated. thanks From: belaid_...@hotmail.com To: dtustud...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 08:00:56 + Hi Jack, I may need to see the whole code to decide but my quick look suggest that ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the openMPI internal malloc library. Could you try to build openMPI without memory management (using --without-memory-manager) and let us know the outcome. ptmalloc is not needed if you are not using an RDMA interconnect. With best regards, -Belaid. From: dtustud...@hotmail.com To: belaid_...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. List-Post: users@lists.open-mpi.org Date: Tue, 15 Mar 2011 00:30:19 -0600 Hi, Because the code is very long, I just show the calling relationship of functions. main(){scheduler(); }scheduler(){ ImportIndices();} ImportIndices(){Index IdxNode ; IdxNode = ReadFile("fileName");} Index ReadFile(const char* fileinput) { Index TempIndex;. } vector Index::GetPosition() const { return Position; }vector Index::GetColumn() const { return Column; }vector Index::GetYear() const { return Year; }vector Index::GetName() const { return Name; }int Index::GetPosition(const int idx) const { return Position[idx]; }int Index::GetColumn(const int idx) const { return Column[idx]; }int Index::GetYear(const int idx) const { return Year[idx]; }string Index::GetName(const int idx) const { return Name[idx]; }int Index::GetSize() const { return Position.size(); } The sequential code works well, and there is no scheduler(). The parallel code output from gdb:--Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector >, std::allocator > > > &, std::vector >, std::allocator > > > &, std::vector > &, int, std::vector >, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, xdata_to_workers_type=0x121c410, myGenerationNum=1, Mpara_to_workers_type=0x121b9b0, nconNum=0)at src/nsga2/myNetplanScheduler.cpp:109109 ImportIndices();(gdb) cContinuing. Breakpoint 2, ImportIndices () at src/index.cpp:120120 IdxNode = ReadFile("prepdata/idx_node.csv");(gdb) cContinuing. Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")at src/index.cpp:8686 Index TempIndex;(gdb) cContinuing. Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:2020 Name(0) {}(gdb) cContinuing. Program received signal SIGSEGV, Se
Re: [OMPI users] OMPI seg fault by a class with weird address.
s. > From: jsquy...@cisco.com > Date: Tue, 15 Mar 2011 12:50:41 -0400 > CC: us...@open-mpi.org > To: dtustud...@hotmail.com > > You can: > > mpirun -np 4 valgrind ./my_application > > That is, you run 4 copies of valgrind, each with one instance of > ./my_application. Then you'll get valgrind reports for your applications. > You might want to dig into the valgrind command line options to have it dump > the results to files with unique prefixes (e.g., PID and/or hostname) so that > you can get a unique report from each process. > > If you disabled ptmalloc and you're still getting the same error, then it > sounds like an application error. Check out and see what valgrind tells you. > > > > On Mar 15, 2011, at 11:25 AM, Jack Bryan wrote: > > > Thanks, > > > > From http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap > > > > I find that > > > > "Currently the wrappers are only buildable with mpiccs which are based on > > GNU GCC or Intel's C++ Compiler." > > > > The cluster which I am working on is using GNU Open MPI mpic++. i am afraid > > that the Valgrind wrapper can work here. > > > > I do not have system administrator authorization. > > > > Are there other mem-checkers (open source) that can do this ? > > > > thanks > > > > Jack > > > > > Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. > > > From: jsquy...@cisco.com > > > Date: Tue, 15 Mar 2011 06:19:53 -0400 > > > CC: dtustud...@hotmail.com > > > To: us...@open-mpi.org > > > > > > You may also want to run your program through a memory-checking debugger > > > such as valgrind to see if it turns up any other problems. > > > > > > AFIK, ptmalloc should be fine for use with STL vector allocation. > > > > > > > > > On Mar 15, 2011, at 4:00 AM, Belaid MOA wrote: > > > > > > > Hi Jack, > > > > I may need to see the whole code to decide but my quick look suggest > > > > that ptmalloc is causing a problem with STL-vector allocation. ptmalloc > > > > is the openMPI internal malloc library. Could you try to build openMPI > > > > without memory management (using --without-memory-manager) and let us > > > > know the outcome. ptmalloc is not needed if you are not using an RDMA > > > > interconnect. > > > > > > > > With best regards, > > > > -Belaid. > > > > > > > > From: dtustud...@hotmail.com > > > > To: belaid_...@hotmail.com; us...@open-mpi.org > > > > Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. > > > > Date: Tue, 15 Mar 2011 00:30:19 -0600 > > > > > > > > Hi, > > > > > > > > Because the code is very long, I just show the calling relationship of > > > > functions. > > > > > > > > main() > > > > { > > > > scheduler(); > > > > > > > > } > > > > scheduler() > > > > { > > > > ImportIndices(); > > > > } > > > > > > > > ImportIndices() > > > > { > > > > Index IdxNode ; > > > > IdxNode = ReadFile("fileName"); > > > > } > > > > > > > > Index ReadFile(const char* fileinput) > > > > { > > > > Index TempIndex; > > > > . > > > > > > > > } > > > > > > > > vector Index::GetPosition() const { return Position; } > > > > vector Index::GetColumn() const { return Column; } > > > > vector Index::GetYear() const { return Year; } > > > > vector Index::GetName() const { return Name; } > > > > int Index::GetPosition(const int idx) const { return Position[idx]; } > > > > int Index::GetColumn(const int idx) const { return Column[idx]; } > > > > int Index::GetYear(const int idx) const { return Year[idx]; } > > > > string Index::GetName(const int idx) const { return Name[idx]; } > > > > int Index::GetSize() const { return Position.size(); } > > > > > > > > The sequential code works well, and there is no scheduler(). > > > > > > > > The parallel code output from gdb: > > > > -- > > > > Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, > > > > char, i
Re: [OMPI users] OMPI seg fault by a class with weird address.
tor > > >&, std::vector >, std::allocator > > >&, std::vector >&, int, std::vector >, std::allocator > > >&, ompi_datatype_t*, int, ompi_datatype_t*, int) (myNetplanScheduler.cpp:109)==18729==by 0x44F2DF: main (main-parallel2.cpp:216) Note: see also the FAQ in the source distribution.It contains workarounds to several common problems.In particular, if Valgrind aborted or crashed afteridentifying problems in your program, there's a good chancethat fixing those problems will prevent Valgrind aborting orcrashing, especially if it happened in m_mallocfree.c. > Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. > From: jsquy...@cisco.com > Date: Wed, 16 Mar 2011 06:43:01 -0400 > To: dtustud...@hotmail.com > CC: us...@open-mpi.org > > Did you run with a memory checking debugger like Valgrind? > > Sent from my phone. No type good. > > On Mar 15, 2011, at 8:30 PM, "Jack Bryan" wrote: > > > Hi, > > > > I have installed a new open MPI 1.3.4. > > > > But I got more weird errors: > > > > *** glibc detected *** /lustre/nsga2b: malloc(): memory corruption (fast): > > 0x1cafc450 *** > > === Backtrace: = > > /lib64/libc.so.6[0x3c50272aeb] > > /lib64/libc.so.6(__libc_malloc+0x7a)[0x3c5027402a] > > /usr/lib64/libstdc++.so.6(_Znwm+0x1d)[0x3c590bd17d] > > /lustre/jxding/netplan49/nsga2b[0x445bc6] > > /lustre/jxding/netplan49/nsga2b[0x44f43b] > > /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c5021d974] > > /lustre/jxding/netplan49/nsga2b(__gxx_personality_v0+0x499)[0x443909] > > === Memory map: > > 0040-00f33000 r-xp 6ac:e3210 685016360 > > /lustre/netplan49/nsga2b > > 01132000-0117e000 rwxp 00b32000 6ac:e3210 685016360 > > /lustre/netplan49/nsga2b > > 0117e000-01188000 rwxp 0117e000 00:00 0 > > 1ca11000-1ca78000 rwxp 1ca11000 00:00 0 > > 1ca78000-1ca79000 rwxp 1ca78000 00:00 0 > > 1ca79000-1ca7a000 rwxp 1ca79000 00:00 0 > > 1ca7a000-1cab8000 rwxp 1ca7a000 00:00 0 > > 1cab8000-1cac7000 rwxp 1cab8000 00:00 0 > > 1cac7000-1cacf000 rwxp 1cac7000 00:00 0 > > 1cacf000-1cad rwxp 1cacf000 00:00 0 > > 1cad-1cad1000 rwxp 1cad 00:00 0 > > 1cad1000-1cad2000 rwxp 1cad1000 00:00 0 > > 1cad2000-1cada000 rwxp 1cad2000 00:00 0 > > 1cada000-1cadc000 rwxp 1cada000 00:00 0 > > 1cadc000-1cae rwxp 1cadc000 00:00 0 > > > > . > > 51260-3512605000 r-xp 00:11 12043 > > /usr/lib64/librdmacm.so.1 > > 3512605000-3512804000 ---p 5000 00:11 12043 > > /usr/lib64/librdmacm.so.1 > > 3512804000-3512805000 rwxp 4000 00:11 12043 > > /usr/lib64/librdmacm.so.1 > > 3512e0-3512e0c000 r-xp 00:11 5545 > > /usr/lib64/libibverbs.so.1 > > 3512e0c000-351300b000 ---p c000 00:11 5545 > > /usr/lib64/libibverbs.so.1 > > 351300b000-351300c000 rwxp b000 00:11 5545 > > /usr/lib64/libibverbs.so.1 > > 3c4f20-3c4f21c000 r-xp 00:11 2853 > > /lib64/ld-2.5.so > > 3c4f41b000-3c4f41c000 r-xp 0001b000 00:11 2853 > > /lib64/ld-2.5.so > > 3c4f41c000-3c4f41d000 rwxp 0001c000 00:11 2853 > > /lib64/ld-2.5.so > > 3c5020-3c5034c000 r-xp 00:11 897 > > /lib64/libc.so.6 > > 3c5034c000-3c5054c000 ---p 0014c000 00:11 897 > > /lib64/libc.so.6 > > 3c5054c000-3c5055 r-xp 0014c000 00:11 897 > > /lib64/libc.so.6 > > 3c5055-3c50551000 rwxp 0015 00:11 897 > > /lib64/libc.so.6 > > 3c50551000-3c50556000 rwxp 3c50551000 00:00 0 > > 3c5060-3c50682000 r-xp 00:11 2924 > > /lib64/libm.so.6 > > 3c50682000-3c50881000 ---p 00082000 00:11 2924 > > /lib64/libm.so.6 > > 3c50881000-3c50882000 r-xp 00081000 00:11 2924 > > /lib64/libm.so.6 > > 3c50882000-3c50883000 rwxp 00082000 00:11 2924 > > /lib64/libm.so.6 > > 3c50a0-3c50a02000 r-xp 00:11 923 > > /lib64/libdl.so.2 > > 3c50a02000-3c50c02000 ---p 2000 00:11 923 > > /lib64/libdl.so.2
Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling
> Date: Thu, 17 Mar 2011 23:40:31 +0100 > From: dominik.goedd...@math.tu-dortmund.de > To: us...@open-mpi.org > Subject: Re: [OMPI users] Potential bug in creating MPI_GROUP_EMPTY handling > > glad we could help and the two hours of stripping things down were > effectively not wasted. Also good to hear (implicitly) that we were not > too stupid to understand the MPI standard... > > Since to the best of my understanding, our workaround is practically > overhead-free, we went ahead and coded everything up analogously to the > workaround, i.e. we don't rely on / wait for an immediate fix. > > Please let us know if further information is needed. > > Thanks, > > dom > > On 03/17/2011 05:10 PM, Jeff Squyres wrote: > > Sorry for the late reply, but many thanks for the bug report and reliable > > reproducer. > > > > I've confirmed the problem and filed a bug about this: > > > > https://svn.open-mpi.org/trac/ompi/ticket/2752 > > > > > > On Mar 6, 2011, at 6:12 PM, Dominik Goeddeke wrote: > > > >> The attached example code (stripped down from a bigger app) demonstrates a > >> way to trigger a severe crash in all recent ompi releases but not in a > >> bunch of latest MPICH2 releases. The code is minimalistic and boils down > >> to the call > >> > >> MPI_Comm_create(MPI_COMM_WORLD, MPI_GROUP_EMPTY,&dummy_comm); > >> > >> which isn't supposed to be illegal. Please refer to the (well-documented) > >> code for details on the high-dimensional cross product I tested (on ubuntu > >> 10.04 LTS), a potential workaround (which isn't supposed to be necessary I > >> think) and an exemplary stack trace. > >> > >> Instructions: mpicc test.c -Wall -O0&& mpirun -np 2 ./a.out > >> > >> Thanks! > >> > >> dom > >> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Dr. Dominik Göddeke > Institut für Angewandte Mathematik > Technische Universität Dortmund > http://www.mathematik.tu-dortmund.de/~goeddeke > Tel. +49-(0)231-755-7218 Fax +49-(0)231-755-5933 > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OMPI free() error
Hi, I am running a C++ program with OMPI.I got error: *** glibc detected *** /nsga2b: free(): invalid next size (fast): 0x01817a90 *** I used GDB: === Backtrace: =Program received signal SIGABRT, Aborted.0x0038b8830265 in raise () from /lib64/libc.so.6(gdb) bt#0 0x0038b8830265 in raise () from /lib64/libc.so.6#1 0x0038b8831d10 in abort () from /lib64/libc.so.6#2 0x0038b886a99b in __libc_message () from /lib64/libc.so.6#3 0x0038b887245f in _int_free () from /lib64/libc.so.6#4 0x0038b88728bb in free () from /lib64/libc.so.6#5 0x0044a4e3 in workerRunTask (message_to_master_type=0x38c06efe18, nodeSize=2, myRank=1, xVSize=84, objSize=7, xdata_to_workers_type=0x1206350, recvXDataVec=std::vector of length 0, capacity 84, myNsga2=..., Mpara_to_workers_type=0x1205390, events=0x7fffb1f0, netplan=...)at src/nsga2/workerRunTask.cpp:447#6 0x004514d9 in main (argc=1, argv=0x7fffcb48)at src/nsga2/main-parallel2.cpp:425- In valgrind, there are some invalid read and write butno errors about this free(): invalid next size . ---(populp.ind)->xreal = new double[nreal];(populp.ind)->obj = new double[nobj]; (populp.ind)->constr= new double[ncon]; (populp.ind)->xbin = new double[nbin]; if ((populp.ind)->xreal == NULL || (populp.ind)->obj == NULL || (populp.ind)->constr == NULL || (populp.ind)->xbin == NULL ){ #ifdef DEBUG_workerRunTask cout << "In workerRunTask(), I am rank "<< myRank << " (populp.ind)->xreal or (populp.ind)->obj or (populp.ind)->constr or (populp.ind)->xbin is NULL .\n\n" << endl; #endif } delete [] (populp.ind)->xreal ; delete [] (populp.ind)->xbin ; delete [] (populp.ind)->obj ; delete [] (populp.ind)->constr ; delete [] sendResultArrayPr; thanks Any help is really appreciated.
Re: [OMPI users] OMPI seg fault by a class with weird address.
thanks, I forgot to set up storage capacity for some a vector before using [] operator on it. thanks > Subject: Re: [OMPI users] OMPI seg fault by a class with weird address. > From: jsquy...@cisco.com > Date: Wed, 16 Mar 2011 20:20:20 -0400 > CC: us...@open-mpi.org > To: dtustud...@hotmail.com > > Make sure you have the latest version of valgrind. > > But it definitely does highlight what could be real problems if you read down > far enough in the output. > > > ==18729== Invalid write of size 8 > > ==18729==at 0x443BEF: initPopPara(population*, > > std::vector > std::allocator >&, initParaType&, int, int, > > std::vector >&) (main-parallel2.cpp:552) > > ==18729==by 0x44F12E: main (main-parallel2.cpp:204) > > ==18729== Address 0x62c9da0 is 0 bytes after a block of size 0 alloc'd > > ==18729==at 0x4A0666E: operator new(unsigned long) > > (vg_replace_malloc.c:220) > > ==18729==by 0x4573E4: void > > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, > > message_para_to_workersT const&, __false_type) (new_allocator.h:88) > > ==18729==by 0x4576CF: void > > std::__uninitialized_fill_n_a > message_para_to_workersT, > > message_para_to_workersT>(message_para_to_workersT*, unsigned long, > > message_para_to_workersT const&, std::allocator) > > (stl_uninitialized.h:218) > > ==18729==by 0x44EE2E: main (stl_vector.h:218) > > The above is an invalid read of write of size 8 -- you're essentially writing > outside of an array. > > Valgrind is showing you the call stack to how it got there. Looks like you > new'ed or malloc'ed a block of size 0 and then tried to write something to > it. Writing to memory that you don't own is a no-no; it can cause Very Bad > Things to happen. > > You should probably investigate this, and the other issues that it is > reporting (e.g., the next invalid read of size 8). > > > ==18729== > > ==18729== Invalid read of size 8 > > ==18729==at 0x44F13A: main (main-parallel2.cpp:208) > > ==18729== Address 0x62c9d60 is 0 bytes after a block of size 0 alloc'd > > ==18729==at 0x4A0666E: operator new(unsigned long) > > (vg_replace_malloc.c:220) > > ==18729==by 0x45733D: void > > std::__uninitialized_fill_n_aux > message_para_to_workersT>(message_para_to_workersT*, unsigned long, > > message_para_to_workersT const&, __false_type) (new_allocator.h:88) > > ==18729==by 0x4576CF: void > > std::__uninitialized_fill_n_a > message_para_to_workersT, > > message_para_to_workersT>(message_para_to_workersT*, unsigned long, > > message_para_to_workersT const&, std::allocator) > > (stl_uninitialized.h:218) > > ==18729==by 0x44EE2E: main (stl_vector.h:218) > > ==18729== > > > > valgrind: m_mallocfree.c:225 (mk_plain_bszB): Assertion 'bszB != 0' failed. > > valgrind: This is probably caused by your program erroneously writing past > > the > > end of a heap block and corrupting heap metadata. If you fix any > > invalid writes reported by Memcheck, this assertion failure will > > > > probably go away. Please try that before reporting this as a bug. > > > > ==18729==at 0x38029D5C: report_and_quit (m_libcassert.c:145) > > ==18729==by 0x3802A032: vgPlain_assert_fail (m_libcassert.c:217) > > ==18729==by 0x38035645: vgPlain_arena_malloc (m_mallocfree.c:225) > > ==18729==by 0x38002BB5: vgMemCheck_new_block (mc_malloc_wrappers.c:199) > > ==18729==by 0x38002F6B: vgMemCheck___builtin_new > > (mc_malloc_wrappers.c:246) > > ==18729==by 0x3806070C: do_client_request (scheduler.c:1362) > > ==18729==by 0x38061D30: vgPlain_scheduler (scheduler.c:1061) > > ==18729==by 0x38085E6E: run_a_thread_NORETURN (syswrap-linux.c:91) > > > > sched status: > > running_tid=1 > > > > Thread 1: status = VgTs_Runnable > > ==18729==at 0x4A0666E: operator new(unsigned long) > > (vg_replace_malloc.c:220) > > ==18729==by 0x464506: __gnu_cxx::new_allocator::allocate(unsigned > > long, void const*) (new_allocator.h:88) > > ==18729==by 0x46452E: std::_Vector_base > > >::_M_allocate(unsigned long) (stl_vector.h:127) > > ==18729==by 0x464560: std::_Vector_base > > >::_Vector_base(unsigned long, std::allocator const&) > > (stl_vector.h:113) > > ==18729==by 0x464B6A: std::vector > > >::vector(unsigned long, int const&, std::allocator const&) > > (stl_vector.h:216) > > ==18729==by 0x488F62: Index::Index() (index.cpp:20) > > ==18729==by 0x489147: ReadFile(char const*) (index.cpp:86) > > ==18729==by 0x48941C: ImportIndices() (index.cpp:121) > > ==18729==by 0x445D00: myNeplanTaskScheduler(CNSGA2*, int, int, int, > > population*, char, int, std::vector > std::allocator >&, ompi_datatype_t*, int&, int&, > > std::vector >, > > std::allocator > > >&, > > std::vector >, > > std::allocator > > >&, > > std::vector >&, int, > > std::vector >, > > std::allocator > > >&, > > ompi_datatype_t*, int, ompi_datatype_t*, int) (myNetplanScheduler.cpp:109) > >
[OMPI users] OMPI error terminate w/o reasons
Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with extreme prejudice If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away. The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler). But, the error message does not indicate any possible reasons for the termination. There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL. The cluster where I am running the MPI program does not allow running debug tools. If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again. What can I do to find the possible bugs ? Any help is really appreciated. thanks Jack
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, I have tried this. But, the printout from 200 parallel processes make it very hard to locate the possible bug. They may not stop at the same point when the program got signal 9. So, even though I can figure out the print out statements from all200 processes, so many different locations where the processesare stopped make it harder to find out some hints about the bug. Are there some other programming tricks, which can help me narrow down to the doubt points ASAP.Any help is appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 07:53:40 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with extreme prejudice If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away. The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler). But, the error message does not indicate any possible reasons for the termination. There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL. The cluster where I am running the MPI program does not allow running debug tools. If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again. What can I do to find the possible bugs ? Any help is really appreciated. thanks Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, I am working on a cluster, where I am not allowed to install software on system folder. My Open MPI is 1.3.4. I have a very quick of the padb on http://padb.pittman.org.uk/ . Does it require some software install on the cluster in order to use it ? I cannot use command-line to run job on the lcuster , but only script. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 12:12:11 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Have you tried a parallel debugger such as padb? On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, I have tried this. But, the printout from 200 parallel processes make it very hard to locate the possible bug. They may not stop at the same point when the program got signal 9. So, even though I can figure out the print out statements from all200 processes, so many different locations where the processesare stopped make it harder to find out some hints about the bug. Are there some other programming tricks, which can help me narrow down to the doubt points ASAP.Any help is appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 07:53:40 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with extreme prejudice If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away. The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler). But, the error message does not indicate any possible reasons for the termination. There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL. The cluster where I am running the MPI program does not allow running debug tools. If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again. What can I do to find the possible bugs ? Any help is really appreciated. thanks Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI error terminate w/o reasons
Is it possible to enable padb to print out the stack trace and other program execute information into a file ? I can run the program in gdb as this: mpirun -np 200 -e gdb ./myapplication How to make gdb print out the debug information to a file ? So that I can check it when the program is terminated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 13:56:13 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes. As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck. On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, I am working on a cluster, where I am not allowed to install software on system folder. My Open MPI is 1.3.4. I have a very quick of the padb on http://padb.pittman.org.uk/ . Does it require some software install on the cluster in order to use it ? I cannot use command-line to run job on the lcuster , but only script. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 12:12:11 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Have you tried a parallel debugger such as padb? On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, I have tried this. But, the printout from 200 parallel processes make it very hard to locate the possible bug. They may not stop at the same point when the program got signal 9. So, even though I can figure out the print out statements from all200 processes, so many different locations where the processesare stopped make it harder to find out some hints about the bug. Are there some other programming tricks, which can help me narrow down to the doubt points ASAP.Any help is appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 07:53:40 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with extreme prejudice If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away. The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler). But, the error message does not indicate any possible reasons for the termination. There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL. The cluster where I am running the MPI program does not allow running debug tools. If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again. What can I do to find the possible bugs ? Any help is really appreciated. thanks Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI error terminate w/o reasons
The cluster can print out all output into one file. But, checking them for bugs is very hard. The cluster also print out possible error messages into one file. But, sometimes the error file is empty , sometimes it is signal 9. If I only run dummy tasks on worker nodes, no errors. If I run real task, sometimes processes are terminated w/o any errors before the program normally exit.Sometimes, the program get signal 9 but no other error messages. It is weird. Any help is really appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 15:18:53 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons I don't know, but Ashley may be able to help - or you can see his web site for instructions. Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options. -output-filename|--output-filenameRedirect output from application processes into filename.rank On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to print out the stack trace and other program execute information into a file ? I can run the program in gdb as this: mpirun -np 200 -e gdb ./myapplication How to make gdb print out the debug information to a file ? So that I can check it when the program is terminated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 13:56:13 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes. As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck. On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, I am working on a cluster, where I am not allowed to install software on system folder. My Open MPI is 1.3.4. I have a very quick of the padb on http://padb.pittman.org.uk/ . Does it require some software install on the cluster in order to use it ? I cannot use command-line to run job on the lcuster , but only script. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 12:12:11 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Have you tried a parallel debugger such as padb? On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, I have tried this. But, the printout from 200 parallel processes make it very hard to locate the possible bug. They may not stop at the same point when the program got signal 9. So, even though I can figure out the print out statements from all200 processes, so many different locations where the processesare stopped make it harder to find out some hints about the bug. Are there some other programming tricks, which can help me narrow down to the doubt points ASAP.Any help is appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 07:53:40 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with extreme prejudice If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away. The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler). But, the error message does not indicate any possible reasons for the termination. There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL. The cluster where I am running the MPI program does not allow running debug tools. If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again. What can I do to find the possible bugs ? Any help is really appreciated. thanks Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users __
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, I used : mpirun -np 200 -rf --output-filename /mypath/myapplication But, no files are printed out. Can "--debug" option help me hear ? When I tried : -bash-3.2$ mpirun -debug--A suitable debugger could not be found in your PATH. Check the valuesspecified in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was searched.--Any help is really appreciated. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 15:45:39 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons If you use that mpirun option, mpirun will place the output from each rank into a -separate- file for you. Give it: mpirun --output-filename /myhome/debug/run01 and in /myhome/debug, you will find files: run01.0run01.1... each with the output from the indicated rank. On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:The cluster can print out all output into one file. But, checking them for bugs is very hard. The cluster also print out possible error messages into one file. But, sometimes the error file is empty , sometimes it is signal 9. If I only run dummy tasks on worker nodes, no errors. If I run real task, sometimes processes are terminated w/o any errors before the program normally exit.Sometimes, the program get signal 9 but no other error messages. It is weird. Any help is really appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 15:18:53 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons I don't know, but Ashley may be able to help - or you can see his web site for instructions. Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options. -output-filename|--output-filenameRedirect output from application processes into filename.rank On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to print out the stack trace and other program execute information into a file ? I can run the program in gdb as this: mpirun -np 200 -e gdb ./myapplication How to make gdb print out the debug information to a file ? So that I can check it when the program is terminated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 13:56:13 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes. As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck. On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:Hi, I am working on a cluster, where I am not allowed to install software on system folder. My Open MPI is 1.3.4. I have a very quick of the padb on http://padb.pittman.org.uk/ . Does it require some software install on the cluster in order to use it ? I cannot use command-line to run job on the lcuster , but only script. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 12:12:11 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Have you tried a parallel debugger such as padb? On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:Hi, I have tried this. But, the printout from 200 parallel processes make it very hard to locate the possible bug. They may not stop at the same point when the program got signal 9. So, even though I can figure out the print out statements from all200 processes, so many different locations where the processesare stopped make it harder to find out some hints about the bug. Are there some other programming tricks, which can help me narrow down to the doubt points ASAP.Any help is appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 07:53:40 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:Hi , All: I running a Open MPI (1.3.4) program by 200 parallel processes. But, the program is terminated with --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- After searching, the signal 9 means: the process is currently in an unworkable state and should be terminated with e
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, I have figured out how to run the command. OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 700g200i200p14ye ./myapplication Each process print out to a distinct file. But, the program is terminated by the error :-=>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM'smpirun: Forwarding signal 10 to jobmpirun: killing job... --mpirun was unable to cleanly terminate the daemons on the nodes shownbelow. Additional manual cleanup may be required - please refer tothe "orte-clean" tool for assistance.-- n341n338n337n336n335n334 n333n332n331n329n328n326n324 n321n318n316n315n314n313 n312n309n308n306n305 After searching, I find that the error is probably related to the highly frequent I/O activities. I have also run valgrind to do mem check in order to find the possible reason for the original signal 9 (SIGKILL) problem. mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication But, I got the similar error as the above. What does the error mean ? I cannot change the file system of the cluster. I only want to find a way to find the bug, which only appears in the case that the problem size is very large. But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. Any help is really appreciated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 20:47:19 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons That command line cannot possibly work. Both the -rf and --output-filename options require arguments. PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to correctly use these options. On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:Hi, I used : mpirun -np 200 -rf --output-filename /mypath/myapplication But, no files are printed out. Can "--debug" option help me hear ? When I tried : -bash-3.2$ mpirun -debug--A suitable debugger could not be found in your PATH. Check the valuesspecified in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was searched.--Any help is really appreciated. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 15:45:39 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons If you use that mpirun option, mpirun will place the output from each rank into a -separate- file for you. Give it: mpirun --output-filename /myhome/debug/run01 and in /myhome/debug, you will find files: run01.0run01.1... each with the output from the indicated rank. On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:The cluster can print out all output into one file. But, checking them for bugs is very hard. The cluster also print out possible error messages into one file. But, sometimes the error file is empty , sometimes it is signal 9. If I only run dummy tasks on worker nodes, no errors. If I run real task, sometimes processes are terminated w/o any errors before the program normally exit.Sometimes, the program get signal 9 but no other error messages. It is weird. Any help is really appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 15:18:53 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons I don't know, but Ashley may be able to help - or you can see his web site for instructions. Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options. -output-filename|--output-filenameRedirect output from application processes into filename.rank On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:Is it possible to enable padb to print out the stack trace and other pr
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, my original bug is : --mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- The main framework of my code is: main(){ for masternode: while (loop <= LOOP_NUMBER) { master node distributes tasks to workers; master collects results from workers; ++loop; } for worker nodes: { get the task ; run the task; // call CPLEX API lib return results to master; }} When the LOOP_NUMBER <= 600 (with 200 parallel processes), it works well.But, when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error: The possible limit of my Torque may be reason for the above error ? It seems that Torque complains about my high I/O caused by print out something from each process. But, if I comment out the printout statements in my code the Torque complains will be gone, but the signal 9 error is still there. Any help is really appreciated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sun, 27 Mar 2011 13:08:31 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons It means that Torque is unhappy with your job - either you are running longer than it permits, or you exceeded some other system limit. Talk to your sys admin about imposed limits. Usually, there are flags you can provide to your job submission that allow you to change limits for your program. On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:Hi, I have figured out how to run the command. OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 700g200i200p14ye ./myapplication Each process print out to a distinct file. But, the program is terminated by the error :-=>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM'smpirun: Forwarding signal 10 to jobmpirun: killing job... --mpirun was unable to cleanly terminate the daemons on the nodes shownbelow. Additional manual cleanup may be required - please refer tothe "orte-clean" tool for assistance.-- n341n338n337n336n335n334 n333n332n331n329n328n326n324 n321n318n316n315n314n313 n312n309n308n306n305 After searching, I find that the error is probably related to the highly frequent I/O activities. I have also run valgrind to do mem check in order to find the possible reason for the original signal 9 (SIGKILL) problem. mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication But, I got the similar error as the above. What does the error mean ? I cannot change the file system of the cluster. I only want to find a way to find the bug, which only appears in the case that the problem size is very large. But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. Any help is really appreciated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 20:47:19 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons That command line cannot possibly work. Both the -rf and --output-filename options require arguments. PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to correctly use these options. On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:Hi, I used : mpirun -np 200 -rf --output-filename /mypath/myapplication But, no files are printed out. Can "--debug" option help me hear ? When I tried : -bash-3.2$ mpirun -debug--A suitable debugger could not be found in your PATH. Check the valuesspecified in the orte_base_user_debugger MCA parameter for the list ofdebuggers that was searched.--Any help is really appreciated. thanks From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 2
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, I use MPI_Barrier to make all processes to terminate at the same time. int main(){ for masternode: while (loop <= LOOP_NUMBER) { master node distributes tasks to workers; master collects results from workers; ++loop; } for worker nodes: { get the task ; run the task; // call CPLEX API lib return results to master; } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return (0);} thanks From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sun, 27 Mar 2011 15:32:51 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons This might not have anything to do with your problem, but how do you finalize your worker nodes when your master loop terminates? On Sun, Mar 27, 2011 at 3:27 PM, Jack Bryan wrote: Hi, my original bug is : -- mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).-- The main framework of my code is: main() { for masternode: while (loop <= LOOP_NUMBER) { master node distributes tasks to workers; master collects results from workers; ++loop; } for worker nodes: { get the task ; run the task; // call CPLEX API lib return results to master; }} When the LOOP_NUMBER <= 600 (with 200 parallel processes), it works well.But, when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error: The possible limit of my Torque may be reason for the above error ? It seems that Torque complains about my high I/O caused by print out something from each process. But, if I comment out the printout statements in my code the Torque complains will be gone, but the signal 9 error is still there. Any help is really appreciated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sun, 27 Mar 2011 13:08:31 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI error terminate w/o reasons It means that Torque is unhappy with your job - either you are running longer than it permits, or you exceeded some other system limit. Talk to your sys admin about imposed limits. Usually, there are flags you can provide to your job submission that allow you to change limits for your program. On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote: Hi, I have figured out how to run the command. OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 700g200i200p14ye ./myapplication Each process print out to a distinct file. But, the program is terminated by the error :- =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM'smpirun: Forwarding signal 10 to job mpirun: killing job... --mpirun was unable to cleanly terminate the daemons on the nodes shownbelow. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance.-- n341n338n337n336 n335n334n333n332n331n329 n328n326n324n321 n318n316n315n314n313n312 n309n308n306n305 After searching, I find that the error is probably related to the highly frequent I/O activities. I have also run valgrind to do mem check in order to find the possible reason for the original signal 9 (SIGKILL) problem. mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication But, I got the similar error as the above. What does the error mean ? I cannot change the file system of the cluster. I only want to find a way to find the bug, which only appears in the case that the problem size is very large. But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. Any help is really appreciated. thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Sat, 26 Mar 2011 20:47:19 -0600
Re: [OMPI users] OMPI error terminate w/o reasons
Hi, The job queue has a time budget, which has been set in my job script. For example, my current job queue is 24 hours. But, my program got SIGKILL (signal 9) within not more than 2 hours since it began to run. Are there other possible settings that I need to consider ? thanks Jack > From: jsquy...@cisco.com > Date: Sun, 27 Mar 2011 20:29:11 -0400 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI error terminate w/o reasons > > +1 on what Ralph is saying. > > You need to talk to your local administrators and ask them why Torque is > killing your job. Perhaps you're submitting to a queue that only allows jobs > to run for a few seconds, or something like that. > > > On Mar 27, 2011, at 3:08 PM, Ralph Castain wrote: > > > It means that Torque is unhappy with your job - either you are running > > longer than it permits, or you exceeded some other system limit. > > > > Talk to your sys admin about imposed limits. Usually, there are flags you > > can provide to your job submission that allow you to change limits for your > > program. > > > > > > On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote: > > > >> Hi, I have figured out how to run the command. > >> > >> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks > >> > >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > >> -output-filename 700g200i200p14ye ./myapplication > >> > >> Each process print out to a distinct file. > >> > >> But, the program is terminated by the error : > >> - > >> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code > >> 1099) - received SISTER_EOF attempting to communicate with sister MOM's > >> mpirun: Forwarding signal 10 to job > >> mpirun: killing job... > >> > >> -- > >> mpirun was unable to cleanly terminate the daemons on the nodes shown > >> below. Additional manual cleanup may be required - please refer to > >> the "orte-clean" tool for assistance. > >> -- > >> n341 > >> n338 > >> n337 > >> n336 > >> n335 > >> n334 > >> n333 > >> n332 > >> n331 > >> n329 > >> n328 > >> n326 > >> n324 > >> n321 > >> n318 > >> n316 > >> n315 > >> n314 > >> n313 > >> n312 > >> n309 > >> n308 > >> n306 > >> n305 > >> > >> > >> > >> After searching, I find that the error is probably related to the highly > >> frequent I/O activities. > >> > >> I have also run valgrind to do mem check in order to find the possible > >> reason for the original > >> signal 9 (SIGKILL) problem. > >> > >> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib > >> /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes > >> --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication > >> > >> But, I got the similar error as the above. > >> > >> What does the error mean ? > >> I cannot change the file system of the cluster. > >> > >> I only want to find a way to find the bug, which only appears in the case > >> that the problem size is very large. > >> > >> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now. > >> > >> Any help is really appreciated. > >> > >> thanks > >> > >> Jack > >> > >> > >> From: r...@open-mpi.org > >> Date: Sat, 26 Mar 2011 20:47:19 -0600 > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] OMPI error terminate w/o reasons > >> > >> That command line cannot possibly work. Both the -rf and --output-filename > >> options require arguments. > >> > >> PLEASE read the documentation? mpirun -h, or &quo
[OMPI users] OMPI not calling finalize error
Hi, When I run a parallel program, I got an error : --[n333:129522] *** Process received signal ***[n333:129522] Signal: Segmentation fault (11)[n333:129522] Signal code: Address not mapped (1)[n333:129522] Failing at address: 0x40[n333:129522] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n333:129522] [ 1] /opt/openmpi-1.3.4-gnu/lib/libmpi.so.0 [0x4cd19b1][n333:129522] [ 2] /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0(opal_progress+0x75) [0x52e5165][n333:129522] [ 3] /opt/openmpi-1.3.4-gnu/lib/libopen-rte.so.0 [0x508565c][n333:129522] [ 4] /opt/openmpi-1.3.4-gnu/lib/libmpi.so.0 [0x4c653eb][n333:129522] [ 5] /opt/openmpi-1.3.4-gnu/lib/libmpi.so.0(MPI_Init+0x120) [0x4c84b90][n333:129522] [ 6] /lustre/jxding/netplan49/nsga2b [0x4497f6][n333:129522] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n333:129522] [ 8] /lustre/jxding/netplan49/nsga2b(__gxx_personality_v0+0x499) [0x4436e9][n333:129522] *** End of error message ***--mpirun has exited due to process rank 24 with PID 129522 onnode n333 exiting without calling "finalize". This mayhave caused other processes in the application to beterminated by signals sent by mpirun (as reported here).-But, the program only run for not more than a few of minutes. It should take hours to finish. How can it reach "finalize" so fast ? Any help is appreciated. Jack
[OMPI users] OMPI monitor each process behavior
Hi , All: I need to monitor the memory usage of each parallel process on a linux Open MPI cluster. But, top, ps command cannot help here because they only show the head node information. I need to follow the behavior of each process on each cluster node. I cannot use ssh to access each node. The program takes 8 hours to finish. Any help is really appreciated. Jack
Re: [OMPI users] OMPI monitor each process behavior
Hi, I am using mpirun (Open MPI) 1.3.4 But, I have these, orte-clean orted orte-ioforte-ps orterun Can they do the same thing ? If I use them, will they use a lot of memory on each worker node and print out a lot of things on some log files ? Any help is really appreciated. Thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Wed, 13 Apr 2011 08:09:17 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI monitor each process behavior What version are you using? If you are using 1.5.x, there is an "orte-top" command that will do what you ask. It queries the daemons to get the info. On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote:Hi , All: I need to monitor the memory usage of each parallel process on a linux Open MPI cluster. But, top, ps command cannot help here because they only show the head node information. I need to follow the behavior of each process on each cluster node. I cannot use ssh to access each node. The program takes 8 hours to finish. Any help is really appreciated. Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI monitor each process behavior
Hi , If I cannot ssh to a worker node, it means that my program cannot work correctly ? I can run it on 32 nodes *4 cores/node parallel processes. But, for larger parallel processes, 128 nodes * 1 cpu/node, it is killed by signal 9. Is this a reason ? thanks > Date: Wed, 13 Apr 2011 05:59:10 -0700 > From: n...@aol.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI monitor each process behavior > > On 4/12/2011 8:55 PM, Jack Bryan wrote: > > > > > I need to monitor the memory usage of each parallel process on a linux > > Open MPI cluster. > > > > But, top, ps command cannot help here because they only show the head > > node information. > > > > I need to follow the behavior of each process on each cluster node. > Did you consider ganglia et al? > > > > I cannot use ssh to access each node. > How can MPI run? > > > > The program takes 8 hours to finish. > > > > -- > Tim Prince > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI monitor each process behavior
Hi, I do not have qrsh I have qrerunqrls qrttoppm qrun Can they do the same thing ? thanks > From: re...@staff.uni-marburg.de > Date: Wed, 13 Apr 2011 16:28:14 +0200 > To: us...@open-mpi.org > Subject: Re: [OMPI users] OMPI monitor each process behavior > > Am 13.04.2011 um 05:55 schrieb Jack Bryan: > > > I need to monitor the memory usage of each parallel process on a linux Open > > MPI cluster. > > > > But, top, ps command cannot help here because they only show the head node > > information. > > > > I need to follow the behavior of each process on each cluster node. > > > > I cannot use ssh to access each node. > > What about submitting another job with `mpirun ... ps -e f` or alike - in > case you can request the same nodes? > > Can you `qrsh` to a node by the queuingsystem? > > -- Reuti > > > > The program takes 8 hours to finish. > > > > Any help is really appreciated. > > > > Jack > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI monitor each process behavior
Hi, I find the reason why the program is killed by operating system in the case that the problem size is large. It consumes more memory and leads to more memory swap. This also degrade the program performance. But, I cannot determine which function of the worker process causes the problem. I have used try-catch in my code but no exception popped out. I found that ---When the processes running on your server attempt to allocate more memory than your system has available, the kernel begins to swap memory pages to and from the disk. This is done in order to free up sufficient physical memory to meet the RAM allocation requirements of the requestor.-- I am not sure it is really caused by CPLEX ( an optimization model solver) or other routines or maybe by other dynamic memory allocation used by CPLEX API libray at background. Any help is really appreciated. Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Wed, 13 Apr 2011 10:34:38 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI monitor each process behavior On Apr 13, 2011, at 10:19 AM, Jack Bryan wrote:Hi, I am using mpirun (Open MPI) 1.3.4 But, I have these, orte-clean orted orte-ioforte-ps orterun Can they do the same thing ? Unfortunately, no If I use them, will they use a lot of memory on each worker node and print out a lot of things on some log files ? No, but they won't help. orte-top would be run only on the head node (i.e., where you are logged in), and would generate output to your screen. But you don't have it with that release, so the point is moot. Afraid there isn't much else you can do - you might talk to your sys admin and see what tools are available on your cluster for this purpose. Perhaps a nice parallel debugger is available? Any help is really appreciated. Thanks Jack From: r...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Wed, 13 Apr 2011 08:09:17 -0600 To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI monitor each process behavior What version are you using? If you are using 1.5.x, there is an "orte-top" command that will do what you ask. It queries the daemons to get the info. On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote:Hi , All: I need to monitor the memory usage of each parallel process on a linux Open MPI cluster. But, top, ps command cannot help here because they only show the head node information. I need to follow the behavior of each process on each cluster node. I cannot use ssh to access each node. The program takes 8 hours to finish. Any help is really appreciated. Jack ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OMPI vs. network socket communcation
Hi, All: What is the relationship between MPI communication and socket communication ? Is the network socket programming better than MPI ? I am a newbie of network socket programming. I do not know which one is better for parallel/distributed computing ? I know that network socket is unix-based file communication between server and client. If they can also be used for parallel computing, how MPI can work better than them ? I know MPI is for homogeneous cluster system and network socket is based on internet TCP/IP. Any help is really appreciated. Thanks
Re: [OMPI users] OMPI vs. network socket communcation
Thanks for your reply. MPI is for academic purpose. How about business applications ? What kinds of parallel/distributed computing environment do the financial institutionsuse for their high frequency trading ? Any help is really appreciated. Thanks, List-Post: users@lists.open-mpi.org Date: Mon, 2 May 2011 08:34:33 -0400 From: terry.don...@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] OMPI vs. network socket communcation On 04/30/2011 08:52 PM, Jack Bryan wrote: Hi, All: What is the relationship between MPI communication and socket communication ? MPI may use socket communications to do communications between two processes. Aside from that they are used for different purposes. Is the network socket programming better than MPI ? Depends on what you are trying to do. If you are writing a parallel program that may run in multiple environments with different types of performing protocols available for its use then MPI is probably better. If you are looking to do simple client/server type programming then socket program might have an advantage. I am a newbie of network socket programming. I do not know which one is better for parallel/distributed computing ? IMO MPI. I know that network socket is unix-based file communication between server and client. If they can also be used for parallel computing, how MPI can work better than them ? There is a lot of stuff that MPI does behind the curtain to make a parallel applications life a lot easier. As far as performance MPI will not perform better than sockets if it is using sockets as the underlying model. However, the performance difference should be negligible which makes all the other stuff MPI does for you a big win. I know MPI is for homogeneous cluster system and network socket is based on internet TCP/IP. What do you mean by homogeneous cluster? There are some MPIs that can work among different platforms and even different OSes (though some initial setup may be necessary). Hope this helps, -- Message body Terry D. Dontje | Principal Software Engineer Developer Tools Engineering | +1.781.442.2631 Oracle - Performance Technologies 95 Network Drive, Burlington, MA 01803 Email terry.don...@oracle.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OpenMPI data transfer error
Hi, I am using Open MPI to do data transfer from master node to worker nodes. But, the worker node can the data which is not what it should get. I have checked destination node rank, taskTag, datatype, all of them are correct. I do an experiment. Node 0 sends data to node 1 , 2 ,3. Only node 3 can get correct data, but node 1 and 2 get the wrong data, whichshould be received by node 3. What is the possible reason ? I have printed out the data that is sent by master node. They are exactly what the node 1 , 2, 3 should receive. But why node 1 and 2 get data of node 3. Any help is appreciated. Jack
[OMPI users] Open MPI process cannot do send-receive message correctly on a distributed memory cluster
Hi, I have a Open MPI program, which works well on a Linux shared memory multicore (2 x 6 cores) machine. But, it does not work well on a distributed cluster with Linux Open MPI. I found that the the process sends out some messages to other processes, which can not receive them. What is the possible reason ? I do not change anything of the program. Any help is really appreciated. Thanks
Re: [OMPI users] Open MPI process cannot do send-receive message correctly on a distributed memory cluster
Thanks, I am using non-blocking MPI_Isend to send out message and using blocking MPI_Recv to get the message. Each MPI_Isend use a distinct buffer to hold the message, which is not changed until the message is received. Then, the sender process waits for the MPI_Isend to be finished. Before this message is sent out, a heading message (about how many data and what data will be sent out in the following MPI_Isend) is sent out in the same way, they can be received well. Why the following message (which has larger size) cannot be received ? Any help is really appreciated. > Date: Fri, 30 Sep 2011 11:33:16 -0400 > From: raysonlo...@gmail.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI process cannot do send-receive message > correctly on a distributed memory cluster > > You can use a debugger (just gdb will do, no TotalView needed) to find > out which MPI send & receive calls are hanging the code on the > distributed cluster, and see if the send & receive pair is due to a > problem described at: > > Deadlock avoidance in your MPI programs: > http://www.cs.ucsb.edu/~hnielsen/cs140/mpi-deadlocks.html > > Rayson > > = > Grid Engine / Open Grid Scheduler > http://gridscheduler.sourceforge.net > > Wikipedia Commons > http://commons.wikimedia.org/wiki/User:Raysonho > > > On Fri, Sep 30, 2011 at 11:06 AM, Jack Bryan wrote: > > Hi, > > > > I have a Open MPI program, which works well on a Linux shared memory > > multicore (2 x 6 cores) machine. > > > > But, it does not work well on a distributed cluster with Linux Open MPI. > > > > I found that the the process sends out some messages to other processes, > > which can not receive them. > > > > What is the possible reason ? > > > > I do not change anything of the program. > > > > Any help is really appreciated. > > > > Thanks > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > == > Open Grid Scheduler - The Official Open Source Grid Engine > http://gridscheduler.sourceforge.net/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Open MPI error to define MPI_Datatype in header file
Hi, I need to define a (Open MPI) MPI_Datatype in a header file so that all other files that include it can find it. I also try to use extern to do decleration in .h file and then define them in .cpp file. But, I always get error: undefined reference It is not allowed in Open MPI ? Why ? Any help is really appreciated. Thanks
[OMPI users] How to check processes working in parallel on one node of MPI cluster
Hi, I am running an OpenMPI program on a linux cluster with 4 quad cores per node. I use qstat -n jobID to check how many processes working in parallel and find that : node160/15+node160/14+node160/13+node160/12+node160/11+node160/10+node160/9 +node160/8+node160/7+node160/6+node160/5+node160/4+node160/3+node160/2 +node160/1+node160/0+node166/15+node166/14+node166/13+node166/12+node166/11 +node166/10+node166/9+node166/8+node166/7+node166/6+node166/5+node166/4 +node166/3+node166/2+node166/1+node166/0+node173/15+node173/14+node173/13 +node173/12+node173/11+node173/10+node173/9+node173/8+node173/7+node173/6 +node173/5+node173/4+node173/3+node173/2+node173/1+node173/0+node175/15 +node175/14+node175/13+node175/12+node175/11+node175/10+node175/9+node175/8 +node175/7+node175/6+node175/5+node175/4+node175/3+node175/2+node175/1 +node175/0 But, when I use ssh to be on a node , e..g ssh node175, I use top command to check how many processes working on node 175 and find that there are only one process working, not 8 processes. Would you please tell me how to check the processes number on one node ? Any help will be appreciated. Thanks Jinxu Ding
[OMPI users] Open MPI task scheduler
Hi, all: I need to design a task scheduler (not PBS job scheduler) on Open MPI cluster. I need to parallelize an algorithm so that a big problem is decomposed into small tasks, which can be distributed to other worker nodes by the Scheduler and after being solved, the results of these tasks are returned to the manager node with the Scheduler, which will distribute more tasks on the base of the collected results. I need to use C++ to design the scheduler. I have searched online and I cannot find any scheduler available for this purpose. Any help is appreciated. thanks Jack June 19 2010 _ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
Re: [OMPI users] Open MPI task scheduler
Hi, Matthieu: Thanks for your help. Most of your ideas show that what I want to do. My scheduler should be able to be called from any C++ program, which can put a list of tasks to the scheduler and then the scheduler distributes the tasks to other client nodes. It may work like in this way: while(still tasks available) { myScheduler.push(tasks); myScheduler.get(tasks results from client nodes);} My cluster has 400 nodes with Open MPI. The tasks should be transferred b y MPI protocol. I am not familiar with RPC Protocol. If I use Boost.ASIO and some Python/GCCXML script to generate the code, it can be called from C++ program on Open MPI cluster ? I cannot find the skeletton on your blog. Would you please tell me where to find it ? I really appreciate your help. Jack June 20 2010 > Date: Sun, 20 Jun 2010 20:13:14 +0200 > From: matthieu.bruc...@gmail.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI task scheduler > > Hi Jack, > > What you are seeking is the client/server pattern. Have one node act > as a server. It will create a list of tasks or even a graph of tasks > if you have dependencies, and then create clients that will connect to > the server with an RPC protocol (I've done this with a SOAP+TCP > protocol, the severance of the TCP connection meaning that the client > is dead and that its task should be recycled, ités easy to do with > Boost.ASIO and some Python/GCCXML scripts to automatically generate > your code, I've written a skeletton on my blog). You may even have > clients with different sizes or capabilities and tell the server what > each client can do, and then the server may dispatch appropriate > tickets to the clients. > > Each client and server can be a MPI process, you don't have to create > all clients inside one MPI process (you may use several if the > smallest resource your batch scheduler allocates is bigger that one of > your tasks). With a batch scheduler, it's better to allocate your > tasks as small as possible so that you can balance the resources you > need. > > Matthieu > > 2010/6/20 Jack Bryan : > > Hi, all: > > I need to design a task scheduler (not PBS job scheduler) on Open MPI > > cluster. > > I need to parallelize an algorithm so that a big problem is decomposed into > > small tasks, which can be distributed > > to other worker nodes by the Scheduler and after being solved, the results > > of these tasks are returned to the manager node with the Scheduler, which > > will distribute more tasks on the base of the collected results. > > I need to use C++ to design the scheduler. > > I have searched online and I cannot find any scheduler available > > for this purpose. > > Any help is appreciated. > > thanks > > Jack > > June 19 2010 > > > > Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. > > Learn more. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Re: [OMPI users] Open MPI task scheduler
Thanks for your reply. My task scheduler is application program level not OS level. PBS is to ask OS to do the job scheduling. My scheduler needs to be called by any C++ program to out tasks in to the scheduler and then distribute tasks to worker nodes. After the tasks are done, the manager node collects the results. It may work like in this way: while(still tasks available) { myScheduler.push(tasks); myScheduler.get(tasks results from client nodes);} Any help is appreciated. Jack June 20 2010 > From: bill.ran...@sas.com > To: us...@open-mpi.org > Date: Sun, 20 Jun 2010 20:04:26 + > Subject: Re: [OMPI users] Open MPI task scheduler > > > On Jun 20, 2010, at 1:49 PM, Jack Bryan wrote: > > Hi, all: > > I need to design a task scheduler (not PBS job scheduler) on Open MPI cluster. > > Quick question - why *not* PBS? > > Using shell scripts with the Job Array and Dependent Jobs features of PBS Pro > (not sure about Maui/Torque nor SGE) you can implement this in a fairly > straight forward manner. It worked for the Bioinformaticists using BLAST. > > It just seems that the workflow you are describing is part and partial of > what any good workload management system is supposed to do and do well. > > Just a thought. > > Good luck, > > -bill > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
Re: [OMPI users] Open MPI task scheduler
Hi, thank you very much for your help. What is the meaning of " must find a system so that every task can be serialized in the same form." What is the meaning of "serize " ? I have no experience of programming with python and XML. I have studied your blog. Where can I find a simple example to use the techniques you have said ? For exmple, I have 5 task (print "hello world !"). I want to use 6 processors to do it in parallel. One processr is the manager node who distributes tasks and other 5 processorsdo the printing jobs and when they are done, they tell this to the manager noitde. Boost.Asio is a cross-platform C++ library for network and low-level I/O programming. I have no experiences of using it. Will it take a long time to learn how to use it ? If the messages are transferred by SOAP+TCP, how the manager node calls it and push task into it ? Do I need to install SOAP+TCP on my cluster so that I can use it ? Any help is appreciated. Jack June 20 2010 > Date: Sun, 20 Jun 2010 21:00:06 +0200 > From: matthieu.bruc...@gmail.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI task scheduler > > 2010/6/20 Jack Bryan : > > Hi, Matthieu: > > Thanks for your help. > > Most of your ideas show that what I want to do. > > My scheduler should be able to be called from any C++ program, which can > > put > > a list of tasks to the scheduler and then the scheduler distributes the > > tasks to other client nodes. > > It may work like in this way: > > while(still tasks available) { > > myScheduler.push(tasks); > > myScheduler.get(tasks results from client nodes); > > } > > Exactly. In your case, you want only one server, so you must find a > system so that every task can be serialized in the same form. The > easiest way to do so is to serialize your parameter set as an XML > fragment and add the type of task as another field. > > > My cluster has 400 nodes with Open MPI. The tasks should be transferred b y > > MPI protocol. > > No, they should not ;) MPI can be used, but it is not the easiest way > to do so. You still have to serialize your ticket, and you have to use > some functions that are from MPI2 (so perhaps not as portable as MPI1 > functions). Besides, it cannot be used from programs that do not know > of using MPI protocols. > > > I am not familiar with RPC Protocol. > > RPC is not a protocol per se. SOAP is. RPC stands for Remote Procedure > Call. It is basically your scheduler that has several functions > clients can call: > - add tickets > - retrieve ticket > - ticket is done > > > If I use Boost.ASIO and some Python/GCCXML script to generate the code, it > > can be > > called from C++ program on Open MPI cluster ? > > Yes, SOAP is just an XML way of representing the fact that you call a > function on the server. You can use it with C++, Java, ... I use it > with Python to monitor how many tasks are remaining, for instance. > > > I cannot find the skeletton on your blog. > > Would you please tell me where to find it ? > > It's not complete as some of the work is property of my employer. This > is how I use GCCXML to generate the calling code: > http://matt.eifelle.com/2009/07/21/using-gccxml-to-automate-c-wrappers-creation/ > You have some additional code to write, but this is the main idea. > > > I really appreciate your help. > > No sweat, I hope I can give you correct hints! > > Matthieu > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
[OMPI users] openMPI asychronous communication
Dear All: How to do asychronous communication among nodes by openMPI or boot.MPI in cluster ? I need to set up a kind of asychronous communication protocol such that message senders and receivers can communicate asychronously without losing anymessages between them. I do not want to use blocking MPI routines because the processors can do otheroperations when they wait for new messages coming. I donot find this kind of MPI routines that support this asychronous communication. Any help is appreciated. thanks Jack June 27 2010 _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Re: [OMPI users] openMPI asychronous communication
thanks I know this. but, what if sender can send a lot of messages to receivers faster than what receiver can receive ? it means that sender works faster than receiver. Any help is appreciated. jack From: jiangzuo...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 28 Jun 2010 11:31:16 +0800 To: us...@open-mpi.org Subject: Re: [OMPI users] openMPI asychronous communication MPI_Isend - Starts a standard-mode, nonblocking send. BTW, are there any asynchronous collective operations? Changsheng Jiang On Mon, Jun 28, 2010 at 11:22, Jack Bryan wrote: Dear All: How to do asychronous communication among nodes by openMPI or boot.MPI in cluster ? I need to set up a kind of asychronous communication protocol such that message senders and receivers can communicate asychronously without losing anymessages between them. I do not want to use blocking MPI routines because the processors can do otheroperations when they wait for new messages coming. I donot find this kind of MPI routines that support this asychronous communication. Any help is appreciated. thanks Jack June 27 2010 The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Re: [OMPI users] openMPI asychronous communication
thanks I know that. MPI_irecv() ; do other works; MPI_wait(); But, my message receiver is much slower than sender. when the sender is doing its local works, the sender has sent out their messages. but at this time, the sender is very busy doing its local work and cannot post MPI_irecv to get the messages from senders. Any help is appreciated. jack From: jiangzuo...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 28 Jun 2010 11:55:32 +0800 To: us...@open-mpi.org Subject: Re: [OMPI users] openMPI asychronous communication OK, then i think you also know using MPI_Wait to wait the asynchronous requests to complete. if sender works faster then receiver(or reverse), then the MPI_Wait will do wait, not just deallocted. you should keep the buffer content before MPI_Wait. Changsheng Jiang On Mon, Jun 28, 2010 at 11:41, Jack Bryan wrote: thanks I know this. but, what if sender can send a lot of messages to receivers faster than what receiver can receive ? it means that sender works faster than receiver. Any help is appreciated. jack From: jiangzuo...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 28 Jun 2010 11:31:16 +0800 To: us...@open-mpi.org Subject: Re: [OMPI users] openMPI asychronous communication MPI_Isend - Starts a standard-mode, nonblocking send. BTW, are there any asynchronous collective operations? Changsheng Jiang On Mon, Jun 28, 2010 at 11:22, Jack Bryan wrote: Dear All: How to do asychronous communication among nodes by openMPI or boot.MPI in cluster ? I need to set up a kind of asychronous communication protocol such that message senders and receivers can communicate asychronously without losing anymessages between them. I do not want to use blocking MPI routines because the processors can do otheroperations when they wait for new messages coming. I donot find this kind of MPI routines that support this asychronous communication. Any help is appreciated. thanks Jack June 27 2010 The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[OMPI users] Open MPI ERR_TRUNCATE: message truncated
Dear All, I am using Open MPI : mpirun (Open MPI) 1.3.4 I got error: terminate called after throwing an instance of 'boost::exception_detail::clone_impl >' what(): MPI_Test: MPI_ERR_TRUNCATE: message truncated I installed boost MPI library and compile and run the program by openMPI. It seems that the message has been truncated by the receiver. How can I fix the problem ? Is it a bug of OpenMPI ? Any help is appreciated. Jack June 28 2010 _ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
[OMPI users] Open MPI, Segmentation fault
Dear All, I am using Open MPI, I got the error: n337:37664] *** Process received signal ***[n337:37664] Signal: Segmentation fault (11)[n337:37664] Signal code: Address not mapped (1)[n337:37664] Failing at address: 0x7fffcfe9[n337:37664] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n337:37664] [ 1] /lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2 [0x414ed7][n337:37664] [ 2] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n337:37664] [ 3] /lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2(__gxx_personality_v0+0x1f1) [0x412139][n337:37664] *** End of error message *** After searching answers, it seems that some functions fail. My program can run well for 1,2,10 processors, but fail when the number of tasks cannotbe divided evenly by number of processes. Any help is appreciated. thanks Jack June 30 2010 _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Re: [OMPI users] Open MPI, Segmentation fault
thanks I am not familiar with OpenMPI. Would you please help me with how to ask openMPI to show where the fault occurs ? GNU debuger ? Any help is appreciated. thanks!!! Jack June 30 2010 List-Post: users@lists.open-mpi.org Date: Wed, 30 Jun 2010 16:13:09 -0400 From: amja...@gmail.com To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI, Segmentation fault Based on my experiences, I would FULLY endorse (100% agree with) David Zhang. It is usually a coding or typo mistake. At first, Ensure that array sizes and dimension are correct. I experience that if openmpi is compiled with gnu compilers (not with Intel) then it also point outs the subroutine exactly in which the fault occur. have a try. best, AA On Wed, Jun 30, 2010 at 12:43 PM, David Zhang wrote: When I got segmentation faults, it has always been my coding mistakes. Perhaps your code is not robust against number of processes not divisible by 2? On Wed, Jun 30, 2010 at 8:47 AM, Jack Bryan wrote: Dear All, I am using Open MPI, I got the error: n337:37664] *** Process received signal ***[n337:37664] Signal: Segmentation fault (11)[n337:37664] Signal code: Address not mapped (1) [n337:37664] Failing at address: 0x7fffcfe9[n337:37664] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n337:37664] [ 1] /lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2 [0x414ed7] [n337:37664] [ 2] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974][n337:37664] [ 3] /lustre/home/rhascheduler/RhaScheduler-0.4.1.1/mytest/nmn2(__gxx_personality_v0+0x1f1) [0x412139][n337:37664] *** End of error message *** After searching answers, it seems that some functions fail. My program can run well for 1,2,10 processors, but fail when the number of tasks cannotbe divided evenly by number of processes. Any help is appreciated. thanks Jack June 30 2010 The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users _ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
Re: [OMPI users] Open MPI, Segmentation fault
Thanks for all your replies. I want to do master-worker asynchronous communication. The master needs to distribute tasks to workers and then collect results from them. master : world.irecv(resultSourceRank, upStreamTaskTag, myResultTaskPackage[iRank][taskCounterT3]); I got this error "MPI_ERR_TRUNCATE" , because I declared " TaskPackage myResultTaskPackage. " It seems that the 2-dimension array cannot be used to receive my defined class package from worker, who sends a TaskPackage to master. So, I changed it to an int 2-d array to get the result, it works well. But, I still want to find out how to store the result in a data structure with the type TaskPackage because int type data can only be used to carry integers. Too limited. What I want to do is: The master can store the results from each worker and then combine them together to form the final result after collecting all results from workers. But, if the master has number of tasks that cannot be divided evenly by worker numbers, each worker may have different number of tasks. If we have 11 tasks and 3 workers. aveTaskNumPerNode = (11 - 11%3) /3 = 3 leftTaskNum = 11%3 =2 = Z the master distributes each of left tasks from worker 1 to work Z (Z < totalNumWorkers). For example, worker 1: 4 tasks, worker 2: 4 task, worker 3: 3 tasks. The master tries to distribute tasks evenly so that the difference between workloads of each worker is minimized. I am going to use vector's vector to do the dynamic data storage. The 2-dimensional data-structure that can store results from workers. Each row element of the data-structure has different columns. It can be indexed by iterator so that I can find the a specified number worker task result by searching the data strucutre. For example, column column 12 row 1 (worker1.task1)(worker1.task4) row 2 (worker2.task2) (worker1.task5) row 3 (worker3.task3) the data strucutre should remember the location of work ID and the task ID. So that the master can know which task comes from which worker. Any help or comment are appreciated. thanks Jack June 30 2010 > Date: Thu, 1 Jul 2010 11:44:19 -0400 > From: g...@ldeo.columbia.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI, Segmentation fault > > Hello Jack, list > > As others mentioned, this may be a problem with dynamic > memory allocation. > It could also be a violation of statically allocated memory, > I guess. > > You say: > > > My program can run well for 1,2,10 processors, but fail when the > > number of tasks cannot > > be divided evenly by number of processes. > > Often times, when the division of the number of "tasks" > (or the global problem size) by the number of "processors" is not even, > one processor gets a lighter/heavier workload then the others, > it also allocates less/more memory than the others, > and it accesses smaller/larger arrays than the others. > > In general integer division and remainder/module calculations > are used to control memory allocation, the array sizes, etc, > on different processors. > These formulas tend to use the MPI communicator size > (i.e., effectively the number of processors if you are using > MPI_COMM_WORLD) to split the workload across the processors. > > I would search for the lines of code where those calculations are done, > and where the arrays are allocated and accessed, > to make sure the algorithm works both when > they are of the same size > (even workload across the processors), > as when they are of different sizes > (uneven workload across the processors). > You may be violating memory access by a few bytes only, due to a small > mistake in one of those integer division / remainder/module formulas, > perhaps where an array index upper or lower bound is calculated. > It happened to me before, probably to others too. > > This type of code inspection can be done without a debugger, > or before you get to the debugger phase. > > I hope this helps, > Gus Correa > - > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > - > > > Jeff Squyres wrote: > > Also see http://www.open-mpi.org/faq/?category=debugging. > > > > On Jul 1, 2010, at 3:17 AM, Asad Ali wrote: > > > >> Hi Jack, > >> > >> Debugging OpenMPI with traditional debuggers is a pain. > >> >From your error message it sounds that you have some memory allocation > >> >problem. Do you use dynamic memory a
[OMPI users] OpenMPI error MPI_ERR_TRUNCATE
Dear All: With boost MPI, I trying to ask some worker nodes to send some message to the single master node. I am using OpenMPI 1.3.4. I use an array recvArray[row][column] to receive the message, which is a C++ class that contain int, member functions. But I got an error of terminate called after throwing an instance of 'boost::exception_detail::clone_impl >' what(): MPI_Test: MPI_ERR_TRUNCATE: message truncated[n124:126639] *** Process received signal ***[n124:126639] Signal: Aborted (6)[n124:126639] Signal code: (-6) It seems that the master cannot find enough space for the receicved message. But, I have decleared the recvArray , which is a vector with element as my received class package. The error is very wierd. When I open the recvied package, the elements are not expected numbers buy only some very large or small numbers. Any help is appreciated. Jack July 2 2010 _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[OMPI users] Open MPI, cannot get the results from workers
Dear All : I designed a master-worker framework, in which the master can schedulemultiple tasks (numTaskPerWorkerNode) to each worker and then collects results from workers. if the numTaskPerWorkerNode = 1, it works well. But, if numTaskPerWorkerNode > 1, the master cannot get the results from workers. But, the workers can get the tasks from master. why ? I have used different taskTag to distinguish the tasks, but still does not work. Any help is appreciated. Thanks, Jack July 4 2010 _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
Re: [OMPI users] Open MPI, cannot get the results from workers
When the master sends out the task, it assign a distinct task number ID to the task. When the worker receive the task, it still use the task's assigned ID as task tag to send it to master. Any help is appreciated. July 5 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Mon, 5 Jul 2010 13:17:27 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI, cannot get the results from workers how does the master receive results from the workers? if a worker is sending multiple task results, how does the master knows what the message tags are ahead of time? On Sun, Jul 4, 2010 at 10:26 AM, Jack Bryan wrote: Dear All : I designed a master-worker framework, in which the master can schedulemultiple tasks (numTaskPerWorkerNode) to each worker and then collects results from workers. if the numTaskPerWorkerNode = 1, it works well. But, if numTaskPerWorkerNode > 1, the master cannot get the results from workers. But, the workers can get the tasks from master. why ? I have used different taskTag to distinguish the tasks, but still does not work. Any help is appreciated. Thanks, Jack July 4 2010 The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego _ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
[OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated
Dear All: I need to transfer some messages from workers master node on MPI cluster with Open MPI. The number of messages is fixed. When I increase the number of worker nodes, i got error: -- terminate called after throwing an instance of 'boost::exception_detail::clone_impl >' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated[n231:45873] *** Process received signal ***[n231:45873] Signal: Aborted (6)[n231:45873] Signal code: (-6)[n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0][n231:45873] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215][n231:45873] [ 2] /lib64/libc.so.6(abort+0x110) [0x3c50231cc0] -- For 40 workers , it works well. But for 50 workers, it got this error. The largest message size is not more then 72 bytes. Any help is appreciated. thanks Jack July 7 2010 _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated
thanks Wat if the master has to send and receive large data package ? It has to be splited into multiple parts ? This may increase communication overhead. I can use MPI_datatype to wrap it up as a specific datatype, which can carry the data. What if the data is very large? 1k bytes or 10 kbytes , 100 kbytes ? the master need to collect the same datatype from all workers. So, in this way, the master has to set up a data pool to get all data. The master's buffer provided by the MPI may not be large enough to do this. Are there some other ways to do it ? Any help is appreciated. thanks Jack july 7 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Wed, 7 Jul 2010 17:32:27 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated This error typically occurs when the received message is bigger than the specified buffer size. You need to narrow your code down to offending receive command to see if this is indeed the case. On Wed, Jul 7, 2010 at 8:42 AM, Jack Bryan wrote: Dear All: I need to transfer some messages from workers master node on MPI cluster with Open MPI. The number of messages is fixed. When I increase the number of worker nodes, i got error: -- terminate called after throwing an instance of 'boost::exception_detail::clone_impl >' what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated[n231:45873] *** Process received signal ***[n231:45873] Signal: Aborted (6)[n231:45873] Signal code: (-6)[n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0] [n231:45873] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215][n231:45873] [ 2] /lib64/libc.so.6(abort+0x110) [0x3c50231cc0] -- For 40 workers , it works well. But for 50 workers, it got this error. The largest message size is not more then 72 bytes. Any help is appreciated. thanks Jack July 7 2010 The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. Get busy. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- David Zhang University of California, San Diego _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
[OMPI users] OpenMPI how large its buffer size ?
Dear All: How to find the buffer size of OpenMPI ? I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4. Many nodes need to send data to the same node . Workers use mpi_isend, the receiver node use mpi_irecv. because they are non-blocking, the messages are stored in buffers of senders. And then, the receiver collect messages from its buffer. If the receiver's buffer is too small, there will be truncate error. Any help is appreciated. Jack July 9 2010 _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Re: [OMPI users] OpenMPI how large its buffer size ?
Hi, thanks for the program from Jody. David indicated the question that I want to ask. But, Jody's approach is ok when the MPI built-in buffer size is large enough to hold the message such as 100kB in the buffer. In asynchronous communication, when the sender posts a mpi_isend, the message is put in a buffer provided by the MPI. At this point, the receiver may still not post its corresponding mpi_irecv. So, the buffer size is important here. Without knowing the buffer size, I may get " truncate error " on Open MPI. How to know the size of the buffer automatically created by Open MPI in the background ? Any help is appreciated. Jack, July 10 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sat, 10 Jul 2010 16:46:12 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] OpenMPI how large its buffer size ? I believe his question is regarding when under non-blocking send/recv, how does MPI know how much memory to allocate to receive the message, since the size is determined AFTER the irecv is posted. So if the send post isend, but the receiver hasn't post irecv, what would the MPI do with the message. I believe MPI would automatically create a buffer in the background to store the message. On Sat, Jul 10, 2010 at 1:55 PM, jody wrote: Perhaps i misunderstand your question... Generally, it is the user's job to provide the buffers both to send and receive. If you call MPI_Recv, you must pass a buffer that is large enough to hold the data sent by the corresponding MPI_Send. I.e., if you know your sender will send messages of 100kB, then you must provide a buffer of size 100kB to the receiver. If the message size is unknown at compile time, you may have to send two messages: first an integer which tells the receiver how large a buffer it has to allocate, and then the actual message (which then nicely fits into the freshly allocated buffer) #include #include #include #include "mpi.h" #define SENDER 1 #define RECEIVER 0 #define TAG_LEN 77 #define TAG_DATA 78 #define MAX_MESSAGE 16 int main(int argc, char *argv[]) { int num_procs; int rank; int *send_buf; int *recv_buf; int send_message_size; int recv_message_size; MPI_Status st; int i; /* initialize random numbers */ srand(time(NULL)); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == RECEIVER) { /* the receiver */ /* wait for message length */ MPI_Recv(&recv_message_size, 1, MPI_INT, SENDER, TAG_LEN, MPI_COMM_WORLD, &st); /* create a buffer of the required size */ recv_buf = (int*) malloc(recv_message_size*sizeof(int)); /* get data */ MPI_Recv(recv_buf, recv_message_size, MPI_INT, SENDER, TAG_DATA, MPI_COMM_WORLD, &st); printf("Receiver got %d integers:", recv_message_size); for (i = 0; i < recv_message_size; i++) { printf(" %d", recv_buf[i]); } printf("\n"); /* clean up */ free(recv_buf); } else if (rank == SENDER) { /* the sender */ /* random message size */ send_message_size = (int)((1.0*MAX_MESSAGE*rand())/(1.0*RAND_MAX)); /* create a buffer of the required size */ send_buf = (int*) malloc(send_message_size*sizeof(int)); /* create random message */ for (i = 0; i < send_message_size; i++) { send_buf[i] = rand(); } printf("Sender has %d integers:", send_message_size); for (i = 0; i < send_message_size; i++) { printf(" %d", send_buf[i]); } printf("\n"); /* send message size to receiver */ MPI_Send(&send_message_size, 1, MPI_INT, RECEIVER, TAG_LEN, MPI_COMM_WORLD); /* now send messagge */ MPI_Send(send_buf, send_message_size, MPI_INT, RECEIVER, TAG_DATA, MPI_COMM_WORLD); /* clean up */ free(send_buf); } MPI_Finalize(); } I hope this helps Jody On Sat, Jul 10, 2010 at 7:12 AM, Jack Bryan wrote: > Dear All: > How to find the buffer size of OpenMPI ? > I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4. > Many nodes need to send data to the same node . > Workers use mpi_isend, the receiver node use mpi_irecv. > because they are non-blocking, the messages are stored in buffers of > senders. > And then, the receiver collect messages from its buffer. > If the receiver's buffer is too small, there will be truncate error. > Any help is appreciated. > Jack > July 9 2010 > > > Hotmail is redefining busy with tools
Re: [OMPI users] OpenMPI how large its buffer size ?
Hi, thanks for all your replies. The master node can receive message ( the same size) from 50 worker nodes. But, it cannot receive message from 51 nodes. It caused "truncate error". I used the same buffer to get the message in 50 node case. About ""rendezvous" protocol", what is the meaning of "the sender sends a short portion "? What is the "short portion", is it a small mart of the message of the sender ?This "rendezvous" protocol" can work automatically in background without programmerindicates in his program ? The "acknowledgement " can be generated by the receiver only when thecorresponding mpi_irecv is posted by the receiver ? Any help is appreciated. Jack July 10 2010 List-Post: users@lists.open-mpi.org Date: Sat, 10 Jul 2010 20:41:26 -0700 From: eugene@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] OpenMPI how large its buffer size ? I hope I understand the question properly. The "truncate error" means that the receive buffer provided by the user was too small to receive the designated message. That's an error in the user code. You're asking about some buffering sizes within the MPI implementation. We can talk about that, but it probably first makes sense to clarify what MPI is doing. If a sender posts a large send and the receiver has not posted a reply, the MPI implementation is not required to move any data. In particular, most MPI implementations will use a "rendezvous" protocol in which the sender sends a short portion and then waits for an acknowledgement from the receiver that it is ready to receive the message (and knows into which user buffer to place the received data). This protocol is used so that the MPI implementation does not have to buffer internally arbitrarily large messages. So, if you post a large send but no receive, the MPI implementation is probably buffering very little data. The message won't advance until the receive has been posted. This means that a blocking MPI_Send will wait and a nonblocking MPI_Isend will return without having done much. Jack Bryan wrote: Hi, thanks for the program from Jody. David indicated the question that I want to ask. But, Jody's approach is ok when the MPI built-in buffer size is large enough to hold the message such as 100kB in the buffer. In asynchronous communication, when the sender posts a mpi_isend, the message is put in a buffer provided by the MPI. At this point, the receiver may still not post its corresponding mpi_irecv. So, the buffer size is important here. Without knowing the buffer size, I may get " truncate error " on Open MPI. How to know the size of the buffer automatically created by Open MPI in the background ? Any help is appreciated. Jack, July 10 2010 From: solarbik...@gmail.com List-Post: users@lists.open-mpi.org Date: Sat, 10 Jul 2010 16:46:12 -0700 To: us...@open-mpi.org Subject: Re: [OMPI users] OpenMPI how large its buffer size ? I believe his question is regarding when under non-blocking send/recv, how does MPI know how much memory to allocate to receive the message, since the size is determined AFTER the irecv is posted. So if the send post isend, but the receiver hasn't post irecv, what would the MPI do with the message. I believe MPI would automatically create a buffer in the background to store the message. On Sat, Jul 10, 2010 at 1:55 PM, jody wrote: Perhaps i misunderstand your question... Generally, it is the user's job to provide the buffers both to send and receive. If you call MPI_Recv, you must pass a buffer that is large enough to hold the data sent by the corresponding MPI_Send. I.e., if you know your sender will send messages of 100kB, then you must provide a buffer of size 100kB to the receiver. If the message size is unknown at compile time, you may have to send two messages: first an integer which tells the receiver how large a buffer it has to allocate, and then the actual message (which then nicely fits into the freshly allocated buffer) #include #include #include #include "mpi.h" #define SENDER 1 #define RECEIVER 0 #define TAG_LEN 77 #define TAG_DATA 78 #define MAX_MESSAGE 16 int main(int argc, char *argv[]) { int num_procs; int rank; int *send_buf; int *recv_buf; int send_message_size; int recv_message_size; MPI_Status st; int i; /* initialize random numbers */ srand(time(NULL)); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == RECEIVER) { /* the receiver */ /* wait for message length */ MPI
Re: [OMPI users] OpenMPI how large its buffer size ?
thanks for your reply. The message size is 72 bytes. The master sends out the message package to each 51 nodes. Then, after doing their local work, the worker node send back the same-size message to the master. Master use vector.push_back(new messageType) to receive each message from workers. Master use thempi_irecv(workerNodeID, messageTag, bufferVector[row][column]) to receive the worker message. the row is the rankID of each worker, the column is index for message from worker.Each worker may send multiple messages to master. when the worker node size is large, i got MPI_ERR_TRUNCATE error. Any help is appreciated. JACK July 10 2010 List-Post: users@lists.open-mpi.org Date: Sat, 10 Jul 2010 23:12:49 -0700 From: eugene@oracle.com To: us...@open-mpi.org Subject: Re: [OMPI users] OpenMPI how large its buffer size ? Jack Bryan wrote: The master node can receive message ( the same size) from 50 worker nodes. But, it cannot receive message from 51 nodes. It caused "truncate error". How big was the buffer that the program specified in the receive call? How big was the message that was sent? MPI_ERR_TRUNCATE means that you posted a receive with an application buffer that turned out to be too small to hold the message that was received. It's a user application error that has nothing to do with MPI's internal buffers. MPI's internal buffers don't need to be big enough to hold that message. MPI could require the sender and receiver to coordinate so that only part of the message is moved at a time. I used the same buffer to get the message in 50 node case. About ""rendezvous" protocol", what is the meaning of "the sender sends a short portion "? What is the "short portion", is it a small mart of the message of the sender ? It's at least the message header (communicator, tag, etc.) so that the receiver can figure out if this is the expected message or not. In practice, there is probably also some data in there as well. The amount of that portion depends on the MPI implementation and, in practice, the interconnect the message traveled over, MPI-implementation-dependent environment variables set by the user, etc. E.g., with OMPI over shared memory by default it's about 4Kbytes (if I remember correctly). This "rendezvous" protocol" can work automatically in background without programmer indicates in his program ? Right. MPI actually allows you to force such synchronization with MPI_Ssend, but typically MPI implementations use it automatically for "plain" long sends as well even if the user didn't not use MPI_Ssend. The "acknowledgement " can be generated by the receiver only when the corresponding mpi_irecv is posted by the receiver ? Right. _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[OMPI users] OpenMPI load data to multiple nodes
Dear All, I am working on a multi-computer Open MPI cluster system. If I put some data files in /home/mypath/folder, is it possible that all non-head nodes can access the files in the folder ? I need to load some data to some nodes, if all nodes can access the data, I do not need to load them to each node one by one. If multiple nodes access the same file to get data, is there conflict ? For example, fopen(myFile) by node 1, at the same time fopen(myFile) by node 2. Is it allowed to do that on MPI cluster without conflict ? Any help is appreciated. Jinxu Ding July 12 2010 _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
Re: [OMPI users] OpenMPI load data to multiple nodes
thanks very much !!! May I use global variable to do that ? It means that all nodes have the same global variable, such as globalVector. In the initialization, only node 0 load data from files and assign values to the globalVector. After that, all other nodes can get the same data by accessing the globalVector. Does it make sense ? Any help is appreciated. Jack July 12 2010 > Date: Mon, 12 Jul 2010 21:44:34 -0400 > From: g...@ldeo.columbia.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] OpenMPI load data to multiple nodes > > Hi Jack/Jinxu > > Jack Bryan wrote: > > Dear All, > > > > I am working on a multi-computer Open MPI cluster system. > > > > If I put some data files in /home/mypath/folder, is it possible that all > > non-head nodes can access the files in the folder ? > > > > Yes, possible, for instance, if the /home/mypath/folder directory is > NFS mounted on all nodes/computers. > Otherwise, if all disks and directories are local to each computer, > you need to copy the input files to the local disks before you > start, and copy the output files back to your login computer after the > program ends. > > > I need to load some data to some nodes, if all nodes can access the > > data, I do not need to load them to each node one by one. > > > > If multiple nodes access the same file to get data, is there conflict ? > > > > To some extent. > The OS (on the computer where the file is located) > will do the arbitration on which process gets the hold of the file at > each time. > If you have 1000 processes, this means a lot of arbitration, > and most likely contention. > Even for two processes only, if the processes are writing data to a > single file, this won't ensure that they write > the output data in the order that you want. > > > For example, > > > > fopen(myFile) by node 1, at the same time fopen(myFile) by node 2. > > > > Is it allowed to do that on MPI cluster without conflict ? > > > > I think MPI won't have any control over this. > It is up to the operational system, and depends on > which process gets its "fopen" request to the OS first, > which is not a deterministic sequence of events. > That is not a clean technique. > > You could instead: > > 1) Assign a single process, say, rank 0, > to read and write data from/to the file(s). > Then use, say, MPI_Scatter[v] and MPI_Gather[v], > to distribute and collect the data back and forth > between that process (rank 0) and all other processes. > > That is an old fashioned but very robust technique. > It avoids any I/O conflict or contention among processes. > All the data flows across the processes via MPI. > The OS receives I/O requests from a single process (rank 0). > > Besides MPI_Gather/MPI_Scatter, look also at MPI_Bcast, > if you need to send the same data to all processes, > assuming the data is being read by a single process. > > 2) Alternatively, you could use the MPI I/O functions, > if your files are binary. > > I hope it helps, > Gus Correa > > > Any help is appreciated. > > > > Jinxu Ding > > > > July 12 2010 > > > > > > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars > > with Hotmail. Get busy. > > <http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5> > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3
[OMPI users] openMPI, transfer data from multiple sources to one destination
HI, I need to transfer data from multiple sources to one destination. The requirement is: (1) The sources and destination nodes may work asynchronously. (2) Each source node generates data package in their own paces. And, there may be many packages to send. Whenever, a data package is generated , it should be sent to the desination node at once. And then, the source node continue to work on generating the next package. (3) There is only one destination node , which must receive all data package generated from the source nodes. Because the source and destination nodes may work asynchronously, the destination node should not wait for a specific source node until the source node sends out its data. The destination node should be able to receive data package from anyone source node whenever the data package is available in a source node. My question is : What MPI function should be used to implement the protocol above ? If I use MPI_Send/Recv, they are blocking function. The destination node have to wait for one node until its data is available. The communication overhead is too high. If I use MPI_Bsend, the destination node has to use MPI_Recv to , a Blocking receive for a message . This can make the destination node wait for only one source node and actually other source nodes may have data avaiable. Any help or comment is appreciated !!! thanks Dec. 28 2008 _ It’s the same Hotmail®. If by “same” you mean up to 70% faster. http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008
[OMPI users] Open mpi 123 install error for BLACS
Hi, I am installing BLACS in order to install PCSDP - a parallell interior point solver for linear programming. I need to install it on Open MPI 1.2.3 platform. I ahve installed BLAS, LAPACK successfully. Now I need to install BLACS. I can run "make mpi" successfully. But, When I run "make tester". [BLACS]$ make tester ( cd TESTING ; make ) make[1]: Entering directory `/home/PCSDP/BLACS/TESTING' mpif77 -o /home/PCSDP/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0 blacstest.o btprim_MPI.o tools.o /home/PCSDP/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/PCSDP/BLACS/LIB/blacs_MPI-LINUX-0.a /home/PCSDP/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/openmpi_123/lib/libmpi_cxx.la /home/openmpi_123/lib/libmpi_cxx.la: file not recognized: File format not recognized collect2: ld returned 1 exit status make[1]: *** [/home/PCSDP/BLACS/TESTING/EXE/xFbtest_MPI-LINUX-0] Error 1 make[1]: Leaving directory `/home/PCSDP/BLACS/TESTING' make: *** [tester] Error 2 - In the "Makefile" of TESTING/, I have changed : tools.o : tools.f #$(F77) $(F77NO_OPTFLAGS) -c $*.f $(F77) $(F77NO_OPTFLAGS) -fno-globals -fno-f90 -fugly-complex -w -c $*.f blacstest.o : blacstest.f #$(F77) $(F77NO_OPTFLAGS) -c $*.f $(F77) $(F77NO_OPTFLAGS) -fno-globals -fno-f90 -fugly-complex -w -c $*.f -- In "Bconfig.h", I have changed include "/home/openmpi_123/include/mpi.h" In OpenMPI 1.2.3, the lib directory does not include: "*.a" library. only "*.la" library. Any help is appreciated. Jack Jan. 30 2009 My "Bmake.inc" is: - SECTION 1: PATHS AND LIBRARIES SHELL = /bin/sh BTOPdir = /home/PCSDP/BLACS COMMLIB = MPI PLAT = LINUX BLACSdir= $(BTOPdir)/LIB BLACSDBGLVL = 0 BLACSFINIT = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSCINIT = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSLIB= $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a MPIdir = /home/openmpi_123 MPILIBdir = $(MPIdir)/lib MPIINCdir = $(MPIdir)/include MPILIB = $(MPILIBdir)/libmpi_cxx.la BTLIBS = $(BLACSFINIT) $(BLACSLIB) $(BLACSFINIT) $(MPILIB) INSTdir = $(BTOPdir)/INSTALL/EXE TESTdir = $(BTOPdir)/TESTING/EXE FTESTexe = $(TESTdir)/xFbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL) CTESTexe = $(TESTdir)/xCbtest_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL) SYSINC = -I$(MPIINCdir) INTFACE = -Df77IsF2C SENDIS = BUFF = TRANSCOMM = -DCSameF77 WHATMPI = SYSERRORS = DEBUGLVL = -DBlacsDebugLvl=$(BLACSDBGLVL) DEFS1 = -DSYSINC $(SYSINC) $(INTFACE) $(DEFBSTOP) $(DEFCOMBTOP) $(DEBUGLVL) BLACSDEFS = $(DEFS1) $(SENDIS) $(BUFF) $(TRANSCOMM) $(WHATMPI) $(SYSERRORS) SECTION 3: COMPILERS F77= mpif77 F77NO_OPTFLAGS = F77FLAGS = $(F77NO_OPTFLAGS) -O F77LOADER = $(F77) F77LOADFLAGS = CC = mpicc CCFLAGS= -O4 CCLOADER = $(CC) CCLOADFLAGS= ARCH = ar ARCHFLAGS = r RANLIB= ranlib --- _ Windows Live™ Hotmail®…more than just e-mail. http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t2_hm_justgotbetter_howitworks_012009