Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-01 Thread Leonardo Fialho
___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI users] Execution in multicore machines

2008-09-29 Thread Leonardo Fialho
Jed Brown escribió: On Mon 2008-09-29 20:30, Leonardo Fialho wrote: 1) If I use one node (8 cores) the "user" % is around 100% per core. The execution time is around 430 seconds. 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95% per core and th

[OMPI users] Execution in multicore machines

2008-09-29 Thread Leonardo Fialho
ons are: A) The execution time in case "1" should be smaller (only sm communication, no?) than case "2" and "3", no? Cache problems? B) Why the "sys" time while using communication inter nodes? NIC driver? Why this time increase when I balance the load a

[OMPI users] tg3 module

2008-06-04 Thread Leonardo Fialho
1.3) [lfialho@aoclsp ~]$ Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-93-581-2478

Re: [OMPI users] Process size

2008-05-30 Thread Leonardo Fialho
simple-ping 20 1 -- Josh On May 29, 2008, at 7:54 AM, Leonardo Fialho wrote: Hi All, I made some tests with a dummy "ping" application. Some memory problems occurred. On these tests I obtained the following results: 1) OpenMPI (without FT): - delaying 1 second to send token to

[OMPI users] Process size

2008-05-29 Thread Leonardo Fialho
ze growing all the time. I think that it is something in the CRCP module/component... Thanks, -- Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos.uab.es Phone: +34-93-581-2888 Fax: +34-9

Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1

2008-04-29 Thread Leonardo Fialho
returns the "bad file descriptor" (EBAFD) error, and the blcr module don´t catch this error, only return (-1) "child failed". Thanks, Leonardo Fialho Josh Hursey escribió: I don't think I have ever seen this one before. :( So you are trying to checkpoint the MPI pro

Re: [OMPI users] blcr_checkpoint_peer: execvp returned -1

2008-04-28 Thread Leonardo Fialho
file descriptor Thanks, Leonardo Fialho Leonardo Fialho escribió: Hi All, Does anybody experiment this error? [aogrdini:09070] Global) Receive a command message from [[13242,0],0]. ... [aogrd02:07642] Local) Receive a command message. ... [aogrd01:07938] Local) Receive a command me

[OMPI users] blcr_checkpoint_peer: execvp returned -1

2008-04-28 Thread Leonardo Fialho
, like a child of the original process. When a run an application with this version and take a checkpoint manually, I have no problem... Leonardo Fialho Computer Architecture and Operating Systems Department - CAOS Universidad Autonoma de Barcelona - UAB ETSE, Edifcio Q, QC/3088 http://www.caos Phone

Re: [OMPI users] pml_v question

2008-03-05 Thread Leonardo Fialho
Sure! :) Thank you so much! Leonardo Aurélien Bouteiller escribió: Hi, to enable the vprotocol pessimist, you have to specify -mca vprotocol pessimist. This parameter takes precedence on the priority. Let me know if you hit success :] Aurelien Le 5 mars 08 à 13:55, Leonardo Fialho a

[OMPI users] pml_v question

2008-03-05 Thread Leonardo Fialho
mca_base_component_distill_checkpoint_ready=0 ft_cr_enabled=1 crs= rml_wrapper=ftrm snapc=single (similar to full but do a checkpoint of only one process) filem=rsh pml_wrapper=crcpw crcp=uncoord (similar to coord but need to do checkpoint of only one process) btl=tcp,self Thanks, Leonardo Fialho -- Leonardo Fialho Computer Architecture

Re: [OMPI users] Question about fault tolerance checkpointing

2008-01-29 Thread Leonardo Fialho
Josh, At this moment I´m working in the uncoordinated checkpoint, and probably I´ll have some tools to collect data from the process and environment and probably from the application. About the application I´m considering the possibility to do something like this (MPI_Checkpoint??). Leonardo