On Fri, 2007-01-19 at 20:15 -0500, Jeff Squyres wrote: > On Jan 19, 2007, at 6:19 PM, Arif Ali wrote: > > > > [0,1,59][btl_openib_component.c: > > 1153:btl_openib_component_progress] from > > > node16 to: node02 error polling HP CQ with status REMOTE ACCESS > > ERROR > > > status number 10 for wr_id 268919352 opcode 256614836 > > > mpirun noticed that job rank 0 with PID 0 on node node02 exited on > > > signal 15 (Terminated). > > > 55 additional processes aborted (not shown) > > does this happen with btl_openib_flags=1? Does this also happen > > without > > this setting. This doesn't happen with OpenMPI-1.2b3 right? > > > > > > That's Correct, I tried all the flags that was suggested, and a few > > more, which I listed in previous mails > > I can parse your text either way, so forgive me for belaboring the > point:
Sorry for not being clear > - Does this happen with btl_openib_flags=1 on the nightly snapshot of > OMPI v1.2? Yes > - Does this happen without setting btl_openib_flags on the nightly > snapshot of OMPI v1.2? Yes > - What is the exact version of the nightly snapshot for OMPI v1.2 > that you are using? 1.2b4r13137 > > Yes, correct, this doesn't happen with 1.2b3 > > Good to know. > > Were you able to experiment with the various MCA parameters that I > described in the other mail to see if such problems went away? > (i.e., ensure that you're not running out of DMA-able memory) Not yet, I'll be doing these today, and will get back to you as soon as I can regards, Arif