[OMPI users] Error

2013-10-18 Thread sudhirs@
Dear open-mpi user, I am running a CPMD calculation in parallel. I got the following error and job got killed. Below I have given the error message. What is this error and how to fix it ? [[12065,1],23][btl_openib_component.c:2948:handle_wc] from compute-0-0.local to: compute-0-7 error polling LP

Re: [OMPI users] Error

2013-10-18 Thread Jeff Squyres (jsquyres)
This typically indicates an error in the physical layer of your IB network. You should run layer 0 diagnostics and look for bad cables, bad HCAs, etc. On Oct 18, 2013, at 1:49 AM, "sudhirs@" wrote: > Dear open-mpi user, > I am running a CPMD calculation in parallel. I got the following error

[OMPI users] debugging performance regressions between versions

2013-10-18 Thread Dave Love
I've been testing an application that turns out to be ~30% slower with OMPI 1.6.5 than (the Red Hat packaged version of) 1.5.4, with the same mca-params and the same binary, just flipping the runtime. It's running over openib, and the profile it prints says that alltoall is a factor of four slower

Re: [OMPI users] Need help running jobs across different IB vendors

2013-10-18 Thread Dave Love
"Jeff Squyres (jsquyres)" writes: > Short version: > -- > > What you really want is: > > mpirun --mca pml ob1 ... > > The "--mca mtl ^psm" way will get the same result, but forcing pml=ob1 is > really a slightly Better solution (from a semantic perspective) I'm afraid ^psm is r