I have been looking, but I haven't really found a good answer about system level threading. We are about to get a new cluster of dual-processor quad-core nodes or 8 cores per node. Traditionally I would just tell MPI to launch two processes per dual processor single core node, but with eight cores on a node, having 8 processes seems inefficient.
My question is this: does OpenMPI sense that there are multiple cores on a node and use something like pthreads instead of creating new processes automatically when I request 8 processes for a node, or should I run a single process per node and use OpenMP or pthreads explicitly to get better performance on a per node basis? -- Sam Adams