Hi, I am evaluating the performance of a clustering program written in Java with MPI+threads and would like to get some insight in solving a peculiar case. I've attached a performance graph to explain this.
In essence the tests were carried out as TxPxN, where T is threads per process, P is processes per node, and N is number of nodes. I noticed an inefficiency with Tx*1*xN cases in general (tall bars in graph). To elaborate a bit further, 1. each node has 2 sockets with 4 cores each (totaling 8 cores) 2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same) 3. with options A.) --map-by node:PE=4 and --bind-to core B.) --map-by node:PE=8 and --bind-to-core C.) --map-by socket and --bind-to none Timing of A,B,C came out as A < B < C, so used results from option A for Tx *1*xN in the graph. Could you please give some suggestion that may help to speed up these Tx*1*xN cases? Also, I expected B to perform better than A as threads could utilize all 8 cores, but it wasn't the case. Thank you, Saliya [image: Inline image 1] -- Saliya Ekanayake esal...@gmail.com Cell 812-391-4914 Home 812-961-6383 http://saliya.org