Hi,

We've been putting a new Mellanox QDR Intel Sandy Bridge cluster, based on CentOS 6.3, through its paces and we're getting repeated kernel messages we never used to get on CentOS 5. An example on one node:

Sep 28 09:58:20 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:27 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:29 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:31 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:32 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 09:58:45 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT
Sep 28 10:08:23 g8s1n2 kernel: mlx4_core 0000:01:00.0: mlx4_eq_int: 
MLX4_EVENT_TYPE_SRQ_LIMIT

These messages appeared when running IMB compiled with openmpi 1.6.1 across 256 cores (16 nodes, 16 cores per node). The job ran from 09:56:54 to 10:08:46 and failed with no obvious error messages.

Now, I'm used to IMB running into trouble at larger core counts, but I'm wondering if anyone has seen these messages before and know if they indicate a problem?

We're running with an increased log_num_mtt mlx4_core option as recommended by the openmpi FAQ and increased log_num_srq to its maximum value in a failed attempt to get rid of the messages:

$ cat /etc/modprobe.d/libmlx4_local.conf
options mlx4_core log_num_mtt=24 log_mtts_per_seg=3 log_num_srq=20

Any thoughts?

Thanks,

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------

Reply via email to