Hi Dave, This series provides the support for mlx5 Firmware devlink health and sw reset.
We plan to follow up this series with a patch that provides mlx5 documentation under Documentation/networking/mlx5.rst, first thing in 5.3 kernel release, it will include all new mlx5 devlink options and more. For more information please see tag log below. Please pull and let me know if there is any problem. Thanks, Saeed. --- The following changes since commit a734d1f4c2fc962ef4daa179e216df84a8ec5f84: net: openvswitch: return an error instead of doing BUG_ON() (2019-05-04 01:36:36 -0400) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-05-04 for you to fetch changes up to 30d8b932dcebbcb8c5d1991cab5325c2e3faad6d: net/mlx5: Report devlink health on FW fatal issues (2019-05-04 17:22:45 -0700) ---------------------------------------------------------------- mlx5-updates-2019-05-04 Mlx5 devlink health fw reporters and sw reset support This series provides mlx5 firmware reset support and firmware devlink health reporters. 1) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot 2) Issue software reset upon FW asserts 3) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by dump and recover procedures and enable trigger these functionality by user. 3.1) fw reporter: The fw reporter implements diagnose and dump callbacks. It follows symptoms of fw error such as fw syndrome by triggering fw core dump and storing it and any other fw trace into the dump buffer. The fw reporter diagnose command can be triggered any time by the user to check current fw status. 3.2) fw_fatal repoter: The fw_fatal reporter implements dump and recover callbacks. It follows fatal errors indications by CR-space dump and recover flow. The CR-space dump uses vsc interface which is valid even if the FW command interface is not functional, which is the case in most FW fatal errors. The CR-space dump is stored as a memory region snapshot to ease read by address. The recover function runs recover flow which reloads the driver and triggers fw reset if needed. Command examples and output: diagnose data: assert_var[0] 0xfc3fc043 assert_var[1] 0x0001b41c assert_var[2] 0x00000000 assert_var[3] 0x00000000 assert_var[4] 0x00000000 assert_exit_ptr 0x008033b4 assert_callra 0x0080365c fw_ver 16.24.1000 hw_id 0x0000020d irisc_index 0 synd 0x8: unrecoverable hardware error ext_synd 0x003d raw fw_ver 0x101803e8 dump traces: trace: 0000:82:00.1 [0x69cd6c5283e] 0 [0xb8] dump general info GVMI=0x0001 trace: 0000:82:00.1 [0x69cd6c53bec] 0 [0xb8] GVMI management info, gvmi_management context: trace: 0000:82:00.1 [0x69cd6c55eff] 0 [0xb8] [000]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c5657f] 0 [0xb8] [010]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c56608] 0 [0xb8] [020]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c566ff] 0 [0xb8] [030]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c5677f] 0 [0xb8] [040]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c5687f] 0 [0xb8] [050]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c568ff] 0 [0xb8] [060]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c569a5] 0 [0xb8] [070]: 00000000 00000000 00000000 00000000 trace: 0000:82:00.1 [0x69cd6c57021] 0 [0xb8] CMDIF dbase from IRON: active_dbase_slots = 0x00000000 trace: 0000:82:00.1 [0x69cd6c58dae] 0 [0xb8] GVMI=0x0001 hw_toc context: trace: 0000:82:00.1 [0x69cd6c58e7f] 0 [0xb8] [000]: 00400100 00000000 00000000 fffff000 trace: 0000:82:00.1 [0x69cd6c58f7f] 0 [0xb8] [010]: 00000000 00000000 00000000 00000000 ... ... devlink_region_name: cr-space snapshot_id: 1 00000000000f0018 e1 03 00 00 fb ae a9 3f 0000000000000000 00 20 00 01 00 00 00 00 03 00 00 00 00 00 00 00 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000060 00 00 00 00 00 00 00 00 00 00 00 00 de 0a 00 00 0000000000000070 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0000000000000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa 00 0000000000000090 b6 0b 00 00 00 00 00 00 80 c7 fe ff 50 0a 00 00 ... ... ---------------------------------------------------------------- Alex Vesker (3): net/mlx5: Add Vendor Specific Capability access gateway net/mlx5: Add Crdump FW snapshot support net/mlx5: Add support for devlink region_snapshot parameter Eran Ben Elisha (1): net/mlx5: Move all devlink related functions calls to devlink.c Feras Daoud (3): net/mlx5: Handle SW reset of FW in error flow net/mlx5: Control CR-space access by different PFs net/mlx5: Issue SW reset on FW assert Moshe Shemesh (8): net/mlx5: Refactor print health info net/mlx5: Create FW devlink health reporter net/mlx5: Add core dump register access functions net/mlx5: Add support for FW reporter dump net/mlx5: Report devlink health on FW issues net/mlx5: Add fw fatal devlink health reporter net/mlx5: Add support for FW fatal reporter dump net/mlx5: Report devlink health on FW fatal issues drivers/net/ethernet/mellanox/mlx5/core/Makefile | 3 +- drivers/net/ethernet/mellanox/mlx5/core/devlink.c | 72 +++ drivers/net/ethernet/mellanox/mlx5/core/devlink.h | 12 + .../net/ethernet/mellanox/mlx5/core/diag/crdump.c | 210 ++++++++ .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c | 143 +++++ .../ethernet/mellanox/mlx5/core/diag/fw_tracer.h | 14 + .../net/ethernet/mellanox/mlx5/core/en_selftest.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/health.c | 575 +++++++++++++++++---- drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h | 6 + .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 313 +++++++++++ .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h | 33 ++ drivers/net/ethernet/mellanox/mlx5/core/main.c | 19 +- .../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 8 +- include/linux/mlx5/device.h | 10 +- include/linux/mlx5/driver.h | 20 +- include/linux/mlx5/mlx5_ifc.h | 17 +- 16 files changed, 1357 insertions(+), 100 deletions(-) create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/crdump.c create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h