On Tue, May 12, 2020 at 07:23:28PM +0300, Igor Russkikh wrote: > > >> So I think its not a good place to insert this call. > >> Its hard to find exact good place to insert it in qed. > > > > Is there a way to check if what happened was indeed a fw crash? > > Our driver has two firmwares (slowpath and fastpath). > For slowpath firmware the way to understand it crashed is to observe command > response timeout. This is in qed_mcp.c, around "The MFW failed to respond to > command" traceout.
Ok thanks. > For fastpath this is tricky, think you may leave the above place as the only > place to invoke module_firmware_crashed() So do you mean like the changes below? diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c index f4eebaabb6d0..95cb7da2542e 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_debug.c +++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c @@ -7906,6 +7906,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer) rc = qed_dbg_grc(cdev, (u8 *)buffer + offset + REGDUMP_HEADER_SIZE, &feature_size); if (!rc) { + module_firmware_crashed(); *(u32 *)((u8 *)buffer + offset) = qed_calc_regdump_header(cdev, GRC_DUMP, cur_engine, diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c index 280527cc0578..a818cf09dccf 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c @@ -566,6 +566,7 @@ _qed_mcp_cmd_and_union(struct qed_hwfn *p_hwfn, DP_NOTICE(p_hwfn, "The MFW failed to respond to command 0x%08x [param 0x%08x].\n", p_mb_params->cmd, p_mb_params->param); + module_firmware_crashed(); qed_mcp_print_cpu_info(p_hwfn, p_ptt); spin_lock_bh(&p_hwfn->mcp_info->cmd_lock); > >> One more thing is that AFAIU taint flag gets permanent on kernel, but > > for > >> example our device can recover itself from some FW crashes, thus it'd be > >> transparent for user. > > > > Similar things are *supposed* to recoverable with other device, however > > this can also sometimes lead to a situation where devices are not usable > > anymore, and require a full driver unload / load. > > > >> Whats the logical purpose of module_firmware_crashed? Does it mean fatal > >> unrecoverable error on device? > > > > Its just to annotate on the module and kernel that this has happened. > > > > I take it you may agree that, firmware crashing *often* is not good > > design, > > and these issues should be reported to / fixed by vendors. In cases > > where driver bugs are reported it is good to see if a firmware crash has > > happened before, so that during analysis this is ruled out. > > Probably, but still I see some misalignment here, in sense that taint is about > the kernel state, not about a hardware state indication. The kernel carries the driver though, and the driver / subsystem can often times act strange when this happens. > devlink health could really be a much better candidate for such things. That sounds fantastic, please Cc me on patches! However I still believe we should register this event in the kernel for support purposes. Luis