From: Paul Menzel <pmen...@molgen.mpg.de> 
Sent: Monday, February 10, 2025 1:07 PM
>Dear Jedrzej,
>
>
>Thank you for the quick reply.
>
>
>Am 10.02.25 um 12:59 schrieb Jagielski, Jedrzej:
>> From: Paul Menzel <pmen...@molgen.mpg.de>
>> Sent: Monday, February 10, 2025 12:40 PM
>
>>> Am 10.02.25 um 11:40 schrieb Jedrzej Jagielski:
>>>> E610 NICs unlike the previous devices utilising ixgbe driver
>>>> are notified in the case of overheatning by the FW ACI event.
>>>
>>> overheating (without n)
>>>
>>>> In event of overheat when threshold is exceeded, FW suspends all
>>>> traffic and sends overtemp event to the driver.
>>>
>>> I guess this was already the behavior before your patch, and now it’s
>>> only logged, and the device stopped properly?
>
>> Without this patch driver cannot receive the overtemp info. FW handles
>> the event - device stops but user has no clue what has happened.
>> FW does the major work but this patch adds this minimal piece of
>> overtemp handling done by SW - informing user at the very end.
>
>Thank you for the confirmation. Maybe add a small note, that it’s not 
>logged to make it crystal clear for people like myself.

OK, i will extend the commit msg.

>
>>>> Then driver
>>>> logs appropriate message and closes the adapter instance.
>>>> The card remains in that state until the platform is rebooted.
>>>
>>> As a user I’d be interested what the threshold is, and what the measured
>>> temperature is. Currently, the log seems to be just generic?
>> 
>> These details are FW internals.
>> Driver just gets info that such event has happened.
>> There's no additional information.
>> 
>> In that case driver's job is just to inform user that such scenario
>> has happened and tell what should be the next steps.
>
> From a user perspective that is a suboptimal behavior, and shows 
>another time that the Linux kernel should have all the control, and 
>stuff like this should be moved *out* of the firmware and not into the 
>firmware.
>
>It’d be great if you could talk to the card designers/engineers to take 
>that into account, and maybe revert this decision for future versions or 
>future adapters.

I will have it on my mind.


Thanks,
Jedrek

>
>
>Kind regards,
>
>Paul
>
>
>>>      drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:static const char 
>>> ixgbe_overheat_msg[] = "Network adapter has been stopped because it has 
>>> over heated. Restart the computer. If the problem persists, power off the 
>>> system and replace the adapter";
>>>
>>>> Reviewed-by: Przemek Kitszel <przemyslaw.kits...@intel.com>
>>>> Reviewed-by: Mateusz Polchlopek <mateusz.polchlo...@intel.com>
>>>> Signed-off-by: Jedrzej Jagielski <jedrzej.jagiel...@intel.com>
>>>> ---
>>>> v2,3 : commit msg tweaks
>>>> ---
>>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      | 5 +++++
>>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h | 3 +++
>>>>    2 files changed, 8 insertions(+)
>>>
>>>
>>> Kind regards,
>>>
>>> Paul
>>>
>>>
>>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>>>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>>> index 7236f20c9a30..5c804948dd1f 100644
>>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>>> @@ -3165,6 +3165,7 @@ static void ixgbe_aci_event_cleanup(struct 
>>>> ixgbe_aci_event *event)
>>>>    static void ixgbe_handle_fw_event(struct ixgbe_adapter *adapter)
>>>>    {
>>>>            struct ixgbe_aci_event event __cleanup(ixgbe_aci_event_cleanup);
>>>> +  struct net_device *netdev = adapter->netdev;
>>>>            struct ixgbe_hw *hw = &adapter->hw;
>>>>            bool pending = false;
>>>>            int err;
>>>> @@ -3185,6 +3186,10 @@ static void ixgbe_handle_fw_event(struct 
>>>> ixgbe_adapter *adapter)
>>>>                    case ixgbe_aci_opc_get_link_status:
>>>>                            ixgbe_handle_link_status_event(adapter, &event);
>>>>                            break;
>>>> +          case ixgbe_aci_opc_temp_tca_event:
>>>> +                  e_crit(drv, "%s\n", ixgbe_overheat_msg);
>>>> +                  ixgbe_close(netdev);
>>>> +                  break;
>>>>                    default:
>>>>                            e_warn(hw, "unknown FW async event captured\n");
>>>>                            break;
>>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h 
>>>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>>> index 8d06ade3c7cd..617e07878e4f 100644
>>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type_e610.h
>>>> @@ -171,6 +171,9 @@ enum ixgbe_aci_opc {
>>>>            ixgbe_aci_opc_done_alt_write                    = 0x0904,
>>>>            ixgbe_aci_opc_clear_port_alt_write              = 0x0906,
>>>>    
>>>> +  /* TCA Events */
>>>> +  ixgbe_aci_opc_temp_tca_event                    = 0x0C94,
>>>> +
>>>>            /* debug commands */
>>>>            ixgbe_aci_opc_debug_dump_internals              = 0xFF08,

Reply via email to