On 5/22/2024 11:33 AM, Jacob Keller wrote:
> 
> 
> On 5/20/2024 1:33 AM, Przemek Kitszel wrote:
>> On 5/18/24 01:22, Jacob Keller wrote:
>>> The ice_get_pfa_module_tlv() function iterates over the TLVs in the
>>> Preserved Fields Area (PFA) of the NVM. This is used to access data such as
>>> the Part Board Assembly identifier.
>>>
>>> Some NVMs in the wild have been found with incorrect TLV lengths including
>>> at least one which reports a TLV length of 0xFFFF. When trying to read the
>>> PBA from such an NVM, the driver will compute a new offset for the next_tlv
>>> which is lower, due to overflowing the 16-bit next_tlv variable.
>>>
>>> In the best case, the driver will incorrectly interpret values until it
>>> finds one which has an offset greater than the PFA area without
>>> overflowing. In the worst case, the values in the NVM result in an infinite
>>> loop as the misinterpreted lengths result in checking offsets which are
>>> valid within the PFA, and which ultimately point in an infinite loop.
>>>
>>> Fix this by using check_add_overflow when calculating the NVM offsets, and
>>> bailing if we ever overflow. Additionally, use check_add_overflow when
>>> calculating the initial maximum PFA size.
>>>
>>> This ensures that we bail immediately on encountering any TLV who's length
>>> would have caused the naive addition to overflow, rather than entering an
>>> infinite loop or otherwise misinterpreting NVM values.
>>>
>>> Fixes: e961b679fb0b ("ice: add board identifier info to devlink .info_get")
>>> Co-developed-by: Paul Greenwalt <paul.greenw...@intel.com>
>>> Signed-off-by: Paul Greenwalt <paul.greenw...@intel.com>
>>> Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com>
>>> ---
>>> Changes in v3:
>>> - Fix missing {
>>> - Fix missing pfa_ptr variable to dev_warn()
>>> - Add Fixes tag
>>> - Link to v2: 
>>> https://lore.kernel.org/r/20240517-iwl-net-2024-05-16-fix-nvm-tlv-issue-v2-1-fdee184ec...@intel.com
>>>
>>> Changes in v2:
>>> - Use check_add_overflow instead of increasing the variables to u32
>>> - Upgrade log messages to dev_warn()
>>> - Link to v1: 
>>> https://lore.kernel.org/r/20240516-iwl-net-2024-05-16-fix-nvm-tlv-issue-v1-1-ecbb6a759...@intel.com
>>> ---
>>>   drivers/net/ethernet/intel/ice/ice_nvm.c | 19 +++++++++++++++----
>>>   1 file changed, 15 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_nvm.c 
>>> b/drivers/net/ethernet/intel/ice/ice_nvm.c
>>> index 84eab92dc03c..be731b83d667 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_nvm.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_nvm.c
>>> @@ -440,8 +440,7 @@ int
>>>   ice_get_pfa_module_tlv(struct ice_hw *hw, u16 *module_tlv, u16 
>>> *module_tlv_len,
>>>                    u16 module_type)
>>>   {
>>> -   u16 pfa_len, pfa_ptr;
>>> -   u16 next_tlv;
>>> +   u16 pfa_len, pfa_ptr, next_tlv, max_pfa;
>>>     int status;
>>>   
>>>     status = ice_read_sr_word(hw, ICE_SR_PFA_PTR, &pfa_ptr);
>>> @@ -454,11 +453,18 @@ ice_get_pfa_module_tlv(struct ice_hw *hw, u16 
>>> *module_tlv, u16 *module_tlv_len,
>>>             ice_debug(hw, ICE_DBG_INIT, "Failed to read PFA length.\n");
>>>             return status;
>>>     }
>>> +
>>> +   if (check_add_overflow(pfa_ptr, pfa_len, &max_pfa)) {
>>> +           dev_warn(ice_hw_to_dev(hw), "PFA starts at offset %u. PFA 
>>> length of %u causes 16-bit arithmetic overflow.\n",
>>> +                    pfa_ptr, pfa_len);
>>> +           return -EINVAL;
>>> +   }
>>> +
>>>     /* Starting with first TLV after PFA length, iterate through the list
>>>      * of TLVs to find the requested one.
>>>      */
>>>     next_tlv = pfa_ptr + 1;
>>> -   while (next_tlv < pfa_ptr + pfa_len) {
>>> +   while (next_tlv < max_pfa) {
>>>             u16 tlv_sub_module_type;
>>>             u16 tlv_len;
>>>   
>>> @@ -485,7 +491,12 @@ ice_get_pfa_module_tlv(struct ice_hw *hw, u16 
>>> *module_tlv, u16 *module_tlv_len,
>>>             /* Check next TLV, i.e. current TLV pointer + length + 2 words
>>>              * (for current TLV's type and length)
>>>              */
>>> -           next_tlv = next_tlv + tlv_len + 2;
>>> +           if (check_add_overflow(next_tlv, 2, &next_tlv) ||
>>> +               check_add_overflow(next_tlv, tlv_len, &next_tlv)) {
>>> +                   dev_warn(ice_hw_to_dev(hw), "Failed to locate PFA TLV 
>>> module of type %u due to arithmetic overflow. The PFA length is %u. The 
>>> last TLV has length of %u\n",
>>> +                            module_type, pfa_len, tlv_len);
>>> +                   return -EINVAL;
>>> +           }
>>>     }
>>>     /* Module does not exist */
>>>     return -ENOENT;
>>>
>>> ---
>>> base-commit: 83e93942796db58652288f0391ac00072401816f
>>> change-id: 20240516-iwl-net-2024-05-16-fix-nvm-tlv-issue-99ebb2c55c52
>>>
>>> Best regards,
>>
>> Reviewed-by: Przemek Kitszel <przemyslaw.kits...@intel.com>
> 
> It turns out after digging, that this appears to be caused by
> erroneously reading from the NVM past the PFA TLVs:
> 
> I added additional logging and I see:
> 
>> Failed to locate PFA TLV module of type 22 due to arithmetic overflow. The 
>> PFA length is 0x00002f00. The last TLV was 0x00002fff. The PFA area ends at 
>> 0x00003000. The last TLV has length of 0x0000e7c1
> 
> We're trying to read a TLV starting at offset 0x2FFF, but the PFA is
> supposed to end at 0x3000. I think this is actually just a case if an
> off-by-one error when calculating the maximum offset of the PFA, that
> the next_tlv < max_pfa value check doesn't trigger.
> 
> This results in the final loop trying to interpret whatever comes after
> the PFA in the NVM as a TLV, which gets the incorrect length value and
> causes the overflow.
> 
> I need to do a bit more investigation, but I believe the error comes
> from the fact that the PFA length value includes the initial length word
> as well.
> 
> I'm unsure if we would still need the check_add_overflow in that case.
> It might be safe in the case of an actual bad NVM.

Paul and I finished investigating. It turns out that the PFA is
documented to have one extra "last word" with a value of 0xFFFF. This
was not taken into account in the looping, which is what resulted in the
problems. This is not in fact caused by bad or corrupt NVMs, but merely
software which fails to properly account for this extra word.

I'm working on a new fix which will supersede this, and will have a full
explanation and new commit title.

Reply via email to