On Wed, Dec 12, 2018 at 01:50:10PM +0000, Igor Russkikh wrote:
> From: Dmitry Bezrukov <dmitry.bezru...@aquantia.com>
> 
> Lab testing shows that chip may get overheated when
> network link is connected on high speed (2.5G/5G).
> 
> Default hardware design uses only passive heatsink without a cooler,
> and that makes things worse.
> 
> To prevent possible chip damage here we add throttling mechanism.
> 
> There is a worker that monitors the PHY's temperature via reading
> PHY registers. When PHY's temperature reaches the critical threshold
> (it is 106 degrees of Celsius) it changes the link speed to the lowest
> regardless user/default link speed settings. It should reduce the PHY's
> temperature to normal numbers.
> 
> When the PHY's temparature is back to normal (low threshold, it is
> 85 degrees) it restores user/default link speed settings.

Hi Igor

Please could you also export the temperature using HWMON. The Marvell
PHY drivers are good examples.

I also have a patch for driver/net/phy/aquantia.c which adds HWMON
support to that PHY.

> 
> Signed-off-by: Dmitry Bezrukov <dmitry.bezru...@aquantia.com>
> Signed-off-by: Igor Russkikh <igor.russk...@aquantia.com>
> ---
>  drivers/net/usb/aqc111.c | 78 ++++++++++++++++++++++++++++++++++++++++
>  drivers/net/usb/aqc111.h |  8 +++++
>  2 files changed, 86 insertions(+)
> 
> diff --git a/drivers/net/usb/aqc111.c b/drivers/net/usb/aqc111.c
> index 8867f6a3eaa7..fa6b9dfc2a6f 100644
> --- a/drivers/net/usb/aqc111.c
> +++ b/drivers/net/usb/aqc111.c
> @@ -17,6 +17,7 @@
>  #include <linux/usb/cdc.h>
>  #include <linux/usb/usbnet.h>
>  #include <linux/linkmode.h>
> +#include <linux/workqueue.h>
>  
>  #include "aqc111.h"
>  
> @@ -334,6 +335,9 @@ static void aqc111_set_phy_speed(struct usbnet *dev, u8 
> autoneg, u16 speed)
>       aqc111_data->phy_cfg |= (3 << AQ_DSH_RETRIES_SHIFT) &
>                               AQ_DSH_RETRIES_MASK;
>  
> +     if (aqc111_data->thermal_throttling)
> +             speed = SPEED_100;
> +

That is a big jump. Do you need to cool is down quickly, or would
1Gbps be sufficient? I think as a user, i would prefer it ping-pongs
between 5G and 1G, not 5G and 100M.

>       if (autoneg == AUTONEG_ENABLE) {
>               switch (speed) {
>               case SPEED_5000:

It looks like this only works when auto-neg is enabled. If i've fixed
configured it i don't think this works?

> @@ -714,6 +718,8 @@ static int aqc111_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>       /* store aqc111_data pointer in device data field */
>       dev->driver_priv = aqc111_data;
>  
> +     aqc111_data->dev = dev;
> +
>       /* Init the MAC address */
>       ret = aqc111_read_perm_mac(dev);
>       if (ret)
> @@ -991,6 +997,71 @@ static int aqc111_link_reset(struct usbnet *dev)
>       return 0;
>  }
>  
> +int aqc111_get_temperature(struct usbnet *dev, u32 *temperature)
> +{
> +     u16 reg16 = 0;
> +
> +     if (aqc111_mdio_read(dev, AQ_GLB_THERMAL_STATUS2, AQ_PHY_GLOBAL_ADDR,
> +                          &reg16) < 0)
> +             goto err;
> +
> +     if (!(reg16 & AQ_THERM_ST_READY))
> +             goto err;
> +
> +     if (aqc111_mdio_read(dev, AQ_GLB_THERMAL_STATUS1, AQ_PHY_GLOBAL_ADDR,
> +                          &reg16) < 0)
> +             goto err;
> +
> +     /*convert from 1/256 to 1/100 degrees of Celsius*/
> +     *temperature = (u32)reg16 * 100 / 256;
> +     return 0;
> +
> +err:
> +     *temperature = 0;
> +     return -1;
> +}
> +
> +void aqc111_thermal_work_cb(struct work_struct *w)
> +{
> +     struct delayed_work *dw = to_delayed_work(w);
> +     struct aqc111_data *aqc111_data = container_of(dw, struct aqc111_data,
> +                                                    therm_work);
> +     unsigned long timeout = msecs_to_jiffies(AQ_THERMAL_TIMER_MS);
> +     struct usbnet *dev = aqc111_data->dev;
> +     u32 temperature = 0;
> +     u8 reset_speed = 0;
> +
> +     if (!aqc111_data->link)
> +             /* poll not so frequently */
> +             timeout *= 2;
> +
> +     if (aqc111_get_temperature(dev, &temperature) != 0)
> +             goto end;
> +
> +     if (aqc111_data->thermal_throttling &&
> +         temperature <= AQ_NORMAL_TEMP_THRESHOLD) {
> +             netdev_info(dev->net, "The temperature is back to normal(%u)",
> +                         AQ_NORMAL_TEMP_THRESHOLD / 100);

How often do you see these messages? I'm wondering if they need to be
rate limited?

> +             aqc111_data->thermal_throttling = 0;
> +             reset_speed = 1;
> +     }
> +
> +     if (!aqc111_data->thermal_throttling &&
> +         temperature >= AQ_CRITICAL_TEMP_THRESHOLD) {

Should there be some hysteresis in here? In fact, if temperature is
AQ_CRITICAL_TEMP_THRESHOLD it is both back to normal, and throttled at
the same time!

> +             netdev_warn(dev->net, "Critical temperature(%u) is reached",
> +                         AQ_CRITICAL_TEMP_THRESHOLD / 100);
> +             aqc111_data->thermal_throttling = 1;
> +             reset_speed = 1;

update_speed might be a better name, since you are not always
resetting it.

          Andrew

Reply via email to