Stewart Smith <stew...@linux.vnet.ibm.com> writes: > Vipin K Parashar <vi...@linux.vnet.ibm.com> writes: >> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote: >>> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes: >>> >>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations >>>> >>>> done to read any core FIR which is sleeping, offline. >>> OK. >>> >>> Do we know why Linux is causing that to happen? >> >> This issue is originally seen upon running STAF (Software Test >> Automation Framework) stress tests and off-lining some cores >> with stress tests running. >> >> It can also be re-created after off-lining few cores and following >> one of below methods. >> 1. Executing Linux "sensors" command >> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input, >> where X is offline CPU. >> >> Its "opal_get_sensor_data" Linux API that that triggers >> OPAL call "opal_sensor_read", performing XSCOM ops here. >> If core is found sleeping/offline Linux throws up >> "opal_error_code: Unexpected OPAL error" error onto console. >> >> Currently Linux isn't aware about OPAL_WRONG_STATE return code >> from OPAL. Thus it prints "Unexpected OPAL error" message, same >> as it would log for any unknown OPAL return codes. >> >> Seeing this error over console has been a concern for Test and >> would puzzle real user as well. This patch makes Linux aware about >> OPAL_WRONG_STATE return code from OPAL and stops printing >> "Unexpected OPAL error" message onto console for OPAL fails >> with OPAL_WRONG_STATE > > Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we > return the xscom_read return code in event of error. > > I would argue that converting to EIO in that instance is probably > correct... or EAGAIN? EAGAIN may be more correct in the situation where > the core is just sleeping. > > What kind of offlining are you doing? > > Arguably, the correct behaviour would be to remove said sensors when the > core is offline.
Right, that would be ideal. There appear to be at least two other hwmon drivers that are CPU hotplug aware (coretemp and via-cputemp). But perhaps it's not possible to work out which sensors are attached to which CPU etc., I haven't looked in detail. In that case changing just opal_get_sensor_data() to handle OPAL_WRONG_STATE would be OK, with a comment explaining that we might be asked to read a sensor on an offline CPU and we aren't able to detect that. cheers