Michael Ellerman <m...@ellerman.id.au> writes: > Stewart Smith <stew...@linux.vnet.ibm.com> writes: > >> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes: >>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote: >>>> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes: >>>> >>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations >>>>> >>>>> done to read any core FIR which is sleeping, offline. >>>> OK. >>>> >>>> Do we know why Linux is causing that to happen? >>> >>> This issue is originally seen upon running STAF (Software Test >>> Automation Framework) stress tests and off-lining some cores >>> with stress tests running. >>> >>> It can also be re-created after off-lining few cores and following >>> one of below methods. >>> 1. Executing Linux "sensors" command >>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input, >>> where X is offline CPU. >>> >>> Its "opal_get_sensor_data" Linux API that that triggers >>> OPAL call "opal_sensor_read", performing XSCOM ops here. >>> If core is found sleeping/offline Linux throws up >>> "opal_error_code: Unexpected OPAL error" error onto console. >>> >>> Currently Linux isn't aware about OPAL_WRONG_STATE return code >>> from OPAL. Thus it prints "Unexpected OPAL error" message, same >>> as it would log for any unknown OPAL return codes. >>> >>> Seeing this error over console has been a concern for Test and >>> would puzzle real user as well. This patch makes Linux aware about >>> OPAL_WRONG_STATE return code from OPAL and stops printing >>> "Unexpected OPAL error" message onto console for OPAL fails >>> with OPAL_WRONG_STATE >> >> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we >> return the xscom_read return code in event of error. >> >> I would argue that converting to EIO in that instance is probably >> correct... or EAGAIN? EAGAIN may be more correct in the situation where >> the core is just sleeping. >> >> What kind of offlining are you doing? >> >> Arguably, the correct behaviour would be to remove said sensors when the >> core is offline. > > Right, that would be ideal. There appear to be at least two other hwmon > drivers that are CPU hotplug aware (coretemp and via-cputemp). > > But perhaps it's not possible to work out which sensors are attached to > which CPU etc., I haven't looked in detail.
Each core-temp@ sensor has a ibm,pir property, so linking back to what core shouldn't be too hard. For mem-temp@ sensors, we have the chip-id. > In that case changing just opal_get_sensor_data() to handle > OPAL_WRONG_STATE would be OK, with a comment explaining that we might be > asked to read a sensor on an offline CPU and we aren't able to detect > that. Agree. -- Stewart Smith OPAL Architect, IBM.