+Rodrigo ________________________________ From: Intel-gfx <intel-gfx-boun...@lists.freedesktop.org> on behalf of Saarinen, Jani <jani.saari...@intel.com> Sent: Tuesday, December 3, 2024 8:22 AM To: Thomas Weißschuh <li...@weissschuh.net>; Borah, Chaitanya Kumar <chaitanya.kumar.bo...@intel.com>; Krogerus, Heikki <heikki.kroge...@intel.com> Cc: Rafael J. Wysocki <raf...@kernel.org>; Kurmi, Suresh Kumar <suresh.kumar.ku...@intel.com>; Coelho, Luciano <luciano.coe...@intel.com>; Nikula, Jani <jani.nik...@intel.com>; De Marchi, Lucas <lucas.demar...@intel.com>; intel-gfx@lists.freedesktop.org <intel-gfx@lists.freedesktop.org>; intel...@lists.freedesktop.org <intel...@lists.freedesktop.org>; linux...@vger.kernel.org <linux...@vger.kernel.org>; Sebastian Reichel <sebastian.reic...@collabora.com> Subject: RE: Regression on linux-next (next-20241120) and drm-tip
+@Krogerus, Heikki > -----Original Message----- > From: Thomas Weißschuh <li...@weissschuh.net> > Sent: Tuesday, 3 December 2024 18.08 > To: Borah, Chaitanya Kumar <chaitanya.kumar.bo...@intel.com> > Cc: Rafael J. Wysocki <raf...@kernel.org>; Kurmi, Suresh Kumar > <suresh.kumar.ku...@intel.com>; Coelho, Luciano <luciano.coe...@intel.com>; > Saarinen, Jani <jani.saari...@intel.com>; Nikula, Jani > <jani.nik...@intel.com>; > De Marchi, Lucas <lucas.demar...@intel.com>; intel-gfx@lists.freedesktop.org; > intel...@lists.freedesktop.org; linux...@vger.kernel.org; Sebastian Reichel > <sebastian.reic...@collabora.com> > Subject: Re: Regression on linux-next (next-20241120) and drm-tip > > On 2024-12-03 16:57:21+0100, Thomas Weißschuh wrote: > > On 2024-12-03 15:42:23+0000, Borah, Chaitanya Kumar wrote: > > > > > > > > > > -----Original Message----- > > > > From: Thomas Weißschuh <li...@weissschuh.net> > > > > Sent: Tuesday, December 3, 2024 8:20 PM > > > > To: Rafael J. Wysocki <raf...@kernel.org> > > > > Cc: Borah, Chaitanya Kumar <chaitanya.kumar.bo...@intel.com>; > > > > Kurmi, Suresh Kumar <suresh.kumar.ku...@intel.com>; Coelho, > > > > Luciano <luciano.coe...@intel.com>; Saarinen, Jani > > > > <jani.saari...@intel.com>; Nikula, Jani <jani.nik...@intel.com>; > > > > De Marchi, Lucas <lucas.demar...@intel.com>; > > > > intel-gfx@lists.freedesktop.org; intel- x...@lists.freedesktop.org; > > > > linux...@vger.kernel.org; Sebastian Reichel > > > > <sebastian.reic...@collabora.com> > > > > Subject: Re: Regression on linux-next (next-20241120) and drm-tip > > > > > > > > On 2024-12-03 15:33:21+0100, Rafael J. Wysocki wrote: > > > > > On Tue, Dec 3, 2024 at 1:04 PM Thomas Weißschuh > > > > <li...@weissschuh.net> wrote: > > > > > > > > > > > > On 2024-12-03 12:54:54+0100, Rafael J. Wysocki wrote: > > > > > > > On Tue, Dec 3, 2024 at 7:51 AM Thomas Weißschuh > > > > <li...@weissschuh.net> wrote: > > > > > > > > > > > > > > > > (+Cc Sebastian) > > > > > > > > > > > > > > > > Hi Chaitanya, > > > > > > > > > > > > > > > > On 2024-12-03 05:07:47+0000, Borah, Chaitanya Kumar wrote: > > > > > > > > > Hope you are doing well. I am Chaitanya from the linux > > > > > > > > > graphics team > > > > in Intel. > > > > > > > > > > > > > > > > > > This mail is regarding a regression we are seeing in our > > > > > > > > > CI runs[1] on > > > > linux-next repository. > > > > > > > > > > > > > > > > Thanks for the report. > > > > > > > > > > > > > > > > > Since the version next-20241120 [2], we are seeing the > > > > > > > > > following regression > > > > > > > > > > > > > > > > > > ````````````````````````````````````````````````````````````````````````````````` > > > > > > > > > <4>[ 19.990743] Oops: general protection fault, probably > > > > > > > > > for non- > > > > canonical address 0xb11675ef8d1ccbce: 0000 [#1] PREEMPT SMP NOPTI > > > > > > > > > <4>[ 19.990760] CPU: 21 UID: 110 PID: 867 Comm: prometheus- > > > > node Not tainted 6.12.0-next-20241120-next-20241120-gac24e26aa08f+ > > > > #1 > > > > > > > > > <4>[ 19.990771] Hardware name: Intel Corporation Arrow Lake > > > > Client Platform/MTL-S UDIMM 2DPC EVCRB, BIOS > > > > MTLSFWI1.R00.4400.D85.2410100007 10/10/2024 > > > > > > > > > <4>[ 19.990782] RIP: > 0010:power_supply_get_property+0x3e/0xe0 > > > > > > > > > ```````````````````````````````````````````````````````` > > > > > > > > > `````` ``````````````````` Details log can be found in > > > > > > > > > [3]. > > > > > > > > > > > > > > > > > > After bisecting the tree, the following patch [4] seems > > > > > > > > > to be the first > > > > "bad" > > > > > > > > > commit > > > > > > > > > > > > > > > > > > ```````````````````````````````````````````````````````` > > > > > > > > > `````` ``````````````````````````````````````````` > > > > > > > > > Commit 49000fee9e639f62ba1f965ed2ae4c5ad18d19e2 > > > > > > > > > Author: Thomas Weißschuh <mailto:li...@weissschuh.net> > > > > > > > > > AuthorDate: Sat Oct 5 12:05:03 2024 +0200 > > > > > > > > > Commit: Sebastian Reichel > > > > <mailto:sebastian.reic...@collabora.com> > > > > > > > > > CommitDate: Tue Oct 15 22:22:20 2024 +0200 > > > > > > > > > power: supply: core: add wakeup source inhibit by > > > > > > > > > power_supply_config > > > > > > > > > ```````````````````````````````````````````````````````` > > > > > > > > > `````` ``````````````````````````````````````````` > > > > > > > > > > > > > > > > > > This is now seen in our drm-tip runs as well. [5] > > > > > > > > > > > > > > > > > > Could you please check why the patch causes this > > > > > > > > > regression and > > > > provide a fix if necessary? > > > > > > > > > > > > > > > > I don't see how this patch can lead to this error. > > > > > > > > > > > > > > It looks like the cfg->no_wakeup_source access reaches > > > > > > > beyond the struct boundary for some reason. > > > > > > > > > > > > But the access to this field is only done in > > > > > > __power_supply_register(). > > > > > > The error reports however don't show this function at all, > > > > > > they come from power_supply_uevent() and > > > > > > power_supply_get_property() by which time the call to > __power_supply_register() is long over. > > > > > > > > > > > > FWIW there is an uninitialized 'struct power_supply_config' in > > > > > > drivers/hid/hid-corsair-void.c. But I highly doubt the test > > > > > > machines are using that. (I'll send a patch later for it) > > > > > > > > > > So the only way I can think about in which the commit in > > > > > question may lead to the reported issues is that changing the > > > > > size of struct power_supply_config or its alignment makes an > > > > > unexpected functional difference somewhere. > > > > > > > > Indeed. I'd really like to see this issue reproduced with KASAN. > > > > > > > > > AFAICS, this commit cannot be reverted by itself, so which > > > > > commits on top of it need to be reverted in order to revert it > > > > > cleanly? > > > > > > > > All the other patches from this series: > > > > https://lore.kernel.org/lkml/20241005-power-supply-no-wakeup-sourc > > > > e-v1- > > > > 0-1d62bf9bc...@weissschuh.net/ > > > > > > > > Could you point me to the full boot log in the drm-tip CI? > > > > > > Here is the log for drm-tip CI > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/IGT_8136/bat-arls-5/boot0.t > > > xt > > > > Thanks! > > > > > I carried out another bisect and it points to the following commit > > > > > > commit 226ff2e681d006eada59a9693aa1976d4c15a7d4 > > > Author: Heikki Krogerus <heikki.kroge...@linux.intel.com> > > > Date: Wed Nov 6 17:06:05 2024 +0200 > > > > > > usb: typec: ucsi: Convert connector specific commands to bitmaps > > > > > > That allows the fields in those command data structures to > > > be easily validated. If an unsupported field is accessed, a > > > warning is generated. > > > > Suspicous: The bitmaps introduced in this commit are right before the > > psy and psy_desc members of 'struct ucsi_connector'. > > So any out-of-bounds writes into these members would corrupt those > > fields. > > A corrupted power_supply_desc would fit both reported stacktraces. > > struct ucsi_connector { > ... > > struct typec_capability typec_cap; > > /* Cached command responses. */ > DECLARE_BITMAP(cap, UCSI_GET_CONNECTOR_CAPABILITY_SIZE); > DECLARE_BITMAP(status, UCSI_GET_CONNECTOR_STATUS_SIZE); > > DECLARE_BITMAP() takes the size in number of *bits* > > struct power_supply *psy; > struct power_supply_desc psy_desc; > u32 rdo; > > ... > } > > static int ucsi_get_connector_status(struct ucsi_connector *con, bool > conn_ack) { > u64 command = UCSI_GET_CONNECTOR_STATUS | > UCSI_CONNECTOR_NUMBER(con->num); > size_t size = min(UCSI_GET_CONNECTOR_STATUS_SIZE, > UCSI_MAX_DATA_LENGTH(con->ucsi)); > int ret; > > ret = ucsi_send_command_common(con->ucsi, command, &con->status, > size, conn_ack); > > ucsi_send_command_common() takes the size in number of *bytes*. > This call corrupts psy and psy_desc in con. > > return ret < 0 ? ret : 0; > } > > > > > > Reverting it seems to help locally. However, to confirm I have sent out a > > > patch > to our "try-bot" > > > > > > https://patchwork.freedesktop.org/series/142049/ > > > > > > We can wait for its results. > > > > > > Regards > > > > > > Chaitanya > > >