On Wed, 23 Dec 2020 14:10:32 -0600 Lijun Pan wrote: > On Wed, Dec 23, 2020 at 10:50 AM Jakub Kicinski <k...@kernel.org> wrote: > > > > On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote: > > > On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <k...@kernel.org> wrote: > > > > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote: > > > > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive > > > > > init") > > > > > says "If the passive > > > > > CRQ initialization occurs before the FATAL reset task is processed, > > > > > the FATAL error reset task would try to access a CRQ message queue > > > > > that was freed, causing an oops. The problem may be most likely to > > > > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR > > > > > process will automatically issue a change MTU request. > > > > > Fix this by not processing fatal error reset if CRQ is passively > > > > > initialized after client-driven CRQ initialization fails." > > > > > > > > > > Even with this commit, we still see similar kernel crashes. In order > > > > > to completely solve this problem, we'd better continue the fatal error > > > > > reset, capture the kernel crash, and try to fix it from that end. > > > > > > > > This basically reverts the quoted fix. Does the quoted fix make things > > > > worse? Otherwise we should leave the code be until proper fix is found. > > > > > > > > > > Yes, I think the quoted commit makes things worse. It skips the specific > > > reset condition, but that does not fix the problem it claims to fix. > > > > Okay, let's make sure the commit message explains how it makes things > > worse. > > I will reword the commit message. > > > > The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I > > > think reverting it to the original "else" condition is the right thing to > > > do. > > > > Hm. So the problem is fixed? But the commit message says "we still see > > similar kernel crashes", that's present tense suggesting that crashes > > are seen on current net/master. Are you saying that's not the case and > > after 0e435befaea4 and a0faaa27c716 there are no more crashes? > > This patch was formed before I submitted 0e435befaea4 and a0faaa27c716, so > I used the wording "we still see similar kernel crashes". I will modify > the commit message before I submit v2 of this patch. > After 0e435befaea4 and a0faaa27c716, I don't see any crashes as described > in this quoted commit even without this quoted commit. > That's why I am sure this quoted commit does not fix the described problem > and I want to revert it.
I see, that explains it!