> On Nov 4, 2020, at 11:40 PM, Gulam Mohamed <gulam.moha...@oracle.com> wrote: > > Description > ========= > 1. This Kernel panic could be due to a timing issue when there is a race > between the sync thread and the initiator was processing of a login response > from the target. The session re-open can be invoked from two places > a. Sessions sync thread when the iscsid restart > b. From iscsid through iscsi error handler > 2. The session reopen sequence is as follows in user-space > (iscsi-initiator-utils) > a. Disconnect the connection > b. Then send the stop connection request to the kernel which > releases the connection (releases the socket) > c. Queues the reopen for 2 seconds delay > d. Once the delay expires, create the TCP connection again by > calling the connect() call > e. Poll for the connection > f. When poll is successful i.e when the TCP connection is > established, it performs > i. Creation of session and connection data structures > ii. Bind the connection to the session. This is the place where we > assign the sock to tcp_sw_conn->sock > iii. Sets the parameters like target name, persistent address etc . > iv. Creates the login pdu > v. Sends the login pdu to kernel > vi. Returns to the main loop to process further events. The kernel then > sends the login request over to the target node > g. Once login response with success is received, it enters into full > feature phase and sets the negotiable parameters like max_recv_data_length, > max_transmit_length, data_digest etc . > 3. While setting the negotiable parameters by calling > "iscsi_session_set_neg_params()", kernel panicked as sock was NULL > > What happened here is > -------------------------------- > 1. Before initiator received the login response mentioned in above point > 2.f.v, another reopen request was sent from the error handler/sync session > for the same session, as the initiator utils was in main loop to process > further events (as > mentioned in point 2.f.vi above). > 2. While processing this reopen, it stopped the connection which released > the socket and queued this connection and at this point of time the login > response was received for the earlier one To make sure I got this you are saying before iscsi_sw_tcp_conn_stop calls iscsi_sw_tcp_release_conn that the recv path has called iscsi_recv_pdu right? > 3. The kernel passed over this response to user-space which then sent the > set_neg_params request to kernel > 4. As the connection was stopped, the sock was NULL and hence while the > kernel was processing the set param request from user-space, it panicked > > Fix > ---- > 1. While setting the set_param in kernel, we need to check if sock is NULL > 2. If the sock is NULL, then return EPERM (operation not permitted) > 3. Due to this error handler will be invoked in user-space again to > recover the session > > Signed-off-by: Gulam Mohamed <gulam.moha...@oracle.com> > Reviewed-by: Junxiao Bi <junxiao...@oracle.com> > --- > diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c > index df47557a02a3..fd668a194053 100644 > --- a/drivers/scsi/iscsi_tcp.c > +++ b/drivers/scsi/iscsi_tcp.c > @@ -711,6 +711,12 @@ static int iscsi_sw_tcp_conn_set_param(struct > iscsi_cls_conn *cls_conn, > struct iscsi_tcp_conn *tcp_conn = conn->dd_data; > struct iscsi_sw_tcp_conn *tcp_sw_conn = tcp_conn->dd_data; > > + if (!tcp_sw_conn->sock) { > + iscsi_conn_printk(KERN_ERR, conn, > + "Cannot set param as sock is NULL\n"); > + return -ENOTCONN; > + } > + I think this might have 2 issues: 1. It only fixes iscsi_tcp. What about other drivers that free/null resources/fields in ep_disconnect that we layer need to access in the set_param callback (the cxgbi drivers)? 2. What about drivers that do not free/null fields (be2iscsi, bnx2i, ql4xxx and qedi) so the set_param calls end up just working. What state will userspace be in where we have logged in, but iscsid also thinks we are in the middle of trying to login? I think we could do: 1. On ep_disconnect and stop_conn set some iscsi_cls_conn bit that indicates set_param calls for connection level settings should fail. scsi_transport_iscsi could set and check this bit for all drivers. 2. On bind_conn clear the bit.