On 7/1/19 10:11 PM, Gerd Rausch wrote:
Hi Santosh,

On 01/07/2019 19.28, santosh.shilim...@oracle.com wrote:

Below. All command timeouts are 60 seconds.

enum {
         MLX4_CMD_TIME_CLASS_A   = 60000,
         MLX4_CMD_TIME_CLASS_B   = 60000,
         MLX4_CMD_TIME_CLASS_C   = 60000,
};


Thank you for the pointer.

But having said that, I re-looked the code you are patching
and thats actually only FRWR code which is purely work-request
based so this command timeout shouldn't matter.


Which brings us back full circle to the question of
what the timeout ought to be?

Please keep in mind that prior to this fix,
the RDS code didn't wait at all:

It simply posted those registration (IB_WR_REG_MR)
and invalidation (IB_WR_LOCAL_INV)
work-requests, with no regards to when the firmware
would execute them.

Arguably, waiting any amount time greater than zero
for the operation to complete is better than not waiting at all.

We can change the timeout to a high value, or even make it infinite
by using "wait_event" instead of "wait_event_timeout".

For the registration work-requests there is a benefit to wait a short
amount of time only (the trade-off described in patch #1 of this series).

Actually we should just switch this code to what Avinash has
finally made in downstream code. That keeps the RDS_GET_MR
semantics and makes sure MR is really valid before handing over
the key to userland. There is no need for any timeout
for registration case.

For de-registration work-requests, it is beneficial to wait
until they are truly done.
But: Function "rds_ib_unreg_frmr" prior and post this change
simply moves on after a failed de-registration attempt,
and releases the pages owned by the memory region.

This patch does _not_ change that behavior.

If the work request fails, then it will lead to flush errors and
MRs will be marked as STALE. So this wait may not be necessary


This wait is necessary to avoid the 2 scenarios described
in the commit-log message:

#1) Memory regions bouncing between "drop_list" and "clean_list"
     as items on the "clean_list" aren't really clean until
     their state transitions to "FRMR_IS_FREE".


#2) Prevent an access error as "rds_ib_post_inv" is called
     just prior to de-referencing pages via "__rds_ib_teardown_mr".
     And you certainly don't want those pages populated in the
     HCA's memory-translation-table with full access, while
     the Linux kernel 'thinks' you gave them back already
     and starts re-purposing them.

RDS_GET_MR case is what actually showing the issue you saw
and the fix for that Avinash has it in production kernel.

Actually, no:
Socket option RDS_GET_MR wasn't even in the code-path of the
tests I performed:

It were there RDS_CMSG_RDMA_MAP / RDS_CMSG_RDMA_DEST control
messages that ended up calling '__rds_rdma_map".

What option did you use ? Default option with rds-stress is
RDS_GET_MR and hence the question.


I believe with that change, registration issue becomes non-issue
already.


Please explain how that is related to this fix-suggestion?

I submitted this patch #3 and the others in this series in order
to fix bugs in the RDS that is currently shipped with Linux.

It may very well be the case that there are other changes
that Avinash put into production kernels that would be better
suited to fix this and other problems.

But that should not eliminate the need to fix what is currently broken.

Fixing what's broken does not preclude replacing the fixed code
with newer or better versions of the same.

And as far as invalidation concerned with proxy qp, it not longer
races with data path qp.


I don't understand, please elaborate.

May be you can try those changes if not already to see if it
addresses the couple of cases where you ended up adding
timeouts.


I don't understand, please elaborate:
a) Are you saying this issue should not be fixed?
b) Or are you suggesting to replace this fix with a different fix?
    If it's the later, please point out what you have in mind.
c) ???

All am saying is the code got changed for good reason and that changed
code makes some of these race conditions possibly not applicable.
So instead of these timeout fixes, am suggesting to use that
code as fix. At least test it with those changes and see whats
the behavior.

Regards,
Santosh

Reply via email to