On 7/1/19 1:50 PM, Gerd Rausch wrote:
Hi David,
On 01/07/2019 11.27, David Miller wrote:
From: Gerd Rausch <gerd.rau...@oracle.com>
Date: Mon, 1 Jul 2019 09:39:44 -0700
+ /* Memory regions make it onto the "clean_list" via
+ * "rds_ib_flush_mr_pool", after the memory region has
+ * been posted for invalidation via "rds_ib_post_inv".
+ *
+ * At that point in time, "fr_state" may still be
+ * in state "FRMR_IS_INUSE", since the only place where
+ * "fr_state" transitions to "FRMR_IS_FREE" is in
+ * is in "rds_ib_mr_cqe_handler", which is
+ * triggered by a tasklet.
+ *
+ * So in case we notice that
+ * "fr_state != FRMR_IS_FREE" (see below), * we wait for
+ * "fr_inv_done" to trigger with a maximum of 10msec.
+ * Then we check again, and only put the memory region
+ * onto the drop_list (via "rds_ib_free_frmr")
+ * in case the situation remains unchanged.
+ *
+ * This avoids the problem of memory-regions bouncing
+ * between "clean_list" and "drop_list" before they
+ * even have a chance to be properly invalidated.
+ */
+ frmr = &ibmr->u.frmr;
+ wait_event_timeout(frmr->fr_inv_done,
+ frmr->fr_state == FRMR_IS_FREE,
+ msecs_to_jiffies(10));
+ if (frmr->fr_state == FRMR_IS_FREE)
+ break;
If we see FRMR_IS_FREE after the timeout, what cleans this up?
In that case, the memory-region is subjected to the
"rds_ib_free_frmr(ibmr, true)" call that follows:
In essence making it onto the "drop_list".
It's the same as if it wouldn't transition to FRMR_IS_FREE at all.
In both cases, the memory region should get dropped, and the application
would have been penalized by an extra 10msec wait-time (for having tried to
invalidate it).
Also, why 10msec?
It's empirical.
I had added some debugging code (not part of this submission) that traced
the return value of "wait_event_timeout" in order to see the out-lier in terms
of processing the "IB_WR_LOCAL_INV" request.
On my test-systems the majority of requests were done in less than 1msec.
I saw an outlier at almost 2msec once.
So I gave it an extra order-of-magnitude multiplier for extra buffer / paranoia.
Why that specific value and not some other value?
I looked around to find what Mellanox or any other reference material had
so say about the expected turn--around time of an "IB_WR_LOCAL_INV" ought to be.
I wasn't able to find any.
LOCAL_INV/REG etc are all end being HCA commands and the command
timeouts are large. 60 seconds is what CX3 HCA has for example.
Thats the worst case timeout from HCA before marking the command
to be timeout and hence the operation to be failed.
Regards,
Santosh