Hi Param, Those fields are no longer existing in the newer version of the ena_com (and newer DPDK), so please upgrade if you are encountering any issues. But I don't think that TSO is supported on any of the ENA devices on the AWS.
Thanks, Michal niedz., 8 gru 2019 o 20:03 kumaraparameshwaran rathinavel < kumaraparames...@gmail.com> napisał(a): > Thanks a lot Michal. Will follow approach that you have suggested. > > Also I see that in case if TSO is enabled we set, > 336 /* this param needed only for TSO */ > 337 ena_meta->l3_outer_hdr_len = 0; > 338 ena_meta->l3_outer_hdr_offset = 0; > > > So even if TSO is enabled should these values be zero. > > Thanks, > Param. > > On Wed, Dec 4, 2019 at 7:24 PM Michał Krawczyk <m...@semihalf.com> wrote: > >> Hi Param, >> >> Adding atomic operations to setting/clearing comp ctxt won't help, as >> there is no race there. The admin queue is designed this way, that >> only single completion context can be held, so you should serialize >> access to the rte_eth_stats_get(). >> If you won't do that, the 2nd thread will try to hold already occupied >> context and this will result in disabling admin queue by the ena >> communication layer - you won't be able to send further admin >> commands. >> That's intended behavior and it is caused because you are trying to >> get the context with the occupied flag being set to true. Adding >> atomic operations there won't change anything, as there will still be >> a race between the thread that is waiting for the completion (occupied >> flag already send to true) and another thread, that is trying to send >> the same command using the same context (can't set occupied to true, >> as it's already true) - that should never happen. >> >> Without totally reworking ena_com admin queue design, we could add >> lock in ena_stats_get() - but that'll cause unnecessary locking in all >> of the applications that are using it from the main lcore context and >> as your design seems to be unique by doing it from multiple threads, >> maybe you could add a lock to your calls to the rte_eth_stats_get()? >> >> Another solution might be using xstats API, which should let you to >> get statistics from multiple threads as it's not using admin queue for >> that - all stats are being counter internally in the PMD. >> >> Thanks, >> Michal >> >> >> pt., 29 lis 2019 o 13:01 kumaraparameshwaran rathinavel >> <kumaraparames...@gmail.com> napisał(a): >> > >> > Hi Michał, >> > >> > Thanks for getting back on this. >> > >> > In our design we are using multiple cores requesting for >> rte_eth_stats_get, it is not from one process and hence not serialized. >> Since in our design this is not serialized, and hence in get_comp_ctxt() >> checking for occupied flag and comp_ctxt_release() are not done atomically >> which is causing this issue. Please let me know if my understanding is >> correct, so that I will fix the application in such a way that it is done >> from one process and not multiple. >> > >> > Thanks, >> > Param. >> > >> > On Thu, Nov 28, 2019 at 6:44 PM Michał Krawczyk <m...@semihalf.com> >> wrote: >> >> >> >> Hi Param, >> >> >> >> first of all - you are using very old ena_com. This code comes from >> >> the DPDK version before v18.08. If you have any doubts, please check >> >> the newer version of the driver and DPDK as the potential bug could be >> >> already fixed there. >> >> >> >> Anyway, if you will look at the function get_comp_ctxt() which is >> >> called by __ena_com_submit_admin_cmd() to get the completion context, >> >> there is a check for the context if it's not occupied - in case it is >> >> (which will be true until comp_ctxt_release() will clear it), the new >> >> command using the same context cannot be used. So there shouldn't be >> >> two consumers using the same completion contexts. >> >> >> >> In addition, drivers that are using ena_com are sending admin commands >> >> one at a time during the init, so there shouldn't be even 2 commands >> >> at a time. The only exception is ena_com_get_dev_basic_stats(), which >> >> is called from rte_eth_stats_get() context - but if you consider DPDK >> >> application, it should use it on the management lcore after init, so >> >> it'll also be serialized. >> >> >> >> Thanks, >> >> Michal >> >> >> >> >> >> >> >> pt., 8 lis 2019 o 07:02 kumaraparameshwaran rathinavel >> >> <kumaraparames...@gmail.com> napisał(a): >> >> > >> >> > Hi Michał, >> >> > >> >> > Please look at the below function, >> >> > >> >> > static int >> >> > ena_com_wait_and_process_admin_cq_polling( >> >> > struct ena_comp_ctx *comp_ctx, >> >> > struct ena_com_admin_queue *admin_queue) >> >> > { >> >> > unsigned long flags = 0; >> >> > u64 start_time; >> >> > int ret; >> >> > >> >> > start_time = ENA_GET_SYSTEM_USECS(); >> >> > >> >> > while (comp_ctx->status == ENA_CMD_SUBMITTED) { >> >> > if ((ENA_GET_SYSTEM_USECS() - start_time) > >> >> > ADMIN_CMD_TIMEOUT_US) { >> >> > ena_trc_err("Wait for completion (polling) timeout\n"); >> >> > /* ENA didn't have any completion */ >> >> > ENA_SPINLOCK_LOCK(admin_queue->q_lock, flags); >> >> > admin_queue->stats.no_completion++; >> >> > admin_queue->running_state = false; >> >> > ENA_SPINLOCK_UNLOCK(admin_queue->q_lock, flags); >> >> > >> >> > ret = ENA_COM_TIMER_EXPIRED; >> >> > goto err; >> >> > } >> >> > >> >> > ENA_SPINLOCK_LOCK(admin_queue->q_lock, flags); >> >> > ena_com_handle_admin_completion(admin_queue); >> >> > ENA_SPINLOCK_UNLOCK(admin_queue->q_lock, flags); >> >> > } >> >> > >> >> > if (unlikely(comp_ctx->status == ENA_CMD_ABORTED)) { >> >> > ena_trc_err("Command was aborted\n"); >> >> > ENA_SPINLOCK_LOCK(admin_queue->q_lock, flags); >> >> > admin_queue->stats.aborted_cmd++; >> >> > ENA_SPINLOCK_UNLOCK(admin_queue->q_lock, flags); >> >> > ret = ENA_COM_NO_DEVICE; >> >> > goto err; >> >> > } >> >> > >> >> > ENA_ASSERT(comp_ctx->status == ENA_CMD_COMPLETED, >> >> > "Invalid comp status %d\n", comp_ctx->status); >> >> > >> >> > ret = ena_com_comp_status_to_errno(comp_ctx->comp_status); >> >> > err: >> >> > comp_ctxt_release(admin_queue, comp_ctx); >> >> > return ret; >> >> > } >> >> > >> >> > This is a case where there are two threads executing admin commands. >> >> > >> >> > The occupied flag is set to false in the function >> comp_ctxt_release. Let us say there are two consumers of completion >> context and C1 has a completion context and the same completion context can >> be used by another consumer C2 even before the C1 is resetting the occupied >> flag. >> >> > >> >> > This is because the ena_com_handle_admin_completion is done under >> spin lock and comp_ctxt_release is not under this spin lock. >> >> > >> >> > Thanks, >> >> > Param >> >> > >> >> > On Thu, Oct 24, 2019 at 2:09 PM Michał Krawczyk <m...@semihalf.com> >> wrote: >> >> >> >> >> >> sob., 19 paź 2019 o 20:26 kumaraparameshwaran rathinavel >> >> >> <kumaraparames...@gmail.com> napisał(a): >> >> >> > >> >> >> > Hi All, >> >> >> > >> >> >> > In the ENA poll mode driver I see that every request in the admin >> queue is >> >> >> > associated with a completion context and this is preallocated >> during the >> >> >> > device initialisation. When the completion context is used we >> check for >> >> >> > occupied to be true in the 16.X version if the occupied flag is >> set to true >> >> >> > we assert and in the latest version I see that this is an error >> log. But >> >> >> > there is a time window where if the completion context would be >> available >> >> >> > to the other consumer but still the old consumer did not set the >> occupied >> >> >> > to false. The new consumer holds the admin queue lock to get the >> completion >> >> >> > context but the update by the old consumer to set the the >> occupied flag is >> >> >> > not done under lock. So should we make sure that the new consumer >> should >> >> >> > get the completion context only when the occupied flag is set to >> false. Any >> >> >> > thoughts on this? >> >> >> >> >> >> Hi Param, >> >> >> >> >> >> Both the producer and the consumer are holding the spinlock while >> >> >> getting the completion context. If you see any situation where it >> >> >> isn't (besides the release function), please let me know. >> >> >> As it is protected by the lock, returning error while completion >> >> >> context is occupied (and it shouldn't) it fine, as it will stop the >> >> >> admin queue and allow the DPDK user application to execute the reset >> >> >> of the device. >> >> >> >> >> >> Thanks, >> >> >> Michal >> >> >> >> >> >> > If required I can try to make a patch where the completion >> context would be >> >> >> > available only after setting the occupied flag to false. >> >> >> > >> >> >> > Thanks, >> >> >> > Param. >> >