Mattias, Thanks for the prompt response. Appreciate your situation of not being able to share the proprietary code. More answers inline as [VV]: --Venky
On 11/14/18, 11:41 AM, "Mattias Rönnblom" <hof...@lysator.liu.se> wrote: On 2018-11-14 20:16, Venky Venkatesh wrote: > Hi, > > https://urldefense.proofpoint.com/v2/url?u=https-3A__mails.dpdk.org_archives_dev_2018-2DSeptember_111344.html&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=w2W5SR0mU5u5mz008DZNCsexDN1Lr9bpL7ZGKuD0Zd4&m=H4I6cuKi4kKoypKWz8mjDoXLGgkSNurKbKXrq4qJs5A&s=AD0KG106hPreSKeTQMRzDPwnEfBR9oD6dtjpL2Plt4c&e= mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline > with 1000 clock cycles of work per stage, the average event device > overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available? > It's proprietary code, although it's also been tested by some of our partners. The primary reason for it not being contributed to DPDK is because it's a fair amount of work to do so. I would refer to it as an eventdev pipeline simulator, rather than a sample app. > We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it? > Is this enqueue or dequeue burst? How large is n? Is this explicit release? [VV]: Yes both are burst of 32. I tried n=4-7. It is explicit RELEASE. What do you set nb_events_limit to? Good DSW performance much depends on the average burst size on the event rings, which in turn is dependent on the number of in-flight events. On really high core-count systems you might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it effectively puts a limit on the maximum number of events buffered on the output buffers. [VV]: struct rte_event_dev_config config = { .nb_event_queues = 2, .nb_event_ports = 5, .nb_events_limit = 4096, .nb_event_queue_flows = 1024, .nb_event_port_dequeue_depth = 128, .nb_event_port_enqueue_depth = 128, }; struct rte_event_port_conf p_conf = { .dequeue_depth = 64, .enqueue_depth = 64, .new_event_threshold = 1024, .disable_implicit_release = 0, }; struct rte_event_queue_conf q_conf = { .schedule_type = RTE_SCHED_TYPE_ATOMIC, .priority = RTE_EVENT_DEV_PRIORITY_NORMAL, .nb_atomic_flows = 1024, .nb_atomic_order_sequences = 1024, }; In the pipeline simulator all cores produce events initially, and then recycles events when the number of in-flight events reach a certain threshold (50% of nb_events_limit). A single lcore won't be able to fill the pipeline, if you have zero-work stages. [VV]: I have a single NEW event enqueue thread(0) and a bunch of “dequeue and RELEASE” threads (1-4) – simple case. I have a stats print thread(5) as well. If the 1 enqueue thread is unable to fill the pipeline, what counter would indicate that? I see the contrary effect -- I am tracking the number of times enqueue fails and that number is large. Even though I can't send you the simulator code at this point, I'm happy to assist you in any DSW-related endeavors. [VV]: My program is a simple enough program (nothing proprietary) that I can share. Can I unicast it to you for a quick recommendation?