On 2018-11-27 23:33, Venky Venkatesh wrote:
As you can see the DSW overhead dominates the scene and very little real work
is getting done. Is there some configuration or tuning to be done to get the
sort of performance you are seeing with multiple cores?
I can't explain the behavior you are seeing based on the information you
have supplied.
Attached is a small DSW throughput test program, that I thought might
help you to find the issue. It works much like the pipeline simulator I
used when developing the scheduler, but it's a lot simpler. Remember to
supply "--vdev=event_dsw0".
I ran it on my 12-core Skylake desktop (@2,9 GHz, turbo disabled). With
zero work and one stage, I get ~640 Mevent/s. For the first few stages
you add, you'll see a drop in performance. For example, with 3 stages,
you are at ~310 Mevent/s.
If you increase DSW_MAX_PORT_OUT_BUFFER and DSW_MAX_PORT_OPS_PER_BG_TASK
you see improvements in efficiency on high-core-count machines. On my
system, the above goes to 675 M/s for a 1-stage pipeline, and 460 M/s on
a 3-stage pipeline, if I apply the following changes to dsw_evdev.h:
-#define DSW_MAX_PORT_OUT_BUFFER (32)
+#define DSW_MAX_PORT_OUT_BUFFER (64)
-#define DSW_MAX_PORT_OPS_PER_BG_TASK (128)
+#define DSW_MAX_PORT_OPS_PER_BG_TASK (512)
With 500 clock cycles of dummy work, the per-event overhead is ~16 TSC
clock cycles/stage and event (i.e. per scheduled event; enqueue +
dequeue), if my quick-and-dirty benchmark program does the math
correctly. This also includes the overhead from the benchmark program
itself.
Overhead with a real application will be higher.