Hi Jerin:

On 27/6/2017 10:35 AM, Jerin Jacob wrote:
-----Original Message-----
Date: Mon, 26 Jun 2017 15:46:47 +0100
From: "Hunt, David" <david.h...@intel.com>
To: Jerin Jacob <jerin.ja...@caviumnetworks.com>, Harry van Haaren
  <harry.van.haa...@intel.com>
CC: dev@dpdk.org, Gage Eads <gage.e...@intel.com>, Bruce Richardson
  <bruce.richard...@intel.com>
Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added
  sample app
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
  Thunderbird/45.8.0

Hi Jerin,
Hi David,

Looks like you have sent the old version. The below mentioned comments
are not addressed in v2.

Oops. Glitch in the Matrix. I've just pushed a V3 with the changes.

I'm assisting Harry on the sample app, and have just pushed up a V2 patch
based on your feedback. I've addressed most of your suggestions, comments
below. There may still a couple of outstanding questions that need further
discussion.
A few general comments:
1) Nikhil/Gage's proposal on ethdev rx to eventdev adapter will change the major
portion(eventdev setup and producer()) of this application
2) Producing one lcore worth of packets, really cant show as example
eventdev application as it will be pretty bad in low-end machine.
At least application infrastructure should not limit.

Considering above points, Should we wait for rx adapter to complete
first? I would like to show this as real world application to use eventdev.

Thoughts?

On the same note:
Can we finalize on rx adapter proposal? I can work on v1 of patch and
common code if Nikhil or Gage don't have bandwidth. Let me know?

last followup:
http://dpdk.org/ml/archives/dev/2017-June/068776.html

I had a quick chat with Harry, and wonder if we'd be as well to merge the app as it is now, and as the new frameworks become available, the app can be updated to make use of them? I feel it would be better to have something out there for people to play with than waiting for 17.11.

Also, if you have bandwidth to patch the app for your desired use cases, that would be a good contribution. I'd only be guessing for some of it :)


Regards,
Dave

On 10/5/2017 3:12 PM, Jerin Jacob wrote:
-----Original Message-----
Date: Fri, 21 Apr 2017 10:51:37 +0100
From: Harry van Haaren <harry.van.haa...@intel.com>
To: dev@dpdk.org
CC: jerin.ja...@caviumnetworks.com, Harry van Haaren
<harry.van.haa...@intel.com>, Gage Eads <gage.e...@intel.com>, Bruce
  Richardson <bruce.richard...@intel.com>
Subject: [PATCH 1/3] examples/eventdev_pipeline: added sample app
X-Mailer: git-send-email 2.7.4

This commit adds a sample app for the eventdev library.
The app has been tested with DPDK 17.05-rc2, hence this
release (or later) is recommended.

The sample app showcases a pipeline processing use-case,
with event scheduling and processing defined per stage.
The application recieves traffic as normal, with each
packet traversing the pipeline. Once the packet has
been processed by each of the pipeline stages, it is
transmitted again.

The app provides a framework to utilize cores for a single
role or multiple roles. Examples of roles are the RX core,
TX core, Scheduling core (in the case of the event/sw PMD),
and worker cores.

Various flags are available to configure numbers of stages,
cycles of work at each stage, type of scheduling, number of
worker cores, queue depths etc. For a full explaination,
please refer to the documentation.

Signed-off-by: Gage Eads <gage.e...@intel.com>
Signed-off-by: Bruce Richardson <bruce.richard...@intel.com>
Signed-off-by: Harry van Haaren <harry.van.haa...@intel.com>
Thanks for the example application to share the SW view.
I could make it run on HW after some tweaking(not optimized though)

[...]
+#define MAX_NUM_STAGES 8
+#define BATCH_SIZE 16
+#define MAX_NUM_CORE 64
How about RTE_MAX_LCORE?
Core usage in the sample app is held in a uint64_t. Adding arrays would be
possible, but I feel that the extra effort would not give that much benefit.
I've left as is for the moment, unless you see any strong requirement to go
beyond 64 cores?
I think, it is OK. Again with service core infrastructure this will change.

+
+static unsigned int active_cores;
+static unsigned int num_workers;
+static unsigned long num_packets = (1L << 25); /* do ~32M packets */
+static unsigned int num_fids = 512;
+static unsigned int num_priorities = 1;
looks like its not used.
Yes, Removed.

+static unsigned int num_stages = 1;
+static unsigned int worker_cq_depth = 16;
+static int queue_type = RTE_EVENT_QUEUE_CFG_ATOMIC_ONLY;
+static int16_t next_qid[MAX_NUM_STAGES+1] = {-1};
+static int16_t qid[MAX_NUM_STAGES] = {-1};
Moving all fastpath related variables under a structure with cache
aligned will help.
I tried a few different combinations of this, and saw no gains, some losses.
So will leave as is for the moment, if that's OK.
I think, the one are using in fastpath better to allocate from huge page
using rte_malloc()

+static int worker_cycles;
+static int enable_queue_priorities;
+
+struct prod_data {
+    uint8_t dev_id;
+    uint8_t port_id;
+    int32_t qid;
+    unsigned num_nic_ports;
+};
Yes, saw a percent or two gain when this plus following two data structs
cache aligned.
looks like it not fixed in v2. Looks like you have sent the old
version.

+
+    return 0;
+}
+
+static int
+producer(void)
+{
+    static uint8_t eth_port;
+    struct rte_mbuf *mbufs[BATCH_SIZE];
+    struct rte_event ev[BATCH_SIZE];
+    uint32_t i, num_ports = prod_data.num_nic_ports;
+    int32_t qid = prod_data.qid;
+    uint8_t dev_id = prod_data.dev_id;
+    uint8_t port_id = prod_data.port_id;
+    uint32_t prio_idx = 0;
+
+    const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs,
BATCH_SIZE);
+    if (++eth_port == num_ports)
+        eth_port = 0;
+    if (nb_rx == 0) {
+        rte_pause();
+        return 0;
+    }
+
+    for (i = 0; i < nb_rx; i++) {
+        ev[i].flow_id = mbufs[i]->hash.rss;
prefetching the buff[i+1] may help here?
I tried, didn't make much difference.
OK.

+        ev[i].op = RTE_EVENT_OP_NEW;
+        ev[i].sched_type = queue_type;
The value of RTE_EVENT_QUEUE_CFG_ORDERED_ONLY != RTE_SCHED_TYPE_ORDERED.
So, we
cannot assign .sched_type as queue_type.

I think, one option could be to avoid translation in application is to
- Remove RTE_EVENT_QUEUE_CFG_ALL_TYPES, RTE_EVENT_QUEUE_CFG_*_ONLY
- Introduce a new RTE_EVENT_DEV_CAP_ to denote
RTE_EVENT_QUEUE_CFG_ALL_TYPES cap
ability
- add sched_type in struct rte_event_queue_conf. If capability flag is
   not set then implementation takes sched_type value for the queue.

Any thoughts?

Not sure here, would it be ok for the moment, and we can work on a patch in
the future?
OK

+
+    if (tx_core[lcore_id] && (tx_single ||
+        rte_atomic32_cmpset(&tx_lock, 0, 1))) {
+        consumer();
Should consumer() need to come in this pattern? I am thinking like
if events is from last stage then call consumer() in worker()

I think, above scheme works better when the _same_ worker code need to run
the
case where
1) ethdev HW is capable to enqueuing the packets to same txq from
   multiple thread
2) ethdev is not capable to do so.

So, The above cases can be addressed in configuration time where we link
the queues to port
case 1) Link all workers to last queue
case 2) Link only worker to last queue

and keeping the common worker code.

HW implementation has functional and performance issue if "two" ports are
assigned to one lcore for dequeue. The above scheme fixes that problem
too.


Can we have a bit more discussion on this item? Is this needed for this
sample app, or can we perhaps work a patch for this later? Harry?
As explained above, Is there any issue in keeping consumer() for last
stage ?

I would probably see this as a future enhancement as per my initial comments above. Any hardware or new framework additions are welcome as future patches to the app.


+        rte_atomic32_clear((rte_atomic32_t *)&tx_lock);
+    }
+}
+
+static int
+worker(void *arg)
+{
+    struct rte_event events[BATCH_SIZE];
+
+    struct worker_data *data = (struct worker_data *)arg;
+    uint8_t dev_id = data->dev_id;
+    uint8_t port_id = data->port_id;
+    size_t sent = 0, received = 0;
+    unsigned lcore_id = rte_lcore_id();
+
+    while (!done) {
+        uint16_t i;
+
+        schedule_devices(dev_id, lcore_id);
+
+        if (!worker_core[lcore_id]) {
+            rte_pause();
+            continue;
+        }
+
+        uint16_t nb_rx = rte_event_dequeue_burst(dev_id, port_id,
+                events, RTE_DIM(events), 0);
+
+        if (nb_rx == 0) {
+            rte_pause();
+            continue;
+        }
+        received += nb_rx;
+
+        for (i = 0; i < nb_rx; i++) {
+            struct ether_hdr *eth;
+            struct ether_addr addr;
+            struct rte_mbuf *m = events[i].mbuf;
+
+            /* The first worker stage does classification */
+            if (events[i].queue_id == qid[0])
+                events[i].flow_id = m->hash.rss % num_fids;
Not sure why we need do(shrinking the flows) this in worker() in queue
based pipeline.
If an PMD has any specific requirement on num_fids,I think, we
can move this configuration stage or PMD can choose optimum fid internally
to
avoid modulus operation tax in fastpath in all PMD.

Does struct rte_event_queue_conf.nb_atomic_flows help here?
In my tests the modulus makes very little difference in the throughput. And
I think it's good to have a way of varying the number of flows for testing
different scenarios, even if it's not the most performant.
Not sure.

+
+            events[i].queue_id = next_qid[events[i].queue_id];
+            events[i].op = RTE_EVENT_OP_FORWARD;
missing events[i].sched_type.HW PMD does not work with this.
I think, we can use similar scheme like next_qid for next_sched_type.
Done. added events[i].sched_type = queue_type.

+
+            /* change mac addresses on packet (to use mbuf data) */
+            eth = rte_pktmbuf_mtod(m, struct ether_hdr *);
+            ether_addr_copy(&eth->d_addr, &addr);
+            ether_addr_copy(&eth->s_addr, &eth->d_addr);
+            ether_addr_copy(&addr, &eth->s_addr);
IMO, We can make packet processing code code as "static inline function"
so
different worker types can reuse.
Done. moved out to a work() function.
I think, mac swap should do in last stage, not on each forward.
ie. With existing code, 2 stage forward makes in original order.

+
+            /* do a number of cycles of work per packet */
+            volatile uint64_t start_tsc = rte_rdtsc();
+            while (rte_rdtsc() < start_tsc + worker_cycles)
+                rte_pause();
Ditto.
Done. moved out to a work() function.

I think, All worker specific variables like "worker_cycles" can moved into
one structure and use.

+        }
+        uint16_t nb_tx = rte_event_enqueue_burst(dev_id, port_id,
+                events, nb_rx);
+        while (nb_tx < nb_rx && !done)
+            nb_tx += rte_event_enqueue_burst(dev_id, port_id,
+                            events + nb_tx,
+                            nb_rx - nb_tx);
+        sent += nb_tx;
+    }
+
+    if (!quiet)
+        printf("  worker %u thread done. RX=%zu TX=%zu\n",
+                rte_lcore_id(), received, sent);
+
+    return 0;
+}
+
+/*
+ * Parse the coremask given as argument (hexadecimal string) and fill
+ * the global configuration (core role and core count) with the parsed
+ * value.
+ */
+static int xdigit2val(unsigned char c)
multiple instance of "xdigit2val" in DPDK repo. May be we can push this
as common code.
Sure, that's something we can look at in a separate patch, now that it's
being used more and more.
make sense.

+{
+    int val;
+
+    if (isdigit(c))
+        val = c - '0';
+    else if (isupper(c))
+        val = c - 'A' + 10;
+    else
+        val = c - 'a' + 10;
+    return val;
+}
+
+
+static void
+usage(void)
+{
+    const char *usage_str =
+        "  Usage: eventdev_demo [options]\n"
+        "  Options:\n"
+        "  -n, --packets=N              Send N packets (default ~32M), 0
implies no limit\n"
+        "  -f, --atomic-flows=N         Use N random flows from 1 to N
(default 16)\n"
I think, this parameter now, effects the application fast path code.I
think,
it should eventdev configuration para-mater.
See above comment on num_fids

+        "  -s, --num_stages=N           Use N atomic stages (default
1)\n"
+        "  -r, --rx-mask=core mask      Run NIC rx on CPUs in core
mask\n"
+        "  -w, --worker-mask=core mask  Run worker on CPUs in core
mask\n"
+        "  -t, --tx-mask=core mask      Run NIC tx on CPUs in core
mask\n"
+        "  -e  --sched-mask=core mask   Run scheduler on CPUs in core
mask\n"
+        "  -c  --cq-depth=N             Worker CQ depth (default 16)\n"
+        "  -W  --work-cycles=N          Worker cycles (default 0)\n"
+        "  -P  --queue-priority         Enable scheduler queue
prioritization\n"
+        "  -o, --ordered                Use ordered scheduling\n"
+        "  -p, --parallel               Use parallel scheduling\n"
IMO, all stage being "parallel" or "ordered" or "atomic" is one mode of
operation. It is valid have to any combination. We need to express that in
command like
example:
3 stage with
O->A->P
How about we add an option that specifies the mode of operation for each
stage in a string? Maybe have a '-m' option (modes) e.g. '-m appo' for 4
stages with atomic, parallel, paralled, ordered. Or maybe reuse your
test-eventdev parameter style?
Any scheme is fine.

+        "  -q, --quiet                  Minimize printed output\n"
+        "  -D, --dump                   Print detailed statistics before
exit"
+        "\n";
+    fprintf(stderr, "%s", usage_str);
+    exit(1);
+}
+
[...]

+            rx_single = (popcnt == 1);
+            break;
+        case 't':
+            tx_lcore_mask = parse_coremask(optarg);
+            popcnt = __builtin_popcountll(tx_lcore_mask);
+            tx_single = (popcnt == 1);
+            break;
+        case 'e':
+            sched_lcore_mask = parse_coremask(optarg);
+            popcnt = __builtin_popcountll(sched_lcore_mask);
+            sched_single = (popcnt == 1);
+            break;
+        default:
+            usage();
+        }
+    }
+
+    if (worker_lcore_mask == 0 || rx_lcore_mask == 0 ||
+        sched_lcore_mask == 0 || tx_lcore_mask == 0) {
+
+    /* Q creation - one load balanced per pipeline stage*/
+
+    /* set up one port per worker, linking to all stage queues */
+    for (i = 0; i < num_workers; i++) {
+        struct worker_data *w = &worker_data[i];
+        w->dev_id = dev_id;
+        if (rte_event_port_setup(dev_id, i, &wkr_p_conf) < 0) {
+            printf("Error setting up port %d\n", i);
+            return -1;
+        }
+
+        uint32_t s;
+        for (s = 0; s < num_stages; s++) {
+            if (rte_event_port_link(dev_id, i,
+                        &worker_queues[s].queue_id,
+                        &worker_queues[s].priority,
+                        1) != 1) {
+                printf("%d: error creating link for port %d\n",
+                        __LINE__, i);
+                return -1;
+            }
+        }
+        w->port_id = i;
+    }
+    /* port for consumer, linked to TX queue */
+    if (rte_event_port_setup(dev_id, i, &tx_p_conf) < 0) {
If ethdev supports MT txq queue support then this port can be linked to
worker too. something to consider for future.

Sure. No change for now.
OK

Just to add a comment for any remaining comments above, we would hope that none of them are blockers for the merge of the current version, as they can be patched in the future as the infrastructure changes.

Rgds,
Dave.


Reply via email to