I am dealing with a performance issue where I have a simple
loopback test on a T2000 Sparc platform.

A user space application is sending data to the streams
driver, which sends the data to a our board via DMA.
Then the data is returned from the board to user space application #2.

++++++++        ++++++++
+ app1 +        + app2 +
++++++++        ++++++++
  |                 ^
  v                 |
++++++++++++++++++++++++
+                      +
+         driver    B  +
++++++++++++++++++++++++
   |             ^
   V             |
++++++++++++++++++++++++
+   E   board          +
++++++++++++++++++++++++

In this simple configuration,
we have run experiments that show
from app1 to app2, we can send/receive 28,000 messages per second.
( Each message is about 800 bytes. )

This is not good enough for our purposes. We need to achieve 36,000.

Further testing and experimentation reveals that
if we send the messages from app1 to point 'E' in the above
diagram, we can achieve the 36,000 rate.

Additional testing shows that going from app1 , thru the board
to point 'B' in the diagram ( inside the driver upon
DMA complete from the board ), we can make the 36,000 mark.

So, the bottleneck is from point 'B' to app2.

But, this is streams - part of Solaris.

Using lockstat, we can see many locks being used
by the streams. 
The "D_MTOUTEPERIM | D_MTOCEXCL" indicators have been
removed from the driver.
We want maximum concurrentcy. This implies
as far as I know that we want no outer perimeter and no
inner perimeter in the driver.

Interestingly, on a T2000, the same rates are obtained
whether 2 CPUs are used or 16 CPUs.

Does anybody have any tips/tricks/techniques
that can be used to remove the bottleneck?

+++++++
more details...
from the board to the host driver, an interrupt is 
created when a DMA of a batch of messages is completed.
A batch is usually 250 messages in this test.
The interrupt routine schedules the read service routine ( via qenable()
) of the
streams driver. The service routine pulls the messages from the dma
chain
into the proper queue  for sending the message up the stream ( via
putq() ).
In this simple case, it is one-to-one. In the general
case, the data could go on one of 124 possible queues.

Those messages go up the stream via the
getq(),putnext() sequence observing flow control when appropriate.

When we hacked the driver to ignore flow control,
we still could not do better than 28000.

wr

-----Original Message-----
From: Peter Memishian [mailto:[EMAIL PROTECTED]
Sent: Friday, April 25, 2008 5:26 PM
To: William Reich
Cc: [EMAIL PROTECTED]
Subject: re: [osol-code] Streams flags - D_MTPUTSHARED


 > Our driver using the following flags in its cb_ops structure:
 > ( D_NEW | D_MP | D_MTOUTEPERIM | D_MTOCEXCL ).
 > I understand that this set of flags will make the open & close
routines  > synchronous.

The open and close routines are *always* synchronous.  D_MTOCEXCL makes
sure there's only one thread in open or close at a time across all the
instances of the driver in the system (basically, it forces the outer
perimeter to be entered exclusively for those entrypoints).  BTW, the
D_NEW flag above does nothing (it expands to 0x0) and should be removed.

 > My question is - do I need to add the D_MTPUTHSARED, D_MTPERQ, and  >
_D_MTSVCSHARED flags to make sure read and write queue put & service  >
routines can run concurrently ?

No -- D_MP does that already.  For an inner perimeter, you have four
basic choices:

        * D_MP: effectively no inner perimeter.
        * D_MTPERQ: inner perimeter around each queue.
        * D_MTQPAIR: inner perimeter around each queuepair.
        * D_MTPERMOD: inner perimeter around all queuepairs.

Since the inner perimeter is always exclusively by default, D_MP is the
highest level of concurrency, and flags like D_MTPUTSHARED make no sense
with it.  You can approximate D_MP to some degree by combining coarser
perimeters with e.g. D_MTPUTSHARED and the like, but there's no reason
to do that unless you have a specific reason to not be D_MP.  As an
aside:
_D_MTSVCSHARED is not a public interface (hence the leading underscore).
Do not use it.

Hope this helps,
--
meem
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to