I am dealing with a performance issue where I have a simple loopback test on a T2000 Sparc platform.
A user space application is sending data to the streams driver, which sends the data to a our board via DMA. Then the data is returned from the board to user space application #2. ++++++++ ++++++++ + app1 + + app2 + ++++++++ ++++++++ | ^ v | ++++++++++++++++++++++++ + + + driver B + ++++++++++++++++++++++++ | ^ V | ++++++++++++++++++++++++ + E board + ++++++++++++++++++++++++ In this simple configuration, we have run experiments that show from app1 to app2, we can send/receive 28,000 messages per second. ( Each message is about 800 bytes. ) This is not good enough for our purposes. We need to achieve 36,000. Further testing and experimentation reveals that if we send the messages from app1 to point 'E' in the above diagram, we can achieve the 36,000 rate. Additional testing shows that going from app1 , thru the board to point 'B' in the diagram ( inside the driver upon DMA complete from the board ), we can make the 36,000 mark. So, the bottleneck is from point 'B' to app2. But, this is streams - part of Solaris. Using lockstat, we can see many locks being used by the streams. The "D_MTOUTEPERIM | D_MTOCEXCL" indicators have been removed from the driver. We want maximum concurrentcy. This implies as far as I know that we want no outer perimeter and no inner perimeter in the driver. Interestingly, on a T2000, the same rates are obtained whether 2 CPUs are used or 16 CPUs. Does anybody have any tips/tricks/techniques that can be used to remove the bottleneck? +++++++ more details... from the board to the host driver, an interrupt is created when a DMA of a batch of messages is completed. A batch is usually 250 messages in this test. The interrupt routine schedules the read service routine ( via qenable() ) of the streams driver. The service routine pulls the messages from the dma chain into the proper queue for sending the message up the stream ( via putq() ). In this simple case, it is one-to-one. In the general case, the data could go on one of 124 possible queues. Those messages go up the stream via the getq(),putnext() sequence observing flow control when appropriate. When we hacked the driver to ignore flow control, we still could not do better than 28000. wr -----Original Message----- From: Peter Memishian [mailto:[EMAIL PROTECTED] Sent: Friday, April 25, 2008 5:26 PM To: William Reich Cc: [EMAIL PROTECTED] Subject: re: [osol-code] Streams flags - D_MTPUTSHARED > Our driver using the following flags in its cb_ops structure: > ( D_NEW | D_MP | D_MTOUTEPERIM | D_MTOCEXCL ). > I understand that this set of flags will make the open & close routines > synchronous. The open and close routines are *always* synchronous. D_MTOCEXCL makes sure there's only one thread in open or close at a time across all the instances of the driver in the system (basically, it forces the outer perimeter to be entered exclusively for those entrypoints). BTW, the D_NEW flag above does nothing (it expands to 0x0) and should be removed. > My question is - do I need to add the D_MTPUTHSARED, D_MTPERQ, and > _D_MTSVCSHARED flags to make sure read and write queue put & service > routines can run concurrently ? No -- D_MP does that already. For an inner perimeter, you have four basic choices: * D_MP: effectively no inner perimeter. * D_MTPERQ: inner perimeter around each queue. * D_MTQPAIR: inner perimeter around each queuepair. * D_MTPERMOD: inner perimeter around all queuepairs. Since the inner perimeter is always exclusively by default, D_MP is the highest level of concurrency, and flags like D_MTPUTSHARED make no sense with it. You can approximate D_MP to some degree by combining coarser perimeters with e.g. D_MTPUTSHARED and the like, but there's no reason to do that unless you have a specific reason to not be D_MP. As an aside: _D_MTSVCSHARED is not a public interface (hence the leading underscore). Do not use it. Hope this helps, -- meem _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org