Hi,

Thanks for your interest. I am in the middle of rewriting parts due to
my findings about dma_unmap. If everything goes well I should have a
new prototype ready on Tuesday.
My code base is 2.6.37 rc4. Will that work for you?

After Tuesday I will go on vacation until Linaro sprint in Dallas Jan
10. I will not make any new updates on my code during my vacation but
I try to keep up with my emails.
I don't want to send it out for a full review yet because the code is
far from ready. It would only cause to much noise I'm afraid, and
since I am going on vacation it is not the best timing.

Patches.
Is it ok for you to wait until Tuesday (or a few days later if I run
into trouble) and then you can test my latest version supporting
double buffering for unmap. I can send out the patches directly to
you.

BR
Per

On 18 December 2010 03:50, Kyungmin Park <kmp...@infradead.org> wrote:
> Hi,
>
> It's interesting.
>
> Can you send the your working codes to test it in our environment. Samsung 
> SoC.
>
> Thank you,
> Kyungmin Park
>
> On Sat, Dec 18, 2010 at 12:38 AM, Per Forlin <per.for...@linaro.org> wrote:
>> Hi again,
>>
>> I made a mistake in my double buffering implementation.
>> I assumed dma_unmap did not do any cache operations. Well, it does.
>> Due to L2 read prefetch the L2 needs to be invalidated at dma_unmap.
>>
>> I made a quick test to see how much throughput would improved if
>> dma_unmap could be run in parallel.
>> In this run dma_unmap is removed.
>>
>> Then the figures for read becomes:
>> * 7-16 % gain if double buffering in the ideal case. Closing on the
>> same performance as for PIO.
>>
>> Relative diff: MMC-VANILLA-DMA-LOG -> MMC-MMCI-2-BUF-DMA-LOG-NO-UNMAP
>> CPU is abs diff
>>                                                        random  random
>>        KB      reclen  write   rewrite read    reread  read    write
>>        51200   4       +0%     +0%     +7%     +8%     +2%     +0%
>>        cpu:            +0.0    +0.0    +0.7    +0.7    -0.0    +0.0
>>
>>        51200   8       +0%     +0%     +10%    +10%    +6%     +0%
>>        cpu:            -0.1    +0.1    +0.6    +0.9    +0.3    +0.0
>>
>>        51200   16      +0%     +0%     +11%    +11%    +8%     +0%
>>        cpu:            -0.0    -0.1    +0.9    +1.0    +0.3    +0.0
>>
>>        51200   32      +0%     +0%     +13%    +13%    +10%    +0%
>>        cpu:            -0.1    +0.0    +1.0    +0.5    +0.8    +0.0
>>
>>        51200   64      +0%     +0%     +13%    +13%    +12%    +1%
>>        cpu:            +0.0    +0.0    +0.4    +1.0    +0.9    +0.1
>>
>>        51200   128     +0%     +5%     +14%    +14%    +14%    +1%
>>        cpu:            +0.0    +0.2    +1.0    +0.9    +1.0    +0.0
>>
>>        51200   256     +0%     +2%     +13%    +13%    +13%    +1%
>>        cpu:            +0.0    +0.1    +0.9    +0.3    +1.6    -0.1
>>
>>        51200   512     +0%     +1%     +14%    +14%    +14%    +8%
>>        cpu:            -0.0    +0.3    +2.5    +1.8    +2.4    +0.3
>>
>>        51200   1024    +0%     +2%     +14%    +15%    +15%    +0%
>>        cpu:            +0.0    +0.3    +1.3    +1.4    +1.3    +0.1
>>
>>        51200   2048    +2%     +2%     +15%    +15%    +15%    +4%
>>        cpu:            +0.3    +0.1    +1.6    +2.1    +0.9    +0.3
>>
>>        51200   4096    +5%     +3%     +15%    +16%    +16%    +5%
>>        cpu:            +0.0    +0.4    +1.1    +1.7    +1.7    +0.5
>>
>>        51200   8192    +5%     +3%     +16%    +16%    +16%    +2%
>>        cpu:            +0.0    +0.4    +2.0    +1.3    +1.8    +0.1
>>
>>        51200   16384   +1%     +1%     +16%    +16%    +16%    +4%
>>        cpu:            +0.1    -0.2    +2.3    +1.7    +2.6    +0.2
>>
>> I will work on adding unmap to double buffering next week.
>>
>> /Per
>>
>> On 16 December 2010 15:15, Per Forlin <per.for...@linaro.org> wrote:
>>> Hi,
>>>
>>> I am working on the blueprint
>>> https://blueprints.launchpad.net/linux-linaro/+spec/other-storage-performance-emmc.
>>> Currently I am investigating performance for DMA vs PIO on eMMC.
>>>
>>> Pros and cons for DMA on MMC
>>> + Offloads CPU
>>> + Fewer interrupts, one single interrupt for each transfer compared to
>>> 100s or even 1000s
>>> + Power save, DMA consumes less power than CPU
>>> - Less bandwidth / throughput compared to PIO-CPU
>>>
>>> The reason for introducing double buffering in the MMC framework is to
>>> address the throughput issue for DMA on MMC.
>>> The assumption is that the CPU and DMA have higher throughput than the
>>> MMC / SD-card.
>>> My hypothesis is that the difference in performance between PIO-mode
>>> and DMA-mode for MMC is due to latency for preparing a DMA-job.
>>> If the next DMA-job could be prepared while the current job is ongoing
>>> this latency would be reduced. The biggest part of preparing a DMA-job
>>> is maintenance of caches.
>>> In my case I run on U5500 (mach-ux500) which has both L1 and L2
>>> caches. The host mmc driver in use is the mmci driver (PL180).
>>>
>>> I have done a hack in both the MMC-framework and mmci in order to make
>>> a prove of concept. I have run IOZone to get measurements to prove my
>>> case worthy.
>>> The next step, if the results are promising will be to clean up my
>>> work and send out patches for review.
>>>
>>> The DMAC in ux500 support to modes LOG and PHY.
>>> LOG - Many logical channels are multiplex on top of one physical channel
>>> PHY - Only one channel per physical channel
>>>
>>> DMA mode LOG and PHY have different latency both HW and SW wise. One
>>> could almost treat them as "two different DMACs. To get a wider test
>>> scope I have tested using both modes.
>>>
>>> Summary of the results.
>>> * It is optional for the mmc host driver to utitlize the 2-buf
>>> support. 2-buf in framework requires no change in the host drivers.
>>> * IOZone shows no performance hit on existing drivers* if adding 2-buf
>>> to the framework but not in the host driver.
>>>  (* So far I have only test one driver)
>>> * The performance gain for DMA using 2-buf is probably proportional to
>>> the cache maintenance time.
>>>  The faster the card is the more significant the cache maintenance
>>> part becomes and vice versa.
>>> * For U5500 with 2-buf performance for DMA is:
>>> Throughput: DMA vanilla vs DMA 2-buf
>>>  * read +5-10 %
>>>  * write +0-3 %
>>> CPU load: CPU vs DMA 2-buf
>>>  * read large data: minus 10-20 units of %
>>>  * read small data: same as PIO
>>>  * write: same load as PIO ( why? )
>>>
>>> Here follows two of the measurements from IOZones comparing MMC with
>>> double buffering and without. The rest you can find in the text files
>>> attached.
>>>
>>> === Performance CPU compared with DMA vanilla kernel ===
>>> Absolute diff: MMC-VANILLA-CPU -> MMC-VANILLA-DMA-LOG
>>>                                                        random  random
>>>        KB      reclen  write   rewrite read    reread  read    write
>>>        51200   4       -14     -8      -1005   -988    -679    -1
>>>        cpu:            -0.0    -0.1    -0.8    -0.9    -0.7    +0.0
>>>
>>>        51200   8       -35     -34     -1763   -1791   -1327   +0
>>>        cpu:            +0.0    -0.1    -0.9    -1.2    -0.7    +0.0
>>>
>>>        51200   16      +6      -38     -2712   -2728   -2225   +0
>>>        cpu:            -0.1    -0.0    -1.6    -1.2    -0.7    -0.0
>>>
>>>        51200   32      -10     -79     -3640   -3710   -3298   -1
>>>        cpu:            -0.1    -0.2    -1.2    -1.2    -0.7    -0.0
>>>
>>>        51200   64      +31     -16     -4401   -4533   -4212   -1
>>>        cpu:            -0.2    -0.2    -0.6    -1.2    -1.2    -0.0
>>>
>>>        51200   128     +58     -58     -4749   -4776   -4532   -4
>>>        cpu:            -0.2    -0.0    -1.2    -1.1    -1.2    +0.1
>>>
>>>        51200   256     +192    +283    -5343   -5347   -5184   +13
>>>        cpu:            +0.0    +0.1    -1.2    -0.6    -1.2    +0.0
>>>
>>>        51200   512     +232    +470    -4663   -4690   -4588   +171
>>>        cpu:            +0.1    +0.1    -4.5    -3.9    -3.8    -0.1
>>>
>>>        51200   1024    +250    +68     -3151   -3318   -3303   +122
>>>        cpu:            -0.1    -0.5    -14.0   -13.5   -14.0   -0.1
>>>
>>>        51200   2048    +224    +401    -2708   -2601   -2612   +161
>>>        cpu:            -1.7    -1.3    -18.4   -19.5   -17.8   -0.5
>>>
>>>        51200   4096    +194    +417    -2380   -2361   -2520   +242
>>>        cpu:            -1.3    -1.6    -19.4   -19.9   -19.4   -0.6
>>>
>>>        51200   8192    +228    +315    -2279   -2327   -2291   +270
>>>        cpu:            -1.0    -0.9    -20.8   -20.3   -21.0   -0.6
>>>
>>>        51200   16384   +254    +289    -2260   -2232   -2269   +308
>>>        cpu:            -0.8    -0.8    -20.5   -19.9   -21.5   -0.4
>>>
>>> === Performance CPU compared with DMA with MMC double buffering ===
>>> Absolute diff: MMC-VANILLA-CPU -> MMC-MMCI-2-BUF-DMA-LOG
>>>                                                        random  random
>>>        KB      reclen  write   rewrite read    reread  read    write
>>>        51200   4       -7      -11     -533    -513    -365    +0
>>>        cpu:            -0.0    -0.1    -0.5    -0.7    -0.4    +0.0
>>>
>>>        51200   8       -19     -28     -916    -932    -671    +0
>>>        cpu:            -0.0    -0.0    -0.3    -0.6    -0.2    +0.0
>>>
>>>        51200   16      +14     -13     -1467   -1479   -1203   +1
>>>        cpu:            +0.0    -0.1    -0.7    -0.7    -0.2    -0.0
>>>
>>>        51200   32      +61     +24     -2008   -2088   -1853   +4
>>>        cpu:            -0.3    -0.2    -0.7    -0.7    -0.2    -0.0
>>>
>>>        51200   64      +130    +84     -2571   -2692   -2483   +5
>>>        cpu:            +0.0    -0.4    -0.1    -0.7    -0.7    +0.0
>>>
>>>        51200   128     +275    +279    -2760   -2747   -2607   +19
>>>        cpu:            -0.1    +0.1    -0.7    -0.6    -0.7    +0.1
>>>
>>>        51200   256     +558    +503    -3455   -3429   -3216   +55
>>>        cpu:            -0.1    +0.1    -0.8    -0.1    -0.8    +0.0
>>>
>>>        51200   512     +608    +820    -2476   -2497   -2504   +154
>>>        cpu:            +0.2    +0.5    -3.3    -2.1    -2.7    +0.0
>>>
>>>        51200   1024    +652    +493    -818    -977    -1023   +291
>>>        cpu:            +0.0    -0.1    -13.2   -12.8   -13.3   +0.1
>>>
>>>        51200   2048    +654    +809    -241    -218    -242    +501
>>>        cpu:            -1.5    -1.2    -16.9   -18.2   -17.0   -0.2
>>>
>>>        51200   4096    +482    +908    -80     +82     -154    +633
>>>        cpu:            -1.4    -1.2    -19.1   -18.4   -18.6   -0.2
>>>
>>>        51200   8192    +643    +810    +199    +186    +182    +675
>>>        cpu:            -0.8    -0.7    -19.8   -19.2   -19.5   -0.7
>>>
>>>        51200   16384   +684    +724    +275    +323    +269    +724
>>>        cpu:            -0.6    -0.7    -19.2   -18.6   -19.8   -0.2
>>>
>>
>> _______________________________________________
>> linaro-dev mailing list
>> linaro-dev@lists.linaro.org
>> http://lists.linaro.org/mailman/listinfo/linaro-dev
>>
>

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to