Re: MMC double buffering

Per Forlin Sat, 18 Dec 2010 10:25:09 -0800

> We are also want to test it and wait until you release it for mmc mailing 
> list.
I think I will be able to  send out code for mailing list mid-end January.


> I saw the mmc performance blueprint and we're now suffer from mmc
> performance when low cpu frequency.
> even though input clock is consistent at 50MHz. the performance
> depends on cpu frequency.
Do you run in PIO or DMA mode?

> Need to investigate it.
Please let me know when you find out.

/Per

On 18 December 2010 16:29, Kyungmin Park <kmp...@infradead.org> wrote:
> Thanks
>
> No problem.
>
> We are also want to test it and wait until you release it for mmc mailing 
> list.
> I saw the mmc performance blueprint and we're now suffer from mmc
> performance when low cpu frequency.
> even though input clock is consistent at 50MHz. the performance
> depends on cpu frequency.
>
> Need to investigate it.
>
> Thank you,
> Kyungmin Park
>
> On Sat, Dec 18, 2010 at 11:19 PM, Per Forlin <per.for...@linaro.org> wrote:
>> Hi,
>>
>> Thanks for your interest. I am in the middle of rewriting parts due to
>> my findings about dma_unmap. If everything goes well I should have a
>> new prototype ready on Tuesday.
>> My code base is 2.6.37 rc4. Will that work for you?
>>
>> After Tuesday I will go on vacation until Linaro sprint in Dallas Jan
>> 10. I will not make any new updates on my code during my vacation but
>> I try to keep up with my emails.
>> I don't want to send it out for a full review yet because the code is
>> far from ready. It would only cause to much noise I'm afraid, and
>> since I am going on vacation it is not the best timing.
>>
>> Patches.
>> Is it ok for you to wait until Tuesday (or a few days later if I run
>> into trouble) and then you can test my latest version supporting
>> double buffering for unmap. I can send out the patches directly to
>> you.
>>
>> BR
>> Per
>>
>> On 18 December 2010 03:50, Kyungmin Park <kmp...@infradead.org> wrote:
>>> Hi,
>>>
>>> It's interesting.
>>>
>>> Can you send the your working codes to test it in our environment. Samsung 
>>> SoC.
>>>
>>> Thank you,
>>> Kyungmin Park
>>>
>>> On Sat, Dec 18, 2010 at 12:38 AM, Per Forlin <per.for...@linaro.org> wrote:
>>>> Hi again,
>>>>
>>>> I made a mistake in my double buffering implementation.
>>>> I assumed dma_unmap did not do any cache operations. Well, it does.
>>>> Due to L2 read prefetch the L2 needs to be invalidated at dma_unmap.
>>>>
>>>> I made a quick test to see how much throughput would improved if
>>>> dma_unmap could be run in parallel.
>>>> In this run dma_unmap is removed.
>>>>
>>>> Then the figures for read becomes:
>>>> * 7-16 % gain if double buffering in the ideal case. Closing on the
>>>> same performance as for PIO.
>>>>
>>>> Relative diff: MMC-VANILLA-DMA-LOG -> MMC-MMCI-2-BUF-DMA-LOG-NO-UNMAP
>>>> CPU is abs diff
>>>>                                                        random  random
>>>>        KB      reclen  write   rewrite read    reread  read    write
>>>>        51200   4       +0%     +0%     +7%     +8%     +2%     +0%
>>>>        cpu:            +0.0    +0.0    +0.7    +0.7    -0.0    +0.0
>>>>
>>>>        51200   8       +0%     +0%     +10%    +10%    +6%     +0%
>>>>        cpu:            -0.1    +0.1    +0.6    +0.9    +0.3    +0.0
>>>>
>>>>        51200   16      +0%     +0%     +11%    +11%    +8%     +0%
>>>>        cpu:            -0.0    -0.1    +0.9    +1.0    +0.3    +0.0
>>>>
>>>>        51200   32      +0%     +0%     +13%    +13%    +10%    +0%
>>>>        cpu:            -0.1    +0.0    +1.0    +0.5    +0.8    +0.0
>>>>
>>>>        51200   64      +0%     +0%     +13%    +13%    +12%    +1%
>>>>        cpu:            +0.0    +0.0    +0.4    +1.0    +0.9    +0.1
>>>>
>>>>        51200   128     +0%     +5%     +14%    +14%    +14%    +1%
>>>>        cpu:            +0.0    +0.2    +1.0    +0.9    +1.0    +0.0
>>>>
>>>>        51200   256     +0%     +2%     +13%    +13%    +13%    +1%
>>>>        cpu:            +0.0    +0.1    +0.9    +0.3    +1.6    -0.1
>>>>
>>>>        51200   512     +0%     +1%     +14%    +14%    +14%    +8%
>>>>        cpu:            -0.0    +0.3    +2.5    +1.8    +2.4    +0.3
>>>>
>>>>        51200   1024    +0%     +2%     +14%    +15%    +15%    +0%
>>>>        cpu:            +0.0    +0.3    +1.3    +1.4    +1.3    +0.1
>>>>
>>>>        51200   2048    +2%     +2%     +15%    +15%    +15%    +4%
>>>>        cpu:            +0.3    +0.1    +1.6    +2.1    +0.9    +0.3
>>>>
>>>>        51200   4096    +5%     +3%     +15%    +16%    +16%    +5%
>>>>        cpu:            +0.0    +0.4    +1.1    +1.7    +1.7    +0.5
>>>>
>>>>        51200   8192    +5%     +3%     +16%    +16%    +16%    +2%
>>>>        cpu:            +0.0    +0.4    +2.0    +1.3    +1.8    +0.1
>>>>
>>>>        51200   16384   +1%     +1%     +16%    +16%    +16%    +4%
>>>>        cpu:            +0.1    -0.2    +2.3    +1.7    +2.6    +0.2
>>>>
>>>> I will work on adding unmap to double buffering next week.
>>>>
>>>> /Per
>>>>
>>>> On 16 December 2010 15:15, Per Forlin <per.for...@linaro.org> wrote:
>>>>> Hi,
>>>>>
>>>>> I am working on the blueprint
>>>>> https://blueprints.launchpad.net/linux-linaro/+spec/other-storage-performance-emmc.
>>>>> Currently I am investigating performance for DMA vs PIO on eMMC.
>>>>>
>>>>> Pros and cons for DMA on MMC
>>>>> + Offloads CPU
>>>>> + Fewer interrupts, one single interrupt for each transfer compared to
>>>>> 100s or even 1000s
>>>>> + Power save, DMA consumes less power than CPU
>>>>> - Less bandwidth / throughput compared to PIO-CPU
>>>>>
>>>>> The reason for introducing double buffering in the MMC framework is to
>>>>> address the throughput issue for DMA on MMC.
>>>>> The assumption is that the CPU and DMA have higher throughput than the
>>>>> MMC / SD-card.
>>>>> My hypothesis is that the difference in performance between PIO-mode
>>>>> and DMA-mode for MMC is due to latency for preparing a DMA-job.
>>>>> If the next DMA-job could be prepared while the current job is ongoing
>>>>> this latency would be reduced. The biggest part of preparing a DMA-job
>>>>> is maintenance of caches.
>>>>> In my case I run on U5500 (mach-ux500) which has both L1 and L2
>>>>> caches. The host mmc driver in use is the mmci driver (PL180).
>>>>>
>>>>> I have done a hack in both the MMC-framework and mmci in order to make
>>>>> a prove of concept. I have run IOZone to get measurements to prove my
>>>>> case worthy.
>>>>> The next step, if the results are promising will be to clean up my
>>>>> work and send out patches for review.
>>>>>
>>>>> The DMAC in ux500 support to modes LOG and PHY.
>>>>> LOG - Many logical channels are multiplex on top of one physical channel
>>>>> PHY - Only one channel per physical channel
>>>>>
>>>>> DMA mode LOG and PHY have different latency both HW and SW wise. One
>>>>> could almost treat them as "two different DMACs. To get a wider test
>>>>> scope I have tested using both modes.
>>>>>
>>>>> Summary of the results.
>>>>> * It is optional for the mmc host driver to utitlize the 2-buf
>>>>> support. 2-buf in framework requires no change in the host drivers.
>>>>> * IOZone shows no performance hit on existing drivers* if adding 2-buf
>>>>> to the framework but not in the host driver.
>>>>>  (* So far I have only test one driver)
>>>>> * The performance gain for DMA using 2-buf is probably proportional to
>>>>> the cache maintenance time.
>>>>>  The faster the card is the more significant the cache maintenance
>>>>> part becomes and vice versa.
>>>>> * For U5500 with 2-buf performance for DMA is:
>>>>> Throughput: DMA vanilla vs DMA 2-buf
>>>>>  * read +5-10 %
>>>>>  * write +0-3 %
>>>>> CPU load: CPU vs DMA 2-buf
>>>>>  * read large data: minus 10-20 units of %
>>>>>  * read small data: same as PIO
>>>>>  * write: same load as PIO ( why? )
>>>>>
>>>>> Here follows two of the measurements from IOZones comparing MMC with
>>>>> double buffering and without. The rest you can find in the text files
>>>>> attached.
>>>>>
>>>>> === Performance CPU compared with DMA vanilla kernel ===
>>>>> Absolute diff: MMC-VANILLA-CPU -> MMC-VANILLA-DMA-LOG
>>>>>                                                        random  random
>>>>>        KB      reclen  write   rewrite read    reread  read    write
>>>>>        51200   4       -14     -8      -1005   -988    -679    -1
>>>>>        cpu:            -0.0    -0.1    -0.8    -0.9    -0.7    +0.0
>>>>>
>>>>>        51200   8       -35     -34     -1763   -1791   -1327   +0
>>>>>        cpu:            +0.0    -0.1    -0.9    -1.2    -0.7    +0.0
>>>>>
>>>>>        51200   16      +6      -38     -2712   -2728   -2225   +0
>>>>>        cpu:            -0.1    -0.0    -1.6    -1.2    -0.7    -0.0
>>>>>
>>>>>        51200   32      -10     -79     -3640   -3710   -3298   -1
>>>>>        cpu:            -0.1    -0.2    -1.2    -1.2    -0.7    -0.0
>>>>>
>>>>>        51200   64      +31     -16     -4401   -4533   -4212   -1
>>>>>        cpu:            -0.2    -0.2    -0.6    -1.2    -1.2    -0.0
>>>>>
>>>>>        51200   128     +58     -58     -4749   -4776   -4532   -4
>>>>>        cpu:            -0.2    -0.0    -1.2    -1.1    -1.2    +0.1
>>>>>
>>>>>        51200   256     +192    +283    -5343   -5347   -5184   +13
>>>>>        cpu:            +0.0    +0.1    -1.2    -0.6    -1.2    +0.0
>>>>>
>>>>>        51200   512     +232    +470    -4663   -4690   -4588   +171
>>>>>        cpu:            +0.1    +0.1    -4.5    -3.9    -3.8    -0.1
>>>>>
>>>>>        51200   1024    +250    +68     -3151   -3318   -3303   +122
>>>>>        cpu:            -0.1    -0.5    -14.0   -13.5   -14.0   -0.1
>>>>>
>>>>>        51200   2048    +224    +401    -2708   -2601   -2612   +161
>>>>>        cpu:            -1.7    -1.3    -18.4   -19.5   -17.8   -0.5
>>>>>
>>>>>        51200   4096    +194    +417    -2380   -2361   -2520   +242
>>>>>        cpu:            -1.3    -1.6    -19.4   -19.9   -19.4   -0.6
>>>>>
>>>>>        51200   8192    +228    +315    -2279   -2327   -2291   +270
>>>>>        cpu:            -1.0    -0.9    -20.8   -20.3   -21.0   -0.6
>>>>>
>>>>>        51200   16384   +254    +289    -2260   -2232   -2269   +308
>>>>>        cpu:            -0.8    -0.8    -20.5   -19.9   -21.5   -0.4
>>>>>
>>>>> === Performance CPU compared with DMA with MMC double buffering ===
>>>>> Absolute diff: MMC-VANILLA-CPU -> MMC-MMCI-2-BUF-DMA-LOG
>>>>>                                                        random  random
>>>>>        KB      reclen  write   rewrite read    reread  read    write
>>>>>        51200   4       -7      -11     -533    -513    -365    +0
>>>>>        cpu:            -0.0    -0.1    -0.5    -0.7    -0.4    +0.0
>>>>>
>>>>>        51200   8       -19     -28     -916    -932    -671    +0
>>>>>        cpu:            -0.0    -0.0    -0.3    -0.6    -0.2    +0.0
>>>>>
>>>>>        51200   16      +14     -13     -1467   -1479   -1203   +1
>>>>>        cpu:            +0.0    -0.1    -0.7    -0.7    -0.2    -0.0
>>>>>
>>>>>        51200   32      +61     +24     -2008   -2088   -1853   +4
>>>>>        cpu:            -0.3    -0.2    -0.7    -0.7    -0.2    -0.0
>>>>>
>>>>>        51200   64      +130    +84     -2571   -2692   -2483   +5
>>>>>        cpu:            +0.0    -0.4    -0.1    -0.7    -0.7    +0.0
>>>>>
>>>>>        51200   128     +275    +279    -2760   -2747   -2607   +19
>>>>>        cpu:            -0.1    +0.1    -0.7    -0.6    -0.7    +0.1
>>>>>
>>>>>        51200   256     +558    +503    -3455   -3429   -3216   +55
>>>>>        cpu:            -0.1    +0.1    -0.8    -0.1    -0.8    +0.0
>>>>>
>>>>>        51200   512     +608    +820    -2476   -2497   -2504   +154
>>>>>        cpu:            +0.2    +0.5    -3.3    -2.1    -2.7    +0.0
>>>>>
>>>>>        51200   1024    +652    +493    -818    -977    -1023   +291
>>>>>        cpu:            +0.0    -0.1    -13.2   -12.8   -13.3   +0.1
>>>>>
>>>>>        51200   2048    +654    +809    -241    -218    -242    +501
>>>>>        cpu:            -1.5    -1.2    -16.9   -18.2   -17.0   -0.2
>>>>>
>>>>>        51200   4096    +482    +908    -80     +82     -154    +633
>>>>>        cpu:            -1.4    -1.2    -19.1   -18.4   -18.6   -0.2
>>>>>
>>>>>        51200   8192    +643    +810    +199    +186    +182    +675
>>>>>        cpu:            -0.8    -0.7    -19.8   -19.2   -19.5   -0.7
>>>>>
>>>>>        51200   16384   +684    +724    +275    +323    +269    +724
>>>>>        cpu:            -0.6    -0.7    -19.2   -18.6   -19.8   -0.2
>>>>>
>>>>
>>>> _______________________________________________
>>>> linaro-dev mailing list
>>>> linaro-dev@lists.linaro.org
>>>> http://lists.linaro.org/mailman/listinfo/linaro-dev
>>>>
>>>
>>
>

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: MMC double buffering

Reply via email to