The following set of patches will improve the performance
of blit-copy functions for Radeon GPUs based on 
R600, R700, Evergreen and NI ASICs.

The foundation for improvement is the use of tiled mode access
(which for copying bo's can be used regardless of whether the
content is tiled or not), and segmenting the memory block
being copied into rectangles whose edge ratio is between 1:1
and 1:2. This maximizes the number of PCIe transactions that
use maximum payload size (typically 128 bytes) and also 
creates a memory access pattern that is more favorable for
both VRAM and host DRAM than what's currently in the kernel.

To come up with the new blit-copy code, I did a lot of 
PCIe traffic analysis with the bus analyzer and also 
had many discussions with Alex, trying to explain what's 
going on (thanks to Alex for his time).

Below (at the end of this note) are the results of some benchmarks
that I did with various GPUs (all in the same host: Intel i7 CPU,
X58 chipset, three DRAM channels). To run the tests on your machine
load the radeon module with 'benchmark=1 pcie_gen2=1' parameters.
Most significant improvement is in the upstream (VRAM to GART)
direction because that's where the PCIe transactions were fragmented 
and also where memory access pattern was such that it created a lot of 
backpressure from the host.

It is also interesting that high-end devices (e.g. Cayman) exhibit
the least improvement and were the worst to begin with. This is
because high-end devices copy more tiles in parallel which 
in turn can create bank conflicts on host memory and cause the
host to do lots of bank-close/precharge/bank-open cycles. 

As an added "bonus", I also did some code cleanup and consolidated
the repeated code into common function, so r600 and evergreen/NI
parts now share the blit-copy code. I also expanded on the
benchmark coverage, so the module now takes benckmark parameter
value between 1 and 8 and each results in running a different 
benchmark.

For details, see the commit log messages and the code.
I have been running with these patches for a few months 
(and I kept rebasing them to drm-core-next as the public 
git progressed) and I used them in a system setup that does
*many* copying of this kind (and does them frequently); I 
have not seen instabilities introduced by these patches. I also
verified the correctness of the copy using test=1 parameter
for each GPU that I had and the test passed.

I would welcome some feedback and if you run the benchmarks
with the new blit code, I would very much like to hear
what kind of improvement you are seeing.


BENCHMARK RESULTS:
==================

1) VRAM to GTT 
==============

Card (ASIC)     VRAM            Before  After
---------------------------------------------
5570 (Redwood)  DDR3 1600MHZ     454    3912
6450 (Caicos)   DDR5 3200MHz    3718    5090
6570 (Turks)    DDR3 1800MHz     484    4144
5450 (Cedar)    DDR3 1600MHz    3679    5090
5450 (Cedar)    DDR2  800MHz    2695    4639
E4690 (RV730)   DDR3 1400MHZ     485    4969
E6760 (Turks)   DDR5 3200MHz     474    4177
V5700 (RV730)   DDR3 ????MHz     488    4297
2260 (RV620)    DDR2 ????MHz     494    3093
6870 (Barts)    DDR5 4200MHz     475    1113
6970 (Cayman)   DDR5 4200MHz     473     710

2) GTT to VRAM
==============

Card (ASIC)     VRAM            Before  After
---------------------------------------------
5570 (Redwood)  DDR3 1600MHz    3158    3360
6450 (Caicos)   DDR5 3200MHz    2995    3393
6570 (Turks)    DDR3 1800MHz    3039    3339
5450 (Cedar)    DDR3 1600MHz    3246    3404
5450 (Cedar)    DDR2  800MHz    2614    3371
E4690 (RV730)   DDR3 1400MHz    3084    3426
E6760 (Turks)   DDR5 3200MHz    2443    2570
V5700 (RV730)   DDR3 ????MHz    3187    3506    
2260 (RV620)    DDR2 ????MHz     584    3246
6870 (Barts)    DDR5 4200MHz    2472    2601
6970 (Cayman)   DDR5 4200MHz    2460    2737

Reply via email to