Hi, in the past couple of weeks i tried to optimize the shaders used for the iDCT and MC code. Beside optimizing the TGSI code for the shaders i optimized the TGSI->R600 code generation in r600g quite a bit: * Removed the temporary register use from most instructions * Optimize away CF_INST_POP * Use special constants for 0, 1, -1, 1.0f, 0.5f etc * Implement output modifiers and use them to further optimize LRP * Fixed TEX and VTX joining * Optimize away CF ALU instructions even if type doesn't match * Fix alu slot assignment * Reworked and fixed bank swizzle code * Implement replacing gpr with pv and ps * Merging of alu slots into larger groups * Reworked literal handling * Implement register remapping * Optimized away unneeded alu moves * Rearanging and merging of export instructions * Fully implemented barrier handling
The end result still looks valid and gives a nice 25% speed increase for a 720x480p videos (probably a bit more because the the bottleneck is definitely the CPU now), but for 1280x1080i and 1920x1080i the increase is only around 7% and 5% with the cpu still quite idle. I assume that the bottleneck for the higher resolutions is the memory bandwidth caused by the access patterns the iDCT and MC code uses. I tried to enable tilling, but wasn't successfully so far, all i got when setting R600_FORCE_TILING is: Failed to allocate : size : 0 bytes alignment : 0 bytes I updated the kernel and merged my branch with master on a regular basis, but still getting the same error. So what i'm missing? Do i need to update some other component, like libdrm for example? Is there any way to debug the memory bandwith usage of the GPU? I'm currently a bit frustrated, because it looks like I'm stuck and can't improve the speed further. Any help would be very welcome. Regards, Christian. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev