On 12/28/2011 08:51 AM, Richard Sandiford wrote:
Vladimir Makarov<vmaka...@redhat.com>  writes:
In the end I tried an ad-hoc approach in an attempt to do something
about (2), (3) and (4b).  The idea was to construct a preliminary
"model" schedule in which the only objective is to keep register
pressure to a minimum.  This schedule ignores pipeline characteristics,
latencies, and the number of available registers.  The maximum pressure
seen in this initial model schedule (MP) is then the benchmark for ECC(X).

I always had an impression that the code before scheduler is close to
minimal register pressure because of specific expression generation.
May be I was wrong and some optimizations (global ones like pre) changes
this a lot.
One of the examples I was looking at was:

-----------------------------------------------------------------------------
#include<stdint.h>

#define COUNT 8

void
loop (uint8_t *__restrict dst, uint8_t *__restrict src, uint8_t *__restrict 
ff_cropTbl, int dstStride, int srcStride)
{
   const int w = COUNT;
   uint8_t *cm = ff_cropTbl + 1024;
   for(int i=0; i<w; i++)
     {
       const int srcB = src[-2*srcStride];
       const int srcA = src[-1*srcStride];
       const int src0 = src[0 *srcStride];
       const int src1 = src[1 *srcStride];
       const int src2 = src[2 *srcStride];
       const int src3 = src[3 *srcStride];
       const int src4 = src[4 *srcStride];
       const int src5 = src[5 *srcStride];
       const int src6 = src[6 *srcStride];
       const int src7 = src[7 *srcStride];
       const int src8 = src[8 *srcStride];
       const int src9 = src[9 *srcStride];
       const int src10 = src[10*srcStride];

       dst[0*dstStride] = cm[(((src0+src1)*20 - (srcA+src2)*5 + (srcB+src3)) + 
16)>>5];
       dst[1*dstStride] = cm[(((src1+src2)*20 - (src0+src3)*5 + (srcA+src4)) + 
16)>>5];
       dst[2*dstStride] = cm[(((src2+src3)*20 - (src1+src4)*5 + (src0+src5)) + 
16)>>5];
       dst[3*dstStride] = cm[(((src3+src4)*20 - (src2+src5)*5 + (src1+src6)) + 
16)>>5];
       dst[4*dstStride] = cm[(((src4+src5)*20 - (src3+src6)*5 + (src2+src7)) + 
16)>>5];
       dst[5*dstStride] = cm[(((src5+src6)*20 - (src4+src7)*5 + (src3+src8)) + 
16)>>5];
       dst[6*dstStride] = cm[(((src6+src7)*20 - (src5+src8)*5 + (src4+src9)) + 
16)>>5];
       dst[7*dstStride] = cm[(((src7+src8)*20 - (src6+src9)*5 + (src5+src10)) + 
16)>>5];
       dst++;
       src++;
     }
}
-----------------------------------------------------------------------------

(based on the libav h264 code).  In this example the loads from src and
stores to dst are still in their original order by the time we reach sched1,
so src, dst, srcA, srcB, and src0..10 are all live at once.  There's no
aliasing reason why they can't be reordered, and we do that during
scheduling.

Thanks, for the example.


As for the patch itself, I look at this with more attention at the
beginning of next year.  As I understand there is no rush with that
because we are still not at the stage 1.
Thanks, appreciate it.  And yeah, there's definitely no rush: there's
no way this can go in 4.7.

Ok.  Thanks, Richard.

Reply via email to