On 09/13/2017 11:43 PM, aravindan.muthuku...@intel.com wrote: > From: Aravindan Muthukumar <aravindan.muthuku...@intel.com> > > Avoiding the loop which was running with O(n) complexity. > Now the complexity has been reduced to O(1) > > Algorithm calculates the index using matrix method. > Matrix arrangement is as below: > Assuming PAGE_SIZE is 4096. > > 1*4096 2*4096 3*4096 4*4096 > 5*4096 6*4096 7*4096 8*4096
I think adding one more row to this chart would make it more clear. The two rows shown also follow a simpler pattern, and that made some of the complexity below seem confusing. 10*4096 12*4096 14*4096 16*4096 > ... ... ... ... > ... ... ... ... > ... ... ... max_cache_size > > From this matrix its cleary seen that every row clearly > follows the below way: > ... ... ... n > n+(1/4)n n+(1/2)n n+(3/4)n 2n > > Row is calulated as log2(size/PAGE_SIZE) > Column is calculated as converting the difference > between the elements to fit into power size of two > and indexing it. > > Final Index is (row*4)+(col-1) > > Tested with Intel Mesa CI. > > Improves performance of 3d Mark on Broxton. > Analyzed using Compare Perf Analyser: > Average : 201.2 +/- 65.4836 (n=20) Is 201 the improvement or the absolute score? Do not quote absolute scores. > Percentage : 0.705966% +/- 0.229767% (n=20) Eero: Can you reproduce this result on BXT or other platforms? Just curious... > v2: Review comments regarding cosmetics and asserts implemented > > Signed-off-by: Aravindan Muthukumar <aravindan.muthuku...@intel.com> > Signed-off-by: Kedar Karanje <kedar.j.kara...@intel.com> > Reviewed-by: Yogesh Marathe <yogesh.mara...@intel.com> > --- > src/mesa/drivers/dri/i965/brw_bufmgr.c | 46 > ++++++++++++++++++++++++++++------ > 1 file changed, 39 insertions(+), 7 deletions(-) > > diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.c > b/src/mesa/drivers/dri/i965/brw_bufmgr.c > index 8017219..8013ccb 100644 > --- a/src/mesa/drivers/dri/i965/brw_bufmgr.c > +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.c > @@ -87,6 +87,8 @@ > > #define memclear(s) memset(&s, 0, sizeof(s)) > > +#define PAGE_SIZE 4096 > + > #define FILE_DEBUG_FLAG DEBUG_BUFMGR > > static inline int > @@ -181,19 +183,45 @@ bo_tile_pitch(struct brw_bufmgr *bufmgr, uint32_t > pitch, uint32_t tiling) > return ALIGN(pitch, tile_width); > } > > +static inline int > +ilog2_round_up(int value) > +{ > + assert(value != 0); > + return 32 - __builtin_clz(value - 1); > +} > + > +/* > + * This function finds the correct bucket fit for the input size. > + * The function works with O(1) complexity when the requested size > + * was queried instead of iterating the size through all the buckets. > + */ > static struct bo_cache_bucket * > bucket_for_size(struct brw_bufmgr *bufmgr, uint64_t size) > { > - int i; > + int index = -1; I don't see how execution could get to that test without index being set again, so it should be safe to remove the initialization. > + int row, col = 0; > + int pages, pages_log2; Move the declarations of row, col, pages, and pages_log2 into the else case, initialize them with the only assignment to them, and make them const. Since all of these values, including index, get values derived from size, I believe the should all be unsigned. In that case, remove the index >= 0 test below. > > - for (i = 0; i < bufmgr->num_buckets; i++) { > - struct bo_cache_bucket *bucket = &bufmgr->cache_bucket[i]; > - if (bucket->size >= size) { > - return bucket; > - } > + /* condition for size less than 4*4096 (16KB) page size */ ^ ^ Remove one space here. | Capitalize and add a period at the end of the sentence. > + if(size <= 4 * PAGE_SIZE) { ^ Add a space here. > + index = DIV_ROUND_UP(size, PAGE_SIZE) - 1;; Remove extra trailing semicolon. ^ > + } else { > + /* Number of pages of page size */ > + pages = DIV_ROUND_UP(size, PAGE_SIZE); > + pages_log2 = ilog2_round_up(pages) - 1; Isn't this going to be the same result as _mesa_logbase2(pages)? > + > + /* Finding the row and column of the matrix */ > + row = pages_log2 - 1; > + col = DIV_ROUND_UP((pages - (1 << pages_log2)), > + (1 << (pages_log2 - 2))); ^ ^ This should | be aligned | here. -------+ Removing some of the extra parenthesis would also make it easier to read. So... I feel like this function doesn't need to special case size < 16k like this. I haven't benchmarked it, but the following is about 30 bytes shorter and avoids the conditional branch. I also think it's easier to understand. const unsigned pages = DIV_ROUND_UP(size, PAGE_SIZE); /* Row Bucket sizes clz((x-1) | 3) Row Column * in pages stride size * 0: 1 2 3 4 -> 30 30 30 30 4 1 * 1: 5 6 7 8 -> 29 29 29 29 4 1 * 2: 10 12 14 16 -> 28 28 28 28 8 2 * 3: 20 24 28 32 -> 27 27 27 27 16 4 */ const unsigned row = 30 - __builtin_clz((pages - 1) | 3); const unsigned row_max_pages = 4 << row; /* The '& ~2' is the special case for row 1. In row 1, max pages / * 2 is 2, but the previous row maximum is zero (because there is * no previous row). All row maximum sizes are power of 2, so that * is the only case where that bit will be set. */ const unsigned prev_row_max_pages = (row_max_pages / 2) & ~2; const unsigned col_size = (row_max_pages - prev_row_max_pages) / 4; const unsigned col = DIV_ROUND_UP(pages - prev_row_max_pages, col_size); /* Using the calculated row and column to index into the matrix */ const unsigned index = (row * 4) + (col - 1); The second DIV_ROUND_UP annoys me. col_size is a power of two, so we should be able to just shift. My attempts to take advantage of that yielded piles of extra instructions and a conditional branch. The problem was the row == 0 case. The best I got was: int col_size_log2 = row - 1; col_size_log2 += (col_size_log2 < 0); const unsigned col = (pages - prev_row_max_pages + ((1 << col_size_log2) - 1)) >> col_size_log2; I also found open-coding the first DIV_ROUND_UP as (size + PAGE_SIZE - 1) / PAGE_SIZE generated very slightly better code. That saves about 8 bytes of code... no idea if it's faster or not, but it did avoid a dependency on the flags. You should try my different versions of this routine in your benchmark. > + > + /* Using the calculated row and column to index into the matrix */ > + index = (row << 2) + (col - 1); > } > > - return NULL; > + /* Checking the error condition */ Assuming the index >= 0 check is removed, per my suggestion above, I don't feel like this is a very useful comment. If index is >= num_buckets, then the size is too large? The cache bucket hasn't been created yet? > + return (index >= 0 && index < bufmgr->num_buckets) ? > + (&bufmgr->cache_bucket[index]) : NULL; ^ ^ Remove these extraneous parenthesis. > } > > int > @@ -1239,6 +1267,10 @@ add_bucket(struct brw_bufmgr *bufmgr, int size) > list_inithead(&bufmgr->cache_bucket[i].head); > bufmgr->cache_bucket[i].size = size; > bufmgr->num_buckets++; > + > + assert(bucket_for_size(bufmgr, size) == &bufmgr->cache_bucket[i]); > + assert(bucket_for_size(bufmgr, size - 2048) == &bufmgr->cache_bucket[i]); > + assert(bucket_for_size(bufmgr, size + 1) != &bufmgr->cache_bucket[i]); > } > > static void > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev