Great work Roland! And thanks Ajax to finding this hot spot. We use memcmp a lot -- all CSO caching, so we should use this everywhere.
We should also code a sse2 version with intrinsics for x86-64, which is guaranteed to always have SSE2. Jose ----- Original Message ----- > Actually I ran some numbers here and tried out a optimized struct > compare: > original ipers: 12.1 fps > ajax patch: 15.5 fps > optimized struct compare: 16.8 fps > > > This is the function I used for that (just enabled in that lp_setup > function): > > static INLINE int util_cmp_struct(const void *src1, const void *src2, > unsigned count) > { > /* hmm pointer casting is evil */ > const uint32_t *src1_ptr = (uint32_t *)src1; > const uint32_t *src2_ptr = (uint32_t *)src2; > unsigned i; > assert(count % 4 == 0); > for (i = 0; i < count/4; i++) { > if (*src1_ptr != *src2_ptr) { > return 1; > } > src1_ptr++; > src2_ptr++; > } > return 0; > } > > (And no this doesn't use repz cmpsd here.) > > So, unless I made some mistake, memcmp is just dead slow (*), most of > the slowness probably coming from the bytewise comparison (and > apparently I was wrong in assuming the comparison there might never > be > the same for ipers). > Of course, the optimized struct compare relies on structs really > being > dword aligned (I think this is always the case), and additionally it > relies on the struct size being a whole multiple of dwords - likely > struct needs padding to ensure that (at least I don't think this is > always guaranteed for all structs). > But since memcmp is used extensively (cso for one) maybe some > optimization along these lines might be worth it (though of course > for > small structs the win isn't going to be as big - and can't beat the > repz > cmps in code size...). > > Roland > > (*) I actually found some references some implementations might be > better they don't just use repz cmpsb but they split this up in parts > which do dword (or qword even - well for really large structs could > use > sse2) comparisons for the parts where it's possible and only byte > comparisons for the remaining bytes (and if the compiler does that it > probably would know the size at compile time here hence could leave > out > much of the code). Of course memcmp requires that the return value > isn't > just a true or false value, hence there's more code needed once an > unequal dword is found, though the compiler could optimize that away > too > in case it's not needed. Much the same as memcpy is optimized usually > really, so blame gcc :-). > > > > Am 29.06.2011 20:33, schrieb Roland Scheidegger: > > Ohh that's interesting, you'd think the comparison shouldn't be > > that > > expensive (though I guess in ipers case the comparison is never > > true). > > memcmp is quite extensively used everywhere. Maybe we could replace > > that > > with something faster (since we only ever care if the blocks are > > the > > same but not care about the lexographic ordering and always compare > > whole structs, should compare dwords instead of bytes for a 4 time > > speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? > > Also I guess it would help if the values which are more likely to > > be > > unequal are first in the struct (if we can tell that). > > Of course though if it's unlikely to be the same as the compared > > value > > anyway not comparing at all still might be a win (here). > > > > Roland > > > > Am 29.06.2011 19:19, schrieb Adam Jackson: > >> Perversely, do this by eliminating the comparison between stored > >> and > >> current fs state. On ipers, a perf trace showed > >> try_update_scene_state > >> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the > >> memcmp. > >> Taking that out takes try_update_scene_state down to 6.5% of the > >> profile; more importantly, ipers goes from 10 to 14fps and gears > >> goes > >> from 790 to 830fps. > >> > >> Signed-off-by: Adam Jackson <a...@redhat.com> > >> --- > >> src/gallium/drivers/llvmpipe/lp_setup.c | 61 > >> ++++++++++++++----------------- > >> 1 files changed, 27 insertions(+), 34 deletions(-) > >> > >> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c > >> b/src/gallium/drivers/llvmpipe/lp_setup.c > >> index cbe06e5..9118db5 100644 > >> --- a/src/gallium/drivers/llvmpipe/lp_setup.c > >> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c > >> @@ -839,42 +839,35 @@ try_update_scene_state( struct > >> lp_setup_context *setup ) > >> setup->dirty |= LP_SETUP_NEW_FS; > >> } > >> > >> - > >> if (setup->dirty & LP_SETUP_NEW_FS) { > >> - if (!setup->fs.stored || > >> - memcmp(setup->fs.stored, > >> - &setup->fs.current, > >> - sizeof setup->fs.current) != 0) > >> - { > >> - struct lp_rast_state *stored; > >> - uint i; > >> - > >> - /* The fs state that's been stored in the scene is > >> different from > >> - * the new, current state. So allocate a new > >> lp_rast_state object > >> - * and append it to the bin's setup data buffer. > >> - */ > >> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, > >> sizeof *stored); > >> - if (!stored) { > >> - assert(!new_scene); > >> - return FALSE; > >> - } > >> + struct lp_rast_state *stored; > >> + uint i; > >> + > >> + /* The fs state that's been stored in the scene is > >> different from > >> + * the new, current state. So allocate a new lp_rast_state > >> object > >> + * and append it to the bin's setup data buffer. > >> + */ > >> + stored = (struct lp_rast_state *) lp_scene_alloc(scene, > >> sizeof *stored); > >> + if (!stored) { > >> + assert(!new_scene); > >> + return FALSE; > >> + } > >> > >> - memcpy(stored, > >> - &setup->fs.current, > >> - sizeof setup->fs.current); > >> - setup->fs.stored = stored; > >> - > >> - /* The scene now references the textures in the > >> rasterization > >> - * state record. Note that now. > >> - */ > >> - for (i = 0; i < Elements(setup->fs.current_tex); i++) { > >> - if (setup->fs.current_tex[i]) { > >> - if (!lp_scene_add_resource_reference(scene, > >> - > >> > >> setup->fs.current_tex[i], > >> - new_scene)) { > >> - assert(!new_scene); > >> - return FALSE; > >> - } > >> + memcpy(stored, > >> + &setup->fs.current, > >> + sizeof setup->fs.current); > >> + setup->fs.stored = stored; > >> + > >> + /* The scene now references the textures in the > >> rasterization > >> + * state record. Note that now. > >> + */ > >> + for (i = 0; i < Elements(setup->fs.current_tex); i++) { > >> + if (setup->fs.current_tex[i]) { > >> + if (!lp_scene_add_resource_reference(scene, > >> + > >> setup->fs.current_tex[i], > >> + new_scene)) { > >> + assert(!new_scene); > >> + return FALSE; > >> } > >> } > >> } > > > > _______________________________________________ > > mesa-dev mailing list > > mesa-dev@lists.freedesktop.org > > http://lists.freedesktop.org/mailman/listinfo/mesa-dev > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev