I didn't even look at that was just curious why the memcmp (which is used a lot in other places) is slow. However, none of the other memcmp seem to show up prominently (cso functions are quite low in profiles, _mesa_search_program_cache uses memcmp too but it's not that high neither). So I guess those functions either aren't called that often or the sizes they compare are small. So should maybe just file a gcc bug for memcmp and look at that particular llvmpipe issue again :-).
Roland Am 30.06.2011 01:16, schrieb Corbin Simpson: > Okay, so maybe I'm failing to recognize the exact situation here, but > wouldn't it be possible to mark the FS state with a serial number and > just compare those? Or are these FS states not CSO-cached? > > ~ C. > > On Wed, Jun 29, 2011 at 3:44 PM, Roland Scheidegger <srol...@vmware.com> > wrote: >> Actually I ran some numbers here and tried out a optimized struct compare: >> original ipers: 12.1 fps >> ajax patch: 15.5 fps >> optimized struct compare: 16.8 fps >> >> >> This is the function I used for that (just enabled in that lp_setup >> function): >> >> static INLINE int util_cmp_struct(const void *src1, const void *src2, >> unsigned count) >> { >> /* hmm pointer casting is evil */ >> const uint32_t *src1_ptr = (uint32_t *)src1; >> const uint32_t *src2_ptr = (uint32_t *)src2; >> unsigned i; >> assert(count % 4 == 0); >> for (i = 0; i < count/4; i++) { >> if (*src1_ptr != *src2_ptr) { >> return 1; >> } >> src1_ptr++; >> src2_ptr++; >> } >> return 0; >> } >> >> (And no this doesn't use repz cmpsd here.) >> >> So, unless I made some mistake, memcmp is just dead slow (*), most of >> the slowness probably coming from the bytewise comparison (and >> apparently I was wrong in assuming the comparison there might never be >> the same for ipers). >> Of course, the optimized struct compare relies on structs really being >> dword aligned (I think this is always the case), and additionally it >> relies on the struct size being a whole multiple of dwords - likely >> struct needs padding to ensure that (at least I don't think this is >> always guaranteed for all structs). >> But since memcmp is used extensively (cso for one) maybe some >> optimization along these lines might be worth it (though of course for >> small structs the win isn't going to be as big - and can't beat the repz >> cmps in code size...). >> >> Roland >> >> (*) I actually found some references some implementations might be >> better they don't just use repz cmpsb but they split this up in parts >> which do dword (or qword even - well for really large structs could use >> sse2) comparisons for the parts where it's possible and only byte >> comparisons for the remaining bytes (and if the compiler does that it >> probably would know the size at compile time here hence could leave out >> much of the code). Of course memcmp requires that the return value isn't >> just a true or false value, hence there's more code needed once an >> unequal dword is found, though the compiler could optimize that away too >> in case it's not needed. Much the same as memcpy is optimized usually >> really, so blame gcc :-). >> >> >> >> Am 29.06.2011 20:33, schrieb Roland Scheidegger: >>> Ohh that's interesting, you'd think the comparison shouldn't be that >>> expensive (though I guess in ipers case the comparison is never true). >>> memcmp is quite extensively used everywhere. Maybe we could replace that >>> with something faster (since we only ever care if the blocks are the >>> same but not care about the lexographic ordering and always compare >>> whole structs, should compare dwords instead of bytes for a 4 time >>> speedup)? Or isn't that the reason cmpsb instead of cmpsd is used? >>> Also I guess it would help if the values which are more likely to be >>> unequal are first in the struct (if we can tell that). >>> Of course though if it's unlikely to be the same as the compared value >>> anyway not comparing at all still might be a win (here). >>> >>> Roland >>> >>> Am 29.06.2011 19:19, schrieb Adam Jackson: >>>> Perversely, do this by eliminating the comparison between stored and >>>> current fs state. On ipers, a perf trace showed try_update_scene_state >>>> using 31% of a CPU, and 98% of that was in 'repz cmpsb', ie, the memcmp. >>>> Taking that out takes try_update_scene_state down to 6.5% of the >>>> profile; more importantly, ipers goes from 10 to 14fps and gears goes >>>> from 790 to 830fps. >>>> >>>> Signed-off-by: Adam Jackson <a...@redhat.com> >>>> --- >>>> src/gallium/drivers/llvmpipe/lp_setup.c | 61 >>>> ++++++++++++++----------------- >>>> 1 files changed, 27 insertions(+), 34 deletions(-) >>>> >>>> diff --git a/src/gallium/drivers/llvmpipe/lp_setup.c >>>> b/src/gallium/drivers/llvmpipe/lp_setup.c >>>> index cbe06e5..9118db5 100644 >>>> --- a/src/gallium/drivers/llvmpipe/lp_setup.c >>>> +++ b/src/gallium/drivers/llvmpipe/lp_setup.c >>>> @@ -839,42 +839,35 @@ try_update_scene_state( struct lp_setup_context >>>> *setup ) >>>> setup->dirty |= LP_SETUP_NEW_FS; >>>> } >>>> >>>> - >>>> if (setup->dirty & LP_SETUP_NEW_FS) { >>>> - if (!setup->fs.stored || >>>> - memcmp(setup->fs.stored, >>>> - &setup->fs.current, >>>> - sizeof setup->fs.current) != 0) >>>> - { >>>> - struct lp_rast_state *stored; >>>> - uint i; >>>> - >>>> - /* The fs state that's been stored in the scene is different from >>>> - * the new, current state. So allocate a new lp_rast_state >>>> object >>>> - * and append it to the bin's setup data buffer. >>>> - */ >>>> - stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof >>>> *stored); >>>> - if (!stored) { >>>> - assert(!new_scene); >>>> - return FALSE; >>>> - } >>>> + struct lp_rast_state *stored; >>>> + uint i; >>>> + >>>> + /* The fs state that's been stored in the scene is different from >>>> + * the new, current state. So allocate a new lp_rast_state object >>>> + * and append it to the bin's setup data buffer. >>>> + */ >>>> + stored = (struct lp_rast_state *) lp_scene_alloc(scene, sizeof >>>> *stored); >>>> + if (!stored) { >>>> + assert(!new_scene); >>>> + return FALSE; >>>> + } >>>> >>>> - memcpy(stored, >>>> - &setup->fs.current, >>>> - sizeof setup->fs.current); >>>> - setup->fs.stored = stored; >>>> - >>>> - /* The scene now references the textures in the rasterization >>>> - * state record. Note that now. >>>> - */ >>>> - for (i = 0; i < Elements(setup->fs.current_tex); i++) { >>>> - if (setup->fs.current_tex[i]) { >>>> - if (!lp_scene_add_resource_reference(scene, >>>> - >>>> setup->fs.current_tex[i], >>>> - new_scene)) { >>>> - assert(!new_scene); >>>> - return FALSE; >>>> - } >>>> + memcpy(stored, >>>> + &setup->fs.current, >>>> + sizeof setup->fs.current); >>>> + setup->fs.stored = stored; >>>> + >>>> + /* The scene now references the textures in the rasterization >>>> + * state record. Note that now. >>>> + */ >>>> + for (i = 0; i < Elements(setup->fs.current_tex); i++) { >>>> + if (setup->fs.current_tex[i]) { >>>> + if (!lp_scene_add_resource_reference(scene, >>>> + setup->fs.current_tex[i], >>>> + new_scene)) { >>>> + assert(!new_scene); >>>> + return FALSE; >>>> } >>>> } >>>> } >>> >>> _______________________________________________ >>> mesa-dev mailing list >>> mesa-dev@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >> >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > > > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev