Tapani Pälli <tapani.pa...@intel.com> writes: > Hello; > > On 07/25/2017 05:17 PM, Marathe, Yogesh wrote: >> Hi Matt, Sorry for late reply, please see below. >> >>> -----Original Message----- >>> From: mesa-dev [mailto:mesa-dev-boun...@lists.freedesktop.org] On Behalf >>> Of Matt Turner >>> Sent: Saturday, July 22, 2017 12:12 AM >>> To: Muthukumar, Aravindan <aravindan.muthuku...@intel.com> >>> Cc: mesa-dev@lists.freedesktop.org; Marathe, Yogesh >>> <yogesh.mara...@intel.com> >>> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks >>> >>> On 07/21, aravindan.muthuku...@intel.com wrote: >>>> From: Aravindan Muthukumar <aravindan.muthuku...@intel.com> >>>> >>>> This patch improves CPI Rate(Cycles per Instruction) and branch miss >>>> predict for i965. The function check_state() was showing CPI retired rate. >>>> >>>> Performance stats with android: >>>> - CPI retired lowered by 28% (lower is better) >>>> - Branch missprediction lowered by 13% (lower is better) >>>> - 3DMark improved by 2% >>>> >>>> The dissassembly doesn't show difference, although above results were >>>> observed with patch. >>> >> >> Yes this is true for V3 where we removed the function based on a review >> comment. >> >>> If there's no difference in the assembly then whatever measurements you have >>> taken must be noise. >>> >> >> No that's not guaranteed either. Lot of things still depend on how >> instructions are >> aligned, sometimes even changing linking order gives different results where >> disassemblies of individual functions remain same. >> >>> I applied the patch and inspected brw_state_upload.o. There are assembly >>> differences. I can produce the same assembly as this patch by just pulling >>> the if >>> statement out of check_and_emit_atom() and into the caller. The replacement >>> of check_state() with a macro is completely unnecessary. >>> >>> >>> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c >>> b/src/mesa/drivers/dri/i965/brw_state_upload.c >>> index acaa97ee7d..b163e1490d 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c >>> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c >>> @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw, } >>> >>> static inline void >>> -check_and_emit_atom(struct brw_context *brw, >>> - struct brw_state_flags *state, >>> - const struct brw_tracked_state *atom) >>> +emit_atom(struct brw_context *brw, >>> + struct brw_state_flags *state, >>> + const struct brw_tracked_state *atom) >>> { >>> - if (check_state(state, &atom->dirty)) { >>> - atom->emit(brw); >>> - merge_ctx_state(brw, state); >>> - } >>> + atom->emit(brw); >>> + merge_ctx_state(brw, state); >>> } >>> >>> static inline void >>> @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >>> const struct brw_tracked_state *atom = &atoms[i]; >>> struct brw_state_flags generated; >>> >>> - check_and_emit_atom(brw, &state, atom); >>> + if (check_state(&state, &atom->dirty)) { >>> + emit_atom(brw, &state, atom); >>> + } >>> >>> accumulate_state(&examined, &atom->dirty); >>> >>> @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >>> for (i = 0; i < num_atoms; i++) { >>> const struct brw_tracked_state *atom = &atoms[i]; >>> >>> - check_and_emit_atom(brw, &state, atom); >>> + if (check_state(&state, &atom->dirty)) { >>> + emit_atom(brw, &state, atom); >>> + } >>> } >>> } >>> >>> >>> With that said, the assembly differences are not understandable to me. >>> Why should extracting an if statement from a static inline function into >>> the caller >>> of that function cause any difference whatsoever? >> >> Agreed, it shouldn't in case of static inline. >> >>> >>> I'm viewing the assembly differences with: >>> >>> wdiff -n \ >>> -w $'\033[30;41m' -x $'\033[0m' \ >>> -y $'\033[30;42m' -z $'\033[0m' \ >>> <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \ >>> <(objdump -d brw_state_upload.o.wtf | sed -e 's/^.*\t//') | less -R >>> >>> and the only real difference is the movement of some call and jmp >>> instructions. >>> >>> I cannot take the patch without some reasonable explanation for the change. >> >> Ok I think this has been discussed already and we agree that there is no big >> visible difference >> in disassembly which can be pointed out for improvement. Although, as you >> said this movement >> of instructions can cause this. If with this patch instructions get cache >> aligned that too can show >> improvement. This is a busy function with bad CPI. Hence chosen for >> optimization. Branch miss >> predict is another metric. Do we want to consider all these or just >> disassembly? > > I don't have a reasonable explanation to give but I made some benchmark > runs today (3DMark "Ice Storm Unlimited" test ) and I'm getting > consistently 1-2% better results with the patch. I also tried the > mentioned modification "pulling the if statement out of > check_and_emit_atom() and into the caller" and it has same performance > benefits. What I can tell from assembly dump is that > brw_upload_render_state function becomes slightly shorter, there's > movement with call and jmp and dump with the patch has less mov and nopl > calls in total.
FWIW, I see no difference in the assembly emitted with clang-4.0.1 for either of the patches suggested in this thread, although I see my old gcc-4.8 ignore the inline request for check_and_emit_atom. That's really all that makes sense to me here, that the inlining is getting ignored. Someone should check that the same performance increase comes from s/inline/ALWAYS_INLINE/ for check_and_emit_atom. >> Let me make one more attempt, we clearly see icache misses for >> brw_upload_pipeline_state also >> reduced with patch against without patch. Do you think that too is noise >> with *.o.wtf? >> I would be happy to learn if that's the case to understand all this better. >> >> I believe there are two reasons why this should go in >> 1. It consistently benefits android (we are ok to rename it), helps reduce >> overhead >> 2. It doesn't harm other platforms >> >> We are ok to drop it if above two claims can be negated with someone else >> other than us testing >> this. Otherwise can you please reconsider? >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev