On Tue, Oct 20, 2015 at 7:43 AM, Jeff Law <l...@redhat.com> wrote: > On 10/14/2015 01:15 PM, Bernd Schmidt wrote: >> >> On 10/14/2015 07:43 PM, Jeff Law wrote: >>> >>> Obviously some pessimization relative to current code is necessary to >>> fix some of the problems WRT thread safety and avoiding things like >>> introducing faults in code which did not previously fault. >> >> >> Huh? This patch is purely an (attempt at) optimization, not something >> that fixes any problems. > > Then I must be mentally merging two things Abe has been working on then. > He's certainly had an if-converter patch that was designed to avoid > introducing races in code that didn't previously have races. > > Looking back through the archives that appears to be the case. His patches > to avoid racing are for the tree level if converter, not the RTL if > converter.
Even for the tree level this wasn't the case, he just run into a bug of the existing converter that I've fixed meanwhile. > Sigh, sorry for the confusion. It's totally my fault. Assuming Abe doesn't > have a correctness case at all here, then I don't see any way for the code > to go forward as-is since it's likely making things significantly worse. > >> >> I can't test valgrind right now, it fails to run on my machine, but I >> guess it could adapt to allow stores slightly below the stack (maybe >> warning once)? It seems like a bit of an edge case to worry about, but >> if supporting it is critical and it can't be changed to adapt to new >> optimizations, then I think we're probably better off entirely without >> this scratchpad transformation. >> >> Alternatively I can think of a few other possible approaches which >> wouldn't require this kind of bloat: >> * add support for allocating space in the stack redzone. That could be >> interesting for the register allocator as well. Would help only >> x86_64, but that's a large fraction of gcc's userbase. >> * add support for opportunistically finding unused alignment padding >> in the existing stack frame. Less likely to work but would produce >> better results when it does. >> * on embedded targets we probably don't have to worry about valgrind, >> so do the optimal (sp - x) thing there >> * allocate a single global as the dummy target. Might be more >> expensive to load the address on some targets though. >> * at least find a way to express costs for this transformation. >> Difficult since you don't yet necessarily know if the function is >> going to have a stack frame. Hence, IMO this approach is flawed. >> (You'll still want cost estimates even when not allocating stuff in >> the normal stack frame, because generated code will still execute >> between two and four extra instructions). > > One could argue these should all be on the table. However, I tend to really > dislike using area beyond the current stack. I realize it's throw-away > data, but it just seems like a bad idea to me -- even on embedded targets > that don't support valgrind. > >