https://llvm.org/bugs/show_bug.cgi?id=25219
Bug ID: 25219 Summary: [ppc] LLVM built 470.lbm is 9.5% slower than gcc on power8 Product: libraries Version: trunk Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P Component: Backend: PowerPC Assignee: unassignedb...@nondot.org Reporter: car...@google.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified The following compiler options are used -fno-strict-aliasing -O2 -m64 -mvsx -mcpu=power8 -ffp-contract=fast more than 98% of execution time is in function LBM_performStreamCollide, it contains a single loop, related code is: void LBM_performStreamCollide( LBM_Grid srcGrid, LBM_Grid dstGrid ) { for (...) { ... ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid ) + SRC_NE( srcGrid ) - SRC_NW( srcGrid ) + SRC_SE( srcGrid ) - SRC_SW( srcGrid ) + SRC_ET( srcGrid ) + SRC_EB( srcGrid ) - SRC_WT( srcGrid ) - SRC_WB( srcGrid ); uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid ) + SRC_NE( srcGrid ) + SRC_NW( srcGrid ) - SRC_SE( srcGrid ) - SRC_SW( srcGrid ) + SRC_NT( srcGrid ) + SRC_NB( srcGrid ) - SRC_ST( srcGrid ) - SRC_SB( srcGrid ); uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid ) + SRC_NT( srcGrid ) - SRC_NB( srcGrid ) + SRC_ST( srcGrid ) - SRC_SB( srcGrid ) + SRC_ET( srcGrid ) - SRC_EB( srcGrid ) + SRC_WT( srcGrid ) - SRC_WB( srcGrid ); ux /= rho; uy /= rho; uz /= rho; if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { ux = 0.005; uy = 0.002; uz = 0.000; } ... } } LLVM tranforms the code into if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid ) + SRC_NE( srcGrid ) - SRC_NW( srcGrid ) + SRC_SE( srcGrid ) - SRC_SW( srcGrid ) + SRC_ET( srcGrid ) + SRC_EB( srcGrid ) - SRC_WT( srcGrid ) - SRC_WB( srcGrid ); ux /= rho; } else ux = 0.005; if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid ) + SRC_NE( srcGrid ) + SRC_NW( srcGrid ) - SRC_SE( srcGrid ) - SRC_SW( srcGrid ) + SRC_NT( srcGrid ) + SRC_NB( srcGrid ) - SRC_ST( srcGrid ) - SRC_SB( srcGrid ); uy /= rho; } else uy = 0.002; if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid ) + SRC_NT( srcGrid ) - SRC_NB( srcGrid ) + SRC_ST( srcGrid ) - SRC_SB( srcGrid ) + SRC_ET( srcGrid ) - SRC_EB( srcGrid ) + SRC_WT( srcGrid ) - SRC_WB( srcGrid ); uz /= rho; } else uz = 0.000; Note that following floating point expressions are dependence chain containing 10 floating instructions uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid ) + SRC_NT( srcGrid ) - SRC_NB( srcGrid ) + SRC_ST( srcGrid ) - SRC_SB( srcGrid ) + SRC_ET( srcGrid ) - SRC_EB( srcGrid ) + SRC_WT( srcGrid ) - SRC_WB( srcGrid ); uz /= rho; One power8 each fp instruction has 6 or more cycle latency, so it needs at least 60 cycles to execute each of the three dependence chain. GCC doesn't do the control flow transform, so it can interleave the 3 dependence chains, and the result code is much faster. -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs