This patch is : reviewed-by: Vincent Lejeune<vljn at ovi.com>
----- Mail original ----- > De : Tom Stellard <t...@stellard.net> > À : Vincent Lejeune <v...@ovi.com> > Cc : "mesa-dev@lists.freedesktop.org" <mesa-dev@lists.freedesktop.org>; > "llvm-comm...@cs.uiuc.edu" <llvm-comm...@cs.uiuc.edu>; Tom Stellard > <thomas.stell...@amd.com> > Envoyé le : Jeudi 14 novembre 2013 1h53 > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in the same > clause > > Hi Vincent, > > I discovered a bug in the previous patch. Here is an updated versions. > > -Tom > > On Tue, Nov 12, 2013 at 03:01:42PM -0800, Tom Stellard wrote: >> Hi Vincent, >> >> Here is an updated patch where I added a call to >> SubstituteKCacheBank() in canClauseLocalKillFitInClause() This should >> prevent OQAP uses and defs from being split because of constant bank >> limitations. >> >> Maybe we can leave the ScheduleDAGMutation optimization as a future >> TODO. >> >> -Tom >> >> On Sun, Nov 03, 2013 at 10:19:16AM -0800, Vincent Lejeune wrote: >> > I have put some comments below but otherwise the patch is >> > reviewed-by: Vincent Lejeune <vljn at ovi.com> >> > >> > >> > >-------------- next part -------------- >> > >>From 2eb4673e3184af0e077cbe30a594602441e8d98e Mon Sep 17 > 00:00:00 2001 >From: Tom Stellard <thomas.stellard at amd.com> >> > >Date: Thu, 5 Sep 2013 08:59:32 -0700 >> > >Subject: [PATCH] R600: Fix scheduling of instructions that use the > LDS output >> > > queue >> > > >> > >The LDS output queue is accessed via the OQAP register. The OQAP >> > >register cannot be live across clauses, so if value is written to > the >> > >output queue, it must be retrieved before the end of the clause. >> > >With the machine scheduler, we cannot statisfy this constraint, > because >> > >it lacks proper alias analysis and it will mark some LDS accesses > as >> > >having a chain dependency on vertex fetches. Since vertex fetches >> > >> > We can customize the dependency graph before machine scheduling takes > place, >> > using ScheduleDAGMutation. >> > I already wrote some code to break artificial dependencies between > vector >> > subregister read/write here : >> > > http://cgit.freedesktop.org/~vlj/llvm/commit/?h=vliw5&id=e91b16a22845d0a80ed348f158ae7ab293e003a8 >> > While I'm expecting from Matthias Braun's Subregister patches > to be upstreamed >> > to obsolete most of this patch except tests, it can be reworked so > that >> > it'll parse all MEM dependency, and remove the ones between > instructions >> > touching different memory pool (like VTX_FETCH and LDS_READ). >> > >> > >require a new clauses, the dependency may end up spiltting OQAP > uses and >> > >defs so the end up in different clauses. See the > lds-output-queue.ll >> > >test for a more detailed explanation. >> > > >> > >To work around this issue, we now combine the LDS read and the > OQAP >> > >copy into one instruction and expand it after register allocation. >> > > >> > >This patch also adds some checks to the EmitClauseMarker pass, so > that >> > >it doesn't end a clause with a value still in the output queue > and >> > >removes AR.X and OQAP handling from the scheduler (AR.X uses and > defs >> > >were already being expanded post-RA, so the scheduler will never > see >> > >them). >> > >--- >> > > lib/Target/R600/R600EmitClauseMarkers.cpp | 52 ++++++++++++++ >> > > lib/Target/R600/R600ExpandSpecialInstrs.cpp | 17 +++++ >> > > lib/Target/R600/R600ISelLowering.cpp | 20 +++--- >> > > lib/Target/R600/R600InstrInfo.cpp | 8 +++ >> > > lib/Target/R600/R600InstrInfo.h | 2 + >> > > lib/Target/R600/R600MachineScheduler.cpp | 32 --------- >> > > lib/Target/R600/R600MachineScheduler.h | 2 - >> > > lib/Target/R600/R600RegisterInfo.cpp | 13 ++++ >> > > lib/Target/R600/R600RegisterInfo.h | 2 + >> > > test/CodeGen/R600/lds-output-queue.ll | 99 > +++++++++++++++++++++++++++ >> > > test/CodeGen/R600/local-memory-two-objects.ll | 8 ++- >> > > 11 files changed, 206 insertions(+), 49 deletions(-) >> > > create mode 100644 test/CodeGen/R600/lds-output-queue.ll >> > > >> > >diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp > b/lib/Target/R600/R600EmitClauseMarkers.cpp >> > >> > >> > >+ bool canClauseLocalKillFitInClause( >> > >+ unsigned AluInstCount, >> > >+ MachineBasicBlock::iterator Def, >> > >+ MachineBasicBlock::iterator BBEnd) > { >> > >+ const R600RegisterInfo &TRI = TII->getRegisterInfo(); >> > >+ for (MachineInstr::const_mop_iterator >> > >+ MOI = Def->operands_begin(), >> > >+ MOE = Def->operands_end(); MOI != MOE; ++MOI) > { >> > >+ if (!MOI->isReg() || !MOI->isDef() || >> > >+ TRI.isPhysRegLiveAcrossClauses(MOI->getReg())) >> > >+ continue; >> > >+ >> > >+ // Def defines a clause local register, so check that its > use will fit >> > >+ // in the clause. >> > >+ unsigned LastUseCount = 0; >> > >+ for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; > ++UseI) { >> > >+ AluInstCount += OccupiedDwords(UseI); >> > >+ // We have reached the maximum instruction limit before > finding the >> > >+ // use that kills this register, so we cannot use this > def in the >> > >+ // current clause. >> > >+ if (AluInstCount >= TII->getMaxAlusPerClause()) >> > >+ return false; >> > >+ >> > >+ // Register kill flags have been cleared by the time we > get to this >> > >+ // pass, but it is safe to assume that all uses of this > register >> > >+ // occur in the same basic block as its definition, > because >> > >+ // it is illegal for the scheduler to schedule them in >> > >+ // different blocks. >> > >+ if (UseI->findRegisterUseOperandIdx(MOI->getReg())) >> > >+ LastUseCount = AluInstCount; >> > >+ >> > >+ if (UseI != Def && > UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1) >> > >+ break; >> > >+ } >> > >+ if (LastUseCount) >> > >+ return LastUseCount <= TII->getMaxAlusPerClause(); >> > >+ llvm_unreachable("Clause local register live at end of > clause."); >> > >+ } >> > >+ return true; >> > >+ } >> > >> > This function does not check if current clause can hold all constant > bank. >> > I think it's likely to be rare for a clause to be split because of > constant bank limitations, >> > but it would be better to have an assertion failure in such case to > make debugging easier. >> > For instance if the SubstituteKCacheBank return false, you can check > that there is no lds >> > use still pending. >> > >> > >> > ----- Mail original ----- >> > > De : Tom Stellard <t...@stellard.net> >> > > À : Vincent Lejeune <v...@ovi.com> >> > > Cc : "llvm-comm...@cs.uiuc.edu" > <llvm-comm...@cs.uiuc.edu>; "mesa-dev@lists.freedesktop.org" > <mesa-dev@lists.freedesktop.org>; Tom Stellard > <thomas.stell...@amd.com> >> > > Envoyé le : Mercredi 30 octobre 2013 17h46 >> > > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in > the same clause >> > > >> > > Hi Vincent, >> > > >> > > It turns out that it's not possible to correctly schedule > uses and defs >> > > of the OQAP register without proper alias analysis in the > MachineScheduler. See >> > > the explanation in the lds-output-queue.ll test case. >> > > >> > > Here is an updated patch that fixes all the outstanding LDS > scheduling >> > > bugs that I am aware of. >> > > >> > > -Tom >> > > >> > > On Fri, Oct 25, 2013 at 05:42:13AM -0700, Vincent Lejeune wrote: >> > >> This patch should work when checking than no OQAP is used > before beeing >> > > queued, assuming that a value in OQAP is consumed >> > >> and cannot be read twice. However I'm not sure I cover > all LDS >> > > instructions that queues a value, I only use LDS_RET_READ in > switch case. >> > >> >> > >> Vincent >> > >> >> > >> >> > >> >> > >> ----- Mail original ----- >> > >> > De : Tom Stellard <t...@stellard.net> >> > >> > À : Vincent Lejeune <v...@ovi.com> >> > >> > Cc : "llvm-comm...@cs.uiuc.edu" >> > > <llvm-comm...@cs.uiuc.edu>; > "mesa-dev@lists.freedesktop.org" >> > > <mesa-dev@lists.freedesktop.org>; Tom Stellard >> > > <thomas.stell...@amd.com> >> > >> > Envoyé le : Mardi 22 octobre 2013 23h20 >> > >> > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses > happen in the >> > > same clause >> > >> > >> > >> > Hi Vincent, >> > >> > >> > >> > Here is an updated patch. I wasn't sure where to > put the >> > > assertion to >> > >> > check that UnscheduledNoLiveOut{Defs,Uses} is > empty when >> > > switching to a >> > >> > new clause. I tried adding it to > R600SchedStartegy::schedNode() >> > > behind >> > >> > the if (NextInstKind != CurInstKind) condition, but it > always failed. >> > >> > Any suggestions on where I should but it? >> > >> > >> > >> > -Tom >> > >> > >> > >> > >> > >> > On Mon, Oct 21, 2013 at 12:40:28PM -0700, Vincent > Lejeune wrote: >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> ----- Mail original ----- >> > >> >> > De : Tom Stellard <t...@stellard.net> >> > >> >> > À : llvm-comm...@cs.uiuc.edu >> > >> >> > Cc : mesa-dev@lists.freedesktop.org; Tom > Stellard >> > >> > <thomas.stell...@amd.com> >> > >> >> > Envoyé le : Vendredi 11 octobre 2013 20h10 >> > >> >> > Objet : [PATCH] R600: Make sure OQAP defs and > uses happen in >> > > the same >> > >> > clause >> > >> >> > >> > >> >> > From: Tom Stellard > <thomas.stell...@amd.com> >> > >> >> > >> > >> >> > Reading the special OQAP register pops the > top value off the >> > > LDS >> > >> >> > input queue and returns it to the > instruction. This queue >> > > is >> > >> >> > invalidated at the end of an ALU clause and > leaving values >> > > in the >> > >> > queue >> > >> >> > can lead to GPU hangs. This means that if we > load a value >> > > into the >> > >> > queue, >> > >> >> > we must use it before the end of the clause. >> > >> >> > >> > >> >> > This fixes some hangs in the OpenCV test > suite. >> > >> >> > --- >> > >> >> > lib/Target/R600/R600MachineScheduler.cpp | 25 > >> > >> > +++++++++++++------------ >> > >> >> > lib/Target/R600/R600MachineScheduler.h | 4 > ++-- >> > >> >> > test/CodeGen/R600/lds-input-queue.ll | 26 > >> > >> > ++++++++++++++++++++++++++ >> > >> >> > 3 files changed, 41 insertions(+), 14 > deletions(-) >> > >> >> > create mode 100644 > test/CodeGen/R600/lds-input-queue.ll >> > >> >> > >> > >> >> > diff --git > a/lib/Target/R600/R600MachineScheduler.cpp >> > >> >> > b/lib/Target/R600/R600MachineScheduler.cpp >> > >> >> > index 6c26d9e..611b7f4 100644 >> > >> >> > --- > a/lib/Target/R600/R600MachineScheduler.cpp >> > >> >> > +++ > b/lib/Target/R600/R600MachineScheduler.cpp >> > >> >> > @@ -93,11 +93,12 @@ SUnit* > R600SchedStrategy::pickNode(bool >> > >> > &IsTopNode) >> > >> >> > { >> > >> >> > } >> > >> >> > >> > >> >> > >> > >> >> > - // We want to scheduled AR defs as soon as > possible to >> > > make sure >> > >> > they >> > >> >> > aren't >> > >> >> > - // put in a different ALU clause from > their uses. >> > >> >> > - if (!SU && > !UnscheduledARDefs.empty()) { >> > >> >> > - SU = UnscheduledARDefs[0]; >> > >> >> > - > UnscheduledARDefs.erase(UnscheduledARDefs.begin()); >> > >> >> > + // We want to scheduled defs that cannot > be live outside >> > > of this >> > >> > clause >> > >> >> > + // as soon as possible to make sure they > aren't put >> > > in a >> > >> > different >> > >> >> > + // ALU clause from their uses. >> > >> >> > + if (!SU && > !UnscheduledNoLiveOutDefs.empty()) >> > > { >> > >> >> > + SU = UnscheduledNoLiveOutDefs[0]; >> > >> >> > + >> > >> > > UnscheduledNoLiveOutDefs.erase(UnscheduledNoLiveOutDefs.begin()); >> > >> >> > NextInstKind = IDAlu; >> > >> >> > } >> > >> >> > >> > >> >> > @@ -132,9 +133,9 @@ SUnit* > R600SchedStrategy::pickNode(bool >> > >> > &IsTopNode) >> > >> >> > { >> > >> >> > >> > >> >> > // We want to schedule the AR uses as late > as possible to >> > > make sure >> > >> > that >> > >> >> > // the AR defs have been released. >> > >> >> > - if (!SU && > !UnscheduledARUses.empty()) { >> > >> >> > - SU = UnscheduledARUses[0]; >> > >> >> > - > UnscheduledARUses.erase(UnscheduledARUses.begin()); >> > >> >> > + if (!SU && > !UnscheduledNoLiveOutUses.empty()) >> > > { >> > >> >> > + SU = UnscheduledNoLiveOutUses[0]; >> > >> >> > + >> > >> > > UnscheduledNoLiveOutUses.erase(UnscheduledNoLiveOutUses.begin()); >> > >> >> >> > >> >> Can we use std::queue<SUnit*> instead of a > std::vector for >> > >> > UnscheduledNoLiveOutUses ? >> > >> >> I had to use a vector because I needed to be able > to pop non >> > > topmost SUnit >> > >> > in some case >> > >> >> (to fit Instruction Group const read limitation) > but I would >> > > rather avoid >> > >> > erase(iterator) call >> > >> >> when possible. >> > >> >> >> > >> >> >> > >> >> > NextInstKind = IDAlu; >> > >> >> > } >> > >> >> > >> > >> >> > @@ -217,15 +218,15 @@ void >> > > R600SchedStrategy::releaseBottomNode(SUnit >> > >> > *SU) >> > >> >> > { >> > >> >> > >> > >> >> > int IK = getInstKind(SU); >> > >> >> > >> > >> >> > - // Check for AR register defines >> > >> >> > + // Check for registers that do not live > across ALU >> > > clauses. >> > >> >> > for (MachineInstr::const_mop_iterator I = >> > >> >> > SU->getInstr()->operands_begin(), >> > >> >> > E = >> > >> >> > SU->getInstr()->operands_end(); >> > >> >> > I != > E; ++I) { >> > >> >> > - if (I->isReg() && > I->getReg() == >> > > AMDGPU::AR_X) >> > >> > { >> > >> >> > + if (I->isReg() && > (I->getReg() == >> > > AMDGPU::AR_X || >> > >> >> > I->getReg() == AMDGPU::OQAP)) { >> > >> >> > if (I->isDef()) { >> > >> >> > - UnscheduledARDefs.push_back(SU); >> > >> >> > + > UnscheduledNoLiveOutDefs.push_back(SU); >> > >> >> > } else { >> > >> >> > - UnscheduledARUses.push_back(SU); >> > >> >> > + > UnscheduledNoLiveOutUses.push_back(SU); >> > >> >> > } >> > >> >> > return; >> > >> >> > } >> > >> >> > diff --git > a/lib/Target/R600/R600MachineScheduler.h >> > >> >> > b/lib/Target/R600/R600MachineScheduler.h >> > >> >> > index 0a6f120..db2e188 100644 >> > >> >> > --- a/lib/Target/R600/R600MachineScheduler.h >> > >> >> > +++ b/lib/Target/R600/R600MachineScheduler.h >> > >> >> > @@ -53,8 +53,8 @@ class R600SchedStrategy : > public >> > >> > MachineSchedStrategy { >> > >> >> > >> > >> >> > std::vector<SUnit *> > Available[IDLast], >> > > Pending[IDLast]; >> > >> >> > std::vector<SUnit *> > AvailableAlus[AluLast]; >> > >> >> > - std::vector<SUnit *> > UnscheduledARDefs; >> > >> >> > - std::vector<SUnit *> > UnscheduledARUses; >> > >> >> > + std::vector<SUnit *> > UnscheduledNoLiveOutDefs; >> > >> >> > + std::vector<SUnit *> > UnscheduledNoLiveOutUses; >> > >> >> > std::vector<SUnit *> > PhysicalRegCopy; >> > >> >> > >> > >> >> > InstKind CurInstKind; >> > >> >> > diff --git > a/test/CodeGen/R600/lds-input-queue.ll >> > >> >> > b/test/CodeGen/R600/lds-input-queue.ll >> > >> >> > new file mode 100644 >> > >> >> > index 0000000..548b41c >> > >> >> > --- /dev/null >> > >> >> > +++ b/test/CodeGen/R600/lds-input-queue.ll >> > >> >> > @@ -0,0 +1,26 @@ >> > >> >> > +; RUN: llc < %s -march=r600 -mcpu=redwood > | FileCheck %s >> > >> >> >> > >> >> Does the test work with -verify-machineinstrs flag > set ? >> > >> >> >> > >> >> > +; >> > >> >> > +; This test checks that the lds input queue > will is empty >> > > at the end >> > >> > of >> > >> >> > +; the ALU clause. >> > >> >> > + >> > >> >> > +; CHECK-LABEL: @lds_input_queue >> > >> >> > +; CHECK: LDS_READ_RET * OQAP >> > >> >> > +; CHECK-NOT: ALU clause >> > >> >> > +; CHECK: MOV > T{{[0-9]\.[XYZW]}}, OQAP >> > >> >> > + >> > >> >> > +@local_mem = internal addrspace(3) > unnamed_addr global [2 x >> > > i32] [i32 >> > >> > 1, i32 >> > >> >> > 2], align 4 >> > >> >> > + >> > >> >> > +define void @lds_input_queue(i32 > addrspace(1)* %out, i32 >> > >> > addrspace(1)* %in, i32 >> > >> >> > %index) { >> > >> >> > +entry: >> > >> >> > + %0 = getelementptr inbounds [2 x i32] > addrspace(3)* >> > > @local_mem, i32 >> > >> > 0, i32 >> > >> >> > %index >> > >> >> > + %1 = load i32 addrspace(3)* %0 >> > >> >> > + call void @llvm.AMDGPU.barrier.local() >> > >> >> > + >> > >> >> > + ; This will start a new clause for the > vertex fetch >> > >> >> > + %2 = load i32 addrspace(1)* %in >> > >> >> > + %3 = add i32 %1, %2 >> > >> >> > + store i32 %3, i32 addrspace(1)* %out >> > >> >> > + ret void >> > >> >> > +} >> > >> >> > + >> > >> >> > +declare void @llvm.AMDGPU.barrier.local() >> > >> >> > -- >> > >> >> > 1.7.11.4 >> > >> >> > >> > >> >> > > _______________________________________________ >> > >> >> > llvm-commits mailing list >> > >> >> > llvm-comm...@cs.uiuc.edu >> > >> >> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits >> > >> >> > >> > >> > >> > > >> > >> From 956c9986fcda84c4a4b0a555432319bbdfd98cfc Mon Sep 17 > 00:00:00 2001 >> > >> From: Vincent Lejeune <v...@ovi.com> >> > >> Date: Fri, 25 Oct 2013 14:32:35 +0200 >> > >> Subject: [PATCH] R600: Uses Assert for OQAP >> > >> >> > >> --- >> > >> lib/Target/R600/R600EmitClauseMarkers.cpp | 31 >> > > +++++++++++++++++++++++++++++++ >> > >> lib/Target/R600/R600ISelLowering.cpp | 2 ++ >> > >> 2 files changed, 33 insertions(+) >> > >> >> > >> diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp >> > > b/lib/Target/R600/R600EmitClauseMarkers.cpp >> > >> index 928c0e3..9951b15 100644 >> > >> --- a/lib/Target/R600/R600EmitClauseMarkers.cpp >> > >> +++ b/lib/Target/R600/R600EmitClauseMarkers.cpp >> > >> @@ -33,6 +33,8 @@ private: >> > >> static char ID; >> > >> const R600InstrInfo *TII; >> > >> int Address; >> > >> + int OQAPLevel; >> > >> + bool ArDefined; >> > >> >> > >> unsigned OccupiedDwords(MachineInstr *MI) const { >> > >> switch (MI->getOpcode()) { >> > >> @@ -160,8 +162,36 @@ private: >> > >> return true; >> > >> } >> > >> >> > >> + void AssertARAndOQAPCorrectness(MachineInstr *MI) { >> > >> + for (MachineInstr::mop_iterator MOp = > MI->operands_begin(), >> > >> + E = MI->operands_end(); MOp != E; MOp++) { >> > >> + MachineOperand &MO = *MOp; >> > >> + if (!MO.isReg() || MO.isDef()) >> > >> + continue; >> > >> + switch (MO.getReg()) { >> > >> + default: break; >> > >> + case AMDGPU::OQAP: >> > >> + OQAPLevel--; >> > >> + assert (OQAPLevel >= 0 && "OQAP not > defined in >> > > this clause !"); >> > >> + break; >> > >> + } >> > >> + } >> > >> + switch (MI->getOpcode()) { >> > >> + default: break; >> > >> + case AMDGPU::MOVA_INT_eg: >> > >> + ArDefined = true; >> > >> + break; >> > >> + case AMDGPU::LDS_READ_RET: >> > >> + OQAPLevel++; >> > >> + break; >> > >> + } >> > >> + } >> > >> + >> > >> MachineBasicBlock::iterator >> > >> MakeALUClause(MachineBasicBlock &MBB, > MachineBasicBlock::iterator I) >> > > { >> > >> + // LDS queue/ AR registers get resetted in new clause. >> > >> + OQAPLevel = 0; >> > >> + ArDefined = false; >> > >> MachineBasicBlock::iterator ClauseHead = I; >> > >> std::vector<std::pair<unsigned, unsigned> > > KCacheBanks; >> > >> bool PushBeforeModifier = false; >> > >> @@ -205,6 +235,7 @@ private: >> > >> !SubstituteKCacheBank(I, KCacheBanks)) >> > >> break; >> > >> AluInstCount += OccupiedDwords(I); >> > >> + AssertARAndOQAPCorrectness(I); >> > >> } >> > >> unsigned Opcode = PushBeforeModifier ? >> > >> AMDGPU::CF_ALU_PUSH_BEFORE : AMDGPU::CF_ALU; >> > >> diff --git a/lib/Target/R600/R600ISelLowering.cpp >> > > b/lib/Target/R600/R600ISelLowering.cpp >> > >> index 3c2e388..d8cbcae 100644 >> > >> --- a/lib/Target/R600/R600ISelLowering.cpp >> > >> +++ b/lib/Target/R600/R600ISelLowering.cpp >> > >> @@ -215,6 +215,7 @@ MachineBasicBlock * >> > > R600TargetLowering::EmitInstrWithCustomInserter( >> > >> } >> > >> >> > >> case AMDGPU::TXD: { >> > >> + abort(); >> > >> unsigned T0 = >> > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass); >> > >> unsigned T1 = >> > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass); >> > >> MachineOperand &RID = MI->getOperand(4); >> > >> @@ -316,6 +317,7 @@ MachineBasicBlock * >> > > R600TargetLowering::EmitInstrWithCustomInserter( >> > >> } >> > >> >> > >> case AMDGPU::TXD_SHADOW: { >> > >> + abort(); >> > >> unsigned T0 = >> > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass); >> > >> unsigned T1 = >> > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass); >> > >> MachineOperand &RID = MI->getOperand(4); >> > >> -- >> > >> 1.8.3.1 >> > > >> > >> >> > > > >> From fa5abaa99d923fd429029ed149805063848f3184 Mon Sep 17 00:00:00 2001 >> From: Tom Stellard <thomas.stell...@amd.com> >> Date: Thu, 5 Sep 2013 08:59:32 -0700 >> Subject: [PATCH] R600: Fix scheduling of instructions that use the LDS > output >> queue >> >> The LDS output queue is accessed via the OQAP register. The OQAP >> register cannot be live across clauses, so if value is written to the >> output queue, it must be retrieved before the end of the clause. >> With the machine scheduler, we cannot statisfy this constraint, because >> it lacks proper alias analysis and it will mark some LDS accesses as >> having a chain dependency on vertex fetches. Since vertex fetches >> require a new clauses, the dependency may end up spiltting OQAP uses and >> defs so the end up in different clauses. See the lds-output-queue.ll >> test for a more detailed explanation. >> >> To work around this issue, we now combine the LDS read and the OQAP >> copy into one instruction and expand it after register allocation. >> >> This patch also adds some checks to the EmitClauseMarker pass, so that >> it doesn't end a clause with a value still in the output queue and >> removes AR.X and OQAP handling from the scheduler (AR.X uses and defs >> were already being expanded post-RA, so the scheduler will never see >> them). >> --- >> lib/Target/R600/R600EmitClauseMarkers.cpp | 65 ++++++++++++++++-- >> lib/Target/R600/R600ExpandSpecialInstrs.cpp | 17 +++++ >> lib/Target/R600/R600ISelLowering.cpp | 20 +++--- >> lib/Target/R600/R600InstrInfo.cpp | 8 +++ >> lib/Target/R600/R600InstrInfo.h | 2 + >> lib/Target/R600/R600MachineScheduler.cpp | 32 --------- >> lib/Target/R600/R600MachineScheduler.h | 2 - >> lib/Target/R600/R600RegisterInfo.cpp | 13 ++++ >> lib/Target/R600/R600RegisterInfo.h | 2 + >> test/CodeGen/R600/lds-output-queue.ll | 99 > +++++++++++++++++++++++++++ >> test/CodeGen/R600/local-memory-two-objects.ll | 8 ++- >> 11 files changed, 215 insertions(+), 53 deletions(-) >> create mode 100644 test/CodeGen/R600/lds-output-queue.ll >> >> diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp > b/lib/Target/R600/R600EmitClauseMarkers.cpp >> index 928c0e3..881b4aa 100644 >> --- a/lib/Target/R600/R600EmitClauseMarkers.cpp >> +++ b/lib/Target/R600/R600EmitClauseMarkers.cpp >> @@ -47,6 +47,11 @@ private: >> break; >> } >> >> + // These will be expanded to two ALU instructions in the >> + // ExpandSpecialInstructions pass. >> + if (TII->isLDSRetInstr(MI->getOpcode())) >> + return 2; >> + >> if(TII->isVector(*MI) || >> TII->isCubeOp(MI->getOpcode()) || >> TII->isReductionOp(MI->getOpcode())) >> @@ -108,6 +113,10 @@ private: >> bool SubstituteKCacheBank(MachineInstr *MI, >> std::vector<std::pair<unsigned, unsigned> > > &CachedConsts) const { >> std::vector<std::pair<unsigned, unsigned> > UsedKCache; >> + >> + if (!TII->isALUInstr(MI->getOpcode()) && > MI->getOpcode() != AMDGPU::DOT_4) >> + return true; >> + >> const SmallVectorImpl<std::pair<MachineOperand *, int64_t> >> &Consts = >> TII->getSrcs(MI); >> assert((TII->isALUInstr(MI->getOpcode()) || >> @@ -160,6 +169,52 @@ private: >> return true; >> } >> >> + bool canClauseLocalKillFitInClause( >> + unsigned AluInstCount, >> + std::vector<std::pair<unsigned, unsigned> >> KCacheBanks, >> + MachineBasicBlock::iterator Def, >> + MachineBasicBlock::iterator BBEnd) { >> + const R600RegisterInfo &TRI = TII->getRegisterInfo(); >> + for (MachineInstr::const_mop_iterator >> + MOI = Def->operands_begin(), >> + MOE = Def->operands_end(); MOI != MOE; ++MOI) { >> + if (!MOI->isReg() || !MOI->isDef() || >> + TRI.isPhysRegLiveAcrossClauses(MOI->getReg())) >> + continue; >> + >> + // Def defines a clause local register, so check that its use will > fit >> + // in the clause. >> + unsigned LastUseCount = 0; >> + for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; ++UseI) > { >> + AluInstCount += OccupiedDwords(UseI); >> + // Make sure we won't need to end the clause due to KCache > limitations. >> + if (!SubstituteKCacheBank(UseI, KCacheBanks)) >> + return false; >> + >> + // We have reached the maximum instruction limit before finding > the >> + // use that kills this register, so we cannot use this def in the >> + // current clause. >> + if (AluInstCount >= TII->getMaxAlusPerClause()) >> + return false; >> + >> + // Register kill flags have been cleared by the time we get to > this >> + // pass, but it is safe to assume that all uses of this register >> + // occur in the same basic block as its definition, because >> + // it is illegal for the scheduler to schedule them in >> + // different blocks. >> + if (UseI->findRegisterUseOperandIdx(MOI->getReg())) >> + LastUseCount = AluInstCount; >> + >> + if (UseI != Def && > UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1) >> + break; >> + } >> + if (LastUseCount) >> + return LastUseCount <= TII->getMaxAlusPerClause(); >> + llvm_unreachable("Clause local register live at end of > clause."); >> + } >> + return true; >> + } >> + >> MachineBasicBlock::iterator >> MakeALUClause(MachineBasicBlock &MBB, MachineBasicBlock::iterator I) > { >> MachineBasicBlock::iterator ClauseHead = I; >> @@ -198,11 +253,13 @@ private: >> I++; >> break; >> } >> - if (TII->isALUInstr(I->getOpcode()) && >> - !SubstituteKCacheBank(I, KCacheBanks)) >> + >> + // If this instruction defines a clause local register, make sure >> + // its use can fit in this clause. >> + if (!canClauseLocalKillFitInClause(AluInstCount, KCacheBanks, I, E)) >> break; >> - if (I->getOpcode() == AMDGPU::DOT_4 && >> - !SubstituteKCacheBank(I, KCacheBanks)) >> + >> + if (!SubstituteKCacheBank(I, KCacheBanks)) >> break; >> AluInstCount += OccupiedDwords(I); >> } >> diff --git a/lib/Target/R600/R600ExpandSpecialInstrs.cpp > b/lib/Target/R600/R600ExpandSpecialInstrs.cpp >> index 67b42d7..aeee4aa 100644 >> --- a/lib/Target/R600/R600ExpandSpecialInstrs.cpp >> +++ b/lib/Target/R600/R600ExpandSpecialInstrs.cpp >> @@ -68,6 +68,23 @@ bool > R600ExpandSpecialInstrsPass::runOnMachineFunction(MachineFunction &MF) > { >> MachineInstr &MI = *I; >> I = llvm::next(I); >> >> + // Expand LDS_*_RET instructions >> + if (TII->isLDSRetInstr(MI.getOpcode())) { >> + int DstIdx = TII->getOperandIdx(MI.getOpcode(), > AMDGPU::OpName::dst); >> + assert(DstIdx != -1); >> + MachineOperand &DstOp = MI.getOperand(DstIdx); >> + MachineInstr *Mov = TII->buildMovInstr(&MBB, I, >> + DstOp.getReg(), > AMDGPU::OQAP); >> + DstOp.setReg(AMDGPU::OQAP); >> + int LDSPredSelIdx = TII->getOperandIdx(MI.getOpcode(), >> + AMDGPU::OpName::pred_sel); >> + int MovPredSelIdx = TII->getOperandIdx(Mov->getOpcode(), >> + AMDGPU::OpName::pred_sel); >> + // Copy the pred_sel bit >> + Mov->getOperand(MovPredSelIdx).setReg( >> + MI.getOperand(LDSPredSelIdx).getReg()); >> + } >> + >> switch (MI.getOpcode()) { >> default: break; >> // Expand PRED_X to one of the PRED_SET instructions. >> diff --git a/lib/Target/R600/R600ISelLowering.cpp > b/lib/Target/R600/R600ISelLowering.cpp >> index 5bb8129..9845d12 100644 >> --- a/lib/Target/R600/R600ISelLowering.cpp >> +++ b/lib/Target/R600/R600ISelLowering.cpp >> @@ -128,21 +128,17 @@ MachineBasicBlock * > R600TargetLowering::EmitInstrWithCustomInserter( >> >> switch (MI->getOpcode()) { >> default: >> - if (TII->isLDSInstr(MI->getOpcode()) && >> - TII->getOperandIdx(MI->getOpcode(), AMDGPU::OpName::dst) != > -1) { >> + // Replace LDS_*_RET instruction that don't have any uses with the >> + // equivalent LDS_*_NORET instruction. >> + if (TII->isLDSRetInstr(MI->getOpcode())) { >> int DstIdx = TII->getOperandIdx(MI->getOpcode(), > AMDGPU::OpName::dst); >> assert(DstIdx != -1); >> MachineInstrBuilder NewMI; >> - if (!MRI.use_empty(MI->getOperand(DstIdx).getReg())) { >> - NewMI = BuildMI(*BB, I, BB->findDebugLoc(I), > TII->get(MI->getOpcode()), >> - AMDGPU::OQAP); >> - TII->buildDefaultInstruction(*BB, I, AMDGPU::MOV, >> - MI->getOperand(0).getReg(), >> - AMDGPU::OQAP); >> - } else { >> - NewMI = BuildMI(*BB, I, BB->findDebugLoc(I), >> - > TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode()))); >> - } >> + if (!MRI.use_empty(MI->getOperand(DstIdx).getReg())) >> + return BB; >> + >> + NewMI = BuildMI(*BB, I, BB->findDebugLoc(I), >> + > TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode()))); >> for (unsigned i = 1, e = MI->getNumOperands(); i < e; ++i) > { >> NewMI.addOperand(MI->getOperand(i)); >> } >> diff --git a/lib/Target/R600/R600InstrInfo.cpp > b/lib/Target/R600/R600InstrInfo.cpp >> index a11d54a..0ae4d0b 100644 >> --- a/lib/Target/R600/R600InstrInfo.cpp >> +++ b/lib/Target/R600/R600InstrInfo.cpp >> @@ -141,6 +141,14 @@ bool R600InstrInfo::isLDSInstr(unsigned Opcode) const > { >> (TargetFlags & R600_InstFlag::LDS_1A2D)); >> } >> >> +bool R600InstrInfo::isLDSNoRetInstr(unsigned Opcode) const { >> + return isLDSInstr(Opcode) && getOperandIdx(Opcode, > AMDGPU::OpName::dst) == -1; >> +} >> + >> +bool R600InstrInfo::isLDSRetInstr(unsigned Opcode) const { >> + return isLDSInstr(Opcode) && getOperandIdx(Opcode, > AMDGPU::OpName::dst) != -1; >> +} >> + >> bool R600InstrInfo::canBeConsideredALU(const MachineInstr *MI) const > { >> if (isALUInstr(MI->getOpcode())) >> return true; >> diff --git a/lib/Target/R600/R600InstrInfo.h > b/lib/Target/R600/R600InstrInfo.h >> index d7438ef..d6d1fe8 100644 >> --- a/lib/Target/R600/R600InstrInfo.h >> +++ b/lib/Target/R600/R600InstrInfo.h >> @@ -65,6 +65,8 @@ namespace llvm { >> bool isALUInstr(unsigned Opcode) const; >> bool hasInstrModifiers(unsigned Opcode) const; >> bool isLDSInstr(unsigned Opcode) const; >> + bool isLDSNoRetInstr(unsigned Opcode) const; >> + bool isLDSRetInstr(unsigned Opcode) const; >> >> /// \returns true if this \p Opcode represents an ALU > instruction or an >> /// instruction that will be lowered in ExpandSpecialInstrs Pass. >> diff --git a/lib/Target/R600/R600MachineScheduler.cpp > b/lib/Target/R600/R600MachineScheduler.cpp >> index 6c26d9e..da2a4d8 100644 >> --- a/lib/Target/R600/R600MachineScheduler.cpp >> +++ b/lib/Target/R600/R600MachineScheduler.cpp >> @@ -92,15 +92,6 @@ SUnit* R600SchedStrategy::pickNode(bool &IsTopNode) > { >> AllowSwitchFromAlu = true; >> } >> >> - >> - // We want to scheduled AR defs as soon as possible to make sure they > aren't >> - // put in a different ALU clause from their uses. >> - if (!SU && !UnscheduledARDefs.empty()) { >> - SU = UnscheduledARDefs[0]; >> - UnscheduledARDefs.erase(UnscheduledARDefs.begin()); >> - NextInstKind = IDAlu; >> - } >> - >> if (!SU && ((AllowSwitchToAlu && CurInstKind != IDAlu) > || >> (!AllowSwitchFromAlu && CurInstKind == IDAlu))) { >> // try to pick ALU >> @@ -130,15 +121,6 @@ SUnit* R600SchedStrategy::pickNode(bool > &IsTopNode) { >> NextInstKind = IDOther; >> } >> >> - // We want to schedule the AR uses as late as possible to make sure that >> - // the AR defs have been released. >> - if (!SU && !UnscheduledARUses.empty()) { >> - SU = UnscheduledARUses[0]; >> - UnscheduledARUses.erase(UnscheduledARUses.begin()); >> - NextInstKind = IDAlu; >> - } >> - >> - >> DEBUG( >> if (SU) { >> dbgs() << " ** Pick node **\n"; >> @@ -217,20 +199,6 @@ void R600SchedStrategy::releaseBottomNode(SUnit *SU) > { >> >> int IK = getInstKind(SU); >> >> - // Check for AR register defines >> - for (MachineInstr::const_mop_iterator I = > SU->getInstr()->operands_begin(), >> - E = > SU->getInstr()->operands_end(); >> - I != E; ++I) { >> - if (I->isReg() && I->getReg() == AMDGPU::AR_X) { >> - if (I->isDef()) { >> - UnscheduledARDefs.push_back(SU); >> - } else { >> - UnscheduledARUses.push_back(SU); >> - } >> - return; >> - } >> - } >> - >> // There is no export clause, we can schedule one as soon as its ready >> if (IK == IDOther) >> Available[IDOther].push_back(SU); >> diff --git a/lib/Target/R600/R600MachineScheduler.h > b/lib/Target/R600/R600MachineScheduler.h >> index 0a6f120..97c8cde 100644 >> --- a/lib/Target/R600/R600MachineScheduler.h >> +++ b/lib/Target/R600/R600MachineScheduler.h >> @@ -53,8 +53,6 @@ class R600SchedStrategy : public MachineSchedStrategy > { >> >> std::vector<SUnit *> Available[IDLast], Pending[IDLast]; >> std::vector<SUnit *> AvailableAlus[AluLast]; >> - std::vector<SUnit *> UnscheduledARDefs; >> - std::vector<SUnit *> UnscheduledARUses; >> std::vector<SUnit *> PhysicalRegCopy; >> >> InstKind CurInstKind; >> diff --git a/lib/Target/R600/R600RegisterInfo.cpp > b/lib/Target/R600/R600RegisterInfo.cpp >> index dd8f3ef..83fc90b 100644 >> --- a/lib/Target/R600/R600RegisterInfo.cpp >> +++ b/lib/Target/R600/R600RegisterInfo.cpp >> @@ -85,3 +85,16 @@ const RegClassWeight > &R600RegisterInfo::getRegClassWeight( >> const TargetRegisterClass *RC) const { >> return RCW; >> } >> + >> +bool R600RegisterInfo::isPhysRegLiveAcrossClauses(unsigned Reg) const > { >> + assert(!TargetRegisterInfo::isVirtualRegister(Reg)); >> + >> + switch (Reg) { >> + case AMDGPU::OQAP: >> + case AMDGPU::OQBP: >> + case AMDGPU::AR_X: >> + return false; >> + default: >> + return true; >> + } >> +} >> diff --git a/lib/Target/R600/R600RegisterInfo.h > b/lib/Target/R600/R600RegisterInfo.h >> index d458e55..18f3b3e 100644 >> --- a/lib/Target/R600/R600RegisterInfo.h >> +++ b/lib/Target/R600/R600RegisterInfo.h >> @@ -45,6 +45,8 @@ struct R600RegisterInfo : public AMDGPURegisterInfo > { >> >> virtual const RegClassWeight &getRegClassWeight(const > TargetRegisterClass *RC) const; >> >> + // \returns true if \p Reg can be defined in one ALU caluse and > used in another. >> + virtual bool isPhysRegLiveAcrossClauses(unsigned Reg) const; >> }; >> >> } // End namespace llvm >> diff --git a/test/CodeGen/R600/lds-output-queue.ll > b/test/CodeGen/R600/lds-output-queue.ll >> new file mode 100644 >> index 0000000..63a4332 >> --- /dev/null >> +++ b/test/CodeGen/R600/lds-output-queue.ll >> @@ -0,0 +1,99 @@ >> +; RUN: llc < %s -march=r600 -mcpu=redwood -verify-machineinstrs | > FileCheck %s >> +; >> +; This test checks that the lds input queue will is empty at the end of >> +; the ALU clause. >> + >> +; CHECK-LABEL: @lds_input_queue >> +; CHECK: LDS_READ_RET * OQAP >> +; CHECK-NOT: ALU clause >> +; CHECK: MOV * T{{[0-9]\.[XYZW]}}, OQAP >> + >> +@local_mem = internal addrspace(3) unnamed_addr global [2 x i32] [i32 1, > i32 2], align 4 >> + >> +define void @lds_input_queue(i32 addrspace(1)* %out, i32 addrspace(1)* > %in, i32 %index) { >> +entry: >> + %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, > i32 %index >> + %1 = load i32 addrspace(3)* %0 >> + call void @llvm.AMDGPU.barrier.local() >> + >> + ; This will start a new clause for the vertex fetch >> + %2 = load i32 addrspace(1)* %in >> + %3 = add i32 %1, %2 >> + store i32 %3, i32 addrspace(1)* %out >> + ret void >> +} >> + >> +declare void @llvm.AMDGPU.barrier.local() >> + >> +; The machine scheduler does not do proper alias analysis and assumes that >> +; loads from global values (Note that a global value is different that a >> +; value from global memory. A global value is a value that is declared >> +; outside of a function, it can reside in any address space) alias with >> +; all other loads. >> +; >> +; This is a problem for scheduling the reads from the local data share > (lds). >> +; These reads are implemented using two instructions. The first copies > the >> +; data from lds into the lds output queue, and the second moves the data > from >> +; the input queue into main memory. These two instructions don't have > to be >> +; scheduled one after the other, but they do need to be scheduled in the > same >> +; clause. The aliasing problem mentioned above causes problems when there > is a >> +; load from global memory which immediately follows a load from a global > value that >> +; has been declared in the local memory space: >> +; >> +; %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, > i32 %index >> +; %1 = load i32 addrspace(3)* %0 >> +; %2 = load i32 addrspace(1)* %in >> +; >> +; The instruction selection phase will generate ISA that looks like this: >> +; %OQAP = LDS_READ_RET >> +; %vreg0 = MOV %OQAP >> +; %vreg1 = VTX_READ_32 >> +; %vreg2 = ADD_INT %vreg1, %vreg0 >> +; >> +; The bottom scheduler will schedule the two ALU instructions first: >> +; >> +; UNSCHEDULED: >> +; %OQAP = LDS_READ_RET >> +; %vreg1 = VTX_READ_32 >> +; >> +; SCHEDULED: >> +; >> +; vreg0 = MOV %OQAP >> +; vreg2 = ADD_INT %vreg1, %vreg2 >> +; >> +; The lack of proper aliasing results in the local memory read > (LDS_READ_RET) >> +; to consider the global memory read (VTX_READ_32) has a chain dependency, > so >> +; the global memory read will always be scheduled first. This will give > us a >> +; final program which looks like this: >> +; >> +; Alu clause: >> +; %OQAP = LDS_READ_RET >> +; VTX clause: >> +; %vreg1 = VTX_READ_32 >> +; Alu clause: >> +; vreg0 = MOV %OQAP >> +; vreg2 = ADD_INT %vreg1, %vreg2 >> +; >> +; This is an illegal program because the OQAP def and use know occur in >> +; different ALU clauses. >> +; >> +; This test checks this scenario and makes sure it doesn't result in > an >> +; illegal program. For now, we have fixed this issue by merging the >> +; LDS_READ_RET and MOV together during instruction selection and then >> +; expanding them after scheduling. Once the scheduler has better alias >> +; analysis, we should be able to keep these instructions sparate before >> +; scheduling. >> +; >> +; CHECK-LABEL: @local_global_alias >> +; CHECK: LDS_READ_RET >> +; CHECK-NOT: ALU clause >> +; CHECK MOV * T{{[0-9]\.[XYZW]}}, OQAP >> +define void @local_global_alias(i32 addrspace(1)* %out, i32 addrspace(1)* > %in) { >> +entry: >> + %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, > i32 0 >> + %1 = load i32 addrspace(3)* %0 >> + %2 = load i32 addrspace(1)* %in >> + %3 = add i32 %2, %1 >> + store i32 %3, i32 addrspace(1)* %out >> + ret void >> +} >> diff --git a/test/CodeGen/R600/local-memory-two-objects.ll > b/test/CodeGen/R600/local-memory-two-objects.ll >> index b413fe3..e2d8406 100644 >> --- a/test/CodeGen/R600/local-memory-two-objects.ll >> +++ b/test/CodeGen/R600/local-memory-two-objects.ll >> @@ -12,9 +12,11 @@ >> ; SI-CHECK: .long 47180 >> ; SI-CHECK-NEXT: .long 32768 >> >> -; Make sure the lds writes are using different addresses. >> -; EG-CHECK: LDS_WRITE {{[*]*}} > {{PV|T}}[[ADDRW:[0-9]*\.[XYZW]]] >> -; EG-CHECK-NOT: LDS_WRITE {{[*]*}} T[[ADDRW]] >> +; We would like to check the the lds writes are using different >> +; addresses, but due to variations in the scheduler, we can't do >> +; this consistently on evergreen GPUs. >> +; EG-CHECK: LDS_WRITE >> +; EG-CHECK: LDS_WRITE >> ; SI-CHECK: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW:[0-9]*]] >> ; SI-CHECK-NOT: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW]] >> >> -- >> 1.8.1.4 >> > >> _______________________________________________ >> llvm-commits mailing list >> llvm-comm...@cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev