Re: [Mesa-dev] [PATCH] R600: Make sure OQAP defs and uses happen in the same clause

Vincent Lejeune Thu, 14 Nov 2013 15:46:25 -0800

This patch is : reviewed-by: Vincent Lejeune<vljn at ovi.com>



----- Mail original -----
> De : Tom Stellard <t...@stellard.net>
> À : Vincent Lejeune <v...@ovi.com>
> Cc : "mesa-dev@lists.freedesktop.org" <mesa-dev@lists.freedesktop.org>; 
> "llvm-comm...@cs.uiuc.edu" <llvm-comm...@cs.uiuc.edu>; Tom Stellard 
> <thomas.stell...@amd.com>
> Envoyé le : Jeudi 14 novembre 2013 1h53
> Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in the same 
> clause
> 
> Hi Vincent,
> 
> I discovered a bug in the previous patch.  Here is an updated versions.
> 
> -Tom
> 
> On Tue, Nov 12, 2013 at 03:01:42PM -0800, Tom Stellard wrote:
>>  Hi Vincent,
>> 
>>  Here is an updated patch where I added a call to
>>  SubstituteKCacheBank() in canClauseLocalKillFitInClause()  This should
>>  prevent OQAP uses and defs from being split because of constant bank
>>  limitations.
>> 
>>  Maybe we can leave the ScheduleDAGMutation optimization as a future
>>  TODO.
>> 
>>  -Tom
>> 
>>  On Sun, Nov 03, 2013 at 10:19:16AM -0800, Vincent Lejeune wrote:
>>  > I have put some comments below but otherwise the patch is
>>  > reviewed-by: Vincent Lejeune <vljn at ovi.com>
>>  > 
>>  > 
>>  > >-------------- next part --------------
>>  > >>From 2eb4673e3184af0e077cbe30a594602441e8d98e Mon Sep 17 
> 00:00:00 2001 >From: Tom Stellard <thomas.stellard at amd.com>
>>  > >Date: Thu, 5 Sep 2013 08:59:32 -0700
>>  > >Subject: [PATCH] R600: Fix scheduling of instructions that use the 
> LDS output
>>  > > queue
>>  > >
>>  > >The LDS output queue is accessed via the OQAP register.  The OQAP
>>  > >register cannot be live across clauses, so if value is written to 
> the
>>  > >output queue, it must be retrieved before the end of the clause.
>>  > >With the machine scheduler, we cannot statisfy this constraint, 
> because
>>  > >it lacks proper alias analysis and it will mark some LDS accesses 
> as
>>  > >having a chain dependency on vertex fetches.  Since vertex fetches
>>  > 
>>  > We can customize the dependency graph before machine scheduling takes 
> place,
>>  > using ScheduleDAGMutation.
>>  > I already wrote some code to break artificial dependencies between 
> vector
>>  > subregister read/write here :
>>  > 
> http://cgit.freedesktop.org/~vlj/llvm/commit/?h=vliw5&id=e91b16a22845d0a80ed348f158ae7ab293e003a8
>>  > While I'm expecting from Matthias Braun's Subregister patches 
> to be upstreamed
>>  > to obsolete most of this patch except tests, it can be reworked so 
> that
>>  > it'll parse all MEM dependency, and remove the ones between 
> instructions
>>  > touching different memory pool (like VTX_FETCH and LDS_READ).
>>  > 
>>  > >require a new clauses, the dependency may end up spiltting OQAP 
> uses and
>>  > >defs so the end up in different clauses.  See the 
> lds-output-queue.ll
>>  > >test for a more detailed explanation.
>>  > >
>>  > >To work around this issue, we now combine the LDS read and the 
> OQAP
>>  > >copy into one instruction and expand it after register allocation.
>>  > >
>>  > >This patch also adds some checks to the EmitClauseMarker pass, so 
> that
>>  > >it doesn't end a clause with a value still in the output queue 
> and
>>  > >removes AR.X and OQAP handling from the scheduler (AR.X uses and 
> defs
>>  > >were already being expanded post-RA, so the scheduler will never 
> see
>>  > >them).
>>  > >---
>>  > > lib/Target/R600/R600EmitClauseMarkers.cpp     | 52 ++++++++++++++
>>  > > lib/Target/R600/R600ExpandSpecialInstrs.cpp   | 17 +++++
>>  > > lib/Target/R600/R600ISelLowering.cpp          | 20 +++---
>>  > > lib/Target/R600/R600InstrInfo.cpp             |  8 +++
>>  > > lib/Target/R600/R600InstrInfo.h               |  2 +
>>  > > lib/Target/R600/R600MachineScheduler.cpp      | 32 ---------
>>  > > lib/Target/R600/R600MachineScheduler.h        |  2 -
>>  > > lib/Target/R600/R600RegisterInfo.cpp          | 13 ++++
>>  > > lib/Target/R600/R600RegisterInfo.h            |  2 +
>>  > > test/CodeGen/R600/lds-output-queue.ll         | 99 
> +++++++++++++++++++++++++++
>>  > > test/CodeGen/R600/local-memory-two-objects.ll |  8 ++-
>>  > > 11 files changed, 206 insertions(+), 49 deletions(-)
>>  > > create mode 100644 test/CodeGen/R600/lds-output-queue.ll
>>  > >
>>  > >diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp 
> b/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  > 
>>  > 
>>  > >+  bool canClauseLocalKillFitInClause(
>>  > >+                             unsigned AluInstCount,
>>  > >+                             MachineBasicBlock::iterator Def,
>>  > >+                             MachineBasicBlock::iterator BBEnd) 
> {
>>  > >+    const R600RegisterInfo &TRI = TII->getRegisterInfo();
>>  > >+    for (MachineInstr::const_mop_iterator
>>  > >+           MOI = Def->operands_begin(),
>>  > >+           MOE = Def->operands_end(); MOI != MOE; ++MOI) 
> {
>>  > >+      if (!MOI->isReg() || !MOI->isDef() ||
>>  > >+          TRI.isPhysRegLiveAcrossClauses(MOI->getReg()))
>>  > >+        continue;
>>  > >+
>>  > >+      // Def defines a clause local register, so check that its 
> use will fit
>>  > >+      // in the clause.
>>  > >+      unsigned LastUseCount = 0;
>>  > >+      for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; 
> ++UseI) {
>>  > >+        AluInstCount += OccupiedDwords(UseI);
>>  > >+        // We have reached the maximum instruction limit before 
> finding the
>>  > >+        // use that kills this register, so we cannot use this 
> def in the
>>  > >+        // current clause.
>>  > >+        if (AluInstCount >= TII->getMaxAlusPerClause())
>>  > >+          return false;
>>  > >+
>>  > >+        // Register kill flags have been cleared by the time we 
> get to this
>>  > >+        // pass, but it is safe to assume that all uses of this 
> register
>>  > >+        // occur in the same basic block as its definition, 
> because
>>  > >+        // it is illegal for the scheduler to schedule them in
>>  > >+        // different blocks.
>>  > >+        if (UseI->findRegisterUseOperandIdx(MOI->getReg()))
>>  > >+          LastUseCount = AluInstCount;
>>  > >+
>>  > >+        if (UseI != Def && 
> UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1)
>>  > >+          break;
>>  > >+      }
>>  > >+      if (LastUseCount)
>>  > >+        return LastUseCount <= TII->getMaxAlusPerClause();
>>  > >+      llvm_unreachable("Clause local register live at end of 
> clause.");
>>  > >+    }
>>  > >+    return true;
>>  > >+  }
>>  > 
>>  > This function does not check if current clause can hold all constant 
> bank.
>>  > I think it's likely to be rare for a clause to be split because of 
> constant bank limitations,
>>  > but it would be better to have an assertion failure in such case to 
> make debugging easier.
>>  > For instance if the SubstituteKCacheBank return false, you can check 
> that there is no lds
>>  > use still pending.
>>  > 
>>  > 
>>  > ----- Mail original -----
>>  > > De : Tom Stellard <t...@stellard.net>
>>  > > À : Vincent Lejeune <v...@ovi.com>
>>  > > Cc : "llvm-comm...@cs.uiuc.edu" 
> <llvm-comm...@cs.uiuc.edu>; "mesa-dev@lists.freedesktop.org" 
> <mesa-dev@lists.freedesktop.org>; Tom Stellard 
> <thomas.stell...@amd.com>
>>  > > Envoyé le : Mercredi 30 octobre 2013 17h46
>>  > > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in 
> the same clause
>>  > > 
>>  > > Hi Vincent,
>>  > > 
>>  > > It turns out that it's not possible to correctly schedule 
> uses and defs
>>  > > of the OQAP register without proper alias analysis in the 
> MachineScheduler.  See
>>  > > the explanation in the lds-output-queue.ll test case.
>>  > > 
>>  > > Here is an updated patch that fixes all the outstanding LDS 
> scheduling
>>  > > bugs that I am aware of.
>>  > > 
>>  > > -Tom
>>  > > 
>>  > > On Fri, Oct 25, 2013 at 05:42:13AM -0700, Vincent Lejeune wrote:
>>  > >>  This patch should work when checking than no OQAP is used 
> before beeing 
>>  > > queued, assuming that a value in OQAP is consumed
>>  > >>  and cannot be read twice. However I'm not sure I cover 
> all LDS 
>>  > > instructions that queues a value, I only use LDS_RET_READ in 
> switch case.
>>  > >> 
>>  > >>  Vincent
>>  > >> 
>>  > >> 
>>  > >> 
>>  > >>  ----- Mail original -----
>>  > >>  > De : Tom Stellard <t...@stellard.net>
>>  > >>  > À : Vincent Lejeune <v...@ovi.com>
>>  > >>  > Cc : "llvm-comm...@cs.uiuc.edu" 
>>  > > <llvm-comm...@cs.uiuc.edu>; 
> "mesa-dev@lists.freedesktop.org" 
>>  > > <mesa-dev@lists.freedesktop.org>; Tom Stellard 
>>  > > <thomas.stell...@amd.com>
>>  > >>  > Envoyé le : Mardi 22 octobre 2013 23h20
>>  > >>  > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses 
> happen in the 
>>  > > same clause
>>  > >>  > 
>>  > >>  > Hi Vincent,
>>  > >>  > 
>>  > >>  > Here is an updated patch.  I wasn't sure where to 
> put the 
>>  > > assertion to
>>  > >>  > check that UnscheduledNoLiveOut{Defs,Uses} is 
> empty when 
>>  > > switching to a
>>  > >>  > new clause.  I tried adding it to 
> R600SchedStartegy::schedNode() 
>>  > > behind
>>  > >>  > the if (NextInstKind != CurInstKind) condition, but it 
> always failed.
>>  > >>  > Any suggestions on where I should but it?
>>  > >>  > 
>>  > >>  > -Tom
>>  > >>  > 
>>  > >>  > 
>>  > >>  > On Mon, Oct 21, 2013 at 12:40:28PM -0700, Vincent 
> Lejeune wrote:
>>  > >>  >> 
>>  > >>  >> 
>>  > >>  >> 
>>  > >>  >> 
>>  > >>  >>  ----- Mail original -----
>>  > >>  >>  > De : Tom Stellard <t...@stellard.net>
>>  > >>  >>  > À : llvm-comm...@cs.uiuc.edu
>>  > >>  >>  > Cc : mesa-dev@lists.freedesktop.org; Tom 
> Stellard 
>>  > >>  > <thomas.stell...@amd.com>
>>  > >>  >>  > Envoyé le : Vendredi 11 octobre 2013 20h10
>>  > >>  >>  > Objet : [PATCH] R600: Make sure OQAP defs and 
> uses happen in 
>>  > > the same 
>>  > >>  > clause
>>  > >>  >>  > 
>>  > >>  >>  > From: Tom Stellard 
> <thomas.stell...@amd.com>
>>  > >>  >>  > 
>>  > >>  >>  > Reading the special OQAP register pops the 
> top value off the 
>>  > > LDS
>>  > >>  >>  > input queue and returns it to the 
> instruction.  This queue 
>>  > > is
>>  > >>  >>  > invalidated at the end of an ALU clause and 
> leaving values 
>>  > > in the 
>>  > >>  > queue
>>  > >>  >>  > can lead to GPU hangs.  This means that if we 
> load a value 
>>  > > into the 
>>  > >>  > queue,
>>  > >>  >>  > we must use it before the end of the clause.
>>  > >>  >>  > 
>>  > >>  >>  > This fixes some hangs in the OpenCV test 
> suite.
>>  > >>  >>  > ---
>>  > >>  >>  > lib/Target/R600/R600MachineScheduler.cpp | 25 
> 
>>  > >>  > +++++++++++++------------
>>  > >>  >>  > lib/Target/R600/R600MachineScheduler.h   |  4 
> ++--
>>  > >>  >>  > test/CodeGen/R600/lds-input-queue.ll     | 26 
> 
>>  > >>  > ++++++++++++++++++++++++++
>>  > >>  >>  > 3 files changed, 41 insertions(+), 14 
> deletions(-)
>>  > >>  >>  > create mode 100644 
> test/CodeGen/R600/lds-input-queue.ll
>>  > >>  >>  > 
>>  > >>  >>  > diff --git 
> a/lib/Target/R600/R600MachineScheduler.cpp 
>>  > >>  >>  > b/lib/Target/R600/R600MachineScheduler.cpp
>>  > >>  >>  > index 6c26d9e..611b7f4 100644
>>  > >>  >>  > --- 
> a/lib/Target/R600/R600MachineScheduler.cpp
>>  > >>  >>  > +++ 
> b/lib/Target/R600/R600MachineScheduler.cpp
>>  > >>  >>  > @@ -93,11 +93,12 @@ SUnit* 
> R600SchedStrategy::pickNode(bool 
>>  > >>  > &IsTopNode) 
>>  > >>  >>  > {
>>  > >>  >>  >    }
>>  > >>  >>  > 
>>  > >>  >>  > 
>>  > >>  >>  > -  // We want to scheduled AR defs as soon as 
> possible to 
>>  > > make sure 
>>  > >>  > they 
>>  > >>  >>  > aren't
>>  > >>  >>  > -  // put in a different ALU clause from 
> their uses.
>>  > >>  >>  > -  if (!SU && 
> !UnscheduledARDefs.empty()) {
>>  > >>  >>  > -      SU = UnscheduledARDefs[0];
>>  > >>  >>  > -      
> UnscheduledARDefs.erase(UnscheduledARDefs.begin());
>>  > >>  >>  > +  // We want to scheduled defs that cannot 
> be live outside 
>>  > > of this 
>>  > >>  > clause 
>>  > >>  >>  > +  // as soon as possible to make sure they 
> aren't put 
>>  > > in a 
>>  > >>  > different
>>  > >>  >>  > +  // ALU clause from their uses.
>>  > >>  >>  > +  if (!SU && 
> !UnscheduledNoLiveOutDefs.empty()) 
>>  > > {
>>  > >>  >>  > +      SU = UnscheduledNoLiveOutDefs[0];
>>  > >>  >>  > +      
>>  > >>  > 
> UnscheduledNoLiveOutDefs.erase(UnscheduledNoLiveOutDefs.begin());
>>  > >>  >>  >        NextInstKind = IDAlu;
>>  > >>  >>  >    }
>>  > >>  >>  > 
>>  > >>  >>  > @@ -132,9 +133,9 @@ SUnit* 
> R600SchedStrategy::pickNode(bool 
>>  > >>  > &IsTopNode) 
>>  > >>  >>  > {
>>  > >>  >>  > 
>>  > >>  >>  >    // We want to schedule the AR uses as late 
> as possible to 
>>  > > make sure 
>>  > >>  > that
>>  > >>  >>  >    // the AR defs have been released.
>>  > >>  >>  > -  if (!SU && 
> !UnscheduledARUses.empty()) {
>>  > >>  >>  > -      SU = UnscheduledARUses[0];
>>  > >>  >>  > -      
> UnscheduledARUses.erase(UnscheduledARUses.begin());
>>  > >>  >>  > +  if (!SU && 
> !UnscheduledNoLiveOutUses.empty()) 
>>  > > {
>>  > >>  >>  > +      SU = UnscheduledNoLiveOutUses[0];
>>  > >>  >>  > +      
>>  > >>  > 
> UnscheduledNoLiveOutUses.erase(UnscheduledNoLiveOutUses.begin());
>>  > >>  >> 
>>  > >>  >>  Can we use std::queue<SUnit*> instead of a 
> std::vector for 
>>  > >>  > UnscheduledNoLiveOutUses ?
>>  > >>  >>  I had to use a vector because I needed to be able 
> to pop non 
>>  > > topmost SUnit 
>>  > >>  > in some case
>>  > >>  >>  (to fit Instruction Group const read limitation) 
> but I would 
>>  > > rather avoid 
>>  > >>  > erase(iterator) call
>>  > >>  >>  when possible.
>>  > >>  >> 
>>  > >>  >> 
>>  > >>  >>  >        NextInstKind = IDAlu;
>>  > >>  >>  >    }
>>  > >>  >>  > 
>>  > >>  >>  > @@ -217,15 +218,15 @@ void 
>>  > > R600SchedStrategy::releaseBottomNode(SUnit 
>>  > >>  > *SU) 
>>  > >>  >>  > {
>>  > >>  >>  > 
>>  > >>  >>  >    int IK = getInstKind(SU);
>>  > >>  >>  > 
>>  > >>  >>  > -  // Check for AR register defines
>>  > >>  >>  > +  // Check for registers that do not live 
> across ALU 
>>  > > clauses.
>>  > >>  >>  >    for (MachineInstr::const_mop_iterator I = 
>>  > >>  >>  > SU->getInstr()->operands_begin(),
>>  > >>  >>  >                                          E = 
>>  > >>  >>  > SU->getInstr()->operands_end();
>>  > >>  >>  >                                          I != 
> E; ++I) {
>>  > >>  >>  > -    if (I->isReg() && 
> I->getReg() == 
>>  > > AMDGPU::AR_X) 
>>  > >>  > {
>>  > >>  >>  > +    if (I->isReg() && 
> (I->getReg() == 
>>  > > AMDGPU::AR_X || 
>>  > >>  >>  > I->getReg() == AMDGPU::OQAP)) {
>>  > >>  >>  >        if (I->isDef()) {
>>  > >>  >>  > -        UnscheduledARDefs.push_back(SU);
>>  > >>  >>  > +        
> UnscheduledNoLiveOutDefs.push_back(SU);
>>  > >>  >>  >        } else {
>>  > >>  >>  > -        UnscheduledARUses.push_back(SU);
>>  > >>  >>  > +        
> UnscheduledNoLiveOutUses.push_back(SU);
>>  > >>  >>  >        }
>>  > >>  >>  >        return;
>>  > >>  >>  >      }
>>  > >>  >>  > diff --git 
> a/lib/Target/R600/R600MachineScheduler.h 
>>  > >>  >>  > b/lib/Target/R600/R600MachineScheduler.h
>>  > >>  >>  > index 0a6f120..db2e188 100644
>>  > >>  >>  > --- a/lib/Target/R600/R600MachineScheduler.h
>>  > >>  >>  > +++ b/lib/Target/R600/R600MachineScheduler.h
>>  > >>  >>  > @@ -53,8 +53,8 @@ class R600SchedStrategy : 
> public 
>>  > >>  > MachineSchedStrategy {
>>  > >>  >>  > 
>>  > >>  >>  >    std::vector<SUnit *> 
> Available[IDLast], 
>>  > > Pending[IDLast];
>>  > >>  >>  >    std::vector<SUnit *> 
> AvailableAlus[AluLast];
>>  > >>  >>  > -  std::vector<SUnit *> 
> UnscheduledARDefs;
>>  > >>  >>  > -  std::vector<SUnit *> 
> UnscheduledARUses;
>>  > >>  >>  > +  std::vector<SUnit *> 
> UnscheduledNoLiveOutDefs;
>>  > >>  >>  > +  std::vector<SUnit *> 
> UnscheduledNoLiveOutUses;
>>  > >>  >>  >    std::vector<SUnit *> 
> PhysicalRegCopy;
>>  > >>  >>  > 
>>  > >>  >>  >    InstKind CurInstKind;
>>  > >>  >>  > diff --git 
> a/test/CodeGen/R600/lds-input-queue.ll 
>>  > >>  >>  > b/test/CodeGen/R600/lds-input-queue.ll
>>  > >>  >>  > new file mode 100644
>>  > >>  >>  > index 0000000..548b41c
>>  > >>  >>  > --- /dev/null
>>  > >>  >>  > +++ b/test/CodeGen/R600/lds-input-queue.ll
>>  > >>  >>  > @@ -0,0 +1,26 @@
>>  > >>  >>  > +; RUN: llc < %s -march=r600 -mcpu=redwood 
> | FileCheck %s
>>  > >>  >> 
>>  > >>  >>  Does the test work with -verify-machineinstrs flag 
> set ?
>>  > >>  >> 
>>  > >>  >>  > +;
>>  > >>  >>  > +; This test checks that the lds input queue 
> will is empty 
>>  > > at the end 
>>  > >>  > of
>>  > >>  >>  > +; the ALU clause.
>>  > >>  >>  > +
>>  > >>  >>  > +; CHECK-LABEL: @lds_input_queue
>>  > >>  >>  > +; CHECK: LDS_READ_RET * OQAP
>>  > >>  >>  > +; CHECK-NOT: ALU clause
>>  > >>  >>  > +; CHECK: MOV 
> T{{[0-9]\.[XYZW]}}, OQAP
>>  > >>  >>  > +
>>  > >>  >>  > +@local_mem = internal addrspace(3) 
> unnamed_addr global [2 x 
>>  > > i32] [i32 
>>  > >>  > 1, i32 
>>  > >>  >>  > 2], align 4
>>  > >>  >>  > +
>>  > >>  >>  > +define void @lds_input_queue(i32 
> addrspace(1)* %out, i32 
>>  > >>  > addrspace(1)* %in, i32 
>>  > >>  >>  > %index) {
>>  > >>  >>  > +entry:
>>  > >>  >>  > +  %0 = getelementptr inbounds [2 x i32] 
> addrspace(3)* 
>>  > > @local_mem, i32 
>>  > >>  > 0, i32 
>>  > >>  >>  > %index
>>  > >>  >>  > +  %1 = load i32 addrspace(3)* %0
>>  > >>  >>  > +  call void @llvm.AMDGPU.barrier.local()
>>  > >>  >>  > +
>>  > >>  >>  > +  ; This will start a new clause for the 
> vertex fetch
>>  > >>  >>  > +  %2 = load i32 addrspace(1)* %in
>>  > >>  >>  > +  %3 = add i32 %1, %2
>>  > >>  >>  > +  store i32 %3, i32 addrspace(1)* %out
>>  > >>  >>  > +  ret void
>>  > >>  >>  > +}
>>  > >>  >>  > +
>>  > >>  >>  > +declare void @llvm.AMDGPU.barrier.local()
>>  > >>  >>  > -- 
>>  > >>  >>  > 1.7.11.4
>>  > >>  >>  > 
>>  > >>  >>  > 
> _______________________________________________
>>  > >>  >>  > llvm-commits mailing list
>>  > >>  >>  > llvm-comm...@cs.uiuc.edu
>>  > >>  >>  > 
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>  > >>  >>  >
>>  > >>  > 
>>  > > 
>>  > >>  From 956c9986fcda84c4a4b0a555432319bbdfd98cfc Mon Sep 17 
> 00:00:00 2001
>>  > >>  From: Vincent Lejeune <v...@ovi.com>
>>  > >>  Date: Fri, 25 Oct 2013 14:32:35 +0200
>>  > >>  Subject: [PATCH] R600: Uses Assert for OQAP
>>  > >> 
>>  > >>  ---
>>  > >>   lib/Target/R600/R600EmitClauseMarkers.cpp | 31 
>>  > > +++++++++++++++++++++++++++++++
>>  > >>   lib/Target/R600/R600ISelLowering.cpp      |  2 ++
>>  > >>   2 files changed, 33 insertions(+)
>>  > >> 
>>  > >>  diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp 
>>  > > b/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  > >>  index 928c0e3..9951b15 100644
>>  > >>  --- a/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  > >>  +++ b/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  > >>  @@ -33,6 +33,8 @@ private:
>>  > >>     static char ID;
>>  > >>     const R600InstrInfo *TII;
>>  > >>     int Address;
>>  > >>  +  int OQAPLevel;
>>  > >>  +  bool ArDefined;
>>  > >>   
>>  > >>     unsigned OccupiedDwords(MachineInstr *MI) const {
>>  > >>       switch (MI->getOpcode()) {
>>  > >>  @@ -160,8 +162,36 @@ private:
>>  > >>       return true;
>>  > >>     }
>>  > >>   
>>  > >>  +  void AssertARAndOQAPCorrectness(MachineInstr *MI) {
>>  > >>  +    for (MachineInstr::mop_iterator MOp = 
> MI->operands_begin(),
>>  > >>  +        E = MI->operands_end(); MOp != E; MOp++) {
>>  > >>  +      MachineOperand &MO = *MOp;
>>  > >>  +      if (!MO.isReg() || MO.isDef())
>>  > >>  +        continue;
>>  > >>  +      switch (MO.getReg()) {
>>  > >>  +      default: break;
>>  > >>  +      case AMDGPU::OQAP:
>>  > >>  +        OQAPLevel--;
>>  > >>  +        assert (OQAPLevel >= 0 && "OQAP not 
> defined in 
>>  > > this clause !");
>>  > >>  +        break;
>>  > >>  +      }
>>  > >>  +    }
>>  > >>  +    switch (MI->getOpcode()) {
>>  > >>  +    default: break;
>>  > >>  +    case AMDGPU::MOVA_INT_eg:
>>  > >>  +      ArDefined = true;
>>  > >>  +      break;
>>  > >>  +    case AMDGPU::LDS_READ_RET:
>>  > >>  +      OQAPLevel++;
>>  > >>  +      break;
>>  > >>  +    }
>>  > >>  +  }
>>  > >>  +
>>  > >>     MachineBasicBlock::iterator
>>  > >>     MakeALUClause(MachineBasicBlock &MBB, 
> MachineBasicBlock::iterator I) 
>>  > > {
>>  > >>  +    // LDS queue/ AR registers get resetted in new clause.
>>  > >>  +    OQAPLevel = 0;
>>  > >>  +    ArDefined = false;
>>  > >>       MachineBasicBlock::iterator ClauseHead = I;
>>  > >>       std::vector<std::pair<unsigned, unsigned> > 
> KCacheBanks;
>>  > >>       bool PushBeforeModifier = false;
>>  > >>  @@ -205,6 +235,7 @@ private:
>>  > >>             !SubstituteKCacheBank(I, KCacheBanks))
>>  > >>           break;
>>  > >>         AluInstCount += OccupiedDwords(I);
>>  > >>  +      AssertARAndOQAPCorrectness(I);
>>  > >>       }
>>  > >>       unsigned Opcode = PushBeforeModifier ?
>>  > >>           AMDGPU::CF_ALU_PUSH_BEFORE : AMDGPU::CF_ALU;
>>  > >>  diff --git a/lib/Target/R600/R600ISelLowering.cpp 
>>  > > b/lib/Target/R600/R600ISelLowering.cpp
>>  > >>  index 3c2e388..d8cbcae 100644
>>  > >>  --- a/lib/Target/R600/R600ISelLowering.cpp
>>  > >>  +++ b/lib/Target/R600/R600ISelLowering.cpp
>>  > >>  @@ -215,6 +215,7 @@ MachineBasicBlock * 
>>  > > R600TargetLowering::EmitInstrWithCustomInserter(
>>  > >>     }
>>  > >>   
>>  > >>     case AMDGPU::TXD: {
>>  > >>  +    abort();
>>  > >>       unsigned T0 = 
>>  > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
>>  > >>       unsigned T1 = 
>>  > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
>>  > >>       MachineOperand &RID = MI->getOperand(4);
>>  > >>  @@ -316,6 +317,7 @@ MachineBasicBlock * 
>>  > > R600TargetLowering::EmitInstrWithCustomInserter(
>>  > >>     }
>>  > >>   
>>  > >>     case AMDGPU::TXD_SHADOW: {
>>  > >>  +    abort();
>>  > >>       unsigned T0 = 
>>  > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
>>  > >>       unsigned T1 = 
>>  > > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
>>  > >>       MachineOperand &RID = MI->getOperand(4);
>>  > >>  -- 
>>  > >>  1.8.3.1
>>  > > 
>>  > >> 
>>  > >
> 
>>  From fa5abaa99d923fd429029ed149805063848f3184 Mon Sep 17 00:00:00 2001
>>  From: Tom Stellard <thomas.stell...@amd.com>
>>  Date: Thu, 5 Sep 2013 08:59:32 -0700
>>  Subject: [PATCH] R600: Fix scheduling of instructions that use the LDS 
> output
>>   queue
>> 
>>  The LDS output queue is accessed via the OQAP register.  The OQAP
>>  register cannot be live across clauses, so if value is written to the
>>  output queue, it must be retrieved before the end of the clause.
>>  With the machine scheduler, we cannot statisfy this constraint, because
>>  it lacks proper alias analysis and it will mark some LDS accesses as
>>  having a chain dependency on vertex fetches.  Since vertex fetches
>>  require a new clauses, the dependency may end up spiltting OQAP uses and
>>  defs so the end up in different clauses.  See the lds-output-queue.ll
>>  test for a more detailed explanation.
>> 
>>  To work around this issue, we now combine the LDS read and the OQAP
>>  copy into one instruction and expand it after register allocation.
>> 
>>  This patch also adds some checks to the EmitClauseMarker pass, so that
>>  it doesn't end a clause with a value still in the output queue and
>>  removes AR.X and OQAP handling from the scheduler (AR.X uses and defs
>>  were already being expanded post-RA, so the scheduler will never see
>>  them).
>>  ---
>>   lib/Target/R600/R600EmitClauseMarkers.cpp     | 65 ++++++++++++++++--
>>   lib/Target/R600/R600ExpandSpecialInstrs.cpp   | 17 +++++
>>   lib/Target/R600/R600ISelLowering.cpp          | 20 +++---
>>   lib/Target/R600/R600InstrInfo.cpp             |  8 +++
>>   lib/Target/R600/R600InstrInfo.h               |  2 +
>>   lib/Target/R600/R600MachineScheduler.cpp      | 32 ---------
>>   lib/Target/R600/R600MachineScheduler.h        |  2 -
>>   lib/Target/R600/R600RegisterInfo.cpp          | 13 ++++
>>   lib/Target/R600/R600RegisterInfo.h            |  2 +
>>   test/CodeGen/R600/lds-output-queue.ll         | 99 
> +++++++++++++++++++++++++++
>>   test/CodeGen/R600/local-memory-two-objects.ll |  8 ++-
>>   11 files changed, 215 insertions(+), 53 deletions(-)
>>   create mode 100644 test/CodeGen/R600/lds-output-queue.ll
>> 
>>  diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp 
> b/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  index 928c0e3..881b4aa 100644
>>  --- a/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  +++ b/lib/Target/R600/R600EmitClauseMarkers.cpp
>>  @@ -47,6 +47,11 @@ private:
>>         break;
>>       }
>>   
>>  +    // These will be expanded to two ALU instructions in the
>>  +    // ExpandSpecialInstructions pass.
>>  +    if (TII->isLDSRetInstr(MI->getOpcode()))
>>  +      return 2;
>>  +
>>       if(TII->isVector(*MI) ||
>>           TII->isCubeOp(MI->getOpcode()) ||
>>           TII->isReductionOp(MI->getOpcode()))
>>  @@ -108,6 +113,10 @@ private:
>>     bool SubstituteKCacheBank(MachineInstr *MI,
>>         std::vector<std::pair<unsigned, unsigned> > 
> &CachedConsts) const {
>>       std::vector<std::pair<unsigned, unsigned> > UsedKCache;
>>  +
>>  +    if (!TII->isALUInstr(MI->getOpcode()) && 
> MI->getOpcode() != AMDGPU::DOT_4)
>>  +      return true;
>>  +
>>       const SmallVectorImpl<std::pair<MachineOperand *, int64_t> 
>>  &Consts =
>>           TII->getSrcs(MI);
>>       assert((TII->isALUInstr(MI->getOpcode()) ||
>>  @@ -160,6 +169,52 @@ private:
>>       return true;
>>     }
>>   
>>  +  bool canClauseLocalKillFitInClause(
>>  +                        unsigned AluInstCount,
>>  +                        std::vector<std::pair<unsigned, unsigned> 
>>  KCacheBanks,
>>  +                        MachineBasicBlock::iterator Def,
>>  +                        MachineBasicBlock::iterator BBEnd) {
>>  +    const R600RegisterInfo &TRI = TII->getRegisterInfo();
>>  +    for (MachineInstr::const_mop_iterator
>>  +           MOI = Def->operands_begin(),
>>  +           MOE = Def->operands_end(); MOI != MOE; ++MOI) {
>>  +      if (!MOI->isReg() || !MOI->isDef() ||
>>  +          TRI.isPhysRegLiveAcrossClauses(MOI->getReg()))
>>  +        continue;
>>  +
>>  +      // Def defines a clause local register, so check that its use will 
> fit
>>  +      // in the clause.
>>  +      unsigned LastUseCount = 0;
>>  +      for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; ++UseI) 
> {
>>  +        AluInstCount += OccupiedDwords(UseI);
>>  +        // Make sure we won't need to end the clause due to KCache 
> limitations.
>>  +        if (!SubstituteKCacheBank(UseI, KCacheBanks))
>>  +          return false;
>>  +
>>  +        // We have reached the maximum instruction limit before finding 
> the
>>  +        // use that kills this register, so we cannot use this def in the
>>  +        // current clause.
>>  +        if (AluInstCount >= TII->getMaxAlusPerClause())
>>  +          return false;
>>  +
>>  +        // Register kill flags have been cleared by the time we get to 
> this
>>  +        // pass, but it is safe to assume that all uses of this register
>>  +        // occur in the same basic block as its definition, because
>>  +        // it is illegal for the scheduler to schedule them in
>>  +        // different blocks.
>>  +        if (UseI->findRegisterUseOperandIdx(MOI->getReg()))
>>  +          LastUseCount = AluInstCount;
>>  +
>>  +        if (UseI != Def && 
> UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1)
>>  +          break;
>>  +      }
>>  +      if (LastUseCount)
>>  +        return LastUseCount <= TII->getMaxAlusPerClause();
>>  +      llvm_unreachable("Clause local register live at end of 
> clause.");
>>  +    }
>>  +    return true;
>>  +  }
>>  +
>>     MachineBasicBlock::iterator
>>     MakeALUClause(MachineBasicBlock &MBB, MachineBasicBlock::iterator I) 
> {
>>       MachineBasicBlock::iterator ClauseHead = I;
>>  @@ -198,11 +253,13 @@ private:
>>           I++;
>>           break;
>>         }
>>  -      if (TII->isALUInstr(I->getOpcode()) &&
>>  -          !SubstituteKCacheBank(I, KCacheBanks))
>>  +
>>  +      // If this instruction defines a clause local register, make sure
>>  +      // its use can fit in this clause.
>>  +      if (!canClauseLocalKillFitInClause(AluInstCount, KCacheBanks, I, E))
>>           break;
>>  -      if (I->getOpcode() == AMDGPU::DOT_4 &&
>>  -          !SubstituteKCacheBank(I, KCacheBanks))
>>  +
>>  +      if (!SubstituteKCacheBank(I, KCacheBanks))
>>           break;
>>         AluInstCount += OccupiedDwords(I);
>>       }
>>  diff --git a/lib/Target/R600/R600ExpandSpecialInstrs.cpp 
> b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>  index 67b42d7..aeee4aa 100644
>>  --- a/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>  +++ b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>  @@ -68,6 +68,23 @@ bool 
> R600ExpandSpecialInstrsPass::runOnMachineFunction(MachineFunction &MF) 
> {
>>         MachineInstr &MI = *I;
>>         I = llvm::next(I);
>>   
>>  +      // Expand LDS_*_RET instructions
>>  +      if (TII->isLDSRetInstr(MI.getOpcode())) {
>>  +        int DstIdx = TII->getOperandIdx(MI.getOpcode(), 
> AMDGPU::OpName::dst);
>>  +        assert(DstIdx != -1);
>>  +        MachineOperand &DstOp = MI.getOperand(DstIdx);
>>  +        MachineInstr *Mov = TII->buildMovInstr(&MBB, I,
>>  +                                               DstOp.getReg(), 
> AMDGPU::OQAP);
>>  +        DstOp.setReg(AMDGPU::OQAP);
>>  +        int LDSPredSelIdx = TII->getOperandIdx(MI.getOpcode(),
>>  +                                           AMDGPU::OpName::pred_sel);
>>  +        int MovPredSelIdx = TII->getOperandIdx(Mov->getOpcode(),
>>  +                                           AMDGPU::OpName::pred_sel);
>>  +        // Copy the pred_sel bit
>>  +        Mov->getOperand(MovPredSelIdx).setReg(
>>  +            MI.getOperand(LDSPredSelIdx).getReg());
>>  +      }
>>  +
>>         switch (MI.getOpcode()) {
>>         default: break;
>>         // Expand PRED_X to one of the PRED_SET instructions.
>>  diff --git a/lib/Target/R600/R600ISelLowering.cpp 
> b/lib/Target/R600/R600ISelLowering.cpp
>>  index 5bb8129..9845d12 100644
>>  --- a/lib/Target/R600/R600ISelLowering.cpp
>>  +++ b/lib/Target/R600/R600ISelLowering.cpp
>>  @@ -128,21 +128,17 @@ MachineBasicBlock * 
> R600TargetLowering::EmitInstrWithCustomInserter(
>>   
>>     switch (MI->getOpcode()) {
>>     default:
>>  -    if (TII->isLDSInstr(MI->getOpcode()) &&
>>  -        TII->getOperandIdx(MI->getOpcode(), AMDGPU::OpName::dst) != 
> -1) {
>>  +    // Replace LDS_*_RET instruction that don't have any uses with the
>>  +    // equivalent LDS_*_NORET instruction.
>>  +    if (TII->isLDSRetInstr(MI->getOpcode())) {
>>         int DstIdx = TII->getOperandIdx(MI->getOpcode(), 
> AMDGPU::OpName::dst);
>>         assert(DstIdx != -1);
>>         MachineInstrBuilder NewMI;
>>  -      if (!MRI.use_empty(MI->getOperand(DstIdx).getReg())) {
>>  -        NewMI = BuildMI(*BB, I, BB->findDebugLoc(I), 
> TII->get(MI->getOpcode()),
>>  -                        AMDGPU::OQAP);
>>  -        TII->buildDefaultInstruction(*BB, I, AMDGPU::MOV,
>>  -                                     MI->getOperand(0).getReg(),
>>  -                                     AMDGPU::OQAP);
>>  -      } else {
>>  -        NewMI = BuildMI(*BB, I, BB->findDebugLoc(I),
>>  -                        
> TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode())));
>>  -      }
>>  +      if (!MRI.use_empty(MI->getOperand(DstIdx).getReg()))
>>  +        return BB;
>>  +
>>  +      NewMI = BuildMI(*BB, I, BB->findDebugLoc(I),
>>  +                      
> TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode())));
>>         for (unsigned i = 1, e = MI->getNumOperands(); i < e; ++i) 
> {
>>           NewMI.addOperand(MI->getOperand(i));
>>         }
>>  diff --git a/lib/Target/R600/R600InstrInfo.cpp 
> b/lib/Target/R600/R600InstrInfo.cpp
>>  index a11d54a..0ae4d0b 100644
>>  --- a/lib/Target/R600/R600InstrInfo.cpp
>>  +++ b/lib/Target/R600/R600InstrInfo.cpp
>>  @@ -141,6 +141,14 @@ bool R600InstrInfo::isLDSInstr(unsigned Opcode) const 
> {
>>             (TargetFlags & R600_InstFlag::LDS_1A2D));
>>   }
>>   
>>  +bool R600InstrInfo::isLDSNoRetInstr(unsigned Opcode) const {
>>  +  return isLDSInstr(Opcode) && getOperandIdx(Opcode, 
> AMDGPU::OpName::dst) == -1;
>>  +}
>>  +
>>  +bool R600InstrInfo::isLDSRetInstr(unsigned Opcode) const {
>>  +  return isLDSInstr(Opcode) && getOperandIdx(Opcode, 
> AMDGPU::OpName::dst) != -1;
>>  +}
>>  +
>>   bool R600InstrInfo::canBeConsideredALU(const MachineInstr *MI) const 
> {
>>     if (isALUInstr(MI->getOpcode()))
>>       return true;
>>  diff --git a/lib/Target/R600/R600InstrInfo.h 
> b/lib/Target/R600/R600InstrInfo.h
>>  index d7438ef..d6d1fe8 100644
>>  --- a/lib/Target/R600/R600InstrInfo.h
>>  +++ b/lib/Target/R600/R600InstrInfo.h
>>  @@ -65,6 +65,8 @@ namespace llvm {
>>     bool isALUInstr(unsigned Opcode) const;
>>     bool hasInstrModifiers(unsigned Opcode) const;
>>     bool isLDSInstr(unsigned Opcode) const;
>>  +  bool isLDSNoRetInstr(unsigned Opcode) const;
>>  +  bool isLDSRetInstr(unsigned Opcode) const;
>>   
>>     /// \returns true if this \p Opcode represents an ALU 
> instruction or an
>>     /// instruction that will be lowered in ExpandSpecialInstrs Pass.
>>  diff --git a/lib/Target/R600/R600MachineScheduler.cpp 
> b/lib/Target/R600/R600MachineScheduler.cpp
>>  index 6c26d9e..da2a4d8 100644
>>  --- a/lib/Target/R600/R600MachineScheduler.cpp
>>  +++ b/lib/Target/R600/R600MachineScheduler.cpp
>>  @@ -92,15 +92,6 @@ SUnit* R600SchedStrategy::pickNode(bool &IsTopNode) 
> {
>>         AllowSwitchFromAlu = true;
>>     }
>>   
>>  -
>>  -  // We want to scheduled AR defs as soon as possible to make sure they 
> aren't
>>  -  // put in a different ALU clause from their uses.
>>  -  if (!SU && !UnscheduledARDefs.empty()) {
>>  -      SU = UnscheduledARDefs[0];
>>  -      UnscheduledARDefs.erase(UnscheduledARDefs.begin());
>>  -      NextInstKind = IDAlu;
>>  -  }
>>  -
>>     if (!SU && ((AllowSwitchToAlu && CurInstKind != IDAlu) 
> ||
>>         (!AllowSwitchFromAlu && CurInstKind == IDAlu))) {
>>       // try to pick ALU
>>  @@ -130,15 +121,6 @@ SUnit* R600SchedStrategy::pickNode(bool 
> &IsTopNode) {
>>         NextInstKind = IDOther;
>>     }
>>   
>>  -  // We want to schedule the AR uses as late as possible to make sure that
>>  -  // the AR defs have been released.
>>  -  if (!SU && !UnscheduledARUses.empty()) {
>>  -      SU = UnscheduledARUses[0];
>>  -      UnscheduledARUses.erase(UnscheduledARUses.begin());
>>  -      NextInstKind = IDAlu;
>>  -  }
>>  -
>>  -
>>     DEBUG(
>>         if (SU) {
>>           dbgs() << " ** Pick node **\n";
>>  @@ -217,20 +199,6 @@ void R600SchedStrategy::releaseBottomNode(SUnit *SU) 
> {
>>   
>>     int IK = getInstKind(SU);
>>   
>>  -  // Check for AR register defines
>>  -  for (MachineInstr::const_mop_iterator I = 
> SU->getInstr()->operands_begin(),
>>  -                                        E = 
> SU->getInstr()->operands_end();
>>  -                                        I != E; ++I) {
>>  -    if (I->isReg() && I->getReg() == AMDGPU::AR_X) {
>>  -      if (I->isDef()) {
>>  -        UnscheduledARDefs.push_back(SU);
>>  -      } else {
>>  -        UnscheduledARUses.push_back(SU);
>>  -      }
>>  -      return;
>>  -    }
>>  -  }
>>  -
>>     // There is no export clause, we can schedule one as soon as its ready
>>     if (IK == IDOther)
>>       Available[IDOther].push_back(SU);
>>  diff --git a/lib/Target/R600/R600MachineScheduler.h 
> b/lib/Target/R600/R600MachineScheduler.h
>>  index 0a6f120..97c8cde 100644
>>  --- a/lib/Target/R600/R600MachineScheduler.h
>>  +++ b/lib/Target/R600/R600MachineScheduler.h
>>  @@ -53,8 +53,6 @@ class R600SchedStrategy : public MachineSchedStrategy 
> {
>>   
>>     std::vector<SUnit *> Available[IDLast], Pending[IDLast];
>>     std::vector<SUnit *> AvailableAlus[AluLast];
>>  -  std::vector<SUnit *> UnscheduledARDefs;
>>  -  std::vector<SUnit *> UnscheduledARUses;
>>     std::vector<SUnit *> PhysicalRegCopy;
>>   
>>     InstKind CurInstKind;
>>  diff --git a/lib/Target/R600/R600RegisterInfo.cpp 
> b/lib/Target/R600/R600RegisterInfo.cpp
>>  index dd8f3ef..83fc90b 100644
>>  --- a/lib/Target/R600/R600RegisterInfo.cpp
>>  +++ b/lib/Target/R600/R600RegisterInfo.cpp
>>  @@ -85,3 +85,16 @@ const RegClassWeight 
> &R600RegisterInfo::getRegClassWeight(
>>     const TargetRegisterClass *RC) const {
>>     return RCW;
>>   }
>>  +
>>  +bool R600RegisterInfo::isPhysRegLiveAcrossClauses(unsigned Reg) const 
> {
>>  +  assert(!TargetRegisterInfo::isVirtualRegister(Reg));
>>  +
>>  +  switch (Reg) {
>>  +  case AMDGPU::OQAP:
>>  +  case AMDGPU::OQBP:
>>  +  case AMDGPU::AR_X:
>>  +    return false;
>>  +  default:
>>  +    return true;
>>  +  }
>>  +}
>>  diff --git a/lib/Target/R600/R600RegisterInfo.h 
> b/lib/Target/R600/R600RegisterInfo.h
>>  index d458e55..18f3b3e 100644
>>  --- a/lib/Target/R600/R600RegisterInfo.h
>>  +++ b/lib/Target/R600/R600RegisterInfo.h
>>  @@ -45,6 +45,8 @@ struct R600RegisterInfo : public AMDGPURegisterInfo 
> {
>>   
>>     virtual const RegClassWeight &getRegClassWeight(const 
> TargetRegisterClass *RC) const;
>>   
>>  +  // \returns true if \p Reg can be defined in one ALU caluse and 
> used in another.
>>  +  virtual bool isPhysRegLiveAcrossClauses(unsigned Reg) const;
>>   };
>>   
>>   } // End namespace llvm
>>  diff --git a/test/CodeGen/R600/lds-output-queue.ll 
> b/test/CodeGen/R600/lds-output-queue.ll
>>  new file mode 100644
>>  index 0000000..63a4332
>>  --- /dev/null
>>  +++ b/test/CodeGen/R600/lds-output-queue.ll
>>  @@ -0,0 +1,99 @@
>>  +; RUN: llc < %s -march=r600 -mcpu=redwood -verify-machineinstrs | 
> FileCheck %s
>>  +;
>>  +; This test checks that the lds input queue will is empty at the end of
>>  +; the ALU clause.
>>  +
>>  +; CHECK-LABEL: @lds_input_queue
>>  +; CHECK: LDS_READ_RET * OQAP
>>  +; CHECK-NOT: ALU clause
>>  +; CHECK: MOV * T{{[0-9]\.[XYZW]}}, OQAP
>>  +
>>  +@local_mem = internal addrspace(3) unnamed_addr global [2 x i32] [i32 1, 
> i32 2], align 4
>>  +
>>  +define void @lds_input_queue(i32 addrspace(1)* %out, i32 addrspace(1)* 
> %in, i32 %index) {
>>  +entry:
>>  +  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, 
> i32 %index
>>  +  %1 = load i32 addrspace(3)* %0
>>  +  call void @llvm.AMDGPU.barrier.local()
>>  +
>>  +  ; This will start a new clause for the vertex fetch
>>  +  %2 = load i32 addrspace(1)* %in
>>  +  %3 = add i32 %1, %2
>>  +  store i32 %3, i32 addrspace(1)* %out
>>  +  ret void
>>  +}
>>  +
>>  +declare void @llvm.AMDGPU.barrier.local()
>>  +
>>  +; The machine scheduler does not do proper alias analysis and assumes that
>>  +; loads from global values (Note that a global value is different that a
>>  +; value from global memory.  A global value is a value that is declared
>>  +; outside of a function, it can reside in any address space) alias with
>>  +; all other loads.
>>  +;
>>  +; This is a problem for scheduling the reads from the local data share 
> (lds).
>>  +; These reads are implemented using two instructions.  The first copies 
> the
>>  +; data from lds into the lds output queue, and the second moves the data 
> from
>>  +; the input queue into main memory.  These two instructions don't have 
> to be
>>  +; scheduled one after the other, but they do need to be scheduled in the 
> same
>>  +; clause.  The aliasing problem mentioned above causes problems when there 
> is a
>>  +; load from global memory which immediately follows a load from a global 
> value that
>>  +; has been declared in the local memory space:
>>  +;
>>  +;  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, 
> i32 %index
>>  +;  %1 = load i32 addrspace(3)* %0
>>  +;  %2 = load i32 addrspace(1)* %in
>>  +;
>>  +; The instruction selection phase will generate ISA that looks like this:
>>  +; %OQAP = LDS_READ_RET
>>  +; %vreg0 = MOV %OQAP
>>  +; %vreg1 = VTX_READ_32
>>  +; %vreg2 = ADD_INT %vreg1, %vreg0
>>  +;
>>  +; The bottom scheduler will schedule the two ALU instructions first:
>>  +;
>>  +; UNSCHEDULED:
>>  +; %OQAP = LDS_READ_RET
>>  +; %vreg1 = VTX_READ_32
>>  +;
>>  +; SCHEDULED:
>>  +;
>>  +; vreg0 = MOV %OQAP
>>  +; vreg2 = ADD_INT %vreg1, %vreg2
>>  +;
>>  +; The lack of proper aliasing results in the local memory read 
> (LDS_READ_RET)
>>  +; to consider the global memory read (VTX_READ_32) has a chain dependency, 
> so
>>  +; the global memory read will always be scheduled first.  This will give 
> us a
>>  +; final program which looks like this:
>>  +;
>>  +; Alu clause:
>>  +; %OQAP = LDS_READ_RET
>>  +; VTX clause:
>>  +; %vreg1 = VTX_READ_32
>>  +; Alu clause:
>>  +; vreg0 = MOV %OQAP
>>  +; vreg2 = ADD_INT %vreg1, %vreg2
>>  +;
>>  +; This is an illegal program because the OQAP def and use know occur in
>>  +; different ALU clauses.
>>  +;
>>  +; This test checks this scenario and makes sure it doesn't result in 
> an
>>  +; illegal program.  For now, we have fixed this issue by merging the
>>  +; LDS_READ_RET and MOV together during instruction selection and then
>>  +; expanding them after scheduling.  Once the scheduler has better alias
>>  +; analysis, we should be able to keep these instructions sparate before
>>  +; scheduling.
>>  +;
>>  +; CHECK-LABEL: @local_global_alias
>>  +; CHECK: LDS_READ_RET
>>  +; CHECK-NOT: ALU clause
>>  +; CHECK MOV * T{{[0-9]\.[XYZW]}}, OQAP
>>  +define void @local_global_alias(i32 addrspace(1)* %out, i32 addrspace(1)* 
> %in) {
>>  +entry:
>>  +  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, 
> i32 0
>>  +  %1 = load i32 addrspace(3)* %0
>>  +  %2 = load i32 addrspace(1)* %in
>>  +  %3 = add i32 %2, %1
>>  +  store i32 %3, i32 addrspace(1)* %out
>>  +  ret void
>>  +}
>>  diff --git a/test/CodeGen/R600/local-memory-two-objects.ll 
> b/test/CodeGen/R600/local-memory-two-objects.ll
>>  index b413fe3..e2d8406 100644
>>  --- a/test/CodeGen/R600/local-memory-two-objects.ll
>>  +++ b/test/CodeGen/R600/local-memory-two-objects.ll
>>  @@ -12,9 +12,11 @@
>>   ; SI-CHECK: .long 47180
>>   ; SI-CHECK-NEXT: .long 32768
>>   
>>  -; Make sure the lds writes are using different addresses.
>>  -; EG-CHECK: LDS_WRITE {{[*]*}} 
> {{PV|T}}[[ADDRW:[0-9]*\.[XYZW]]]
>>  -; EG-CHECK-NOT: LDS_WRITE {{[*]*}} T[[ADDRW]]
>>  +; We would like to check the the lds writes are using different
>>  +; addresses, but due to variations in the scheduler, we can't do
>>  +; this consistently on evergreen GPUs.
>>  +; EG-CHECK: LDS_WRITE
>>  +; EG-CHECK: LDS_WRITE
>>   ; SI-CHECK: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW:[0-9]*]]
>>   ; SI-CHECK-NOT: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW]]
>>   
>>  -- 
>>  1.8.1.4
>> 
> 
>>  _______________________________________________
>>  llvm-commits mailing list
>>  llvm-comm...@cs.uiuc.edu
>>  http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] R600: Make sure OQAP defs and uses happen in the same clause

Reply via email to