llvmbot wrote:

<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-tablegen

Author: None (macurtis-amd)

<details>
<summary>Changes</summary>

Add clang builtins and associated llvm intrinsics for scoped load/store of 
128bits

New builtins:
1. `__builtin_amdgcn_global_load_b128` 
([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/docs/LanguageExtensions.rst#__builtin_amdgcn_global_load_b128-and-__builtin_amdgcn_global_store_b128),
 
[test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl))
2. `__builtin_amdgcn_global_store_b128` 
([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/docs/LanguageExtensions.rst#__builtin_amdgcn_global_load_b128-and-__builtin_amdgcn_global_store_b128),
 
[test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl))

And corresponding intrinsics:
1. `llvm.amdgcn.global.load.b128` 
([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/docs/AMDGPUUsage.rst)
 - search for intrinsic name, 
[test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll)
 )
2. `llvm.amdgcn.global.store.b128` 
([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/docs/AMDGPUUsage.rst)
 - search for intrinsic name, 
[test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll)
 )

These will initially be used by [RCCL](https://github.com/ROCm/rccl) to address 
some low-level performance issues.

---

Patch is 1.74 MiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/172090.diff


23 Files Affected:

- (modified) clang/docs/LanguageExtensions.rst (+37) 
- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+3) 
- (modified) clang/include/clang/Sema/SemaAMDGPU.h (+2) 
- (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+20) 
- (modified) clang/lib/Sema/SemaAMDGPU.cpp (+16) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl (+113) 
- (added) clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl 
(+22) 
- (added) 
clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl (+26) 
- (modified) llvm/docs/AMDGPUUsage.rst (+106) 
- (modified) llvm/include/llvm/CodeGen/GlobalISel/GIMatchTableExecutor.h (+6) 
- (modified) llvm/include/llvm/CodeGen/GlobalISel/GIMatchTableExecutorImpl.h 
(+9) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+25) 
- (modified) llvm/lib/IR/Verifier.cpp (+30-3) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3) 
- (modified) llvm/lib/Target/AMDGPU/FLATInstructions.td (+15) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+22) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll (+30869) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.store.b128.ll (+3888) 
- (added) llvm/test/CodeGen/AMDGPU/unsupported-global-load.ll (+36) 
- (added) llvm/test/CodeGen/AMDGPU/unsupported-global-store.ll (+36) 
- (added) llvm/test/Verifier/amdgpu-intrinsics.ll (+66) 
- (modified) llvm/utils/TableGen/Common/GlobalISel/GlobalISelMatchTable.cpp 
(+17) 
- (modified) llvm/utils/TableGen/Common/GlobalISel/GlobalISelMatchTable.h (+18) 


``````````diff
diff --git a/clang/docs/LanguageExtensions.rst 
b/clang/docs/LanguageExtensions.rst
index c4b86b203d383..4d4d6ca3fe0bd 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -5243,6 +5243,43 @@ returns the bit at the position of the current lane. It 
is almost equivalent to
 ``(mask & (1 << lane_id)) != 0``, except that its behavior is only defined if
 the given mask has the same value for all active lanes of the current wave.
 
+
+__builtin_amdgcn_global_load_b128 and __builtin_amdgcn_global_store_b128
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Signature:
+
+.. code-block:: c
+
+    typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) 
unsigned int v4u;
+    typedef v4u __attribute__((address_space(1))) *global_ptr_to_v4u;
+
+    v4u __builtin_amdgcn_global_load_b128(
+       v4u __attribute__((address_space(1))) *src,
+       const char                            *scope);
+
+    void __builtin_amdgcn_global_store_b128(
+       v4u __attribute__((address_space(1))) *dst,
+       v4u                                    data,
+       const char                            *scope);
+
+Load or store a vector of 4 unsigned integers from or to global memory with
+cache behavior specified by `scope` which must be a string literal.
+
+Valid values for `scope` are:
+
+* ``"wavefront"``       
+* ``"workgroup"``       
+* ``"agent"``           
+* ``""`` (empty string) 
+
+These builtins are supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+They map to the llvm intrinsics ``llvm.amdgcn.global.load.b128`` and
+``llvm.amdgcn.global.store.b128`` documented in `User Guide for AMDGPU Backend
+<https://llvm.org/docs/AMDGPUUsage.html>`_.
+
+
 ARM/AArch64 Language Extensions
 -------------------------------
 
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index a867144d83928..4bc5b1c16f2ad 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -190,6 +190,9 @@ 
TARGET_BUILTIN(__builtin_amdgcn_raw_ptr_buffer_atomic_fmax_f64, "ddQbiiIi", "",
 TARGET_BUILTIN(__builtin_amdgcn_raw_ptr_buffer_load_lds, "vQbv*3IUiiiIiIi", 
"", "vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_struct_ptr_buffer_load_lds, 
"vQbv*3IUiiiiIiIi", "", "vmem-to-lds-load-insts")
 
+TARGET_BUILTIN(__builtin_amdgcn_global_load_b128, "V4UiV4Ui*1cC*", "n", 
"gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_global_store_b128, "vV4Ui*1V4UicC*", "n", 
"gfx9-insts")
+
 
//===----------------------------------------------------------------------===//
 // Ballot builtins.
 
//===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Sema/SemaAMDGPU.h 
b/clang/include/clang/Sema/SemaAMDGPU.h
index bac812a9d4fcf..556bfb705de67 100644
--- a/clang/include/clang/Sema/SemaAMDGPU.h
+++ b/clang/include/clang/Sema/SemaAMDGPU.h
@@ -28,6 +28,8 @@ class SemaAMDGPU : public SemaBase {
 
   bool checkCoopAtomicFunctionCall(CallExpr *TheCall, bool IsStore);
 
+  bool checkScopedMemAccessFunctionCall(CallExpr *TheCall);
+
   bool checkMovDPPFunctionCall(CallExpr *TheCall, unsigned NumArgs,
                                unsigned NumDataArgs);
 
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index eabdc370da6b4..384f76e092252 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -885,6 +885,26 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
     llvm::Function *F = CGM.getIntrinsic(IID, {Args[0]->getType()});
     return Builder.CreateCall(F, {Args});
   }
+  case AMDGPU::BI__builtin_amdgcn_global_load_b128:
+  case AMDGPU::BI__builtin_amdgcn_global_store_b128: {
+    const bool IsStore =
+        BuiltinID == AMDGPU::BI__builtin_amdgcn_global_store_b128;
+    LLVMContext &Ctx = CGM.getLLVMContext();
+    SmallVector<Value *, 5> Args = {EmitScalarExpr(E->getArg(0))}; // addr
+    if (IsStore)
+      Args.push_back(EmitScalarExpr(E->getArg(1))); // data
+    const unsigned ScopeIdx = E->getNumArgs() - 1;
+    StringRef ScopeLit =
+        cast<StringLiteral>(E->getArg(ScopeIdx)->IgnoreParenCasts())
+            ->getString();
+    llvm::MDNode *MD =
+        llvm::MDNode::get(Ctx, {llvm::MDString::get(Ctx, ScopeLit)});
+    Args.push_back(llvm::MetadataAsValue::get(Ctx, MD)); // scope
+    llvm::Function *F =
+        CGM.getIntrinsic(IsStore ? Intrinsic::amdgcn_global_store_b128
+                                 : Intrinsic::amdgcn_global_load_b128);
+    return Builder.CreateCall(F, Args);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
     Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
                                    {llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index cece22092bb14..72c7bf03f93ad 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -255,6 +255,9 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
            (SemaRef.BuiltinConstantArg(TheCall, ArgCount, Result)) ||
            (SemaRef.BuiltinConstantArg(TheCall, (ArgCount - 1), Result));
   }
+  case AMDGPU::BI__builtin_amdgcn_global_load_b128:
+  case AMDGPU::BI__builtin_amdgcn_global_store_b128:
+    return checkScopedMemAccessFunctionCall(TheCall);
   default:
     return false;
   }
@@ -344,6 +347,19 @@ bool SemaAMDGPU::checkCoopAtomicFunctionCall(CallExpr 
*TheCall, bool IsStore) {
   return Fail;
 }
 
+bool SemaAMDGPU::checkScopedMemAccessFunctionCall(CallExpr *TheCall) {
+  bool Fail = false;
+  // Last argument is a string literal
+  Expr *Arg = TheCall->getArg(TheCall->getNumArgs() - 1);
+  auto Scope = dyn_cast<StringLiteral>(Arg->IgnoreParenCasts());
+  if (!Scope) {
+    Fail = true;
+    Diag(TheCall->getBeginLoc(), diag::err_expr_not_string_literal)
+        << Arg->getSourceRange();
+  }
+  return Fail;
+}
+
 bool SemaAMDGPU::checkMovDPPFunctionCall(CallExpr *TheCall, unsigned NumArgs,
                                          unsigned NumDataArgs) {
   assert(NumDataArgs <= 2);
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl
new file mode 100644
index 0000000000000..7ffceead747e8
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl
@@ -0,0 +1,113 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py 
UTC_ARGS: --check-globals smart
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx950         
-emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX950
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx9-4-generic 
-emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX9_4_GENERIC
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx1250        
-emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX1250
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx12-generic  
-emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX12_GENERIC
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned 
int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+//------------------------------------------------------------------------------
+// Store
+//------------------------------------------------------------------------------
+// GFX-LABEL: @test_amdgcn_global_store_b128_00(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) 
[[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META4:![0-9]+]])
+// GFX-NEXT:    ret void
+//
+void test_amdgcn_global_store_b128_00(global_ptr_to_v4u32 ptr, v4u32 data) {
+  __builtin_amdgcn_global_store_b128(ptr, data, "wavefront");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_01(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) 
[[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META5:![0-9]+]])
+// GFX-NEXT:    ret void
+//
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data) {
+  __builtin_amdgcn_global_store_b128(ptr, data, "workgroup");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_10(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) 
[[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META6:![0-9]+]])
+// GFX-NEXT:    ret void
+//
+void test_amdgcn_global_store_b128_10(global_ptr_to_v4u32 ptr, v4u32 data) {
+  __builtin_amdgcn_global_store_b128(ptr, data, "agent");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_11(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) 
[[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META7:![0-9]+]])
+// GFX-NEXT:    ret void
+//
+void test_amdgcn_global_store_b128_11(global_ptr_to_v4u32 ptr, v4u32 data) {
+  __builtin_amdgcn_global_store_b128(ptr, data, "");
+}
+
+//------------------------------------------------------------------------------
+// Load
+//------------------------------------------------------------------------------
+// GFX-LABEL: @test_amdgcn_global_load_b128_00(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    [[TMP0:%.*]] = tail call <4 x i32> 
@llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META4]])
+// GFX-NEXT:    ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_00(global_ptr_to_v4u32 ptr) {
+  return __builtin_amdgcn_global_load_b128(ptr, "wavefront");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_01(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    [[TMP0:%.*]] = tail call <4 x i32> 
@llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META5]])
+// GFX-NEXT:    ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr) {
+  return __builtin_amdgcn_global_load_b128(ptr, "workgroup");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_10(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    [[TMP0:%.*]] = tail call <4 x i32> 
@llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META6]])
+// GFX-NEXT:    ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_10(global_ptr_to_v4u32 ptr) {
+  return __builtin_amdgcn_global_load_b128(ptr, "agent");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_11(
+// GFX-NEXT:  entry:
+// GFX-NEXT:    [[TMP0:%.*]] = tail call <4 x i32> 
@llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META7]])
+// GFX-NEXT:    ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_11(global_ptr_to_v4u32 ptr) {
+  return __builtin_amdgcn_global_load_b128(ptr, "");
+}
+//.
+// GFX950: [[META4]] = !{!"wavefront"}
+// GFX950: [[META5]] = !{!"workgroup"}
+// GFX950: [[META6]] = !{!"agent"}
+// GFX950: [[META7]] = !{!""}
+//.
+// GFX9_4_GENERIC: [[META4]] = !{!"wavefront"}
+// GFX9_4_GENERIC: [[META5]] = !{!"workgroup"}
+// GFX9_4_GENERIC: [[META6]] = !{!"agent"}
+// GFX9_4_GENERIC: [[META7]] = !{!""}
+//.
+// GFX1250: [[META4]] = !{!"wavefront"}
+// GFX1250: [[META5]] = !{!"workgroup"}
+// GFX1250: [[META6]] = !{!"agent"}
+// GFX1250: [[META7]] = !{!""}
+//.
+// GFX12_GENERIC: [[META4]] = !{!"wavefront"}
+// GFX12_GENERIC: [[META5]] = !{!"workgroup"}
+// GFX12_GENERIC: [[META6]] = !{!"agent"}
+// GFX12_GENERIC: [[META7]] = !{!""}
+//.
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add 
tests below this line:
+// GFX1250: {{.*}}
+// GFX12_GENERIC: {{.*}}
+// GFX950: {{.*}}
+// GFX9_4_GENERIC: {{.*}}
diff --git a/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl 
b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl
new file mode 100644
index 0000000000000..b21b604baa944
--- /dev/null
+++ b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl
@@ -0,0 +1,22 @@
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx950         
-S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx9-4-generic 
-S -verify -o - %s
+// REQUIRES: amdgpu-registered-target
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned 
int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+void test_amdgcn_global_store_b128_00(v4u32 *ptr, v4u32 data, const char* 
scope) {
+  __builtin_amdgcn_global_store_b128(ptr, data, "");  
//expected-error{{passing '__private v4u32 *__private' to parameter of type 
'__attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int 
__global *' changes address space of pointer}}
+}
+
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data, 
const char* scope) {
+  __builtin_amdgcn_global_store_b128(ptr, data, scope);  
//expected-error{{expression is not a string literal}}
+}
+
+v4u32 test_amdgcn_global_load_b128_00(v4u32 *ptr, const char* scope) {
+  return __builtin_amdgcn_global_load_b128(ptr, "");  
//expected-error{{passing '__private v4u32 *__private' to parameter of type 
'__attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int 
__global *' changes address space of pointer}}
+}
+
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr, const char* 
scope) {
+  return __builtin_amdgcn_global_load_b128(ptr, scope);  
//expected-error{{expression is not a string literal}}
+}
diff --git 
a/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl 
b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl
new file mode 100644
index 0000000000000..ec357c58ef903
--- /dev/null
+++ b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl
@@ -0,0 +1,26 @@
+// We test loads and stores separately because clang only seems to exit after
+// the first 'target feature' error.
+
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx602 
-DTEST_LOAD  -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx705 
-DTEST_LOAD  -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx810 
-DTEST_LOAD  -S -verify -o - %s
+
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx602 
-DTEST_STORE -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx705 
-DTEST_STORE -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx810 
-DTEST_STORE -S -verify -o - %s
+// REQUIRES: amdgpu-registered-target
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned 
int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+#ifdef TEST_LOAD
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr, const char* 
scope) {
+  return __builtin_amdgcn_global_load_b128(ptr, ""); // 
expected-error{{'__builtin_amdgcn_global_load_b128' needs target feature 
gfx9-insts}}
+}
+#endif
+
+#ifdef TEST_STORE
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data, 
const char* scope) {
+  __builtin_amdgcn_global_store_b128(ptr, data, ""); // 
expected-error{{'__builtin_amdgcn_global_store_b128' needs target feature 
gfx9-insts}}
+}
+#endif
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 7ecf1c1124894..39afd29737156 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1596,6 +1596,112 @@ The AMDGPU backend implements the following LLVM IR 
intrinsics.
                                                    * 1 - Data cache.
 
                                                    Instruction cache 
prefetches are unsafe on invalid address.
+
+  llvm.amdgcn.global.load.b128                     This intrinsic is supported 
on gfx9, gfx10, gfx11, and gfx12 targets.
+  
+                                                   Signature:
+                                                   
+                                                   .. code-block:: llvm
+                                                      
+                                                      <4 x i32> 
@llvm.amdgcn.global.load.b128(
+                                                          ptr addrspace(1), ; 
source
+                                                          metadata)         ; 
scope    - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+                                                   Reads the value from the 
source address with cache behavior specified by the scope.
+
+                                                   The following table shows 
the mapping between valid scope values and target
+                                                   instruction flags or field 
values.
+
+                                                   ============== 
========================== ========================== 
========================== ========================== ==========================
+                                                   targets        instruction  
                         ``"wavefront"``            ``"workgroup"``             
   ``"agent"``      ``""`` (empty string)
+                                                   ============== 
========================== ========================== 
========================== ========================== ==========================
+                                                   gfx90*         
``global_load_dwordx4``                                                         
                    ``glc``                    ``glc``
+                                                   
+                                                   gfx942, gfx950 
``global_load_dwordx4``                        (wave)            ``sc0`` 
(group)           ``sc1`` (device)       ``sc0 sc1`` (system)
+                                                   
+                                                   gfx10*         
``global_load_dwordx4``                                                  
``glc``                ``glc dlc``                ``glc dlc``
+                                                   
+                                                   gfx11*         
``global_load_dwordx4``                                                  
``glc``                    ``glc``                    ``glc``
+                                                   
+                                                   gfx120*        
``global_load_b128``                             (CU)    ``scope:SCOPE_SE`` 
(SE)  ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+                                                   
+                                                   gfx125*        
``global_load_b128``                             (CU)                           
  ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+                                                   ============== 
========================== ========================== 
========================== ========================== ==========================
+                                                   
+                                                   For gfx90*, see "GLC Bit 
Explained" in the appropriate instruction set reference
+                                                   (e.g. Chapter 9.1.10 in 
"AMD Instinct MI100" Instruction Set Architecture Reference
+                                                   Guide).
+                                                   
+                                                   For gfx942 and gfx950 
targets, see "Memory Scope and Temporal Controls" in the
+                                                   appropriate instruction set 
reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+                                                   MI300" Instruction Set 
Architecture Reference Guide).
+
+                                                   For gfx10* targets, see 
"GLC, DLC and SLC Bit Explained" in the appropriate
+                                                   instruction set reference 
(e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set Architecture
+                                                   Reference Guide)
+                                                   
+                                                   For gfx11* targets, see 
"Cache Controls: SLC, GLC and DLC" in the appropriate
+                                                   instruction set reference 
(e.g. Chapter 4.1.1 in "RDNA3" Instruction Set Architecture
+                                   ...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/172090
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to