I changed it to use 128-byte alignment to match the GPU cache-lines.
Committed to OG12.
Andrew
On 11/01/2023 18:05, Andrew Stubbs wrote:
This patch fixes a runtime issue I encountered with the AMD GCN Unified
Shared Memory implementation.
We were using regular malloc'd memory configured into USM mode, but
there were random intermittent crashes. I can't be completely sure, but
my best guess is that the HSA driver is using malloc internally from the
same heap, and therefore using memory on the same page as the offload
kernel. What I do know is that I could make the crashes go away by
simply padding the USM allocations before and after.
With this patch USM allocations are now completely separated from the
system heap. The custom allocator is probably less optimal is some
use-cases, but does have the advantage that all the metadata is stored
in a side-table that won't ever cause any pages to migrate back to
main-memory unnecessarily. It's still possible for the user program to
use USM memory in a way that causes it to thrash, and this might have
been the ultimate cause of the crashes, but there's not much we can do
about that here.
I've broken the allocator out into a new file because I anticipate it
being needed in more than one place, but I didn't put full
data-isolation on it yet.
I'll rebase, merge, and repost all of the OpenMP memory patches sometime
soonish.
Andrew
libgomp, amdgcn: Switch USM to 128-byte alignment
This should optimize cache-lines on the AMD GPUs somewhat.
libgomp/ChangeLog:
* usm-allocator.c (ALIGN): Use 128-byte alignment.
diff --git a/libgomp/usm-allocator.c b/libgomp/usm-allocator.c
index c45109169ca..68c1ebafec2 100644
--- a/libgomp/usm-allocator.c
+++ b/libgomp/usm-allocator.c
@@ -57,7 +57,8 @@ static int usm_lock = 0;
static struct usm_splay_tree_s usm_allocations = { NULL };
static struct usm_splay_tree_s usm_free_space = { NULL };
-#define ALIGN(VAR) (((VAR) + 7) & ~7) /* 8-byte granularity. */
+/* 128-byte granularity means GPU cache-line aligned. */
+#define ALIGN(VAR) (((VAR) + 127) & ~127)
/* Coalesce contiguous free space into one entry. This considers the entries
either side of the root node only, so it should be called each time a new