On Mon, Dec 30, 2024 at 10:15:13AM -0800, Paul Lalonde wrote: > The hard part is memory consistency. > > x86 has a strong coherence model, which means that any write is immediately > visible to any other core that loads the same address. This wreaks havoc > on your cache architecture. You need to either have a global > synchronization point (effectively a global shared cache) with its > associated performance bottleneck, or you need some method to track which > cores are using which cache lines that might become invalidated by a > write. Contended cache lines become very expensive in terms of > communication. Intel's chips deal with this by having "tag directories" > which track globally which caches contain which cache lines. AMD has a > similar structure on the IO die of their chiplet-based architectures. In > both cases, you get much better performance if you avoid *ever* using the > same cache line on separate processors. In the AMD case you can fairly > easily partition your workload across physical memory channels to do this. > On Intel it's much more "magical", and it's not at all clear how well the > magic scales into very large numbers of cores. You might note that Intel > is not doing as well as AMD these days on the architecture side. > > Relaxing the memory model somewhat like ARM and RISC-V can help there by > reducing the number of (false) synchronization points between caches, but > at the cost of more subtle bugs if you miss including the required > synchronization primitives. >
Perhaps the target market is not really HPC in the sense of trying to speed up _one_ task by doing it, parallely, concurrently, on a lot of cores, but more the "cloud": limited needs for global synchronization, since, all in all, there are separate VMs running at the same time but almost orthogonal---even the storage is in fact segregated. So it is a more tightly coupled cluster of separated systems, allowing speedy migration of tasks, than a single system (for a single task---computing weather evolution or simulating complex physics systems. Just a guess. > > > On Mon, Dec 30, 2024 at 10:00?AM Ron Minnich <rminn...@p9f.org> wrote: > > > On Mon, Dec 30, 2024 at 9:39?AM Bakul Shah via 9fans <9fans@9fans.net> > > wrote: > > > > > > I wonder how these many-core systems share memory effectively. > > > > Typically there is an on-chip network, and at least on some systems, > > memory blocks scattered among the cores. > > > > See the Esperanto SOC-1 for one example. -- Thierry Laronde <tlaronde +AT+ kergis +dot+ com> http://www.kergis.com/ http://kertex.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/T7692a612f26c8ec5-Ma32369581e699aa37841140f Delivery options: https://9fans.topicbox.com/groups/9fans/subscription