The hard part is memory consistency. x86 has a strong coherence model, which means that any write is immediately visible to any other core that loads the same address. This wreaks havoc on your cache architecture. You need to either have a global synchronization point (effectively a global shared cache) with its associated performance bottleneck, or you need some method to track which cores are using which cache lines that might become invalidated by a write. Contended cache lines become very expensive in terms of communication. Intel's chips deal with this by having "tag directories" which track globally which caches contain which cache lines. AMD has a similar structure on the IO die of their chiplet-based architectures. In both cases, you get much better performance if you avoid *ever* using the same cache line on separate processors. In the AMD case you can fairly easily partition your workload across physical memory channels to do this. On Intel it's much more "magical", and it's not at all clear how well the magic scales into very large numbers of cores. You might note that Intel is not doing as well as AMD these days on the architecture side.
Relaxing the memory model somewhat like ARM and RISC-V can help there by reducing the number of (false) synchronization points between caches, but at the cost of more subtle bugs if you miss including the required synchronization primitives. Paul On Mon, Dec 30, 2024 at 10:00 AM Ron Minnich <rminn...@p9f.org> wrote: > On Mon, Dec 30, 2024 at 9:39 AM Bakul Shah via 9fans <9fans@9fans.net> > wrote: > > > > I wonder how these many-core systems share memory effectively. > > Typically there is an on-chip network, and at least on some systems, > memory blocks scattered among the cores. > > See the Esperanto SOC-1 for one example. ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/T7692a612f26c8ec5-Mf27994b8c88e9e8b8693b871 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription