Comments on: The Power Of Power10’s Memory Inception Clustering

By: Markos Malliarakis

Markos Malliarakis — Tue, 24 Aug 2021 07:38:01 +0000

The concept sends existing neural network frameworks straight to dust … everything in memory, hybrid connected hardware pushing this to there. Intel DPC++ is not far, C++ 20 closer with programs running solely in compiler.
Your comment is exact, ML and AI automated in compiler for unsupervised ML … getting closer to Star Trek.

By: Axel Koester

Axel Koester — Wed, 18 Aug 2021 13:01:43 +0000

In reply to Paul Berry.

One major difference is the advent of in-memory computation which is also pioneered at IBM labs: https://www.zurich.ibm.com/sto/memory/
Considering the memory attached to the Power10 cores could be not just DRAM, but also compute-capable PCRAM with neural processing embedded into the PCRAM matrix, it’d make sense for several cores to be able to access that portion of memory in a shared way. Maybe another good reason to call it “memory inception”?

By: Erik Scott

Erik Scott — Tue, 17 Aug 2021 03:39:17 +0000

In reply to William Kelley. It's been so long ago I don't trust my memory, but didn't the KSR approach require operating system support? I seem to remember it being "page faulting over the network instead of to disk". That's a gross oversimplification and there must have been some elegant locking. I saw a KSR-1, once. Running. :-)

By: Mark Funk

Mark Funk — Mon, 16 Aug 2021 21:02:36 +0000

In reply to v.ang. Interesting observation. I’ve been picturing quite a few higher level comm architectures – along with variations on inter-process shared-memory architectures - built on top of something like this. But I also wonder at what level and how different. For example, with RDMA, even for simplex communications, there is still a source buffer and a target buffer, each securely addressed in each system. It strikes me that RDMA assumes exactly that (but would be pleased to learn otherwise). In RDMA, each system provides linkage into their system’s memory windows and all of the associated lower-level enablement and seems to show that to the RDMA user. This clustered memory, though, the simplex communications seems to be built upon a single shared memory buffer. Both systems and processes on each know of that one shared buffer. The lower-level enablement sets things up so that the two systems and these processes see the same thing using their own higher-level addresses, again allowing cores on both systems to access the share buffer. I’m just not sure that this underlying difference can be hidden from the RDMA user. Move it up a level, though, as in “I want that other system to see my data” then sure.

By: Mark Funk

Mark Funk — Mon, 16 Aug 2021 20:11:55 +0000

In reply to William Kelley. Right. NUMA (Non-Uniform Memory Access) is a characteristic of pretty much any multi-socket SMP-based system. Such a cache-coherent SMP system is also called ccNUMA. Its non-uniform in the sense that a core’s access of the memory hung off of its own socket tends to be faster than an identical access from memory hung off of another socket. NUMA, then, is a performance issue, not a functional issue. As long as a core can directly access memory in a cache coherent manner, it is an SMP; again, a functional issue. I had considered folding NUMA into this article, but it was getting quite long as it was and for now wanted to stick with the functional aspects of it. But as a hopefully straightforward answer, notice that IBM’s PowerAXON is being used for cross-socket accesses within their (NUMA) SMPs, even for up to their 16-socket system. Now also notice that PowerAXON is being used for clustered memory as well.

By: William Kelley

William Kelley — Mon, 16 Aug 2021 17:14:59 +0000

I kept expecting to see an explanation for how this architecture differs from NUMA (Non-Uniform Memory Access) or the Kendall Square Research (KSR-1) machine’s “All Cache Memory” architecture.

By: Paul Berry

Paul Berry — Mon, 16 Aug 2021 15:08:42 +0000

This is all really impressive. Is it substantively different from the Cray X1 or SGI origin of 20 years ago? Obviously there’s a lot more bytes and more bytes per second.

By: v.ang

v.ang — Sun, 15 Aug 2021 09:23:07 +0000

It reads like getting RDMA under the hood (or adding an ‘RDMA accelerator’) bind with smoothing out some complex for new engineers to grasp network programming and process synchronization (programming) tasks. Not a small feat by all means

By: Eric Olson

Eric Olson — Sat, 14 Aug 2021 15:34:25 +0000

It would be nice, as future analysis, to see how the IBM memory inception architecture fits with modern Fortran’s notion of a coarray. Will there be compilers that support this in a natural way at launch?