Comments on: Other Than Nvidia, Who Will Use Arm’s Neoverse V2 Core?

By: HuMo

HuMo — Fri, 15 Sep 2023 23:07:06 +0000

In reply to Hubert. ... or, seeing how these V2 cores are so tiny at 2.5 mm^2, AMD could bake 96 of them in a pair of tasty chiplets, plunk them onto an MI300N (N for Neoverse) with 4 Instinct GPU dies, and then make some funny faces at NVIDIA! d^%^p

By: Hubert

Hubert — Thu, 14 Sep 2023 20:54:55 +0000

In reply to Slim Albert.

Inasmuch as speculative execution is a hardware implementation of McCarthy’s ambiguous operator (amb), with branch-predictor heuristics to prioritize moves (as in chess), the more execution ports are available to run this in parallel, the better — so I’d vote to keep V2’s new ALUs (I think).

I might agree with you on cutting per core L2 back to 1MB and increasing FP64 oomph instead, say with 4x or 8x 256-bit vector units, to provide one or two single-cycle 4×4 FP64 matmuls, and fit the nimbler 32B line size. It should be tested for performance and efficiency though.

By: Slim Albert

Slim Albert — Thu, 14 Sep 2023 16:38:49 +0000

I’m not sure why V2 adds those 2 ALUs when there were 6 already in V1 (maybe for speculative execution in crazy-branched code, or 3 levels of binary branches in the execution tree/graph, a bit like chess?), but looking at FP64 efficiency it looks like 16 GF/W (8 x 2.8 GHz / 1.4 W) which is Fugaku-like (15.4 GF/W). As with other CPUs then, pairing with accelerators seems needed to get to the 100+ GF/W of MI300A and H100.

Maybe, for HPC/AI, they could remove the 2 extra ALUs, and cut L2 back to 1 MB, and use the freed space for a couple 4×4 or 8×8 matrix units?

By: Hubert

Hubert — Wed, 13 Sep 2023 23:31:58 +0000

Wow! That’s the deepest of deep drilldowns I’ve seen in quite awhile (a Whopper!)! My answer to the title survey question is (of course): everyone will use the Neoverse V2! HPC and AI will want to add HBM though, as found in A64FX (Neo. V0), SiPearl Rhea (Neo. V1), and on the GPU part of the GH200 (Neo. V2 for the CPU part).

There’s some indication that the V0 to V2 progression is meant to make the CPU more nimble and agile (more responsive, less crampy and sclerotic) as it jumps around through code and data. For example, the width of vector units went down from 2x512b in “V0”, to 2x256b in V1, and 4x128b in V2. This should also help with chip layout (cost), and power consumption (cost). Also, the A64FX/V0 has 256-byte cache lines, vs the more normal 64B for V1 & V2 (more nimble). The L2 cache per core increased from 0.67MB in V0, to 1MB in V1, and 2MB in V2, which is generally great as well (thanks to 5nm vs 7nm). And, in V2, they do “aggressive store-to-load forwarding [for] minimal bubbles and stalls […] maintaining the short pipeline […] for quick mispredict recovery.” (more nimble).

As the article suggests, the impact of these improvements should almost not show up at all in Spec CPU (a benchmark of ALU/FPU perf with blocky memory transfers, if any), but they should be much more visible in workloads with scattered memory accesses and when running virtual machines (and dynamic languages like Python and Javascript). In other words, V2 is the right direction for datacenter servers, HPC, and AI. Next-up, they’ll want to reduce the size of cache lines to 32B (for graphs!), in V3.

It’s fun to see that, even in ARM’s RISC (since Cortex-A78?), instructions can be split internally into μOps, and/or fused into MOps (eg. CMP + CSEL/CSET in the Issue/Execute slide), for increased performance!