Comments on: Other Than Nvidia, Who Will Use Arm’s Neoverse V2 Core? https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/ In-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds. Fri, 22 Sep 2023 17:22:44 +0000 hourly 1 https://wordpress.org/?v=6.5.5 By: HuMo https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/#comment-213664 Fri, 15 Sep 2023 23:07:06 +0000 https://www.nextplatform.com/?p=142919#comment-213664 In reply to Hubert.

… or, seeing how these V2 cores are so tiny at 2.5 mm^2, AMD could bake 96 of them in a pair of tasty chiplets, plunk them onto an MI300N (N for Neoverse) with 4 Instinct GPU dies, and then make some funny faces at NVIDIA! d^%^p

]]>
By: Hubert https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/#comment-213626 Thu, 14 Sep 2023 20:54:55 +0000 https://www.nextplatform.com/?p=142919#comment-213626 In reply to Slim Albert.

Inasmuch as speculative execution is a hardware implementation of McCarthy’s ambiguous operator (amb), with branch-predictor heuristics to prioritize moves (as in chess), the more execution ports are available to run this in parallel, the better — so I’d vote to keep V2’s new ALUs (I think).

I might agree with you on cutting per core L2 back to 1MB and increasing FP64 oomph instead, say with 4x or 8x 256-bit vector units, to provide one or two single-cycle 4×4 FP64 matmuls, and fit the nimbler 32B line size. It should be tested for performance and efficiency though.

]]>
By: Slim Albert https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/#comment-213615 Thu, 14 Sep 2023 16:38:49 +0000 https://www.nextplatform.com/?p=142919#comment-213615 I’m not sure why V2 adds those 2 ALUs when there were 6 already in V1 (maybe for speculative execution in crazy-branched code, or 3 levels of binary branches in the execution tree/graph, a bit like chess?), but looking at FP64 efficiency it looks like 16 GF/W (8 x 2.8 GHz / 1.4 W) which is Fugaku-like (15.4 GF/W). As with other CPUs then, pairing with accelerators seems needed to get to the 100+ GF/W of MI300A and H100.

Maybe, for HPC/AI, they could remove the 2 extra ALUs, and cut L2 back to 1 MB, and use the freed space for a couple 4×4 or 8×8 matrix units?

]]>
By: Hubert https://www.nextplatform.com/2023/09/13/other-than-nvidia-who-will-use-arms-neoverse-v2-core/#comment-213543 Wed, 13 Sep 2023 23:31:58 +0000 https://www.nextplatform.com/?p=142919#comment-213543 Wow! That’s the deepest of deep drilldowns I’ve seen in quite awhile (a Whopper!)! My answer to the title survey question is (of course): everyone will use the Neoverse V2! HPC and AI will want to add HBM though, as found in A64FX (Neo. V0), SiPearl Rhea (Neo. V1), and on the GPU part of the GH200 (Neo. V2 for the CPU part).

There’s some indication that the V0 to V2 progression is meant to make the CPU more nimble and agile (more responsive, less crampy and sclerotic) as it jumps around through code and data. For example, the width of vector units went down from 2x512b in “V0”, to 2x256b in V1, and 4x128b in V2. This should also help with chip layout (cost), and power consumption (cost). Also, the A64FX/V0 has 256-byte cache lines, vs the more normal 64B for V1 & V2 (more nimble). The L2 cache per core increased from 0.67MB in V0, to 1MB in V1, and 2MB in V2, which is generally great as well (thanks to 5nm vs 7nm). And, in V2, they do “aggressive store-to-load forwarding [for] minimal bubbles and stalls […] maintaining the short pipeline […] for quick mispredict recovery.” (more nimble).

As the article suggests, the impact of these improvements should almost not show up at all in Spec CPU (a benchmark of ALU/FPU perf with blocky memory transfers, if any), but they should be much more visible in workloads with scattered memory accesses and when running virtual machines (and dynamic languages like Python and Javascript). In other words, V2 is the right direction for datacenter servers, HPC, and AI. Next-up, they’ll want to reduce the size of cache lines to 32B (for graphs!), in V3.

It’s fun to see that, even in ARM’s RISC (since Cortex-A78?), instructions can be split internally into μOps, and/or fused into MOps (eg. CMP + CSEL/CSET in the Issue/Execute slide), for increased performance!

]]>