Comments on: Nvidia’s “Grace” Arm CPU Holds Its Own Against X86 For HPC https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/ In-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds. Sat, 11 May 2024 00:04:48 +0000 hourly 1 https://wordpress.org/?v=6.5.5 By: Petr Krysl https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-224233 Sat, 11 May 2024 00:04:48 +0000 https://www.nextplatform.com/?p=143603#comment-224233 Interesting. At this point I have collected some data on the scaling of multithreaded computations on the Graces. In brief, no scaling beyond ~36 threads. Same code scales well on the A64FX up to 48 cores. So, I might be leaving some performance on the table, but the tuning guide is not much help.

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219936 Fri, 09 Feb 2024 12:16:54 +0000 https://www.nextplatform.com/?p=143603#comment-219936 In reply to Hubert.

I was beginning to think you took sabbatical…

]]>
By: Hubert https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219918 Fri, 09 Feb 2024 04:42:32 +0000 https://www.nextplatform.com/?p=143603#comment-219918 Outstanding information! It really brings nicely together a lot of the news and analyses of HPC systems covered by TNP over the past months. The comparison of A64FX and Graviton 3 shows the impact of vector units in matrix multiplication and LINPACK (A64FX has 1.5x more than Graviton 3) and suggests that Graviton 4 (same total vector capacity as A64FX) will do better there.

Elsewhere, Grace-Grace is giving SR Max (Aurora CPUs) a run for its money all around, except at HPCG where HBM is key (to bringing memory-access efficiency up from, say, 3% to 6% — the challenge is still there, not really solved, but less worse). And since SR Max is not really power-sipping … there could be hope for “Fugaku-Next”-type machines that are Grace-Grace based IMHO (to be compared also to Rhea1 and Monaka, but the first has too few vector units, and the latter is too N2, IMHO).

Granite Rapids remains a wild card (hopefully the good one that we all hope for, to keep competition going!).

On the CPU-GPU side of things, because AMD’s FP64 performance has commonly trounced NVIDIA’s, I still fully expect MI300A machines (El Capitan) to just plain crush anything else out there (including GH100 motors, like Venado). I have the hardest time waiting for these machines to turn up (along with Aurora’s full config)!

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219911 Fri, 09 Feb 2024 00:11:58 +0000 https://www.nextplatform.com/?p=143603#comment-219911 In reply to Jeff.

Well, I guess I expected a sweep!

]]>
By: Jeff https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219906 Thu, 08 Feb 2024 21:00:55 +0000 https://www.nextplatform.com/?p=143603#comment-219906 You write “Here is how Grace-Grace stacked up on OpenFOAM, using the MotoBikeQ simulation with 11 million cells across all machines: We would have expected for the Grace-Grace unit to do better here. Hmmm.” What were you expecting? It already got the best results in 2 cases, and for Solving just barely lost to the Xeon with HBM.

]]>
By: Nikolay Simakov https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219891 Thu, 08 Feb 2024 17:08:20 +0000 https://www.nextplatform.com/?p=143603#comment-219891 In reply to Anonymous.

Hello, HPL was run within the HPCC benchmark, and the same matrix size was used across all test systems. So, some would say it is not the correct way to run Linpack. On the other hand, utilizing the same matrix size can provide more apple-per-apple comparison. Certainly, checking the range of the matrix’s sizes will be the best, but at this point, we only do one size.

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219860 Wed, 07 Feb 2024 23:29:48 +0000 https://www.nextplatform.com/?p=143603#comment-219860 In reply to Anonymous.

I am actually not sure. Precise node configurations were not shown, but there are core counts. It looks like a two-socket AMD versus a Grace-Grace to me.

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219859 Wed, 07 Feb 2024 23:28:30 +0000 https://www.nextplatform.com/?p=143603#comment-219859 In reply to John Linford.

Yeah, I was doing single Grace and then decided to double and then didn’t see that I didn’t double. You can always tell where the phone rings and makes me lose a train of thought….

All fixed now.

]]>
By: Anonymous https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219846 Wed, 07 Feb 2024 17:32:19 +0000 https://www.nextplatform.com/?p=143603#comment-219846 Looking at those Linpack numbers something seems way off, they have a Epyc 7763, supposedly dual socket system as it is listed as 128 cores, running at 2.1 Tflops. However looking at the top500 results for dual socket nodes with those CPUs they seems to run at about 4.1 Tflops. Dell has it at ~4 Tflops (https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/). So were they really run correctly, or is that a single socket being compared against a whole grace+grace board? And how many other errors like that is there ?

]]>
By: John Linford https://www.nextplatform.com/2024/02/06/nvidias-grace-arm-cpu-holds-its-own-against-x86-for-hpc/#comment-219845 Wed, 07 Feb 2024 17:26:05 +0000 https://www.nextplatform.com/?p=143603#comment-219845 Hi Tim! Heads up, the bandwidth and capacity numbers here are incorrect: “””…1 TB of physical memory with 546 GB/sec of peak theoretical bandwidth…. only 480 GB of that memory capacity and only 512 GB/sec of that memory bandwidth is actually available…”””

In fact the NVIDA Grace CPU Superchip has 960GB of memory capacity with up to 1TB/s of memory bandwidth. Each Grace CPU provides up to 480GB of LPDDR5x memory and 512GB/s of memory bandwidth. Please see Page 5 of the Grace CPU Superchip white paper: https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-cpu-superchip.

For more documentation of NVIDIA Grace CPU Superchip, see https://docs.nvidia.com/grace/ and https://developer.nvidia.com/grace.

Cheers!

]]>