Comments on: The NVSwitch Fabric That Is The Hub Of The DGX H100 SuperPOD https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/ In-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds. Wed, 26 Oct 2022 15:36:19 +0000 hourly 1 https://wordpress.org/?v=6.5.5 By: EC https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184423 Wed, 30 Mar 2022 16:00:37 +0000 https://www.nextplatform.com/?p=140293#comment-184423 And when the prime contractor stamps these as “Nvidia certified” there would be some rather hefty ongoing support contracts generated.

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184169 Thu, 24 Mar 2022 17:21:20 +0000 https://www.nextplatform.com/?p=140293#comment-184169 In reply to Eric Olson.

And not just the vertical integration and the supply chain control, which are important. (As the shortages of Mellanox ConnectX adapters right now is causing all kinds of server delays….) But the fact that Nvidia is a big user of supercomputing, understands all of the headaches and builds the first system for itself and runs it before putting it out for sale means it has built the skills to do this. Is HPE or Dell or Lenovo or Atos a big user of supercomputing in the same way? I am not contending that Nvidia will want to be a prime contractor. I am thinking it will be forced into it.

]]>
By: Eric Olson https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184168 Thu, 24 Mar 2022 16:53:00 +0000 https://www.nextplatform.com/?p=140293#comment-184168 In reply to Timothy Prickett Morgan.

Back when Omni-path was the latest greatest, the switch here broke (during a vendor upgrade) leaving thousands of cores connected to each other only by means of the administrative Ethernet. Because the replacement switch was astonishingly out of stock this situation persisted for more than 6 months.

My guess is most of the small HPC clusters similarly rely on vendor service and warranties for big-ticket replacement parts rather than redundantly keeping their own on site.

]]>
By: Eric Olson https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184164 Thu, 24 Mar 2022 16:31:08 +0000 https://www.nextplatform.com/?p=140293#comment-184164 The vertical integration that Nvidia has put together is not only useful from a performance point of view, but in these days of unreliable supply chains this type of vertical integration may allow Nvidia to more reliably take on the responsibility of delivering large systems according to contact.

Do Lenovo, Dell, HPE or even IBM have the needed level of control over their supply chains to provide the same practical delivery results as Nvidia? What about smaller companies such as Penguin?

]]>
By: Timothy Prickett Morgan https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184158 Thu, 24 Mar 2022 15:29:37 +0000 https://www.nextplatform.com/?p=140293#comment-184158 In reply to Paul Berry.

I presume there are spares sitting in most HPC and AI centers for this contingency, but it is a good point to make that what happens when you “chaos monkey” the network. I don’t know the answer, but it sounds like a good story to chase….

]]>
By: Paul Berry https://www.nextplatform.com/2022/03/23/nvidia-will-be-a-prime-contractor-for-big-ai-supercomputers/#comment-184156 Thu, 24 Mar 2022 15:09:02 +0000 https://www.nextplatform.com/?p=140293#comment-184156 Very impressive raw numbers. I would be interested to see real world benchmarks.
With networks of this scale, it’s often most interesting to know what the network performs like when 2 of the switches are disabled, 4 of the network cables are dead, and 2 of the network cables work, but are transmitting occasional malformed packets. That’s the way it’ll have to run, much of the time. Not to say that nvidia hasn’t thought of this, but it shows up in the real world numbers.

]]>