Comments on: Broadcom Takes On InfiniBand With Jericho3-AI Switch Chips

By: Chris Whyte

Chris Whyte — Sun, 07 May 2023 19:47:51 +0000

In reply to Timothy Prickett Morgan. Your options for building out the most cost-effective AI fabric are not limited to IB vs DDC. In fact, you can successfully leverage the huge investment in the massive, multi-vendor front-end DC fabric that already exists today by focusing the solution to emulating a lossless, low latency network on the endpoints. Of course, it requires the architects of said DC fabric to have the ingenuity and foresight to do so. More importantly, it's already being done at one very notable hyperscaler.

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Sun, 07 May 2023 14:47:31 +0000

In reply to Chris Whyte. The architecture, and how it got there, only matters inasmuch as it beats InfiniBand on real-world AI applications and if it is also cheaper, then there ya go. Sauce for the goose, Mr Savik.

By: Chris Whyte

Chris Whyte — Sun, 07 May 2023 04:11:04 +0000

So many questions but I’ll start with this one:

If a fully scheduled fabric offers (near) perfect LB and congestion-free operation then why exactly do I need deep buffers?

Note: Those deep buffers come with an additional cost due to the off-chip memory requirement.

It seems to me they’re just trying to repurpose an existing chip with a 15+ year old architecture (i.e., you’re paying for the additional packet memory whether you need it or not), throw “AI” on the end of the name (because we all know the mere mention of “AI” and your stock increase 20% overnight), and inevitably force you into a vendor lock-in strategy.

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Tue, 02 May 2023 16:37:09 +0000

In reply to anoop. Correct.

By: anoop

anoop — Tue, 02 May 2023 14:51:39 +0000

“there is nothing in the Microsoft architecture that _does not prohibit_ it from moving to a fabric based on the Broadcom Dune StrataDNX family of which Jericho3-AI and Ramon 3 are a part”

“does not prohibit” -> prohibits

By: HuMo

HuMo — Sat, 29 Apr 2023 03:04:24 +0000

In reply to Hubert.

Patience young gracehopper, post-exascale kung-fu gastronomy cannot be rushed; to wit, AI-formulated plant-based cheeses will not hit shelves until 2024 (Bel Group and Climax Foods). According to nVidia dev blog meditations (Yamaguchi & Busato, 2021):

“sparse linear algebra […] does not provide competitive performance [yet …] when sparsity is below 95% […] due to […] scattered memory accesses”

300 PhDs a Shaolin Temple did not hence make(?) … and is a bird in the hand really worth two stoned (or vice-versa — as in the next TNP article)?

By: Elad

Elad — Fri, 28 Apr 2023 12:36:31 +0000

Juniper Express5 is more promising, doubling Jerico3 capacity.

We all heard before the theoretical scale of Jerico2 – in reality it wasn’t even close!
Looks like another chip with great marketing but not really optimised.

By: Hubert

Hubert — Fri, 28 Apr 2023 05:19:27 +0000

In reply to Eric Olson. I think that it is because during training, one deals mostly with dense (full) weight matrices for the NN layers. As with HPC's HPL dense-matrix benchmark; latency is less critical for this than bandwidth as you can pretty much batch-burst the needed data into caches or buffers, for the computational units, karate-style. Once the ANN is trained on its humongous dataset, one may want to prune it (esp. for weight storage efficiency) resulting in sparse weight matrices, to be used for inference. The inference situation may then be similar to that of the sparse-matrix HPCG benchmark of HPC, where latency is more critical, and some form of memory-access kung-fu becomes valuable. NVIDIA has developed some special sparsity-support-hardware for this purpose if I'm not mis-mixmetaphoring (along with adaptive 4-bit quantization, and more, as discussed in recent TNP pieces).

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Thu, 27 Apr 2023 11:50:14 +0000

In reply to Eric Olson. It must be the case or this would be a Tomahawk variant.

By: Eric Olson

Eric Olson — Thu, 27 Apr 2023 06:56:37 +0000

I though large buffers were the enemy of low latency. Maybe I don’t understand AI training. Are there aspects such a feed forward for which larger buffers are better?