Why AI needs a new type of supercomputer network — the OpenAI Podcast Ep. 18
Why AI needs a new type of supercomputer network — the OpenAI Podcast Ep. 18


Author: OpenAI – Duration: 00:37:39
Training cutting-edge models isn't as simple as adding more GPUs: one small glitch and the whole coordinated dance falls apart. OpenAI's Mark Handley and Greg Steinbrecher explain how a new supercomputer network design, used to train some of the company's latest models, keeps the entire system moving, even with a record number of GPUs. They break down Multipath Reliable Connection, a new OpenAI protocol developed with AMD, Broadcom, Intel, Microsoft and Nvidia, and why they're making it available to the entire industry. Chapters 00:00 Introduction 00:39 Greg and Mark's Paths to OpenAI 04:34 Why AI Training Stresses Networks Differently 10:05 Bottlenecks, Failures, and the Cost of Waiting 15:19 How a Reliable Multipath Connection Works 18:59 A Protocol to Work Around Failures 25:05 Why OpenAI Makes MRC an Open Standard 35:09 Could AI migrate to space?






