Serving AI models at scale with vLLM
Serving AI models at scale with vLLM


Author: Google Cloud Tech – Duration: 00:03:08
Unlock the full potential of your AI models by serving them at scale with vLLM. This video addresses common issues like memory inefficiency, high latency under load, and large models, showing how vLLM maximizes the throughput of your existing hardware. Learn about vLLM's innovative features such as PagedAttention, Prefix Caching, Multi-Host Service, and Disaggregated Service, and learn how it seamlessly integrates with Google Cloud GPUs and TPUs for flexible, high-performance AI inference. Chapters: 0:00 – Introduction: The Challenge of Scaling AI 0:25 – 3 Common Problems 1:01 – Solution: vLLM for High-Performance Service 1:13 – vLLM Feature: PagedAttention 1:30 – vLLM Feature: Prefix Caching 1:46 – vLLM Feature: Multi-Host and Disaggregated Server 2:07 – Support for vLLM on Google Cloud (GPU and TPU) 2:29 – vLLM tunable settings 2:46 – Wrap-up resources: Welcome to vLLM → https://goo.gle/49zlRZN
GitHub TPU Inference → https://goo.gle/3JUkBpn
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #vLLM #AIInfrastructure Speakers: Don McCasland Products mentioned: AI infrastructure, tensor processing units, cloud GPUs






