1
GPU Accelerated Infrastructure
Accelerating AI with GPUs
GB200 Hardware Overview
GB200 Software Overview
GB300 NVL72 Upgrade
NCP Storage
Spectrum-X Networking Platform Overview
BlueField DPU Networking Platform
BlueField DPU System Overview
NVIDIA Data Center GPU Manager
NVIDIA GPU Containers
Virtualizing GPU Resources
1
Cumulus Linux Administration Self-Paced Course
Introduction to Cumulus Linux
- Cumulus Linux Overview and Architecture
- Getting Started with Cumulus Linux
- NVIDIA User Experience (NVUE) Overview
- Interface Configuration and Management
Layer 2 Features
- Ethernet Bridging (VLAN and Trunks, SVIs, STP)
- Bonds - Link Aggregation (LAG/LACP)
- Multi-Chassis LAG (MLAG)
Layer 3 Features
- Virtual Router Redundancy (VRR)
- Virtual Routing and Forwarding (VRF)
- FRR Routing Protocols Suite
Border Gateway Protocol (MP-BGP)
- BGP Overview
- BGP in The Data Center
- BGP Unnumbered
Network Virtualization with VXLAN and EVPN
- Network Virtualization with VXLAN
- Ethernet Virtual Private Network (EVPN)
- VXLAN Routing
1
InfiniBand Networking
Introduction to InfiniBand
InfiniBand Architecture and Management
Fabric Initialization
Fabric Monitoring
InfiniBand Fabric Diagnosis with ibdiagnet Tool
2
Unified Fabric Monitoring
UFM Overview
Key Features
Architecture
Operational Dashboard
Fabric Health and Logging
3
NVLink Switch & GFM
NVLink Switch Tray Overview
NVOS Overview
NVLink Partitions Overview
NVOS ZTP & Software Upgrade
Switch Tray BMC
Reductions Over NVL
4
NMX
NMX-C Software Architecture Overview
NMX-T Software Architecture Overview
5
Monitoring, Telemetry and Troubleshooting
GB200 Telemetry
NVLink Fault Handling & Attribution
Flow Control and Network Forwarding in NVL
1
Base Command Manager
Base Command Manager Overview
Basic Concepts
Networking and Preparing for Installation
Head Node Installation
Bringing Up the Cluster
Orchestration, MLOps and Job Scheduling
2
NVIDIA AI Enterprise
NVIDIA AI Enterprise Solution Overview
Hardware Options
Deployment Methods Overview
NGC CLI and NGC API
NVIDIA Licensing System
Docker & NVIDIA Container Toolkit
Accessing and Installing AI Software
AI Workflows
NVIDIA Inference Microservice (NIM) LLM Overview
AI Inference
1
NCP NVL72 Reference Architecture Overview
Scalable Supercomputing Infrastructure
NVL72 East–West NVLink Compute Fabric
NVL72 InfiniBand Compute Fabric
Storage
- NCP storage hierarchy
- Storage fabrics
NCP N-S Management Networks
- In-band network
- Out-of-band Management Network
NCP Security
- Security best practices (compute, network, cluster management)
- Multi-Tenancy
NCP Control Plane
- Run:ai
- Slurm
- High Availability
2
Customer Environment Review
Systems and Components Overview
Cluster Architecture
Storage
Security
Workload Management
Q&A
3
System Administration Workshop
GPU Monitoring and Diagnostics
GPU Virtualization with MIG
Running NGC Containers
1
Main Topics
Switch Management
Layer 2 Features
Layer 3 Features
Network Virtualization with VXLAN and EVPN
Troubleshooting
Automation with Ansible
1
InfiniBand Workshop
Managing InfiniBand Networks - UFM and ibdiagnet
2
NVLink Workshop
Coming Soon: NMX Presentation
NMX-C Troubleshooting and Telemetry
1
Base Command Manager Workshop
Cluster Management Tools
Node Provisioning
Software Images
Node Categories
User Management
Workload Management
Health Monitoring
Scheduling Jobs with Slurm
Job Orchestration with Run:ai and Kubernetes
1
NMC Software Platform Overview
Autonomous Hardware Recovery
Autonomous Job Recovery
Leak Detection
Building a Management System
IMEX (Internode memory exchange service)
Rack Management
Firmware Management
Power Reservation Steering
1
Run:ai Overview
Run:ai Overview
Feature Deep Dive
Run:ai Persona Workflows
2
Run:ai Building Blocks
Managing Resources with Run:ai
Project Overview
Department Overview
Understanding Quotas and Over Quota
Introduction to Node Pools
3
Run:ai Manageability and Visibility
Run:ai Dashboards Overview
Dashboard Insights
Dashboard Navigation
4