Foundational Knowledge
Building AI Factories for the Next Industrial Revolution
Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.
Base Command Manager Administration
Install and manage clusters using Base Command Manager by mastering its architecture, management tools, workload handling, and cloud integration capabilities.
AI Infrastructure and Operations Fundamentals
Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology.
Recommended training for the AIIO certification exam.
AI Infrastructure and Operations Fundamentals Associate Certification Exam
This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows.
Data Center Management Made Easy with UFM
Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration.
Physical Deployment
GB200 MOP
Cluster Bring Up
Compute Cluster Bring Up Workshop
Topics for this instructor led training include Clone default images, Create/Clone HGX/Login Categories, Customize software image, Install Base Command Manager, Install BCM License, Install Kubernetes, Install Slurm/Enroot/Pyxis, Install Run:ai.
PREREQUISITE: Base Command Manager Administration (self-paced course)
See Outline
Configure Head Node Network
Learn how to create a bonded provisioning interface on a base command manager head node and set up cluster networks.
Configure Node Disk Layouts and Interfaces
Learn how to configure disc layouts for nodes, create nodes, and set up interfaces.
Configure High Availability for Base Command Manager Example
Learn how to configure high availability by creating a secondary head node and connecting it to shared storage.
Configure Rack Layout in Base Command Manager
Learn how to model, configure, assign, and visualize racks and chassis in Base Command Manager to represent your physical cluster layout.
Configure Power Distribution Units in BCM
Learn to configure and manage Power Distribution Units in Base Command Manager for centralized cluster power control.
Post-install BCM configuration
Content will include integration of storage solutions into BCM.
Validation
ClusterKit: Introduction and Setup
Install NVIDIA ClusterKit from the HPC-X toolkit and configure the necessary prerequisites to prepare for network validation.
ClusterKit: Running and Interpreting Tests
Execute bandwidth and latency tests and interpret the text-based output to verify cluster health against performance baselines.
ClusterKit: Advanced Features and Visualization
Apply advanced mapper scripts for multi-HCA nodes and visualize test results using the UFM Fabric Validation Plugin.
NCCL (different traffic patterns)
NCCL Burn-in
HPL Burn-in
Nemo Burn-in
Node Inventory
Storage Testing
Single-node Burn-in
Foundational Knowledge
InfiniBand Network Administration
Master InfiniBand technology fundamentals to effectively install, configure, manage, and troubleshoot fabric architectures.
AI Infrastructure and Operations Fundamentals
Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology.
Recommended training for the AIIO certification exam.
AI Infrastructure and Operations Fundamentals Associate Certification Exam
This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows.
BlueField DPU Administration
Learn the fundamental concepts of BlueField DPUs to deploy platforms for accelerated data center computing.
Data Center Management Made Easy with UFM
Provision, monitor, and troubleshoot InfiniBand data center fabrics using the enhanced management and telemetry capabilities of UFM Enterprise.
Ansible Essentials for Network Engineers
Automate fabric workflows by exploring Ansible modules and writing playbooks specifically adapted for modern data centers.
Cumulus Linux Essentials
Explore NVIDIA Cumulus Linux through a combination of presentations and recorded hands-on demonstrations.
Building AI Factories for the Next Industrial Revolution
Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.
Spectrum-X Platform Overview
Discover how Spectrum-X enhances AI data center network architecture by addressing congestion and latency issues in traditional Ethernet.
Ethernet Switch Systems
Explore NVIDIA’s range of high-throughput Ethernet switch systems tailored for data centers, including the SN 2000 through SN 5000 series.
LinkX Interconnect Solutions
Enhance data center scalability and performance using high-speed interconnect options like direct attach copper and active optical cables.
Run:ai Platform deployment
Learn to deploy, configure, and verify the Run:ai platform from start to finish.
Partial content available now. Full content is anticipated to be available by February.
RTX Pro Reference Architecture
HGX H*00/B*00 Enterprise Reference Architecture
NCP Reference Architecture.
Physical Deployment
Cabling Guide
Learn to implement NVIDIA’s comprehensive cabling methodology for AI data centers, mastering every phase from physical design and documentation to staging, installation, and long-term maintenance.
Server Scale NDR
Rack Scale NDR
Rack Scale XDR
Cluster Bring Up
Networking Cluster Bring up Workshop
NetQ Configuration
Learn to use NVIDIA NetQ to monitor network health, validate configurations, and troubleshoot issues in real time for a stable, scalable AI data center.
Switch configuration
Learn how to configure E/W Switches N/S.
Validation
Cable Validation Tool (CVT)
Expected availability is February 2026.