Foundational Knowledge
Building AI Factories for the Next Industrial Revolution
Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.
Base Command Manager Administration Course
Install and manage clusters using Base Command Manager by mastering its architecture, management tools, workload handling, and cloud integration capabilities. Learn more about this course.
AI Infrastructure and Operations Fundamentals Course
Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology. Learn more about this course.
Recommended training for the AIIO certification exam.
AI Infrastructure and Operations Fundamentals Associate Certification Exam
This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows. Learn more about this exam.
Data Center Management Made Easy with UFM Course
Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration. Learn more about this course.
Physical Deployment
GB200 MOP
Deploy a GB200 NVL72 rack safely and successfully by following step-by-step MOP guidelines for receiving, uncrating, filling, moving, anchoring, and connecting cooling and power using the proper tools and PPE.
Cluster Bring Up
Compute Cluster Bring Up Workshop
Topics for this instructor led training include Clone default images, Create/Clone HGX/Login Categories, Customize software image, Install Base Command Manager, Install BCM License, Install Kubernetes, Install Slurm/Enroot/Pyxis, Install Run:ai.
PREREQUISITE: Base Command Manager Administration (self-paced course)
See Outline
Configure Head Node Network
Learn how to create a bonded provisioning interface on a base command manager head node and set up cluster networks.
Configure Node Disk Layouts and Interfaces
Learn how to configure disc layouts for nodes, create nodes, and set up interfaces.
Configure High Availability for Base Command Manager Example
Learn how to configure high availability by creating a secondary head node and connecting it to shared storage.
Configure Rack Layout in Base Command Manager
Learn how to model, configure, assign, and visualize racks and chassis in Base Command Manager to represent your physical cluster layout.
Configure Power Distribution Units in BCM
Learn to configure and manage Power Distribution Units in Base Command Manager for centralized cluster power control.
Post-install BCM configuration
Content will include integration of storage solutions into BCM.
Validation
ClusterKit: Introduction and Setup
Install NVIDIA ClusterKit from the HPC-X toolkit and configure the necessary prerequisites to prepare for network validation.
ClusterKit: Running and Interpreting Tests
Execute bandwidth and latency tests and interpret the text-based output to verify cluster health against performance baselines.
ClusterKit: Advanced Features and Visualization
Apply advanced mapper scripts for multi-HCA nodes and visualize test results using the UFM Fabric Validation Plugin.
Network Performance Benchmarking with NCCL Tests
Validate your GPU fabric for demanding AI workloads by running NCCL allreduceperf tests, interpreting Bus Bandwidth results, and using them to detect, isolate, and troubleshoot network performance bottlenecks.
System Validation with High-Performance Linpack (HPL and Single-node Burn-in)
Gain the skills to validate large-scale AI clusters with HPL and HPL-MxP by configuring HPL.dat, running containerized HPL jobs on Slurm, and interpreting results to confirm performance, efficiency, and stability.
Nemo Burn-in
Train a massive Nemotron LLM as a practical system benchmark by using DGXC Performance Recipes, launching multi-node jobs with llmb-run, reading training logs, and applying best practices for large-scale, distributed training.
Node Inventory
More info soon.
Storage Testing
More info soon.
Foundational Knowledge
InfiniBand Network Administration Course
Master InfiniBand technology fundamentals to effectively install, configure, manage, and troubleshoot fabric architectures. Learn more about this course.
AI Infrastructure and Operations Fundamentals Course
Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology. Learn more about this course.
Recommended training for the AIIO certification exam.
AI Infrastructure and Operations Fundamentals Associate Certification Exam
This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows. Learn more about this exam.
BlueField DPU Administration Course
Learn the fundamental concepts of BlueField DPUs to deploy platforms for accelerated data center computing. Learn more about this course.
Data Center Management Made Easy with UFM Course
Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration. Learn more about this course.
Ansible Essentials for Network Engineers Course
Automate fabric workflows by exploring Ansible modules and writing playbooks specifically adapted for modern data centers. Learn more about this course.
Cumulus Linux Essentials Course
Explore NVIDIA Cumulus Linux through a combination of presentations and recorded hands-on demonstrations. Learn more about this course.
Building AI Factories for the Next Industrial Revolution
Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.
Spectrum-X Platform Overview
Discover how Spectrum-X enhances AI data center network architecture by addressing congestion and latency issues in traditional Ethernet.
Ethernet Switch Systems
Explore NVIDIA’s range of high-throughput Ethernet switch systems tailored for data centers, including the SN 2000 through SN 5000 series.
LinkX Interconnect Solutions
Enhance data center scalability and performance using high-speed interconnect options like direct attach copper and active optical cables.
Run:ai Platform Deployment Course
Learn to deploy, configure, and verify the Run:ai platform from start to finish. Learn more about this course.
RTX Pro Reference Architecture
Design and scale high-performance RTX PRO AI Factory clusters by applying the 2-8-5-200 reference architecture with RTX PRO 6000 GPUs, Spectrum-X networking, and modular scalable units for diverse AI workloads.
HGX H*00/B*00 Enterprise Reference Architecture
Design and recommend the right NVIDIA Enterprise Reference Architecture by decoding C-G-N-B node configs, matching PCIe, HGX, and Grace patterns to workloads, and integrating Spectrum-X networking with certified storage and software stacks.
NCP Reference Architecture.
More info soon.
Physical Deployment
Cabling Guide
Learn to implement NVIDIA’s comprehensive cabling methodology for AI data centers, mastering every phase from physical design and documentation to staging, installation, and long-term maintenance.
Server Scale NDR
More info soon.
Rack Scale NDR
More info soon.
Rack Scale XDR
More info soon.
Cluster Bring Up
Networking Cluster Bring up Workshop
This workshop dives into the networking required for an AI Factory. The learners will be introduced to Cumulus Linux, Spectrum X, and InfiniBand. They will learn how to configure networks and do basic debugging and troubleshooting. The learners will become familiar with command line and GUI tools for working with the networks in an AI Factory using the NVIDIA Air© platform and other tools.
PREREQUISITES: InfiniBand Network Administration, Cumulus Linux Essentials, Spectrum-X Platform Overview (self-paced training)
See Outline
NetQ Configuration
Learn to use NVIDIA NetQ to monitor network health, validate configurations, and troubleshoot issues in real time for a stable, scalable AI data center.
Switch configuration
Learn how to configure E/W Switches N/S.
Validation
Cable Validation Tool (CVT)
Build end-to-end skills to plan, deploy, and operate the NVIDIA Cable Validation Tool by preparing prerequisites, installing the Collector (standalone or as a UFM plug‑in), creating a correct Unified P2P file, and using CVT to validate and troubleshoot your fabric. Expected availability is March 2026.
Networking FW Validation
More info soon.