AI Factory Deployment Training

Anticipated Outline and Content Availability by Track (see left menu)

1

Foundational Knowledge

Building AI Factories for the​ Next Industrial Revolution

Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.

Self-Paced 20 mins

Base Command Manager Administration Course

Install and manage clusters using Base Command Manager by mastering its architecture, management tools, workload handling, and cloud integration capabilities. Learn more about this course.

Self-Paced 4 hours

AI Infrastructure and Operations Fundamentals Course

Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology. Learn more about this course.

Recommended training for the AIIO certification exam.

Self-Paced 6 hours

AI Infrastructure and Operations Fundamentals Associate Certification Exam

This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows. Learn more about this exam.

Exam 1 hour

Data Center Management Made Easy with UFM Course

Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration. Learn more about this course.

Self-Paced 3 hours
2

Physical Deployment

GB200 MOP

Deploy a GB200 NVL72 rack safely and successfully by following step-by-step MOP guidelines for receiving, uncrating, filling, moving, anchoring, and connecting cooling and power using the proper tools and PPE.

Self-Paced Coming Soon
3

Cluster Bring Up

Compute Cluster Bring Up Workshop

Topics for this instructor led training include Clone default images, Create/Clone HGX/Login Categories, Customize software image, Install Base Command Manager, Install BCM License, Install Kubernetes, Install Slurm/Enroot/Pyxis, Install Run:ai.

PREREQUISITE: Base Command Manager Administration (self-paced course)
See Outline

Workshop 3 days

Configure Head Node Network

Learn how to create a bonded provisioning interface on a base command manager head node and set up cluster networks.

Self-Paced 5 mins

Configure Node Disk Layouts and Interfaces

Learn how to configure disc layouts for nodes, create nodes, and set up interfaces.

Self-Paced 6 mins

Configure High Availability for Base Command Manager Example

Learn how to configure high availability by creating a secondary head node and connecting it to shared storage.

Self-Paced 7 mins

Configure Rack Layout in Base Command Manager

Learn how to model, configure, assign, and visualize racks and chassis in Base Command Manager to represent your physical cluster layout.

Self-Paced 6 mins

Configure Power Distribution Units in BCM

Learn to configure and manage Power Distribution Units in Base Command Manager for centralized cluster power control.

Self-Paced 5 mins

Post-install BCM configuration

Content will include integration of storage solutions into BCM.

Self-Paced Coming Soon
4

Validation

ClusterKit: Introduction and Setup

Install NVIDIA ClusterKit from the HPC-X toolkit and configure the necessary prerequisites to prepare for network validation.

Self-Paced 5 mins

ClusterKit: Running and Interpreting Tests

Execute bandwidth and latency tests and interpret the text-based output to verify cluster health against performance baselines.

Self-Paced 7 mins

ClusterKit: Advanced Features and Visualization

Apply advanced mapper scripts for multi-HCA nodes and visualize test results using the UFM Fabric Validation Plugin.

Self-Paced 6 mins

Network Performance Benchmarking with NCCL Tests​

Validate your GPU fabric for demanding AI workloads by running NCCL allreduceperf tests, interpreting Bus Bandwidth results, and using them to detect, isolate, and troubleshoot network performance bottlenecks.

Self-Paced Coming Soon

System Validation with High-Performance Linpack (HPL and Single-node Burn-in)

Gain the skills to validate large-scale AI clusters with HPL and HPL-MxP by configuring HPL.dat, running containerized HPL jobs on Slurm, and interpreting results to confirm performance, efficiency, and stability.

Self-Paced Coming Soon

Nemo Burn-in

Train a massive Nemotron LLM as a practical system benchmark by using DGXC Performance Recipes, launching multi-node jobs with llmb-run, reading training logs, and applying best practices for large-scale, distributed training.

Self-Paced Coming Soon

Node Inventory

More info soon.

Self-Paced Coming Soon

Storage Testing

More info soon.

Self-Paced Coming Soon
1

Foundational Knowledge

InfiniBand Network Administration Course

Master InfiniBand technology fundamentals to effectively install, configure, manage, and troubleshoot fabric architectures. Learn more about this course.

Self-Paced 5 hours

AI Infrastructure and Operations Fundamentals Course

Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology. Learn more about this course.

Recommended training for the AIIO certification exam.

Self-Paced 6 hours

AI Infrastructure and Operations Fundamentals Associate Certification Exam

This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows. Learn more about this exam.

Exam 1 hour

BlueField DPU Administration Course

Learn the fundamental concepts of BlueField DPUs to deploy platforms for accelerated data center computing. Learn more about this course.

Self-Paced 5 hours

Data Center Management Made Easy with UFM Course

Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration. Learn more about this course.

Self-Paced 3 hours

Ansible Essentials for Network Engineers Course

Automate fabric workflows by exploring Ansible modules and writing playbooks specifically adapted for modern data centers. Learn more about this course.

Self-Paced 1 hour

Cumulus Linux Essentials Course

Explore NVIDIA Cumulus Linux through a combination of presentations and recorded hands-on demonstrations. Learn more about this course.

Self-Paced 1 hour

Building AI Factories for the​ Next Industrial Revolution

Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.

Self-Paced 20 mins

Spectrum-X Platform Overview

Discover how Spectrum-X enhances AI data center network architecture by addressing congestion and latency issues in traditional Ethernet.

Self-Paced 22 mins

Ethernet Switch Systems

Explore NVIDIA’s range of high-throughput Ethernet switch systems tailored for data centers, including the SN 2000 through SN 5000 series.

Self-Paced 8 mins

LinkX Interconnect Solutions

Enhance data center scalability and performance using high-speed interconnect options like direct attach copper and active optical cables.

Self-Paced 7 mins

Run:ai Platform Deployment Course

Learn to deploy, configure, and verify the Run:ai platform from start to finish. Learn more about this course.

Self-Paced 3 hours

RTX Pro Reference Architecture

Design and scale high-performance RTX PRO AI Factory clusters by applying the 2-8-5-200 reference architecture with RTX PRO 6000 GPUs, Spectrum-X networking, and modular scalable units for diverse AI workloads.

Self-Paced Coming Soon

HGX H*00/B*00 Enterprise Reference Architecture

Design and recommend the right NVIDIA Enterprise Reference Architecture by decoding C-G-N-B node configs, matching PCIe, HGX, and Grace patterns to workloads, and integrating Spectrum-X networking with certified storage and software stacks.

Self-Paced Coming Soon

NCP Reference Architecture.

More info soon.

Self-Paced Coming Soon
2

Physical Deployment

Cabling Guide

Learn to implement NVIDIA’s comprehensive cabling methodology for AI data centers, mastering every phase from physical design and documentation to staging, installation, and long-term maintenance.

Self-Paced 6 mins

Server Scale NDR

More info soon.

Self-Paced Coming Soon

Rack Scale NDR

More info soon.

Self-Paced Coming Soon

Rack Scale XDR

More info soon.

Self-Paced Coming Soon
3

Cluster Bring Up

Networking Cluster Bring up Workshop

This workshop dives into the networking required for an AI Factory. The learners will be introduced to Cumulus Linux, Spectrum X, and InfiniBand. They will learn how to configure networks and do basic debugging and troubleshooting. The learners will become familiar with command line and GUI tools for working with the networks in an AI Factory using the NVIDIA Air© platform and other tools.

PREREQUISITES: InfiniBand Network Administration, Cumulus Linux Essentials, Spectrum-X Platform Overview (self-paced training)
See Outline

Workshop Coming Soon

NetQ Configuration

Learn to use NVIDIA NetQ to monitor network health, validate configurations, and troubleshoot issues in real time for a stable, scalable AI data center.

Self-Paced 1 hour

Switch configuration

Learn how to configure E/W Switches N/S.

Self-Paced Coming Soon
4

Validation

Cable Validation Tool (CVT)

Build end-to-end skills to plan, deploy, and operate the NVIDIA Cable Validation Tool by preparing prerequisites, installing the Collector (standalone or as a UFM plug‑in), creating a correct Unified P2P file, and using CVT to validate and troubleshoot your fabric. Expected availability is March 2026.

Self-Paced Coming Soon

Networking FW Validation

More info soon.

Self-Paced Coming Soon