AI Factory Deployment Training

Anticipated Outline and Content Availability by Track (see left menu)

1

Foundational Knowledge

Building AI Factories for the​ Next Industrial Revolution

Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.

Self-Paced 20 mins

Base Command Manager Administration

Install and manage clusters using Base Command Manager by mastering its architecture, management tools, workload handling, and cloud integration capabilities.

Self-Paced 4 hours

AI Infrastructure and Operations Fundamentals

Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology.

Recommended training for the AIIO certification exam.

Self-Paced 6 hours

AI Infrastructure and Operations Fundamentals Associate Certification Exam

This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows.

Exam 1 hour

Data Center Management Made Easy with UFM

Provision, monitor, and troubleshoot InfiniBand data center fabrics using UFM Enterprise, covering telemetry, traffic optimization, and job scheduler integration.

Self-Paced 3 hours
2

Physical Deployment

GB200 MOP

Self-Paced Coming Soon
3

Cluster Bring Up

Compute Cluster Bring Up Workshop

Topics for this instructor led training include Clone default images, Create/Clone HGX/Login Categories, Customize software image, Install Base Command Manager, Install BCM License, Install Kubernetes, Install Slurm/Enroot/Pyxis, Install Run:ai.

PREREQUISITE: Base Command Manager Administration (self-paced course)
See Outline

Workshop 3 days

Configure Head Node Network

Learn how to create a bonded provisioning interface on a base command manager head node and set up cluster networks.

Self-Paced 5 mins

Configure Node Disk Layouts and Interfaces

Learn how to configure disc layouts for nodes, create nodes, and set up interfaces.

Self-Paced 6 mins

Configure High Availability for Base Command Manager Example

Learn how to configure high availability by creating a secondary head node and connecting it to shared storage.

Self-Paced 7 mins

Configure Rack Layout in Base Command Manager

Learn how to model, configure, assign, and visualize racks and chassis in Base Command Manager to represent your physical cluster layout.

Self-Paced 6 mins

Configure Power Distribution Units in BCM

Learn to configure and manage Power Distribution Units in Base Command Manager for centralized cluster power control.

Self-Paced 5 mins

Post-install BCM configuration

Content will include integration of storage solutions into BCM.

Self-Paced Coming Soon
4

Validation

ClusterKit: Introduction and Setup

Install NVIDIA ClusterKit from the HPC-X toolkit and configure the necessary prerequisites to prepare for network validation.

Self-Paced 5 mins

ClusterKit: Running and Interpreting Tests

Execute bandwidth and latency tests and interpret the text-based output to verify cluster health against performance baselines.

Self-Paced 7 mins

ClusterKit: Advanced Features and Visualization

Apply advanced mapper scripts for multi-HCA nodes and visualize test results using the UFM Fabric Validation Plugin.

Self-Paced 6 mins

NCCL (different traffic patterns)

Self-Paced Coming Soon

NCCL Burn-in

Self-Paced Coming Soon

HPL Burn-in

Self-Paced Coming Soon

Nemo Burn-in

Self-Paced Coming Soon

Node Inventory

Self-Paced Coming Soon

Storage Testing

Self-Paced Coming Soon

Single-node Burn-in

Self-Paced Coming Soon
1

Foundational Knowledge

InfiniBand Network Administration

Master InfiniBand technology fundamentals to effectively install, configure, manage, and troubleshoot fabric architectures.

Self-Paced 5 hours

AI Infrastructure and Operations Fundamentals

Gain insights into the evolving AI landscape, covering the essential compute power, tools, and algorithms driving modern enterprise technology.

Recommended training for the AIIO certification exam.

Self-Paced 6 hours

AI Infrastructure and Operations Fundamentals Associate Certification Exam

This entry-level credential validates foundational skills in deploying, managing, and optimizing AI infrastructure, covering GPU systems, networking, and core operational workflows.

Exam 1 hour

BlueField DPU Administration

Learn the fundamental concepts of BlueField DPUs to deploy platforms for accelerated data center computing.

Self-Paced 5 hours

Data Center Management Made Easy with UFM

Provision, monitor, and troubleshoot InfiniBand data center fabrics using the enhanced management and telemetry capabilities of UFM Enterprise.

Self-Paced 3 hours

Ansible Essentials for Network Engineers

Automate fabric workflows by exploring Ansible modules and writing playbooks specifically adapted for modern data centers.

Self-Paced 1 hour

Cumulus Linux Essentials

Explore NVIDIA Cumulus Linux through a combination of presentations and recorded hands-on demonstrations.

Self-Paced 1 hour

Building AI Factories for the​ Next Industrial Revolution

Learn how NVIDIA’s full-stack AI Factory platform addresses modern design and deployment challenges through optimized compute, networking, and software infrastructure tailored for scalable enterprise AI.

Self-Paced 20 mins

Spectrum-X Platform Overview

Discover how Spectrum-X enhances AI data center network architecture by addressing congestion and latency issues in traditional Ethernet.

Self-Paced 22 mins

Ethernet Switch Systems

Explore NVIDIA’s range of high-throughput Ethernet switch systems tailored for data centers, including the SN 2000 through SN 5000 series.

Self-Paced 8 mins

LinkX Interconnect Solutions

Enhance data center scalability and performance using high-speed interconnect options like direct attach copper and active optical cables.

Self-Paced 7 mins

Run:ai Platform deployment

Learn to deploy, configure, and verify the Run:ai platform from start to finish.

Partial content available now. Full content is anticipated to be available by February.

Self-Paced 3 hours

RTX Pro Reference Architecture

Self-Paced Coming Soon

HGX H*00/B*00 Enterprise Reference Architecture

Self-Paced Coming Soon

NCP Reference Architecture.

Self-Paced Coming Soon
2

Physical Deployment

Cabling Guide

Learn to implement NVIDIA’s comprehensive cabling methodology for AI data centers, mastering every phase from physical design and documentation to staging, installation, and long-term maintenance.

Self-Paced 6 mins

Server Scale NDR

Self-Paced Coming Soon

Rack Scale NDR

Self-Paced Coming Soon

Rack Scale XDR

Self-Paced Coming Soon
3

Cluster Bring Up

Networking Cluster Bring up Workshop

Workshop Coming Soon

NetQ Configuration

Learn to use NVIDIA NetQ to monitor network health, validate configurations, and troubleshoot issues in real time for a stable, scalable AI data center.

Self-Paced 1 hour

Switch configuration

Learn how to configure E/W Switches N/S.

Self-Paced Coming Soon
4

Validation

Cable Validation Tool (CVT)

Expected availability is February 2026.

Self-Paced Coming Soon

Networking FW Validation

Self-Paced Coming Soon