NCP GB200/GB300 Administration

Course Outline

1

GPU Accelerated Infrastructure

Accelerating AI with GPUs

Self-Paced 22 min

GB200 Hardware Overview

Self-Paced 27 min

GB200 Software Overview

Self-Paced 17 min

GB300 NVL72 Upgrade

Self-Paced 24 min

NCP Storage

Self-Paced 14 min

Spectrum-X Networking Platform Overview

Self-Paced 22 min

BlueField DPU Networking Platform

Self-Paced 21 min

BlueField DPU System Overview

Self-Paced 13 min

NVIDIA Data Center GPU Manager

Self-Paced 7 min

NVIDIA GPU Containers

Self-Paced 8 min

Virtualizing GPU Resources

Self-Paced 32 min
1

Cumulus Linux Administration Self-Paced Course

Introduction to Cumulus Linux

  • Cumulus Linux Overview and Architecture
  • Getting Started with Cumulus Linux
  • NVIDIA User Experience (NVUE) Overview
  • Interface Configuration and Management
Self-Paced

Layer 2 Features

  • Ethernet Bridging (VLAN and Trunks, SVIs, STP)
  • Bonds - Link Aggregation (LAG/LACP)
  • Multi-Chassis LAG (MLAG)
Self-Paced

Layer 3 Features

  • Virtual Router Redundancy (VRR)
  • Virtual Routing and Forwarding (VRF)
  • FRR Routing Protocols Suite
Self-Paced

Border Gateway Protocol (MP-BGP)

  • BGP Overview
  • BGP in The Data Center
  • BGP Unnumbered
Self-Paced

Network Virtualization with VXLAN and EVPN

  • Network Virtualization with VXLAN
  • Ethernet Virtual Private Network (EVPN)
  • VXLAN Routing
Self-Paced
1

InfiniBand Networking

Introduction to InfiniBand

Self-Paced 10 min

InfiniBand Architecture and Management

Self-Paced 20 min

Fabric Initialization

Self-Paced 20 min

Fabric Monitoring

Self-Paced 12 min

InfiniBand Fabric Diagnosis with ibdiagnet Tool

Self-Paced 14 min
2

Unified Fabric Monitoring

UFM Overview

Self-Paced

Key Features

Self-Paced

Architecture

Self-Paced

Operational Dashboard

Self-Paced

Fabric Health and Logging

Self-Paced
3

NVLink Switch & GFM

NVLink Switch Tray Overview

Self-Paced 26 min

NVOS Overview

Self-Paced 36 min

NVLink Partitions Overview

Self-Paced 28 min

NVOS ZTP & Software Upgrade

Self-Paced 32 min

Switch Tray BMC

Self-Paced 25 min

Reductions Over NVL

Self-Paced 6 min
4

NMX

NMX-C Software Architecture Overview

Self-Paced 27 min

NMX-T Software Architecture Overview

Self-Paced 34 min
5

Monitoring, Telemetry and Troubleshooting

GB200 Telemetry

Self-Paced 28 min

NVLink Fault Handling & Attribution

Self-Paced 39 min

Flow Control and Network Forwarding in NVL

Self-Paced 34 min
1

Base Command Manager

Base Command Manager Overview

Self-Paced 11 min

Basic Concepts

Self-Paced 13 min

Networking and Preparing for Installation

Self-Paced 10 min

Head Node Installation

Self-Paced 10 min

Bringing Up the Cluster

Self-Paced 9 min

Orchestration, MLOps and Job Scheduling

Self-Paced 9 min
2

NVIDIA AI Enterprise

NVIDIA AI Enterprise Solution Overview

Self-Paced 20 min

Hardware Options

Self-Paced 35 min

Deployment Methods Overview

Self-Paced 25 min

NGC CLI and NGC API

Self-Paced 3 min

NVIDIA Licensing System

Self-Paced 30 min

Docker & NVIDIA Container Toolkit

Self-Paced 6 min

Accessing and Installing AI Software

Self-Paced 30 min

AI Workflows

Self-Paced 35 min

NVIDIA Inference Microservice (NIM) LLM Overview

Self-Paced 17 min

AI Inference

Self-Paced 55 min
1

NCP NVL72 Reference Architecture Overview

Scalable Supercomputing Infrastructure

ILT

NVL72 East–West NVLink Compute Fabric

ILT

NVL72 InfiniBand Compute Fabric

ILT

Storage

  • NCP storage hierarchy
  • Storage fabrics
ILT

NCP N-S Management Networks

  • In-band network
  • Out-of-band Management Network
ILT

NCP Security

  • Security best practices (compute, network, cluster management)
  • Multi-Tenancy
ILT

NCP Control Plane

  • Run:ai
  • Slurm
  • High Availability
ILT
2

Customer Environment Review

Systems and Components Overview

ILT

Cluster Architecture

ILT

Storage

ILT

Security

ILT

Workload Management

ILT

Q&A

ILT
3

System Administration Workshop

GPU Monitoring and Diagnostics

ILT

GPU Virtualization with MIG

ILT

Running NGC Containers

ILT
1

Main Topics

Switch Management

ILT

Layer 2 Features

ILT

Layer 3 Features

ILT

Network Virtualization with VXLAN and EVPN

ILT

Troubleshooting

ILT

Automation with Ansible

ILT
1

InfiniBand Workshop

Managing InfiniBand Networks - UFM and ibdiagnet

ILT
2

NVLink Workshop

Coming Soon: NMX Presentation

ILT

NMX-C Troubleshooting and Telemetry

ILT
1

Base Command Manager Workshop

Cluster Management Tools

ILT

Node Provisioning

ILT

Software Images

ILT

Node Categories

ILT

User Management

ILT

Workload Management

ILT

Health Monitoring

ILT

Scheduling Jobs with Slurm

ILT

Job Orchestration with Run:ai and Kubernetes

ILT
1

NMC Software Platform Overview

Autonomous Hardware Recovery

ILT

Autonomous Job Recovery

ILT

Leak Detection

ILT

Building a Management System

ILT

IMEX (Internode memory exchange service)

ILT

Rack Management

ILT

Firmware Management

ILT

Power Reservation Steering

ILT
1

Run:ai Overview

Run:ai Overview

ILT

Feature Deep Dive

ILT

Run:ai Persona Workflows

ILT
2

Run:ai Building Blocks

Managing Resources with Run:ai

ILT

Project Overview

ILT

Department Overview

ILT

Understanding Quotas and Over Quota

ILT

Introduction to Node Pools

ILT
3

Run:ai Manageability and Visibility

Run:ai Dashboards Overview

ILT

Dashboard Insights

ILT

Dashboard Navigation

ILT
4

Run:ai Workloads

Workload Types

ILT

Assets

ILT

Workload Submission

ILT

Workload View

ILT