Secure Cloud partner requirements

This page covers the requirements to become a Runpod Secure Cloud partner. These requirements are the baseline—meeting them does not guarantee selection. Runpod also evaluates business health, prior performance, and corporate alignment before selecting partners.

Minimum deployment size: 100kW of GPU server capacity.

Effective dates

Requirement	New partners	Existing partners
Hardware specifications	November 1, 2024	December 15, 2024 (new servers only)
Compliance specifications	November 1, 2024	April 1, 2025

A new revision is released annually in October. Minor mid-year revisions may be made for market, roadmap, or customer needs.

Hardware requirements

GPU compute servers

GPU requirements

NVIDIA GPUs no older than Ampere generation.

CPU requirements

Requirement	Specification
Cores	Minimum 4 physical CPU cores per GPU + 2 for system operations.
Clock speed	Minimum 3.5 GHz base clock, with boost clock of at least 4.0 GHz.
Recommended CPUs	AMD EPYC 9654 (96 cores, up to 3.7 GHz), Intel Xeon Platinum 8490H (60 cores, up to 4.8 GHz), AMD EPYC 9474F (48 cores, up to 4.1 GHz).

Bus bandwidth

GPU VRAM	Minimum bandwidth
8/10/12/16 GB	PCIe 3.0 x16
20/24/32/40/48 GB	PCIe 4.0 x16
80 GB	PCIe 5.0 x16

Exception: A100 80GB PCI-E requires PCIe 4.0 x16.

Memory

Main system memory must have ECC.

GPU configuration	Recommended RAM
8x 80 GB VRAM	>= 2048 GB DDR5
8x 40/48 GB VRAM	>= 1024 GB DDR5
8x 24 GB VRAM	>= 512 GB DDR4/5
8x 16 GB VRAM	>= 256 GB DDR4/5

Storage

Servers require two separate storage arrays: a boot array for the host operating system and a working array for customer workloads. Boot array:

Requirement	Specification
Redundancy	>= 2n redundancy (RAID 1)
Size	>= 500 GB (post-RAID)
Sequential read	2,000 MB/s
Sequential write	2,000 MB/s
Random read (4K QD32)	100,000 IOPS
Random write (4K QD32)	10,000 IOPS

Working array:

Requirement	Specification
Redundancy	>= 2n redundancy (RAID 1 or RAID 10)
Size	2 TB+ NVMe per GPU for 24/48 GB GPUs; 4 TB+ NVMe per GPU for 80 GB GPUs (post-RAID)
Sequential read	6,000 MB/s
Sequential write	5,000 MB/s
Random read (4K QD32)	400,000 IOPS
Random write (4K QD32)	40,000 IOPS

Storage cluster

Each data center must have a storage cluster providing shared storage between all GPU servers. The hardware is provided by the partner; storage cluster licensing is provided by Runpod. Baseline specifications:

Component	Requirement
Minimum servers	4
Minimum storage size	200 TB raw (100 TB usable)
Connectivity	200 Gbps between servers/data-plane
Network	Private subnet

Server specifications:

Component	Requirement
CPU	AMD Genoa: EPYC 9354P (32-core, 3.25-3.8 GHz), EPYC 9534 (64-core, 2.45-3.7 GHz), or EPYC 9554 (64-core, 3.1-3.75 GHz)
RAM	256 GB or higher, DDR5/ECC

Storage cluster servers follow the same boot array specifications as GPU compute servers. The working array should use JBOD (Runpod assembles into array) with 7-14 TB disk sizes recommended. Servers should have spare disk slots for future expansion. Dedicated metadata server (large-scale clusters): Required when the storage cluster exceeds 90% single-core CPU on the leader node during peak hours.

Component	Requirement
CPU	AMD Ryzen Threadripper 7960X (24-core, 4.2-5.3 GHz)
RAM	128 GB or higher, DDR5/ECC
Boot disk	>= 500 GB, RAID 1

CPU servers

Each data center should have CPU servers for CPU-only Pods and Serverless workers. Runpod also uses these servers for features that do not require GPUs (e.g., the S3-compatible API). Baseline specifications:

Component	Requirement
Minimum servers	2
Minimum storage size	8 TB usable
Connectivity	200 Gbps between servers/data-plane
Network	Private subnet; public IP and >990 ports open

Server specifications:

Component	Requirement
CPU	AMD EPYC 9004 ‘Genoa’ Zen 4 or better with minimum 32 cores, 3+ GHz clock speed
RAM	1 TB or higher, DDR5/ECC
Storage	8 TB+, RAID 1 or RAID 10

Software requirements

Operating system

Ubuntu Server 22.04 LTS
Linux kernel 6.5.0-15 or later (Ubuntu HWE Kernel)
SSH remote connection capability

BIOS configuration

IOMMU disabled for non-VM systems
Server BIOS/firmware updated to latest stable version

Drivers and software

Component	Requirement
NVIDIA drivers	Version 550.54.15 or later
CUDA	Version 12.4 or later
NVIDIA Persistence	Activated for GPUs of 48 GB or more

HGX SXM systems

Additional requirements for HGX SXM systems:

NVIDIA Fabric Manager installed, activated, running, and tested.
Fabric Manager version must match NVIDIA drivers and kernel driver headers.
CUDA Toolkit, NVIDIA NSCQ, and NVIDIA DCGM installed.
Verify NVLink switch topology using nvidia-smi and dcgmi.
Ensure SXM performance using dcgmi diagnostic tool.

Power requirements

Requirement	Specification
Utility feeds	Two independent utility feeds from separate substations. Each feed capable of supporting 100% of the data center’s power load. Automatic transfer switches (ATS) with UL 1008 certification.
UPS	N+1 redundancy. Minimum 15 minutes runtime at full load.
Generators	N+1 redundancy. 100% load support. 48 hours of on-site fuel storage at full load. Automatic transfer within 10 seconds of utility failure.
Power distribution	2N redundant paths from utility to rack level. Redundant PDUs in each rack. Remote power monitoring at rack level.
Fire suppression	Compliant with NFPA 75 and 76 (or regional equivalent).
Capacity planning	Maintain minimum 20% spare power capacity. Annual capacity audits and forecasting.

Testing and maintenance

Monthly generator tests under load (minimum 30 minutes).
Quarterly full-load tests of entire backup power system.
Annual full-facility power outage test (coordinated with Runpod).
Regular thermographic scanning of electrical systems.
Detailed maintenance logs for all power equipment.
24/7 on-site facilities team.

Network requirements

Requirement	Specification
Internet connectivity	Two diverse, redundant circuits from separate providers. BGP routing for automatic failover. 100 Gbps minimum total bandwidth.
Speed per server	Preferred: >= 10 Gbps sustained. Minimum: >= 5 Gbps sustained.
Core infrastructure	Redundant core switches in high-availability configuration.
Distribution layer	Redundant switches with MLAG or equivalent. 100 Gbps uplinks to core switches.
Access layer	Redundant top-of-rack switches. 100 Gbps server connections for high-performance compute nodes.
DDoS protection	On-premises or on-demand cloud-based mitigation solution.

Quality of service

Network utilization below 80% on any link during peak hours.
Packet loss not exceeding 0.1% on any segment.
P95 round-trip time within data center not exceeding 4ms.
P95 jitter within data center not exceeding 3ms.

Testing and maintenance

Semi-annual failover testing of all redundant components.
Annual full-scale disaster recovery test.
Maintenance windows scheduled at least 1 week in advance.
Maintain minimum 40% spare network capacity.

Compliance requirements

Partners must adhere to at least one of the following compliance standards:

SOC 2 Type I (System and Organization Controls)
ISO/IEC 27001:2013 (Information Security Management Systems)
PCI DSS (Payment Card Industry Data Security Standard)

Operational standards

Requirement	Description
Data center tier	Tier III+ data center standards.
Security	24/7 on-site security and technical staff.
Physical security	Runpod servers in isolated, secure rack or cage. Physical access tracked and logged.
Maintenance	Disruptions scheduled at least 1 week in advance. Large disruptions coordinated with Runpod at least 1 month in advance.

Evidence review

Runpod will review:

Physical access logs.
Redundancy checks.
Refueling agreements.
Power system test results and maintenance logs.
Power monitoring and capacity planning reports.
Network infrastructure diagrams and configurations.
Network performance and capacity reports.
Security audit results and incident response plans.

Release log

2024-11-01: Initial release.

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Effective dates

Hardware requirements

GPU compute servers

GPU requirements

CPU requirements

Bus bandwidth

Memory

Storage

Storage cluster

CPU servers

Software requirements

Operating system

BIOS configuration

Drivers and software

HGX SXM systems

Power requirements

Testing and maintenance

Network requirements

Quality of service

Testing and maintenance

Compliance requirements

Operational standards

Evidence review

Release log

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

​Effective dates

​Hardware requirements

​GPU compute servers

​GPU requirements

​CPU requirements

​Bus bandwidth

​Memory

​Storage

​Storage cluster

​CPU servers

​Software requirements

​Operating system

​BIOS configuration

​Drivers and software

​HGX SXM systems

​Power requirements

​Testing and maintenance

​Network requirements

​Quality of service

​Testing and maintenance

​Compliance requirements

​Operational standards

​Evidence review

​Release log

Effective dates

Hardware requirements

GPU compute servers

GPU requirements

CPU requirements

Bus bandwidth

Memory

Storage

Storage cluster

CPU servers

Software requirements

Operating system

BIOS configuration

Drivers and software

HGX SXM systems

Power requirements

Testing and maintenance

Network requirements

Quality of service

Testing and maintenance

Compliance requirements

Operational standards

Evidence review

Release log