This page covers the requirements to become a Runpod Secure Cloud partner. These requirements are the baseline—meeting them does not guarantee selection. Runpod also evaluates business health, prior performance, and corporate alignment before selecting partners.
Minimum deployment size: 100kW of GPU server capacity.
Effective dates
| Requirement | New partners | Existing partners |
|---|
| Hardware specifications | November 1, 2024 | December 15, 2024 (new servers only) |
| Compliance specifications | November 1, 2024 | April 1, 2025 |
A new revision is released annually in October. Minor mid-year revisions may be made for market, roadmap, or customer needs.
Hardware requirements
GPU compute servers
GPU requirements
NVIDIA GPUs no older than Ampere generation.
CPU requirements
| Requirement | Specification |
|---|
| Cores | Minimum 4 physical CPU cores per GPU + 2 for system operations. |
| Clock speed | Minimum 3.5 GHz base clock, with boost clock of at least 4.0 GHz. |
| Recommended CPUs | AMD EPYC 9654 (96 cores, up to 3.7 GHz), Intel Xeon Platinum 8490H (60 cores, up to 4.8 GHz), AMD EPYC 9474F (48 cores, up to 4.1 GHz). |
Bus bandwidth
| GPU VRAM | Minimum bandwidth |
|---|
| 8/10/12/16 GB | PCIe 3.0 x16 |
| 20/24/32/40/48 GB | PCIe 4.0 x16 |
| 80 GB | PCIe 5.0 x16 |
Exception: A100 80GB PCI-E requires PCIe 4.0 x16.
Memory
Main system memory must have ECC.
| GPU configuration | Recommended RAM |
|---|
| 8x 80 GB VRAM | >= 2048 GB DDR5 |
| 8x 40/48 GB VRAM | >= 1024 GB DDR5 |
| 8x 24 GB VRAM | >= 512 GB DDR4/5 |
| 8x 16 GB VRAM | >= 256 GB DDR4/5 |
Storage
Servers require two separate storage arrays: a boot array for the host operating system and a working array for customer workloads.
Boot array:
| Requirement | Specification |
|---|
| Redundancy | >= 2n redundancy (RAID 1) |
| Size | >= 500 GB (post-RAID) |
| Sequential read | 2,000 MB/s |
| Sequential write | 2,000 MB/s |
| Random read (4K QD32) | 100,000 IOPS |
| Random write (4K QD32) | 10,000 IOPS |
Working array:
| Requirement | Specification |
|---|
| Redundancy | >= 2n redundancy (RAID 1 or RAID 10) |
| Size | 2 TB+ NVMe per GPU for 24/48 GB GPUs; 4 TB+ NVMe per GPU for 80 GB GPUs (post-RAID) |
| Sequential read | 6,000 MB/s |
| Sequential write | 5,000 MB/s |
| Random read (4K QD32) | 400,000 IOPS |
| Random write (4K QD32) | 40,000 IOPS |
Storage cluster
Each data center must have a storage cluster providing shared storage between all GPU servers. The hardware is provided by the partner; storage cluster licensing is provided by Runpod.
Baseline specifications:
| Component | Requirement |
|---|
| Minimum servers | 4 |
| Minimum storage size | 200 TB raw (100 TB usable) |
| Connectivity | 200 Gbps between servers/data-plane |
| Network | Private subnet |
Server specifications:
| Component | Requirement |
|---|
| CPU | AMD Genoa: EPYC 9354P (32-core, 3.25-3.8 GHz), EPYC 9534 (64-core, 2.45-3.7 GHz), or EPYC 9554 (64-core, 3.1-3.75 GHz) |
| RAM | 256 GB or higher, DDR5/ECC |
Storage cluster servers follow the same boot array specifications as GPU compute servers. The working array should use JBOD (Runpod assembles into array) with 7-14 TB disk sizes recommended. Servers should have spare disk slots for future expansion.
Dedicated metadata server (large-scale clusters):
Required when the storage cluster exceeds 90% single-core CPU on the leader node during peak hours.
| Component | Requirement |
|---|
| CPU | AMD Ryzen Threadripper 7960X (24-core, 4.2-5.3 GHz) |
| RAM | 128 GB or higher, DDR5/ECC |
| Boot disk | >= 500 GB, RAID 1 |
CPU servers
Each data center should have CPU servers for CPU-only Pods and Serverless workers. Runpod also uses these servers for features that do not require GPUs (e.g., the S3-compatible API).
Baseline specifications:
| Component | Requirement |
|---|
| Minimum servers | 2 |
| Minimum storage size | 8 TB usable |
| Connectivity | 200 Gbps between servers/data-plane |
| Network | Private subnet; public IP and >990 ports open |
Server specifications:
| Component | Requirement |
|---|
| CPU | AMD EPYC 9004 ‘Genoa’ Zen 4 or better with minimum 32 cores, 3+ GHz clock speed |
| RAM | 1 TB or higher, DDR5/ECC |
| Storage | 8 TB+, RAID 1 or RAID 10 |
Software requirements
Operating system
- Ubuntu Server 22.04 LTS
- Linux kernel 6.5.0-15 or later (Ubuntu HWE Kernel)
- SSH remote connection capability
BIOS configuration
- IOMMU disabled for non-VM systems
- Server BIOS/firmware updated to latest stable version
Drivers and software
| Component | Requirement |
|---|
| NVIDIA drivers | Version 550.54.15 or later |
| CUDA | Version 12.4 or later |
| NVIDIA Persistence | Activated for GPUs of 48 GB or more |
HGX SXM systems
Additional requirements for HGX SXM systems:
- NVIDIA Fabric Manager installed, activated, running, and tested.
- Fabric Manager version must match NVIDIA drivers and kernel driver headers.
- CUDA Toolkit, NVIDIA NSCQ, and NVIDIA DCGM installed.
- Verify NVLink switch topology using
nvidia-smi and dcgmi.
- Ensure SXM performance using
dcgmi diagnostic tool.
Power requirements
| Requirement | Specification |
|---|
| Utility feeds | Two independent utility feeds from separate substations. Each feed capable of supporting 100% of the data center’s power load. Automatic transfer switches (ATS) with UL 1008 certification. |
| UPS | N+1 redundancy. Minimum 15 minutes runtime at full load. |
| Generators | N+1 redundancy. 100% load support. 48 hours of on-site fuel storage at full load. Automatic transfer within 10 seconds of utility failure. |
| Power distribution | 2N redundant paths from utility to rack level. Redundant PDUs in each rack. Remote power monitoring at rack level. |
| Fire suppression | Compliant with NFPA 75 and 76 (or regional equivalent). |
| Capacity planning | Maintain minimum 20% spare power capacity. Annual capacity audits and forecasting. |
Testing and maintenance
- Monthly generator tests under load (minimum 30 minutes).
- Quarterly full-load tests of entire backup power system.
- Annual full-facility power outage test (coordinated with Runpod).
- Regular thermographic scanning of electrical systems.
- Detailed maintenance logs for all power equipment.
- 24/7 on-site facilities team.
Network requirements
| Requirement | Specification |
|---|
| Internet connectivity | Two diverse, redundant circuits from separate providers. BGP routing for automatic failover. 100 Gbps minimum total bandwidth. |
| Speed per server | Preferred: >= 10 Gbps sustained. Minimum: >= 5 Gbps sustained. |
| Core infrastructure | Redundant core switches in high-availability configuration. |
| Distribution layer | Redundant switches with MLAG or equivalent. 100 Gbps uplinks to core switches. |
| Access layer | Redundant top-of-rack switches. 100 Gbps server connections for high-performance compute nodes. |
| DDoS protection | On-premises or on-demand cloud-based mitigation solution. |
Quality of service
- Network utilization below 80% on any link during peak hours.
- Packet loss not exceeding 0.1% on any segment.
- P95 round-trip time within data center not exceeding 4ms.
- P95 jitter within data center not exceeding 3ms.
Testing and maintenance
- Semi-annual failover testing of all redundant components.
- Annual full-scale disaster recovery test.
- Maintenance windows scheduled at least 1 week in advance.
- Maintain minimum 40% spare network capacity.
Compliance requirements
Partners must adhere to at least one of the following compliance standards:
- SOC 2 Type I (System and Organization Controls)
- ISO/IEC 27001:2013 (Information Security Management Systems)
- PCI DSS (Payment Card Industry Data Security Standard)
Operational standards
| Requirement | Description |
|---|
| Data center tier | Tier III+ data center standards. |
| Security | 24/7 on-site security and technical staff. |
| Physical security | Runpod servers in isolated, secure rack or cage. Physical access tracked and logged. |
| Maintenance | Disruptions scheduled at least 1 week in advance. Large disruptions coordinated with Runpod at least 1 month in advance. |
Evidence review
Runpod will review:
- Physical access logs.
- Redundancy checks.
- Refueling agreements.
- Power system test results and maintenance logs.
- Power monitoring and capacity planning reports.
- Network infrastructure diagrams and configurations.
- Network performance and capacity reports.
- Security audit results and incident response plans.
Release log
- 2024-11-01: Initial release.