Skip to main content
This page covers the requirements to become a Runpod Secure Cloud partner. These requirements are the baseline—meeting them does not guarantee selection. Runpod also evaluates business health, prior performance, and corporate alignment before selecting partners.
Minimum deployment size: 100kW of GPU server capacity.

Effective dates

RequirementNew partnersExisting partners
Hardware specificationsNovember 1, 2024December 15, 2024 (new servers only)
Compliance specificationsNovember 1, 2024April 1, 2025
A new revision is released annually in October. Minor mid-year revisions may be made for market, roadmap, or customer needs.

Hardware requirements

GPU compute servers

GPU requirements

NVIDIA GPUs no older than Ampere generation.

CPU requirements

RequirementSpecification
CoresMinimum 4 physical CPU cores per GPU + 2 for system operations.
Clock speedMinimum 3.5 GHz base clock, with boost clock of at least 4.0 GHz.
Recommended CPUsAMD EPYC 9654 (96 cores, up to 3.7 GHz), Intel Xeon Platinum 8490H (60 cores, up to 4.8 GHz), AMD EPYC 9474F (48 cores, up to 4.1 GHz).

Bus bandwidth

GPU VRAMMinimum bandwidth
8/10/12/16 GBPCIe 3.0 x16
20/24/32/40/48 GBPCIe 4.0 x16
80 GBPCIe 5.0 x16
Exception: A100 80GB PCI-E requires PCIe 4.0 x16.

Memory

Main system memory must have ECC.
GPU configurationRecommended RAM
8x 80 GB VRAM>= 2048 GB DDR5
8x 40/48 GB VRAM>= 1024 GB DDR5
8x 24 GB VRAM>= 512 GB DDR4/5
8x 16 GB VRAM>= 256 GB DDR4/5

Storage

Servers require two separate storage arrays: a boot array for the host operating system and a working array for customer workloads. Boot array:
RequirementSpecification
Redundancy>= 2n redundancy (RAID 1)
Size>= 500 GB (post-RAID)
Sequential read2,000 MB/s
Sequential write2,000 MB/s
Random read (4K QD32)100,000 IOPS
Random write (4K QD32)10,000 IOPS
Working array:
RequirementSpecification
Redundancy>= 2n redundancy (RAID 1 or RAID 10)
Size2 TB+ NVMe per GPU for 24/48 GB GPUs; 4 TB+ NVMe per GPU for 80 GB GPUs (post-RAID)
Sequential read6,000 MB/s
Sequential write5,000 MB/s
Random read (4K QD32)400,000 IOPS
Random write (4K QD32)40,000 IOPS

Storage cluster

Each data center must have a storage cluster providing shared storage between all GPU servers. The hardware is provided by the partner; storage cluster licensing is provided by Runpod. Baseline specifications:
ComponentRequirement
Minimum servers4
Minimum storage size200 TB raw (100 TB usable)
Connectivity200 Gbps between servers/data-plane
NetworkPrivate subnet
Server specifications:
ComponentRequirement
CPUAMD Genoa: EPYC 9354P (32-core, 3.25-3.8 GHz), EPYC 9534 (64-core, 2.45-3.7 GHz), or EPYC 9554 (64-core, 3.1-3.75 GHz)
RAM256 GB or higher, DDR5/ECC
Storage cluster servers follow the same boot array specifications as GPU compute servers. The working array should use JBOD (Runpod assembles into array) with 7-14 TB disk sizes recommended. Servers should have spare disk slots for future expansion. Dedicated metadata server (large-scale clusters): Required when the storage cluster exceeds 90% single-core CPU on the leader node during peak hours.
ComponentRequirement
CPUAMD Ryzen Threadripper 7960X (24-core, 4.2-5.3 GHz)
RAM128 GB or higher, DDR5/ECC
Boot disk>= 500 GB, RAID 1

CPU servers

Each data center should have CPU servers for CPU-only Pods and Serverless workers. Runpod also uses these servers for features that do not require GPUs (e.g., the S3-compatible API). Baseline specifications:
ComponentRequirement
Minimum servers2
Minimum storage size8 TB usable
Connectivity200 Gbps between servers/data-plane
NetworkPrivate subnet; public IP and >990 ports open
Server specifications:
ComponentRequirement
CPUAMD EPYC 9004 ‘Genoa’ Zen 4 or better with minimum 32 cores, 3+ GHz clock speed
RAM1 TB or higher, DDR5/ECC
Storage8 TB+, RAID 1 or RAID 10

Software requirements

Operating system

  • Ubuntu Server 22.04 LTS
  • Linux kernel 6.5.0-15 or later (Ubuntu HWE Kernel)
  • SSH remote connection capability

BIOS configuration

  • IOMMU disabled for non-VM systems
  • Server BIOS/firmware updated to latest stable version

Drivers and software

ComponentRequirement
NVIDIA driversVersion 550.54.15 or later
CUDAVersion 12.4 or later
NVIDIA PersistenceActivated for GPUs of 48 GB or more

HGX SXM systems

Additional requirements for HGX SXM systems:
  • NVIDIA Fabric Manager installed, activated, running, and tested.
  • Fabric Manager version must match NVIDIA drivers and kernel driver headers.
  • CUDA Toolkit, NVIDIA NSCQ, and NVIDIA DCGM installed.
  • Verify NVLink switch topology using nvidia-smi and dcgmi.
  • Ensure SXM performance using dcgmi diagnostic tool.

Power requirements

RequirementSpecification
Utility feedsTwo independent utility feeds from separate substations. Each feed capable of supporting 100% of the data center’s power load. Automatic transfer switches (ATS) with UL 1008 certification.
UPSN+1 redundancy. Minimum 15 minutes runtime at full load.
GeneratorsN+1 redundancy. 100% load support. 48 hours of on-site fuel storage at full load. Automatic transfer within 10 seconds of utility failure.
Power distribution2N redundant paths from utility to rack level. Redundant PDUs in each rack. Remote power monitoring at rack level.
Fire suppressionCompliant with NFPA 75 and 76 (or regional equivalent).
Capacity planningMaintain minimum 20% spare power capacity. Annual capacity audits and forecasting.

Testing and maintenance

  • Monthly generator tests under load (minimum 30 minutes).
  • Quarterly full-load tests of entire backup power system.
  • Annual full-facility power outage test (coordinated with Runpod).
  • Regular thermographic scanning of electrical systems.
  • Detailed maintenance logs for all power equipment.
  • 24/7 on-site facilities team.

Network requirements

RequirementSpecification
Internet connectivityTwo diverse, redundant circuits from separate providers. BGP routing for automatic failover. 100 Gbps minimum total bandwidth.
Speed per serverPreferred: >= 10 Gbps sustained. Minimum: >= 5 Gbps sustained.
Core infrastructureRedundant core switches in high-availability configuration.
Distribution layerRedundant switches with MLAG or equivalent. 100 Gbps uplinks to core switches.
Access layerRedundant top-of-rack switches. 100 Gbps server connections for high-performance compute nodes.
DDoS protectionOn-premises or on-demand cloud-based mitigation solution.

Quality of service

  • Network utilization below 80% on any link during peak hours.
  • Packet loss not exceeding 0.1% on any segment.
  • P95 round-trip time within data center not exceeding 4ms.
  • P95 jitter within data center not exceeding 3ms.

Testing and maintenance

  • Semi-annual failover testing of all redundant components.
  • Annual full-scale disaster recovery test.
  • Maintenance windows scheduled at least 1 week in advance.
  • Maintain minimum 40% spare network capacity.

Compliance requirements

Partners must adhere to at least one of the following compliance standards:
  • SOC 2 Type I (System and Organization Controls)
  • ISO/IEC 27001:2013 (Information Security Management Systems)
  • PCI DSS (Payment Card Industry Data Security Standard)

Operational standards

RequirementDescription
Data center tierTier III+ data center standards.
Security24/7 on-site security and technical staff.
Physical securityRunpod servers in isolated, secure rack or cage. Physical access tracked and logged.
MaintenanceDisruptions scheduled at least 1 week in advance. Large disruptions coordinated with Runpod at least 1 month in advance.

Evidence review

Runpod will review:
  • Physical access logs.
  • Redundancy checks.
  • Refueling agreements.
  • Power system test results and maintenance logs.
  • Power monitoring and capacity planning reports.
  • Network infrastructure diagrams and configurations.
  • Network performance and capacity reports.
  • Security audit results and incident response plans.

Release log

  • 2024-11-01: Initial release.