Migrate HPC Workflows To Cloud With Slurm & Spawn
Overview: Seamlessly Transition HPC Workflows
Enable spawn to interpret and execute Slurm batch scripts, providing a smooth migration path for High-Performance Computing (HPC) users looking to move to the cloud or run hybrid workloads. This initiative directly addresses the needs of researchers and scientists eager to leverage the flexibility and scalability of cloud resources while preserving their existing Slurm-based workflows. The goal is to make cloud adoption as frictionless as possible, allowing users to focus on their research rather than wrestling with infrastructure.
The Opportunity: Addressing HPC Challenges
Many researchers heavily rely on Slurm for their computational tasks. They're seeking ways to adapt to the cloud environment, but the transition can be daunting. Hereβs why this is a prime opportunity:
- Migrate to the Cloud Without Rewrites: Preserve existing Slurm scripts. No need to rewrite everything. Migrate to cloud or hybrid compute.
- Run Hybrid Workloads: Blend on-premise clusters with cloud resources for optimal performance and cost-effectiveness. This allows users to keep long-running tasks on their clusters and burst into the cloud when needed.
- Eliminate Queue Times: Bypass potentially lengthy queue times, which can delay research progress. Launch jobs immediately, significantly speeding up the research cycle.
- Cost Comparison: Easily compare the costs of institutional resources versus cloud-based solutions to make informed decisions and optimize spending.
Key Insight: Slurm and Spawn: A Natural Partnership
Slurm array jobs and spawn parameter sweeps align PERFECTLY! This is the core of our solution. Slurm array jobs are designed to run multiple instances of the same script with different parameters, a perfect fit for parameter sweeps.
For example, consider the following Slurm array job:
#SBATCH --array=1-100 # 100 independent tasks
python analyze.py --task $SLURM_ARRAY_TASK_ID
This translates almost directly to spawn parameters:
params:
- index: 1
- index: 2
# ... up to 100
This alignment enables spawn to effectively manage embarrassingly parallel workloads, a common use case in research computing. By leveraging this natural fit, we simplify the migration process and enhance the efficiency of cloud-based workflows.
What Works Well β : Core Slurm Features Supported
We've identified key Slurm features that seamlessly integrate with spawn, enabling a robust and functional cloud migration experience. Here's a breakdown:
1. Array Jobs (Perfect Match)
--array=1-100translates directly to 100 parameter sets within spawn.$SLURM_ARRAY_TASK_IDbecomes$SPAWN_PARAM_INDEX, ensuring that each task receives the correct index.- This covers the most common pattern in research computing, making migration straightforward.
2. Multi-Node MPI Jobs
Spawn already supports this! (See MPI_GUIDE.md). This means that jobs requiring multiple nodes for parallel processing are fully supported.
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=16
mpirun ./simulation
Translates to:
spawn launch --count 8 --instance-type c7i.4xlarge --mpi
3. Single-Node GPU Jobs
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
python train.py
Direct translation to spawn with GPU instance type. This allows users to easily utilize GPU resources for tasks that require them, such as machine learning and deep learning applications.
4. Time Limits
#SBATCH --time=04:00:00 β ttl: 4h
Specifying time limits is straightforward, ensuring that jobs terminate gracefully if they exceed their allocated time.
5. Resource Mapping: Mapping Slurm Directives to Spawn
Hereβs a clear mapping of Slurm directives to spawn functionalities:
| Slurm | Spawn |
|---|---|
--time=2h |
ttl: 2h β
|
--array=1-100 |
Parameter sweep β |
--mem=16GB |
Select instance β₯16GB |
--cpus-per-task=8 |
Select instance β₯8 vCPUs |
--gres=gpu:1 |
g5.xlarge, p3.2xlarge, etc. |
This table illustrates the direct correlation between Slurm directives and spawn's capabilities, facilitating easy migration.
What Doesn't Work β: Challenges and Solutions
Some Slurm features don't have a direct equivalent in spawn, and require alternative solutions.
1. Job Dependencies
#SBATCH --dependency=afterok:12345
- Spawn has no built-in DAG workflow system (yet).
- Proposed solution: Track dependencies in DynamoDB and launch jobs after parent completion. This would involve creating a system to monitor the status of jobs and trigger the next job based on its completion.
- Question: Is this feature essential? We need user feedback to prioritize development efforts.
2. Shared Filesystem
- Slurm jobs depend on a shared filesystem, while EC2 instances are isolated by default.
- Solutions: EFS mount or S3 staging (see issue #47). EFS provides a shared filesystem, while S3 can be used for staging data to/from the cloud environment. Users can leverage these options to work around the limitations imposed by isolated EC2 instances.
3. Module System
- Each cluster has different modules, making it difficult to create a universal system.
module load python/3.9
module load cuda/11.7
- Solution: User-provided mapping config. Users will need to provide a configuration that maps the modules they use on their clusters to the appropriate software available in the cloud environment.
- Alternative: Container approach (Docker/Singularity). Docker or Singularity containers allow for bundling the software and dependencies needed by the jobs, providing a consistent environment across different infrastructures.
Proposed Implementation: Phased Approach
To ensure a smooth transition and iterative development, we propose a phased approach:
Phase 1: Parse & Convert (Proof of Concept)
spawn slurm convert job.sbatch --config mapping.yaml --output params.yaml
Supported directives: This phase focuses on parsing and converting essential Slurm directives.
--array=N-Mβ Parameter sweep β--time=HH:MM:SSβ TTL (Time-to-Live) β--mem=XGBβ Instance selection--cpus-per-task=Nβ Instance selection--gres=gpu:Nβ GPU instance type--nodes=Nβ MPI job array β
Mapping config example: This example illustrates how the mapping config works, allowing for flexible configuration.
partitions:
gpu:
instance_types: [g5.xlarge, g5.2xlarge, p3.2xlarge]
ami: ami-gpu-cuda117
cpu:
instance_types: [c5.4xlarge, c5.2xlarge]
ami: ami-cpu-optimized
modules:
python/3.9:
conda: python:3.9
cuda/11.7:
ami: ami-gpu-cuda117
Output: Standard spawn parameter file. The tool generates a standard spawn parameter file ready for execution.
defaults:
instance_type: g5.xlarge
ttl: 2h
spot: true
ami: ami-gpu-cuda117
params:
- index: 1
script: |
export SLURM_ARRAY_TASK_ID=1
export SLURM_JOB_ID=${SPAWN_SWEEP_ID}
python train.py --input data/1.txt
- index: 2
script: |
export SLURM_ARRAY_TASK_ID=2
python train.py --input data/2.txt
# ... (generated)
Pros: Transparency (user reviews generated file), debuggability, and the ability to edit before launching.
Cons: Two-step process, and large arrays may result in large YAML files.
Phase 2: Direct Submit
spawn slurm submit job.sbatch --config mapping.yaml --detach
Single command, familiar to Slurm users. This phase aims to simplify the submission process, making it as seamless as possible for existing Slurm users.
Phase 3: Dry Run & Cost Estimation
spawn slurm estimate job.sbatch --config mapping.yaml
Output: Provide users with estimates of costs before running the job.
Slurm Job Analysis:
Job name: training
Array size: 100 tasks
Time limit: 2h per task
Resources: 1 GPU, 8 CPUs, 16GB RAM
Spawn Translation:
Instance type: g5.xlarge (spot)
Total instances: 100
Max concurrent: 20
Wall time: 10h (with concurrency)
Estimated cost: $45.60
Cluster Comparison:
Queue time: 2-24 hours (typical)
Run time: 100 Γ 2h = 200 task-hours
Cost: $0 (institutional)
Time savings: 14 hours faster (immediate launch)
Cloud cost: $45.60
Would you like to proceed? [y/N]
Phase 4: Spawn-Aware Scripts
Allow hybrid scripts to work seamlessly on both Slurm and spawn:
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --array=1-100
#SBATCH --time=02:00:00
# Spawn-specific overrides (ignored by Slurm)
#SPAWN --instance-type c5.4xlarge
#SPAWN --spot true
#SPAWN --region us-east-1
# Regular script continues...
module load python/3.9
python train.py --task $SLURM_ARRAY_TASK_ID
Resource Selection: Streamlining Instance Choices
Automatic Instance Type Selection
Given Slurm directives:
#SBATCH --mem=32GB
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
Spawn could auto-select:
- Memory: β₯32GB
- vCPUs: β₯8
- GPU: 1Γ
- Result:
g5.2xlarge(8 vCPU, 32GB RAM, 1Γ GPU, $1.21/hr spot)
With fallback pattern (issue #40):
--instance-type