Vast.ai Provisioning and Monitoring

Hyperscape provides automated tools for provisioning and monitoring GPU instances on Vast.ai for WebGPU streaming deployments.

Overview

Vast.ai is a GPU marketplace that provides affordable NVIDIA GPUs for cloud computing. Hyperscape’s streaming pipeline requires specific GPU configurations to support WebGPU rendering. Key Requirement: Instances MUST have gpu_display_active=true to support WebGPU. This ensures the GPU has display driver support, not just compute access.

Automated Provisioning

Vast.ai Provisioner Script

The provisioner script (./scripts/vast-provision.sh) automatically searches for and rents WebGPU-capable instances. Usage:

VAST_API_KEY=xxx bun run vast:provision

What it does:

Searches for instances with gpu_display_active=true (REQUIRED for WebGPU)
Filters by reliability (≥95%), GPU RAM (≥20GB), price (≤$2/hr)
Rents the best available instance
Waits for instance to be ready
Outputs SSH connection details and GitHub secret commands
Saves configuration to /tmp/vast-instance-config.env

Requirements:

Vast.ai CLI: pip install vastai
API key configured: vastai set api-key YOUR_API_KEY

Search Criteria

The provisioner uses the following filters:

Filter	Value	Rationale
`gpu_display_active`	`true`	REQUIRED for WebGPU display driver support
`reliability`	≥95%	Minimize downtime and connection issues
`gpu_ram`	≥20GB	Sufficient VRAM for WebGPU rendering
`disk_space`	≥120GB	Room for builds, assets, and logs
`dph_total`	≤$2/hr	Cost control

Output

After successful provisioning, the script outputs:

✅ Instance 12345678 is ready!

SSH Connection:
  ssh root@ssh4.vast.ai -p 12345 -L 5555:localhost:5555

GitHub Secrets (add to repository settings):
  VAST_SSH_HOST=ssh4.vast.ai
  VAST_SSH_PORT=12345
  VAST_SSH_USER=root
  VAST_INSTANCE_ID=12345678

Configuration saved to: /tmp/vast-instance-config.env

Vast.ai Commands

Search for Instances

Search for WebGPU-capable instances without renting:

VAST_API_KEY=xxx bun run vast:search

This displays available instances matching the search criteria.

Check Instance Status

Check the status of your current instance:

VAST_API_KEY=xxx bun run vast:status

Output includes:

Instance ID
Status (running, stopped, etc.)
GPU model and VRAM
Disk space
Price per hour
Uptime

Destroy Instance

Destroy your current instance to stop billing:

VAST_API_KEY=xxx bun run vast:destroy

Warning: This permanently deletes the instance and all data. Make sure to backup any important data first.

Vast-Keeper Monitoring Service

Run the vast-keeper monitoring service to automatically manage instances:

VAST_API_KEY=xxx bun run vast:keeper

The keeper service:

Monitors instance health
Automatically restarts failed instances
Sends alerts on critical failures
Manages instance lifecycle

Streaming Health Monitoring

Quick Status Check

Check streaming health on a running Vast.ai instance:

bun run duel:status

This checks:

Server health endpoint
Streaming API status
Duel context (fighting phase)
RTMP bridge status and bytes streamed
PM2 process status
Recent logs

Example Output:

✅ Server Health: OK (200)
✅ Streaming API: OK
✅ Duel Context: FIGHTING (Agent1 vs Agent2)
✅ RTMP Bridge: Active (1.2 GB streamed)
✅ PM2 Processes: 2 running
📋 Recent Logs:
  [2026-03-07 10:30:15] Combat tick processed
  [2026-03-07 10:30:16] Frame captured (1920x1080)
  [2026-03-07 10:30:17] RTMP packet sent (15.2 KB)

Detailed Diagnostics

For more detailed diagnostics, SSH into the instance and check: GPU Status:

nvidia-smi

Should show display mode (not just compute):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 3090    Off  | 00000000:01:00.0  On |                  N/A |
| 30%   45C    P2    75W / 350W |   8192MiB / 24576MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Note the Disp.A column shows On - this indicates display driver is active. WebGPU Initialization:

# Check deployment logs for WebGPU test results
cat /var/log/deploy-vast.log | grep "WebGPU"

Should show successful WebGPU initialization:

[INFO] WebGPU preflight test: PASSED
[INFO] navigator.gpu: available
[INFO] Adapter: NVIDIA RTX 3090
[INFO] Backend: Vulkan

PM2 Processes:

pm2 status

Should show all processes running:

┌─────┬──────────────────┬─────────┬─────────┬──────────┬────────┬──────┐
│ id  │ name             │ mode    │ ↺       │ status   │ cpu    │ mem  │
├─────┼──────────────────┼─────────┼─────────┼──────────┼────────┼──────┤
│ 0   │ hyperscape-duel  │ fork    │ 0       │ online   │ 45%    │ 2.1G │
│ 1   │ rtmp-bridge      │ fork    │ 0       │ online   │ 12%    │ 512M │
└─────┴──────────────────┴─────────┴─────────┴──────────┴────────┴──────┘

Deployment Validation

The deployment script (scripts/deploy-vast.sh) performs extensive validation:

GPU Display Driver Check

Early validation:

# Check nvidia_drm kernel module
lsmod | grep nvidia_drm

# Check DRM device nodes
ls -la /dev/dri/

# Query GPU display mode
nvidia-smi --query-gpu=display_mode --format=csv,noheader

If any check fails, deployment aborts with guidance to rent instances with gpu_display_active=true.

WebGPU Pre-Check Tests

The deployment runs 6 WebGPU tests with different Chrome configurations:

Headless Vulkan: --headless=new --use-vulkan --use-angle=vulkan
Headless EGL: --headless=new --use-gl=egl
Xvfb Vulkan: Non-headless Chrome with Xvfb display
Ozone Headless: --ozone-platform=headless
SwiftShader: Software Vulkan fallback
Playwright Xvfb: Playwright-managed browser with Xvfb

The first successful configuration is used for streaming.

Vulkan ICD Verification

Check Vulkan ICD availability:

ls -la /usr/share/vulkan/icd.d/nvidia_icd.json
cat /usr/share/vulkan/icd.d/nvidia_icd.json

Expected output:

{
    "file_format_version": "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version": "1.3.0"
    }
}

Display Server Verification

Check X server:

# Socket check (more reliable than xdpyinfo)
ls -la /tmp/.X11-unix/X99

# DISPLAY environment variable
echo $DISPLAY

Expected:

Socket exists: /tmp/.X11-unix/X99
DISPLAY set: :99 or :99.0

Troubleshooting

WebGPU Not Initializing

Symptom: Deployment fails with “WebGPU initialization failed” Causes:

Instance doesn’t have gpu_display_active=true
NVIDIA display driver not installed
Vulkan ICD not configured
X server not running

Solutions:

Use the provisioner (ensures correct instance type):
```
VAST_API_KEY=xxx bun run vast:provision
```
Verify GPU display driver:
```
nvidia-smi --query-gpu=display_mode --format=csv,noheader
```
Should output Enabled or Enabled, Validated
Check Vulkan ICD:
```
VK_LOADER_DEBUG=all vulkaninfo 2>&1 | head -50
```
Should show NVIDIA ICD loading successfully

Verify X server:

ls -la /tmp/.X11-unix/X99
echo $DISPLAY

Browser Timeout During Page Load

Symptom: Browser times out (180s limit) when loading game page Cause: Vite dev server JIT compilation is too slow for WebGPU shader compilation Solution: Use production client build:

NODE_ENV=production
DUEL_USE_PRODUCTION_CLIENT=true

This serves pre-built client via vite preview instead of dev server, significantly faster page loads.

Stream Disconnects After 30 Minutes

Symptom: Twitch/YouTube disconnects stream after 30 minutes of idle content Cause: Streaming platforms disconnect streams that appear “idle” Solution: Enable placeholder frame mode:

STREAM_PLACEHOLDER_ENABLED=true

This sends minimal JPEG frames during idle periods to keep the stream alive.

Database Connection Errors

Symptom: “too many clients already” errors during crash loops Cause: PostgreSQL connection pool exhaustion Solution: Reduce connection pool size:

POSTGRES_POOL_MAX=3              # Down from default 6
POSTGRES_POOL_MIN=0              # Don't hold idle connections

Also increase PM2 restart delay in ecosystem.config.cjs:

restart_delay: 10000,            // 10s instead of 5s
exp_backoff_restart_delay: 2000, // 2s for gradual backoff

Environment Variables

Required for Vast.ai Deployment

# GPU Display Driver (CRITICAL)
# Instances must have gpu_display_active=true

# Database Configuration (for crash loop resilience)
POSTGRES_POOL_MAX=3              # Prevent connection exhaustion
POSTGRES_POOL_MIN=0              # Don't hold idle connections

# Model Agent Spawning
SPAWN_MODEL_AGENTS=true          # Auto-create agents when database is empty

# Stream Keep-Alive
STREAM_PLACEHOLDER_ENABLED=true  # Prevent 30-minute disconnects

# Production Client Build
NODE_ENV=production              # Use production client build
DUEL_USE_PRODUCTION_CLIENT=true  # Force production client for streaming

Optional Configuration

# Stream Capture
STREAM_CAPTURE_EXECUTABLE=/usr/bin/google-chrome-unstable
STREAM_LOW_LATENCY=true          # Use zerolatency tune
STREAM_GOP_SIZE=60               # GOP size in frames

# Audio Capture
STREAM_AUDIO_ENABLED=true
PULSE_AUDIO_DEVICE=chrome_audio.monitor

Deployment Workflow

1. Provision Instance

VAST_API_KEY=xxx bun run vast:provision

2. Configure GitHub Secrets

Add the output secrets to your GitHub repository:

VAST_SSH_HOST
VAST_SSH_PORT
VAST_SSH_USER
VAST_INSTANCE_ID

3. Deploy via GitHub Actions

The .github/workflows/deploy-vast.yml workflow automatically:

Connects to the instance via SSH
Pulls latest code from main branch
Runs deployment script (scripts/deploy-vast.sh)
Validates WebGPU initialization
Starts PM2 processes
Verifies streaming health

4. Monitor Deployment

Check streaming health:

bun run duel:status

Or SSH into the instance:

ssh root@ssh4.vast.ai -p 12345
pm2 logs hyperscape-duel

5. Graceful Restart (Zero-Downtime Updates)

Request a restart after the current duel ends:

curl -X POST http://your-server/admin/graceful-restart \
  -H "x-admin-code: YOUR_ADMIN_CODE"

The server waits for the duel RESOLUTION phase before restarting, ensuring no interruption to active duels or streams.

Cost Optimization

Instance Selection

The provisioner automatically selects the cheapest instance that meets requirements: Typical costs (as of March 2026):

RTX 3090 (24GB): $0.30-0.50/hr
RTX 4090 (24GB): $0.50-0.80/hr
A6000 (48GB): $0.80-1.20/hr

Cost control:

Maximum price: $2/hr (configurable in script)
Automatic selection of cheapest qualifying instance
Destroy instances when not in use

Billing

Vast.ai bills by the hour. To minimize costs:

Destroy instances when not streaming:
```
VAST_API_KEY=xxx bun run vast:destroy
```
Use spot instances (cheaper but can be interrupted)
Monitor usage with bun run vast:status

Advanced Configuration

Custom Search Criteria

Edit scripts/vast-provision.sh to customize search criteria:

# Increase GPU RAM requirement
MIN_GPU_RAM=40  # Default: 20

# Increase reliability requirement
MIN_RELIABILITY=98  # Default: 95

# Adjust price limit
MAX_PRICE=1.50  # Default: 2.00

# Increase disk space
MIN_DISK_SPACE=200  # Default: 120

Manual Instance Selection

If you prefer to manually select an instance:

Search for instances:

vastai search offers 'gpu_display_active=true reliability>=0.95 gpu_ram>=20 disk_space>=120 dph_total<=2'

Rent specific instance:

vastai create instance <instance_id> --image nvidia/cuda:12.0.0-devel-ubuntu22.04 --disk 120

Get SSH details:
```
vastai show instance <instance_id>
```

AGENTS.md: Vast.ai Deployment Architecture section
docs/duel-stack.md: Duel stack deployment guide
scripts/deploy-vast.sh: Deployment script source code
scripts/check-streaming-status.sh: Health check script

Commit History

Vast.ai provisioner was introduced in commit 8591248d (March 1, 2026):

fix(vast): require gpu_display_active=true for WebGPU streaming WebGPU requires GPU display driver support, not just compute. This was causing deployment failures because Xorg/Xvfb couldn’t start without proper display driver access. Changes:

vast-keeper: Add gpu_display_active=true to search query (CRITICAL)

vast-keeper: Add CLI commands (provision, status, search, destroy)

deploy-vast.yml: Add preflight check for GPU display support

deploy-vast.yml: Add force_deploy input to override GPU check

vast-provision.sh: Update disk size to 120GB

package.json: Add vast:* convenience scripts

Configuration

Operations

​Vast.ai Provisioning and Monitoring

​Overview

​Automated Provisioning

​Vast.ai Provisioner Script

​Search Criteria

​Output

​Vast.ai Commands

​Search for Instances

​Check Instance Status

​Destroy Instance

​Vast-Keeper Monitoring Service

​Streaming Health Monitoring

​Quick Status Check

​Detailed Diagnostics

​Deployment Validation

​GPU Display Driver Check

​WebGPU Pre-Check Tests

​Vulkan ICD Verification

​Display Server Verification

​Troubleshooting

​WebGPU Not Initializing

​Browser Timeout During Page Load

​Stream Disconnects After 30 Minutes

​Database Connection Errors

​Environment Variables

​Required for Vast.ai Deployment

​Optional Configuration

​Deployment Workflow

​1. Provision Instance

​2. Configure GitHub Secrets

​3. Deploy via GitHub Actions

​4. Monitor Deployment

​5. Graceful Restart (Zero-Downtime Updates)

​Cost Optimization

​Instance Selection

​Billing

​Advanced Configuration

​Custom Search Criteria

​Manual Instance Selection

​Related Documentation

​Commit History