Vast.ai Provisioning and Monitoring
Hyperscape provides automated tools for provisioning and monitoring GPU instances on Vast.ai for WebGPU streaming deployments.Overview
Vast.ai is a GPU marketplace that provides affordable NVIDIA GPUs for cloud computing. Hyperscape’s streaming pipeline requires specific GPU configurations to support WebGPU rendering. Key Requirement: Instances MUST havegpu_display_active=true to support WebGPU. This ensures the GPU has display driver support, not just compute access.
Automated Provisioning
Vast.ai Provisioner Script
The provisioner script (./scripts/vast-provision.sh) automatically searches for and rents WebGPU-capable instances.
Usage:
- Searches for instances with
gpu_display_active=true(REQUIRED for WebGPU) - Filters by reliability (≥95%), GPU RAM (≥20GB), price (≤$2/hr)
- Rents the best available instance
- Waits for instance to be ready
- Outputs SSH connection details and GitHub secret commands
- Saves configuration to
/tmp/vast-instance-config.env
- Vast.ai CLI:
pip install vastai - API key configured:
vastai set api-key YOUR_API_KEY
Search Criteria
The provisioner uses the following filters:| Filter | Value | Rationale |
|---|---|---|
gpu_display_active | true | REQUIRED for WebGPU display driver support |
reliability | ≥95% | Minimize downtime and connection issues |
gpu_ram | ≥20GB | Sufficient VRAM for WebGPU rendering |
disk_space | ≥120GB | Room for builds, assets, and logs |
dph_total | ≤$2/hr | Cost control |
Output
After successful provisioning, the script outputs:Vast.ai Commands
Search for Instances
Search for WebGPU-capable instances without renting:Check Instance Status
Check the status of your current instance:- Instance ID
- Status (running, stopped, etc.)
- GPU model and VRAM
- Disk space
- Price per hour
- Uptime
Destroy Instance
Destroy your current instance to stop billing:Vast-Keeper Monitoring Service
Run the vast-keeper monitoring service to automatically manage instances:- Monitors instance health
- Automatically restarts failed instances
- Sends alerts on critical failures
- Manages instance lifecycle
Streaming Health Monitoring
Quick Status Check
Check streaming health on a running Vast.ai instance:- Server health endpoint
- Streaming API status
- Duel context (fighting phase)
- RTMP bridge status and bytes streamed
- PM2 process status
- Recent logs
Detailed Diagnostics
For more detailed diagnostics, SSH into the instance and check: GPU Status:Disp.A column shows On - this indicates display driver is active.
WebGPU Initialization:
Deployment Validation
The deployment script (scripts/deploy-vast.sh) performs extensive validation:
GPU Display Driver Check
Early validation:gpu_display_active=true.
WebGPU Pre-Check Tests
The deployment runs 6 WebGPU tests with different Chrome configurations:- Headless Vulkan:
--headless=new --use-vulkan --use-angle=vulkan - Headless EGL:
--headless=new --use-gl=egl - Xvfb Vulkan: Non-headless Chrome with Xvfb display
- Ozone Headless:
--ozone-platform=headless - SwiftShader: Software Vulkan fallback
- Playwright Xvfb: Playwright-managed browser with Xvfb
Vulkan ICD Verification
Check Vulkan ICD availability:Display Server Verification
Check X server:- Socket exists:
/tmp/.X11-unix/X99 - DISPLAY set:
:99or:99.0
Troubleshooting
WebGPU Not Initializing
Symptom: Deployment fails with “WebGPU initialization failed” Causes:- Instance doesn’t have
gpu_display_active=true - NVIDIA display driver not installed
- Vulkan ICD not configured
- X server not running
-
Use the provisioner (ensures correct instance type):
-
Verify GPU display driver:
Should output
EnabledorEnabled, Validated -
Check Vulkan ICD:
Should show NVIDIA ICD loading successfully
-
Verify X server:
Browser Timeout During Page Load
Symptom: Browser times out (180s limit) when loading game page Cause: Vite dev server JIT compilation is too slow for WebGPU shader compilation Solution: Use production client build:vite preview instead of dev server, significantly faster page loads.
Stream Disconnects After 30 Minutes
Symptom: Twitch/YouTube disconnects stream after 30 minutes of idle content Cause: Streaming platforms disconnect streams that appear “idle” Solution: Enable placeholder frame mode:Database Connection Errors
Symptom: “too many clients already” errors during crash loops Cause: PostgreSQL connection pool exhaustion Solution: Reduce connection pool size:ecosystem.config.cjs:
Environment Variables
Required for Vast.ai Deployment
Optional Configuration
Deployment Workflow
1. Provision Instance
2. Configure GitHub Secrets
Add the output secrets to your GitHub repository:VAST_SSH_HOSTVAST_SSH_PORTVAST_SSH_USERVAST_INSTANCE_ID
3. Deploy via GitHub Actions
The.github/workflows/deploy-vast.yml workflow automatically:
- Connects to the instance via SSH
- Pulls latest code from main branch
- Runs deployment script (
scripts/deploy-vast.sh) - Validates WebGPU initialization
- Starts PM2 processes
- Verifies streaming health
4. Monitor Deployment
Check streaming health:5. Graceful Restart (Zero-Downtime Updates)
Request a restart after the current duel ends:Cost Optimization
Instance Selection
The provisioner automatically selects the cheapest instance that meets requirements: Typical costs (as of March 2026):- RTX 3090 (24GB): $0.30-0.50/hr
- RTX 4090 (24GB): $0.50-0.80/hr
- A6000 (48GB): $0.80-1.20/hr
- Maximum price: $2/hr (configurable in script)
- Automatic selection of cheapest qualifying instance
- Destroy instances when not in use
Billing
Vast.ai bills by the hour. To minimize costs:-
Destroy instances when not streaming:
- Use spot instances (cheaper but can be interrupted)
-
Monitor usage with
bun run vast:status
Advanced Configuration
Custom Search Criteria
Editscripts/vast-provision.sh to customize search criteria:
Manual Instance Selection
If you prefer to manually select an instance:-
Search for instances:
-
Rent specific instance:
-
Get SSH details:
Related Documentation
- AGENTS.md: Vast.ai Deployment Architecture section
- docs/duel-stack.md: Duel stack deployment guide
- scripts/deploy-vast.sh: Deployment script source code
- scripts/check-streaming-status.sh: Health check script
Commit History
Vast.ai provisioner was introduced in commit8591248d (March 1, 2026):
fix(vast): require gpu_display_active=true for WebGPU streaming WebGPU requires GPU display driver support, not just compute. This was causing deployment failures because Xorg/Xvfb couldn’t start without proper display driver access. Changes:
- vast-keeper: Add gpu_display_active=true to search query (CRITICAL)
- vast-keeper: Add CLI commands (provision, status, search, destroy)
- deploy-vast.yml: Add preflight check for GPU display support
- deploy-vast.yml: Add force_deploy input to override GPU check
- vast-provision.sh: Update disk size to 120GB
- package.json: Add vast:* convenience scripts