Deployment Best Practices
This guide covers best practices for deploying Hyperscape to production, including maintenance mode coordination, health checks, and zero-downtime deployments.Table of Contents
- Maintenance Mode
- Deployment Workflow
- Health Checks
- Rollback Procedures
- Security Checklist
- Monitoring
Maintenance Mode
Maintenance mode provides graceful deployment coordination for the streaming duel system, preventing data loss and market inconsistency during deployments.When to Use Maintenance Mode
Required for:- Code deployments that restart the server
- Database schema migrations
- Configuration changes affecting duel system
- Infrastructure maintenance (server moves, scaling)
- Static asset updates (CDN only)
- Client-only deployments (Cloudflare Pages)
- Documentation updates
Maintenance Mode API
Authentication: All endpoints requireADMIN_CODE header:
Enter Maintenance Mode
reason(string): Reason for maintenance (logged for audit)timeoutMs(number): Maximum wait time for markets to resolve (default: 300000 = 5 minutes)
- Pauses new duel cycles (current cycle completes)
- Locks betting markets (no new bets accepted)
- Waits for current market to resolve
- Returns when safe to deploy or timeout reached
Check Status
safeToDeploy: truecurrentPhase: "IDLE"(no active duel)marketStatus: "resolved"(all markets settled)pendingMarkets: 0
Exit Maintenance Mode
- Resumes duel cycle scheduling
- Unlocks betting markets
- Normal operations resume
Helper Scripts
Pre-Deployment:CI/CD Integration
The Vast.ai deployment workflow (.github/workflows/deploy-vast.yml) automatically coordinates maintenance mode:
- Enter maintenance mode (pauses new duels)
- Wait for active markets to resolve (up to 5 minutes)
- Deploy latest code via SSH
- Verify deployment health
- Exit maintenance mode (resumes operations)
- If markets don’t resolve within timeout, deployment proceeds anyway
- Manual intervention may be required to resolve stuck markets
- Check
/admin/maintenance/statusafter deployment
Deployment Workflow
Railway (Production Server)
Automatic Deployment:- Push to
main→ deploys toprodenvironment - Push to
develop→ deploys todevenvironment
- Go to GitHub Actions → Deploy to Railway
- Select environment:
prodordev - Click “Run workflow”
JWT_SECRET- Required (throws error if not set)ADMIN_CODE- Required for securityDATABASE_URL- PostgreSQL connection stringPRIVY_APP_ID- Privy app IDPRIVY_APP_SECRET- Privy app secretPUBLIC_CDN_URL- Asset CDN URL (e.g., https://assets.hyperscape.club)
- Check
/healthendpoint:https://hyperscape.gg/health - Verify WebSocket:
wss://hyperscape.gg/ws - Monitor logs for errors
- Test character creation and login
Cloudflare Pages (Frontend)
Automatic Deployment:- Push to
main→ deploys to production - Pull requests → preview deployments
PUBLIC_PRIVY_APP_ID- Must match server’sPRIVY_APP_IDPUBLIC_API_URL- Backend API URL (e.g., https://hyperscape.gg)PUBLIC_WS_URL- WebSocket URL (e.g., wss://hyperscape.gg/ws)PUBLIC_CDN_URL- Asset CDN URL (e.g., https://assets.hyperscape.club)
- Test asset loading (check Network tab for 404s)
- Verify WebSocket connection
- Test authentication flow
- Check for CORS errors in Console
Vast.ai (Streaming Duels)
Automatic Deployment:- Triggered on successful main branch builds
- Includes maintenance mode coordination
- Automatic health checks and recovery
- Check PM2 status:
pm2 status - Verify stream health:
curl http://localhost:5555/health - Check RTMP output:
ffplay rtmp://your-rtmp-server/live/stream - Monitor logs:
pm2 logs
Health Checks
Server Health Endpoint
status: "healthy"- All systems operationaldatabase: "connected"- PostgreSQL connection activewebsocket: "active"- WebSocket server runningmaintenance: false- Not in maintenance modestreaming.active: true- Streaming duel system running
Vast.ai Health Checks
The Vast.ai keeper (packages/vast-keeper) automatically monitors instance health:
Health Check Criteria:
- HTTP
/healthendpoint responds with 200 - Response time < 5 seconds
- Database connection active
- WebSocket server running
- Detect unhealthy instance (3 consecutive failures)
- Destroy failed instance
- Provision new instance
- Deploy latest code
- Resume operations
.env):
Manual Health Checks
Server:Rollback Procedures
Railway Rollback
Via Dashboard:- Go to Railway dashboard → Deployments
- Find last known good deployment
- Click “Redeploy”
Cloudflare Pages Rollback
Via Dashboard:- Go to Cloudflare dashboard → Pages → hyperscape
- Click “Deployments” tab
- Find last known good deployment
- Click “Rollback to this deployment”
Vast.ai Rollback
Via SSH:Security Checklist
Pre-Deployment
-
JWT_SECRETset and secure (32+ characters) -
ADMIN_CODEset and not committed to git -
PRIVY_APP_SECRETset and not exposed to client - Database credentials rotated (if compromised)
- API tokens reviewed and scoped appropriately
- Environment variables match between client and server
- CORS configuration includes only known domains
- Rate limiting enabled (
DISABLE_RATE_LIMIT=false)
Post-Deployment
-
/healthendpoint returns 200 - Authentication flow works (Privy login)
- Admin commands require
ADMIN_CODE - CSRF protection active for same-origin requests
- CORS errors not present in browser console
- WebSocket connections establish successfully
- Database migrations applied successfully
- No sensitive data in client-side logs
Production Environment Variables
Required:Monitoring
Key Metrics
Server:- Response time (p50, p95, p99)
- Error rate (4xx, 5xx)
- WebSocket connections (active, total)
- Database query time
- Memory usage
- CPU usage
- RTMP uptime
- FFmpeg restarts
- CDP stall events
- Frame rate (target: 30 FPS)
- Bitrate (target: 2500 kbps)
- Duel cycle duration
- Market resolution time
- Bet placement rate
- Payout success rate
Logging
Server Logs:Alerts
Critical Alerts (configureALERT_WEBHOOK_URL):
- Server crash or restart
- Database connection lost
- WebSocket server down
- Streaming pipeline failure
- Maintenance mode timeout
- High error rate (> 5%)
- Slow response time (> 1s p95)
- Memory usage > 80%
- FFmpeg restart (> 3 in 10 minutes)
Common Deployment Issues
JWT_SECRET Not Set
Symptom: Server throws error on startup in production/staging. Cause:JWT_SECRET is required as of February 2026 (security hardening).
Solution:
CORS Errors After Deployment
Symptom: Assets fail to load with CORS errors in browser console. Cause: R2 bucket CORS not configured or domains missing. Solution:- Run
scripts/configure-r2-cors.sh(seedocs/r2-cors-configuration.md) - Verify all domains in
AllowedOriginslist - Wait 1-2 minutes for propagation
- Hard reload browser (Cmd+Shift+R / Ctrl+Shift+R)
Maintenance Mode Timeout
Symptom: Deployment proceeds but markets still active. Cause: Markets didn’t resolve within timeout (default 5 minutes). Solution:- Check market status:
GET /admin/maintenance/status - Manually resolve stuck markets (if safe)
- Exit maintenance mode:
POST /admin/maintenance/exit - Monitor for data inconsistencies
Database Migration Failures
Symptom: Server fails to start after deployment with schema errors. Cause: Migration failed or partially applied. Solution:WebSocket Connection Failures
Symptom: Clients can’t connect to WebSocket after deployment. Cause: WebSocket URL misconfigured or server not listening. Solution:- Verify
PUBLIC_WS_URLin client.envmatches server domain - Check server logs for WebSocket initialization errors
- Test WebSocket manually:
wscat -c wss://hyperscape.gg/ws - Verify Railway/Cloudflare WebSocket support enabled
Zero-Downtime Deployment
Strategy
-
Blue-Green Deployment (Railway):
- Deploy to new instance
- Health check new instance
- Switch traffic to new instance
- Keep old instance for rollback
-
Maintenance Mode Coordination (Vast.ai):
- Enter maintenance mode
- Wait for safe state
- Deploy new code
- Exit maintenance mode
Implementation
Railway (automatic):- Railway handles blue-green deployment automatically
- Old instance kept for 30 seconds after new instance healthy
- Traffic switches when new instance passes health checks
Database Migrations
Safe Migration Workflow
-
Backup Database:
-
Test Migration Locally:
-
Apply to Production:
-
Rollback if Needed:
Migration Best Practices
- Always backup before migrations
- Test locally with production data copy
- Use transactions for multi-step migrations
- Avoid breaking changes (add columns as nullable, deprecate instead of drop)
- Monitor performance after migrations (check query plans)
Streaming Deployment
RTMP Configuration
Environment Variables (set in Vast.ai/Railway):Streaming Stability
Tuning Parameters (.env):
- Soft CDP Recovery: Restarts screencast without browser/FFmpeg teardown (no stream gap)
- Increased Thresholds: CDP stall (2→4 intervals), FFmpeg restarts (5→8), recovery failures (2→4)
- Best-Effort WebGPU: Tries
maxTextureArrayLayers: 2048, retries with defaults if GPU rejects
Troubleshooting
Deployment Hangs
Symptom: Deployment stuck in “Building” or “Starting” state. Cause: Build timeout, dependency installation failure, or startup error. Solution:- Check Railway/GitHub Actions logs for errors
- Verify
bun install --frozen-lockfilesucceeds locally - Check for npm 403 errors (retry logic should handle)
- Increase build timeout in Railway settings (if needed)
Database Connection Errors
Symptom: Server starts but can’t connect to database. Cause:DATABASE_URL misconfigured or database not accessible.
Solution:
- Verify
DATABASE_URLformat:postgresql://user:password@host:port/database - Check database is running and accessible
- Test connection manually:
psql $DATABASE_URL -c "SELECT 1" - Verify firewall rules allow connections from server IP
Asset 404 Errors
Symptom: Models, textures, or audio fail to load with 404 errors. Cause: CDN URL misconfigured or assets not uploaded. Solution:- Verify
PUBLIC_CDN_URLin both client and server.env - Check R2 bucket contains assets:
aws s3 ls s3://hyperscape-assets/ - Upload missing assets:
bun run assets:sync(see README) - Verify CORS configuration (see
docs/r2-cors-configuration.md)
Related Documentation
- R2 CORS:
docs/r2-cors-configuration.md - Railway Setup:
docs/railway-dev-prod.md - Native Releases:
docs/native-release.md - Duel Stack:
docs/duel-stack.md - Environment Variables:
packages/server/.env.example,packages/client/.env.example
Deployment Checklist
Pre-Deployment
- Code reviewed and tested locally
- All tests passing (
bun test) - Database backup created
- Migration tested on copy of production data
- Environment variables verified
- Security checklist completed
- Rollback plan documented
During Deployment
- Maintenance mode entered (if applicable)
- Safe state confirmed (
safeToDeploy: true) - Deployment triggered
- Health checks passing
- Logs monitored for errors
Post-Deployment
-
/healthendpoint returns 200 - WebSocket connections working
- Authentication flow tested
- Asset loading verified
- Database queries performing well
- Streaming active (if applicable)
- Maintenance mode exited
- Monitoring alerts configured
- Team notified of deployment
Emergency Contacts
Critical Issues:- Check #hyperscape-alerts Slack channel
- Page on-call engineer via PagerDuty
- Rollback immediately if user-facing
- Create GitHub issue with logs
- Post in #hyperscape-dev Slack channel
- Schedule fix for next deployment
Commit References
- Maintenance Mode:
30b52bd(February 26, 2026) - CORS Configuration:
143914d(February 26, 2026) - Streaming Stability:
14a1e1b(February 25, 2026) - JWT Security:
3bc59db(February 26, 2026)