ADR-002: Railway for Long-Running Python Compute
ADR-002: Railway for Long-Running Python Compute
Status: Accepted Date: 2024-11-15 Deciders: Platform architect
Context
The Groundtruth engine executes AI consulting crews that take 10-60 minutes per run. The Next.js web app is deployed on Vercel, which has a 60-second timeout for serverless functions (300 seconds on Enterprise plan). The engine is a Python 3.12 application using CrewAI, which requires persistent process state during execution.
The platform needed a hosting solution for the Python engine that:
- Supports long-running processes (30+ minutes)
- Runs Python 3.12 with native dependencies (cffi, pdfplumber, openpyxl)
- Provides a persistent HTTP server for SSE streaming
- Is cost-effective for a pre-revenue startup
Decision
Deploy the Python engine as a persistent FastAPI service on Railway. The Next.js app communicates with it via HTTP API. Railway runs the engine in a Docker container with a 30-minute watchdog timeout.
Alternatives Considered
1. Vercel Serverless Functions (Python runtime)
Rejected due to the 60-second timeout constraint. Even the 300-second Enterprise timeout is insufficient for full engagement runs. Vercel's Python runtime also has limited native dependency support.
2. AWS Lambda with Step Functions
Would solve the timeout issue via state machine orchestration. Rejected due to operational complexity — Step Functions require decomposing the engine into discrete Lambda functions, which conflicts with CrewAI's in-process execution model. Also significantly more expensive and complex to debug.
3. AWS ECS / Google Cloud Run
Both would work technically. Rejected in favor of Railway for simplicity — Railway provides Git-based deploys, automatic HTTPS, and simple configuration. For a single-developer project, the operational overhead of ECS task definitions and Cloud Run configurations is not justified.
4. Fly.io
Strong contender. Similar to Railway in simplicity. Railway was chosen for slightly better Python/Docker support at the time and a more predictable pricing model.
Consequences
Positive:
- Engine runs without timeout constraints (watchdog set to 30 minutes)
- SSE streaming works naturally with a persistent FastAPI server
- Docker deployment handles all native Python dependencies
- Simple Git-push deploys from the monorepo
- Cost-effective: $5-20/month for the current scale
Negative:
- Two deployment targets (Vercel for web, Railway for engine) adds operational complexity
- CORS configuration required between Vercel and Railway domains
- No auto-scaling — single Railway instance handles all concurrent runs
- Cold starts if the instance sleeps (Railway sleeps after inactivity on the free tier)
Risks:
- Single point of failure: one Railway instance serves all tenants. Mitigated by the stateless design (in-memory run state can be reconstructed) and Railway's uptime SLA.
- Dockerfile maintenance: Railway uses a custom Dockerfile because nixpacks has a cffi/Python 3.12 incompatibility. Dockerfile must be updated when Python dependencies change.