ADR-002: Railway for Long-Running Python Compute

Status: Accepted Date: 2024-11-15 Deciders: Platform architect

Context

The Groundtruth engine executes AI consulting crews that take 10-60 minutes per run. The Next.js web app is deployed on Vercel, which has a 60-second timeout for serverless functions (300 seconds on Enterprise plan). The engine is a Python 3.12 application using CrewAI, which requires persistent process state during execution.

The platform needed a hosting solution for the Python engine that:

Supports long-running processes (30+ minutes)
Runs Python 3.12 with native dependencies (cffi, pdfplumber, openpyxl)
Provides a persistent HTTP server for SSE streaming
Is cost-effective for a pre-revenue startup

Decision

Deploy the Python engine as a persistent FastAPI service on Railway. The Next.js app communicates with it via HTTP API. Railway runs the engine in a Docker container with a 30-minute watchdog timeout.

Alternatives Considered

1. Vercel Serverless Functions (Python runtime)

Rejected due to the 60-second timeout constraint. Even the 300-second Enterprise timeout is insufficient for full engagement runs. Vercel's Python runtime also has limited native dependency support.

2. AWS Lambda with Step Functions

Would solve the timeout issue via state machine orchestration. Rejected due to operational complexity — Step Functions require decomposing the engine into discrete Lambda functions, which conflicts with CrewAI's in-process execution model. Also significantly more expensive and complex to debug.

3. AWS ECS / Google Cloud Run

Both would work technically. Rejected in favor of Railway for simplicity — Railway provides Git-based deploys, automatic HTTPS, and simple configuration. For a single-developer project, the operational overhead of ECS task definitions and Cloud Run configurations is not justified.

4. Fly.io

Strong contender. Similar to Railway in simplicity. Railway was chosen for slightly better Python/Docker support at the time and a more predictable pricing model.

Consequences

Positive:

Engine runs without timeout constraints (watchdog set to 30 minutes)
SSE streaming works naturally with a persistent FastAPI server
Docker deployment handles all native Python dependencies
Simple Git-push deploys from the monorepo
Cost-effective: $5-20/month for the current scale

Negative:

Two deployment targets (Vercel for web, Railway for engine) adds operational complexity
CORS configuration required between Vercel and Railway domains
No auto-scaling — single Railway instance handles all concurrent runs
Cold starts if the instance sleeps (Railway sleeps after inactivity on the free tier)

Risks:

Single point of failure: one Railway instance serves all tenants. Mitigated by the stateless design (in-memory run state can be reconstructed) and Railway's uptime SLA.
Dockerfile maintenance: Railway uses a custom Dockerfile because nixpacks has a cffi/Python 3.12 incompatibility. Dockerfile must be updated when Python dependencies change.

ADR-002: Railway for Long-Running Python Compute

On this page