Calafai Docs

Monitoring & Observability

Monitoring & Observability

This document covers error tracking, logging, health checks, and the admin health dashboard for the Groundtruth Platform.


Sentry Error Tracking

The platform uses Sentry for error tracking across both the Next.js web application and the Python engine.

Next.js Integration

Sentry is integrated via @sentry/nextjs, which instruments three runtime environments:

  • Client — Browser-side errors, unhandled promise rejections, React error boundaries
  • Server — API route errors, server component errors, middleware failures
  • Edge — Edge runtime errors (middleware, edge API routes)

Configuration is applied in next.config.ts using the withSentryConfig wrapper. This enables automatic instrumentation of:

  • API route handlers
  • Server-side data fetching
  • Client-side navigation and rendering

Source maps are uploaded to Sentry during CI builds for readable stack traces in production. Source map uploads are disabled in local development to avoid slowing down builds.

Python Engine Integration

The engine uses sentry-sdk[fastapi] to capture errors from:

  • FastAPI request handlers
  • Background crew execution tasks
  • Unhandled exceptions in engine modules

Required Environment Variables

VariableDescription
SENTRY_DSNSentry Data Source Name (the ingest URL for your project)
SENTRY_ORGSentry organization slug (used for source map uploads)
SENTRY_PROJECTSentry project slug (used for source map uploads)

Sentry is optional. If SENTRY_DSN is not set, the platform operates normally without error tracking. See environment-variables.md for the full list of configuration variables.


Structured JSON Logging

Logger Module

A custom logger is implemented at src/lib/logger.ts. It provides structured, contextual logging with environment-aware formatting:

  • Production — JSON format, machine-parseable. Each log entry is a single JSON object per line, suitable for ingestion by log aggregation services (Datadog, Logflare, etc.).
  • Development — Human-readable format with color coding for quick scanning in the terminal.

Log Entry Fields

Every log entry includes structured metadata:

FieldDescription
levelLog level: info, warn, error, debug
messageHuman-readable log message
timestampISO 8601 timestamp
tenantIdTenant context (when available)
engagementIdEngagement context (when available)
actionThe action being performed (e.g., engagement.start, deliverable.write)
metadataAdditional key-value pairs specific to the log event

Migration from console.error

All console.error calls throughout the codebase have been migrated to logger.error to ensure consistent structured output. Direct console.log and console.error usage should be avoided in favor of the logger module.

Usage Example

import { logger } from '@/lib/logger';

logger.info('Engagement started', {
  tenantId: tenant.id,
  engagementId: engagement.id,
  action: 'engagement.start',
});

logger.error('Engine request failed', {
  tenantId: tenant.id,
  action: 'engine.request',
  metadata: { statusCode: 500, endpoint: '/run' },
});

Admin Health Dashboard

Location

/dashboard/admin/health

Access Control

Restricted to users with the admin or owner role. Other roles (member, viewer) receive a 403 Forbidden response. Role enforcement is handled by the RBAC middleware. See security.md for details on the role hierarchy.

System Status Overview

The dashboard displays an overall system status:

StatusMeaning
HealthyAll services responding normally
DegradedOne or more services slow or partially failing
DownCritical service unreachable

Service Checks

Three services are monitored with individual status and latency:

  1. Database (PostgreSQL) — Executes a lightweight query against the Supabase PostgreSQL instance. Reports connection success/failure and round-trip latency in milliseconds.

  2. Redis (Upstash) — Pings the Upstash Redis instance. Reports connection success/failure and latency. If Redis is not configured, the check reports "not configured" rather than "down" (Redis is optional).

  3. Engine (Railway) — Calls the Railway-hosted Python engine's /health endpoint. Reports availability, latency, and engine version.

Dashboard Statistics

In addition to service health, the dashboard displays:

  • Active runs — Number of currently running engagements
  • Total engagements — Aggregate count across the platform
  • Recent errors (24h) — Count of errors captured in the last 24 hours

Auto-Refresh

The dashboard automatically refreshes health data on a periodic interval to provide near-real-time status without manual reloading.


Health Endpoints

Web Application Health

GET /api/health

No authentication required. Returns:

{
  "status": "ok",
  "service": "groundtruth-web",
  "timestamp": "2026-02-18T12:00:00.000Z"
}

This endpoint is suitable for uptime monitoring services (Vercel checks, UptimeRobot, Pingdom, etc.).

Engine Health

GET /health

Hosted on the Railway Python engine. Returns the engine's status and version. Used by the admin health dashboard to verify engine availability.


  • Security — RBAC, RLS, rate limiting, audit logging
  • Environment Variables — Full list of configuration variables including Sentry, Redis, and engine URL

On this page