ai-setup 4 min read

Relvy – AI On-Call Runbooks That Slash MTTR

Relvy is a YC F24-backed on-call automation platform that executes runbooks during incidents — reducing MTTR by turning static SOPs into AI-driven response workflows. Free tier available.

By
Share: X in
Relvy AI on-call runbook automation platform

TL;DR

TL;DR: Relvy turns static on-call runbooks into AI-executable workflows, slashing MTTR by automating SOP execution during incidents — backed by YC F24.

Source and Accuracy Notes

What Is Relvy?

Relvy is an on-call runbook automation platform built for SRE and incident response teams. Instead of relying on engineers to manually follow runbook steps during a high-pressure outage, Relvy lets AI execute those steps — automatically opening dashboards, running diagnostics, and escalating when human judgment is needed.

The core loop: your runbook is a structured document (markdown, YAML, or natural language), and Relvy’s agent interprets it at incident time, taking action without waiting for an engineer to read through it step by step.

Setup Workflow

Step 1: Sign Up and Connect Your Stack

Register at https://relvy.ai. Relvy integrates with:

  • PagerDuty, OpsGenie, and other on-call schedulers
  • Datadog, Grafana, and Prometheus for metrics
  • Slack and Microsoft Teams for alerting
  • GitHub for runbook versioning

Step 2: Import Your Runbooks

Import existing runbooks from Notion, Confluence, GitHub, or plain markdown. Relvy parses the structure and converts each step into a machine-executable action with a confidence threshold.

# Example: import from GitHub repo
relvy import --source github --repo your-org/runbooks --branch main

Step 3: Define Action Permissions

Set which actions Relvy is allowed to take autonomously vs. which require human approval. For example:

  • Auto: Open Grafana dashboard, fetch pod logs, restart a known-failing service
  • Approval required: Scale down a deployment, change DNS records, modify load balancer config

Step 4: Configure Alert Triggers

Connect your monitoring tool to trigger Relvy runbooks automatically when a threshold is breached. A typical flow:

  1. Alert fires in PagerDuty → on-call engineer is paged
  2. Relvy detects the alert → starts executing the relevant runbook
  3. AI runs through diagnostic steps → posts updates to Slack in real time
  4. If automation succeeds, incident is resolved without engineer involvement
  5. If confidence is low, Relvy pauses and pages the engineer with context

Deeper Analysis

Where it excels:

  • High-pressure incidents where engineers waste time reading through long runbooks
  • Repetitive on-call patterns where the same 10-step diagnostic always runs
  • Reducing MTTR by cutting the time between alert and first meaningful action

Where it struggles:

  • Highly environment-specific runbooks that need judgment calls on every step
  • Security-sensitive actions that can’t be delegated to AI without extensive guardrails
  • Teams without mature enough runbook documentation to import

Pricing: Free tier available. See relvy.ai for paid tiers.

Practical Evaluation Checklist

  • Does your team have documented runbooks for top incident types?
  • Are your runbooks structured enough to be parsed (markdown headers, numbered steps)?
  • Do you have clear human/AI permission boundaries for production actions?
  • Is your monitoring stack supported (Datadog, Prometheus, PagerDuty)?
  • Would auto-execution actually save time vs. manual steps?

Security Notes

  • Action permissions are fully configurable — you control what AI can do without approval
  • Runbook access can be scoped to specific teams
  • Audit log of all AI-executed actions is available
  • Self-hosted deployment option for air-gapped environments (check with Relvy team)

FAQ

Q: Does Relvy replace the on-call engineer? A: No. Relvy automates routine diagnostic and remediation steps, but pauses and escalates when confidence is low. The engineer is still in the loop for judgment calls.

Q: What happens if the AI takes a wrong action? A: Relvy maintains a human-in-the-loop model. Each action class has a confidence threshold — below it, the AI pauses and requests approval before proceeding.

Q: How does it connect to existing monitoring? A: Relvy has native integrations with Datadog, Grafana, Prometheus, PagerDuty, and OpsGenie. You can also use webhooks for custom setups.

Q: Is there a self-hosted option? A: Contact the Relvy team for enterprise/air-gapped deployments. Standard offering is SaaS.

Conclusion

Relvy targets a real pain point: on-call engineers spending critical incident minutes manually following runbooks that a machine could execute. If your team has well-documented runbooks and wants to cut MTTR without hiring more engineers, it’s worth evaluating. The YC F24 backing signals the team is serious about building in the incident response space.

Site: https://relvy.ai