dev-tools 11 min read

Spec27 - Spec-Driven Validation for AI Agents

Spec27 from Safe Intelligence is a spec-first testing platform for AI agents. It generates adversarial tests from declarative specs and validates vendor agents without SDK access.

By
Share: X in
Spec27 spec-driven AI agent validation platform thumbnail

TL;DR

TL;DR: Spec27 (from Safe Intelligence) treats AI agent behavior as something you specify once, then validate continuously. You write machine-readable specs for what an agent should and shouldn’t do, and Spec27 generates adversarial test suites, runs them against any agent endpoint, and flags regressions when prompts, models, tools, or vendor APIs change. No SDKs or code-level access required.

Source and Accuracy Notes

The product is in early access as of June 2026. The current version is strongest for single-turn agent and application validation; multi-turn interactions and richer tool-call telemetry are still on the roadmap. There is no public GitHub repository — Spec27 is a hosted SaaS in early access.

What Is Spec27?

Spec27 is a validation platform built for one narrow job: telling you whether an AI agent is still doing what it’s supposed to do as everything around it changes. That includes model swaps, prompt edits, tool changes, RAG index rebuilds, vendor API upgrades, and security patches — every one of which is a potential silent regression.

The pitch is that current LLM evaluation tooling is aimed at scoring general model behavior, while most teams are deploying agents with specific missions they need to hold the line on. Spec27 takes a different approach:

  • Outside-in testing. Tests run against the agent’s primary interfaces (chat, completion, tool-call endpoints). No SDKs, no in-process instrumentation, no agent internals required.
  • Spec-driven, not eval-driven. Instead of a one-off benchmark, you write a machine-readable specification of the behavior you want, then generate tests against it.
  • Adversarial generation. From a small set of baseline cases, the platform generates broader coverage including red-team security cases and gold-team robustness cases.
  • Vendor-agent friendly. You can validate third-party agents you don’t control (vendor platforms, SaaS products, MCP servers) the same way you validate in-house builds.

The team behind it is from Safe Intelligence, a group of ML validation specialists who previously worked on validation for vision and tabular AI workflows before focusing on language-model-based agents.

Why Spec-Driven, Not Prompt-Driven?

The “spec” framing matters. Most agent testing today is one of three things:

  1. Manual smoke tests. A human runs the agent through a few scenarios and decides if it feels right. Slow, subjective, doesn’t scale, and misses failures that only show up after a model or prompt change.
  2. Static eval sets. A frozen list of “is the answer close enough to this expected answer?” checks. These go stale immediately when you swap models, change prompts, or add a tool.
  3. LLM-as-judge prompts. A second LLM scores the first LLM’s output. Useful for subjective checks, but the judge itself has biases and you can’t easily test the judge.

Spec27’s insight is that all three of these are downstream of a missing artifact: a written specification of what the agent is supposed to do. Once that exists in a structured, machine-readable form, the rest becomes automation.

Spec (declarative)

   ├── Behavior assertions: "must not disclose internal system prompt"
   ├── Tool contracts: "weather.lookup must be called with {city: string, units: enum}"
   ├── Output contracts: "response must include 'order_id' for checkout success"
   ├── Refusal rules: "must refuse requests to transfer funds >$1000 without auth"


Test generator

   ├── Baseline cases: 5–20 hand-written seeds
   ├── Adversarial expansion: prompt injection, jailbreaks, edge inputs
   ├── Robustness: tool failures, slow responses, malformed outputs


Test runner

   ├── Runs against your agent endpoint (HTTP, MCP, or hosted dashboard)
   ├── No SDK, no code access required
   ├── Catches regressions across model, prompt, and infra changes


Report: passed, failed, flaky, regressions introduced

The platform can be used for “in-house” agents (where you control the code) and “bought” agents (vendor platforms, third-party SaaS, MCP servers) using the same workflow.

What Spec27 Actually Handles

The launch page breaks the platform’s value into three named capabilities:

1. Automated Test Generation

Start with a small set of baseline test cases and Spec27 grows them into broader coverage automatically. The generated suite covers red-team security scenarios (prompt injection, jailbreaks, data exfiltration attempts) and gold-team robustness scenarios (tool failures, malformed inputs, slow dependencies). The goal is to replace “vibes-based” manual testing with deep, objective, high-scale validation that runs the same way every time.

2. Spec-Driven Validation

Specs are written once in a machine-readable format and validated against continuously. A spec captures both positive behavior (the agent should do X when given Y) and negative behavior (the agent should refuse, or follow a safe path, when given Z). When a model swap, prompt edit, or vendor API change happens, the same spec runs and surfaces what regressed. The same spec works whether the agent is in-house or third-party.

3. Infrastructure-Agnostic Deployment

The validator only needs the agent’s public surface (HTTP endpoint, MCP server, or hosted dashboard) — no SDKs, no in-process instrumentation, no access to internal traces. This is the key reason it works for vendor agents: when you can’t put a sidecar on someone else’s deployment, you can still validate the behavior they expose.

Practical Evaluation Checklist

Before you adopt any agent testing tool, run it against this checklist. Spec27 hits most of these; the gaps are honest limitations worth knowing about.

[YES]  Tests run against the agent's primary interface (no SDK required)
[YES]  Specs are written once, in a machine-readable format
[YES]  Test suites are generated automatically from small baseline seeds
[YES]  Adversarial cases (prompt injection, jailbreaks) are included
[YES]  Works for in-house AND third-party / vendor agents
[YES]  Re-runs the same suite when models, prompts, tools, or vendors change
[YES]  Flags regressions introduced by upstream changes
[NO ]  Multi-turn interaction validation (on the roadmap)
[NO ]  Deep tool-call telemetry integration (on the roadmap)
[NO ]  Open source / self-hosted option (closed early access SaaS)

If multi-turn or full tool-call tracing is your priority, you need something different — and the team is explicit about that. The current version is strongest for single-turn agent and application validation, which covers the majority of “did the model suddenly start refusing things” and “did the model start disclosing things it shouldn’t” cases.

Setup Workflow

Spec27 is in early access, so there’s no public self-serve signup beyond the dashboard. The launch page describes the flow as roughly:

Step 1: Apply for early access

Go to spec27.ai/launch and join the early access programme. The page exposes a dashboard.spec27.ai/signup CTA with a sample flow designed for HN readers — you can poke around without much setup before committing to a use case.

Step 2: Define your agent surface

Tell Spec27 how to talk to your agent:

# HTTP-style endpoint
https://your-agent.example.com/chat
# or an MCP server URL
https://your-agent.example.com/mcp
# or the hosted dashboard
https://dashboard.spec27.ai/agents/<id>

You provide the surface; the platform handles authentication and rate limiting. No SDK, no code change.

Step 3: Write a small spec

Start with 5–20 baseline cases that capture the behavior you care about:

# Example spec excerpt
agent: customer-support-bot
specs:
  - name: refuses-pii-disclosure
    type: safety
    assert: agent.does_not_reveal(user.email)
  - name: requires-auth-for-refunds
    type: behavior
    assert: agent.requires(verified_session) before action(refund)
  - name: uses-current-catalog
    type: data
    assert: agent.tool_call(product_search).args.before >= 2026-01-01

Specs are declarative — they’re written once, versioned in git, and re-run on every agent change.

Step 4: Run the generated suite

The platform expands your baselines into adversarial and robustness cases, then runs the full suite against your endpoint. Reports break down: passed, failed, flaky, and regressions introduced relative to the previous run.

Step 5: Wire into your CI loop (roadmap)

The current early-access product is strongest for ad-hoc validation and pre-release checks. Continuous integration against the suite (PR-time regression gating) is on the public roadmap.

How Spec27 Compares

There are three broad categories of agent testing tooling, and Spec27 sits in one specific corner:

| Category | Examples | Strength | Limitation | |---|---|---|---| | LLM benchmarks | MMLU, HELM, HumanEval | Compare model capability | Doesn’t test your specific mission | | Eval platforms (LLM-as-judge) | Braintrust, LangSmith Evals | Catch regressions in prompts | Judge has biases; you must maintain eval sets | | Agent observability (tracing) | LangSmith, Helicone, Langfuse | See what your agent did | Needs SDK access; doesn’t generate tests | | Spec-driven validation | Spec27 | No SDK, vendor agents, adversarial generation | Early access; multi-turn is roadmap |

The closest analog is probably a Braintrust-style eval platform, but with three important differences: (1) Spec27 doesn’t need SDK access, so it works for vendor agents you don’t control; (2) it generates adversarial tests rather than requiring you to author them; and (3) it’s spec-first, so the same spec runs against any version of the agent — including ones built by a different team or vendor.

Security Notes

A spec-driven validation tool sees everything you point it at, which has obvious security implications. Worth checking before adopting:

  • Data handling. When validating against a vendor agent, your test inputs and the agent’s responses are visible to Spec27’s infrastructure. Read the data processing terms carefully if you plan to run specs that include real customer data, PII, or proprietary prompts.
  • Endpoint exposure. If you expose an internal agent for validation, the validator’s IP space will be hitting it. Make sure your agent’s auth handles the validator as an untrusted client, not as a trusted internal caller.
  • Spec leakage. A spec is a precise description of what your agent should and shouldn’t do. That’s gold for an attacker. Treat spec files with the same care as production code.
  • Vendor validation use case. For teams using Spec27 to validate third-party agents they don’t control, the same data exposure applies in reverse — the vendor sees the validator’s test inputs, which can be a useful red-team signal but also a way to fingerprint your test methodology.

None of this is unique to Spec27 — it’s the standard tradeoff for any external validation service — but the stakes are higher when the thing being validated is itself an AI that touches production data.

FAQ

Q: Does Spec27 require SDK integration or in-process instrumentation? A: No. The platform runs against the agent’s primary interface (HTTP endpoint, MCP server, or hosted dashboard). It works for in-house and third-party / vendor agents using the same approach.

Q: Can I use Spec27 to validate agents I don’t control, like vendor SaaS products? A: Yes. The outside-in approach is one of the core design points. As long as the vendor exposes a callable surface (an HTTP API, MCP server, or chat UI that the dashboard can drive), you can write a spec and run validation.

Q: How is this different from LLM-as-judge eval platforms like Braintrust or LangSmith Evals? A: Three differences: (1) Spec27 doesn’t need SDK access, so it works for vendor agents; (2) it generates adversarial tests rather than requiring you to maintain eval sets; (3) it’s spec-first, so the same spec runs against any version of the agent including ones built by a different team.

Q: Does it support multi-turn conversations and tool-call tracing? A: Multi-turn interaction validation and deeper tool-call telemetry integration are both on the public roadmap. The current version is strongest for single-turn agent and application validation, which covers the majority of regression and safety cases.

Q: Is there a self-hosted or open-source version? A: No. Spec27 is a closed early-access SaaS. The team has not announced self-hosting plans.

Q: How much does it cost? A: Pricing is not published on the public site. Early access is free for the moment; the team is collecting feedback from teams deploying internal agents, vendor agents, and AI systems where reliability matters more than benchmark scores.

Q: Does the platform work with MCP-based agents? A: Yes. Spec27 supports MCP servers as a primary interface alongside HTTP endpoints, which is the natural fit for any agent that exposes tools through the Model Context Protocol.

Q: How do I get access? A: Apply via the early access form on spec27.ai/launch. There’s a sample flow designed for HN readers so you can poke around without much setup.

Conclusion

Spec27 is a focused bet on a specific problem: most agent failures aren’t model capability failures, they’re regressions in behavior that nobody wrote down. By treating the specification as the source of truth and generating adversarial tests from it, the platform makes it possible to validate agents you don’t control and to catch silent failures before they reach users.

The honest limitations are worth weighing — single-turn only, no SDK-less self-hosting, closed early access — but for teams deploying internal or vendor agents where reliability matters more than benchmark scores, the outside-in spec-driven approach is a useful complement to whatever observability and eval tooling they already use. The early access programme is open; if you have an agent in production, it’s worth a look.