Captain – Automated RAG for Files and Cloud Storage

TL;DR

TL;DR: Captain is a YC W26 startup that automates RAG pipeline construction for unstructured data in cloud storage (S3, GCS, Google Drive) — connect your bucket and query in plain English within minutes, with automatic VLM, OCR, and computer vision processing for complex documents.

Source and Accuracy Notes

Project page: runcaptain.com
Documentation: docs.runcaptain.com
Source repository: github.com/runcaptain/captain-mrag-bench (benchmark evaluation repo)
HN launch thread: news.ycombinator.com/item?id=47366011
Source last checked: 2026-06-15

What Is Captain?

Captain is a fully-managed RAG (Retrieval-Augmented Generation) API for unstructured data. Instead of building and maintaining your own vector database, embedding pipeline, and retrieval logic, you connect your cloud storage and Captain handles the rest.

From the official docs:

Captain is the highest-accuracy AI search API for unstructured data. Just connect your cloud storage and ask away.

The platform targets teams drowning in files — PDFs, spreadsheets, images, slides — scattered across S3 buckets, Google Cloud Storage, or Google Drive. Rather than manually chunking documents and tuning embeddings, Captain’s NLP ingestion engine processes files automatically and exposes a natural language search API.

Key Features

Cloud Storage Integration

Connect AWS S3, GCS, or Azure Blob storage. Captain indexes files over a single API call and keeps them synced. Multi-tenancy support lets you scope collections by team, folder, or project.

Multimodal Search

Captain doesn’t just handle text. The platform includes automatic VLM (Vision-Language Model), OCR, and Computer Vision processing pipelines. This means you can search across:

Text-heavy PDFs with complex layouts
Images with embedded text or visual content
Multi-faceted spreadsheets with mixed data types

From the benchmark repo, Captain achieved 81.3% ContentHit@K on MRAG-Bench (ICLR 2025), a vision-centric benchmark with 16,130 images and 1,251 human-annotated questions. For high-precision scenarios, it hit 100% ContentHit@K.

Hybrid Search Augmentations

The docs mention “novel hybrid search augmentations” that deliver high accuracy out-of-the-box. This likely combines dense vector embeddings with sparse retrieval (BM25-style) and recency boosting, though the exact architecture isn’t publicly detailed.

Complex Document Handling

Large documents, tables in PDFs, and visual content are processed automatically. You don’t need to pre-chunk or manually extract text — Captain’s ingestion pipeline handles it.

Setup Workflow

Step 1: Create a Collection

Collections scope your data. Think of them as namespaces for different projects or teams.

curl -X POST https://api.runcaptain.com/v2/collections \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -H "X-Org-ID: $CAPTAIN_ORG_ID" \
  -H "Content-Type: application/json" \
  -d '{"name": "product-docs", "description": "Q2 product documentation"}'

Step 2: Connect Cloud Storage

Point Captain at your S3 bucket or GCS bucket. The platform indexes files asynchronously.

curl -X POST https://api.runcaptain.com/v2/collections/product-docs/index/s3 \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -H "X-Org-ID: $CAPTAIN_ORG_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket": "my-company-docs",
    "prefix": "product/q2/",
    "region": "us-east-1"
  }'

For Google Drive or other SaaS sources, the docs describe OAuth-based connectors (see Connect Cloud Storage).

Step 3: Query in Natural Language

Once indexed, ask questions in plain English:

curl -X POST https://api.runcaptain.com/v2/collections/product-docs/query \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -H "X-Org-ID: $CAPTAIN_ORG_ID" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the return policy for enterprise customers?"}'

The API returns relevant passages with source file citations.

Step 4: Filter and Refine

Metadata filtering lets you narrow results by file type, date range, or custom tags:

curl -X POST https://api.runcaptain.com/v2/collections/product-docs/query \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -H "X-Org-ID: $CAPTAIN_ORG_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "pricing changes",
    "filters": {
      "file_type": "pdf",
      "date_range": {"start": "2026-01-01", "end": "2026-06-15"}
    }
  }'

Deeper Analysis

Accuracy Claims vs. Benchmarks

Captain claims “highest-accuracy” search. The MRAG-Bench results support this for multimodal retrieval:

Overall ContentHit@5: 81.3% (at least one retrieved image’s VLM description contains the correct answer)
MRR@5: 0.807 (mean reciprocal rank of first correct hit)
Precision@5: 74.2% (fraction of retrieved images with matching descriptions)

For text-heavy scenarios (High Precision, Scope, Incomplete), accuracy exceeds 94%. For visual transformation scenarios (Deformation, Temporal), accuracy drops to 36-57%, which is expected — these are edge cases where visual context matters more than text extraction.

Pricing Model

Captain uses a credit-based system:

Free Tier ($0/mo):

500 indexing credits/month
1,000 queries/month
$0.02 per overage credit
$0.005 per query after included quota

Startup ($1,600/mo):

83,000 credits/month
Unlimited queries
$0.015 per overage credit
Production-ready infrastructure
Slack support

Enterprise: Custom pricing for compliance, dedicated support, and custom SLAs.

Text files (.txt, .md, code) cost 1 credit. Complex files (PDFs with tables/images, powered by VLM and CV models) cost more — the exact multiplier isn’t public, but the pricing page shows SVG diagrams suggesting 10-50x multipliers for heavy processing.

Competitive Position

Captain competes with:

LlamaIndex / LangChain — open-source frameworks, but you build and maintain the pipeline
Pinecone / Weaviate — vector databases, but you handle ingestion and chunking
Unstructured.io — document parsing, but not a complete RAG API

Captain’s differentiator is the fully-managed aspect: connect storage, get an API. No infrastructure to operate.

Practical Evaluation Checklist

Before adopting Captain, verify:

[ ] Data sensitivity — Does your compliance framework allow third-party indexing of files? Captain is SOC II compliant, but review their data handling policies.
[ ] File types — Test with your actual documents. PDFs with complex tables, scanned images, and multi-sheet Excel files may have variable accuracy.
[ ] Query latency — The docs don’t publish P50/P95 latency. Test with your expected query volume.
[ ] Sync frequency — How often does Captain re-index your bucket? Real-time sync isn’t mentioned — assume hourly or daily batch updates.
[ ] Cost at scale — 83,000 credits sounds like a lot, but if your PDFs cost 50 credits each, that’s 1,660 files/month. Model your actual usage.
[ ] Multi-tenancy isolation — If you’re building a customer-facing product, verify collection isolation meets your security requirements.

Security Notes

SOC II compliant — Captain claims SOC II certification. Request the report via [email protected] or the sales page.
Data encryption — The docs mention “secure data handling” but don’t specify encryption-at-rest or in-transit details. Assume TLS in transit; verify encryption-at-rest with support.
Access control — API keys are org-scoped. Use separate collections for different security domains rather than relying on collection-level ACLs (not documented).
Data retention — The FAQ mentions data retention policies, but the exact duration isn’t public. Clarify with support if you have GDPR/CCPA requirements.
Regional hosting — No mention of regional data residency. If you need EU-only storage, contact sales.

FAQ

Q: How long does indexing take? A: The HN post mentions “RAG part took Captain about 3 minutes to set up” for Paul Graham’s essays (text-only). For large S3 buckets with complex PDFs, expect hours to days depending on file count and complexity. The platform processes files asynchronously.

Q: Can I self-host Captain? A: No. Captain is a fully-managed SaaS API. There’s no on-premise or VPC deployment option mentioned in the docs.

Q: What embedding models does Captain use? A: The docs mention “Multiple Embedding Models” as a feature but don’t specify which ones. The benchmark repo doesn’t disclose the model names. Contact support for details.

Q: Does Captain support real-time streaming updates? A: Not documented. The platform appears to use batch indexing with periodic syncs. For real-time use cases, you’d need to trigger re-indexing via the API after file updates.

Q: How does Captain handle duplicate files? A: Not explicitly documented. Assume deduplication by file path + content hash within a collection. Test with duplicate files to verify behavior.

Q: Can I export my indexed data? A: The docs don’t mention a data export feature. You retain ownership of source files in your cloud storage, but the embeddings and metadata live in Captain’s platform. Clarify with support if you need portability.

Conclusion

Captain fills a real gap: teams with terabytes of unstructured data in cloud storage need search, but building a RAG pipeline is a multi-week project involving vector databases, embedding models, chunking strategies, and retrieval tuning. Captain abstracts all of that into a connect-and-query API.

The multimodal capabilities (VLM + OCR + CV) are the standout feature. Most RAG tools handle text well but struggle with images, tables, and complex PDFs. Captain’s 81.3% accuracy on MRAG-Bench suggests it’s solving the hard cases, not just the easy text-only ones.

The pricing is steep for the Startup tier ($1,600/mo), but the free tier (500 credits, 1,000 queries) is generous enough for evaluation. If you’re drowning in files and need search yesterday, Captain is worth a 30-minute proof-of-concept.

Next steps:

Sign up at runcaptain.com and grab an API key
Connect a test S3 bucket with 10-20 representative files
Run 5-10 queries that cover your actual use cases
Compare accuracy and latency against your current solution (or manual search)
If it works, estimate monthly credit usage and pick a plan

dev-tools

Sonarly – AI Agent auto-fixes your production alerts

Sonarly triages alerts, finds root causes, and opens fix PRs on GitHub. 40+ integrations, 84% root-cause accuracy, cuts MTTR 10x. YC W26.

5/28/2026

ai-setup

Sentrial – Catch AI Agent Failures Before Your Users Do

YC W26-backed AI agent observability platform. Trace sessions, detect silent regressions, and A/B test prompts in production before failures reach users.

5/28/2026

ai-setup

IonRouter – Fast Low-Cost AI Inference API

IonRouter is a YC W26 inference API routing open-source and fine-tuned models via an OpenAI-compatible endpoint, built on a C++ runtime optimized for GH200.

5/28/2026

TL;DR

Source and Accuracy Notes

What Is Captain?

Key Features

Cloud Storage Integration

Multimodal Search

Hybrid Search Augmentations

Complex Document Handling

Setup Workflow

Step 1: Create a Collection

Step 2: Connect Cloud Storage

Step 3: Query in Natural Language

Step 4: Filter and Refine

Deeper Analysis

Accuracy Claims vs. Benchmarks

Pricing Model

Competitive Position

Practical Evaluation Checklist

Security Notes

FAQ

Conclusion

Related Posts