Paper Pilot - Local-First AI Research Assistant

TL;DR

TL;DR: Paper Pilot is an open-source desktop research workflow app for scientists who want local control. It pulls papers from multiple academic sources, downloads open-access PDFs, converts them into AI-ready text, indexes them with SQLite full-text and vector search, and lets a local or OpenAI-compatible model answer questions with citations grounded in that corpus.

Source and Accuracy Notes

Official repo: Xueyang-Song/paper-pilot
This article is based on the public README, project structure, requirements, source inventory, and documented verification commands in the official repository as of June 4, 2026.

What Is Paper Pilot?

Paper Pilot is built for one very specific research pain: literature review gets messy fast. A scientist starts with one question, then opens Crossref, arXiv, PubMed, Semantic Scholar, PDF tabs, notes, highlights, and a separate AI chat. Very quickly the workflow becomes fragmented, and the AI summary layer is detached from the actual paper corpus.

Paper Pilot’s answer is local-first consolidation.

The repo describes a pipeline like this:

Ask research question
  -> crawl 8+ paper sources
  -> download and convert papers
  -> index corpus with FTS5 + vector search
  -> synthesize with AI grounded in local papers

That is an attractive shape for academic work because it treats retrieval and corpus management as first-class parts of the workflow, not an afterthought bolted onto a chat window.

The documented source list includes OpenAlex, Crossref, Semantic Scholar, PubMed/PMC, arXiv, Europe PMC, CORE, Unpaywall, and experimental Google Scholar support. On top of that, Paper Pilot can run with local Ollama or OpenAI-compatible providers.

Repo-Specific Setup Workflow

Step 1: Satisfy local prerequisites

The README lists three inputs up front:

Node.js >= 22.18.0
Python 3.11+
optional Ollama for fully local inference

The Python requirement matters because the project uses scripting tools in its PDF-to-text pipeline, including MarkItDown and isolated Python tooling.

Step 2: Clone and run desktop app

The official dev bootstrap is short and clear:

git clone https://github.com/Xueyang-Song/paper-pilot.git
cd paper-pilot
npm install
npm run dev

For production-style local builds, repo also documents:

npm run build
npm run package

The README includes one operational note that is easy to miss: development must use http://127.0.0.1:5173. If that port is busy, free it before starting the app.

Step 3: Configure model path

Paper Pilot is flexible on model backend. The repo explicitly supports:

local Ollama for offline operation
OpenAI-compatible APIs
Vercel AI Gateway support

That matters because researchers split into two camps:

privacy-sensitive users who want local models and local notes only
users who want stronger hosted models but still want local corpus control

Paper Pilot accommodates both without changing storage architecture.

Step 4: Search across academic sources

The app’s documented paper sources are a large part of its value. Instead of pushing user toward one proprietary index, it federates across multiple public or semi-public scholarly sources.

Practical source expectations from README:

OpenAlex / Crossref           -> multidisciplinary metadata
Semantic Scholar             -> strong CS + science coverage
PubMed / PMC / Europe PMC    -> biomedical and life sciences
arXiv                        -> CS, physics, math preprints
CORE / Unpaywall             -> open-access discovery and PDFs
Google Scholar               -> experimental

The repo also labels key requirements honestly. CORE requires an API key. OpenAlex and Crossref optionally benefit from email identification. Google Scholar support is marked experimental.

Step 5: Let local knowledge pipeline build your corpus

One of the strongest details in the README is that Paper Pilot is not only a search front end. It stores and indexes papers locally using:

SQLite via node:sqlite
FTS5 full-text search
sqlite-vec vector similarity search

For open-access PDFs, the documented flow is auto-fetch via Unpaywall, conversion via MarkItDown, then indexing into the project’s local store.

That architecture is more durable than one-shot prompting. You are building a reusable research corpus, not running disposable chats.

Step 6: Use built-in verification scripts

The repo includes unusually good self-check commands for an early project:

npm run verify
npm run test
npm run typecheck
npm run test:crawlers:api
npm run test:crawlers:browser

That gives technical users a reasonable way to validate crawler health, browser-based crawling behavior, build integrity, and strict TypeScript checks before trusting the app on live research work.

Deeper Analysis

Local-first is real here, not branding

Plenty of tools say local-first when they mean “desktop shell around remote API.” Paper Pilot’s repo evidence is stronger than that. The architecture explicitly centers local storage, local indexing, local project separation, local conversation history, and optional local inference.

That matters for research in at least three ways:

unpublished notes stay on machine
proprietary PDF collections stay local
project boundaries are easier to maintain

For labs, healthcare research, industry R&D, or any pre-publication environment, this is a meaningful product position.

Source breadth is good, but quality control matters

Querying many academic sources is attractive, but multi-source crawling can also produce duplicates, inconsistent metadata, and mixed relevance. Paper Pilot’s design partly offsets that by combining metadata retrieval with local indexing and a project-scoped corpus.

In other words, the product is not trying to replace scholarly judgment. It is trying to lower the friction of building a working source base that your AI can reason over.

SQLite choice is pragmatic

Using node:sqlite, FTS5, and sqlite-vec is a sensible stack for desktop research software. It avoids shipping a heavy server dependency while still giving:

structured metadata storage
keyword search over papers
semantic retrieval over passages or documents

For a solo researcher or small lab machine, that is often right trade-off. It is inspectable, portable, and good enough without introducing infrastructure overhead.

Early status is explicit

README labels project as early v1. Stable areas include core crawlers, local database layer, and AI agent path. Experimental or less-proven areas include Google Scholar support plus macOS and Linux packaging.

That is useful honesty. If you need enterprise-grade cross-platform packaging now, this is early. If you care more about architecture direction and local research ergonomics, it is already interesting.

Practical Evaluation Checklist

Research fit:
  [ ] You regularly review papers across multiple academic indexes
  [ ] You want one local corpus per project or topic

Privacy fit:
  [ ] Papers, notes, or drafts should stay on-device
  [ ] You prefer local Ollama or controlled OpenAI-compatible endpoints

Technical fit:
  [ ] You are comfortable with Node + Python desktop tooling
  [ ] You can manage optional API keys for selected paper sources

Quality fit:
  [ ] You will still manually judge source quality and duplicates
  [ ] Experimental Google Scholar support is acceptable

Security Notes

The README explicitly positions Paper Pilot as local-first, with papers, notes, and AI conversations stored on your machine.
Secure credential storage is delegated to Electron safeStorage, which is a better default than plain-text app config.
Open-access PDF fetching and browser-based crawling still mean you are ingesting external content, so normal PDF and parser caution applies.
If you connect hosted OpenAI-compatible APIs instead of local Ollama, your prompts and extracted passages may leave the machine even though primary storage remains local.

FAQ

Q: Is Paper Pilot only for fully offline use?
A: No. It can run fully local with Ollama, but repo also supports OpenAI-compatible providers and Vercel AI Gateway.

Q: What makes it different from a generic RAG notebook?
A: The repo is specialized around academic sources, PDF acquisition, paper conversion, citation-grounded synthesis, and per-project corpora rather than a generic document bucket.

Q: Do I need API keys for every source?
A: No. Some sources require none, some list optional identifiers, and CORE is marked as required for its API path.

Q: Is Google Scholar support production-ready?
A: Not yet. The README marks it experimental.

Q: Can I trust the AI answers by default?
A: You should trust the workflow more than the summary layer. Paper Pilot is strongest when used as a grounded retrieval and corpus tool, with the researcher still validating actual cited papers.

Conclusion

Paper Pilot is a good example of AI tooling getting more useful by becoming narrower, not broader. It is not trying to be a general-purpose research chatbot. It is trying to make one demanding workflow work better: discover papers, build a corpus, index it locally, and ask grounded questions against it.

For researchers who care about local control, reproducible paper trails, and source-grounded synthesis, that is a strong direction.

Related reading: VibeSearchBench, AnySearch MCP Server, and Extract.

dev-tools

Automotive Skills Suite for AI Engineering

Evaluate Automotive Skills Suite for APQP, ASPICE, HARA, safety-plan, and DIA workflows with setup notes, governance risks, and SME review guidance.

5/28/2026

dev-tools

awesome-agentic-ai-zh Roadmap Guide

Explore awesome-agentic-ai-zh as a Chinese agentic AI learning roadmap, with setup notes, track selection, study workflow, and evaluation guidance.

5/28/2026

dev-tools

Baguette iOS Simulator Automation Guide

Set up Baguette for iOS Simulator automation, web dashboards, device farms, gesture input, streaming, and camera testing with Xcode caveats.

5/28/2026

TL;DR

Source and Accuracy Notes

What Is Paper Pilot?

Repo-Specific Setup Workflow

Step 1: Satisfy local prerequisites

Step 2: Clone and run desktop app

Step 3: Configure model path

Step 4: Search across academic sources

Step 5: Let local knowledge pipeline build your corpus

Step 6: Use built-in verification scripts

Deeper Analysis

Local-first is real here, not branding

Source breadth is good, but quality control matters

SQLite choice is pragmatic

Early status is explicit

Practical Evaluation Checklist

Security Notes

FAQ

Conclusion

Related Posts