TokenSpeed Preview and Benchmark Guide

tokenspeed GitHub tool guide thumbnail

TL;DR

TL;DR: TokenSpeed is an early, GPU-heavy LLM inference engine for agentic workloads, with documented Docker runner setup, editable Python packages, OpenAI-compatible serving, and explicit warning against production use for current preview.

Source and Accuracy Notes

This article uses public material from lightseekorg/tokenspeed, TokenSpeed docs index, Getting Started, Launching a Server, Server Parameters, Compatible Parameters, Model Recipes, and Parallelism. Commands below are preserved from official project docs.

TokenSpeed labels current release as preview for reproducing Kimi K2.5 on B200 and MLA on B200 results. The project notes major work still in progress: Qwen 3.6, DeepSeek V4, MiniMax M2.7, PD, EPLB, KV store, Mamba cache, VLM, metrics, Hopper optimization, MI350 optimization, and other runtime work. It explicitly says not to use preview release for production deployments.

What Is tokenspeed?

TokenSpeed is a Python-facing LLM inference engine built by LightSeek for high-throughput agentic workloads. It positions itself between TensorRT-LLM-style performance and vLLM-style usability. That framing matters: agentic workloads are not only single long completions. They often involve many short and medium requests, tool-call formats, reasoning traces, cache pressure, and latency-sensitive control loops.

The project is organized around four core areas. Modeling uses local-SPMD design with static compiler support for generating collective communication from placement annotations at module boundaries. Scheduler architecture splits C++ control plane from Python execution plane, while request lifecycle and KV cache ownership live in finite-state logic. Kernels are pluggable through layered registry system, including Multi-head Latent Attention work for Blackwell. Entrypoint exposes SMG-integrated AsyncLLM and an OpenAI-compatible HTTP server through tokenspeed serve.

In practical terms, this is not a laptop inference toy. It expects NVIDIA GPU infrastructure, Docker with GPU support, big shared memory, and model checkpoint access. Current showcase points include Kimi K2.5 on B200 and reported TokenSpeed performance results for Qwen3.5-397B-A17B agentic workloads. If you do not operate high-end accelerators, TokenSpeed is mainly worth studying for architecture direction.

Repo-Specific Setup Workflow

Step 1: Confirm hardware and container prerequisites

Documented prerequisites are direct: NVIDIA GPU host, Docker with GPU support, enough shared memory for model serving, and access to model checkpoints. Missing any one of those changes evaluation from install test to architecture reading.

Step 2: Start runner container

Project docs start with prebuilt runner image:

docker pull lightseekorg/tokenspeed-runner:latest

docker run -itd \
  --shm-size 32g \
  --gpus all \
  -v /raid/cache:/home/runner/.cache \
  --ipc=host \
  --network=host \
  --pid=host \
  --privileged \
  --name tokenspeed \
  lightseekorg/tokenspeed-runner:latest \
  /bin/bash

This command is intentionally broad in host access: GPU, IPC, network, PID namespace, and privileged mode. Use isolated benchmark hosts, not shared developer machines.

Step 3: Clone source inside container

Inside container:

git clone https://github.com/lightseekorg/tokenspeed.git
cd tokenspeed

Step 4: Install editable packages

Install Python runtime:

export PIP_BREAK_SYSTEM_PACKAGES=1
pip install -e "./python" --no-build-isolation

Install kernel package:

pip install -e tokenspeed-kernel/python/ --no-build-isolation

Install scheduler package:

pip install -e tokenspeed-scheduler/

Editable installs signal active development workflow. Expect source-level iteration, not polished package-manager experience.

Step 5: Verify CLI surface

Run documented checks:

tokenspeed env
tokenspeed serve --help

Save tokenspeed env output with benchmark notes. In GPU inference work, environment details often explain more than code changes: driver, CUDA stack, image version, GPU model, and package state.

Step 6: Launch minimal OpenAI-compatible server

Minimal launch:

tokenspeed serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

The model path goes directly after tokenspeed serve. API surface then listens at configured host and port.

Step 7: Use production skeleton as reproducibility template

Docs provide explicit Kimi K2.5-style launch skeleton:

tokenspeed serve nvidia/Kimi-K2.5-NVFP4 \
  --served-model-name kimi-k2.5 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --quantization nvfp4 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --chunked-prefill-size 8192 \
  --max-num-seqs 256 \
  --attention-backend trtllm_mla \
  --moe-backend flashinfer_trtllm \
  --reasoning-parser kimi_k25 \
  --tool-call-parser kimik2

Treat this as benchmark skeleton, not universal config. Hardware topology, model family, parser choice, and quantization format all matter.

Step 8: Test OpenAI client compatibility

Documented Python client path:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Write a concise deployment checklist."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

That compatibility matters because agent frameworks usually already speak OpenAI chat completions.

Deeper Analysis

TokenSpeed is interesting because it targets scheduler and kernel bottlenecks exposed by agent workloads. Agent loops often produce smaller turns, tool-call parsing, reasoning-token structures, and concurrent sessions. Serving engine must manage KV cache reuse safely while keeping GPUs fed. TokenSpeed emphasizes finite-state request lifecycle and compile-time safety around KV resource reuse, which points directly at production-class scheduling concerns.

The modeling layer is also worth attention. Local-SPMD plus placement annotations suggests users should express model parallel intent without hand-writing every collective operation. If successful, this could improve maintainability for large MoE and reasoning models where parallelism strategy becomes part of serving correctness.

However, preview status dominates evaluation. Features many teams need for production observability and operations are still listed as ongoing. The install flow uses privileged Docker and editable installs. That is acceptable for benchmark reproduction and deep technical evaluation, but not turnkey production.

Practical Evaluation Checklist

Confirm exact GPU type; Blackwell B200 examples should not be generalized to older cards.
Record Docker image digest, driver version, CUDA stack, and tokenspeed env output.
Start with minimal launch before Kimi-style large config.
Verify model license and checkpoint access before downloading huge weights.
Use same prompt set for vLLM, TensorRT-LLM, and TokenSpeed comparison.
Track throughput, time-to-first-token, inter-token latency, error rate, and GPU memory use.
Test OpenAI-compatible client behavior with your real agent framework.
Avoid production rollout while preview warning remains.

Security Notes

The runner command uses --privileged, --network=host, --pid=host, and --ipc=host. That gives container broad host visibility. Run it only on isolated GPU hosts or disposable benchmark machines.

--trust-remote-code appears in documented production skeleton. That flag allows model repository code execution. Use only with model sources you trust, pinned revisions, and restricted network/storage permissions.

Serving on 0.0.0.0:8000 exposes API on all interfaces. Add firewalling, reverse proxy authentication, or private network placement before allowing access beyond localhost. Model endpoints can leak prompts, completions, and business logic.

FAQ

Q: Is TokenSpeed production-ready? A: No for current preview. Project material says not to use preview release for production deployments.

Q: Can I run it on CPU or normal laptop GPU? A: Documented setup requires NVIDIA GPU host and Docker GPU support. It is aimed at serious accelerator environments.

Q: Why does setup use multiple editable installs? A: Runtime, kernel package, and scheduler package are separate active-development components. Editable installs fit source evaluation and local patching.

Q: Is API compatible with OpenAI clients? A: Server launch exposes OpenAI-compatible HTTP API, and docs include Python OpenAI client example using base_url="http://localhost:8000/v1".

Q: What should I benchmark first? A: Start with documented minimal launch and small request set. Only then move to large Kimi K2.5 skeleton with tensor parallelism, expert parallelism, NVFP4, and parser flags.

Conclusion

TokenSpeed is best read as high-performance inference research turning into usable infrastructure. Its repo-specific value is not one magic command; it is combination of modeling compiler direction, explicit scheduler design, pluggable kernels, Blackwell-focused MLA work, and OpenAI-compatible serving.

For runany.dev readers, practical path is narrow: evaluate on isolated NVIDIA GPU servers, preserve documented commands, capture environment facts, and compare against existing serving stack. If you need production stability today, wait. If you study future agentic inference engines or operate benchmark hardware, TokenSpeed deserves close inspection.