on-device-llm-bench

Three backends, same prompts, same harness: gemma-tjs (Gemma 4, Transformers.js + WebGPU) · phi4-edge (Phi-4-mini, Edge Prompt API, native) · phi4-tjs (Phi-4-mini, Transformers.js + WebGPU). phi4-edge vs phi4-tjs isolates the runtime; gemma-tjs vs phi4-tjs isolates the model. Source on GitHub.

Results

Loading…

How to read this

Backends — gemma-tjs: Gemma 4 E2B on Transformers.js. phi4-edge: Phi-4-mini on Edge's native Prompt API. phi4-tjs: Phi-4-mini on Transformers.js.
Pass-rate — % of cases where the model produced an acceptable answer (exact tool match, grounded answer, correct refusal, etc.).
TTFT — time to first token, in milliseconds. Lower is more responsive.
TPS — output tokens per second, steady-state. Higher means faster generation. phi4-tjs uses exact tokenization; phi4-edge approximates ~4 chars/token.
Winner highlight is per-row only — overall tradeoff is in the linked report.

Contribute a run from your device

Run all three backends per the README, then add a folder under results/<YYYYMMDD>-<os-arch>/ with the three JSONs and a report.md. Append an entry to docs/manifest.json and open a PR.