Local LLM (GGUF) Inference Viewer

This demo downloads a GGUF model from a fixed URL, stores it in IndexedDB for reuse, and runs real inference in the browser. It can also expose Ollama-compatible /api/tags, /api/generate, and /api/chat endpoints.

Initializing inference engine...

1) Fetch the fixed model (Qwen2.5-1.5B-Instruct-GGUF)

* The first run and periodic refreshes require about 1 GB of traffic.

Model not loaded yet (press the button to download it or reuse the cache).

1.5) Enable API (Service Worker)

API status: unchecked

2) Enter a prompt

The response will appear here.

3) Ollama-compatible API (same-origin fetch)

GET /api/tags POST /api/generate {"model":"default","prompt":"hello","stream":false} POST /api/chat {"model":"default","messages":[{"role":"user","content":"hello"}],"stream":false}

* If stream is omitted, the default is true (NDJSON stream). The API returns 503 when no model is loaded.