Android / llama.cpp / GGUF

LLM AI Server with llama.cpp

LLM AI Server with llama.cpp is an Android local-LLM testing app that lets you load GGUF models, tune inference and prompt-template settings, manage shared MCP / Function Definitions settings, inspect logs, and expose an Ollama/OpenAI-compatible API plus the bundled WebUI from one app.

Docs: User Manual | API Reference | Technical Specification | llama.cpp / JNI / CMake Deep Dive | Privacy Policy

        Supports both model downloads from a URL and importing local .gguf files from the device.
Lets you combine generation settings, Think behavior, custom chat templates, shared MCP settings, and Function Definitions JSON.
Can start an on-device Ollama/OpenAI-compatible API and WebUI on the same port, including endpoints such as /api/chat and /v1/chat/completions.

      

Screenshots

LLM AI Server with llama.cpp main screen — Main screen with status header (speed, temperature, backend), API/WebUI controls, Direct Run section, and processing logs.

LLM AI Server with llama.cpp settings screen — Settings screen for model URLs, local GGUF import, GPU backend selection, mmproj settings, and inference parameters.

LLM AI Server with llama.cpp template settings — Template, API, MCP, language, and log settings, including System Prompt, custom chat template, and Function Definitions JSON fields.

Key Features

On-device local inference: Runs GGUF models directly on Android with llama.cpp.
Flexible model loading: Supports downloadable model URLs and local .gguf imports.
Deep inference controls: Adjust n_ctx, n_threads, compute backend (CPU / GPU), offload layer count, Top-p, Top-k, penalties, Mirostat, DRY, Think behavior, and custom chat templates.
MTP speculative decoding (experimental): For models that embed an MTP head (Qwen3.5-MTP, Gemma 4, etc.), enable MTP (draft-mtp) per model profile. Off by default, adjustable n_draft, no extra file needed (own head). Generation can be faster on supported setups.
Shared MCP / function-calling settings: Save MCP Config JSON and Function Definitions JSON separately from model profiles, then optionally enable them for the main prompt input, /api/chat, /api/generate, and /v1/chat/completions.
Built-in Ollama/OpenAI-compatible API and WebUI: Provides /api/chat, /api/generate, /api/tags, /v1/chat/completions, /v1/models, /props, and /slots on the same port; only one generation runs at a time, with a queue of up to 10 requests for up to 60 seconds.
Multimodal API inputs: /api/chat and /v1/chat/completions can accept image_url and input_audio when the loaded model supports vision/audio.

Operational Notes

Model downloads can be several gigabytes. Wi‑Fi is strongly recommended.
The local API server is intended for the same device or a local network. Android 13+ may require notification permission.
If you configure MCP servers for the main prompt input, API integrations, or the WebUI, parts of conversation content or tool inputs may be sent to those MCP servers.