User Manual

This public manual documents how LLM AI Server with llama.cpp works with the current Android implementation.

Related docs: Technical Specification | Privacy Policy

Important: Download Data Usage

Downloading models may require gigabytes of data. Using mobile/cellular data may incur significant charges; downloading over Wi-Fi is strongly recommended.

1. Overview

App name: LLM AI Server with llama.cpp
Built with: Llama (llama.cpp)
This app runs an LLM on your device and generates responses to prompts.
An Ollama-compatible API server and the standard llama.cpp WebUI can be started together on the same port.

2. Recommended Setup

If the API/WebUI enablement popup appears at launch, enable it when needed or check "Don't show next time" to skip it on future launches.
On first launch, if you check "Don't show next time" in Quick Start, it will not be shown on subsequent launches.
Open "Settings" from the main screen.
* During inference (Busy), the Settings button is disabled and is re-enabled automatically when processing completes.
Enter the model URL or import a .gguf file from the local device, then tap "Load Model".
* The local import picker opens in Downloads by default, and you can navigate elsewhere on the device as needed. Reachable HTTP/HTTPS URLs can be used. HTTPS uses normal SSL/TLS certificate verification. Imported local files are saved as filenames only in Settings.
Edit parameters if needed and tap "Save Config".
Tap "SAVE & CLOSE" to save settings and apply them to the model immediately.

3. Main Screen Features

Status Header (top of screen)

API/WebUI status: Shows whether the API server is running and on which port.
Speed, temperature, backend: Shows the last generation speed (tok/s), device temperature (°C), active compute backend (CPU / GPU), and the name of the loaded model.

Main Action Buttons

Start API/WebUI / Stop API/WebUI: Toggle the API server and WebUI together.
Reset: Available even while work is running. Stops the active generation and immediately reinitializes the current profile. If it fails, check the log or load the model again from Settings.
Settings: Automatically disabled while inference is busy, and re-enabled when processing completes.

Web UI URL

Web UI URL (tap to open / long-press to copy): Displays the WebUI URL once the API server is running. Tap to open in the browser, long-press to copy the URL to the clipboard.

Direct Run Section (tap to expand; collapsed by default)

This section is collapsed by default. Tap the section header to expand it.

Profile: Select the profile (configuration) to use from the spinner. Applied on the next prompt send.
Enter Prompt: Type your prompt.
Send: Start generation. If the selected profile's model is not loaded, it will be loaded automatically first.
Model Output: Shows model responses. Long-press to select and copy text.

Processing Status/Logs Section (tap to collapse; expanded by default)

View Log / Show Status: Show the latest 100 lines from the log file in the status area. Tap again to return to the live processing status view.
Update: Reload the latest 100 log lines while View Log is active.
Clear: In normal view, clears the processing status area. While View Log is active, clears the log file.
Download: Save the current response in normal view, or save the full log while View Log is active. If the response area is empty, the full log is used instead.
Processing Status/Logs: Shows timestamped processing logs. Long-press to select and copy text.

4. Settings Screen

Settings screen: Controls are grouped into collapsible sections. Tap a section title to expand or collapse it. The MCP Settings section is collapsed by default.
Configuration Management: Save/delete/load configurations. The Display Language switch is also in this section.
Model Selection: Load models from a URL or import .gguf files from the local device. The picker opens in Downloads by default, and you can navigate elsewhere on the device as needed. Imported files are copied into the app model storage directory and only the filename is shown in Settings. Reachable HTTP/HTTPS URLs can be used, and HTTPS uses normal SSL/TLS certificate verification.
Multimodal Projector (mmproj): Select or clear an mmproj (multimodal projector) file when using a multimodal model. Used based on the loaded model's vision/audio support.
Model Maintenance (MAINTAIN MODEL): Lists model files stored in the app, with options to switch to a different stored model or delete a selected file.
Search GGUF on Hugging Face: Opens the Hugging Face GGUF model search page in the browser.
Compute Backend: Toggle GPU (OpenCL/Adreno) on or off. When it is off the model runs on CPU.
Offload Layers (GPU): Set the number of layers to offload to the GPU using a slider. The maximum position means ALL available layers. The slider is disabled when GPU is off (CPU mode).
Model Parameters: Set generation parameters such as n_ctx, n_threads, n_batch, and temperature.
MTP (speculative decoding, experimental): Enable via the "MTP speculative decoding" toggle (off by default). When on, set n_draft (draft tokens per step, default 2) and the draft source ("Use this model's own MTP head (recommended)" or a separate GGUF). MTP settings are stored per-model profile and take effect after saving the config and reloading the model. For models that embed an MTP head (Qwen3.5-MTP, Gemma 4) just pick "own head" — no extra file and no double memory use, and generation can be faster (the gain depends on device, model and n_draft; behaviour is identical to before when off; controls are disabled while inference is running).
Output Settings: Toggle streaming output on/off. Enable Show Performance Metrics to append token counts, processing time, and speed to the output area after generation completes.
Prompt Template: Set System Prompt, Think on/off (chat-template-kwargs.enable_thinking), and custom chat template. When no custom template is set, the app first uses GGUF chat_template metadata and otherwise auto-selects by model family. A Bonsai fallback template is included.
Llama API Server: Set the server port. The Local URL is shown as http://localhost:<port>, and while connected to Wi-Fi the LAN URL is also shown and can be tapped to copy. Enabling it from the startup popup or the main screen makes both the API and WebUI available on that port.
MCP Settings: Save MCP config JSON and Function Definitions JSON as app-wide shared settings separate from model profiles. When the switches are off, they are available only in the WebUI and are treated as absent everywhere else. When enabled, they are also used as shared MCP and function-calling settings for the main prompt input, /api/chat, /api/generate, and /v1/chat/completions.
Log Settings: Select log level (default on first launch: INFO).
Show License: Display license text.
Documents: View the user manual and the privacy policy.
SAVE & CLOSE: Save current settings and apply them to the model immediately.
CLOSE: Return to the main screen without saving any changes.

5. Model Parameter Details

Basic Parameters

Context Size (n_ctx): Number of tokens the model can process at once. Larger values handle longer contexts but use more memory.
Threads (n_threads): Number of CPU threads for inference. Adjust based on your device's core count.
Batch Size (n_batch): Number of tokens processed at once. Larger is faster but uses more memory.
Compute Backend: Toggle GPU (OpenCL/Adreno) on or off. When it is off the model runs on CPU.
Offload Layers (GPU): Set the number of layers to offload via a slider. The maximum position (ALL) targets all available layers. Disabled when GPU is off.
Temperature (temp): Controls output randomness. Lower is more deterministic, higher is more diverse.
Top-p: Select from tokens until cumulative probability reaches this value (nucleus sampling).
Top-k: Select from top k probability tokens.

Penalty Parameters

Penalty Last N: Number of recent tokens to apply penalties to.
Penalty Repeat: Multiplier for repeat token penalty. 1.0 disables, higher suppresses repetition.
Penalty Frequency: Penalty based on token frequency.
Penalty Presence: Penalty for tokens that appeared before.

Mirostat Parameters

Mirostat: 0=disabled, 1=Mirostat v1, 2=Mirostat v2. Auto-adjusts output consistency.
Mirostat Tau: Target surprise value (perplexity). Lower for more consistent output.
Mirostat Eta: Learning rate for Mirostat feedback.

Additional Sampling Parameters

Min-p: Minimum probability threshold. Excludes tokens below this probability.
Typical P: Parameter for typical sampling.
Dynamic Temperature Range: Range for dynamic temperature adjustment. 0 disables.
Dynamic Temperature Exponent: Exponent for dynamic temperature.
XTC Probability: Probability for XTC sampling.
XTC Threshold: Threshold for XTC sampling.
Top-N-Sigma: Sigma-based sampling. -1 disables.

DRY Parameters

DRY Multiplier: Don't Repeat Yourself penalty strength. 0 disables.
DRY Base: Base value for DRY penalty.
DRY Allowed Length: Minimum length for allowed repetitions.
DRY Penalty Last N: Number of tokens for DRY penalty. -1 applies to all.
DRY Sequence Breakers: Characters that break DRY sequences.

Output Settings

Enable Streaming: When enabled, output updates as tokens are generated. When disabled, output shows all at once after generation completes.

Think Settings

Enable Think: Toggles chat-template-kwargs enable_thinking. When disabled, prompts are formatted to suppress visible thinking output.

6. Prompt Template Auto-Selection

When no custom template is set, the app first estimates the family from GGUF chat_template metadata and otherwise auto-selects from the filename.

Supported families: Gemma, Qwen, Mistral, LLaMA, Phi, Bonsai, Zephyr, Hermes
Fallback when unrecognized: ChatML
Gemma family: The app keeps the system / user / model order
Logging: Selection results are logged to Processing Status/Logs and INFO-level logs
API history: Conversation history from /api/chat is formatted using model-family-specific multi-turn templates

7. Stop Sequences

Generation automatically stops when common chat template delimiters are detected in the output.

8. API/WebUI Server (Optional)

On app launch, a popup asks whether to enable the local API/WebUI server, and you can check "Don't show next time" to skip it on future launches.
When enabled, the server provides:
- /api/chat, /api/generate, /api/tags
- /v1/chat/completions, /v1/models
- /props, /slots
- WebUI static files
The WebUI is available at http://<device-ip>:<port>/ on the same port
MCP config JSON saved in the app settings is exposed to the WebUI through /props as a shared setting and is used together with the WebUI's local MCP settings.
When MCP outside WebUI is enabled, shared MCP settings are also used by the main prompt input, /api/chat, /api/generate, and /v1/chat/completions for internal tool execution. When disabled, they remain WebUI-only.
When Function Calling outside WebUI is enabled, Function Definitions JSON is automatically added as shared function-calling definitions for the main prompt input, /api/chat, /api/generate, and /v1/chat/completions. When disabled, it remains WebUI-only.
Only one generation runs at a time. When busy, requests are queued (up to 10) and wait up to 60 seconds; queue overflow or timeout returns 503.
Android 13+ may require notification permission.

9. 🧭 Finding GGUF Files

9-1. Locating GGUF-compatible models

Use the GGUF tag on Hugging Face model search
https://huggingface.co/models?library=gguf
GGUF models often have -GGUF in the repository name
Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
The model page's Files tab lists available *.gguf files

9-2. Choosing a quantization variant (overview)

Q2_K: Lightweight, low memory footprint
Q4_K_M: Balanced (recommended to start with)
Q8_0: Larger, higher quality

10. 📥 Downloading from a Browser

10-1. Manual download

Open the model page
Example: https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF
Click the Files tab
Click the desired *.gguf file
Press the Download button in the top right

10-2. Getting a direct URL to a GGUF file

In the Files tab, click the *.gguf file to open its page
Right-click the Download button and select "Copy link"
You now have a direct URL to the GGUF file that you can paste into the app

Tips

Loading a very large model may stop because address-space reservation fails or because the process was interrupted by user action. In that case the app clears temporary load files on the next launch and shows a notice. If needed, try a smaller model or load the model again from Settings.