User Manual
This public manual documents how LLM tester with llama.cpp works with the current Android implementation.
Related docs: Technical Specification | Privacy Policy
Important: Download Data Usage
Downloading models may require gigabytes of data. Using mobile/cellular data may incur significant charges; downloading over Wi-Fi is strongly recommended.
1. Overview
- App name: LLM tester with llama.cpp
- Built with: Llama (llama.cpp)
- This app runs an LLM on your device and generates responses to prompts.
- An Ollama-compatible API server and the standard llama.cpp WebUI can be started together on the same port.
2. Recommended Setup
- If the API/WebUI enablement popup appears at launch, enable it when needed or check "Don't show next time" to skip it on future launches.
- On first launch, if you check "Don't show next time" in Quick Start, it will not be shown on subsequent launches.
- Open "Settings" from the main screen.
* During inference (Busy), the Settings button is disabled and is re-enabled automatically when processing completes. - Enter the model URL or import a .gguf file from the local device, then tap "Load Model".
* The local import picker opens in Downloads by default, and you can navigate elsewhere on the device as needed. Reachable HTTP/HTTPS URLs can be used. HTTPS uses normal SSL/TLS certificate verification. Imported local files are saved as filenames only in Settings. - Edit parameters if needed and tap "Save Config".
- Tap "SAVE & CLOSE" to save settings and apply them to the model immediately.
3. Main Screen Features
- Enter Prompt: Type your prompt.
- Send: Start generation. If the model is not loaded, it will be loaded automatically.
- Settings: Automatically disabled while inference is busy, and re-enabled when busy is cleared.
- Re-init Model: Available even while work is running. It stops the active generation and immediately reinitializes the current profile. If it fails, check the log or load the model again from Settings.
- View Log / Hide Log: Show the latest 100 lines from the log file in the Model Output area, and tap again to return to the normal response view.
- Update: Reload the latest 100 log lines while View Log is active.
- Clear Log: Clear the log file.
- Start/Stop API/WebUI: Toggle the API server and WebUI together.
- Model Output: Shows model responses. When View Log is active, this area shows the log body instead.
- Copy (Model Output): Copy the content shown in the Model Output area to the clipboard.
- Download: Save the current response in normal view, or save the full log while View Log is active. If the response area is empty, the full log is used instead.
- Processing Status/Logs: Shows timestamped processing logs.
- Copy (Processing Status/Logs): Copy the Processing Status/Logs area to the clipboard.
4. Settings Screen
- Settings screen: Controls are grouped into collapsible sections. Tap a section title to expand or collapse it. The MCP Settings section is collapsed by default.
- Configuration Management: Save/delete/load configurations.
- Model Selection: Load models from a URL or import .gguf files from the local device. The picker opens in Downloads by default, and you can navigate elsewhere on the device as needed. Imported files are copied into the app model storage directory and only the filename is shown in Settings. Reachable HTTP/HTTPS URLs can be used, and HTTPS uses normal SSL/TLS certificate verification.
- Model Parameters: Set generation parameters (including GPU Offload Layers).
- Output Settings: Toggle streaming output on/off.
- Prompt Template: Set System Prompt, Think on/off (chat-template-kwargs.enable_thinking), and custom chat template. When no custom template is set, the app first uses GGUF chat_template metadata and otherwise auto-selects by model family. A Bonsai fallback template is included.
- Llama API Server: Set the server port. The Local URL is shown as
http://localhost:<port>, and while connected to Wi-Fi the LAN URL is also shown and can be tapped to copy. Enabling it from the startup popup or the main screen makes both the API and WebUI available on that port. - MCP Settings: Save MCP config JSON and Function Definitions JSON as app-wide shared settings separate from model profiles. When the switches are off, they are available only in the WebUI and are treated as absent everywhere else. When enabled, they are also used as shared MCP and function-calling settings for the main prompt input,
/api/chat,/api/generate, and/v1/chat/completions. - Display Language: Switch UI language between Japanese and English. On first launch it follows your device locale, and your choice is saved for later launches.
- Log Settings: Select log level (default on first launch: INFO).
- Show License: Display license text.
- Documents: View the user manual and the privacy policy.
- SAVE & CLOSE: Save current settings and apply them to the model immediately.
- CLOSE: Return to the main screen without saving any changes.
5. Model Parameter Details
Basic Parameters
- Context Size (n_ctx): Number of tokens the model can process at once. Larger values handle longer contexts but use more memory.
- Threads (n_threads): Number of CPU threads for inference. Adjust based on your device's core count.
- Batch Size (n_batch): Number of tokens processed at once. Larger is faster but uses more memory.
- GPU Offload Layers: Number of layers to offload to GPU. 0 disables offload, 1-39 offloads that many layers, and -1 targets all available layers.
- Temperature (temp): Controls output randomness. Lower is more deterministic, higher is more diverse.
- Top-p: Select from tokens until cumulative probability reaches this value (nucleus sampling).
- Top-k: Select from top k probability tokens.
Penalty Parameters
- Penalty Last N: Number of recent tokens to apply penalties to.
- Penalty Repeat: Multiplier for repeat token penalty. 1.0 disables, higher suppresses repetition.
- Penalty Frequency: Penalty based on token frequency.
- Penalty Presence: Penalty for tokens that appeared before.
Mirostat Parameters
- Mirostat: 0=disabled, 1=Mirostat v1, 2=Mirostat v2. Auto-adjusts output consistency.
- Mirostat Tau: Target surprise value (perplexity). Lower for more consistent output.
- Mirostat Eta: Learning rate for Mirostat feedback.
Additional Sampling Parameters
- Min-p: Minimum probability threshold. Excludes tokens below this probability.
- Typical P: Parameter for typical sampling.
- Dynamic Temperature Range: Range for dynamic temperature adjustment. 0 disables.
- Dynamic Temperature Exponent: Exponent for dynamic temperature.
- XTC Probability: Probability for XTC sampling.
- XTC Threshold: Threshold for XTC sampling.
- Top-N-Sigma: Sigma-based sampling. -1 disables.
DRY Parameters
- DRY Multiplier: Don't Repeat Yourself penalty strength. 0 disables.
- DRY Base: Base value for DRY penalty.
- DRY Allowed Length: Minimum length for allowed repetitions.
- DRY Penalty Last N: Number of tokens for DRY penalty. -1 applies to all.
- DRY Sequence Breakers: Characters that break DRY sequences.
Output Settings
- Enable Streaming: When enabled, output updates as tokens are generated. When disabled, output shows all at once after generation completes.
Think Settings
- Enable Think: Toggles chat-template-kwargs enable_thinking. When disabled, prompts are formatted to suppress visible thinking output.
6. Prompt Template Auto-Selection
When no custom template is set, the app first estimates the family from GGUF chat_template metadata and otherwise auto-selects from the filename.
- Supported families: Gemma, Qwen, Mistral, LLaMA, Phi, Bonsai, Zephyr, Hermes
- Fallback when unrecognized: ChatML
- Gemma family: The app keeps the system / user / model order
- Logging: Selection results are logged to Processing Status/Logs and INFO-level logs
- API history: Conversation history from /api/chat is formatted using model-family-specific multi-turn templates
7. Stop Sequences
Generation automatically stops when common chat template delimiters are detected in the output.
8. API/WebUI Server (Optional)
- On app launch, a popup asks whether to enable the local API/WebUI server, and you can check "Don't show next time" to skip it on future launches.
- When enabled, the server provides:
/api/chat,/api/generate,/api/tags/v1/chat/completions,/v1/models/props,/slots- WebUI static files
- The WebUI is available at
http://<device-ip>:<port>/on the same port - MCP config JSON saved in the app settings is exposed to the WebUI through
/propsas a shared setting and is used together with the WebUI's local MCP settings. - When MCP outside WebUI is enabled, shared MCP settings are also used by the main prompt input,
/api/chat,/api/generate, and/v1/chat/completionsfor internal tool execution. When disabled, they remain WebUI-only. - When Function Calling outside WebUI is enabled, Function Definitions JSON is automatically added as shared function-calling definitions for the main prompt input,
/api/chat,/api/generate, and/v1/chat/completions. When disabled, it remains WebUI-only. - Only one generation runs at a time. When busy, requests are queued (up to 10) and wait up to 60 seconds; queue overflow or timeout returns 503.
- Android 13+ may require notification permission.
9. 🧭 Finding GGUF Files
9-1. Locating GGUF-compatible models
- Use the GGUF tag on Hugging Face model search
https://huggingface.co/models?library=gguf - GGUF models often have
-GGUFin the repository name
Example:TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF - The model page's Files tab lists available
*.gguffiles
9-2. Choosing a quantization variant (overview)
Q2_K: Lightweight, low memory footprintQ4_K_M: Balanced (recommended to start with)Q8_0: Larger, higher quality
10. 📥 Downloading from a Browser
10-1. Manual download
- Open the model page
Example:https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF - Click the Files tab
- Click the desired
*.gguffile - Press the Download button in the top right
10-2. Getting a direct URL to a GGUF file
- In the Files tab, click the
*.gguffile to open its page - Right-click the Download button and select "Copy link"
- You now have a direct URL to the GGUF file that you can paste into the app
Tips
Loading a very large model may stop because address-space reservation fails or because the process was interrupted by user action. In that case the app clears temporary load files on the next launch and shows a notice. If needed, try a smaller model or load the model again from Settings.