Technical Specification

This page documents the current Android implementation of LLM AI Server with llama.cpp, focusing on the full path from the front app to configuration storage, JNI, CMake/NDK build, llama.cpp/ggml, the on-device API server, and bundled WebUI delivery.

Android Activity UI Configuration JSON ModelManager JNI / C++ llama.cpp + ggml + mtmd Ollama / OpenAI compatible API Bundled WebUI

1. System overview

The app has two main execution paths: a direct UI-to-native inference path, and an on-device HTTP path used by the Ollama-compatible / OpenAI-compatible server. In both cases, model loading, busy-state control, parameter application, and reinitialization are centralized in ModelManager, and actual inference ends inside the llama_jni shared library that embeds llama.cpp, ggml, and mtmd.

MainActivity / SettingsActivity
        ↓
ConfigurationManager / ModelFileHelper
        ↓
ModelManager
        ↓
LlamaNative (Java native wrapper)
        ↓ JNI
llama_jni (jni_llama.cpp)
        ↓
llama.cpp + ggml + mtmd + libcurl + mbedTLS

Separate path:
OllamaForegroundService
        ↓
OllamaApiServer (HTTP, same-device / LAN)
        ↓
ModelManager → LlamaNative → JNI → llama.cpp

Layer	Main classes	Role
UI	MainActivity / SettingsActivity / DocumentsActivity	Input, output display, log viewing, settings editing, API/WebUI control
Configuration	ConfigurationManager	Profile JSON persistence and defaults
Model assets	ModelFileHelper	GGUF/mmproj path resolution, storage directory, modality inference
Execution control	ModelManager	Loading, reloading, locking, native invocation coordination
Java ↔ Native	LlamaNative	Exports native methods and registers token callbacks
Native runtime	jni_llama.cpp	Global state, init, sampling, stop sequences, logging, crash markers
HTTP	OllamaApiServer / OllamaForegroundService	Ollama API, OpenAI API, WebUI delivery, queue control

2. Front app layer

2-1. MainActivity

The main screen combines prompt input, model output, processing logs, log-file inspection, and API/WebUI controls.
When Send is pressed, the activity first checks whether the selected profile's model is already loaded. If not, it calls ModelManager.loadConfiguration() and only then enters processGeneration().
Direct prompt input uses PromptTemplateManager.buildPromptForDirectInputWithSelection() so GGUF metadata, custom templates, Settings system prompt, and Think behavior are resolved into one final prompt.
When streaming is enabled, the activity registers a LlamaNative.TokenListener and incrementally appends filtered tokens to the UI. Otherwise it waits for the full generate() result.
The reinitialize action can interrupt current work, disconnect API clients if needed, and force a clean reload through forceReinitializeConfiguration().

2-2. SettingsActivity

Edits profile name, model URL / local GGUF import, n_ctx, n_threads, top_p, DRY, Mirostat, Think, GPU offload layers, API port, display language, and log level.
Local GGUF import uses the Android document picker and copies the selected file into the app's model storage directory.
Download progress is delivered through LlamaNative.DownloadProgressListener and reflected in the progress bar and model file status text.
GPU offload uses a seek bar and converts the top range into -1, meaning "all available layers".
The Documents button opens the in-app document viewer for the manual and privacy policy.

2-3. Startup and background integration

LlamaApplication configures native logging early and writes uncaught Java exceptions to last_crash.txt.
OllamaForegroundService keeps the API/WebUI stack alive in the background using a persistent notification.
MainActivity listens for service broadcasts to update server status and append processing logs in real time.

3. Configuration and model files

3-1. ConfigurationManager

Each profile is stored as configs/*.json in the app's external files area. A default profile is generated automatically and includes model reference, generation parameters, Think state, GPU offload, custom template, system prompt, and optional mmproj reference.

Category	Main fields	Purpose
Model	`modelUrl`, `multimodalProjectorUrl`	GGUF and optional mmproj references
Load-time	`nCtx`, `nThreads`, `nBatch`, `gpuOffloadLayers`	Context, CPU, and GPU-loading behavior
Sampling	`temp`, `topP`, `topK`, penalties, Mirostat, DRY, XTC	Mapped directly into the native sampler chain
Prompt construction	`systemPrompt`, `customChatTemplate`, `enableThinking`	Controls the final prompt format
UI/runtime	`streaming`	Default streaming behavior for UI and API

Shared MCP settings and shared Function Definitions are stored separately in SharedPreferences, outside the per-model profile JSON.
The persisted keys are shared_mcp_servers_json, shared_function_definitions_json, and their enable/disable switches for use outside the WebUI.
/props exposes these values inside webui_settings as sharedMcpServers and sharedFunctionDefinitions, allowing the bundled WebUI to combine them with its own local settings.

3-2. ModelFileHelper

The model storage directory is getExternalFilesDir(null) when available, otherwise internal files.
It extracts filenames from URLs or local references and resolves them into app-local storage paths.
Only .gguf files are treated as model candidates.
mmproj/projector filenames are auto-detected from naming patterns such as mmproj, projector, gemma4v, and gemma4a, then scored against the base model filename.
The same helper infers vision/audio support so API metadata can expose capabilities even before a model is loaded.

3-3. Split GGUF and HTTPS downloads

Split GGUF files using names like name-00001-of-00005.gguf are detected, and missing shards are checked in both Java and native code.
Before native HTTPS downloads, the Android CA store is exported into a PEM bundle and passed to libcurl.
This keeps native downloads aligned with Android's certificate trust behavior instead of relying on a hardcoded custom certificate set.

4. ModelManager responsibilities

ModelManager is the shared execution coordinator for both the UI and the HTTP server. It owns locking, model initialization, reinitialization, parameter application, native logging setup, and HTTPS trust-store preparation.

4-1. Locking and reinitialization

It tracks busy, resetPending, and reinitializing so direct UI generation and HTTP requests cannot mutate the same native model state concurrently.
Only callers that succeed in tryAcquire() are allowed to load or generate, and all flows must call release() when done.
Forced reinitialization can cancel active generation, wait for the active lock to drain, and then fully reload the requested configuration.

4-2. Load sequence

Load the configuration profile.
Resolve the model filename and destination path.
Resolve or auto-detect an mmproj file if multimodal support is needed.
Download the model if it is not present locally.
Unload existing model/context/mmproj if a different model is required.
Preload: call initWithMmproj() once with n_ctx=64.
Full init: restore the requested n_ctx and call initWithMmproj() again.
Apply sampling parameters with applyConfiguration().

That two-stage init helps surface file/load failures early before committing to the full requested context size.

4-3. Generation behavior

ModelManager does not build prompts itself; it receives already-built prompts from PromptTemplateManager and forwards them to generate() or generateWithMedia().
Vision/audio support is surfaced through supportsVision() and supportsAudio(), which reflect the currently loaded native state.
Log path and log level are applied once during singleton initialization and shared by both UI and server-driven inference.

5. JNI / C++ layer

5-1. Java-side native interface

LlamaNative loads llama_jni and exposes the following operations.

Method	Purpose
`download(url, path)`	Downloads models/mmproj files with libcurl
`initWithMmproj(modelPath, mmprojPath)`	Initializes the model and optional multimodal projector
`setLoadParameters(...)`	Sets n_ctx / n_threads / n_batch / GPU load parameters before init
`setParameters(...)`	Sets penalties, DRY, Mirostat, XTC, and related sampler settings
`generate(prompt)`, `generateWithMedia(prompt, media)`	Runs text-only or multimodal generation
`setTokenListener(listener)`	Registers token, completion, and error callbacks for streaming
`cancelGeneration()`	Requests cancellation from native generation loops
`getChatTemplate()`	Reads GGUF chat-template metadata from the loaded model
`supportsVision()`, `supportsAudio()`	Reports current loaded-model modality support

5-2. Native global state

g_model, g_ctx, and g_mtmd hold the loaded model, inference context, and multimodal context.
g_current_model_path and g_current_mmproj_path are used to avoid redundant initialization.
g_supports_vision and g_supports_audio expose modality capabilities back to Java.
g_token_listener plus JavaVM let native worker threads invoke Java callbacks safely.
g_cancel_generation is the shared cancellation flag used by both UI and HTTP flows.

5-3. initWithMmproj flow

Install fatal signal handlers so native crashes leave markers in native_crash.txt.
Register llama.cpp and mtmd log callbacks.
Verify model file, mmproj file, and split GGUF completeness.
Free existing model/context/mtmd state when required.
Call llama_backend_init() and verify registered ggml backends.
Load the model with llama_model_load_from_file() after applying n_gpu_layers.
Create the inference context via llama_init_from_model() with the selected context/thread/batch settings.
If needed, initialize multimodal support and record detected vision/audio capabilities.

5-4. generate flow

Clear existing model memory for a fresh prompt prefill.
Use prefill_text_prompt_locked() for text-only prompts or prefill_multimodal_prompt_locked() for prompts with image/audio inputs.
Construct a sampler chain from the current Java-provided settings.
Run up to 1024 generated tokens using llama_sampler_sample(), llama_sampler_accept(), and llama_decode().
Stop on EOG, stop-sequence detection, context safety limit, or cancellation.
Emit delta tokens through notify_token_delta() and completion through notify_token_complete().

5-5. Sampler chain composition

The native layer maps Java config fields directly into llama.cpp sampler components in roughly this order.

penalties
→ DRY
→ top_n_sigma
→ top_k
→ typical
→ top_p
→ min_p
→ XTC
→ temperature / dynamic temperature
→ mirostat v1 or v2 / fallback dist

DRY sequence breakers default to \n,:,",*, matching the Java configuration default and being unescaped natively before calling llama_sampler_init_dry().

5-6. Stop conditions and output cleanup

The native runtime watches common chat-template delimiters such as <|end|>, </s>, <|im_end|>, and <end_of_turn>.
UTF-8 safety checks trim incomplete byte sequences before returning output to Java.
Java then performs an additional response-marker cleanup step before displaying or serving text.

5-7. Downloading, logging, and crash traces

Native downloads use libcurl and optionally a CA bundle exported from Android's trust store.
Split GGUF handling can derive and fetch sibling shard URLs automatically.
Logs are written to ollama.log; Java crashes go to last_crash.txt; fatal native signals leave markers in native_crash.txt.

6. CMake / NDK build

For a deeper explanation of the JNI boundary and CMake structure from a general llama.cpp-integration perspective, see llama.cpp / JNI / CMake Deep Dive.

6-1. Gradle / NDK characteristics

compileSdk 36, targetSdk 36 (Android 16), and minSdk 24. To match the edge-to-edge enforcement of targetSdk 36, release builds enable R8 (code shrinking / obfuscation).
The app builds only for arm64-v8a.
Native build uses src/main/cpp/CMakeLists.txt with -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON.
NDK version is fixed to 27.2.12479018 for current Android compatibility, including 16 KB page-size support.

6-2. CMake structure

The project builds a single shared library named llama_jni.
llama.cpp sources are vendored directly under app/src/main/cpp/llama and compiled into the shared library instead of being consumed as a prebuilt binary.
ggml CPU sources and ARM-specific implementations are compiled into the same target.
mtmd sources are also included, enabling multimodal projector-backed models (Qwen3-VL, Gemma 4, etc.) in the same runtime.
common / speculative sources are also linked, so models that embed an MTP head (Qwen3.5-MTP, Gemma 4, etc.) can use MTP (speculative decoding, draft-mtp) — enabled per-model in Settings (off by default).
Static libcurl and three static mbedTLS libraries are imported and linked explicitly.

6-3. Android-specific linker behavior

Linker options include -Wl,-z,max-page-size=16384 and -Wl,-z,common-page-size=16384.
-align-segments is intentionally avoided because Android's NDK linker does not support it.
GGML_USE_CPU and GGML_USE_K_QUANTS are defined at compile time.

6-4. Why this matters

This means the app is not merely "calling an external engine". It ships a custom Android-native runtime that directly embeds llama.cpp, ggml, mtmd, curl, and TLS support. UI actions therefore control a locally compiled inference stack inside the APK itself.

7. API server design

OllamaApiServer is a lightweight custom HTTP server running inside a foreground service. API routes and the WebUI share the same port; the default port is 11434.

7-1. Exposed endpoints

Route	Main purpose	Notes
`POST /api/generate`	Single-prompt generation	Ollama-style NDJSON streaming or non-streaming
`POST /api/chat`	Conversation generation from messages	Builds model-family-specific multi-turn prompts
`GET/POST /api/tags`	Model list	Treats saved configurations as model names
`POST /v1/chat/completions`	OpenAI-compatible chat endpoint	SSE streaming supported
`GET /v1/models`, `/models`	Model inventory and status	Reports loaded state, modalities, file path, etc.
`GET /props`	llama.cpp WebUI model props	Includes generation defaults, chat template, and WebUI settings
`GET /slots`	Slot state	Total slot count is fixed at 1
`GET /health`, `/v1/health`	Health check	Returns `role` and `webui=true`
`/`, `/index.html`, static assets	Bundled WebUI	Served from app assets with in-memory cache

7-2. Concurrency and queue behavior

Only one generation slot exists. The server is effectively single-generation at a time.
If the model is busy, up to 10 waiting requests are queued.
Queued requests can wait up to 60 seconds before receiving a 503.
If a model reset is pending, new requests are rejected and queued requests are aborted as unavailable.

7-3. Internal processing for /api/generate and /api/chat

Parse the request JSON.
Acquire the generation slot through acquireGenerationSlot().
Load the requested configuration and model if needed.
Use GGUF chat-template metadata, custom template, system prompt, and Think settings to build the final prompt.
If the request includes tools or the app has shared tool configuration, execution switches into SharedToolManager.generateWithTools(), which can continue auto tool execution for up to 10 turns.
When streaming, start a queue and writer thread so the native generation thread is not blocked by network writes.
If the client disconnects or an end marker is detected, call cancelGeneration().

7-4. Multimodal request handling

/api/chat and /v1/chat/completions can accept content[] parts such as text, input_text, image_url, and input_audio.
image_url supports HTTP/HTTPS images or base64 data URLs. Remote downloads are capped at 10 MB.
input_audio accepts only base64-encoded wav or mp3.
Media parts are replaced by an internal <__media__> marker in text, while raw bytes are forwarded to JNI as byte[][].
If the currently loaded model does not support the requested modality, the server returns 400.

7-5. OpenAI compatibility layer

/v1/chat/completions returns SSE when streaming is enabled.
Internally it still uses the same prompt builder, ModelManager, and JNI runtime as the Ollama-compatible routes; only the response envelope differs.
If n_predict=0 is present, the server follows a pre-encode-only branch and returns early.
Some request-level generation overrides are applied onto the configuration before the native sampler settings are re-applied.
If tool_calls is missing, the runtime also tries to extract tool-call markers from reasoning_content or the returned content body.
The server forwards tool_choice and parallel_tool_calls, combining request tools with shared MCP / Function Definitions settings for the internal tool-execution loop.

8. WebUI delivery

8-1. What the bundled WebUI is

The APK bundles webui/index.html, bundle.js, bundle.css, and loading.html under app assets.
handleWebUi() serves these files for / and related asset paths.
Assets are cached in memory through webUiAssetCache after first load.

8-2. How WebUI and API fit together

The WebUI is not a separate server; it is routed by the same OllamaApiServer.
/props and /slots provide llama.cpp-style metadata used by the WebUI, including generation defaults plus webui_settings for the system message, Think display, and shared MCP / Function Definitions settings.
Unknown client-side routes are normalized back to index.html, making the bundled WebUI behave like a small SPA served from the same port.

8-3. Operational meaning

The Android app effectively bundles three things together on one device: a native local inference runtime, an Ollama/OpenAI-compatible HTTP surface, and a browser-facing WebUI that consumes that local API. No external server is required for this stack to function.

9. Relationship to the public website

The Mick Lab website itself is a Firebase Hosting static site served from public/.
On Hosting, /api/** is rewritten to api-fallback.json, while other routes fall back to /index.html.
That means these public pages are static documentation about the Android app's local API/WebUI implementation; they are not the live API server running inside the app.
All API and WebUI behavior described on this page refers to the in-app OllamaApiServer and OllamaForegroundService, not Firebase Hosting.