Technical Specification
This page documents the current Android implementation of LLM tester with llama.cpp, focusing on the full path from the front app to configuration storage, JNI, CMake/NDK build, llama.cpp/ggml, the on-device API server, and bundled WebUI delivery.
Related docs: User Manual | llama.cpp / JNI / CMake Deep Dive | Privacy Policy
Contents
1. System overview
The app has two main execution paths: a direct UI-to-native inference path, and an on-device HTTP path used by the Ollama-compatible / OpenAI-compatible server. In both cases, model loading, busy-state control, parameter application, and reinitialization are centralized in ModelManager, and actual inference ends inside the llama_jni shared library that embeds llama.cpp, ggml, and mtmd.
MainActivity / SettingsActivity
↓
ConfigurationManager / ModelFileHelper
↓
ModelManager
↓
LlamaNative (Java native wrapper)
↓ JNI
llama_jni (jni_llama.cpp)
↓
llama.cpp + ggml + mtmd + libcurl + mbedTLS
Separate path:
OllamaForegroundService
↓
OllamaApiServer (HTTP, same-device / LAN)
↓
ModelManager → LlamaNative → JNI → llama.cpp
| Layer | Main classes | Role |
|---|---|---|
| UI | MainActivity / SettingsActivity / DocumentsActivity | Input, output display, log viewing, settings editing, API/WebUI control |
| Configuration | ConfigurationManager | Profile JSON persistence and defaults |
| Model assets | ModelFileHelper | GGUF/mmproj path resolution, storage directory, modality inference |
| Execution control | ModelManager | Loading, reloading, locking, native invocation coordination |
| Java ↔ Native | LlamaNative | Exports native methods and registers token callbacks |
| Native runtime | jni_llama.cpp | Global state, init, sampling, stop sequences, logging, crash markers |
| HTTP | OllamaApiServer / OllamaForegroundService | Ollama API, OpenAI API, WebUI delivery, queue control |
2. Front app layer
2-1. MainActivity
- The main screen combines prompt input, model output, processing logs, log-file inspection, and API/WebUI controls.
- When
Sendis pressed, the activity first checks whether the selected profile's model is already loaded. If not, it callsModelManager.loadConfiguration()and only then entersprocessGeneration(). - Direct prompt input uses
PromptTemplateManager.buildPromptForDirectInputWithSelection()so GGUF metadata, custom templates, Settings system prompt, and Think behavior are resolved into one final prompt. - When streaming is enabled, the activity registers a
LlamaNative.TokenListenerand incrementally appends filtered tokens to the UI. Otherwise it waits for the fullgenerate()result. - The reinitialize action can interrupt current work, disconnect API clients if needed, and force a clean reload through
forceReinitializeConfiguration().
2-2. SettingsActivity
- Edits profile name, model URL / local GGUF import,
n_ctx,n_threads,top_p, DRY, Mirostat, Think, GPU offload layers, API port, display language, and log level. - Local GGUF import uses the Android document picker and copies the selected file into the app's model storage directory.
- Download progress is delivered through
LlamaNative.DownloadProgressListenerand reflected in the progress bar and model file status text. - GPU offload uses a seek bar and converts the top range into
-1, meaning "all available layers". - The Documents button opens the in-app document viewer for the manual and privacy policy.
2-3. Startup and background integration
LlamaApplicationconfigures native logging early and writes uncaught Java exceptions tolast_crash.txt.OllamaForegroundServicekeeps the API/WebUI stack alive in the background using a persistent notification.- MainActivity listens for service broadcasts to update server status and append processing logs in real time.
3. Configuration and model files
3-1. ConfigurationManager
Each profile is stored as configs/*.json in the app's external files area. A default profile is generated automatically and includes model reference, generation parameters, Think state, GPU offload, custom template, system prompt, and optional mmproj reference.
| Category | Main fields | Purpose |
|---|---|---|
| Model | modelUrl, multimodalProjectorUrl | GGUF and optional mmproj references |
| Load-time | nCtx, nThreads, nBatch, gpuOffloadLayers | Context, CPU, and GPU-loading behavior |
| Sampling | temp, topP, topK, penalties, Mirostat, DRY, XTC | Mapped directly into the native sampler chain |
| Prompt construction | systemPrompt, customChatTemplate, enableThinking | Controls the final prompt format |
| UI/runtime | streaming | Default streaming behavior for UI and API |
- Shared MCP settings and shared Function Definitions are stored separately in
SharedPreferences, outside the per-model profile JSON. - The persisted keys are
shared_mcp_servers_json,shared_function_definitions_json, and their enable/disable switches for use outside the WebUI. /propsexposes these values insidewebui_settingsassharedMcpServersandsharedFunctionDefinitions, allowing the bundled WebUI to combine them with its own local settings.
3-2. ModelFileHelper
- The model storage directory is
getExternalFilesDir(null)when available, otherwise internal files. - It extracts filenames from URLs or local references and resolves them into app-local storage paths.
- Only
.gguffiles are treated as model candidates. - mmproj/projector filenames are auto-detected from naming patterns such as
mmproj,projector,gemma4v, andgemma4a, then scored against the base model filename. - The same helper infers vision/audio support so API metadata can expose capabilities even before a model is loaded.
3-3. Split GGUF and HTTPS downloads
- Split GGUF files using names like
name-00001-of-00005.ggufare detected, and missing shards are checked in both Java and native code. - Before native HTTPS downloads, the Android CA store is exported into a PEM bundle and passed to libcurl.
- This keeps native downloads aligned with Android's certificate trust behavior instead of relying on a hardcoded custom certificate set.
4. ModelManager responsibilities
ModelManager is the shared execution coordinator for both the UI and the HTTP server. It owns locking, model initialization, reinitialization, parameter application, native logging setup, and HTTPS trust-store preparation.
4-1. Locking and reinitialization
- It tracks
busy,resetPending, andreinitializingso direct UI generation and HTTP requests cannot mutate the same native model state concurrently. - Only callers that succeed in
tryAcquire()are allowed to load or generate, and all flows must callrelease()when done. - Forced reinitialization can cancel active generation, wait for the active lock to drain, and then fully reload the requested configuration.
4-2. Load sequence
- Load the configuration profile.
- Resolve the model filename and destination path.
- Resolve or auto-detect an mmproj file if multimodal support is needed.
- Download the model if it is not present locally.
- Unload existing model/context/mmproj if a different model is required.
- Preload: call
initWithMmproj()once withn_ctx=64. - Full init: restore the requested
n_ctxand callinitWithMmproj()again. - Apply sampling parameters with
applyConfiguration().
That two-stage init helps surface file/load failures early before committing to the full requested context size.
4-3. Generation behavior
- ModelManager does not build prompts itself; it receives already-built prompts from
PromptTemplateManagerand forwards them togenerate()orgenerateWithMedia(). - Vision/audio support is surfaced through
supportsVision()andsupportsAudio(), which reflect the currently loaded native state. - Log path and log level are applied once during singleton initialization and shared by both UI and server-driven inference.
5. JNI / C++ layer
5-1. Java-side native interface
LlamaNative loads llama_jni and exposes the following operations.
| Method | Purpose |
|---|---|
download(url, path) | Downloads models/mmproj files with libcurl |
initWithMmproj(modelPath, mmprojPath) | Initializes the model and optional multimodal projector |
setLoadParameters(...) | Sets n_ctx / n_threads / n_batch / GPU load parameters before init |
setParameters(...) | Sets penalties, DRY, Mirostat, XTC, and related sampler settings |
generate(prompt), generateWithMedia(prompt, media) | Runs text-only or multimodal generation |
setTokenListener(listener) | Registers token, completion, and error callbacks for streaming |
cancelGeneration() | Requests cancellation from native generation loops |
getChatTemplate() | Reads GGUF chat-template metadata from the loaded model |
supportsVision(), supportsAudio() | Reports current loaded-model modality support |
5-2. Native global state
g_model,g_ctx, andg_mtmdhold the loaded model, inference context, and multimodal context.g_current_model_pathandg_current_mmproj_pathare used to avoid redundant initialization.g_supports_visionandg_supports_audioexpose modality capabilities back to Java.g_token_listenerplusJavaVMlet native worker threads invoke Java callbacks safely.g_cancel_generationis the shared cancellation flag used by both UI and HTTP flows.
5-3. initWithMmproj flow
- Install fatal signal handlers so native crashes leave markers in
native_crash.txt. - Register llama.cpp and mtmd log callbacks.
- Verify model file, mmproj file, and split GGUF completeness.
- Free existing model/context/mtmd state when required.
- Call
llama_backend_init()and verify registered ggml backends. - Load the model with
llama_model_load_from_file()after applyingn_gpu_layers. - Create the inference context via
llama_init_from_model()with the selected context/thread/batch settings. - If needed, initialize multimodal support and record detected vision/audio capabilities.
5-4. generate flow
- Clear existing model memory for a fresh prompt prefill.
- Use
prefill_text_prompt_locked()for text-only prompts orprefill_multimodal_prompt_locked()for prompts with image/audio inputs. - Construct a sampler chain from the current Java-provided settings.
- Run up to
1024generated tokens usingllama_sampler_sample(),llama_sampler_accept(), andllama_decode(). - Stop on EOG, stop-sequence detection, context safety limit, or cancellation.
- Emit delta tokens through
notify_token_delta()and completion throughnotify_token_complete().
5-5. Sampler chain composition
The native layer maps Java config fields directly into llama.cpp sampler components in roughly this order.
penalties → DRY → top_n_sigma → top_k → typical → top_p → min_p → XTC → temperature / dynamic temperature → mirostat v1 or v2 / fallback dist
DRY sequence breakers default to \n,:,",*, matching the Java configuration default and being unescaped natively before calling llama_sampler_init_dry().
5-6. Stop conditions and output cleanup
- The native runtime watches common chat-template delimiters such as
<|end|>,</s>,<|im_end|>, and<end_of_turn>. - UTF-8 safety checks trim incomplete byte sequences before returning output to Java.
- Java then performs an additional response-marker cleanup step before displaying or serving text.
5-7. Downloading, logging, and crash traces
- Native downloads use libcurl and optionally a CA bundle exported from Android's trust store.
- Split GGUF handling can derive and fetch sibling shard URLs automatically.
- Logs are written to
ollama.log; Java crashes go tolast_crash.txt; fatal native signals leave markers innative_crash.txt.
6. CMake / NDK build
For a deeper explanation of the JNI boundary and CMake structure from a general llama.cpp-integration perspective, see llama.cpp / JNI / CMake Deep Dive.
6-1. Gradle / NDK characteristics
compileSdk 35,targetSdk 35, andminSdk 24.- The app builds only for
arm64-v8a. - Native build uses
src/main/cpp/CMakeLists.txtwith-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON. - NDK version is fixed to
27.2.12479018for current Android compatibility, including 16 KB page-size support.
6-2. CMake structure
- The project builds a single shared library named
llama_jni. - llama.cpp sources are vendored directly under
app/src/main/cpp/llamaand compiled into the shared library instead of being consumed as a prebuilt binary. - ggml CPU sources and ARM-specific implementations are compiled into the same target.
- mtmd sources are also included, enabling multimodal projector-backed models in the same runtime.
- Static
libcurland three static mbedTLS libraries are imported and linked explicitly.
6-3. Android-specific linker behavior
- Linker options include
-Wl,-z,max-page-size=16384and-Wl,-z,common-page-size=16384. -align-segmentsis intentionally avoided because Android's NDK linker does not support it.GGML_USE_CPUandGGML_USE_K_QUANTSare defined at compile time.
6-4. Why this matters
This means the app is not merely "calling an external engine". It ships a custom Android-native runtime that directly embeds llama.cpp, ggml, mtmd, curl, and TLS support. UI actions therefore control a locally compiled inference stack inside the APK itself.
7. API server design
OllamaApiServer is a lightweight custom HTTP server running inside a foreground service. API routes and the WebUI share the same port; the default port is 11434.
7-1. Exposed endpoints
| Route | Main purpose | Notes |
|---|---|---|
POST /api/generate | Single-prompt generation | Ollama-style NDJSON streaming or non-streaming |
POST /api/chat | Conversation generation from messages | Builds model-family-specific multi-turn prompts |
GET/POST /api/tags | Model list | Treats saved configurations as model names |
POST /v1/chat/completions | OpenAI-compatible chat endpoint | SSE streaming supported |
GET /v1/models, /models | Model inventory and status | Reports loaded state, modalities, file path, etc. |
GET /props | llama.cpp WebUI model props | Includes generation defaults, chat template, and WebUI settings |
GET /slots | Slot state | Total slot count is fixed at 1 |
GET /health, /v1/health | Health check | Returns role and webui=true |
/, /index.html, static assets | Bundled WebUI | Served from app assets with in-memory cache |
7-2. Concurrency and queue behavior
- Only one generation slot exists. The server is effectively single-generation at a time.
- If the model is busy, up to
10waiting requests are queued. - Queued requests can wait up to
60seconds before receiving a503. - If a model reset is pending, new requests are rejected and queued requests are aborted as unavailable.
7-3. Internal processing for /api/generate and /api/chat
- Parse the request JSON.
- Acquire the generation slot through
acquireGenerationSlot(). - Load the requested configuration and model if needed.
- Use GGUF chat-template metadata, custom template, system prompt, and Think settings to build the final prompt.
- If the request includes
toolsor the app has shared tool configuration, execution switches intoSharedToolManager.generateWithTools(), which can continue auto tool execution for up to 10 turns. - When streaming, start a queue and writer thread so the native generation thread is not blocked by network writes.
- If the client disconnects or an end marker is detected, call
cancelGeneration().
7-4. Multimodal request handling
/api/chatand/v1/chat/completionscan acceptcontent[]parts such astext,input_text,image_url, andinput_audio.image_urlsupports HTTP/HTTPS images or base64 data URLs. Remote downloads are capped at 10 MB.input_audioaccepts only base64-encodedwavormp3.- Media parts are replaced by an internal
<__media__>marker in text, while raw bytes are forwarded to JNI asbyte[][]. - If the currently loaded model does not support the requested modality, the server returns
400.
7-5. OpenAI compatibility layer
/v1/chat/completionsreturns SSE when streaming is enabled.- Internally it still uses the same prompt builder, ModelManager, and JNI runtime as the Ollama-compatible routes; only the response envelope differs.
- If
n_predict=0is present, the server follows a pre-encode-only branch and returns early. - Some request-level generation overrides are applied onto the configuration before the native sampler settings are re-applied.
- If
tool_callsis missing, the runtime also tries to extract tool-call markers fromreasoning_contentor the returned content body. - The server forwards
tool_choiceandparallel_tool_calls, combining request tools with shared MCP / Function Definitions settings for the internal tool-execution loop.
8. WebUI delivery
8-1. What the bundled WebUI is
- The APK bundles
webui/index.html,bundle.js,bundle.css, andloading.htmlunder app assets. handleWebUi()serves these files for/and related asset paths.- Assets are cached in memory through
webUiAssetCacheafter first load.
8-2. How WebUI and API fit together
- The WebUI is not a separate server; it is routed by the same
OllamaApiServer. /propsand/slotsprovide llama.cpp-style metadata used by the WebUI, including generation defaults pluswebui_settingsfor the system message, Think display, and shared MCP / Function Definitions settings.- Unknown client-side routes are normalized back to
index.html, making the bundled WebUI behave like a small SPA served from the same port.
8-3. Operational meaning
The Android app effectively bundles three things together on one device: a native local inference runtime, an Ollama/OpenAI-compatible HTTP surface, and a browser-facing WebUI that consumes that local API. No external server is required for this stack to function.
9. Relationship to the public website
- The Mick Lab website itself is a Firebase Hosting static site served from
public/. - On Hosting,
/api/**is rewritten toapi-fallback.json, while other routes fall back to/index.html. - That means these public pages are static documentation about the Android app's local API/WebUI implementation; they are not the live API server running inside the app.
- All API and WebUI behavior described on this page refers to the in-app
OllamaApiServerandOllamaForegroundService, not Firebase Hosting.