Technical Specification

This page documents the current Android implementation of LLM tester with llama.cpp, focusing on the full path from the front app to configuration storage, JNI, CMake/NDK build, llama.cpp/ggml, the on-device API server, and bundled WebUI delivery.

Related docs: User Manual | llama.cpp / JNI / CMake Deep Dive | Privacy Policy

Android Activity UI Configuration JSON ModelManager JNI / C++ llama.cpp + ggml + mtmd Ollama / OpenAI compatible API Bundled WebUI

Contents

1. System overview

The app has two main execution paths: a direct UI-to-native inference path, and an on-device HTTP path used by the Ollama-compatible / OpenAI-compatible server. In both cases, model loading, busy-state control, parameter application, and reinitialization are centralized in ModelManager, and actual inference ends inside the llama_jni shared library that embeds llama.cpp, ggml, and mtmd.

MainActivity / SettingsActivity
        ↓
ConfigurationManager / ModelFileHelper
        ↓
ModelManager
        ↓
LlamaNative (Java native wrapper)
        ↓ JNI
llama_jni (jni_llama.cpp)
        ↓
llama.cpp + ggml + mtmd + libcurl + mbedTLS

Separate path:
OllamaForegroundService
        ↓
OllamaApiServer (HTTP, same-device / LAN)
        ↓
ModelManager → LlamaNative → JNI → llama.cpp
LayerMain classesRole
UIMainActivity / SettingsActivity / DocumentsActivityInput, output display, log viewing, settings editing, API/WebUI control
ConfigurationConfigurationManagerProfile JSON persistence and defaults
Model assetsModelFileHelperGGUF/mmproj path resolution, storage directory, modality inference
Execution controlModelManagerLoading, reloading, locking, native invocation coordination
Java ↔ NativeLlamaNativeExports native methods and registers token callbacks
Native runtimejni_llama.cppGlobal state, init, sampling, stop sequences, logging, crash markers
HTTPOllamaApiServer / OllamaForegroundServiceOllama API, OpenAI API, WebUI delivery, queue control

2. Front app layer

2-1. MainActivity

2-2. SettingsActivity

2-3. Startup and background integration

3. Configuration and model files

3-1. ConfigurationManager

Each profile is stored as configs/*.json in the app's external files area. A default profile is generated automatically and includes model reference, generation parameters, Think state, GPU offload, custom template, system prompt, and optional mmproj reference.

CategoryMain fieldsPurpose
ModelmodelUrl, multimodalProjectorUrlGGUF and optional mmproj references
Load-timenCtx, nThreads, nBatch, gpuOffloadLayersContext, CPU, and GPU-loading behavior
Samplingtemp, topP, topK, penalties, Mirostat, DRY, XTCMapped directly into the native sampler chain
Prompt constructionsystemPrompt, customChatTemplate, enableThinkingControls the final prompt format
UI/runtimestreamingDefault streaming behavior for UI and API

3-2. ModelFileHelper

3-3. Split GGUF and HTTPS downloads

4. ModelManager responsibilities

ModelManager is the shared execution coordinator for both the UI and the HTTP server. It owns locking, model initialization, reinitialization, parameter application, native logging setup, and HTTPS trust-store preparation.

4-1. Locking and reinitialization

4-2. Load sequence

  1. Load the configuration profile.
  2. Resolve the model filename and destination path.
  3. Resolve or auto-detect an mmproj file if multimodal support is needed.
  4. Download the model if it is not present locally.
  5. Unload existing model/context/mmproj if a different model is required.
  6. Preload: call initWithMmproj() once with n_ctx=64.
  7. Full init: restore the requested n_ctx and call initWithMmproj() again.
  8. Apply sampling parameters with applyConfiguration().

That two-stage init helps surface file/load failures early before committing to the full requested context size.

4-3. Generation behavior

5. JNI / C++ layer

5-1. Java-side native interface

LlamaNative loads llama_jni and exposes the following operations.

MethodPurpose
download(url, path)Downloads models/mmproj files with libcurl
initWithMmproj(modelPath, mmprojPath)Initializes the model and optional multimodal projector
setLoadParameters(...)Sets n_ctx / n_threads / n_batch / GPU load parameters before init
setParameters(...)Sets penalties, DRY, Mirostat, XTC, and related sampler settings
generate(prompt), generateWithMedia(prompt, media)Runs text-only or multimodal generation
setTokenListener(listener)Registers token, completion, and error callbacks for streaming
cancelGeneration()Requests cancellation from native generation loops
getChatTemplate()Reads GGUF chat-template metadata from the loaded model
supportsVision(), supportsAudio()Reports current loaded-model modality support

5-2. Native global state

5-3. initWithMmproj flow

  1. Install fatal signal handlers so native crashes leave markers in native_crash.txt.
  2. Register llama.cpp and mtmd log callbacks.
  3. Verify model file, mmproj file, and split GGUF completeness.
  4. Free existing model/context/mtmd state when required.
  5. Call llama_backend_init() and verify registered ggml backends.
  6. Load the model with llama_model_load_from_file() after applying n_gpu_layers.
  7. Create the inference context via llama_init_from_model() with the selected context/thread/batch settings.
  8. If needed, initialize multimodal support and record detected vision/audio capabilities.

5-4. generate flow

  1. Clear existing model memory for a fresh prompt prefill.
  2. Use prefill_text_prompt_locked() for text-only prompts or prefill_multimodal_prompt_locked() for prompts with image/audio inputs.
  3. Construct a sampler chain from the current Java-provided settings.
  4. Run up to 1024 generated tokens using llama_sampler_sample(), llama_sampler_accept(), and llama_decode().
  5. Stop on EOG, stop-sequence detection, context safety limit, or cancellation.
  6. Emit delta tokens through notify_token_delta() and completion through notify_token_complete().

5-5. Sampler chain composition

The native layer maps Java config fields directly into llama.cpp sampler components in roughly this order.

penalties
→ DRY
→ top_n_sigma
→ top_k
→ typical
→ top_p
→ min_p
→ XTC
→ temperature / dynamic temperature
→ mirostat v1 or v2 / fallback dist

DRY sequence breakers default to \n,:,",*, matching the Java configuration default and being unescaped natively before calling llama_sampler_init_dry().

5-6. Stop conditions and output cleanup

5-7. Downloading, logging, and crash traces

6. CMake / NDK build

For a deeper explanation of the JNI boundary and CMake structure from a general llama.cpp-integration perspective, see llama.cpp / JNI / CMake Deep Dive.

6-1. Gradle / NDK characteristics

6-2. CMake structure

6-3. Android-specific linker behavior

6-4. Why this matters

This means the app is not merely "calling an external engine". It ships a custom Android-native runtime that directly embeds llama.cpp, ggml, mtmd, curl, and TLS support. UI actions therefore control a locally compiled inference stack inside the APK itself.

7. API server design

OllamaApiServer is a lightweight custom HTTP server running inside a foreground service. API routes and the WebUI share the same port; the default port is 11434.

7-1. Exposed endpoints

RouteMain purposeNotes
POST /api/generateSingle-prompt generationOllama-style NDJSON streaming or non-streaming
POST /api/chatConversation generation from messagesBuilds model-family-specific multi-turn prompts
GET/POST /api/tagsModel listTreats saved configurations as model names
POST /v1/chat/completionsOpenAI-compatible chat endpointSSE streaming supported
GET /v1/models, /modelsModel inventory and statusReports loaded state, modalities, file path, etc.
GET /propsllama.cpp WebUI model propsIncludes generation defaults, chat template, and WebUI settings
GET /slotsSlot stateTotal slot count is fixed at 1
GET /health, /v1/healthHealth checkReturns role and webui=true
/, /index.html, static assetsBundled WebUIServed from app assets with in-memory cache

7-2. Concurrency and queue behavior

7-3. Internal processing for /api/generate and /api/chat

  1. Parse the request JSON.
  2. Acquire the generation slot through acquireGenerationSlot().
  3. Load the requested configuration and model if needed.
  4. Use GGUF chat-template metadata, custom template, system prompt, and Think settings to build the final prompt.
  5. If the request includes tools or the app has shared tool configuration, execution switches into SharedToolManager.generateWithTools(), which can continue auto tool execution for up to 10 turns.
  6. When streaming, start a queue and writer thread so the native generation thread is not blocked by network writes.
  7. If the client disconnects or an end marker is detected, call cancelGeneration().

7-4. Multimodal request handling

7-5. OpenAI compatibility layer

8. WebUI delivery

8-1. What the bundled WebUI is

8-2. How WebUI and API fit together

8-3. Operational meaning

The Android app effectively bundles three things together on one device: a native local inference runtime, an Ollama/OpenAI-compatible HTTP surface, and a browser-facing WebUI that consumes that local API. No external server is required for this stack to function.

9. Relationship to the public website