llama.cpp / JNI / CMake Deep Dive

This page uses the current Android implementation as the concrete example, but reframes it around the broader question of how to integrate llama.cpp into an Android app: JNI boundaries, native state ownership, CMake layout, and the way llama.cpp / ggml are embedded into one runtime.

Android NDK JNI boundary design llama.cpp integration pattern ggml CPU backend mtmd extension Single-target CMake layout

1. What this page is for 2. Why wrap llama.cpp with JNI + CMake 3. End-to-end flow 4. JNI design essentials 5. Reading the llama.cpp runtime 6. How to read the CMake design 7. General integration checklist 8. Common pitfalls

1. What this page is for

The existing Technical Specification page gives the full application-level picture. This page goes narrower and deeper: it explains how llama.cpp is embedded into Android, with special focus on LlamaNative, jni_llama.cpp, and CMakeLists.txt.

In this implementation, both the UI path and the local API server path eventually call one shared library, llama_jni. That split—Java/Kotlin for app control, C/C++ for the inference runtime—is the most common architectural shape for Android apps using llama.cpp.

Perspective	Mainly covered here	More app-specific
Reusable design	JNI surface, shared-library packaging, model init, generation loop, CMake target design	None
Notable choices in this implementation	`arm64-v8a` only, 16 KB page-size support, split between `setLoadParameters()` and `setParameters()`, two-stage init	Android UI, foreground service, custom HTTP server
Optional extensions	mtmd for image/audio projectors, libcurl + mbedTLS downloads	Bundled WebUI, Ollama/OpenAI compatibility layer

2. Why wrap llama.cpp with JNI + CMake

On Android, there are three broad ways to use llama.cpp: 1) ship a prebuilt native library, 2) consume upstream CMake almost as-is, or 3) assemble one app-specific shared library from the sources you actually need. This implementation leans strongly toward option 3.

Approach	Strength	Caution
Ship prebuilt binaries	Fastest initial setup	Weak control over ABI, flags, dependencies, and update timing
Use upstream via `add_subdirectory()`	Easier to stay close to upstream	Can drag in CLI/tooling structure an Android app does not want
Compile required sources into one app target	Clear control over what the APK ships and how JNI is shaped	Source lists and build defines must be reviewed on upstream updates

In practice, Android integrations stay saner when you treat Java/Kotlin as app control and C/C++ as the inference runtime. llama.cpp is deeply native in how it handles contexts, samplers, backends, and quantized model formats, so packaging it as one controlled native target usually leads to a cleaner app boundary.

3. End-to-end flow

Gradle (externalNativeBuild)
  ↓
CMakeLists.txt
  ↓
llama_jni.so
  ↓
System.loadLibrary("llama_jni")
  ↓
LlamaNative native methods
  ↓ JNI
jni_llama.cpp
  ↓
llama.cpp / ggml / optional mtmd / curl / TLS

The key idea is that Java does not “use llama.cpp directly”. Instead, Java talks to a very small JNI facade, and that native facade owns the long-lived inference state.

Layer	Responsibility	Concrete example here
Gradle	Declares ABI, NDK, and CMake arguments	`arm64-v8a`, `ndkVersion 27.2.12479018`, `-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON`
CMake	Collects sources, include paths, link options, and static deps	Builds `llama_jni` from llama.cpp, ggml, mtmd, curl, and mbedTLS
JNI wrapper	Maps Java calls into native state transitions	`LlamaNative` and `jni_llama.cpp`
llama.cpp runtime	Loads models, tokenizes, decodes, samples, and stops	`llama_model_load_from_file()`, `llama_init_from_model()`, `llama_sampler_*`

4. JNI design essentials

4-1. Keep the Java API thin

LlamaNative exposes three practical groups: load operations, generate operations, and helper operations. For most Android integrations, this is a good default cut.

load:
  setLoadParameters(...)
  initWithMmproj(modelPath, mmprojPath)
  free()

generate:
  setParameters(...)
  generate(prompt)
  generateWithMedia(prompt, media)
  cancelGeneration()

helpers:
  setTokenListener(listener)
  getChatTemplate()
  supportsVision()
  supportsAudio()

In this implementation, temp, top_p, and top_k are stored through setLoadParameters(), while more advanced sampler features sit behind setParameters(). The general design lesson is the split itself: avoid exposing every llama.cpp detail directly to Java.

4-2. Centralize state ownership

jni_llama.cpp holds g_model, g_ctx, g_mtmd, g_cancel_generation, and g_jvm as global state guarded by g_mutex. That is a very practical Android pattern because the most important rule is that multiple Java entry points must not race the same native model state.

On the Java side, ModelManager adds busy-state and reinitialization control. So the implementation uses a double layer of safety: app-level exclusivity in Java plus native-level locking in C++.

4-3. Callbacks must return through JavaVM

Streaming uses a global-ref listener and stores JavaVM, then attaches worker threads with AttachCurrentThread() when needed. This matters in any JNI integration because the thread running the native generation loop is not guaranteed to be the same thread that entered from Java.

4-4. Two-stage init is a production-oriented pattern

ModelManager first preloads with n_ctx=64, then frees and performs full initialization with the requested context size. That is a useful general pattern for llama.cpp apps: it surfaces file/load problems early before committing to a much larger memory footprint.

5. Reading the llama.cpp runtime

5-1. Initialization phase

Validate model and optional mmproj files.
Check split GGUF completeness.
Call llama_backend_init() and inspect available backends.
Load the model with llama_model_load_from_file().
Create the inference context with llama_init_from_model().
Optionally initialize mtmd and record vision/audio capability.

That core sequence is broadly reusable. What is more application-specific here is the operational support around it: crash markers, log files, split-GGUF handling, and TLS trust-store export for native downloads.

5-2. Generation phase

Clear runtime memory and prefill the prompt.
For text-only prompts, tokenize and decode in batches.
For multimodal prompts, let mtmd chunk and prefill into the same g_ctx.
Build a sampler chain and iterate llama_sampler_sample() → llama_sampler_accept() → llama_decode().
Stop on EOG, stop sequences, context safety limits, or cancellation.

The main JNI job is therefore not merely “call one C API”. It is to reshape llama.cpp state and control flow into an Android-friendly runtime API.

5-3. The sampler chain is the translation layer from UI to runtime

This implementation maps penalties, DRY, top-n-sigma, top-k, typical, top-p, min-p, XTC, temperature, and Mirostat into a native sampler chain. Any serious Android integration needs some equivalent translation layer between app-facing settings and llama.cpp’s native sampling pieces.

Whether that chain is rebuilt on every generation or cached on config changes is a design choice. For maintainability, rebuilding it per request is often the clearer first implementation.

5-4. Keep text-only and multimodal separate in your head

The minimum viable integration is text-only. This implementation includes mtmd in the same target because it wants one llama_jni runtime that can also serve image/audio projector-backed models. For many projects, it is wiser to ship a stable text-only path first and add multimodal support later.

6. How to read the CMake design

6-1. What this CMakeLists.txt is really doing

add_library(llama_jni SHARED
  jni/jni_llama.cpp
  ${LLAMA_SOURCES}
  ${GGML_SOURCES}
)

target_compile_definitions(llama_jni PRIVATE
  GGML_USE_CPU
  GGML_USE_K_QUANTS
)

target_link_libraries(llama_jni PRIVATE
  curl mbedtls mbedcrypto mbedx509
  log android jnigraphics m dl atomic
)

The best mental model is: the final Android-native product is one shared object, llama_jni.so. llama.cpp, ggml, mtmd, downloading, and TLS support all collapse into that one target.

6-2. Why the source grouping matters

LLAMA_SOURCES: the main llama.cpp sources, common helpers, model sources, and generated build-info.cpp.
GGML_SOURCES: ggml core, backend registration, CPU implementation, and ARM-specific files.
MTMD_SOURCES: projector-related image/audio helpers.

One subtle but important choice here is that some source groups are gathered broadly, while backend-sensitive ggml files are listed explicitly. That is a useful pattern for Android integrations: let fast-moving upstream areas stay flexible, but pin the backend-critical pieces you do not want silently changing.

6-3. build-info and compile definitions

configure_file() generates build-info.cpp so llama.cpp gets the build number, commit, compiler, and target metadata it expects. Likewise, compile definitions such as GGML_USE_CPU and GGML_USE_K_QUANTS lock in which backend and quantization paths are compiled.

Because llama.cpp evolves quickly, Android consumers benefit from making this relationship explicit: which macros are enabled, and which sources must therefore exist in the target?

6-4. Android-specific knobs

Item	This implementation	Meaning
ABI	`arm64-v8a` only	Reduces native size and the testing matrix
Flexible page sizes	`-DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON`	Targets newer Android devices, including 16 KB page-size environments
Link options	`-Wl,-z,max-page-size=16384` and `-Wl,-z,common-page-size=16384`	Improves compatibility with 16 KB page-size systems
NDK	`27.2.12479018`	Locks the native toolchain for current Android compatibility

6-5. libcurl and TLS integration

The CMake design imports static libcurl and mbedTLS archives into the same target. That choice reflects a runtime goal: model downloads should happen inside the same native runtime that owns split-GGUF logic and native logging. For some apps, Java-side HTTP is enough; for others, keeping downloads native simplifies the overall flow.

7. General integration checklist

Choose the ABI first: starting with arm64-v8a only is usually the simplest path.
Start with text-only: make generate(prompt) stable before adding multimodal complexity.
Keep the JNI surface small: load, generate, cancel, free, and callbacks are often enough.
Centralize state management: combine a Java-side manager with a native mutex.
Separate load-time and sampler settings: do not let n_ctx, threads, GPU layers, penalties, and Mirostat blur together conceptually.
Add cancellation early: both UI flows and HTTP flows will need it.
Leave logs and crash traces: native model-load failures are otherwise hard to reproduce.
Define an upstream-update checklist: source lists, build-info, macros, and optional mtmd additions should be reviewed together.

8. Common pitfalls

Pitfall	Why it happens	Direction for mitigation
Upstream updates suddenly break the build	llama.cpp / ggml source layout and macro assumptions change	Review source lists and compile definitions as part of every upstream bump
Java callbacks crash or do nothing	A native thread uses a stale or unrelated `JNIEnv*`	Store `JavaVM` and attach/detach threads when invoking callbacks
Long prompts fail immediately	Prefill token count exceeds `n_ctx`	Check token counts after tokenization and return clear errors
Split GGUF models fail in surprising ways	The primary shard exists but sibling shards are missing	Validate shard completeness before and after download
HTTPS downloads behave differently from Java networking	Native code is not using Android’s trust-store assumptions	Export Android CA Store to PEM and pass it into libcurl
`generate()` grows into a giant function	Prefill, sampling, callbacks, stop detection, and UTF-8 cleanup all accumulate there	Factor out helpers for prefill, callbacks, and stop handling

The main lesson is that successful llama.cpp integration on Android is less about “making one model run once” and more about wrapping a native inference runtime so it fits Android’s lifecycle, threading, ABI, and maintenance reality. JNI and CMake are the tools that make that wrapping explicit.