llama.cpp / JNI / CMake Deep Dive
This page uses the current Android implementation as the concrete example, but reframes it around the broader question of how to integrate llama.cpp into an Android app: JNI boundaries, native state ownership, CMake layout, and the way llama.cpp / ggml are embedded into one runtime.
Related docs: Technical Specification | User Manual | Privacy Policy
Contents
1. What this page is for
The existing Technical Specification page gives the full application-level picture. This page goes narrower and deeper: it explains how llama.cpp is embedded into Android, with special focus on LlamaNative, jni_llama.cpp, and CMakeLists.txt.
In this implementation, both the UI path and the local API server path eventually call one shared library, llama_jni. That split—Java/Kotlin for app control, C/C++ for the inference runtime—is the most common architectural shape for Android apps using llama.cpp.
| Perspective | Mainly covered here | More app-specific |
|---|---|---|
| Reusable design | JNI surface, shared-library packaging, model init, generation loop, CMake target design | None |
| Notable choices in this implementation | arm64-v8a only, 16 KB page-size support, split between setLoadParameters() and setParameters(), two-stage init | Android UI, foreground service, custom HTTP server |
| Optional extensions | mtmd for image/audio projectors, libcurl + mbedTLS downloads | Bundled WebUI, Ollama/OpenAI compatibility layer |
2. Why wrap llama.cpp with JNI + CMake
On Android, there are three broad ways to use llama.cpp: 1) ship a prebuilt native library, 2) consume upstream CMake almost as-is, or 3) assemble one app-specific shared library from the sources you actually need. This implementation leans strongly toward option 3.
| Approach | Strength | Caution |
|---|---|---|
| Ship prebuilt binaries | Fastest initial setup | Weak control over ABI, flags, dependencies, and update timing |
Use upstream via add_subdirectory() | Easier to stay close to upstream | Can drag in CLI/tooling structure an Android app does not want |
| Compile required sources into one app target | Clear control over what the APK ships and how JNI is shaped | Source lists and build defines must be reviewed on upstream updates |
In practice, Android integrations stay saner when you treat Java/Kotlin as app control and C/C++ as the inference runtime. llama.cpp is deeply native in how it handles contexts, samplers, backends, and quantized model formats, so packaging it as one controlled native target usually leads to a cleaner app boundary.
3. End-to-end flow
Gradle (externalNativeBuild)
↓
CMakeLists.txt
↓
llama_jni.so
↓
System.loadLibrary("llama_jni")
↓
LlamaNative native methods
↓ JNI
jni_llama.cpp
↓
llama.cpp / ggml / optional mtmd / curl / TLS
The key idea is that Java does not “use llama.cpp directly”. Instead, Java talks to a very small JNI facade, and that native facade owns the long-lived inference state.
| Layer | Responsibility | Concrete example here |
|---|---|---|
| Gradle | Declares ABI, NDK, and CMake arguments | arm64-v8a, ndkVersion 27.2.12479018, -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON |
| CMake | Collects sources, include paths, link options, and static deps | Builds llama_jni from llama.cpp, ggml, mtmd, curl, and mbedTLS |
| JNI wrapper | Maps Java calls into native state transitions | LlamaNative and jni_llama.cpp |
| llama.cpp runtime | Loads models, tokenizes, decodes, samples, and stops | llama_model_load_from_file(), llama_init_from_model(), llama_sampler_* |
4. JNI design essentials
4-1. Keep the Java API thin
LlamaNative exposes three practical groups: load operations, generate operations, and helper operations. For most Android integrations, this is a good default cut.
load: setLoadParameters(...) initWithMmproj(modelPath, mmprojPath) free() generate: setParameters(...) generate(prompt) generateWithMedia(prompt, media) cancelGeneration() helpers: setTokenListener(listener) getChatTemplate() supportsVision() supportsAudio()
In this implementation, temp, top_p, and top_k are stored through setLoadParameters(), while more advanced sampler features sit behind setParameters(). The general design lesson is the split itself: avoid exposing every llama.cpp detail directly to Java.
4-2. Centralize state ownership
jni_llama.cpp holds g_model, g_ctx, g_mtmd, g_cancel_generation, and g_jvm as global state guarded by g_mutex. That is a very practical Android pattern because the most important rule is that multiple Java entry points must not race the same native model state.
On the Java side, ModelManager adds busy-state and reinitialization control. So the implementation uses a double layer of safety: app-level exclusivity in Java plus native-level locking in C++.
4-3. Callbacks must return through JavaVM
Streaming uses a global-ref listener and stores JavaVM, then attaches worker threads with AttachCurrentThread() when needed. This matters in any JNI integration because the thread running the native generation loop is not guaranteed to be the same thread that entered from Java.
4-4. Two-stage init is a production-oriented pattern
ModelManager first preloads with n_ctx=64, then frees and performs full initialization with the requested context size. That is a useful general pattern for llama.cpp apps: it surfaces file/load problems early before committing to a much larger memory footprint.
5. Reading the llama.cpp runtime
5-1. Initialization phase
- Validate model and optional mmproj files.
- Check split GGUF completeness.
- Call
llama_backend_init()and inspect available backends. - Load the model with
llama_model_load_from_file(). - Create the inference context with
llama_init_from_model(). - Optionally initialize mtmd and record vision/audio capability.
That core sequence is broadly reusable. What is more application-specific here is the operational support around it: crash markers, log files, split-GGUF handling, and TLS trust-store export for native downloads.
5-2. Generation phase
- Clear runtime memory and prefill the prompt.
- For text-only prompts, tokenize and decode in batches.
- For multimodal prompts, let mtmd chunk and prefill into the same
g_ctx. - Build a sampler chain and iterate
llama_sampler_sample()→llama_sampler_accept()→llama_decode(). - Stop on EOG, stop sequences, context safety limits, or cancellation.
The main JNI job is therefore not merely “call one C API”. It is to reshape llama.cpp state and control flow into an Android-friendly runtime API.
5-3. The sampler chain is the translation layer from UI to runtime
This implementation maps penalties, DRY, top-n-sigma, top-k, typical, top-p, min-p, XTC, temperature, and Mirostat into a native sampler chain. Any serious Android integration needs some equivalent translation layer between app-facing settings and llama.cpp’s native sampling pieces.
Whether that chain is rebuilt on every generation or cached on config changes is a design choice. For maintainability, rebuilding it per request is often the clearer first implementation.
5-4. Keep text-only and multimodal separate in your head
The minimum viable integration is text-only. This implementation includes mtmd in the same target because it wants one llama_jni runtime that can also serve image/audio projector-backed models. For many projects, it is wiser to ship a stable text-only path first and add multimodal support later.
6. How to read the CMake design
6-1. What this CMakeLists.txt is really doing
add_library(llama_jni SHARED
jni/jni_llama.cpp
${LLAMA_SOURCES}
${GGML_SOURCES}
)
target_compile_definitions(llama_jni PRIVATE
GGML_USE_CPU
GGML_USE_K_QUANTS
)
target_link_libraries(llama_jni PRIVATE
curl mbedtls mbedcrypto mbedx509
log android jnigraphics m dl atomic
)
The best mental model is: the final Android-native product is one shared object, llama_jni.so. llama.cpp, ggml, mtmd, downloading, and TLS support all collapse into that one target.
6-2. Why the source grouping matters
LLAMA_SOURCES: the main llama.cpp sources, common helpers, model sources, and generatedbuild-info.cpp.GGML_SOURCES: ggml core, backend registration, CPU implementation, and ARM-specific files.MTMD_SOURCES: projector-related image/audio helpers.
One subtle but important choice here is that some source groups are gathered broadly, while backend-sensitive ggml files are listed explicitly. That is a useful pattern for Android integrations: let fast-moving upstream areas stay flexible, but pin the backend-critical pieces you do not want silently changing.
6-3. build-info and compile definitions
configure_file() generates build-info.cpp so llama.cpp gets the build number, commit, compiler, and target metadata it expects. Likewise, compile definitions such as GGML_USE_CPU and GGML_USE_K_QUANTS lock in which backend and quantization paths are compiled.
Because llama.cpp evolves quickly, Android consumers benefit from making this relationship explicit: which macros are enabled, and which sources must therefore exist in the target?
6-4. Android-specific knobs
| Item | This implementation | Meaning |
|---|---|---|
| ABI | arm64-v8a only | Reduces native size and the testing matrix |
| Flexible page sizes | -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON | Targets newer Android devices, including 16 KB page-size environments |
| Link options | -Wl,-z,max-page-size=16384 and -Wl,-z,common-page-size=16384 | Improves compatibility with 16 KB page-size systems |
| NDK | 27.2.12479018 | Locks the native toolchain for current Android compatibility |
6-5. libcurl and TLS integration
The CMake design imports static libcurl and mbedTLS archives into the same target. That choice reflects a runtime goal: model downloads should happen inside the same native runtime that owns split-GGUF logic and native logging. For some apps, Java-side HTTP is enough; for others, keeping downloads native simplifies the overall flow.
7. General integration checklist
- Choose the ABI first: starting with
arm64-v8aonly is usually the simplest path. - Start with text-only: make
generate(prompt)stable before adding multimodal complexity. - Keep the JNI surface small: load, generate, cancel, free, and callbacks are often enough.
- Centralize state management: combine a Java-side manager with a native mutex.
- Separate load-time and sampler settings: do not let
n_ctx, threads, GPU layers, penalties, and Mirostat blur together conceptually. - Add cancellation early: both UI flows and HTTP flows will need it.
- Leave logs and crash traces: native model-load failures are otherwise hard to reproduce.
- Define an upstream-update checklist: source lists, build-info, macros, and optional mtmd additions should be reviewed together.
8. Common pitfalls
| Pitfall | Why it happens | Direction for mitigation |
|---|---|---|
| Upstream updates suddenly break the build | llama.cpp / ggml source layout and macro assumptions change | Review source lists and compile definitions as part of every upstream bump |
| Java callbacks crash or do nothing | A native thread uses a stale or unrelated JNIEnv* | Store JavaVM and attach/detach threads when invoking callbacks |
| Long prompts fail immediately | Prefill token count exceeds n_ctx | Check token counts after tokenization and return clear errors |
| Split GGUF models fail in surprising ways | The primary shard exists but sibling shards are missing | Validate shard completeness before and after download |
| HTTPS downloads behave differently from Java networking | Native code is not using Android’s trust-store assumptions | Export Android CA Store to PEM and pass it into libcurl |
generate() grows into a giant function | Prefill, sampling, callbacks, stop detection, and UTF-8 cleanup all accumulate there | Factor out helpers for prefill, callbacks, and stop handling |
The main lesson is that successful llama.cpp integration on Android is less about “making one model run once” and more about wrapping a native inference runtime so it fits Android’s lifecycle, threading, ABI, and maintenance reality. JNI and CMake are the tools that make that wrapping explicit.