Last year, I spent three months trying to run a 7B parameter language model on Android devices. The first attempt? My test phone ran out of memory in 30 seconds. By the end of that project, I had the same model running smoothly on a mid-range device with 70% smaller file size and 3x faster inference. The secret wasn't reinventing the wheel—it was understanding quantization.
If you're serious about building an AI Android app that delivers real value without requiring users to have flagship devices, quantization is non-negotiable. In this post, I'll walk you through exactly how I approach it, including the practical mistakes I made so you don't have to.
Why Quantization Matters for Mobile AI
Here's the hard truth: modern LLMs are massive. A base Llama 2 7B model weighs around 13GB in full precision (FP32). On Android, that's a non-starter. Users won't download a 13GB app, and even if they did, the device would overheat and crash.
Quantization solves this by reducing the precision of the model's weights from 32-bit floating point to 8-bit, 4-bit, or even lower. Sounds scary? It's actually elegant. Here's why it works:
- Model Size: A 7B model drops from 13GB (FP32) to ~3.5GB (INT8) or ~1.8GB (INT4)
- Memory During Inference: Your device needs far less RAM to load and run the model
- Speed: Integer operations are faster than floating-point math on mobile processors
- Quality Loss: Minimal—most quantized models lose only 1-3% accuracy on benchmarks
I've shipped three AI app development projects using quantized models. In every case, users noticed no meaningful difference in quality, but they absolutely noticed it could run offline on their phone.
Quantization Techniques That Work
Post-Training Quantization (PTQ)
This is the easiest and most practical approach for most teams. You take a pre-trained, fully-accurate model and convert it to a lower-precision version after training. No retraining required.
The trade-off? You lose slightly more accuracy than quantization-aware training, but in my experience, it's negligible for inference workloads. PTQ is what I used on AudioBook AI (50K+ users), and users never complained about output quality.
Quantization-Aware Training (QAT)
If you have the compute budget and need maximum accuracy, QAT trains the model to be quantization-friendly from the start. The model learns to work with lower precision during training, so it adapts better to quantization.
Downside: You need to retrain on your hardware, which costs money and time. I've only used this when clients demanded specific accuracy thresholds that PTQ couldn't hit.
Dynamic vs Static Quantization
Static quantization pre-computes the scaling factors for each layer (faster inference, slightly less accurate). Dynamic quantization calculates scales at runtime (more flexible, slightly slower). For mobile, I prefer static—every millisecond counts.
📖 The Real Win
Post-training quantization with INT8 is the sweet spot for Android. It's fast to implement, loses barely any accuracy, and cuts model size by 75%. Start here.
Implementing Quantized LLMs in Android
Let me show you how I approach this in production. The workflow is:
- Quantize your model using TensorFlow Lite or ONNX Runtime
- Export to a mobile-friendly format (.tflite or .onnx)
- Bundle it in your Android app
- Use Kotlin + Coroutines to run inference without blocking the UI
Step 1: Quantizing Your Model
Using TensorFlow Lite (my go-to for Android):
import tensorflow as tf
# Load your trained model
converter = tf.lite.TFLiteConverter.from_saved_model('path/to/model')
# Enable post-training integer quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
# Convert and save
quantized_model = converter.convert()
with open('model_quantized.tflite', 'wb') as f:
f.write(quantized_model)
That's it. Your model is now INT8 quantized and ready for Android.
Step 2: Android Implementation with TensorFlow Lite
In your Android app, load and run the quantized model using Kotlin Coroutines:
import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.gpu.CompatibilityList
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.withContext
class QuantizedLLMInference(context: Context) {
private lateinit var interpreter: Interpreter
init {
// Load quantized model from assets
val modelBuffer = loadModelFile(context, "model_quantized.tflite")
val options = Interpreter.Options()
// Use GPU delegate if available (faster inference)
if (CompatibilityList().isDelegateSupportedOnThisDevice) {
options.addDelegate(GpuDelegate())
} else {
// Fall back to NNAPI for mid-range devices
options.addDelegate(NnApiDelegate())
}
interpreter = Interpreter(modelBuffer, options)
}
suspend fun generateText(prompt: String): String = withContext(Dispatchers.Default) {
// Prepare input (quantized to INT8)
val inputArray = ByteArray(MAX_INPUT_SIZE)
// ... tokenize prompt into inputArray ...
// Run inference
val output = ByteArray(MAX_OUTPUT_SIZE)
interpreter.run(inputArray, output)
// Dequantize and decode output
return@withContext decodeOutput(output)
}
private fun loadModelFile(context: Context, filename: String): ByteBuffer {
val assetManager = context.assets
val inputStream = assetManager.open(filename)
val bytes = inputStream.readBytes()
return ByteBuffer.wrap(bytes).apply { order(ByteOrder.nativeOrder()) }
}
}
Step 3: Memory-Safe Inference
When running inference on Android, always use Coroutines on a background dispatcher. The quantized model might be smaller, but it still needs careful memory management:
// In your ViewModel or UseCase
viewModelScope.launch {
val result = llmInference.generateText(userPrompt)
_uiState.value = UIState.Success(result)
}
Never call inference on the main thread. Your UI will freeze, and users will leave.
⚠️ Memory Leaks
Always close the Interpreter when your Activity destroys. Failing to do so will leak memory across app lifecycle. Use proper lifecycle management.
Real-World Performance Benchmarks
Here's what I saw when I quantized a 7B Llama model for Android on a Snapdragon 8 Gen 1 device:
| Metric | FP32 | INT8 (Quantized) | Improvement |
|---|---|---|---|
| Model Size | 13GB | 3.2GB | 75% smaller |
| Memory During Inference | ~8GB | ~2.1GB | 74% reduction |
| Tokens/Second (CPU) | 2 | 6.5 | 3.2x faster |
| Accuracy Loss (MMLU) | 65% | 63.2% | 1.8% loss |
That 1.8% accuracy loss? In practice, users didn't notice it. The response quality was virtually identical for chat, summarization, and note-taking tasks.
The real win was memory usage. On non-flagship devices (Snapdragon 778G+), the quantized version ran smoothly. The FP32 version would crash or require 12GB of available RAM.
"Quantization is the difference between 'this AI Android app doesn't work on my phone' and 'wow, this runs offline and never lags.'"
When Quantization Isn't Enough
Sometimes even INT8 isn't small enough. For extremely resource-constrained devices, I've used:
- INT4 quantization: Further 50% size reduction, but more noticeable accuracy drop (3-5%)
- Model distillation: Train a smaller model to mimic a larger one's behavior. Larger upfront cost, but often better quality at smaller size
- Dynamic shape quantization: Different layers use different bit-widths. Hybrid approach that balances size and quality
For my AudioBook AI project, INT8 was perfect. For a specialized medical note-taking app with strict accuracy requirements, I used quantization-aware training instead.
Key Takeaways
- Start with post-training INT8 quantization: 75% size reduction with negligible accuracy loss. It's the 80/20 solution for on-device AI on Android
- Use TensorFlow Lite with GPU delegates: Leverage your device's hardware accelerators. Even old devices have NNAPI support. It's 2-3x faster than CPU inference
- Never block the UI thread: Run inference in a Coroutine on
Dispatchers.Default. Quantized models are fast, but not fast enough to freeze your app - Measure accuracy on your actual use case: Generic benchmarks (like MMLU) don't tell the whole story. Test with real user prompts and measure what matters to your app
- Plan for device heterogeneity: Build fallback logic. Run the full INT8 model on flagship devices; use INT4 or distilled models on budget phones. One size doesn't fit all