Blog
I Benchmarked Popular MLX Models That Fit on iPhone and iPad — Here's How Fast On-Device LLMs Actually Are
Ricky Takkar
Published February 8, 2026
TL;DR: I ran 6 quantized LLMs on Russet which uses Apple's MLX framework on an iPhone 17 Pro and iPad Pro M5, both with 12GB RAM. LFM2.5 1.2B at 4-bit hits \(\text{124 tokens/sec}\) on iPad and \(\text{70 tokens/sec}\) on iPhone. iPad Pro is 1.2x–2.2x faster depending on model and prompt length, with the gap widening dramatically for longer contexts.
Why This Exists
There are plenty of MLX benchmarks for Macs running 20B+ models. There are almost none for iPhones and iPads running the small models that actually fit comfortably in mobile memory. I wanted real numbers for the models I ship in Russet, a private, on-device AI app designed for Apple silicon with no accounts and no ads. So I built a benchmark harness directly into a modified build of the publicly available 2026.02.15 version of the app and ran it systematically on my iPhone and iPad. Also, I needed a break from the cold, and there’s only so much sitting around waiting for M5 Pro/Max chips I can tolerate.
The question I wanted to answer: which sub-1GB quantized model gives the best user experience on the go?
Devices
| Device | Chip | RAM | OS |
| iPhone 17 Pro | A19 Pro | 12 GB | iOS 26.3 RC |
| iPad Pro | M5 | 12 GB | iPadOS 26.3 RC |
Models Tested
| Model | Size | Quantization |
| Qwen3 0.6B | 351 MB | 4-bit |
| LFM2.5 1.2B Instruct | 663 MB | 4-bit |
| Llama 3.2 1B | 713 MB | 4-bit |
| Gemma 3 1B (QAT) | 772 MB | 4-bit |
| LFM2.5 1.2B Instruct | 951 MB | 6-bit |
| Qwen3 1.7B | 984 MB | 4-bit |
These represent the set of MLX models from Hugging Face offered on Russet that can run on both devices without hitting jetsam memory limits (which kill apps consuming roughly 50–67% of total RAM depending on device state).
Methodology
Benchmark Design
Each model was tested with 3 prompts of increasing length to isolate prefill vs. decode performance:
| Prompt | Input Tokens | Output Tokens | What It Tests |
| Short | ~47 | 128 | Decode-dominated (minimal prefill) |
| Medium | ~279 | 256 | Balanced prefill + decode |
| Long | ~690 | 512 | Prefill-heavy, sustained decode |
Critical Methodology Choices
-
Fixed output length (EOS ignored). Most benchmarks let the model stop when it emits an EOS token. This makes TPS numbers incomparable across models — a model that generates 43 tokens looks faster per-token than one generating 512, because it never hits KV cache pressure or thermal throttling. I bypass MLX's internal EOS handling entirely using TokenIterator directly, forcing every model to generate exactly the target token count. A consequential design decision.
-
Deterministic decoding. Temperature = 0 top-p = 1.0 Greedy decoding eliminates sampling variance so TPS numbers reflect pure inference speed, not stochastic branching.
-
Actual tokenizer counts. Input tokens are counted via the MLX tokenizer input.text.tokens.size, not estimated by dividing character count by 4.3 (which is how Russet calculates token consumption count in release builds - yes, I've been lazy about this :p). Output tokens are counted per-iteration in the generation loop.
-
Fresh KV cache per run. No cache reuse between runs. Each run creates a new RotatingKVCache with a 2048-token limit.
-
Warmup + measured runs. 2 warmup runs (discarded) + 10 measured runs per model per prompt. I report p50 and p90 to capture both typical and tail-case performance.
-
Thermal monitoring. I ran these benchmarks in my upstate New York apartment with the window open in −27.4°F (−33°C) wind chill, holding each device near the window between model runs until it felt cool to the touch. Serious stuff, I know… Anyway, not a single run triggered a .serious thermal state. (ProcessInfo.thermalState was checked between runs.)
-
TTFT vs. decode separation. Time to First Token (TTFT) measures prefill latency. Decode TPS is calculated as \(\frac{(\text{output_tokens} - 1)}{\text{decode_seconds}}\), where decode time excludes the TTFT phase.
What This Measures
This benchmark measures raw SoC inference throughput via the MLX TokenIterator — not the end-to-end generation speed a user experiences. Real-world generation includes per-token text decoding, stop-sequence matching, and UI update overhead, which reduces effective TPS by roughly 10–15%. The numbers here represent the ceiling.
Results
Headline Numbers (Medium Prompt — p50)
| Model | iPhone TPS | iPad TPS | Speedup | iPhone TTFT | iPad TTFT |
| Qwen3 0.6B | 62.2 | 86.1 | 1.38x | 163ms | 100ms |
| LFM2.5 1.2B (4-bit) | 59.7 | 124.1 | 2.08x | 244ms | 125ms |
| Llama 3.2 1B | 58.1 | 117.9 | 2.03x | 253ms | 132ms |
| Gemma 3 1B (QAT) | 37.1 | 53.0 | 1.43x | 693ms | 534ms |
| LFM2.5 1.2B (6-bit) | 45.4 | 88.4 | 1.95x | 280ms | 142ms |
| Qwen3 1.7B | 39.5 | 61.5 | 1.56x | 360ms | 179ms |
LFM2.5 1.2B at 4-bit is the speed champion. \(\text{124 tokens/sec}\) on iPad Pro and \(\text{60 tokens/sec}\) on iPhone — faster than models half its size. LFM's architecture appears extremely well-optimized for the MLX runtime.
iPhone 17 Pro: Decode Speed by Model
On iPhone, three models cluster at 58–70 TPS for short prompts (Qwen3 0.6B, LFM2.5 4-bit, Llama 3.2 1B), with significant degradation as prompt length increases. Qwen3 0.6B drops from 69.7 to 43.4 TPS (−38%) going from short to long prompts. This is the KV cache pressure effect — as the cache fills, each decode step does more work.
iPad Pro M5: Decode Speed by Model
iPad tells a different story. LFM2.5 4-bit shows virtually no degradation: 124.1 / 124.1 / 122.5 TPS across short/medium/long. The M5's memory bandwidth is sufficient that KV cache pressure barely registers for this model. In contrast, Llama 3.2 1B drops from 119 to 97 TPS (−19%), and Qwen3 1.7B drops from 82 to 58 TPS (−29%).
Cross-Device: TPS Comparison
Cross-Device: Speedup Ratio
The speedup ratio tells the real story. For short prompts (minimal prefill), iPad is 1.2x–1.9x faster. For long prompts (heavy prefill + sustained decode), the gap widens to 1.5x–2.2x. The advantage compounds with context length because prefill is more memory-bandwidth-sensitive than decode, and the iPad Pro's M5 has substantially higher bandwidth than the iPhone 17 Pro's A19 Pro.
LFM2.5 4-bit shows the largest speedup (2.22x on long prompts), while Gemma 3 1B shows the smallest (1.48x). This suggests LFM's architecture is better at exploiting available bandwidth, while Gemma's bottleneck is elsewhere (likely compute-bound due to its larger attention mechanism).
Cross-Device: Scatter Correlation
Every point sits below the diagonal (iPad faster). The spread increases from left to right — faster models on iPad see a proportionally larger gap over iPhone. Gemma 3 1B clusters in the lower-left (slow on both devices), while LFM2.5 4-bit occupies the extreme right (fast on iPad, moderate on iPhone). (Spent some extra time making this plot so kept it in.)
Deep Dives
1. Prefill Scaling: Why Gemma 3 1B Is Slow
TTFT should scale roughly linearly with input token count (prefill is a single forward pass over all input tokens). Most models follow this pattern. Gemma 3 1B is the exception: its TTFT shoots from 407ms at 47 tokens to 1,330ms at 690 tokens — a 3.3x increase for a 14.7x increase in input tokens.
Why? Gemma 3 uses a different attention pattern (alternating local and global attention layers) and has a larger embedding dimension relative to its parameter count. The QAT (Quantization-Aware Training) variant I tested also uses a non-standard quantization scheme that may not be as optimized in the MLX runtime as the standard GGUF-style quantization used by the other models.
2. LFM2.5: The Quantization Tradeoff
Same architecture, same parameters, different bit widths:
| Metric | 4-bit (663 MB) | 6-bit (951 MB) | Delta |
| Short TPS | 70.2 | 50.9 | −27% |
| Medium TPS | 59.7 | 45.4 | −24% |
| Long TPS | 55.2 | 43.5 | −21% |
| Short TTFT | 90ms | 97ms | +8% |
| Long TTFT | 606ms | 612ms | +1% |
The 4-bit variant is 21–27% faster at decode while using 30% less disk and memory. TTFT is nearly identical because prefill is compute-bound (matrix multiplications scale with model dimension, not bit width for these sizes). The quality tradeoff for 6-bit is hard to justify purely on performance grounds unless your use case demands higher numerical precision. But it doesn't cost me anything to provide it in Russet, so run amok!
3. Thermal Drift on iPhone
10 successive runs of the medium prompt on iPhone reveal interesting thermal behavior:
-
Qwen3 0.6B starts at 63 TPS and drifts down to 57 TPS by run 10 — a steady 10% decline suggesting gradual thermal throttling.
-
LFM2.5 4-bit does the opposite: starts at 55 TPS, ramps UP to 63 TPS by run 4-5, then slowly declines. This “warm-up” pattern suggests the MLX runtime or GPU scheduler optimizes kernel execution over the first few iterations.
-
Llama 3.2 1B shows the most dramatic ramp: from 50 to 60+ TPS in the first 3 runs.
-
Smaller/slower models (Gemma 3 1B, Qwen3 1.7B) show flat lines — they don't push the hardware hard enough to trigger either optimization or throttling.
The warmup effect is why I discard the first 2 runs and why p50 is more meaningful than mean for these benchmarks.
4. Variance and Reliability
LFM2.5 4-bit has the widest interquartile range (consistent with the ramp-up pattern in the runs chart). Gemma 3 1B and Qwen3 1.7B are the most consistent — tight boxes with minimal spread, likely because they run well below the thermal ceiling.
For latency-sensitive applications, Llama 3.2 1B offers the best consistency-to-speed ratio: tight IQR centered at 58 TPS with no outliers.
Practical Recommendations
-
For iPhone:
-
Best overall: LFM2.5 1.2B 4-bit — 60 TPS, 663 MB, good quality for a 1.2B model
-
Smallest footprint: Qwen3 0.6B — 70 TPS, only 351 MB, best for memory-constrained scenarios
-
Best consistency: Llama 3.2 1B — 58 TPS with tight variance, 713 MB
-
-
For iPad:
-
Speed king: LFM2.5 1.2B 4-bit — 124 TPS, virtually no degradation with context length
-
Best value: Llama 3.2 1B — 118 TPS, proven architecture, wide ecosystem support
-
IMHO skip the 6-bit unless you need it. The 21–27% decode speed penalty for LFM2.5 6-bit vs. 4-bit is substantial, and the TTFT is nearly identical. Unless you've measured a quality difference that matters for your use case, 4-bit is the better default.
Context length matters more on iPhone. iPad Pro maintains near-constant TPS as context grows. iPhone degrades significantly — lucky for you, Russet's UX meticulously plans around this.
Limitations and Caveats
-
Pure inference throughput. Real-world generation includes tokenizer overhead, stop-sequence matching, and UI rendering. Expect 10–15% lower effective TPS in production.
-
EOS ignored. I force generation to a fixed token count. In practice, models stop earlier — a response that would naturally be 50 tokens is measured at 128/256/512 tokens. This is intentional (it makes cross-model comparison fair) but means real-world generation completes faster wall-clock wise.
-
Single-turn only. I don't test multi-turn conversation with accumulated KV cache. Cache reuse across turns would improve TTFT but is a separate benchmark.
-
Quality not measured. TPS tells you nothing about output quality. A faster model that produces worse outputs isn't a better choice. We're benchmarking the engine, not the fuel. (Yes, I admit that analogy came from AI…)
-
iOS memory constraints. Models are limited to what fits within iOS jetsam limits (50ish% of 12 GB total). Larger models (Qwen3 4B, Llama 3.2 3B) couldn't fit and were excluded. Check out Russet for Mac to access the full set of models (warning: model availability also depends on device RAM).
-
Two devices. Both have 12 GB RAM. I can't isolate memory bandwidth from compute throughput differences between A19 Pro and M5. The speedup numbers reflect the combined effect.
Key Takeaways
-
LFM2.5 1.2B 4-bit is the best MLX model for mobile. Fastest on both devices, especially dominant on iPad where it hits 124 TPS with no context-length degradation.
-
iPad Pro M5 is 1.2x–2.2x faster than iPhone 17 Pro for on-device LLM inference, despite having the same 12 GB RAM. The gap is memory-bandwidth-driven and widens with context length.
-
Model architecture matters more than parameter count (for mobile). Gemma 3 1B is slower than Qwen3 0.6B despite being larger. LFM2.5 1.2B is faster than everything despite being one of the larger models.
-
4-bit quantization is the sweet spot for mobile. The 6-bit variant of LFM2.5 is 21–27% slower with negligible TTFT difference. Unless you've measured a quality gap, go 4-bit.
-
iPhone TPS degrades significantly with context length; iPad doesn't. On-device AI UX must be designed accordingly — aggressive context management matters more on iPhone.