Running LLMs on Old Hardware
Running Local LLMs on Intel Mac with AMD GPUs: A Journey Through llama.cpp and MoltenVK
Introduction
I have an old Mac Pro 2019 that I was considering selling for a long time (funny that 2019 is considered deprecated), but because they sell for next to nothing I decided to try running it as a local server for LLMs. Yes I do realise its far simpler to run inference on Apple silicon rather than intel, but the ultimate goal was to run a server that handled or routed all LLM requests and for that I needed a desktop.
Why llama.cpp?
The first goal was: run large language models locally for Cline in VS Code. The requirements were:
- Large context windows (8K-32K tokens) for working with entire codebases
- 30B+ parameter models for quality code generation
- Cost-effective infrastructure using hardware I already owned
- Reliable enough for production agent workflows, finding a quantization that did not crash the GPU’s
I really only considered llama.cpp as it had Vulkan support for AMD GPUs on macOS. I have not as yet trialed vLLM and other inference engines.
Building llama.cpp with Vulkan Support
Initial Build Attempt and Compiler Crash
The initial build process revealed an immediate issue:
cd ~/Work
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# First attempt with Vulkan backend
cmake -B build \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
The build failed at 47% with a compiler crash:
[ 47%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/arch/x86/repack.cpp.o
fatal error: error in backend: Cannot select: 0x7f7dc302c400: v8f16,ch = load<...
In function: ggml_gemv_q4_K_8x8_q8_K
clang: error: clang frontend command failed with exit code 70
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
The Root Cause: Outdated Compiler and Advanced Instruction Sets
The issue stemmed from using Apple Clang 14.0.3 (from Xcode 14.0.3) which had a bug with the advanced CPU instruction set optimizations that llama.cpp tries to enable by default via -march=native.
The -march=native flag enables CPU SIMD instructions (Single Instruction, Multiple Data) like AVX2 and AVX-512, which can significantly speed up CPU inference by processing multiple values in parallel. However, when the compiler is outdated, it can miscompile these optimizations.
The Fix: Disabling Native CPU Optimizations
Since I was building for GPU inference, the CPU optimizations were mostly irrelevant anyway:
# Clean the failed build
rm -rf build
# Rebuild WITHOUT native CPU optimizations
cmake -B build \
-DGGML_VULKAN=ON \
-DGGML_NATIVE=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DVulkan_LIBRARY=/usr/local/lib/libvulkan.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp
cmake --build build --config Release
The -DGGML_NATIVE=OFF flag disables the problematic -march=native compiler flag, preventing the crash.
That said the correct method was update clang to 15.0.0 but this required an update in Xcode.
Verifying GPU Detection
The first test—did llama.cpp detect both GPUs?
./build/bin/llama-cli --version
# Output showed:
# Vulkan backend enabled
# Found 2 Vulkan devices:
# - AMD Radeon Pro W5700X (16384 MB)
# - AMD Radeon Pro W5700X (16384 MB)
Both GPUs were visible to the Vulkan backend.
Initial Testing
The First Models
I started with Qwen2.5-Coder-7B-Instruct in Q4_K_M quantization:
./build/bin/llama-cli \
-hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
-m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
-p "def fibonacci(n):" \
-n 50 \
-ngl 99 # Offload all layers to GPU
The results weren’t good:
Prompt: 73.0 t/s | Generation: 71.8 t/s
Output: 动了@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...
71.8 tokens per second of complete gibberish. Chinese characters, random symbols, repetition—but blazingly fast.
CPU Baseline Test
To verify the model itself wasn’t corrupted:
./build/bin/llama-cli \
-m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
-p "def fibonacci(n):" \
-n 50 \
-ngl 0 # Force CPU only
Result: Valid Python code at 9.6 tokens/second.
Conclusion: The model was fine. GPU acceleration was working (speed proved that), but the output was corrupted.
The Hunt for the Bug
What I Observed
- Speed proved GPU was working: 71.8 tok/s vs 9.6 tok/s CPU-only (7.5x speedup)
- Both GPUs showed activity in Activity Monitor during inference
- The corruption was deterministic: Same gibberish patterns every time
- Different models showed the same issue: Tried DeepSeek-Coder-33B, Qwen variants—all corrupted
- Quantization didn’t matter: Q4, Q6, Q8 all showed corruption (though Q8 behaved differently)
The Investigation
The debug process involved:
# Checking Vulkan installation
vulkaninfo --summary
# Output:
# Vulkan Instance Version: 1.3.296
# Found 2 devices...
# Checking which MoltenVK version was in use
vulkaninfo | grep -i "driverVersion\|driverName"
# Output:
# driverName: MoltenVK
# driverVersion: 0.2.2019 (10211)
MoltenVK version 0.2.2019—from 2019! I was running a six-year-old translation layer. Maybe something I fiddled around with back in the day
llama.cpp (Vulkan calls)
↓
MoltenVK (translation layer) ← The problem was here
↓
Metal API
↓
AMD GPU hardware
The Upgrade Process
brew info molten-vk
# Output: stable 1.4.0 (bottled)
brew upgrade molten-vk
# Fix permissions (needed on my system)
sudo chown -R $(whoami):admin /usr/local/include/MoltenVK
sudo chown -R $(whoami):admin /usr/local/lib/libMoltenVK.dylib
# Force link the new version
brew link --overwrite molten-vk
# Verify the upgrade
vulkaninfo | grep -i "driverVersion"
# Output: driverVersion = 0.2.2208 (10400)
Testing again
./build/bin/llama-cli \
-hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
-m Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
-p "def fibonacci(n):" \
-n 50 \
-ngl 99
Output:
def fibonacci(n):
if n <= 0:
return "Input should be a positive integer"
elif n == 1:
return 0
elif n == 2:
return 1
...
[ Prompt: 51.4 t/s | Generation: 43.5 t/s ]
Valid Python code at 43.5 tokens per second. The MoltenVK upgrade had fixed it.
Scaling Up: Testing Larger Models
Qwen2.5-Coder-32B-Instruct
With the corruption issue resolved, I could finally test the models I actually wanted to use:
./build/bin/llama-server \
-m qwen2.5-coder-32b-instruct-q5_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 32768 \ # 32K context window
-ngl 999 \ # All layers to GPU
--threads 8 \
--flash-attn
Performance:
- Prompt processing: ~15-20 t/s (varies with prompt length)
- Token generation: 13.5 t/s sustained
- Context window: 32K tokens working reliably
- VRAM usage: ~28GB across both GPUs
This was the max parameters my gpus could handle in VRAM and also really in usable speeds for token output or so I thought.
Practical Performance with Cline
Real-world usage patterns:
- Simple queries (< 1000 tokens): ~20-50 t/s, interactive
- Medium tasks (1000-5000 tokens): ~15 t/s, usable
- Large context (5000+ tokens): Prompt processing becomes the bottleneck coming down to 5 t/s
The Prompt Processing Bottleneck (Why Is Prompt Processing Limited to One GPU?)
Where the logic would need to change:
The limitation is primarily in llama.cpp’s Vulkan backend, specifically in how it implements the batched matrix multiplications for attention. To enable multi-GPU prefill would require:
I do not want to claim any specialy over what specifically is the technical difference at the prompt processing stage but from analysis by Claude it was:
- Tensor parallelism for attention operations
- Pipeline parallelism with careful layer assignment
- Sophisticated scheduling to overlap communication and computation
None of these are currently implemented in llama.cpp’s Vulkan backend, though they exist in more specialized frameworks like vLLM or TensorRT-LLM which target CUDA.
Qwen3 sidetrack
I tried to upgrade to the Qwen3 model in various Q4 and Q5 and all of them had garbled output on the GPUs.
I hypothesized that the issue was related to asynchronous execution in server due to a github issue posted about the Intel DG1 having broken output in async mode.
The theory was that Vulkan fence synchronization (vkWaitForFences) wasn’t properly blocking in the server’s async context, causing the sampler to read from GPU memory before the computation completed. This investigation was aided by an LLM.
Investigating the Code
Tracing through llama.cpp’s Vulkan backend code:
// File: ggml-vulkan.cpp
static void ggml_backend_vk_synchronize(ggml_backend_t backend) {
// This fence wait might be failing in async context
VK_CHECK(vkWaitForFences(device, 1, &fence, VK_TRUE, UINT64_MAX));
// Returns before GPU actually completes?
}
I added custom logging to track fence states and looked into the scheduler differences:
// Server async path (suspected problem)
ggml_backend_sched_graph_compute_async(...);
// CLI sync path (worked fine)
ggml_backend_sched_graph_compute(...);
ggml_backend_synchronize(backend);
The Attempted Fix
I tried disabling async mode entirely by modifying llama.cpp source, thinking it was the same bug affecting the Intel DG1 and since W5700X is also a fairly obscure GPU (for inference anyways).
device->support_async =
(device->vendor_id != VK_VENDOR_ID_INTEL ||
std::string(device->properties.deviceName).find("(DG1)") == std::string::npos) &&
std::string(device->properties.deviceName).find("W5700X") == std::string::npos &&
getenv("GGML_VK_DISABLE_ASYNC") == nullptr;
This would force synchronous execution even in server mode. Since I was only running the server for myself I didn’t require the async processing.
Unfortunately this hunch was not correct as Qwen3 remained corrupted even without async mode.
Therefore sticking to Qwen2.5 these were the final results.
Model Testing Results
I was sticking to mainly Qwen because it’s the only one that currently reliably works with Cline.
| Model | Quantization | VRAM | Prompt t/s | Generate t/s | Notes |
|---|---|---|---|---|---|
| Qwen2.5-Coder-7B | Q4_K_M | ~5GB | 51.4 | 43.5 | |
| Qwen2.5-Coder-14B | Q4_K_M | ~9GB | ~35 | ~25 | |
| Qwen2.5-Coder-32B | Q5_K_M | ~24GB | 15-20 | 13.5 | I believe only 32B plus is useful for any production work |
| Qwen3-Coder-32B | N/A | N/A | N/A | N/A | Didn’t work at any quantization |
Key finding: Q8 quantization with full GPU offload caused memory boundary issues and crashes, but Q4/Q5 worked flawlessly.
Lessons Learned
1. Check Your Vulkan Dependencies (Especially Translation Layers)
MoltenVK being 6 years out of date was a hidden issue. On macOS with AMD GPUs, verify your MoltenVK version before assuming hardware or software bugs.
vulkaninfo | grep -i "driverVersion\|driverName"
3. Fallback to CPU Baseline Testing
The 1-minute CPU test immediately told me:
- The model file was fine
- llama.cpp was working
- The problem was specifically in the GPU code path
5. Quantization
Q4 and Q5 quantizations worked perfectly. Q8 hit memory boundary issues and caused corruption with full offload (-ngl 99), but worked fine with hybrid offload (-ngl 50). The lesson don’t push past VRAM limits or into unstable memory regions.
Final Configuration
#!/bin/bash
# ~/llama-server-start.sh
cd ~/Work/llama.cpp
./build/bin/llama-server \
-m models/qwen2.5-coder-32b-instruct-q5_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 32768 \
-ngl 999 \
--threads 8 \
--flash-attn \
--log-disable
Conclusion
The final result 13.5 tokens/second on a 32B parameter model with 32K context window, on paper it had sufficient output speed with dummy tests like write a hello world in python, in reality it falls over for any kind of production work requiring very large context windows. As an example I was trying to analyse a single function of a file ggml-vulkan.cpp from the lamma.cpp codebase, I was close to the 32K context roof and getting about 4 t/s which was completely unusable.
What started as a straightforward “build llama.cpp with Vulkan” project turned into a dive through the GPU computing stack on macOS.
For anyone attempting similar setups on Intel Macs with AMD GPUs: check your MoltenVK version first. Or depending on your GPU it might be worth going Linux and ROCm rather than the Vulkan path.
Technical Specifications
Hardware:
- Mac Pro 7,1 (2019)
- 2x AMD Radeon Pro W5700X (16GB VRAM each, RDNA1)
- 96GB system RAM
Software:
- macOS Ventura 13.x
- MoltenVK 1.4.0 (upgraded from 0.2.2019)
- llama.cpp (latest, built from source with Vulkan backend)
- Vulkan SDK 1.3.296
Model:
- Qwen2.5-Coder-32B-Instruct
- Q5_K_M quantization
- 32K native context window
Integration:
- Cline (VS Code extension)
- OpenAI-compatible API endpoint
Side notes.
From the llama.cpp manifesto: “efficient transformer model inference on-device (i.e. at the edge).” - adding to this I would say there is an additional point of what edge is. Which encompasses any hardware or any combination of hardware that is capable of runnng inference on models. This is adjunct to what andrej karpathy has noted we need more education as to not be lost in the future about what computing is and what computers do.
Vulkan might be coming to Apple Silicon without the MoltenVK transalation layer.
Person that is hacking Vulkan driver for OSX
inference references https://kipp.ly/h100-inferencing/
G
*The CLI commands and comparison tables have been provided by Claude but the entire exploration of this thing was my idea.