We Rebuilt Mesh-LLM on ARM64 So the Mesh Could Actually Use the GPU

Most distributed inference demos hand-wave the hard part. We rebuilt Mesh-LLM on ARM64, proved CUDA was real, joined a live mesh, and traced the exact failure when a frontier model hit a broken shard cache.

By Nikhil
Published
Mesh-LLMGPUARM64CUDADistributed Inference

We Rebuilt Mesh-LLM on ARM64 So the Mesh Could Actually Use the GPU

A lot of distributed inference writing skips the only part that really matters: is the node actually real, or are you just telling yourself a nice story about GPUs?

We wanted one specific ARM64 box with an NVIDIA GPU to become a serious Mesh-LLM participant, not a fake one. So we rebuilt the stack for CUDA, verified the binaries, joined a live mesh, exposed the OpenAI-compatible API locally, and then pushed far enough to catch a real frontier-model failure in the wild.

That last part matters.

A clean demo is easy to fake. A system that survives contact with reality — host election, CUDA startup, RPC startup, live API traffic, and a precise failure trace when the model cache is broken — is much more interesting.

What Mesh-LLM actually is

Mesh-LLM is built around a simple promise: pool spare GPU capacity across machines and expose the result as one OpenAI-compatible API.

That matters because it changes the default architecture. Instead of treating every machine like an isolated model host, you can treat the mesh as one shared inference surface:

  • if a model fits on one machine, it runs there
  • if it does not, the mesh can split work across nodes
  • every node still exposes the same local OpenAI-compatible API shape

That is the right abstraction. But the abstraction falls apart if a node is built or configured wrong.

The practical problem

We did not need another hand-wavy “GPU accelerated” claim. We needed a local node that was verifiably built for CUDA on this machine.

The live repo on this box is:

/home/ncubelabsai/mesh-llm-src

The build script in that repo supports explicit Linux backends:

  • cpu
  • cuda
  • rocm
  • vulkan

That distinction matters more than people admit. On mixed or unusual hardware, especially ARM64 systems, the difference between “the code supports CUDA” and “this machine is actually built and running the CUDA path” is the difference between a fast mesh participant and a disappointing one.

What we rebuilt

The source build path for Linux is straightforward:

cd /home/ncubelabsai/mesh-llm-src
just build backend=cuda cuda_arch='121'

The explicit script form is:

scripts/build-linux.sh --backend cuda --cuda-arch 121

That build path configures the bundled llama.cpp backend with CUDA turned on, sets the CUDA architecture, keeps RPC enabled, then rebuilds the Rust mesh-llm binary around that stack.

In the build script, the CUDA branch enables:

  • GGML_CUDA=ON
  • GGML_CUDA_FA_ALL_QUANTS=ON
  • CMAKE_CUDA_ARCHITECTURES=<arch>

And the shared build also enables:

  • GGML_RPC=ON

That is the real work. Not a marketing layer. Not a benchmark screenshot. An actual local rebuild of the inference substrate.

What we verified on this machine

We verified the local llama.cpp build cache and the resulting artifacts.

From the current build cache:

CMAKE_CUDA_ARCHITECTURES=121
GGML_CUDA=ON
GGML_CUDA_FA=ON
GGML_CUDA_FA_ALL_QUANTS=ON
GGML_RPC=ON

We also verified the timestamps of the rebuilt binaries:

2026-04-27 04:12:53 PDT  rpc-server
2026-04-27 04:14:04 PDT  llama-server
2026-04-27 04:16:43 PDT  target/release/mesh-llm
2026-04-27 04:17:35 PDT  ~/.local/bin/mesh-llm

And we verified that the installed binary matched the freshly built release binary.

That matters because a lot of teams “rebuild” something, then accidentally keep running the old installed binary.

Live proof from the running mesh

These screenshots are from the live session on this machine while bringing Mesh-LLM up against a real mesh.

1) GPU detection

Mesh-LLM GPU detection proof

This is the minimum bar. If the runtime cannot clearly identify the local GPU and backend device, the rest of the stack is wishful thinking.

2) Mesh join + bootstrap API

Mesh-LLM mesh join and bootstrap API proof

This is the moment the post starts mattering: real peer discovery, successful mesh join, local bootstrap API on localhost:9337, and a host election for a mesh model load.

3) Live model catalog

Mesh-LLM live model catalog proof

Even before the local host model finished downloading, the API surface was already useful. The node exposed a live mesh-backed catalog through the same OpenAI-compatible endpoint shape we care about operationally.

4) Frontier-model load attempt: real failure, real diagnosis

Frontier model load attempt proof

We also pushed the node into a more ambitious path: let the mesh elect this machine as host for Qwen3-Coder-Next-Q4_K_M.

That was valuable precisely because it did not end in a fake success. The node was elected, CUDA services started, the full API proxy came up, and llama-server began loading the model — then failed because an interrupted multi-part Hugging Face cache left shard 00003-of-00004 missing.

That is a better story than a hand-waved demo. It proves two things at once:

  • the rebuilt GPU stack was real enough to get through host election, CUDA startup, RPC startup, and local llama-server launch
  • the remaining blocker was operational state in the model cache, not “CUDA still isn’t working”

In infra work, that distinction matters.

Why this improves the mesh

A distributed inference system does not magically fix a weak local node.

If the local machine is serving from the wrong backend, everything downstream is worse:

  • lower local throughput
  • worse latency before the mesh even starts distributing work
  • weaker single-node fallback behavior
  • less useful GPU contribution to the rest of the pool

By rebuilding the local node correctly, we made that machine a better mesh participant.

That means:

  • local requests start from a faster base
  • the node contributes real GPU capacity to the mesh
  • RPC-based distributed paths have a stronger local foundation
  • the OpenAI-compatible endpoint is backed by a stack we actually trust

In other words: the mesh gets better when each node stops lying.

The part people skip: proof

We also checked the runtime GPU identity directly.

On this machine, mesh-llm gpus reports an NVIDIA GB10 device on CUDA0.

That is exactly the kind of check you should do before declaring victory. If you do not verify the built backend, the produced binaries, and the runtime GPU identity, you are not operating a GPU-backed inference node. You are hoping.

The bigger lesson

The interesting thing about Mesh-LLM is not just that it can spread inference across machines.

The interesting thing is that it gives you a cleaner way to think about inference infrastructure:

  • make each node honest
  • verify the hardware path locally
  • expose one consistent API surface
  • let the mesh decide how to use the pool

That is a much better operational model than hand-wiring one-off model servers and pretending distributed behavior will sort itself out later.

Copy/paste rebuild checklist

If you are doing this on a similar machine, the shortest useful checklist is:

  1. Verify the repo and toolchain are local and current.
  2. Rebuild with the CUDA backend explicitly.
  3. Set the correct CUDA architecture for the machine.
  4. Confirm the build cache shows GGML_CUDA=ON and GGML_RPC=ON.
  5. Verify rpc-server, llama-server, and mesh-llm were actually rebuilt.
  6. Verify the installed binary matches the fresh build.
  7. Run mesh-llm gpus and confirm the runtime sees the real GPU.

If you skip any of those, you are leaving room for fake confidence.

Closing

The headline is not “we turned on a flag.”

The headline is that we made our local inference node real.

Mesh-LLM is most compelling when it turns uneven hardware into a shared serving surface. But that only works when each participant is built and verified like production infrastructure.

That is what we did here: rebuild the stack for CUDA on ARM64, verify the artifacts, and turn the box into a better node for the mesh.