llama.cpp/tools/rpc/README.md

## Overview

> [!IMPORTANT]
> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
> insecure. **Never run the RPC server on an open network or in a sensitive environment!**

The `rpc-server` allows  running `ggml` backend on a remote host.
The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
This can be used for distributed LLM inference with `llama.cpp` in the following way:

```mermaid
flowchart TD
    rpcb<-->|TCP|srva
    rpcb<-->|TCP|srvb
    rpcb<-.->|TCP|srvn
    subgraph hostn[Host N]
    srvn[rpc-server]<-.->backend3["Backend (CUDA,Metal,etc.)"]
    end
    subgraph hostb[Host B]
    srvb[rpc-server]<-->backend2["Backend (CUDA,Metal,etc.)"]
    end
    subgraph hosta[Host A]
    srva[rpc-server]<-->backend["Backend (CUDA,Metal,etc.)"]
    end
    subgraph host[Main Host]
    local["Backend (CUDA,Metal,etc.)"]<-->ggml[llama-cli]
    ggml[llama-cli]<-->rpcb[RPC backend]
    end
    style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
```

Each host can run a different backend, e.g. one with CUDA and another with Metal.
You can also run multiple `rpc-server` instances on the same host, each with a different backend.

## Usage

On each host, build the corresponding backend with `cmake` and add `-DGGML_RPC=ON` to the build options.
For example, to build the CUDA backend with RPC support:

```bash
mkdir build-rpc-cuda
cd build-rpc-cuda
cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build . --config Release
```

Then, start the `rpc-server` with the backend:

```bash
$ bin/rpc-server -p 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
Starting RPC server on 0.0.0.0:50052
```

When using the CUDA backend, you can specify the device with the `CUDA_VISIBLE_DEVICES` environment variable, e.g.:
```bash
$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
```
This way you can run multiple `rpc-server` instances on the same host, each with a different CUDA device.


On the main host build `llama.cpp` for the local backend and add `-DGGML_RPC=ON` to the build options.
Finally, when running `llama-cli`, use the `--rpc` option to specify the host and port of each `rpc-server`:

```bash
$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
```

This way you can offload model layers to both local and remote devices.

### Local cache

The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
This can speed up model loading significantly, especially when using large models.
To enable the cache, use the `-c` option:

```bash
$ bin/rpc-server -c
```

By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`## Overview`

Merge commit from fork 2024-08-09 23:03:21 +03:00			`> [!IMPORTANT]`
			`> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and`
			`> insecure. Never run the RPC server on an open network or in a sensitive environment!`

ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			The `rpc-server` allows running `ggml` backend on a remote host.
			The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
			This can be used for distributed LLM inference with `llama.cpp` in the following way:

			```mermaid
			`flowchart TD`
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			`rpcb<-->\|TCP\|srva`
			`rpcb<-->\|TCP\|srvb`
			`rpcb<-.->\|TCP\|srvn`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`subgraph hostn[Host N]`
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			`srvn[rpc-server]<-.->backend3["Backend (CUDA,Metal,etc.)"]`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`end`
			`subgraph hostb[Host B]`
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			`srvb[rpc-server]<-->backend2["Backend (CUDA,Metal,etc.)"]`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`end`
			`subgraph hosta[Host A]`
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			`srva[rpc-server]<-->backend["Backend (CUDA,Metal,etc.)"]`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`end`
			`subgraph host[Main Host]`
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			`local["Backend (CUDA,Metal,etc.)"]<-->ggml[llama-cli]`
			`ggml[llama-cli]<-->rpcb[RPC backend]`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`end`
			`style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5`
			```

			`Each host can run a different backend, e.g. one with CUDA and another with Metal.`
			You can also run multiple `rpc-server` instances on the same host, each with a different backend.

			`## Usage`

llama : reorganize source code + improve CMake (#8006) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com> 2024-06-26 18:33:02 +03:00			On each host, build the corresponding backend with `cmake` and add `-DGGML_RPC=ON` to the build options.
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`For example, to build the CUDA backend with RPC support:`

			```bash
			`mkdir build-rpc-cuda`
			`cd build-rpc-cuda`
llama : reorganize source code + improve CMake (#8006) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com> 2024-06-26 18:33:02 +03:00			`cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`cmake --build . --config Release`
			```

			Then, start the `rpc-server` with the backend:

			```bash
rpc : add command line arg for specifying backend memory ref: #7293 2024-05-15 15:29:07 +03:00			`$ bin/rpc-server -p 50052`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			`create_backend: using CUDA backend`
			`ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no`
			`ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes`
			`ggml_cuda_init: found 1 CUDA devices:`
			`Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes`
			`Starting RPC server on 0.0.0.0:50052`
			```

			When using the CUDA backend, you can specify the device with the `CUDA_VISIBLE_DEVICES` environment variable, e.g.:
			```bash
rpc : add command line arg for specifying backend memory ref: #7293 2024-05-15 15:29:07 +03:00			`$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			```
			This way you can run multiple `rpc-server` instances on the same host, each with a different CUDA device.


rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00			On the main host build `llama.cpp` for the local backend and add `-DGGML_RPC=ON` to the build options.
			Finally, when running `llama-cli`, use the `--rpc` option to specify the host and port of each `rpc-server`:
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00
			```bash
`build`: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809) * `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew * server: update refs -> llama-server gitignore llama-server * server: simplify nix package * main: update refs -> llama fix examples/main ref * main/server: fix targets * update more names * Update build.yml * rm accidentally checked in bins * update straggling refs * Update .gitignore * Update server-llm.sh * main: target name -> llama-cli * Prefix all example bins w/ llama- * fix main refs * rename {main->llama}-cmake-pkg binary * prefix more cmake targets w/ llama- * add/fix gbnf-validator subfolder to cmake * sort cmake example subdirs * rm bin files * fix llama-lookup-* Makefile rules * gitignore /llama-* * rename Dockerfiles * rename llama\|main -> llama-cli; consistent RPM bin prefixes * fix some missing -cli suffixes * rename dockerfile w/ llama-cli * rename(make): llama-baby-llama * update dockerfile refs * more llama-cli(.exe) * fix test-eval-callback * rename: llama-cli-cmake-pkg(.exe) * address gbnf-validator unused fread warning (switched to C++ / ifstream) * add two missing llama- prefixes * Updating docs for eval-callback binary to use new `llama-` prefix. * Updating a few lingering doc references for rename of main to llama-cli * Updating `run-with-preset.py` to use new binary names. Updating docs around `perplexity` binary rename. * Updating documentation references for lookup-merge and export-lora * Updating two small `main` references missed earlier in the finetune docs. * Update apps.nix * update grammar/README.md w/ new llama-* names * update llama-rpc-server bin name + doc * Revert "update llama-rpc-server bin name + doc" This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930. * add hot topic notice to README.md * Update README.md * Update README.md * rename gguf-split & quantize bins refs in **/tests.sh --------- Co-authored-by: HanClinto <hanclinto@gmail.com> 2024-06-13 00:41:52 +01:00			`$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99`
ggml : add RPC backend (#6829) * ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos 2024-05-14 14:27:19 +03:00			```
rpc : update README [no ci] (#9320) Update README with instructions how to offload model layers to both local and remote devices 2024-09-09 11:04:39 +03:00
readme : add LLMUnity to UI projects (#9381) * add LLMUnity to UI projects * add newline to examples/rpc/README.md to fix editorconfig-checker unit test 2024-09-09 14:21:38 +03:00			`This way you can offload model layers to both local and remote devices.`

rpc : update README for cache usage (#12620) 2025-03-28 09:44:13 +02:00			`### Local cache`

			`The RPC server can use a local cache to store large tensors and avoid transferring them over the network.`
			`This can speed up model loading significantly, especially when using large models.`
			To enable the cache, use the `-c` option:

			```bash
			`$ bin/rpc-server -c`
			```

			By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.