Run your own LLM backends and connect them to the Cellule.ai network. You keep full control over your models, hardware, and configuration.
Proxy mode is for contributors who:
llama-server, vLLM, Ollama, or similarpip install iamine-ai -i https://cellule.ai/pypi --extra-index-url https://pypi.org/simple
Start one or more llama-server instances on different ports:
# Backend 1: Reasoning model on port 8080
llama-server -m models/Qwen3-30B-A3B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 -ngl 99 -c 4096
# Backend 2: Coding model on port 8081
llama-server -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8081 -ngl 99 -c 4096
# Backend 3: Fast chat model on port 8082
llama-server -m models/Qwen3.5-9B-Q4_K_M.gguf \
--host 127.0.0.1 --port 8082 -ngl 99 -c 4096
{
"pool_url": "wss://cellule.ai/ws",
"backends": [
{
"name": "Reasoning",
"url": "http://127.0.0.1:8080",
"model": "Qwen3-30B-A3B",
"model_path": "models/Qwen3-30B-A3B-Instruct-Q4_K_M.gguf",
"worker_id": "MyWorker-reasoning",
"bench_tps": 60.0
},
{
"name": "Coder",
"url": "http://127.0.0.1:8081",
"model": "Qwen3-Coder-30B-A3B",
"model_path": "models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf",
"worker_id": "MyWorker-coder",
"bench_tps": 55.0
},
{
"name": "Chat",
"url": "http://127.0.0.1:8082",
"model": "Qwen3.5-9B",
"model_path": "models/Qwen3.5-9B-Q4_K_M.gguf",
"worker_id": "MyWorker-chat",
"bench_tps": 80.0
}
]
}
python -m iamine proxy -c proxy.json
Each backend registers as a separate worker on the pool. The pool routes traffic to each based on the model requested.
| Field | Required | Description |
|---|---|---|
pool_url | Yes | WebSocket URL of the pool to join. Default: wss://cellule.ai/ws |
backends[].name | Yes | Display name for this backend (shown in logs) |
backends[].url | Yes | HTTP URL of your llama-server instance |
backends[].model | Yes | Model name (used for routing) |
backends[].model_path | Yes | Path to the GGUF file (for pool registry matching) |
backends[].worker_id | No | Custom worker ID. Auto-generated if omitted |
backends[].bench_tps | No | Declared throughput in tokens/sec. Pool uses this for routing priority |
Auto (--auto) | Proxy (-c proxy.json) | |
|---|---|---|
| Model selection | Pool decides | You decide |
| Model download | Automatic | You manage |
| Inference engine | Built-in (llama-cpp-python) | Your backend (llama-server, vLLM, etc.) |
| GPU config | Auto-detected | You configure |
| Multi-model | 1 model per worker | N models per machine |
| Auto-update | Yes (pool pushes) | Package only (config preserved) |
| $IAMINE earnings | Same formula | Same formula |
llama-server (recommended) — from llama.cpp. Best performance for GGUF models.
llama-server -m model.gguf --host 127.0.0.1 --port 8080 -ngl 99
Ollama — set the backend URL to Ollama's API port.
# Start Ollama
ollama serve
# In proxy.json, use:
"url": "http://127.0.0.1:11434"
vLLM — for high-throughput serving with continuous batching.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B --port 8080
The Cellule.ai test network runs a real proxy with 4 backends on an AMD Ryzen AI MAX+ PRO 395 (94 GB RAM, 88 GB ROCm VRAM):
{
"pool_url": "wss://cellule.ai/ws",
"backends": [
{"name": "RED", "url": "http://127.0.0.1:8080", "model": "Qwen3-30B-A3B", "bench_tps": 105.0, "worker_id": "RED-z2"},
{"name": "Coder", "url": "http://127.0.0.1:8081", "model": "Qwen3-Coder-30B-A3B", "bench_tps": 60.0, "worker_id": "Coder-z2"},
{"name": "Tank", "url": "http://127.0.0.1:8082", "model": "Qwen3.5-35B-A3B", "bench_tps": 55.0, "worker_id": "Tank-z2"},
{"name": "Scout", "url": "http://127.0.0.1:8083", "model": "Qwen3.5-9B", "bench_tps": 50.0, "worker_id": "Scout-z2"}
]
}
Combined throughput: 270 tokens/sec across 4 models. This single machine serves reasoning, coding, large context, and fast chat — all at once.
Cellule.ai — Decentralized AI, powered by the community
cellule.ai