← Back to cellule.ai ADVANCED

Proxy Mode

Run your own LLM backends and connect them to the Cellule.ai network. You keep full control over your models, hardware, and configuration.

Who is this for?

Proxy mode is for contributors who:

Already run llama-server, vLLM, Ollama, or similar
Want to choose which models they serve (no auto-assignment)
Have GPU hardware and want to maximize throughput
Want to serve multiple models from one machine

The pool never reassigns models to proxy workers. Your config is yours.

Quick start

1. Install

pip install iamine-ai -i https://cellule.ai/pypi --extra-index-url https://pypi.org/simple

2. Start your LLM backends

Start one or more llama-server instances on different ports:

# Backend 1: Reasoning model on port 8080
llama-server -m models/Qwen3-30B-A3B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 -ngl 99 -c 4096

# Backend 2: Coding model on port 8081
llama-server -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8081 -ngl 99 -c 4096

# Backend 3: Fast chat model on port 8082
llama-server -m models/Qwen3.5-9B-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8082 -ngl 99 -c 4096

3. Create proxy.json

{
    "pool_url": "wss://cellule.ai/ws",
    "backends": [
        {
            "name": "Reasoning",
            "url": "http://127.0.0.1:8080",
            "model": "Qwen3-30B-A3B",
            "model_path": "models/Qwen3-30B-A3B-Instruct-Q4_K_M.gguf",
            "worker_id": "MyWorker-reasoning",
            "bench_tps": 60.0
        },
        {
            "name": "Coder",
            "url": "http://127.0.0.1:8081",
            "model": "Qwen3-Coder-30B-A3B",
            "model_path": "models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf",
            "worker_id": "MyWorker-coder",
            "bench_tps": 55.0
        },
        {
            "name": "Chat",
            "url": "http://127.0.0.1:8082",
            "model": "Qwen3.5-9B",
            "model_path": "models/Qwen3.5-9B-Q4_K_M.gguf",
            "worker_id": "MyWorker-chat",
            "bench_tps": 80.0
        }
    ]
}

4. Launch the proxy

python -m iamine proxy -c proxy.json

Each backend registers as a separate worker on the pool. The pool routes traffic to each based on the model requested.

Configuration reference

Field	Required	Description
`pool_url`	Yes	WebSocket URL of the pool to join. Default: `wss://cellule.ai/ws`
`backends[].name`	Yes	Display name for this backend (shown in logs)
`backends[].url`	Yes	HTTP URL of your llama-server instance
`backends[].model`	Yes	Model name (used for routing)
`backends[].model_path`	Yes	Path to the GGUF file (for pool registry matching)
`backends[].worker_id`	No	Custom worker ID. Auto-generated if omitted
`backends[].bench_tps`	No	Declared throughput in tokens/sec. Pool uses this for routing priority

How it works

Proxy vs Auto mode

	Auto (`--auto`)	Proxy (`-c proxy.json`)
Model selection	Pool decides	You decide
Model download	Automatic	You manage
Inference engine	Built-in (llama-cpp-python)	Your backend (llama-server, vLLM, etc.)
GPU config	Auto-detected	You configure
Multi-model	1 model per worker	N models per machine
Auto-update	Yes (pool pushes)	Package only (config preserved)
$IAMINE earnings	Same formula	Same formula

Supported backends

llama-server (recommended) — from llama.cpp. Best performance for GGUF models.

llama-server -m model.gguf --host 127.0.0.1 --port 8080 -ngl 99

Ollama — set the backend URL to Ollama's API port.

# Start Ollama
ollama serve

# In proxy.json, use:
"url": "http://127.0.0.1:11434"

vLLM — for high-throughput serving with continuous batching.

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-30B-A3B --port 8080

Tips

bench_tps matters. The pool uses your declared throughput for routing. Set it honestly — if you claim 100 t/s but deliver 20, the pool will detect the discrepancy and lower your routing priority.

worker_id is persistent. Your wallet and earnings are tied to the worker_id. If you change it, you start with a fresh wallet.

$IAMINE earnings are the same formula regardless of mode. 60% of credits go to the worker. More jobs served = more earned. The M12 recruitment system places traffic where it's needed — fill a gap, earn more.

Example: Z2 multi-GPU setup

The Cellule.ai test network runs a real proxy with 4 backends on an AMD Ryzen AI MAX+ PRO 395 (94 GB RAM, 88 GB ROCm VRAM):

{
    "pool_url": "wss://cellule.ai/ws",
    "backends": [
        {"name": "RED",    "url": "http://127.0.0.1:8080", "model": "Qwen3-30B-A3B",         "bench_tps": 105.0, "worker_id": "RED-z2"},
        {"name": "Coder",  "url": "http://127.0.0.1:8081", "model": "Qwen3-Coder-30B-A3B",   "bench_tps": 60.0,  "worker_id": "Coder-z2"},
        {"name": "Tank",   "url": "http://127.0.0.1:8082", "model": "Qwen3.5-35B-A3B",       "bench_tps": 55.0,  "worker_id": "Tank-z2"},
        {"name": "Scout",  "url": "http://127.0.0.1:8083", "model": "Qwen3.5-9B",            "bench_tps": 50.0,  "worker_id": "Scout-z2"}
    ]
}

Combined throughput: 270 tokens/sec across 4 models. This single machine serves reasoning, coding, large context, and fast chat — all at once.

Cellule.ai — Decentralized AI, powered by the community

cellule.ai