Why Run Models Locally?
Cloud-hosted LLMs are convenient, but they come with trade-offs that matter when you are writing code or building private tools. Every prompt you send to a hosted API leaves your machine — your proprietary code, internal architecture details, database schemas, and business logic all travel to a third-party server. For personal projects this might be acceptable, but for anything involving proprietary code, client data, or internal tooling, it raises real concerns.
Running models locally eliminates these issues entirely:
- Privacy. Your code never leaves your network. No prompts are logged, stored, or used for training by a third party.
- Zero inference cost. After the initial hardware investment, every token is free. For heavy daily usage — code completion, refactoring, chat-based debugging — the savings compound quickly.
- No rate limits. You are not competing for capacity with other users. Your model is available 24/7 at whatever throughput your hardware allows.
- Latency. Local inference over a LAN is faster than a round trip to a data center, especially for short completions where network overhead dominates.
- Offline availability. You can work on a plane, in a restricted network, or during a provider outage without losing your AI assistant.
The trade-off is capability — local models are smaller than frontier models and will not match their reasoning depth on complex tasks. But for code completion, boilerplate generation, refactoring suggestions, and conversational debugging, modern open-weight models are remarkably competent.
The Stack — vLLM and LiteLLM
Once you start thinking about running models locally, something happens to the way you look at hardware. Every device with a GPU becomes a potential inference node. Your workstation with an RTX 4090? Obviously. That old gaming laptop collecting dust? It has a 3070 — that is 8 GB of VRAM doing nothing. Your partner’s MacBook with its unified memory? It could run a 7B model through llama.cpp while they are asleep. The mini PC you bought for a home media server? It has an iGPU — technically it counts. You start eyeing the smart fridge and wondering if it has a tensor core somewhere.
This is the home lab mentality, and it is a real phenomenon once you realize that LiteLLM can aggregate multiple backends into a single endpoint. Every GPU in your household becomes part of your personal inference cluster.
The most practical setup I have found combines two tools: vLLM for GPU-accelerated model serving and LiteLLM as an OpenAI-compatible API gateway. Together, they give you a production-grade inference endpoint that any tool — IDE plugins, CLI agents, custom scripts — can connect to using standard APIs.
graph LR
subgraph Clients
IDE["JetBrains IDE"]
CC["Claude Code"]
OC["opencode"]
Custom["Custom Tools"]
end
subgraph Gateway
LiteLLM["LiteLLM Proxy\n:4000"]
end
subgraph Workstation["Workstation — RTX 4090"]
V1["vLLM\nQwen3-4B\n:8000"]
V2["vLLM\nQwen3-30B\n:9000"]
end
subgraph Laptop["Old Laptop — RTX 3070"]
V3["vLLM\nQwen3-4B\n:8000"]
end
subgraph Mac["MacBook — Unified Memory"]
V4["llama.cpp\nQwen3-14B\n:8080"]
end
IDE --> LiteLLM
CC --> LiteLLM
OC --> LiteLLM
Custom --> LiteLLM
LiteLLM --> V1
LiteLLM --> V2
LiteLLM --> V3
LiteLLM --> V4
The beauty of this architecture is that LiteLLM does not care where the backend lives or what serves it. A vLLM instance on an NVIDIA workstation, llama.cpp on a MacBook, Ollama on a gaming laptop — they all expose OpenAI-compatible endpoints, and LiteLLM routes to all of them with automatic fallbacks. When your workstation’s Qwen3-30B is busy, the request silently falls back to Qwen3-4B on the laptop.
vLLM is a high-throughput inference engine that serves models with continuous batching, PagedAttention for efficient memory management, and OpenAI-compatible API endpoints out of the box. It runs as a Docker container with NVIDIA GPU support.
LiteLLM is a lightweight proxy that exposes a unified API (OpenAI, Anthropic, and other formats) and routes requests to one or more backend model servers. It adds fallback routing, load balancing, and observability — all configured through a single YAML file.
Deploying vLLM
vLLM runs as a Docker container with GPU passthrough. Here is a Docker Compose configuration that serves two models on a single GPU, splitting VRAM between them:
services:
qwen3-4b:
image: nvcr.io/nvidia/vllm:26.01-py3
container_name: qwen3-4b
restart: unless-stopped
ipc: host
runtime: nvidia
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
- NVIDIA_VISIBLE_DEVICES=all
command: >
--model Qwen/Qwen3-4B
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.25
--max-num-batched-tokens 8192
--enable-auto-tool-choice
--tool-call-parser hermes
--enable-chunked-prefill
--max-num-seqs 64
qwen3-30b:
image: nvcr.io/nvidia/vllm:26.01-py3
container_name: qwen3-30b
restart: unless-stopped
ipc: host
runtime: nvidia
ports:
- "9000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
- NVIDIA_VISIBLE_DEVICES=all
command: >
--model Qwen/Qwen3-30B
--port 8000
--max-model-len 65536
--gpu-memory-utilization 0.60
--max-num-batched-tokens 8192
--enable-auto-tool-choice
--tool-call-parser hermes
--enable-chunked-prefill
--max-num-seqs 64
Key parameters to understand:
| Parameter | Purpose |
|---|---|
--gpu-memory-utilization |
Fraction of GPU VRAM to allocate. Setting 0.25 and 0.60 lets two models share a single GPU. |
--max-model-len |
Maximum context window. Larger values consume more VRAM. |
--max-num-batched-tokens |
Controls continuous batching throughput. |
--enable-auto-tool-choice |
Enables function/tool calling support, essential for agentic workflows. |
--tool-call-parser hermes |
Parser format for tool call extraction from model output. |
--enable-chunked-prefill |
Reduces time-to-first-token by processing prefill in chunks. |
--max-num-seqs |
Maximum concurrent sequences — controls parallel request capacity. |
ipc: host |
Required for shared memory between GPU workers. |
The HuggingFace cache volume mount (~/.cache/huggingface) ensures models are downloaded once and reused across container restarts.
Choosing Models
For coding tasks, instruction-tuned models with tool-calling support work best. The Qwen3 family is a strong choice — Qwen3-4B handles code completion and simple refactoring well, while Qwen3-30B is capable of more complex reasoning, multi-file changes, and architectural discussions. Other good options include CodeLlama, DeepSeek Coder, and StarCoder variants.
The key constraint is VRAM. A rough rule: each billion parameters requires approximately 1 GB of VRAM in FP16, or 0.5 GB in FP8/INT8 quantization. A 24 GB GPU can comfortably serve Qwen3-4B and Qwen3-30B simultaneously with the memory splits shown above.
Deploying LiteLLM
LiteLLM sits in front of your vLLM instances and provides a single endpoint for all clients:
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./config.yml:/etc/litellm/config.yml
command:
- "--config"
- "/etc/litellm/config.yml"
- "--host"
- "0.0.0.0"
- "--port"
- "4000"
deploy:
resources:
limits:
memory: 512M
The configuration file (config.yml) defines available models and routing:
model_list:
- model_name: qwen3-4b
litellm_params:
model: openai/Qwen/Qwen3-4B
api_base: http://gpu-server:8000/v1
api_key: "not-needed"
- model_name: qwen3-30b
litellm_params:
model: openai/Qwen/Qwen3-30B
api_base: http://gpu-server:9000/v1
api_key: "not-needed"
router_settings:
fallbacks:
- qwen3-30b: ["qwen3-4b"]
litellm_settings:
drop_params: true
callbacks:
- prometheus
The openai/ prefix tells LiteLLM to use the OpenAI-compatible API format when communicating with vLLM. The api_key is set to a dummy value since vLLM does not require authentication on a local network.
Mixing Cloud and Local Models
LiteLLM is not limited to local backends. It supports over 100 providers out of the box — Claude, OpenAI, Mistral, Google Gemini, Groq, and others — all configurable in the same model_list. This means you can build a hybrid setup where frontier cloud models handle the hardest tasks and local models handle everything else:
model_list:
# Cloud models — for complex reasoning
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: sk-ant-...
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-...
# Local models — for routine tasks
- model_name: qwen3-30b
litellm_params:
model: openai/Qwen/Qwen3-30B
api_base: http://gpu-server:9000/v1
api_key: "not-needed"
- model_name: qwen3-4b
litellm_params:
model: openai/Qwen/Qwen3-4B
api_base: http://gpu-server:8000/v1
api_key: "not-needed"
router_settings:
fallbacks:
- claude-sonnet: ["qwen3-30b", "qwen3-4b"]
- gpt-4o: ["qwen3-30b", "qwen3-4b"]
- qwen3-30b: ["qwen3-4b"]
The fallback chain is where this gets powerful. If your cloud API key runs out of credits, hits a rate limit, or the provider has an outage, LiteLLM automatically falls back to your local Qwen3-30B. If the local GPU is saturated, it falls further to Qwen3-4B. Your IDE and CLI tools see a single endpoint — they never need to know which model actually served the response.
This hybrid approach gives you the best of both worlds: frontier model quality when you need it, local privacy and zero cost for routine work, and resilience against any single point of failure.
Fallback Routing
The fallbacks configuration is not limited to cloud-to-local chains. Even in a purely local setup, it is useful. If Qwen3-30B is overloaded or temporarily unavailable, LiteLLM automatically retries the request against Qwen3-4B. This gives you resilience without any client-side logic.
Observability
The Prometheus callback exports request metrics — latency, token counts, error rates — that you can scrape and visualize in Grafana. This is invaluable for understanding your usage patterns and identifying when you need to scale up.
Automating Deployment with Ansible
For a repeatable setup across multiple machines, Ansible playbooks work well. The deployment flow:
- Install Docker on target hosts.
- Template Docker Compose files with host-specific variables (ports, GPU memory splits, model names).
- Deploy vLLM services on the GPU host and wait for model loading.
- Deploy LiteLLM on a gateway host (can be a low-power machine — it only proxies requests).
A simplified inventory structure:
all:
children:
inference:
hosts:
gpu-server:
ansible_host: 192.168.1.x
gateway:
hosts:
proxy-server:
ansible_host: 192.168.1.y
With group variables controlling per-host configuration:
# group_vars/inference.yml
vllm_image: "nvcr.io/nvidia/vllm:26.01-py3"
small_model_port: 8000
small_model_name: "Qwen/Qwen3-4B"
small_model_gpu_util: 0.25
large_model_port: 9000
large_model_name: "Qwen/Qwen3-30B"
large_model_gpu_util: 0.60
This approach lets you version-control your entire inference infrastructure, swap models by changing a variable, and redeploy in minutes.
Connecting JetBrains IDEs
JetBrains AI Assistant supports connecting to OpenAI-compatible endpoints, which means it can talk directly to your LiteLLM proxy.
Configuration Steps
- Open Settings | Tools | AI Assistant | Providers & API keys.
- Add a new OpenAI-compatible provider.
- Set the API endpoint to your LiteLLM instance:
http://proxy-server:4000/v1 - Set the API key to any non-empty value (LiteLLM accepts any key by default, or use your configured master key).
- Enable Tool calling if your model supports it (Qwen3, Llama 3.x, and others do).
Model Assignment
In Settings | Tools | AI Assistant | Models Assignment, assign your local models to specific features:
| Feature | Recommended Model |
|---|---|
| AI Chat | Qwen3-30B — better reasoning for conversations |
| Code Completion | Qwen3-4B — faster response, sufficient for completions |
| Inline suggestions | Qwen3-4B — low latency is critical here |
This split lets you use Qwen3-4B for real-time completions where latency matters, and Qwen3-30B for chat-based discussions where quality matters more.
Context Window
Set the context window size in the model settings to match your vLLM configuration. If Qwen3-4B is configured with --max-model-len 8192, set the context window to 8192 in JetBrains. Mismatched values will cause either truncated context or failed requests.
Connecting Claude Code
Claude Code can connect to any endpoint that speaks the Anthropic Messages API format. LiteLLM supports this natively through its /v1/messages pass-through endpoint.
Configuration
Set the following environment variables before launching Claude Code:
export ANTHROPIC_BASE_URL=http://proxy-server:4000
export ANTHROPIC_AUTH_TOKEN=not-needed
export ANTHROPIC_MODEL=qwen3-30b
Or add them to your shell profile (~/.zshrc, ~/.bashrc) for persistence:
# ~/.zshrc
export ANTHROPIC_BASE_URL=http://proxy-server:4000
export ANTHROPIC_AUTH_TOKEN=not-needed
export ANTHROPIC_MODEL=qwen3-30b
After reloading your shell, Claude Code will route all requests through your local LiteLLM proxy to your local models.
Considerations
Claude Code was designed for frontier-class models with large context windows and strong reasoning capabilities. Local models will handle many tasks well — file edits, code generation, explaining code, running commands — but may struggle with complex multi-step reasoning or large codebase analysis that requires processing hundreds of files. Use it for focused tasks and fall back to the hosted API for heavy-lifting sessions.
Connecting opencode
opencode is an open-source terminal-based AI coding assistant similar to Claude Code. It supports any OpenAI-compatible provider, making it straightforward to connect to your local stack.
Configuration
Create or edit ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"local": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local LLM",
"options": {
"baseURL": "http://proxy-server:4000/v1"
},
"models": {
"qwen3-30b": {
"name": "Qwen3-30B",
"tools": true
},
"qwen3-4b": {
"name": "Qwen3-4B",
"tools": true
}
}
}
}
}
The tools: true flag enables function calling, which opencode uses for file operations, shell commands, and other agentic capabilities.
Per-Project Configuration
You can also place an opencode.json in your project root to override the global config. This is useful if different projects need different models — a documentation project might use Qwen3-4B, while a complex backend project benefits from Qwen3-30B.
Model Requirements
opencode works best with models that have at least a 64K token context window for codebase-wide operations. With Qwen3-4B configured at 8K context, stick to focused single-file tasks. Qwen3-30B with its 65K context can handle broader operations like multi-file refactoring.
Practical Recommendations
Hardware Sizing
| GPU VRAM | What You Can Run |
|---|---|
| 8 GB | Qwen3-4B — basic code completion |
| 16 GB | Qwen3-14B, or Qwen3-4B + Qwen3-7B |
| 24 GB | Qwen3-30B + Qwen3-4B simultaneously |
| 48 GB | Two Qwen3-30B instances, or Qwen3-70B with room for a small model |
Consumer GPUs like the RTX 4090 (24 GB) or RTX 5090 (32 GB) are more than capable. You do not need enterprise hardware for a personal inference setup.
Model Selection for Coding
Based on my experience, these model families work well for coding tasks:
- Qwen3 (4B–30B). Strong instruction following, good tool-calling support, efficient at code generation and refactoring.
- DeepSeek Coder V3. Purpose-built for code, excellent at completions and multi-language support.
- CodeLlama (7B–34B). Meta’s code-specialized models, good for completion and infilling tasks.
Start with a small model for code completion and a larger one for chat. Measure whether the larger model’s quality improvement justifies its VRAM cost and slower throughput for your specific workflow.
Network Topology
If your GPU is in a separate machine (a workstation, a home server, or a mini PC with an eGPU), put LiteLLM on the same network and access it from your development machine over LAN. The latency overhead of a local network hop is negligible compared to inference time.
graph LR
Dev["Development Machine\nIDE + CLI tools"] -- "LAN" --> GW["Gateway\nLiteLLM :4000"]
GW -- "LAN" --> GPU["GPU Server\nQwen3-4B :8000\nQwen3-30B :9000"]
Security
On a private home or office network, the setup described here is sufficient. If you expose these services beyond your LAN:
- Enable LiteLLM’s master key authentication.
- Put the services behind a reverse proxy with TLS.
- Restrict access by IP or use a VPN.
Do not expose unprotected inference endpoints to the public internet.
Cost Analysis
Ongoing Costs — Cloud vs Local
A rough comparison for a developer using AI coding tools ~8 hours per day:
| Approach | Monthly Cost | Privacy |
|---|---|---|
| Cloud API (heavy usage) | $50–200+ depending on model and token volume | Prompts sent to third party |
| Cloud subscription (Copilot, Claude Pro, etc.) | $10–20 | Prompts sent to third party |
| Local inference (after hardware) | $5–15 electricity | Everything stays local |
The monthly savings look compelling, but they only tell half the story. The real question is what it costs to get started.
Capital Expenditure — The Hardware Question
You might already have everything you need. If there is a desktop or laptop with a discrete GPU sitting around, you can start serving models today at zero additional cost. But if you are buying hardware specifically for inference, the options range from modest to extravagant:
| Hardware | Approximate Cost | Usable VRAM / Memory | What It Runs |
|---|---|---|---|
| Used RTX 3090 (24 GB) | $700–900 | 24 GB VRAM | Qwen3-30B + Qwen3-4B simultaneously |
| RTX 4090 (24 GB) | $1,600–2,000 | 24 GB VRAM | Same models, faster inference |
| RTX 5090 (32 GB) | $2,000–2,500 | 32 GB VRAM | Larger models or more headroom |
| Mac Mini M4 Pro (64 GB) | $1,800–2,000 | 64 GB unified | 70B models via llama.cpp, slower but capable |
| Mac Studio M4 Ultra (192 GB) | $5,000–7,000 | 192 GB unified | Multiple large models, silent, low power draw |
| Mac Pro / Mac Studio (512 GB) | $12,000–15,000 | 512 GB unified | 400B+ parameter models entirely in memory |
| NVIDIA DGX Spark (128 GB) | ~$3,000 | 128 GB unified NVIDIA GPU | Purpose-built for local AI, desktop form factor |
The spectrum is wide. A used RTX 3090 in an existing desktop is the most practical entry point — it runs everything discussed in this article and costs less than a year of heavy API usage. On the other end, a Mac with 512 GB of unified memory can load models that would otherwise require a multi-GPU server, and the DGX Spark offers NVIDIA’s full CUDA stack in a compact desktop form factor with 128 GB of memory.
Return on Investment
The break-even calculation depends on what you are replacing. Here is a simple framework:
Scenario 1 — Replacing API usage ($100/month)
| Hardware | Cost | Break-even |
|---|---|---|
| Used RTX 3090 | $800 | ~8 months |
| RTX 4090 | $1,800 | ~18 months |
| Mac Mini M4 Pro 64 GB | $1,900 | ~19 months |
Scenario 2 — Replacing subscription + API overflow ($150/month)
| Hardware | Cost | Break-even |
|---|---|---|
| Used RTX 3090 | $800 | ~5 months |
| RTX 4090 | $1,800 | ~12 months |
| Mac Studio M4 Ultra 192 GB | $6,000 | ~40 months |
Scenario 3 — Team of 3 developers, each using $100/month in API costs ($300/month total)
| Hardware | Cost | Break-even |
|---|---|---|
| RTX 4090 workstation | $2,500 | ~8 months |
| DGX Spark | $3,000 | ~10 months |
| Mac Studio 192 GB | $6,000 | ~20 months |
These numbers assume $5–15/month in electricity and do not account for the residual value of the hardware, which retains significant resale value — especially GPUs.
The honest answer: if you are a solo developer spending $20/month on a Copilot subscription and happy with it, buying a $2,000 GPU purely for cost savings makes no sense — the break-even is over 8 years. The ROI case is strongest when you have heavy API usage, multiple users sharing the infrastructure, or when privacy requirements make cloud APIs a non-starter regardless of cost.
For most individual developers, the practical path is to start with hardware you already own, measure whether local models are useful enough for your workflow, and only then decide if a dedicated investment is worth it.
Conclusion
Running local models for coding is no longer a fringe experiment — modern open-weight models are genuinely useful for daily development work, and the tooling to serve and consume them has matured significantly. The combination of vLLM for high-throughput inference and LiteLLM as a universal API gateway gives you a setup that any standard tool can connect to — JetBrains IDEs, Claude Code, opencode, or your own custom scripts.
The sweet spot is a hybrid approach: use local models for routine tasks where privacy matters and latency is important (code completion, quick refactoring, private codebases), and fall back to frontier cloud models for complex reasoning tasks that push beyond what local hardware can handle. With the setup described here, switching between the two is a configuration change, not an architecture rewrite.