Why Run Models Locally?

Cloud-hosted LLMs are convenient, but they come with trade-offs that matter when you are writing code or building private tools. Every prompt you send to a hosted API leaves your machine — your proprietary code, internal architecture details, database schemas, and business logic all travel to a third-party server. For personal projects this might be acceptable, but for anything involving proprietary code, client data, or internal tooling, it raises real concerns.

Running models locally eliminates these issues entirely:

  • Privacy. Your code never leaves your network. No prompts are logged, stored, or used for training by a third party.
  • Zero inference cost. After the initial hardware investment, every token is free. For heavy daily usage — code completion, refactoring, chat-based debugging — the savings compound quickly.
  • No rate limits. You are not competing for capacity with other users. Your model is available 24/7 at whatever throughput your hardware allows.
  • Latency. Local inference over a LAN is faster than a round trip to a data center, especially for short completions where network overhead dominates.
  • Offline availability. You can work on a plane, in a restricted network, or during a provider outage without losing your AI assistant.

The trade-off is capability — local models are smaller than frontier models and will not match their reasoning depth on complex tasks. But for code completion, boilerplate generation, refactoring suggestions, and conversational debugging, modern open-weight models are remarkably competent.


The Stack — vLLM and LiteLLM

Once you start thinking about running models locally, something happens to the way you look at hardware. Every device with a GPU becomes a potential inference node. Your workstation with an RTX 4090? Obviously. That old gaming laptop collecting dust? It has a 3070 — that is 8 GB of VRAM doing nothing. Your partner’s MacBook with its unified memory? It could run a 7B model through llama.cpp while they are asleep. The mini PC you bought for a home media server? It has an iGPU — technically it counts. You start eyeing the smart fridge and wondering if it has a tensor core somewhere.

This is the home lab mentality, and it is a real phenomenon once you realize that LiteLLM can aggregate multiple backends into a single endpoint. Every GPU in your household becomes part of your personal inference cluster.

The most practical setup I have found combines two tools: vLLM for GPU-accelerated model serving and LiteLLM as an OpenAI-compatible API gateway. Together, they give you a production-grade inference endpoint that any tool — IDE plugins, CLI agents, custom scripts — can connect to using standard APIs.

  graph LR
    subgraph Clients
        IDE["JetBrains IDE"]
        CC["Claude Code"]
        OC["opencode"]
        Custom["Custom Tools"]
    end

    subgraph Gateway
        LiteLLM["LiteLLM Proxy\n:4000"]
    end

    subgraph Workstation["Workstation — RTX 4090"]
        V1["vLLM\nQwen3-4B\n:8000"]
        V2["vLLM\nQwen3-30B\n:9000"]
    end

    subgraph Laptop["Old Laptop — RTX 3070"]
        V3["vLLM\nQwen3-4B\n:8000"]
    end

    subgraph Mac["MacBook — Unified Memory"]
        V4["llama.cpp\nQwen3-14B\n:8080"]
    end

    IDE --> LiteLLM
    CC --> LiteLLM
    OC --> LiteLLM
    Custom --> LiteLLM
    LiteLLM --> V1
    LiteLLM --> V2
    LiteLLM --> V3
    LiteLLM --> V4

The beauty of this architecture is that LiteLLM does not care where the backend lives or what serves it. A vLLM instance on an NVIDIA workstation, llama.cpp on a MacBook, Ollama on a gaming laptop — they all expose OpenAI-compatible endpoints, and LiteLLM routes to all of them with automatic fallbacks. When your workstation’s Qwen3-30B is busy, the request silently falls back to Qwen3-4B on the laptop.

vLLM is a high-throughput inference engine that serves models with continuous batching, PagedAttention for efficient memory management, and OpenAI-compatible API endpoints out of the box. It runs as a Docker container with NVIDIA GPU support.

LiteLLM is a lightweight proxy that exposes a unified API (OpenAI, Anthropic, and other formats) and routes requests to one or more backend model servers. It adds fallback routing, load balancing, and observability — all configured through a single YAML file.


Deploying vLLM

vLLM runs as a Docker container with GPU passthrough. Here is a Docker Compose configuration that serves two models on a single GPU, splitting VRAM between them:

services:
  qwen3-4b:
    image: nvcr.io/nvidia/vllm:26.01-py3
    container_name: qwen3-4b
    restart: unless-stopped
    ipc: host
    runtime: nvidia
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --model Qwen/Qwen3-4B
      --port 8000
      --max-model-len 8192
      --gpu-memory-utilization 0.25
      --max-num-batched-tokens 8192
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enable-chunked-prefill
      --max-num-seqs 64

  qwen3-30b:
    image: nvcr.io/nvidia/vllm:26.01-py3
    container_name: qwen3-30b
    restart: unless-stopped
    ipc: host
    runtime: nvidia
    ports:
      - "9000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --model Qwen/Qwen3-30B
      --port 8000
      --max-model-len 65536
      --gpu-memory-utilization 0.60
      --max-num-batched-tokens 8192
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enable-chunked-prefill
      --max-num-seqs 64

Key parameters to understand:

Parameter Purpose
--gpu-memory-utilization Fraction of GPU VRAM to allocate. Setting 0.25 and 0.60 lets two models share a single GPU.
--max-model-len Maximum context window. Larger values consume more VRAM.
--max-num-batched-tokens Controls continuous batching throughput.
--enable-auto-tool-choice Enables function/tool calling support, essential for agentic workflows.
--tool-call-parser hermes Parser format for tool call extraction from model output.
--enable-chunked-prefill Reduces time-to-first-token by processing prefill in chunks.
--max-num-seqs Maximum concurrent sequences — controls parallel request capacity.
ipc: host Required for shared memory between GPU workers.

The HuggingFace cache volume mount (~/.cache/huggingface) ensures models are downloaded once and reused across container restarts.

Choosing Models

For coding tasks, instruction-tuned models with tool-calling support work best. The Qwen3 family is a strong choice — Qwen3-4B handles code completion and simple refactoring well, while Qwen3-30B is capable of more complex reasoning, multi-file changes, and architectural discussions. Other good options include CodeLlama, DeepSeek Coder, and StarCoder variants.

The key constraint is VRAM. A rough rule: each billion parameters requires approximately 1 GB of VRAM in FP16, or 0.5 GB in FP8/INT8 quantization. A 24 GB GPU can comfortably serve Qwen3-4B and Qwen3-30B simultaneously with the memory splits shown above.


Deploying LiteLLM

LiteLLM sits in front of your vLLM instances and provides a single endpoint for all clients:

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./config.yml:/etc/litellm/config.yml
    command:
      - "--config"
      - "/etc/litellm/config.yml"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "4000"
    deploy:
      resources:
        limits:
          memory: 512M

The configuration file (config.yml) defines available models and routing:

model_list:
  - model_name: qwen3-4b
    litellm_params:
      model: openai/Qwen/Qwen3-4B
      api_base: http://gpu-server:8000/v1
      api_key: "not-needed"

  - model_name: qwen3-30b
    litellm_params:
      model: openai/Qwen/Qwen3-30B
      api_base: http://gpu-server:9000/v1
      api_key: "not-needed"

router_settings:
  fallbacks:
    - qwen3-30b: ["qwen3-4b"]

litellm_settings:
  drop_params: true
  callbacks:
    - prometheus

The openai/ prefix tells LiteLLM to use the OpenAI-compatible API format when communicating with vLLM. The api_key is set to a dummy value since vLLM does not require authentication on a local network.

Mixing Cloud and Local Models

LiteLLM is not limited to local backends. It supports over 100 providers out of the box — Claude, OpenAI, Mistral, Google Gemini, Groq, and others — all configurable in the same model_list. This means you can build a hybrid setup where frontier cloud models handle the hardest tasks and local models handle everything else:

model_list:
  # Cloud models — for complex reasoning
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: sk-ant-...

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...

  # Local models — for routine tasks
  - model_name: qwen3-30b
    litellm_params:
      model: openai/Qwen/Qwen3-30B
      api_base: http://gpu-server:9000/v1
      api_key: "not-needed"

  - model_name: qwen3-4b
    litellm_params:
      model: openai/Qwen/Qwen3-4B
      api_base: http://gpu-server:8000/v1
      api_key: "not-needed"

router_settings:
  fallbacks:
    - claude-sonnet: ["qwen3-30b", "qwen3-4b"]
    - gpt-4o: ["qwen3-30b", "qwen3-4b"]
    - qwen3-30b: ["qwen3-4b"]

The fallback chain is where this gets powerful. If your cloud API key runs out of credits, hits a rate limit, or the provider has an outage, LiteLLM automatically falls back to your local Qwen3-30B. If the local GPU is saturated, it falls further to Qwen3-4B. Your IDE and CLI tools see a single endpoint — they never need to know which model actually served the response.

This hybrid approach gives you the best of both worlds: frontier model quality when you need it, local privacy and zero cost for routine work, and resilience against any single point of failure.

Fallback Routing

The fallbacks configuration is not limited to cloud-to-local chains. Even in a purely local setup, it is useful. If Qwen3-30B is overloaded or temporarily unavailable, LiteLLM automatically retries the request against Qwen3-4B. This gives you resilience without any client-side logic.

Observability

The Prometheus callback exports request metrics — latency, token counts, error rates — that you can scrape and visualize in Grafana. This is invaluable for understanding your usage patterns and identifying when you need to scale up.


Automating Deployment with Ansible

For a repeatable setup across multiple machines, Ansible playbooks work well. The deployment flow:

  1. Install Docker on target hosts.
  2. Template Docker Compose files with host-specific variables (ports, GPU memory splits, model names).
  3. Deploy vLLM services on the GPU host and wait for model loading.
  4. Deploy LiteLLM on a gateway host (can be a low-power machine — it only proxies requests).

A simplified inventory structure:

all:
  children:
    inference:
      hosts:
        gpu-server:
          ansible_host: 192.168.1.x
    gateway:
      hosts:
        proxy-server:
          ansible_host: 192.168.1.y

With group variables controlling per-host configuration:

# group_vars/inference.yml
vllm_image: "nvcr.io/nvidia/vllm:26.01-py3"
small_model_port: 8000
small_model_name: "Qwen/Qwen3-4B"
small_model_gpu_util: 0.25
large_model_port: 9000
large_model_name: "Qwen/Qwen3-30B"
large_model_gpu_util: 0.60

This approach lets you version-control your entire inference infrastructure, swap models by changing a variable, and redeploy in minutes.


Connecting JetBrains IDEs

JetBrains AI Assistant supports connecting to OpenAI-compatible endpoints, which means it can talk directly to your LiteLLM proxy.

Configuration Steps

  1. Open Settings | Tools | AI Assistant | Providers & API keys.
  2. Add a new OpenAI-compatible provider.
  3. Set the API endpoint to your LiteLLM instance:
    http://proxy-server:4000/v1
    
  4. Set the API key to any non-empty value (LiteLLM accepts any key by default, or use your configured master key).
  5. Enable Tool calling if your model supports it (Qwen3, Llama 3.x, and others do).

Model Assignment

In Settings | Tools | AI Assistant | Models Assignment, assign your local models to specific features:

Feature Recommended Model
AI Chat Qwen3-30B — better reasoning for conversations
Code Completion Qwen3-4B — faster response, sufficient for completions
Inline suggestions Qwen3-4B — low latency is critical here

This split lets you use Qwen3-4B for real-time completions where latency matters, and Qwen3-30B for chat-based discussions where quality matters more.

Context Window

Set the context window size in the model settings to match your vLLM configuration. If Qwen3-4B is configured with --max-model-len 8192, set the context window to 8192 in JetBrains. Mismatched values will cause either truncated context or failed requests.


Connecting Claude Code

Claude Code can connect to any endpoint that speaks the Anthropic Messages API format. LiteLLM supports this natively through its /v1/messages pass-through endpoint.

Configuration

Set the following environment variables before launching Claude Code:

export ANTHROPIC_BASE_URL=http://proxy-server:4000
export ANTHROPIC_AUTH_TOKEN=not-needed
export ANTHROPIC_MODEL=qwen3-30b

Or add them to your shell profile (~/.zshrc, ~/.bashrc) for persistence:

# ~/.zshrc
export ANTHROPIC_BASE_URL=http://proxy-server:4000
export ANTHROPIC_AUTH_TOKEN=not-needed
export ANTHROPIC_MODEL=qwen3-30b

After reloading your shell, Claude Code will route all requests through your local LiteLLM proxy to your local models.

Considerations

Claude Code was designed for frontier-class models with large context windows and strong reasoning capabilities. Local models will handle many tasks well — file edits, code generation, explaining code, running commands — but may struggle with complex multi-step reasoning or large codebase analysis that requires processing hundreds of files. Use it for focused tasks and fall back to the hosted API for heavy-lifting sessions.


Connecting opencode

opencode is an open-source terminal-based AI coding assistant similar to Claude Code. It supports any OpenAI-compatible provider, making it straightforward to connect to your local stack.

Configuration

Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local LLM",
      "options": {
        "baseURL": "http://proxy-server:4000/v1"
      },
      "models": {
        "qwen3-30b": {
          "name": "Qwen3-30B",
          "tools": true
        },
        "qwen3-4b": {
          "name": "Qwen3-4B",
          "tools": true
        }
      }
    }
  }
}

The tools: true flag enables function calling, which opencode uses for file operations, shell commands, and other agentic capabilities.

Per-Project Configuration

You can also place an opencode.json in your project root to override the global config. This is useful if different projects need different models — a documentation project might use Qwen3-4B, while a complex backend project benefits from Qwen3-30B.

Model Requirements

opencode works best with models that have at least a 64K token context window for codebase-wide operations. With Qwen3-4B configured at 8K context, stick to focused single-file tasks. Qwen3-30B with its 65K context can handle broader operations like multi-file refactoring.


Practical Recommendations

Hardware Sizing

GPU VRAM What You Can Run
8 GB Qwen3-4B — basic code completion
16 GB Qwen3-14B, or Qwen3-4B + Qwen3-7B
24 GB Qwen3-30B + Qwen3-4B simultaneously
48 GB Two Qwen3-30B instances, or Qwen3-70B with room for a small model

Consumer GPUs like the RTX 4090 (24 GB) or RTX 5090 (32 GB) are more than capable. You do not need enterprise hardware for a personal inference setup.

Model Selection for Coding

Based on my experience, these model families work well for coding tasks:

  • Qwen3 (4B–30B). Strong instruction following, good tool-calling support, efficient at code generation and refactoring.
  • DeepSeek Coder V3. Purpose-built for code, excellent at completions and multi-language support.
  • CodeLlama (7B–34B). Meta’s code-specialized models, good for completion and infilling tasks.

Start with a small model for code completion and a larger one for chat. Measure whether the larger model’s quality improvement justifies its VRAM cost and slower throughput for your specific workflow.

Network Topology

If your GPU is in a separate machine (a workstation, a home server, or a mini PC with an eGPU), put LiteLLM on the same network and access it from your development machine over LAN. The latency overhead of a local network hop is negligible compared to inference time.

  graph LR
    Dev["Development Machine\nIDE + CLI tools"] -- "LAN" --> GW["Gateway\nLiteLLM :4000"]
    GW -- "LAN" --> GPU["GPU Server\nQwen3-4B :8000\nQwen3-30B :9000"]

Security

On a private home or office network, the setup described here is sufficient. If you expose these services beyond your LAN:

  • Enable LiteLLM’s master key authentication.
  • Put the services behind a reverse proxy with TLS.
  • Restrict access by IP or use a VPN.

Do not expose unprotected inference endpoints to the public internet.


Cost Analysis

Ongoing Costs — Cloud vs Local

A rough comparison for a developer using AI coding tools ~8 hours per day:

Approach Monthly Cost Privacy
Cloud API (heavy usage) $50–200+ depending on model and token volume Prompts sent to third party
Cloud subscription (Copilot, Claude Pro, etc.) $10–20 Prompts sent to third party
Local inference (after hardware) $5–15 electricity Everything stays local

The monthly savings look compelling, but they only tell half the story. The real question is what it costs to get started.

Capital Expenditure — The Hardware Question

You might already have everything you need. If there is a desktop or laptop with a discrete GPU sitting around, you can start serving models today at zero additional cost. But if you are buying hardware specifically for inference, the options range from modest to extravagant:

Hardware Approximate Cost Usable VRAM / Memory What It Runs
Used RTX 3090 (24 GB) $700–900 24 GB VRAM Qwen3-30B + Qwen3-4B simultaneously
RTX 4090 (24 GB) $1,600–2,000 24 GB VRAM Same models, faster inference
RTX 5090 (32 GB) $2,000–2,500 32 GB VRAM Larger models or more headroom
Mac Mini M4 Pro (64 GB) $1,800–2,000 64 GB unified 70B models via llama.cpp, slower but capable
Mac Studio M4 Ultra (192 GB) $5,000–7,000 192 GB unified Multiple large models, silent, low power draw
Mac Pro / Mac Studio (512 GB) $12,000–15,000 512 GB unified 400B+ parameter models entirely in memory
NVIDIA DGX Spark (128 GB) ~$3,000 128 GB unified NVIDIA GPU Purpose-built for local AI, desktop form factor

The spectrum is wide. A used RTX 3090 in an existing desktop is the most practical entry point — it runs everything discussed in this article and costs less than a year of heavy API usage. On the other end, a Mac with 512 GB of unified memory can load models that would otherwise require a multi-GPU server, and the DGX Spark offers NVIDIA’s full CUDA stack in a compact desktop form factor with 128 GB of memory.

Return on Investment

The break-even calculation depends on what you are replacing. Here is a simple framework:

Scenario 1 — Replacing API usage ($100/month)

Hardware Cost Break-even
Used RTX 3090 $800 ~8 months
RTX 4090 $1,800 ~18 months
Mac Mini M4 Pro 64 GB $1,900 ~19 months

Scenario 2 — Replacing subscription + API overflow ($150/month)

Hardware Cost Break-even
Used RTX 3090 $800 ~5 months
RTX 4090 $1,800 ~12 months
Mac Studio M4 Ultra 192 GB $6,000 ~40 months

Scenario 3 — Team of 3 developers, each using $100/month in API costs ($300/month total)

Hardware Cost Break-even
RTX 4090 workstation $2,500 ~8 months
DGX Spark $3,000 ~10 months
Mac Studio 192 GB $6,000 ~20 months

These numbers assume $5–15/month in electricity and do not account for the residual value of the hardware, which retains significant resale value — especially GPUs.

The honest answer: if you are a solo developer spending $20/month on a Copilot subscription and happy with it, buying a $2,000 GPU purely for cost savings makes no sense — the break-even is over 8 years. The ROI case is strongest when you have heavy API usage, multiple users sharing the infrastructure, or when privacy requirements make cloud APIs a non-starter regardless of cost.

For most individual developers, the practical path is to start with hardware you already own, measure whether local models are useful enough for your workflow, and only then decide if a dedicated investment is worth it.


Conclusion

Running local models for coding is no longer a fringe experiment — modern open-weight models are genuinely useful for daily development work, and the tooling to serve and consume them has matured significantly. The combination of vLLM for high-throughput inference and LiteLLM as a universal API gateway gives you a setup that any standard tool can connect to — JetBrains IDEs, Claude Code, opencode, or your own custom scripts.

The sweet spot is a hybrid approach: use local models for routine tasks where privacy matters and latency is important (code completion, quick refactoring, private codebases), and fall back to frontier cloud models for complex reasoning tasks that push beyond what local hardware can handle. With the setup described here, switching between the two is a configuration change, not an architecture rewrite.


References

  1. vLLM Documentation
  2. LiteLLM Documentation
  3. JetBrains AI Assistant — Custom Models
  4. Claude Code — LLM Gateway Configuration
  5. opencode — Providers Configuration
  6. LiteLLM — Claude Code Quickstart
  7. Qwen3 Model Family