🎓 New! Pondhouse AI Academy - Master AI with our expert-led training programs.

Explore Courses

Running GLM-4.7-Flash Locally with OpenCode

blog preview

Recently, Zhipu AI released GLM-4.7-Flash, a lightweight yet powerful large language model optimized for coding tasks. It is based on the critically acclaimed GLM-4.7 architecture. In this guide, we'll walk through how to set up GLM-4.7-Flash locally using llama.cpp and connect it to OpenCode for agentic coding capabilities - for both NVIDIA and AMD GPUs.

Why GLM-4.7-Flash

Zhipu AI's GLM-4.7-Flash is a 30B parameter model with a Mixture-of-Experts (MoE) architecture. Only ~3B parameters are active per forward pass, which means you get the reasoning capacity of a much larger model while actually being able to run it on hardware you might already own.

What makes it interesting for coding work is the benchmark performance. On SWE-bench Verified - the standard test for whether a model can actually fix real GitHub issues - GLM-4.7-Flash hits 59.2%. That's nearly triple what Qwen3-30B-A3B-Thinking scores (22.0%) and significantly ahead of GPT-OSS-20B (34.0%). The τ²-Bench results tell a similar story: 79.5% versus 49.0% and 47.7% for the other two.

GLM-4.7-Flash benchmark comparisonGLM-4.7-Flash benchmark comparison

These aren't cherry-pcked numbers. Across six different benchmarks-SWE-bench, τ²-Bench, BrowseComp, AIME 25, GPQA, and HLE-GLM-4.7-Flash either leads or stays competitive. The model was clearly tuned for agentic workloads: tool calling, multi-step reasoning, the kind of tasks that matter when you're using it as a coding assistant.

The practical upside: with Q4 quantization, you can run this on a 16GB GPU. Open weights, MIT license. No API costs, no rate limits, full control over your inference stack.

Prerequisites

  • NVIDIA GPU with CUDA 12+ OR AMD GPU with ROCm 6+
  • 16GB+ VRAM (for Q4 quantized version)
  • ~20GB disk space for the model

Building llama.cpp

Pre-built binaries exist, but building from source ensures you get the latest optimizations and proper GPU support. The process differs between NVIDIA and AMD.

NVIDIA (CUDA)

Install dependencies and build with CUDA enabled:

1apt-get update
2apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
3
4git clone https://github.com/ggml-org/llama.cpp
5cmake llama.cpp -B llama.cpp/build \
6 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
7
8cmake --build llama.cpp/build --config Release -j --clean-first \
9 --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
10
11cp build/bin/llama-* ./.

AMD (ROCm)

AMD requires ROCm to be installed first. On Ubuntu (Noble):

1wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
2sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
3sudo apt update
4sudo apt install python3-setuptools python3-wheel rocm
5sudo usermod -a -G render,video $LOGNAME

On Arch Linux:

1sudo pacman -S rocm-hip-sdk rocwmma

Then build llama.cpp with HIP support. You need to specify your GPU architecture in AMDGPU_TARGETS. Find it by running rocminfo and looking for the Name field:

*******
Agent 2
*******
  Name:                    gfx1151      <-- this is your target
  Marketing Name:          Radeon 8060S Graphics
  Vendor Name:             AMD

Build with your target:

1git clone https://github.com/ggml-org/llama.cpp
2cd llama.cpp
3cmake -S . -B build \
4 -DGGML_HIP=ON \
5 -DAMDGPU_TARGETS="gfx1151;gfx1100" \
6 -DCMAKE_BUILD_TYPE=Release \
7 -DGGML_HIP_ROCWMMA_FATTN=ON
8
9cmake --build build --config Release -- -j 10
10cp build/bin/llama-* ./.

Run the Server

llama.cpp can download models directly from Hugging Face. First, set up a Hugging Face token-some model repos require authentication. Create one at huggingface.co/settings/tokens and export it:

1export HF_TOKEN=your_token_here

Then start the server:

1./llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
2 --alias glm-4.7-flash \
3 --jinja --ctx-size 32768 \
4 --temp 1.0 --top-p 0.95 --min-p 0.01 --fit on \
5 --sleep-idle-seconds 300 \
6 --host 0.0.0.0 --port 8080

The -hf flag pulls the model automatically on first run. UD-Q4_K_XL is a good balance between quality and VRAM usage-expect around 12-14GB with this quant. If you have 24GB+ VRAM, swap it for a Q8 variant.

Key flags:

  • --alias glm-4.7-flash - the model name OpenCode will use
  • --ctx-size 32768 - context window (adjust based on your VRAM)
  • --jinja - enables chat template support
  • --sleep-idle-seconds 300 - unloads model after 5 min idle to free VRAM

Verify it's running:

1curl http://localhost:8080/health

You should get {"status":"ok"}.

Test It

Send a request to verify the model is responding:

1curl http://localhost:8080/v1/chat/completions \
2 -H "Content-Type: application/json" \
3 -d '{
4 "model": "glm-4.7-flash",
5 "messages": [
6 {"role": "user", "content": "Hello! Write a hello world in Python."}
7 ],
8 "temperature": 0.7,
9 "max_tokens": 500,
10 "stream": true
11 }'

You should see streamed JSON chunks with a Python hello world snippet. If that works, the server is ready for OpenCode.

Configure OpenCode

Add the provider to your opencode.json (or ~/.config/opencode/opencode.json for global config):

1{
2 "$schema": "https://opencode.ai/config.json",
3 "provider": {
4 "llama.cpp": {
5 "npm": "@ai-sdk/openai-compatible",
6 "name": "llama-server (local)",
7 "options": {
8 "baseURL": "http://127.0.0.1:8080/v1"
9 },
10 "models": {
11 "glm-4.7-flash": {
12 "name": "GLM-4.7-Flash (local)",
13 "limit": {
14 "context": 32768,
15 "output": 16384
16 }
17 }
18 }
19 }
20 }
21}

The model ID (glm-4.7-flash) must match the --alias you passed to llama-server. Run /models in OpenCode and select GLM-4.7-Flash (local).

OpenCode running with GLM-4.7-FlashOpenCode running with GLM-4.7-Flash

Conclusion

That's it. You now have a local coding assistant that rivals closed-source models on agentic benchmarks-running entirely on your own hardware. No API keys to manage, no usage limits, no data leaving your machine.

The MoE architecture makes GLM-4.7-Flash practical where other 30B+ models aren't. If you've got a 16GB GPU sitting around, this is one of the better uses for it.

Interested in building high-quality AI agent systems?

We prepared a comprehensive guide based on cutting-edge research for how to build robust, reliable AI agent systems that actually work in production. This guide covers:

  • Understanding the 14 systematic failure modes in multi-agent systems
  • Evidence-based best practices for agent design
  • Structured communication protocols and verification mechanisms

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog