Currently, the main ways users can access DeepSeek-R1 are through cloud services or “local deployment.” However, the official server often experiences downtime, and personal deployments typically use a distilled version with 90% fewer parameters. As a result, it is very difficult for ordinary users to run the full version of DeepSeek-R1 on regular hardware, and the cost of renting servers is a significant burden even for developers.
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
This week, the KVCache.AI team at Tsinghua University, in collaboration with Quijing Technology, announced a major update to their KTransformers (pronounced “Quick Transformers”) open-source project. They successfully solved the problem of deploying trillion-parameter models locally, marking a significant step towards democratizing large model inference, moving from “cloud monopolies” to “universal access.”
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
As shown in the image, the KTransformers team successfully ran the full version of DeepSeek-R1 V3 with 671 billion parameters on a PC with 24 GB of VRAM and 382 GB of memory on February 10, achieving a speedup of 3 to 28 times.
Today, KTransformers announced support for longer context lengths (24GB single card supports 4–8K) and a 15% speed improvement (up to 16 tokens per second).
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
According to the official introduction, KTransformers is a flexible, Python-centric framework designed for scalability. It allows users to access Transformer-compatible interfaces, RESTful APIs compliant with OpenAI and Ollama standards, and even simplified web user interfaces like ChatGPT, with just one line of code to implement and inject optimization modules.
The technology now supports running the full version of DeepSeek-R1 V3 with 671 billion parameters on a single 24GB VRAM consumer-grade GPU (like RTX 4090D), with preprocessing speeds up to 286 tokens per second and inference generation speeds up to 14 tokens per second. This completely rewrites the history of relying on expensive cloud servers for large AI models.
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
DeepSeek-R1 is based on a mixture of experts (MoE) architecture, where tasks are allocated to different expert modules, and only a portion of the parameters are activated during each inference. The team innovatively offloaded non-shared sparse matrices to CPU memory, and combined it with high-speed operator optimization, reducing VRAM requirements from the traditional 320GB on 8 A100 GPUs to a single 24GB card.
Thanks to KTransformers, ordinary users can now run the full version of DeepSeek-R1 V3 with 671 billion parameters locally with just 24GB of VRAM. Preprocessing speeds can reach up to 286 tokens per second, with inference speeds reaching up to 14 tokens per second.
In response to the characteristics of the MoE architecture, the KTransformers team implemented matrix quantization via Marlin GPU operators, improving efficiency by 3.87 times over traditional methods. Additionally, with breakthroughs on the CPU side using llamafile for multithreading and Intel AMX instruction set optimizations, CPU prefill speeds are 28 times faster than llama.cpp, reducing response times for long sequence tasks from minutes to seconds.
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
Additionally, they reduced CPU/GPU communication breakpoints, allowing for a single complete CUDA Graph call for each decoding, optimizing generation speeds to 14 tokens per second with a power consumption of just 80W. The overall machine cost is about 20,000 yuan, only 2% of the cost of a traditional 8 A100 card setup.
Developer tests showed that with an RTX 3090 GPU and 200GB of memory, combined with Unsloth optimization, the Q2_K_XL model inference speed reached 9.1 tokens per second, making trillion-parameter models “home-friendly.”
It is important to note that KTransformers is not just an inference framework and is not limited to the DeepSeek model. It is compatible with various MoE models and operators, allowing integration of a wide range of operators for various combination tests. It also supports both Windows and Linux platforms, and interested users can try it out.
Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
However, there are some hardware requirements to use KTransformers; simply having an RTX 4090 does not guarantee smooth operation. Prerequisites include:
- CPU: Intel Xeon Gold 6454S 1T DRAM (2 NUMA nodes)
- GPU: RTX 4090D (24GB VRAM)
- Memory: Standard DDR5-4800 server DRAM (1TB)
- CUDA version 12.1 or higher
RTX 4090D + Dual Xeon Gold Test Data:
Task Type | KTrans V0.3 (6 Experts) | KTrans V0.2 (8 Experts) | llama.cpp (FP16) |
---|---|---|---|
8K Context Prefill | 207.20 tokens/s | 195.62 tokens/s | 7.43 tokens/s |
Short Text Decoding | 13.69 tokens/s | 8.73 tokens/s | 4.51 tokens/s |
Long Sequence Throughput | 19.8GB/s | 15.2GB/s | 4.8GB/s |
For Linux-x86_64 systems, install gcc, g++, and cmake using the following commands:
sudo apt-get update
sudo apt-get install gcc g++ cmake ninja-build
It is highly recommended to use Conda to create a virtual environment with Python 3.11. Use the following commands to create and activate the environment:
conda create --name ktransformers python=3.11
conda activate ktransformers # You may need to run ‘conda init’
Install PyTorch, packaging, ninja, cpufeature, and numpy:
pip install torch packaging ninja cpufeature numpy
Install KTransformers:
pip install ktransformers --no-build-isolation
Quick Start:
python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 65 --max_new_tokens 1000
<when you see chat, then press enter to load the text prompt_file>
Parameter Explanation:
model_path
: Path to the model.gguf_path
: Path to the GGUF file.prompt_file
: Path to the file containing the prompt text.cpu_infer 65
: Specifies the number of CPU cores used for inference (65 for dual CPUs).max_new_tokens 1000
: Sets the maximum number of tokens to generate.
Reference:
- GitHub: https://github.com/kvcache-ai/ktransformers
- Local 671B DeepSeek-Coder-V3 / R1 Tutorial: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md