Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity
Currently, the main ways users can access DeepSeek-R1 are through cloud services or “local deployment.” However, the official server often experiences downtime, and personal deployments typically use a distilled version with 90% fewer parameters. As a result, it is very difficult for ordinary users to run the full version of DeepSeek-R1 on regular hardware, and the cost of renting servers is a significant burden even for developers. Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity This week, the KVCache.AI team at Tsinghua University, in collaboration with Quijing Technology, announced a major update to their KTransformers (pronounced “Quick Transformers”) open-source project. They successfully solved the problem of deploying trillion-parameter models locally, marking a significant step towards democratizing large model inference, moving from “cloud monopolies” to “universal access.” Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity As shown in the image, the KTransformers team successfully ran the full version of DeepSeek-R1 V3 with 671 billion parameters on a PC with 24 GB of VRAM and 382 GB of memory on February 10, achieving a speedup of 3 to 28 times. Today, KTransformers announced support for longer context lengths (24GB single card supports 4–8K) and a 15% speed improvement (up to 16 tokens per second). Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity According to the official introduction, KTransformers is a flexible, Python-centric framework designed for scalability. It allows users to access Transformer-compatible interfaces, RESTful APIs compliant with OpenAI and Ollama standards, and even simplified web user interfaces like ChatGPT, with just one line of code to implement and inject optimization modules. The technology now supports running the full version of DeepSeek-R1 V3 with 671 billion parameters on a single 24GB VRAM consumer-grade GPU (like RTX 4090D), with preprocessing speeds up to 286 tokens per second and inference generation speeds up to 14 tokens per second. This completely rewrites the history of relying on expensive cloud servers for large AI models. Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity DeepSeek-R1 is based on a mixture of experts (MoE) architecture, where tasks are allocated to different expert modules, and only a portion of the parameters are activated during each inference. The team innovatively offloaded non-shared sparse matrices to CPU memory, and combined it with high-speed operator optimization, reducing VRAM requirements from the traditional 320GB on 8 A100 GPUs to a single 24GB card. Thanks to KTransformers, ordinary users can now run the full version of DeepSeek-R1 V3 with 671 billion parameters locally with just 24GB of VRAM. Preprocessing speeds can reach up to 286 tokens per second, with inference speeds reaching up to 14 tokens per second. In response to the characteristics of the MoE architecture, the KTransformers team implemented matrix quantization via Marlin GPU operators,