Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

Currently, the main ways users can access DeepSeek-R1 are through cloud services or “local deployment.” However, the official server often experiences downtime, and personal deployments typically use a distilled version with 90% fewer parameters. As a result, it is very difficult for ordinary users to run the full version of DeepSeek-R1 on regular hardware, and the cost of renting servers is a significant burden even for developers.

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

This week, the KVCache.AI team at Tsinghua University, in collaboration with Quijing Technology, announced a major update to their KTransformers (pronounced “Quick Transformers”) open-source project. They successfully solved the problem of deploying trillion-parameter models locally, marking a significant step towards democratizing large model inference, moving from “cloud monopolies” to “universal access.”

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

As shown in the image, the KTransformers team successfully ran the full version of DeepSeek-R1 V3 with 671 billion parameters on a PC with 24 GB of VRAM and 382 GB of memory on February 10, achieving a speedup of 3 to 28 times.

Today, KTransformers announced support for longer context lengths (24GB single card supports 4–8K) and a 15% speed improvement (up to 16 tokens per second).

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

According to the official introduction, KTransformers is a flexible, Python-centric framework designed for scalability. It allows users to access Transformer-compatible interfaces, RESTful APIs compliant with OpenAI and Ollama standards, and even simplified web user interfaces like ChatGPT, with just one line of code to implement and inject optimization modules.

The technology now supports running the full version of DeepSeek-R1 V3 with 671 billion parameters on a single 24GB VRAM consumer-grade GPU (like RTX 4090D), with preprocessing speeds up to 286 tokens per second and inference generation speeds up to 14 tokens per second. This completely rewrites the history of relying on expensive cloud servers for large AI models.

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

DeepSeek-R1 is based on a mixture of experts (MoE) architecture, where tasks are allocated to different expert modules, and only a portion of the parameters are activated during each inference. The team innovatively offloaded non-shared sparse matrices to CPU memory, and combined it with high-speed operator optimization, reducing VRAM requirements from the traditional 320GB on 8 A100 GPUs to a single 24GB card.

Thanks to KTransformers, ordinary users can now run the full version of DeepSeek-R1 V3 with 671 billion parameters locally with just 24GB of VRAM. Preprocessing speeds can reach up to 286 tokens per second, with inference speeds reaching up to 14 tokens per second.

In response to the characteristics of the MoE architecture, the KTransformers team implemented matrix quantization via Marlin GPU operators, improving efficiency by 3.87 times over traditional methods. Additionally, with breakthroughs on the CPU side using llamafile for multithreading and Intel AMX instruction set optimizations, CPU prefill speeds are 28 times faster than llama.cpp, reducing response times for long sequence tasks from minutes to seconds.

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

Additionally, they reduced CPU/GPU communication breakpoints, allowing for a single complete CUDA Graph call for each decoding, optimizing generation speeds to 14 tokens per second with a power consumption of just 80W. The overall machine cost is about 20,000 yuan, only 2% of the cost of a traditional 8 A100 card setup.

Developer tests showed that with an RTX 3090 GPU and 200GB of memory, combined with Unsloth optimization, the Q2_K_XL model inference speed reached 9.1 tokens per second, making trillion-parameter models “home-friendly.”

It is important to note that KTransformers is not just an inference framework and is not limited to the DeepSeek model. It is compatible with various MoE models and operators, allowing integration of a wide range of operators for various combination tests. It also supports both Windows and Linux platforms, and interested users can try it out.

Tsinghua University’s Open Source Project Breaks Through Large Model Computational Bottleneck: RTX 4090 Single Card Runs DeepSeek-R1 at Full Capacity

However, there are some hardware requirements to use KTransformers; simply having an RTX 4090 does not guarantee smooth operation. Prerequisites include:

  • CPU: Intel Xeon Gold 6454S 1T DRAM (2 NUMA nodes)
  • GPU: RTX 4090D (24GB VRAM)
  • Memory: Standard DDR5-4800 server DRAM (1TB)
  • CUDA version 12.1 or higher

RTX 4090D + Dual Xeon Gold Test Data:

Task TypeKTrans V0.3 (6 Experts)KTrans V0.2 (8 Experts)llama.cpp (FP16)
8K Context Prefill207.20 tokens/s195.62 tokens/s7.43 tokens/s
Short Text Decoding13.69 tokens/s8.73 tokens/s4.51 tokens/s
Long Sequence Throughput19.8GB/s15.2GB/s4.8GB/s

For Linux-x86_64 systems, install gcc, g++, and cmake using the following commands:

sudo apt-get update
sudo apt-get install gcc g++ cmake ninja-build

It is highly recommended to use Conda to create a virtual environment with Python 3.11. Use the following commands to create and activate the environment:

conda create --name ktransformers python=3.11
conda activate ktransformers # You may need to run ‘conda init’

Install PyTorch, packaging, ninja, cpufeature, and numpy:

pip install torch packaging ninja cpufeature numpy

Install KTransformers:

pip install ktransformers --no-build-isolation

Quick Start:

python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your prompt txt file> --cpu_infer 65 --max_new_tokens 1000
<when you see chat, then press enter to load the text prompt_file>

Parameter Explanation:

  • model_path: Path to the model.
  • gguf_path: Path to the GGUF file.
  • prompt_file: Path to the file containing the prompt text.
  • cpu_infer 65: Specifies the number of CPU cores used for inference (65 for dual CPUs).
  • max_new_tokens 1000: Sets the maximum number of tokens to generate.

Reference:

Leave a Reply

Your email address will not be published. Required fields are marked *

////