DeepSeek-R1 for Fine-tuning and Local Execution

4 min readJan 21, 2025

Introduction

DeepSeek has achieved a significant breakthrough with its new R1 model series, establishing new standards in reasoning capabilities that rival OpenAI’s O1 model. This advancement builds upon DeepSeek’s recent success with DeepSeek-V3, which currently stands as the most sophisticated open-source AI model available.

Model Architecture and Variants

The R1 model architecture has been successfully integrated with other leading frameworks through distillation and fine-tuning processes, specifically incorporating elements from Llama 3 and Qwen 2.5. This integration enables immediate fine-tuning capabilities using the Unsloth framework. The complete R1 model series, including GGUF versions and 4-bit quantized variants, is accessible through the Hugging Face platform.

Technical Specifications and Implementation

The uncompressed 8-bit version of DeepSeek-R1 requires substantial computational resources, with a storage requirement of 700GB. Originally developed using FP8 precision, direct conversion to GGUF format presented technical challenges due to GGUF’s FP16 requirements. To address this limitation, the development team has successfully created FP16 conversions, making GGUF versions accessible to users.

Running DeepSeek Locally

To run DeepSeek-R1 / R1-Zero, you’ll need to install the open-source package llama.cpp, the original framework for using GGUF files. Hardware requirements: You do not need a GPU, a CPU with RAM will suffice, but make sure you have enough disk space.

🦙 with llama.cpp:

These instructions work for the R1 distilled and non distilled models, however keep in mind that they will require different hardware requirements. See further below for requirements.

When implementing DeepSeek R1 models, proper token configuration is essential for optimal performance. The system requires specific attention to the <｜User｜> and <｜Assistant｜> tokens. To streamline this process, implementing a chat template formatter is recommended to ensure consistent token handling.
The implementation requires the latest version of llama.cpp, which can be accessed through the official repository at https://github.com/ggerganov/llama.cpp. It is crucial to use the most recent version to ensure compatibility and access to the latest optimizations.
For those utilizing Q8_0 K quantized cache configurations, it is important to note that the -no-cnv parameter should be included in your implementation to disable automatic conversation mode. This configuration allows for more precise control over the model's conversational behavior and resource utilization.

./llama.cpp/llama-cli \
       --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf \
       --cache-type-k q8_0 \
       --threads 16 \
       --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' \
       -no-cnv

Example output:

<think>
    Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
    Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
    Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
    I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
    Is there a scenario where 1 plus 1 would not be 2? I can not think of any...

5. For users with advanced GPU hardware, DeepSeek R1 offers significant performance optimization through GPU layer offloading. When utilizing a high-performance graphics card such as the RTX 4090 with 24GB of VRAM, the system can distribute processing layers across the GPU architecture to enhance computational speed. This capability extends further with multi-GPU configurations, where additional graphics processors enable the distribution of an increased number of processing layers, potentially yielding even greater performance benefits.

This architectural flexibility allows organizations to scale their processing capabilities based on their available hardware infrastructure. The ability to leverage multiple GPUs provides a pathway for enhanced processing efficiency, particularly beneficial for resource-intensive applications requiring rapid response times.

For optimal performance tuning, organizations should consider their specific hardware configurations and processing requirements when implementing GPU layer distribution strategies. This approach ensures maximum utilization of available computational resources while maintaining system stability and reliability.

./llama.cpp/llama-cli \
   --model unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
   --cache-type-k q8_0 
   --threads 16 
   --prompt '<｜User｜>What is 1+1?<｜Assistant｜>'
   --n-gpu-layers 20 \
    -no-cnv

Hardware Requirements and Model Specifications for DeepSeek-R1

System Requirements

DeepSeek-R1 can operate on CPU-only systems, though with notable performance considerations. The minimum system configuration requires 48GB of RAM and 250GB of available storage space. However, users should note that operating at these minimum specifications will result in significantly reduced performance, with processing speeds potentially below 1.5 tokens per second.

Performance Optimization

While GPU acceleration is not mandatory, it substantially improves processing speed and overall system performance. For users interested in experimental implementation, the base configuration remains accessible, though GPU integration is recommended for production environments requiring efficient processing speeds.

Model Quantization

Specifications DeepSeek-R1 offers several quantization options, each with distinct storage requirements and configuration details:

Quants — Q2_K_XS

Disk Size — 207 GB

Comprehensive Q2 quantization
Q4 embedding layer
Q6 language model head

Quants — Q2_K_L

Disk Size — 228 GB

Q3 down projection
Q2 quantization for remaining components
Q4 embedding layer
Q6 language model head

Quants — Standard Quantization Variants

Q3_K_M → Disk Size 298 GB
Q4_K_M → Disk Size 377 GB
Q5_K_M → Disk Size 443 GB
Q6_K → Disk Size 513 GB
Q8_0 Disk Size 712 GB

Each quantization option presents different trade-offs between model size, performance, and resource utilization. Organizations should select the appropriate variant based on their specific requirements for accuracy, speed, and available computational resources.