15 min read

02/02/2025

DeepSeek was not trained on $5,57M nor it copied OpenAI

NEWS BLOG EXPERT ARTIFICIAL INTELLIGENCE

31:27

On January 27 2025, the technology world experienced what many are calling AI's "Sputnik moment." I was celebrating my birthday when DeepSeek's latest AI model became the most downloaded free app on Apple's U.S. App Store, surpassing ChatGPT. The ripple effects were immediate and dramatic —Nvidia's stock plunged 17% the following Monday, wiping out $600 billion in market value in a single day. But beyond these market movements lies a deeper revolution in how artificial intelligence is developed and trained. Little attention was paid to another Chinese model: Alibaba's model Qwen 2.5 Plus, which can also stand up to any Western rival in most tasks (follow the link and watch it book a flight using Booking.com's app on a phone).

Just as the Soviet Union's launch of Sputnik in 1957 shattered American assumptions about technological superiority, DeepSeek's breakthrough has challenged fundamental beliefs about AI development. The United States, through companies like OpenAI and infrastructure giants like Nvidia, believed it held an insurmountable lead in AI technology. The U.S. even attempted to maintain this advantage through export controls on advanced AI chips to China. DeepSeek's success proves these measures have failed to prevent Chinese innovation.

The seismic shift that DeepSeek has caused in the artificial intelligence landscape is about to reshape the global technology order. DeepSeek could upend the funding math for AI apps and all things AI. The cost per token is incredibly cheaper than current LLMs. Still, by providing open access to weights and the model itself, DeepSeek is providing machine learning departments and even small and medium-sized companies the power to host and build solutions for which they would have needed millions and months to develop (if ever). Its rise marks not just another milestone in AI development but potentially a fundamental transformation in how we approach machine learning itself. Before we dive into the potential and the consequences, I would like to clear 2 false statements that have been repeated like a mantra by "serious" media and social media "amplifiers" without checking the source (that is, DeepSeek's paper itself).

Real Training Costs and Infrastructure

Most media outlets were busy enough reporting that DeepSeek-V3 cost only $5.67M to train. The Chinese "could match with $5M what the Americans did for $100m". But this figure comes from a theoretical calculation assuming $2/hour H800 GPU rental costs. The real infrastructure investment was substantially higher, including the reported stockpile of 10,000-50,000 Nvidia A100 chips acquired before export restrictions.

The following paragraphs explain the use of H800 and the calculation for the hours, which is fine. "Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data."

It is clear from the above that the machine learning team never paid $5,57 - the infrastructure was already in place, and this is a back calculation only based on potential GPU usage. Deepseek V3 would indeed cost that amount in training computing. The previous paragraphs state that training was smooth and there was never a stop (quite surprising if you ask me). However, the cost of training R1 was never published, and a large part of the efficiency gains is from the choice of increased MoE sparsity ratio they decided to use, which ends up sacrificing more VRAM but gets the benefit of training cost reduction. Recent analysis has also challenged previous assumptions about general AI model training costs, with estimates suggesting more modest expenditures than widely believed. Major models like GPT-4 and Claude are estimated to have training costs around $10M. At the same time, O1 and Claude-3.5 Sonnet fall in the $20-30M range – the latter figure confirmed by Anthropic's CEO Dario Amodei himself in his blog post saying that Claude-3.5 Sonnet took “a few tens of millions”. The disconnect with earlier estimates of hundreds of millions in training costs can be attributed to technical limitations that historically capped training at around 24K GPUs. However, the landscape is evolving, with companies like Microsoft/OpenAI and XAI recently developing larger clusters of approximately 100K H100s, enabling training costs around $500M. This shift in reported costs has sparked discussion about corporate incentives, with some suggesting U.S. companies may have inflated figures to attract investment. In contrast, others, like DeepSeek, emphasize cost efficiency to move the conversation to engineering skills. In any case, DeepSeek R-1 is just the latest model from a series of DeepSeek AI models. It builds on previous work, particularly on math. Although they used H800 for this specific training (H800's are an NVdia model modified version of H100 specifically sold in the Chinese market due to export regulations with a reduced chip-to-chip data transfer rate of around 300 GBps compared to H100's 600 GBps). Without a doubt, the holding company had access to H100's before the export ban —something that has completely backfired. Here’s a breakdown of the likely training costs and infrastructure considerations:

Nvidia A100 GPUs: With a reported stockpile of 10,000 to 50,000 A100 GPUs, the hardware investment alone is substantial. Each A100 GPU costs approximately 10,000 to 15,000, depending on the configuration and purchase volume. This translates to a hardware investment ranging from $100M to $750M. Let's add to the necessary supporting infrastructure: Beyond GPUs, training large AI models requires significant supporting infrastructure, including high-performance CPUs, memory, storage, and networking equipment. Data centers must also be equipped with cooling systems, power supplies, and redundancy systems, adding to the overall cost.
Operational Costs such as power consumption. For example, training a model on 10,000 A100 GPUs for several weeks could consume millions of kilowatt-hours, resulting in power costs in the range of $1M to $5M or more, depending on local electricity rates, plus maintenance and personnel (the large skilled team of engineers, technicians, and researchers with their salaries, benefits, and operational overhead).
Software and Development Costs such as the custom software for developing and optimizing software frameworks for distributed training, data preprocessing, and model evaluation, plus the essential data acquisition and preparation (high-quality training data is critical for model performance and thus acquiring, cleaning, and preprocessing data is a costly and time-intensive process).

The theoretical $5.57M training cost is a significant underestimate. The real investment in DeepSeek-V3 includes GPU rentals and substantial infrastructure, operational, and strategic costs, reflecting the true scale of modern AI development (which is how OpenAI claimed to have spent $100M in the training, of course).

Second fallacy: "They only used Reinforcement Learning"

At the heart of DeepSeek's breakthrough is their Group Relative Policy Optimization (GRPO) framework -we explain how this works below. Unlike traditional reinforcement learning approaches where a "critic" model trained on labeled data is required, GRPO allows the model to learn by comparing its performance against a group average. This eliminates the need for labeled training data while maintaining high-performance standards. Initial media coverage portrayed DeepSeek's breakthrough as purely based on Reinforcement Learning through GRPO, but reading their DeepSeek-V3 Technical Report 2412.19437v1 reveals a more sophisticated multi-stage training process that combines several approaches. The report explicitly discusses "distilling the reasoning capability from the DeepSeek-R1 series of models" during post-training, showing that knowledge transfer from supervised models played a crucial role. I'll summarize it:

More on the subject: What Is Reinforcement Learning from Human Feedback (RLHF) and How Does It Work?

According to the paper, DeepSeek-V3's training involved multiple stages:

1. Pre-training

14.8T high-quality and diverse tokens (we assume from many languages since generation performance is impressive, particularly in under-represented Asian and European languages we have tested, for example)
Traditional language model pre-training approaches
Base model development with MoE architecture (the same one used by OpenAI and Mixtral, the basis of our fine-tuned version ECO LLM, 2023)

More on the subject: Demystifying Mixture of Experts (MoE): The future for Deep GenAI systems

2. Supervised Fine-Tuning (SFT)

1.5M carefully curated instruction instances
Multiple domain coverage
Two distinct types of SFT samples per instance:
- <problem, original response> format
- <system prompt, problem, R1 response> format

3. Post-Training Reinforcement Learning

Implementation of GRPO
Use of both rule-based and model-based reward models
Integration of verification and reflection patterns

4. The Role of Human Feedback

The paper describes several points where human feedback and supervision were crucial in ensuring the model's quality and alignment. Human annotators played a critical role in data verification, meticulously reviewing non-reasoning data to maintain high standards for the training datasets. This involved rigorous quality control measures to filter out noise, inconsistencies, or irrelevant information, strengthening the foundational data used for model training. Additionally, human evaluators systematically validated model outputs to assess accuracy, coherence, and relevance, ensuring the system’s responses met practical and ethical benchmarks. For reward model training, human input was central: annotators generated human preference data to train the reward model, which guided the model toward desirable behaviors. This included chain-of-thought annotations that provided granular feedback on reasoning steps, enabling the model to align its outputs with human-like logical processes. Furthermore, domain experts validated outputs in specialized fields, injecting technical accuracy and domain-specific nuance into the system. These iterative human-in-the-loop processes—spanning data curation, output evaluation, and reward signal refinement—highlight how DeepSeek’s development relied on continuous human oversight to balance automation with precision, ultimately ensuring the model’s reliability and alignment with real-world needs.

It’s not New Engineering- It’s Optimization. This is not “copying”

During the initial months of the ChatGPT3.5 shock, Stanford Alpaca (https://crfm.stanford.edu/2023/03/13/alpaca.html) was trained in 3 hrs on $600 of computing using data from model output, rather than the whole mess of datasets. Nobody complained about that distillation. For a while, the resulting Alpaca performance matched those of Llama and OpenAI and it was the basis of our own first experiments. When training responses are consistent, the models learn more efficiently. To grasp the magnitude of DeepSeek's achievement, we must understand the traditional approach to AI development and how DeepSeek has revolutionized it. They haven’t invented anything new, but they have optimized many components.

Traditionally, large language models like GPT-4 are developed through a process called Supervised Fine-Tuning (SFT). This approach requires massive amounts of labeled training data —essentially examples of correct inputs and outputs that help the model learn. Think of it as teaching a student by showing them thousands of solved problems. This process is expensive, time-consuming, and creates a high barrier to entry for new players in the field.

DeepSeek has taken a radically different approach, focusing on Reinforcement Learning (RL). RL is inspired by how humans naturally learn – through trial and error and Google’s DeepMind pointed to it “being enough” in their 2021 paper. Most of us working in development or having created datasets for AI know that synthetic data (machine-generated data) is often more reliable than human-generated data when training for specific scenarios. With RL, the model learns by receiving rewards or penalties based on its actions instead of being shown the correct answers. What makes DeepSeek's achievement remarkable is its successful implementation of "pure-RL" training without any pre-labeled data.

DeepSeek's implementation, while potentially using some supervised signals, demonstrates how engineering innovations can make reward-based learning more efficient, and this is the real breakthrough. DeepSeek has taken the theory and applied it even in unfavorable conditions, and this truly shines in the engineering achievements they describe in the paper:

Reduces memory and computational costs by projecting KQV matrices into a lower-dimensional space.
Mixture-of-Experts (MoE) architecture with 671B total parameters: Even French company Mistral released two MoE models (also sending a signal to OpenAI: “We know how you’ve done it”). A Mixture of Experts uses only selected parameters per token, reducing computation while maintaining model quality. DeepSeek has implemented special load balancing loss to ensure expert utilization of distributed Hardware.
Multi-Token Prediction (MTP): Allows parallel token generation, improving throughput by 2-3x.
Lastly, FP8 Quantization: Provides up to 75% memory reduction compared to FP32 while maintaining stability through adaptive bit-width scaling and loss-aware quantization techniques.

These architectural innovations (MoE, MLA, MTP, and FP8 Quantization) focus on optimizing large-scale training and deployment and serving efficiency, not single-user or local runtime performance. For example, MoE requires the same memory footprint as the dense model like the Llama family from Meta. Despite using fewer parameters per inference, MTP's parallel token generation mainly benefits high-throughput scenarios.

The real innovation comes, then, from its training methodology. OpenAI, Gemini, Claude, Mixtral, and others may soon adopt the same efficiency techniques. As a model, DeepSeek R1 is too verbose even when its temperature is set to 0. The team researched to find some of the core ideas from OpenAI o1 independently. (Confirmed by Mark Chen Chief Research Officer at OpenAI). Deepseek used Group Relative Policy Optimization (GRPO) - A more efficient alternative to PPO/DPO for reinforcement learning in a multi-stage training approach combining SFT and RL. The reasoning capabilities emerge through reinforcement learning.

Understanding GRPO (and its link to efficient teaching at business schools)

GRPO is indeed innovative, but the concept has been part of some educational establishments like IESE Business School for a long time. At IESE, students are given tasks and projects to complete in groups. Students in a class do not receive individual assessments. Instead, the group gets a score for the work, and the student's achievement is relative to the group's average, with top performers and worse performers. The objective is to increase the group's precision and encourage students with worse marks to do better (work with a higher precision and accuracy). In practice, this would work like this:

Question: What is the capital of France?
Student A	Paris	0,9
Student B	The capital of France is Paris.	1
Student C	Rome	0
Student D	Paris is the capital of France	0,95
Average		0,7125

Here, certain penalizations might take place. For example, if we are looking for longer answers, Student A may be slightly penalized for providing a concise answer, etc. However, any answer scoring over 0,7125 is a viable answer. At scale, we can learn the common sense answer and how it may be expressed. Imagine applying this method to math or code, highly popular disciplines in Asia. This method seems to have been the core of the "savings", since not as many humans were employed as in the first US versions (ChatGPT3.5, Google's Gemini, or Anthropic's series of Claude models)

DeepSeek's Group Relative Policy Optimization (GRPO) represents an elegant engineering solution to a classic reinforcement learning problem. To solve a question such as "What is the capital of France?" traditional approaches would have required a separate "critic" model. As we have seen, GRPO estimates reward baselines from comparing responses within groups of model outputs.

However, this innovation has sparked debate. Some experts in the AI community point out potential contradictions - while DeepSeek claims "pure RL without supervised data," generating meaningful rewards for language model outputs likely still requires some form of supervision or judgment. As one researcher noted on Reddit /singularity: "Without any labels, how does one calculate this 'reward' for each response? The only logical solution is to use another advanced LLM." There is a point in it, which is to obtain certain levels of accuracy. Let's face it, AI is already extensive in many areas of our lives, and not leveraging AI systems to speed up one's work would not be understandable. But fact-checking with an established player or using some of its existing knowledge for support processes can hardly be considered "distilling".

The "Reward is Enough" Connection: This is the real breakthrough

We are now getting around the idea that the training costs of DeepSeek and the process have been much more standard than initially reported in the media and that the frenzy and shock have more to do with sensationalism and misreading the paper. The fact that a model produced in China has surpassed OpenAI's or Gemini in several tasks should be newsworthy on its own, but retelling the story as "they did it with $5M, and they did it only with Reinforcement Learning" (in a way, synthetic processes with no humans involved) ....sells more and captures more headlines.

DeepSeek took a radically different approach, focusing on Reinforcement Learning (RL). RL is inspired by how humans naturally learn – through trial and error. Instead of being shown the correct answers, the model learns by receiving rewards or penalties based on its actions. DeepSeek's approach aligns with principles outlined in the influential "Reward is Enough" paper by DeepMind researchers in October 2021. (DeepMind was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The team now works at Google with research centres in Canada, France, Germany, and the United States). Get ready for some news from Google in 2025!).

DeepMind made headlines in 2016 after its AlphaGo program beat a human professional Go player Lee Sedol, a world champion, in a five-game match that was the subject of a documentary film. Another program, AlphaZero, beat the most powerful programs playing go, chess and shogi (Japanese chess) after a few days of play against itself using... Reinforcement Learning.

This framework suggests that intelligence and its associated abilities can emerge solely through the maximization of a reward: "The maximization of different rewards in various environments leads to distinct forms of intelligence by shaping the nature of an agent's abilities based on its specific experiences. Each environment presents unique challenges and rewards, which, when maximized, result in the emergence of powerful and specialized forms of intelligence. This process allows agents to develop a diverse array of abilities, as demonstrated by the success of AlphaZero in mastering games like Go, chess, and shogi through a singular focus on reward maximization".

The Evolution of DeepSeek

DeepSeek's journey to this breakthrough was methodical. Their first significant release, DeepSeek Coder, arrived in November 2023 as an open-source project. The DeepSeek LLM followed this, scaled to 67B parameters, challenging GPT-4's capabilities but facing efficiency challenges.

The real breakthrough came with DeepSeek-V2 in May 2024, demonstrating unprecedented efficiency in computational resources and energy consumption. This version sparked what Chinese media dubbed the "AI price war," forcing even tech giants like ByteDance and Alibaba to reduce their prices to compete.

DeepSeek-V3, released in December 2024, represented another leap forward, matching the performance of top models like GPT-4 and Claude 3.5 Sonnet while using significantly fewer resources. The January 2025 release of DeepSeek-R1 and R1-Zero represents the culmination of their innovation. These models demonstrate that sophisticated AI can be developed through pure reinforcement learning, challenging the industry's reliance on massive labeled datasets and sending ripples of change to several industries (the need for large acquisition of data by AI companies), and those who had based their apps on OpenAI's API, to name two. Areas such as accuracy in LLM translation capabilities remain to be tested in depth (it is no secret that we have begun our own). DeepSeek recommends a higher parameter than usual. Our initial tests point to a tendency to summarize and adapt too much, especially in Western European languages, perhaps due to the default value of 1,3. Although translations are fluent, there is also a tendency to miss some segments or merge the concepts. Fluency, adaptation, and accuracy in large-demand languages (Chinese, Japanese, Korean) look very good.

The temperature parameter for Translation is a little higher at 1.3

Note DeepSeek's Temperature Parameter is not set to succinctness but to a slight verbosity. This affects conversations and translation.

Industry Implications

The implications of DeepSeek's breakthrough extend far beyond the AI research community:

First, there is the democratization of AI development. By open-sourcing their technology, demonstrating that high-performance AI can be developed without massive labeled datasets, and giving the open model away with open weights, DeepSeek has lowered the barriers to entry for AI development. The right fine-tuning will lead to an explosion of specialized AI models developed by smaller companies and organizations. We will soon begin customization and fine-tuning at Pangeanic for our Deep Adaptive AI Translation, summarization, and data classification, among other technologies. Secondly, there is energy efficiency since DeepSeek's models require significantly less computational power and energy than their competitors. This addresses one of the major criticisms of large language models – their environmental impact due to high energy consumption. Even for those who do not want to host the model, there is a tangible cost reduction in usage since DeepSeek's API costs are approximately 27 times lower than OpenAI's for both input and output tokens.

Political and Economic Implications

The geopolitical ramifications of DeepSeek's success are significant. It demonstrates that U.S. export controls on advanced AI chips have not prevented Chinese innovation. Instead, these restrictions may have pushed Chinese companies to develop more efficient alternatives, which are now available to developers worldwide. The situation has created what some analysts call a "Digital Cold War," with Europe caught between American and Chinese AI ecosystems. True, this may raise important questions about technological sovereignty and the future of global AI development, which, again, the availability and openness of the DeepSeek model may solve. To paraphrase Meta's Yann LeCun, the conflict is really about closed, proprietary models versus open models. For once, after Italy's and Belgium's inquiry to DeepSeek about personal data protection (facing a potential ban, just as ChatGPT did), U.S. authorities are probably happy there is legislation like GDPR in place. GDPR requires data masking of personal data. DeepSeek faces scrutiny over its relationship with Chinese authorities. The model includes certain restrictions and censorship mechanisms, particularly around politically sensitive topics. This raises questions about the balance between open-source development and government control and influence, which we already observe in the United States.

Looking Forward

The emergence of DeepSeek’s R1 model and its rapid adoption signal a pivotal shift in the AI landscape, with profound implications for technology development, global competition, and societal impact. DeepSeek's breakthrough suggests we're entering a new phase in AI development, where innovation might come from unexpected places and take unexpected forms. Some African countries and communities in India or Latin America may take a smaller DeepSeek model and fine-tune it for their purposes, simply adding culturally relevant instructions or more data for localized needs (e.g., low-resource language translation and agricultural optimization). Innovation by startups in those regions could flourish outside traditional tech hubs. This aligns with the trend of "glocal" AI—global models adapted to local contexts. Language preservation projects are likely to benefit. Focusing on Reinforcement Learning over supervised training could lead to more efficient, adaptable AI systems that learn more like humans do.

For industries relying on AI technology, from language translation to software development, this means preparing for a world where advanced AI capabilities are more accessible and affordable. Companies must focus on specialized applications and value-added services rather than simply access to AI models. This leads to a commoditization of foundational models: As foundational models become cheaper and more accessible, their value as standalone products diminishes. It's like having a computer operating system: you expect it and take it for granted. It is what you build on top that matters. Competition will shift to specialized applications, data curation, and vertical-specific solutions (e.g., healthcare diagnostics, legal automation). Startups may prioritize domain expertise over model-building, reshaping venture capital strategies.

We are about to see a lot of pressure on incumbents. Anthropic disdained DeepSeek in its blog. CEO Dario Amodei said that he doesn't view China's DeepSeek "themselves as adversaries" but believes that export controls are more critical than ever when it comes to artificial intelligence. That strategy has proven itself short-sighted already. OpenAI’s response (e.g., accelerating releases) reflects the threat of low-cost alternatives. If DeepSeek’s performance holds, incumbents may face margin compression, forcing them to innovate faster or diversify into enterprise services, hardware, or ecosystem tools (e.g., evaluation frameworks).

The business models that are going to emerge will be novel and very interesting. The affordability of inference (API calls) and open-source flexibility could spur decentralized AI for community-driven model fine-tuning and federated learning and, above all, “AI-as-a-Utility” (subscription-based services for niche industries) and hybrid architectures (startups combining multiple cost-efficient models (e.g., DeepSeek for reasoning, Mistral for creativity) to optimize performance).

The future of AI development appears more open and democratic than ever before, but also more complex and politically charged. DeepSeek’s success highlights China’s growing role in AI innovation, challenging Western dominance. This will likely intensify U.S.-China tech decoupling as both nations vie for technological supremacy. However, it also presents an opportunity for collaboration in critical areas such as AI safety, where shared goals could foster cooperative efforts despite geopolitical tensions. DeepSeek's achievement is not merely a technical milestone; it signifies a potential realignment of the global technological order, where power dynamics may shift, and new leaders in AI innovation could emerge.

We will also face new bias and security risks as concerns about embedded cultural/political biases in models like R1 (trained on Chinese data) may limit adoption in certain markets. Conversely, developers in authoritarian regimes might exploit open-source models for surveillance or censorship tools. Finally, the same regulatory dilemmas will hold as governments will grapple with balancing innovation incentives (via affordable AI) with risks like misinformation, labor displacement, and ethical misuse. Regions like the EU may tighten oversight of open-source models, while others prioritize rapid deployment.

As we move forward, the key question isn't just who can build the most powerful AI models but who can use them most effectively to solve real-world problems while addressing crucial concerns about ethics, privacy, and social impact. The AI revolution has entered a new phase, just 2 years after the first Chat+LLM models shocked the world with their fluency.