05 February 2025

DeepSeek has changed the AI industry in the past week, with US stocks falling and questions being raised about the viability of closed source AI and even of the US AI market. Questions are also being raised about the legal and security issues surrounding DeepSeek, with one US senator filing a bill to make using the app punishable with up to 20 years’ imprisonment.
Today’s post will look at who DeepSeek is, how it’s R1 model actually works, and how it is different to its major US competitors. My next post will look at some of the questions it raises and its broader repercussions. For today I'll look at:
1. Who are DeepSeek?
2. What is DeepSeek-R1?
3. How does DeepSeek-R1 work?
4. How is DeepSeek-R1 different to other LLMs?
1. Who are DeepSeek?
DeepSeek is a Hangzhou-based Chinese AI start-up founded in November 2023 by Liang Wenfeng, a graduate in electronic information engineering and computer science from Zhejiang University.

Liang Wenfeng appearing at a national symposium in China in July 2024
(Credit: The China Academy)
Mr Liang also has a background in finance and is CEO of High-Flyer, a quantitative trading hedge fund (analysing financial data with AI to make investment decisions). In 2019 High-Flyer was the first quant fund in China in raise over 100 billion yuan ($13 million). Liang is DeepSeek’s controlling shareholder.
Little is known about Liang, though he was quoted by the BBC as saying that China “cannot remain a forever follower” to the US in AI development. High-Flyer announced on its WeChat account in March 2023 that it was moving beyond trading to focus on creating a “new and independent research group, to explore the essence of AGI" (Artificial General Intelligence). DeepSeek was created shortly afterwards, based in the same office building as High-Flyer.
"[Silicon Valley’s] surprise stems from seeing a Chinese company join their game as an innovator, not just a follower - which is what most Chinese firms are accustomed to."
Liang Wenfeng, January 2025
The extent of Liang’s fund High-Flyer’s investments in DeepSeek is not clear, though Reuters reports that it owns patents related to chip clusters to train AI models, according to Chinese corporate records. Reuters further reports that High-Flyer announced on WeChat that it owns and operates a cluster of 10,000 A100 chips in July 2022.
2. What is DeepSeek-R1?
DeepSeek-R1 is a reasoning large language model (LLM) released on 20 January 2025, following the publication of a research paper by DeepSeek: 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning'.

The paper is open source, and sets out the technical aspects of the model and its predecessor DeepSeek-V3, as well as the research process behind R1. It’s a little dense to unpack, but it may be one of the seminal AI research papers ever released.
R1 is a “first generation reasoning model” trained via “large-scale reinforcement learning (RL) without supervised fine-tuning (SFT)” according to the paper. Essentially it is an LLM that can be run online through the DeepSeek servers (which are located in China: a matter of some concern for privacy and security experts, but we’ll look at this in the next post), or can be operated on a desktop computer offline. (Note this is not recommended due to the security concerns, but is a key feature of the model so needs to be mentioned here.)
DeepSeek is more efficient than its US competitors, as it uses a “mixture of experts” model to channel prompts to specific parts of the model whilst leaving others dormant, combined with mathematical efficiency and enhanced chain of thought reasoning to decrease the computational power required to operate it.
Let’s unpack what that means…
3. How does DeepSeek-R1 work?
DeepSeek’s previous model V3 was its flagship large language model (LLM): a transformer-based series of neural networks trained on huge amounts of data.
However, V3 made three major developments from previous LLMs resulting in its being more efficient to run:
a. Mixture of Experts
Rather than running every prompt and every token of the response through the entire model, V3 and R1 assign specialisations to different parts of the model. This means that requests are handled only by those sections required to give the response.
A prompt is first assessed for which parameters of the model are most relevant, and only those parts are used to produce the response. This is much less energy-intensive than using the entire model for each query, as parameters that are not needed can lie dormant as a prompt processed.
This diagram from the ‘Exploring Large Language Models’ Substack by Data Scientist Maarten Grootendorst, in his article ‘A Visual Guide to Mixture of Experts (MoE),’ is the best representation I’ve seen:

Visual representation of Mixture of Experts, ‘A Visual Guide to Mixture of Experts (MoE),’ by Maarten Grootendorst
In the first image, the dense model, the entire model is used to run the query. But in the second image the query is rooted to a specific part of the model. For example, a prompt asking “What is 1+1?” would be routed to the “numbers” section, leaving the rest of the model inactive.
R1 has 671 billion parameters (connections between tokens). But only 80 billion of these are useful for solving a Maths problem. When R1 is prompted with the question "What is 1+1?" R1 identifies that the prompt is a "numbers" question and routes the prompt to that part of the model only. So rather than all 671 billion parameters being used to produce the response, in this case only 80 billion might be needed. This may sound like a lot, but an 80 billion parameter model can be run on some home computers, rather requiring than a huge gigafactory.
b. Distillation of larger models
Distillation is the process of using a larger ‘teacher’ model to train a smaller ‘student’ model.
According to IBM: “Whereas the objective in conventional deep learning is to train an artificial neural network to bring its predictions closer to the output examples provided in a training data set, the primary objective in distilling knowledge is to train the student network to match the predictions made by the teacher network.”
DeepSeek distilled several open source AI models (including Llama and Qwen: para 2.4 of the paper) to test whether smaller models could be fine-tuned to become more efficient.
On releasing R1, DeepSeek also offered distilled versions that can be operated using much less computing power, including on home computers, making it far more accessible (albeit with some loss of function compared to the full R1 model).
This isn’t quite the same as agentic AI, which is capable of performing tasks autonomously, as the groups of parameters are still part of the same LLM. But the concept is similar.
c. Mathematical efficiency
Using the parameters more efficiently through complex vector calculus (which are beyond the scope of this blog, or my understanding!), the V3 and R1 models use mathematical efficiency to reduce the amount of computing power required to operate the model.
R1 built upon these elements with two additional developments:
d. Chain of thought
Chain of thought is the process of solving problems in stages, pioneered by OpenAI. Training a model to approach problems in a systematic way, however, increases the computational power required to run it.
R1-Zero (the previous model to R1) used standard chain of thought to produce outputs, but DeepSeek researchers found that simply training the model in this way reduced the readability of its responses.
R1 enhanced the chain of thought training with fine-tuning using “cold start data” (examples of chains of thought for problems) to train the model through reinforcement learning.
Unlike other LLMs, however, DeepSeek’s reinforcement learning for chain of thought was greatly compressed, focusing more on rewarding outputs rather than assessing every chain of thought produced. The result is a highly-accurate model with less energy required to produce chain of thought reasoning.
The final feature of R1 that differs from other models is that the chain of thought information is available to users in human-friendly language, including the “Aha moment” as the model calls it, where the model demonstrates that it has changed approach in producing its result.
e. Multi-stage training process
The chain of thought reinforcement learning is only part of the training process for R1. All LLMs go through reinforcement learning, but with R1 the training process followed a number of key steps. These are set out in much more detail in the article ‘Exploring DeepSeek’s R1 Training Process’ from Towards Data Science, but here is a summary:
Cold start (having chain of thought data available to the model as examples, generated by the V3 model)
Reasoning-oriented reinforcement learning, including rewarding the model based on the linguistic accuracy of its outputs (for example, rewarding it for not mixing languages)
Rejection sampling and supervised fine-tuning, using the V3 data as benchmarks
Final reinforcement learning (“reinforcement learning for all scenarios” as it is called in the paper)
At every stage, the amount of data used to train the model was reduced, making the training process more efficient.
To summarise, R1 uses a combination of existing and innovative data science techniques to produce a model that is highly-accurate whilst requiring far less computing power and energy. The use of mixture of experts and distillation are particularly important, as they represent the highest savings in energy and compute efficiency achieved.
For a really great explanation of the technological advances of R1, I highly recommend the YouTube video on how R1 works from Mike Pound at the University of Nottingham, on the channel Computerphile.
4. How is DeepSeek-R1 different to other LLMs?
DeepSeek has performed extremely well against other models, almost out-performing OpenAI’s o1 model in all tasks, at 95% of the training and operating costs.

In combination with the efficiencies in computational power and energy, this makes the model both accurate and highly-accessible.
Built on a Mixture of Experts (MoE) architecture, DeepSeek-R1 leverages 671 billion parameters, with only 37 billion activated per forward pass, making Deepseek R1 both computationally efficient and highly scalable
UN University, ‘DeepSeek R1: Pioneering Open-Source ‘Thinking Model’ and Its Impact on the LLM Landscape’
To focus on the technical aspects only (the economic, social and geopolitical aspects will be in the next post), R1 differs from its US competitors in a number of significant ways:
Mixture of experts architecture makes it far more efficient and scalable
Distillation of larger models makes it possible to run smaller versions on very little computing power
Chain of thought reasoning is clear and open source
Multi-stage training process is more efficient than larger LLMs
R1 requires far less energy and computing power to operate
Efficiency makes it far cheaper to train (95% cost of a US competitor)
R1 is open source, making it available for use and adaptation by users and researchers all over the world
Conclusion
Whether the training cost estimate of $6 million is accurate has been disputed, as has the question of whether it was truly developed without the advanced microchips currently subject to US export controls against China.
What is undeniable, however, is that DeepSeek’s R1 model has changed the AI industry fundamentally, by producing a more efficient, less energy-intensive LLM that outperforms almost all competitors. And it is all open source.
Liang Wenfeng told Chinese media outlet An Yong that being open source is central to his philosophy:
For technologists, being followed is rewarding. Open-source is cultural, not just commercial. Giving back is an honor, and it attracts talent.
Liang Wenfeng, July 2024
The days of closed-source AI models – and possibly US AI dominance – may be over.
The next post will look at the huge economic, social and geopolitical implications of R1, as US stock markets begin to steady after such a seismic shock.