DeepSeek and Its Implications on the US Tech Market

So we’ve all definitely heard about DeepSeek by now right? (I’m definitely late to the party for this article lol). And we’ve heard many people talk about how it’s definitely the end of American tech! But is it really, and should you start selling all your tech stocks like the media is telling you to?

DeepSeek’s Technical Prowess

Before i start talking about all the investing jargon, i wanna talk more about what DeepSeek ACTUALLY is first - what their technical papers discussed, what they ACTUALLY reported, and whether what the media has been telling you has been true.

Firstly, yes, it’s true that DeepSeek indeed performed better than OpenAI in SOME benchmarks. and yes, this does (to a certain extent) prove DeepSeek’s technical capabilities.

So let’s dive into the technical report further



I took the above diagram directly from the DeepSeek V3 paper. If we look at the benchmarks, we see that DeepSeek does indeed beat the GPT-4o model in all the benchmarks. While it is also possible (and common) for engineers to engineer the benchmark results, I don’t want to discredit the fact that V3’s ability to beat the 4o model in some benchmarks is indeed still an engineering feat.

Also, they are claiming that they did this with a cost of only $6M, while companies like OpenAI are spending $100M+ a year to achieve (similar) results. But is this really true, and if so, are all our investments cooked?

What exactly is DeepSeek?

Before i dive further into DeepSeek’s models, i first wanna address the 2 things about AI that i think will be useful later on - inference (when the AI is running and trying to answer your query) vs training (when the AI is being trained by engineers during development) cost. What DeepSeek claimed was that their TRAINING was cheap, but nothing was said about its inference cost.

Well, I don’t doubt that DeepSeek’s training COST was cheaper, but we must remember that there are 2 primary factors to take into account when trying to optimize an AI’s output.

1. The monetary and compute cost for training → this refers to how much it actually costs to train the model, and how much energy is used to train the model

2. The time cost for training → this refers to how long it takes to train the model

Monetary and compute cost

Now, i’m not sure how true it is that DeepSeek did only cost $5.5M to train, but one thing I don’t doubt is that its training cost was actually cheaper than that of 4o. However, we must remember that China’s energy costs are lower, and that DeepSeek is also likely funded by the Chinese govt. if we were to look at the actual compute cost, and then calculate it with American energy costs instead of Chinese energy costs, who knows how much DeepSeek will cost to train then?

Now, the interesting part. Compute cost: this refers to the amount of computing power required to train DeepSeek. if you look at the architecture for DeepSeek, they are using this architecture called the Mixture of Experts (MoE). what this entails is that instead of trying to create this omnipotent AI that is capable of knowing everything everywhere all at once, we create an AI that is more similar to how a human brain functions - in that different parts of the brain are used for different things.

If we look at the way a human brain functions, you realise that our brains are split into different parts:

- Frontal Lobe – Controls reasoning, decision-making, problem-solving, voluntary movements, and speech.

- Parietal Lobe – Processes sensory information (touch, temperature, pain) and spatial awareness.

- Temporal Lobe – Handles memory, auditory processing, and language comprehension.

- Occipital Lobe – Responsible for vision and visual processing.

I won’t have enough space in this blog to give you a neuroscience lecture, but you get the point - each part has its own specialty, and you won’t be required to activate all parts of your brain for all activities.

Now, if you look at the MoE architecture, what it does is that out of the 671B parameters in V3, it actually only activates 37B parameters for each of the inference. This means that when a user query is passed, the model determines the 37B best tokens to use for processing the query, rather than consulting all 671B parameters. If you’re wondering why this is important, it’s because this is the reason why their inference compute cost is cheaper.

We don’t know exactly how many parameters the OpenAI 4o model has, but suppose that it also has 675B parameters. In GPT-4o, each iteration of the compute will require activation of all 675B parameters, regardless of how useful or useless they are. On the contrary, the V3 model from DeepSeek only requires 37B parameters to be activated; and the proposition is that these 37B parameters will be sufficient to achieve similar results as the 675B parameters in 4o. Hence, the compute cost required by V3 is almost 2 fold smaller than that of 4o.

However, does lower compute mean that the overall cost is lower? Should we also factor in how long it takes to fine-tune and train the model?

Time cost

The thing i see many people talking about DeepSeek is that no one really talks about the training time taken for DeepSeek. Based on the paper alone, there is little to no suggestion on how long it actually took to train and fine tune the DeepSeek model before it was able to produce such results.

Firstly, i want to say that the DeepSeek training time may actually be longer, if not the same, as its fellow peers. Why, you might ask? let’s take another dive into how AI training works.

Technical jargon

If you hate math or anything that sounds remotely technical, feel free to skip this part…

But if not, let’s start diving into the technical jargon.

- Forward Pass – This is the first step in AI model training. During this phase, input data passes through the network, and each layer processes it by applying weights and activations. The output is then compared with the expected result to compute the error.

- Loss Calculation – After the forward pass, a loss function evaluates how far the model's predictions are from the actual values. The smaller the loss, the better the model's performance. DeepSeek, like many large models, uses a cross-entropy loss function for classification tasks.

- Backward Pass (Backpropagation) – This is where the model learns. The gradients of the loss function with respect to each weight are computed using automatic differentiation. This allows the model to adjust its parameters accordingly.

- Gradient Descent & Optimization – Optimization algorithms like AdamW or SGD update the model weights based on the computed gradients. These updates occur iteratively over millions (or even billions) of steps to fine-tune the model’s performance.

- Batch Size & Compute Cost – Batch sizes refer to a sample size that is used to train the model. This helps to chunk together datasets for training the model instead of iterating through every single set of data individually, helping to reduce the training time required. Lower batch sizes will lead to better results, but also increases GPU memory requirements and compute costs.

Now that we understand these steps, let’s get back to why DeepSeek’s training time might be longer than expected.

Compute Considerations

DeepSeek uses Transformer-based architectures, the same as OpenAI’s GPT models, which scale quadratically in compute cost due to self-attention mechanisms. Given its large model size and parameter count, the computational burden is immense. Furthermore, MoE models are harder to train because they risk falling into this thing called an expert collapse - where the model just defaults to 1 expert all the time. In V3’s case, this would effectively mean that the model becomes a 37B parameter model, not 675B.

At the end of the day, DeepSeek is a wake-up call, not a killer, for US tech. Instead of panicking and dumping all your tech stocks, a better approach might be to watch how companies adapt, innovate, and leverage AI advancements. After all, competition fuels progress—and the AI revolution is just getting started.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Report

Comment1

  • Top
  • Latest
  • If you liked this article, please follow me at https://thewontoninvestor.substack.com for more articles like this!


    Reply
    Report