Why America Is Losing the Open-Source AI Race?

I’ve been investing in open source companies for nearly my entire venture career. I love open source businesses, think they’re generally great for the ecosystem, and can also create a lot of commercial value (but this can be tricky!). We’ve seen all kinds of open source businesses become successful.

Databases ( $MongoDB Inc.(MDB)$ , Clickhouse, etc), Data Infrastructure (Databricks, $Confluent, Inc.(CFLT)$ , etc) Developer Tools ( $HashiCorp, Inc.(HCP)$ , $GitLab, Inc.(GTLB)$ , etc). And many other categories.

The nuance lies in how you define “open source.” A lot of this comes down to what open source license the open source project uses. At a high level there are three main buckets.

First is fully open and permissive (MIT, Apache 2.0, etc) - anyone can use, modify, and distribute the code however they want, including building commercial products on top of it.

Second is copyleft (GPL, AGPL, etc) - the code is open, but if you distribute or build on it you have to open source your own code too.

Third is source available - the code is publicly readable but the license restricts commercial use in specific ways. This is where you see licenses like BSL (Business Source License) and SSPL, which HashiCorp and MongoDB have both moved to at various points.

The main point with source available licenses is that you can basically do everything you can in the permissive license, EXCEPT monetize the project (ie you can’t take the project, create a hosted version, and charge for it). The most “pure” opensource proponents would say only the fully open and permissive licenses are truly open source. Everything else isn’t.

The playbook for many of these open source companies was / is to first become the default in their field. This is the most important part. Spark became the default in large scale data processing, Kafka became the default for real time data streaming and event ingestion. Mongo for NoSQL databases. Etc. Each project became widespread and ubiquitous in their respective fields - they were / are the dominant platform for their use case. Then the commercial companies behind each project figured out a way to monetize.

Sometimes this was through hosted offerings - where the company runs and manages the infrastructure for you, so you don't have to worry about deploying, scaling, or maintaining the project yourself. In other instances, proprietary features were gated behind a commercial offering (this could be better performance, enterprise features, etc). Regardless - there were playbooks to monetize, and successful examples of large commercial businesses being built on top of successful open source projects.

But this wasn’t always the case. For a long time, the conventional wisdom was that open source couldn't be a real business. If the code is free, who pays? And the skepticism was genuine - why would a company pay for something their engineers could just download and run themselves?

Early open source companies had to fight hard to prove out the commercial model, and a lot of investors passed on these deals early on because the business logic wasn't obvious. "You're just going to give it away?" was a common reaction. And honestly, it wasn't an unreasonable question. The answer turned out to be: yes, you give the software away, but you charge for everything around it - the hosted version, the enterprise features, the support, the integrations, the proprietary engines, etc. The open source project served as distribution, and the commercial product was something different.

And after a wave of successful open source commercial companies emerged, it became more obvious what was possible (commercially). And this was very important - investors (like myself) deeply believed successful open source projects could translate into successful commercial companies. This created an important incentive set - an incentive to fund open source companies (by investors) and an incentive to start commercial open source companies (by founders)

Let me tie this back to what’s happening in open source AI. Today, one of the biggest disappointments is a lack of a strong open source model ecosystem in the US. All of the frontier models are closed source from OpenAI / Anthropic. The leading opensource models are largely all from Chinese labs. DeepSeek being the most prominent example, but also Qwen from Alibaba, Kimi from Moonshot AI, and others. Jensen Huang has talked about the importance of having American leaders at every layer of the AI stack - chips, models, infrastructure, applications.

The US is clearly winning at chips ( $NVIDIA(NVDA)$ ) and has strong incumbents at the application layer, but the open source model layer is a real gap. And that gap matters more than people realize. Open source models are how developers learn, experiment, and build. They're the distribution layer for the next generation of AI applications. If the dominant open source models all come from Chinese labs, that has long term implications for where developer communities form, where tooling gets built around, and ultimately where commercial ecosystems emerge.

So why hasn’t there been more of a blossoming US open source model ecosystem? I think it all comes down to incentives. In the same way investors had questions about the business model of open source companies many years ago, they have similar questions about open source labs today. How do you monetize an open source model? You could “host” it - which in the case of models probably means just means hosting an API endpoint people can hit for inference.

BUT if you’re doing this you’re also going to directly compete with anyone else who can pick up your model and host an inference API endpoint. Baseten, Fireworks, all the hyperscalers, etc. And the question then becomes do you as the model provider have an advantage that lets you serve the model either more performantly or more cost efficiently (or both). The answer to this question is probably “maybe” but not by enough to win with high margins over the long term.

The next natural place is to move into fine tuning / eval feedback loops. Most folks want to take an open source model, fine tune it for their use case / data, observe / monitor / improve that fine tuned model (and rinse repeat). So the open source labs could move into this line of business and charge for it. But then you’re still going to compete with the Fireworks / Baseten / Hyperscalers who will also offer this.

And then there’s the question of what does it mean to be an open source model? There's actually a spectrum here too - and it maps pretty cleanly to the licensing buckets we talked about earlier. The MIT / Apache equivalent in the model world would be truly open weights with no commercial restrictions - download it, fine tune it, host it, build a business on top of it, do whatever you want. The source available equivalent would be something like Llama - the weights are publicly available, you can experiment and build, but there are commercial restrictions that kick in at scale.

And then true open source, the equivalent of releasing the full source code, would mean releasing not just the weights but the training data, the training pipeline, the evals, everything needed to fully replicate the model from scratch. Almost nobody does this. So when we talk about the "open source model ecosystem," we're really mostly talking about open weights models with varying degrees of commercial restrictions. Which matters, because just like source available licenses changed the commercial dynamics for traditional open source companies, the restrictions baked into open weights models will shape where developer communities form, what gets built on top, and ultimately who captures the commercial value.

I think the main issue facing the open source model ecosystem is incentives. People (ie investors) just question the business feasibility of open source models. Today, most of the world is using the frontier models for everything. However, what’s really come into focus very recently is costs. It just doesn’t make sense to drive the Ferrari down the street to the grocery store…The Honda will do just fine. You don’t need the “expensive frontier” tokens for everything. I don’t know where the line is, but let’s just use the 80/20 rule. Maybe 80% of the use cases are generic enough where a smaller, less powerful, but cost efficient model will do fine. And for the 20% of use cases you’ll really need the frontier expensive tokens (i’m completely making up that ratio, just using an illustrative example).

So then - why shouldn’t open source dominate in this next part of the cycle?? When people shift usage from most expensive tokens to cheaper tokens, wouldn’t open source be the natural alternative? Yes - but where are the models?? The other baked in assumption is that the open source models are “good enough.” Said another way, the gap between the frontier closed source models and open source isn’t “that wide” (and more importantly is the trajectory widening or closing).

Right now all the best open source models Chinese models… And these models have largely gotten to where they are by distilling frontier models. But I think that was much easier when the models were more “basic” language models. Question / answer type models where you could create a data set by asking millions of questions to the frontier model. I think distillation is getting harder. The most frontier models are now being released with increasingly complex harnesses. Or the frontier models aren’t released via an API but are wrapped into a product themselves. Or look at Claude - playing around with Claude Code, I have workstreams that kick off 50+ agents in parallel. How do you distill a model with 50+ agents running in parallel?? It’s much harder than just getting millions of question / answer pairs… All of this makes it much harder to “distill.” And the cost to pre-train (without the head start of distilling) is massive. All of this to say - I find myself today probably on the side of the gap widening (not shrinking) between the frontier and open source.

So who will be able to raise massive amounts of money, to pre-train a model, that will maybe work, but might end up being too far behind the frontier to capture real market share, and even if you get all of the previous points right there are still questions about how you monetize over the long haul (the first half of this post). Everyone WANTS this to happen. But who’s going to fund it?

I go back to the early days of Neoclouds where Nvidia really propped up the ecosystem. They funded (massively) a class of companies who might not have gotten the funding elsewhere. Do we need King Jensen again?? Or do we need governments to fund it out of a sovereign interest?? I’m not sure. But I really hope the US open source ecosystem finds a way to thrive.

What I think it will take - some sort of research breakthrough that open source gets first. Let’s hope that happens! I also think (no inside info) at some point the closed source frontier labs will get back into open sourcing models. Maybe not the frontier. Maybe a slimmed down version of the frontier, or maybe open sourcing one model release prior to the frontier. This is the most hopeful scenario!


😍 Been eyeing Tiger merch but short on Tiger Coins? Now's your chance.

🎁 We’ve selected 4 high-demand items across practial, lifestyle, and learning, now with a lower redemption threshold!

Hot Merch Returns · Up to 43% Off

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Report

Comment

  • Top
  • Latest
empty
No comments yet