[News Brief] The Resurgence of US Open LLMs
Granite, OLMo, Trinity, and Nemotron Enter the Ring
Four American labs have released major open-weight language models within weeks of each other, mounting a late-year counteroffensive in territory that Chinese developers have dominated since DeepSeek’s R1 shook global markets in January.
See my martial arts take on this article:
A Year Defined by DeepSeek
When DeepSeek released its R1 reasoning model in January 2025, the shockwaves rippled through global markets. NVIDIA’s stock dropped 17% in a single day as investors grappled with the implications: a Chinese startup had matched OpenAI’s frontier o1 model at a fraction of the training cost, then open-sourced the weights under an MIT license. The DeepSeek app briefly dethroned ChatGPT atop the App Store charts, and the R1 distilled models racked up over a million downloads on Hugging Face within days.
The message was clear. Chinese labs, including DeepSeek, Qwen, and others, had seized the initiative in open-weight AI. Through much of 2025, American contributions to the open model ecosystem largely consisted of fine-tuning checkpoints trained overseas. US frontier labs like OpenAI and Anthropic continued shipping proprietary models while Meta’s Llama releases remained the primary domestic counterweight.
That calculus began shifting in the final months of the year.
Four Models, One Architecture Trend
IBM, Arcee AI, Allen AI, and NVIDIA have each released open models between October and December 2025. What unites them technically is a shared bet on hybrid architectures that combine traditional transformer attention with more efficient alternatives, enabling longer context windows and faster inference than pure transformer designs.
IBM Granite 4.0 (October 2025) introduced a hybrid Mamba/transformer design that interleaves Mamba-2 layers with sparse attention blocks in a 9:1 ratio. The practical result: over 70% reduction in memory requirements for long-context workloads compared to conventional transformers. Where a standard LLM might require high-end datacenter GPUs to process a large codebase or extensive documentation, Granite 4.0 can handle equivalent workloads on significantly cheaper hardware.
The flagship Granite 4.0-H-Small (32B total, 9B active parameters) targets enterprise agentic workflows like multi-tool agents and customer support automation. On Stanford HELM’s IFEval benchmark for instruction following, it exceeds all open-weight models except Llama 4 Maverick, a model over 12 times its size. It also achieves competitive scores on the Berkeley Function Calling Leaderboard v3, critical for applications where models must reliably translate instructions into tool calls. IBM became the first open model family to achieve ISO 42001 certification for AI governance and ships cryptographically signed checkpoints to verify model provenance.
Arcee Trinity (December 1, 2025) represents a different kind of bet. Rather than building on existing open checkpoints, Arcee trained its models end-to-end on US infrastructure with a domestically controlled data pipeline. As Arcee’s team put it: “For a while, our strategy looked like everyone else’s. Take a strong open base, post-train it hard, wire it into tools and RAG, and ship. That approach carried us very far... At the same time, a few pressures kept building.”
Trinity Mini (26B parameters, 3B active) and the experimental Trinity Nano (6B parameters, 1B active with 128 experts) use a Mamba-Transformer MoE architecture incorporating DeepSeekMoE-style fine-grained experts and sigmoid routing. Trinity Mini is available via API at $0.045/$0.15 per million tokens for input/output, with a free tier on OpenRouter. The company is currently training Trinity Large, a 420B parameter model with 13B active parameters, on 2,048 Blackwell B300 GPUs for release in January 2026.
Allen AI’s OLMo 3 (December 15, 2025) takes the most radically open approach. The research nonprofit released every artifact from the training process: all data, intermediate checkpoints, training recipes, and dependencies. Their goal is explicit reproducibility for the research community.
OLMo 3 ships at 7B and 32B parameter scales, targeting long-context reasoning, function calling, coding, and instruction following. The flagship OLMo 3.1 Think 32B is positioned as the strongest fully-open thinking model available, with the OLMo 3.1 collection adding extended RL training and massive post-training datasets. The Dolci-DPO-Model-Response-Pool contains 71 million completions across multiple models for preference tuning research, a resource that enables other researchers to study and improve alignment techniques without regenerating expensive training data.
NVIDIA Nemotron 3 (December 15, 2025) arrives with the most aggressive efficiency claims. Its hybrid Mamba-Transformer MoE architecture achieves 3.3x higher inference throughput than comparably-sized pure transformers like Qwen3-30B-A3B, with further speedups at longer sequences. The architecture supports context lengths up to 1 million tokens while maintaining competitive accuracy on long-context benchmarks.
Nemotron 3: Redefining Local Inference
NVIDIA’s Nemotron 3 Nano deserves particular attention for what it enables on consumer hardware.
The model packs 31.6B total parameters with only 3.2B active per forward pass. Using 4-bit quantization, it runs on 24GB of VRAM, putting it within reach of a single RTX 4090 or equivalent workstation GPU. Yet its benchmark performance rivals models that require datacenter infrastructure.
Nemotron 3 Nano is a hybrid Mamba-Transformer model that interleaves Mamba-2 blocks with Mixture-of-Experts layers using self-attention in a small subset of layers.

On reasoning benchmarks, Nemotron 3 Nano scores 89.1% on AIME 2025 (competition mathematics) and 99.2% on IFBench (instruction following). For coding tasks, it achieves 49.0% on SWE-Bench and 38.8% on LiveCodeBench v6. On tool use, measured by τ2-Bench, it scores 71.5%. Perhaps most notably, it maintains 68.2% accuracy on RULER at 1 million token context length, demonstrating that the hybrid architecture preserves long-range retrieval capabilities even at extreme sequence lengths.
These numbers translate to concrete capabilities. A local coding assistant that can reason through complex multi-file changes. A private RAG system that can process large amounts of code or document collections. Offline document processing for sensitive financial, legal, or medical materials that cannot leave organizational boundaries. Private financial analysis where proprietary data never touches external APIs.
The inference efficiency gains compound in production scenarios. For workloads involving 8K input tokens and 16K output tokens, Nemotron 3 Nano provides throughput 3.3x higher than Qwen3-30B-A3B and 2.2x higher than GPT-OSS-20B on equivalent hardware. For applications serving multiple concurrent users or processing high volumes of requests, this translates directly to lower infrastructure costs or higher capacity on existing hardware.
NVIDIA also introduces LatentMoE, a technique that projects tokens into a compressed latent space before expert routing. This reduces communication overhead while allowing more experts to activate per token, improving accuracy without sacrificing throughput. The larger Super and Ultra variants (releasing in coming months) extend this with multi-token prediction for speculative decoding and NVFP4 precision training on Blackwell GPUs.
For developers wanting to run Nemotron 3 locally, Unsloth provides GGUF quantizations compatible with llama.cpp, along with guides for inference and fine-tuning. The model uses a chat template with <think> tokens for reasoning traces, enabling granular control over inference-time compute budgets.
Beyond Open Weights
The distinguishing feature of this wave is how much more than model weights the labs are releasing.
NVIDIA is shipping over 10 trillion tokens of training data: Nemotron-CC-v2.1 (2.5 trillion English tokens from Common Crawl with synthetic rephrasing), Nemotron-CC-Code-v1 (428 billion code tokens), and specialized synthetic datasets for STEM reasoning. The complete training recipes, SFT data, and RL environments are published through NeMo-Gym and NeMo-RL, enabling researchers to reproduce and extend the post-training pipeline.
Arcee partnered with DatologyAI to curate a 20-trillion-token dataset for Trinity Large, with 10 trillion synthetic tokens generated on clusters that peaked at 2,048 H100 GPUs. The company publishes its full architectural specifications, including the integration of gated attention, Muon optimization, and their approach to context extension.
IBM’s Data Prep Kit, the open-source framework used to prepare Granite training data, ships alongside the models. The Granite Cookbook provides reproducible Python notebooks demonstrating model capabilities, while extensive documentation covers the training methodology.
Allen AI goes furthest on reproducibility as a core mission. Every stage of OLMo 3’s development is documented and released, from raw data through final post-training. The datasets enabling their RL work, including hundreds of thousands of completions used for filtering and preference learning, are published on Hugging Face.
This level of openness matters for the research community. When frontier labs keep training recipes proprietary, progress requires expensive rediscovery. When those artifacts are published, the entire field can iterate more rapidly, identify what works, and build on proven foundations.
Why This Matters
The resurgence of American labs in open-weight AI carries implications across multiple dimensions.
Strategically, the open model landscape has been a weak point in US AI competitiveness. While American companies lead in proprietary frontier models, the tools available to the broader developer ecosystem have increasingly originated from Chinese labs. DeepSeek’s January release demonstrated that open-weight models could match proprietary performance, creating pressure for organizations to adopt Chinese-developed foundations. Having competitive American alternatives changes the calculus for enterprises, governments, and developers evaluating their options.
Geopolitically, the provenance of AI foundations matters for sensitive applications. Enterprise buyers increasingly ask where base models come from, what data trained them, and which licenses govern their use. Arcee explicitly frames Trinity as addressing “jurisdictional safety” concerns, noting that “we fine-tuned a model with unknown data provenance” does not satisfy compliance officers in regulated industries. Models trained end-to-end on US infrastructure with documented data pipelines offer legal certainty that foreign black-box models cannot.
Technically, the convergence on hybrid Mamba-Transformer architectures signals a potential inflection point. Pure transformer attention scales quadratically with sequence length, creating fundamental constraints on context windows and inference costs. The hybrid approach, now validated across multiple independent efforts, offers linear scaling for most computation while preserving attention’s strengths for tasks requiring all-to-all information routing. If this architecture proves broadly superior, the current releases establish American labs at the frontier of the transition.
Economically, local inference capability reshapes the cost structure of AI deployment. API-based models charge per token, creating costs that scale with usage and making certain high-volume or latency-sensitive applications uneconomical. Models like Nemotron 3 Nano that run on consumer hardware enable fixed-cost deployment where marginal inference is essentially free. For startups, researchers, and organizations with limited budgets, this dramatically expands what becomes feasible to build.
For the open ecosystem, the commitment to releasing training data, recipes, and intermediate checkpoints creates compounding returns. Each public artifact becomes a foundation others can build upon. NVIDIA’s 10+ trillion tokens of curated training data, Allen AI’s 71 million preference completions, and Arcee’s architectural innovations all become shared infrastructure for the field. This collaborative dynamic, where improvements are shared rather than hoarded, accelerates progress for everyone.
Looking Forward
Chinese labs continue to iterate rapidly. DeepSeek upgraded R1 in May 2025, and Qwen remains a strong baseline for many applications. The competitive dynamics remain fluid.
What has changed is the presence of multiple American entrants training models end-to-end with genuine architectural innovation, rather than primarily fine-tuning Chinese checkpoints. The hybrid Mamba-Transformer approach, pioneered by NVIDIA’s earlier Nemotron-H work and now adopted across Granite and Trinity, addresses fundamental efficiency limitations that constrained previous generations.
Whether this represents a sustainable shift depends partly on continued investment. Pretraining frontier models from scratch requires substantial resources. Arcee is spending heavily on its B300 cluster for Trinity Large. NVIDIA can amortize model development against its hardware business. IBM integrates Granite into its enterprise AI platform. Allen AI operates as a nonprofit with philanthropic backing.
For developers and enterprises evaluating their options today, the practical reality is this: the most capable models that can run locally on consumer hardware are now coming from American labs, with full documentation, permissive licenses, and training artifacts that enable customization and extension. That represents a meaningful change from where the field stood twelve months ago.




Really insightful summary of the current LLM landscape. The speed of innovation in the US open-source space right now is incredible. Thanks for breaking down the "four weapons", super helpful for keeping track of the noise
Thanks for writing this, it clarifies a lot, and your breakdown of the market shift and architectural trends is really insightful. I'm particularly interested in the hybrid architectures you mentioned; could you elaborate on what specific 'more efficient alternatives' to traditional transformer attencion these US labs are exploring, beyond just the promise of longer context windows?