NVIDIA Nemotron: Open Models as an Infrastructure Attach Strategy
Bottom Line
Nemotron should not be framed as a clean attempt to beat OpenAI or Anthropic head-on in hosted frontier consumer mindshare. It is better viewed as an open-agent reference stack designed to expand the addressable market for NVIDIA hardware and software while giving enterprises a credible alternative to closed-model dependence. The likely strategic bet is that as agentic AI moves from demos to production, 3 pain points become binding: reasoning cost, long-context memory efficiency, and governance over data and deployment. Nemotron is engineered precisely around those pain points. If NVIDIA can make the open model plus open data plus optimized runtime combination the default enterprise pathway on its GPUs, Nemotron can matter strategically even without undisputed raw-model leadership. The March 2026 Nemotron Coalition and upcoming Nemotron 4 roadmap indicate that NVIDIA intends to deepen this moat by combining its infrastructure position with external model-building talent and shared data flywheels. The most accurate summary is that Nemotron is a compute-aware open model family optimized for enterprise agents on NVIDIA infrastructure, not a pure benchmark-maximization exercise.
1. Executive Overview
NVIDIA Nemotron is best understood not as a single model but as a vertically integrated open-model program spanning reasoning LLMs, vision-language models, retrieval models, safety models, speech models, open datasets, training recipes, deployment microservices, and GPU-optimized runtimes. The strategic objective is 2-fold: satisfy enterprise demand for open, auditable, self-hostable agentic AI, and pull that demand through NVIDIA’s wider stack, including NeMo, TensorRT-LLM, NIM, Blueprints, RTX PRO, DGX Spark, and data-center GPUs.
Relative to Anthropic and OpenAI, Nemotron competes less as a pure managed frontier-model subscription and more as an infrastructure-native alternative for organizations that want model control, data residency, and direct optimization on NVIDIA hardware. Relative to DeepSeek, Qwen, and Kimi, Nemotron’s main differentiator is not universal benchmark leadership but unusually tight coupling between model architecture, open training assets, and NVIDIA-serving infrastructure.
The program’s vertical integration is what distinguishes it from conventional model releases. NVIDIA does not merely publish weights; it publishes pretraining data, post-training data, RL environments, evaluation harnesses, reproducibility material, NIM microservice packaging, TensorRT-LLM serving guides, NeMo Gym environments, and Blueprints for downstream agent construction. That full-stack approach is designed to create switching costs that accrete at the infrastructure layer rather than at the model layer alone. Each enterprise deployment that standardizes on NIM, TensorRT-LLM, and NVIDIA GPU configurations becomes incrementally more difficult to move to a competing hardware or software stack.
The commercial logic is straightforward: open models lower the model-acquisition barrier for enterprises that might otherwise evaluate closed APIs, and then the reference architecture — from GPU procurement through NIM microservices to Blueprints — steers that demand toward NVIDIA’s hardware and enterprise software revenue lines. The Nemotron Coalition announced in March 2026, through which external partners contribute to base model development in exchange for shared data flywheel access, is an extension of this logic: NVIDIA is socializing model-development costs while retaining infrastructure lock-in.
2. What Nemotron Is and Why NVIDIA Built It
NVIDIA’s official positioning is that Nemotron is a family of open models, datasets, and technologies for reasoning, coding, visual understanding, agentic tasks, safety, speech, and information retrieval, deployable from edge to cloud. The family is broader than a conventional LLM line and covers the following capability domains:
- Reasoning LLMs: Multi-tier reasoning models spanning Nano, Super, and Ultra scales, with hybrid think and non-think modes, long-context capability, and runtime thinking-budget control.
- Vision-Language Models (VLMs): Document intelligence and multi-image or video understanding, exemplified by the Nemotron Nano 12B v2 VL series.
- Retrieval components: RAG-oriented embedding, extraction, and reranking models designed for enterprise information retrieval pipelines.
- Safety models: Specialized components for jailbreak detection, content moderation, PII identification, and enterprise policy enforcement.
- Speech models: End-to-end speech pipelines covering ASR, TTS, speech-to-speech translation, and full-duplex voice interaction, represented by the Nemotron 3 VoiceChat family.
- Open datasets and training recipes: More than 10T language tokens and 18M supervised fine-tuning samples across pretraining, post-training, personas, safety, RL, and RAG datasets, along with NeMo Gym RL environments and end-to-end reproducibility material.
This breadth matters because NVIDIA is attempting to supply most of the model components required to build enterprise agents, not only the central planner model. An enterprise agent stack requires a reasoning core, a retrieval layer, safety guardrails, vision for document understanding, and optionally a voice interface. Nemotron provides off-the-shelf components for all of these, tightly integrated with NVIDIA’s inference infrastructure.
The Commercial Strategy: Open Models as Infrastructure Pull-Through
Commercially, the strategy implies a larger goal than model distribution alone. By publishing open models and then steering deployment toward NIM microservices, TensorRT-LLM, NeMo, Blueprints, and GPU platforms ranging from RTX PRO and DGX Spark to H100, H200, and B200, NVIDIA creates a reference architecture that stimulates demand for both infrastructure hardware and enterprise software. The commercial logic flows as follows:
- Open model weights attract developer attention and lower evaluation barriers for enterprises considering alternatives to closed APIs.
- NVIDIA’s serving guides, NIM packaging, and architecture-specific optimizations steer deployment toward NVIDIA GPU configurations.
- TensorRT-LLM and NeMo create a software layer that is easiest to use on NVIDIA hardware, reinforcing the hardware attach.
- Blueprints for multimodal RAG, deep research agents, and specialized automation give enterprises pre-built agent templates that presuppose NVIDIA infrastructure.
- GPU demand for training, fine-tuning, and inference scales with enterprise agent deployment, driving incremental H100, H200, and B200 purchases.
Nemotron therefore functions as a full-stack demand-generation vehicle as much as a research artifact. NVIDIA’s official rationale emphasizes open technologies as essential for trusted, enterprise-ready AI, with repeated emphasis on transparent training data, open weights, and broad platform support. That narrative is consistent with the commercial logic: enterprises that trust the stack’s auditability are more likely to commit to infrastructure investments.
Evolution of the Nemotron Brand
The Nemotron brand has evolved through 3 distinct technical lineages, creating material naming complexity relevant to diligence. The 2024 Nemotron-4 340B family was a 340B-scale base, instruct, and reward suite intended heavily for synthetic data generation and reward modeling, with the reward model reaching 92.2 on RewardBench and the deployment target fitting 8 H100 GPUs in FP8. The 2025 Llama Nemotron line then shifted to post-trained derivatives of Meta Llama, compressed with Neural Architecture Search to reduce memory footprint and GPU count. The 2025–2026 Nano v2 and Nemotron 3 lines moved further toward NVIDIA-native hybrid Mamba and MoE architectures. In March 2026, NVIDIA announced the Nemotron Coalition, whose first coalition-built base model will underpin an upcoming Nemotron 4 family. For diligence purposes, the brand should be treated as a program with multiple generations and architectures, not a single coherent architecture lineage.
The 2024 Nemotron-4 340B release established the program’s DNA. NVIDIA positioned it primarily as a synthetic-data factory, disclosed that more than 98% of alignment data was synthetically generated, and open-sourced the synthetic-data generation pipeline. That historical starting point explains a consistent theme: NVIDIA has focused on the data flywheel, reward modeling, and the downstream model-building pipeline, not only on end-user chat performance.
3. Current Model Stack
The current Nemotron model family spans several tiers, architectures, and deployment targets. The following table provides a consolidated reference across all major models. Active-parameter counts are shown separately from total parameters because the sparse MoE architecture means per-token compute scales with active rather than total parameters. This is the single most important economic fact in the Nemotron stack: Nemotron 3 Super activates only 12B of its 120B total parameters per token, meaning inference cost, memory bandwidth consumption, and GPU utilization scale with a ~12B footprint — not 120B. Similarly, Nemotron 3 Nano activates 3.5B of 30B. In practical terms, an operator running Nemotron 3 Super pays the compute cost of a ~12B dense model while accessing the representational capacity of a 120B-scale system. That is the core “thinking tax” reduction NVIDIA is selling: more reasoning per unit of GPU budget, not just better answers on a benchmark.
| Model | Total Params | Active Params | Architecture | Context | Key Benchmark | Hardware Req. | Role |
|---|---|---|---|---|---|---|---|
| Nemotron 3 Super 120B A12B | 120B | 12B | LatentMoE, Mamba-2, MoE, selective attention, MTP, NVFP4 | Up to 1M tokens (default 262K) | 90.21% AIME25; 81.19% LiveCodeBench; 91.75% RULER-100 1M | 8×H100-80GB min (1M context); 2×B200 or 8×B200 (TRT-LLM profiles) | Flagship enterprise reasoning, long-context agents, collaborative multi-agent, IT automation |
| Nemotron 3 Nano 30B A3B | 30B | 3.5B | Hybrid Mamba-Transformer MoE; 23 Mamba-2 + MoE layers, 6 attention layers; 128 experts + 1 shared; 6 active experts/token | 1M tokens | 89.1% AIME25; 68.3% LiveCodeBench; 86.3% RULER-100 1M; 50.0% MiniF2F pass@1 | NVIDIA GPUs (Hopper+); serving details follow vLLM/TRT-LLM configs | Open-agent primary reasoning; code; long-context efficiency; edge-to-datacenter coverage |
| Nemotron Nano 9B v2 | 9B | ~9B (dense-equivalent; mostly Mamba-2 + MLP, 4 attention layers) | Hybrid Mamba-2 + MLP; 4 attention layers; unified reasoning + non-reasoning with runtime budget control | 128K tokens | 72.1% AIME25; 97.8% MATH500; 64.0% GPQA; 71.1% LCB; 78.9% RULER 128K | A10G, A100, H100-80GB, Jetson AGX Thor | Practical agent sweet spot; customer support automation; edge devices; fast/deep reasoning toggle |
| Nemotron 3 Nano 4B | 4B | ~4B (compressed from Nano 9B v2 via Nemotron Elastic) | Mostly Mamba-2 + MLP; 4 attention layers | 262K tokens | Not separately published at note date; inherits Nano 9B v2 architecture strengths at smaller footprint | A10G, A100, H100-80GB, GeForce RTX, Jetson Thor, DGX Spark | Edge and local deployment; RTX PRO and DGX Spark target; lightweight agent component |
| Nemotron Nano 12B v2 VL | 12B | ~12B | Multimodal; built on Nano v2 backbone with vision encoder | Not separately specified at note date | Document intelligence; multi-image and video understanding | NVIDIA GPU (Hopper class recommended) | Visual document understanding; multi-image reasoning; enterprise document intelligence |
| Nemotron 3 VoiceChat (12B) | 12B | ~12B | End-to-end full-duplex speech-to-speech; built on Nano v2 backbone; no conventional ASR-LLM-TTS pipeline | Context set by backbone | Full-duplex real-time voice interaction with retrieval and safety integration | NVIDIA GPU (H100 recommended for production); available via build.nvidia.com API | Voice agent frontend; customer-facing voice AI; replaces multi-model ASR + TTS stacks |
| Llama Nemotron Super 49B | 49B | ~49B (dense, post-NAS compressed) | Llama-3.3-70B-Instruct derivative; NAS-compressed and heavily post-trained; reasoning by default with /no_think disable | 128K tokens | Published as competitive with 70B-class peers on reasoning, RAG, and tool calling; vLLM TP8, GPU util 0.95 | 1×H200 at high workloads | Earlier-generation enterprise reasoning; RAG; tool calling; demonstrated single-H200 footprint |
| Llama Nemotron Ultra 253B | 253B | ~253B (dense, post-NAS compressed) | Llama-3.1-405B-Instruct derivative; NAS-compressed; 128K context; tuned for reasoning, RAG, tool calling | 128K tokens | Positioned as Ultra-class reasoning; fits on 1×8×H100 node | 1×8×H100 node | Earlier-generation Ultra-scale enterprise reasoning; available via build.nvidia.com and Hugging Face |
Model-by-Model Detail
Nemotron 3 Super 120B A12B is the flagship of the current NVIDIA-native family. The LatentMoE architecture combines Mamba-2, MoE, selective attention, and Multi-Token Prediction, with NVFP4 pretraining to maximize efficiency on Blackwell-class hardware. The official model card targets collaborative agents, long-context reasoning, tool use, RAG, and high-volume workloads such as IT ticket automation. Supported languages include English, French, German, Italian, Japanese, Spanish, and Chinese. The 1M-token context is real but requires explicit enablement: NVIDIA’s own vLLM example defaults to 262,144 tokens. The 8×H100-80GB minimum for 1M context is an important deployment nuance that long-context marketing often obscures.
Nemotron 3 Nano 30B A3B is the clearest expression of NVIDIA’s current open-agent thesis. With 3.5B active parameters out of 30B total, the model computes at a cost closer to a 3–4B dense model while leveraging the capacity of a 30B parameter bank for routing and specialization. The 50.0% MiniF2F pass@1 score versus 5.7% for Qwen3-30B-A3B-Thinking-2507 and 12.1% for GPT-OSS-20B is notable as a signal of mathematical reasoning depth. NVIDIA publishes pretraining, post-training, and RL datasets for this model, and the developer repository includes the end-to-end recipe and NeMo Gym RL environments.
Nemotron Nano 9B v2 is a different and strategically important signal because it is trained from scratch by NVIDIA rather than distilled from a larger Llama base. The runtime thinking-budget control — allowing developers to dial between fast response and extended reasoning at serving time — is explicitly framed for customer support, autonomous agent steps, and edge devices where latency varies by use case. Hardware breadth extending to Jetson AGX Thor positions this as the footprint for embedded and edge agentic inference.
Nemotron 3 Nano 4B extends the Nano family to smaller footprints via the Nemotron Elastic compression framework applied to Nano 9B v2. The 262K context window and GeForce RTX support make it the primary target for RTX PRO laptops and DGX Spark local deployments. The vLLM 0.15.1 minimum version requirement and Mamba SSM cache float32 setting are deployment requirements that distinguish it from generic small-model serving.
Nemotron Nano 12B v2 VL targets document intelligence and multi-image or video understanding. Positioned within the broader multimodal RAG Blueprints, this model handles the visual intake component of enterprise document workflows, feeding extracted information to the reasoning and retrieval pipeline.
Nemotron 3 VoiceChat is architecturally unusual in that it is an end-to-end full-duplex speech-to-speech system built on the Nano v2 backbone rather than a conventional three-stage ASR-to-LLM-to-TTS pipeline. This reduces latency and integration complexity for voice agent deployments and is available as a free endpoint on build.nvidia.com, signaling NVIDIA’s intent to demonstrate the full stack rather than only the text reasoning tier.
Llama Nemotron Super 49B and Ultra 253B represent the prior generation’s enterprise approach: taking Meta Llama as a base and using Neural Architecture Search plus heavy post-training to compress the model, reduce memory footprint, and improve inference efficiency. Super 49B fits on a single H200 at high workloads; Ultra 253B fits on a single 8×H100 node. These models retain 128K context, strong reasoning and tool-calling behavior, and remain available via build.nvidia.com and Hugging Face, but the architectural roadmap has clearly moved toward NVIDIA-native hybrid Mamba-MoE designs.
4. Architectural Differentiation
Nemotron is not a single architecture. The family uses at least 3 distinct design approaches across its generations, and understanding the differences matters for evaluating deployment economics and competitive positioning.
The Hybrid Mamba-Transformer MoE Design
The common design logic across the newer NVIDIA-native models (Nano v2, Nano 30B, Super 120B) is consistent: use Mamba or state-space sequence model (SSM) layers to make very long context more practical at lower memory bandwidth cost, use sparse MoE activation to lower per-token compute relative to total parameter count, retain some attention layers for exact retrieval and reasoning fidelity, and add Multi-Token Prediction to accelerate generation through native speculative decoding.
- Mamba-2 (SSM layers): State-space layers enable sub-quadratic scaling with sequence length, making 1M-token contexts more tractable without the quadratic memory cost of full attention. Mamba-2 specifically improves on the original Mamba design with better parallelism and training stability.
- Sparse Mixture-of-Experts (MoE): Only a subset of experts (e.g., 6 out of 128 in the Nano 30B design) activates per token. This allows the total parameter count to remain large for routing and specialization while keeping per-token FLOPs and memory bandwidth at the active-parameter level. Nemotron 3 Nano uses 128 standard experts plus 1 shared expert, 6 active experts per token, and 23 Mamba-2 + MoE layers alongside 6 attention layers.
- Selective attention layers: Retaining a subset of full-attention layers (4–6 out of the total layer count) preserves the exact retrieval capability and positional fidelity that pure SSM architectures can sacrifice. This hybrid design is a deliberate tradeoff: SSM for sequence efficiency, attention for retrieval accuracy.
- Multi-Token Prediction (MTP): Baked into the Nemotron 3 Super checkpoint, MTP enables native speculative decoding without a separate draft model. This means generation throughput improvements from speculative decoding can be accessed without the engineering overhead of a two-model serving stack, a notable operational simplification versus adding an external draft model.
- LatentMoE (Nemotron 3 Super): Routes expert computation in a compressed latent space, cutting all-to-all communication traffic by approximately 4× versus standard MoE. This makes expert parallelism preferable to pure tensor parallelism for large-scale multi-GPU serving, directly affecting how practitioners configure distributed inference for the flagship model.
Active-Parameter Economics
The active-parameter structure is economically important and frequently misunderstood in marketing comparisons. Nemotron 3 Super is 120B total but 12B active. Nemotron 3 Nano is 30B total but 3.5B active. In deployment terms, these models should be compared less to equivalently sized dense models and more to other sparse MoE systems with similar active-parameter footprints.
The practical implication is that a 12B-active-parameter MoE serving at high throughput has per-token compute economics closer to a 12B dense model than to a 120B dense model, while retaining the routing and specialization benefits of a much larger parameter bank. NVIDIA repeatedly frames the family around “thinking tax” reduction: the commercial appeal is not simply better answers, but more reasoning per unit of GPU budget. That framing aligns with the active-parameter economics.
Open Datasets and Recipes as a Competitive Differentiator
One of Nemotron’s underappreciated differences versus most open-weight competitors is the breadth of accompanying assets beyond weights alone. NVIDIA states that Nemotron includes open weights, training data, and recipes, and the developer portal cites more than 10T language tokens and 18M supervised fine-tuning samples across pretraining, post-training, personas, safety, RL, and RAG datasets. For Nemotron 3 Super and Nano 30B, NVIDIA also publishes evaluation harnesses, NeMo Gym RL environments, a developer repository for the end-to-end recipe, and reproducibility material.
This matters because openness at the weight level is no longer a meaningful differentiator — many models from DeepSeek, Qwen, Meta, and others publish full weights. Openness at the data, RL-environment, and end-to-end recipe level is still much less common. For enterprises that want to fine-tune, evaluate, or extend a model on proprietary data with full reproducibility, access to the pretraining and RL recipe provides a level of control that weight-only releases cannot match. It also positions NVIDIA as a credible partner for regulated industries where model provenance and data lineage audit trails are mandatory.
Architectural Contrast: NAS-Compressed Llama vs. NVIDIA-Native Hybrid
The Llama Nemotron models represent a different approach: start from a state-of-the-art Meta Llama base, apply Neural Architecture Search to identify which layers and components can be pruned or reconfigured without proportionate quality loss, then apply heavy post-training and RL to recover and exceed the starting capability. This yields smaller memory footprints and GPU counts without requiring a ground-up training run on proprietary architecture. The NVIDIA-native Nano and Nemotron 3 line, by contrast, are trained from scratch on NVIDIA-designed hybrid architectures, allowing the company to optimize the architecture-runtime co-design from the ground up. The generation shift from NAS-compressed Llama to NVIDIA-native hybrid reflects a deliberate decision to prioritize architectural differentiation over derivative efficiency.
5. Competitive Landscape
Nemotron competes across two distinct tiers: the closed frontier (GPT-5.4, Claude Opus 4.6) and the open-weight ecosystem (DeepSeek, Qwen, Kimi). The competitive dynamics are different in each tier, and the strategic logic of Nemotron’s positioning shifts accordingly.
| Competitor | Type | Key Strength vs. Nemotron | Key Weakness vs. Nemotron |
|---|---|---|---|
| GPT-5.4 (OpenAI) | Closed managed API | Turnkey frontier platform; mature vendor-managed tooling; native functions, web search, file search, computer-use; out-of-the-box agent loops; 1M-token context with 128K max output; platform deployment maturity | No on-prem or sovereign deployment; no weight access; no data residency control; no training asset auditability; no direct GPU-layer optimization; vendor dependency; higher total cost for GPU-heavy enterprise at scale |
| Claude Opus 4.6 / Sonnet 4.6 (Anthropic) | Closed managed API | Frontier coding, agents, and computer use; 1M context in beta; extended step-by-step thinking controls; safety reputation; strong enterprise-ready tooling and evaluation frameworks | Same sovereign/on-prem limitations as GPT-5.4; no weight or data access; GPU-agnostic (no hardware attach benefit); vendor lock-in at API layer |
| DeepSeek-V3 / R1 (DeepSeek) | Open-weight (MIT license) | Open-model mindshare; MIT license (permissive and simple); 671B total / 37B active MoE; trained on 14.8T tokens; hybrid think and non-think via API endpoints; large-scale reasoning ambition; V3.2 integrates thinking inside tool use | Less comprehensive enterprise packaging; fewer accompanying datasets and recipes; less explicit edge-to-cloud deployment ladder; less architecture-runtime co-design documentation for NVIDIA GPUs; no integrated safety / speech / retrieval catalog |
| Qwen3 / Qwen3.5-122B (Alibaba) | Open-weight (Apache 2.0) | Broad multilingual coverage (119 languages and dialects); clean Apache 2.0 license; hybrid thinking and non-thinking modes; 122B total / 10B active MoE; 262K native context extendable to ~1.01M; large community momentum; stronger on MMLU-Pro, GPQA w/o tools, TauBench, multilingual benchmarks | Less integrated with NVIDIA serving infrastructure; fewer open training datasets and RL recipes; no integrated safety, speech, and retrieval catalog at Nemotron’s breadth; broader community means less enterprise-specific packaging |
| Kimi K2 / K2.5 (Moonshot AI) | Open-weight (recent releases) | Aggressive long-horizon agentic design; 1T total / 32B active MoE; stable behavior across 200–300 sequential tool calls; native INT4 quantization; 256K context; K2.5 adds native multimodality and agent-swarm execution; most agentic-forward frontier open model | Narrower enterprise packaging; less comprehensive hardware optimization documentation for NVIDIA GPUs; less breadth across safety, speech, and retrieval catalogs; newer community relative to Qwen or DeepSeek |
Versus OpenAI and Anthropic
As of March 2026, GPT-5.4 is OpenAI’s flagship for complex reasoning and coding, with a 1M-token context window, 128K max output, and native functions, web search, file search, and computer-use tools. Anthropic positions Opus 4.6 and Sonnet 4.6 as frontier models for coding, agents, computer use, search, and high-stakes enterprise work, with 1M context available in beta and explicit controls for extended step-by-step thinking.
The central difference is not merely model quality. OpenAI and Anthropic sell turnkey frontier platforms with mature vendor-managed tooling. Nemotron sells control. On absolute capability, no supported claim of broad Nemotron superiority over GPT-5.4 or Claude Opus 4.6 can be made from the cited public materials. NVIDIA’s detailed public comparisons are mostly against open peers such as Qwen and GPT-OSS, not against the current closed frontier leaders. Context length is no longer a decisive differentiator because Nemotron 3 Super, GPT-5.4, and Claude 4.6 all operate in the 1M-token class.
The practical comparison is a tradeoff. Closed frontier models appear stronger for out-of-the-box agent loops, hosted tools, platform maturity, and rapid deployment with minimal ML systems engineering. Nemotron becomes attractive when data sovereignty, on-prem or VPC deployment, auditability of weights and training assets, direct GPU control, and self-optimized inference economics matter more than access to the most polished managed platform.
Versus DeepSeek
DeepSeek remains the emblematic open reasoning competitor. DeepSeek-V3 uses a 671B-total, 37B-active MoE with Multi-head Latent Attention (MLA) and MTP, is MIT-licensed, and was trained on 14.8T tokens. DeepSeek-R1 is also MIT-licensed and explicitly permits modification and derivative distillation. DeepSeek-V3.2 adds thinking directly inside tool use and supports both thinking and non-thinking modes through API endpoints.
Relative to DeepSeek, Nemotron’s main advantages are broader enterprise packaging, deeper publication of datasets and recipes, a more explicit edge-to-cloud deployment ladder, and tighter optimization for NVIDIA runtimes. DeepSeek’s main advantages are open-model mindshare, very large-scale reasoning ambition, cleaner permissive licensing, and continued strength as a baseline for open reasoning. The relationship is not purely competitive: NVIDIA’s Llama Nemotron blog states that curated synthetic data from DeepSeek-R1 was used in post-training, indicating that NVIDIA treats DeepSeek not only as a rival but also as an upstream quality source in the open ecosystem.
Versus Qwen
Qwen is probably Nemotron’s closest open-model peer in terms of architecture and positioning. Qwen3 introduced hybrid thinking and non-thinking modes, supports 119 languages and dialects, and open-weights its MoE models under Apache 2.0. Qwen3.5-122B-A10B extends the thesis with multimodality, 122B total and 10B active parameters, 262K native context extendable to approximately 1.01M, broad framework support, and a default thinking mode.
Against this backdrop, Nemotron’s differentiation is narrower but real: more open publication of training assets, stronger direct integration with NVIDIA GPUs and software, and a broader vendor-managed family spanning reasoning, RAG, safety, and speech. Qwen’s advantages are broader multilingual coverage, cleaner Apache licensing, larger community momentum, and — in NVIDIA’s own Nemotron 3 Super published benchmark table — superior results on MMLU-Pro, GPQA without tools, TauBench average, multilingual averages, and several shorter-window long-context or agentic metrics.
Versus Kimi
Kimi K2 and K2.5 are differentiated less by traditional benchmark marketing and more by explicit agentic design. Kimi K2 Instruct is a 1T-total, 32B-active MoE trained on 15.5T tokens and optimized for tool use, reasoning, and autonomous problem solving. Kimi K2 Thinking pushes farther into long-horizon agency, claiming stable behavior across 200 to 300 sequential tool calls, native INT4 quantization, and 256K context. Kimi K2.5 adds native multimodality and an agent-swarm execution scheme.
Relative to Kimi, Nemotron looks less like the most aggressive autonomous agent brain and more like the most infrastructure-native open enterprise package. Kimi’s edge is ambition around long multi-step autonomy. Nemotron’s edge is integration with NVIDIA hardware, open datasets and recipes, and a wider supporting catalog of safety, retrieval, and speech components.
6. Benchmark Reality Check
NVIDIA’s marketing language around benchmark leadership is directionally aggressive relative to its own published tables. The evidence from the official model card for Nemotron 3 Super 120B supports segmented leadership in selected domains — primarily long-context efficiency, tool-augmented coding, and certain mathematical reasoning tasks — not across-the-board dominance. The following table reproduces the key published numbers for the three main comparison models as cited in the source materials.
| Benchmark | Nemotron 3 Super 120B | Qwen3.5-122B-A10B | GPT-OSS-120B | Winner (per published data) |
|---|---|---|---|---|
| MMLU-Pro | 83.73 | Higher (Qwen leads) | Not separated in cited table | Qwen3.5-122B |
| AIME25 (no tools) | 90.21 | Not cited ahead | Higher (GPT-OSS leads) | GPT-OSS-120B |
| GPQA (no tools) | 79.23 | Higher (Qwen leads) | Not separated in cited table | Qwen3.5-122B |
| LiveCodeBench | 81.19 | Not cited ahead | Higher (GPT-OSS leads) | GPT-OSS-120B |
| HLE (no tools) | 18.26 | Not separately cited at note date | Not separately cited at note date | Competitive; no clear winner at note date |
| RULER-100 at 1M tokens | 91.75 | ~77.5 (Nemotron leads by ~14 pts) | Not cited at 1M scale | Nemotron 3 Super 120B |
Nemotron 3 Nano 30B: Selected Benchmark Data
For the Nano 30B vs. Qwen3-30B-A3B-Thinking-2507 vs. GPT-OSS-20B comparison (from the official model card):
- AIME25 (no tools): Nemotron 89.1% vs. Qwen 85.0% vs. GPT-OSS 91.7% — Nemotron competitive but not leading.
- LiveCodeBench: Nemotron 68.3% vs. Qwen 66.0% vs. GPT-OSS 61.0% — Nemotron leads.
- MiniF2F pass@1: Nemotron 50.0% vs. Qwen 5.7% vs. GPT-OSS 12.1% — Nemotron leads by a very large margin; notable for formal mathematical reasoning.
- RULER-100 at 1M: Nemotron 86.3% vs. Qwen 77.5% — Nemotron leads on long-context efficiency.
- MMLU-Pro and multilingual MMLU-ProX: Nemotron trails Qwen in NVIDIA’s own published comparison.
Nemotron Nano 9B v2: Benchmark Profile
The Nano 9B v2 model card reports the following on NVIDIA’s published suite (versus Qwen3-8B as the primary comparison):
- AIME25: 72.1% (outperforms Qwen3-8B per NVIDIA’s published claim)
- MATH500: 97.8%
- GPQA: 64.0%
- LiveCodeBench (LCB): 71.1%
- BFCL v3: 66.9%
- RULER at 128K: 78.9%
The Honest Assessment
NVIDIA’s benchmark marketing consistently emphasizes domains where Nemotron leads (long-context RULER, MiniF2F formal math, tool-augmented coding in select tasks) while presenting comparisons in ways that may not surface areas where Qwen or GPT-OSS are ahead (MMLU-Pro, GPQA without tools, multilingual tasks, broader agentic metrics). This is a standard industry practice, not a unique NVIDIA behavior, but it requires investment-grade diligence to disaggregate.
The honest characterization from the published data is: Nemotron 3 Super is a highly capable open model with genuine leadership in 1M-token long-context efficiency (RULER), formal mathematical reasoning (MiniF2F at the Nano 30B tier), and NVIDIA-optimized serving throughput. It is not the universal benchmark leader across all evaluated tasks in NVIDIA’s own comparison tables. Enterprises should evaluate against their specific workload distribution — long-context agentic tasks, code, and retrieval favor Nemotron; broad multilingual coverage and raw GPQA/MMLU-Pro scores favor Qwen.
7. GPU Deployment and Configuration
Nemotron’s hardware differentiation is real but comes from architecture-aware serving choices, not from a single configuration switch. NVIDIA’s Advanced Deployment Guide for Nemotron 3 Super highlights 3 properties that directly affect inference configuration:
- LatentMoE expert parallelism: Routes expert computation in a compressed latent space, cutting all-to-all communication traffic by approximately 4× versus standard MoE. This makes expert parallelism preferable to pure tensor parallelism for multi-GPU configurations, a non-obvious choice that generic open-source deployment guides would not surface.
- Multi-Token Prediction speculative decoding: MTP is baked into the checkpoint and can be exposed through speculative decoding without a separate draft model. NVIDIA’s vLLM guide uses 5 speculative tokens. This provides generation throughput improvements without the engineering overhead of maintaining a two-model serving stack.
- Mamba SSM cache in float32: The Mamba-2 layers introduce a distinct SSM state cache that NVIDIA recommends keeping in float32 regardless of the overall checkpoint precision. Failing to set this correctly can degrade output quality in ways that are not immediately obvious from loss metrics.
| Setting | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Pinned Version | 0.17.1 | v0.5.9 | Latest supported |
| Min GPUs (Super 120B) | 8×H100-80GB | 8×H100-80GB | 2×B200 (latency) / 8×B200 (throughput) |
| Parallelism | Expert parallelism preferred over tensor parallelism | Tensor + Expert parallelism combined | tp_size and ep_size set together |
| MoE Kernels | FlashInfer MoE kernels enabled | Default | NVFP4 MoE kernels on Blackwell |
| Mamba SSM Cache | float32 (required) | float32 (required) | float32 or quantized with stochastic rounding |
| Speculative Decoding (MTP) | 5 speculative tokens | Requires nightly build; disable radix cache | Native MTP support |
| GPU Memory Utilization | ~0.9 | Default | Profile-dependent |
| Tool-Call Parser | Model-specific reasoning parser | qwen3_coder | N/A |
| NVFP4 Support | Blackwell via NVFP4 MoE kernels | Requires nightly + disable radix cache | Native on Blackwell |
| Block/Prefix Caching | Supported | Radix cache disabled for NVFP4+MTP | Disabled (Mamba state not prefix-cacheable) |
| Key Optimization | LatentMoE cuts all-to-all traffic ~4× | Combined TP+EP for throughput | Quantized Mamba cache + stochastic rounding for throughput |
The table above summarizes configuration differences that matter for production deployment. The critical takeaway is that default open-source serving settings leave meaningful performance on the table: the Mamba SSM cache must be float32, expert parallelism outperforms tensor parallelism for LatentMoE, and MTP-based speculative decoding requires explicit enablement. Teams already standardized on NVIDIA GPUs can extract a materially larger advantage from Nemotron than from a more generic open-weight model.
vLLM Configuration (Nemotron 3 Super)
NVIDIA pins Nemotron 3 Super to vLLM 0.17.1 and provides the following key configuration recommendations:
- Enable FlashInfer MoE kernels for efficient sparse expert computation.
- Use expert parallelism (preferred over pure tensor parallelism due to LatentMoE all-to-all reduction).
- Set GPU memory utilization to approximately 0.9.
- Expose a model-specific reasoning parser for chain-of-thought extraction.
- Configure speculative decoding with 5 speculative tokens using the MTP checkpoint.
- Enable NVFP4 MoE kernels on Blackwell-class GPUs (B200 / GB200).
- Optionally enable the TensorRT-LLM allreduce backend for improved multi-GPU communication.
- Set Mamba SSM cache dtype to float32 explicitly.
SGLang Configuration (Nemotron 3 Super)
NVIDIA pins SGLang to version 0.5.9 and provides the following guidance:
- Set both tensor parallelism (tp) and expert parallelism (ep) simultaneously.
- Use qwen3_coder as the tool-call parser (indicating interoperability with prevailing open-source tool-call conventions rather than a fully isolated NVIDIA-specific orchestration layer).
- Note that NVFP4 or FP8 plus MTP requires a newer nightly build and disabling radix cache.
TensorRT-LLM Configuration (Nemotron 3 Super)
The TensorRT-LLM guide specifies:
- Disable block reuse because Mamba recurrent state is not prefix-cacheable — a critical difference from Transformer-only models where prefix caching is a standard throughput optimization.
- Set tp_size and ep_size together for expert parallelism.
- Two deployment profiles: 2×B200 for latency-optimized (minimum serving cost) and 8×B200 for throughput-optimized (maximum serving capacity).
- Optional quantized Mamba cache with stochastic rounding to improve throughput at modest quality tradeoff.
Smaller Model Configurations
Nemotron Nano 9B v2: Requires mamba_ssm_cache_dtype float32 in vLLM. Supports runtime thinking-budget control. NVIDIA frames this explicitly as useful for customer support, autonomous agent steps, and edge devices where latency varies by task.
Nemotron 3 Nano 4B: Requires vLLM 0.15.1 or higher. Uses a custom reasoning parser. Enables auto tool choice. Sets Mamba cache to float32. For Jetson Thor and DGX Spark deployments, NVIDIA directs users to a specific container image rather than a generic vLLM install.
Llama Nemotron Super 49B v1.5: Uses reasoning by default. Allows /no_think to disable extended thinking. vLLM configuration sets tensor parallel size 8 and GPU memory utilization 0.95. This model predates the Mamba-2 architecture and uses conventional Transformer serving assumptions.
Practical Implication
The practical implication of the deployment complexity is straightforward: default open-source deployment settings can leave meaningful performance on the table or reduce output quality, while a team already standardized on NVIDIA GPUs can extract a larger advantage from Nemotron than from a more generic open-weight model. The deployment optimization documentation NVIDIA provides is unusually detailed by industry standards, and it functions as a knowledge asset that creates switching costs for teams that invest in learning and implementing it. An organization that has tuned vLLM 0.17.1 with FlashInfer MoE kernels, expert parallelism, and float32 Mamba cache for Nemotron 3 Super does not have equivalent guidance available for a Qwen or DeepSeek model running on the same hardware.
8. Enterprise Use Cases and Model Selection
Nemotron is especially rational when 1 of 4 conditions holds. Understanding these conditions is the correct lens for evaluating whether a given enterprise customer represents an incremental GPU demand signal for NVIDIA or merely a developer evaluation that will not convert to hardware.
Four Conditions Where Nemotron Is the Rational Choice
- 1. Data Sovereignty and Governance: Weights can be downloaded from Hugging Face and run entirely on-premises or in a private VPC. NIM offers a supported enterprise packaging path with SLA-backed deployment options. For regulated industries — financial services, healthcare, defense — where data residency requirements preclude sending data to external APIs, Nemotron provides a credible open alternative with auditability of both weights and, uniquely, training data and RL recipes. This is the use case where Nemotron is decisively stronger than GPT-5.4 or Claude Opus 4.6.
- 2. GPU-Stack Alignment: Organizations already standardized on NVIDIA infrastructure — H100 clusters, DGX systems, RTX PRO workstations, or Jetson edge devices — can exploit TensorRT-LLM, NIM, NeMo, and architecture-specific serving guidance that is unusually detailed and directly tied to NVIDIA hardware capabilities. The LatentMoE all-to-all reduction, MTP speculative decoding, and Blackwell NVFP4 kernel support are optimization paths unavailable on competing hardware. The advantage compounds as the enterprise builds out its NVIDIA footprint.
- 3. Agent Composition: NVIDIA offers not only reasoning models but also integrated RAG, safety, speech, and vision models plus Blueprints for multimodal RAG and deep research. An enterprise building a production customer support agent needs a reasoning core, a retrieval layer, safety guardrails, and potentially a voice frontend. Nemotron provides off-the-shelf components for all of these in a single vendor ecosystem. The alternative — assembling a stack from Qwen (reasoning), a separate embedding model, a separate safety classifier, and a separate speech system — requires significantly more integration work and introduces multiple vendor relationships.
- 4. Workload Shape: Long-context multi-agent flows, code agents, customer support automation, IT ticket automation, and voice agents with retrieval and safety guardrails are explicitly targeted in NVIDIA’s materials. The 1M-token RULER benchmark leadership and MiniF2F formal reasoning score at the Nano 30B tier are signals that the architecture is tuned for multi-step reasoning loops with long memory requirements, not for short conversational exchanges. Enterprises whose workloads are dominated by long documents, extended tool-use chains, or high-volume structured reasoning tasks have the best fit.
Where Nemotron Is Less Compelling
Nemotron is less compelling in 4 situations:
- Turnkey managed frontier performance: GPT-5.4 and Claude Opus 4.6 offer superior out-of-the-box agent loops, hosted tools, platform maturity, and rapid deployment with minimal ML systems engineering. Enterprises that need to move fast without GPU infrastructure investment are better served by managed APIs.
- Broadest multilingual reach: Qwen3.5 supports 119 languages and dialects under Apache 2.0. Nemotron 3 Super supports 7 languages. For global enterprise deployments requiring deep multilingual coverage, Qwen is the stronger choice.
- Cleanest permissive licensing: Qwen (Apache 2.0) and DeepSeek (MIT) offer cleaner licensing than the NVIDIA Open Model License, which, while permissive in practice, is less legally simple. Legal teams at enterprises with strict IP policies may prefer Apache or MIT for modified or redistributed derivatives.
- Most aggressive autonomous tool-use frontier: Kimi K2 and K2.5, with claimed stability across 200–300 sequential tool calls and native agent-swarm execution, represent a more ambitious long-horizon autonomy thesis than current Nemotron models. Enterprises building the most autonomous multi-step AI agents may find Kimi’s published agentic behavior claims more relevant to their specific use case.
9. License, Availability, and Ecosystem
Licensing Framework
NVIDIA markets Nemotron as truly open source and states that models, datasets, and techniques are openly published. In operational terms, that is directionally correct relative to closed APIs. In licensing terms, the family is not uniform and requires enterprise legal review.
- NVIDIA Open Model License: Most NVIDIA-native Nemotron models (Nano v2, Nano 30B, Super 120B) use the NVIDIA Open Model License, which NVIDIA describes as permitting use, modification, distribution, and commercial deployment without attribution. Enterprises should independently verify the current license terms for each model version before committing to production use or redistribution.
- Llama License inheritance: Llama Nemotron derivatives (Super 49B, Ultra 253B) inherit the parent Meta Llama license in addition to any NVIDIA-specific terms. The Llama license permits commercial use above certain user thresholds and restricts certain competitive uses.
- Comparative positioning: For enterprises, Nemotron is substantially more open than Anthropic or OpenAI (which offer no weight access or data disclosure), but less licensing-simple than an Apache 2.0 family such as Qwen or a MIT-licensed release such as DeepSeek. Legal teams evaluating modified or redistributed derivative use cases should review the specific NVIDIA Open Model License terms rather than assuming Apache or MIT equivalence.
Availability and Deployment Channels
As of March 19, 2026, Nemotron is available through the following channels:
- build.nvidia.com: Free prototyping via API endpoints for select models (currently including Llama-3.3-Nemotron-Super-49B-v1.5, Llama-3.1-Nemotron-Ultra-253B-v1, and Nemotron 3 VoiceChat). This channel is designed for developer evaluation and proof-of-concept work without GPU investment.
- Hugging Face: Downloadable weights for the full family, including Nemotron 3 Super 120B and Nano 9B v2. Models are available for direct download and deployment on customer-owned hardware.
- NIM Microservices (NVIDIA AI Enterprise): Supported enterprise packaging with SLA-backed deployment for production workloads. NIM provides containerized microservice packaging with NVIDIA-optimized serving configurations pre-installed.
- Third-party inference providers: The developer portal lists hosted inference access through Baseten, DeepInfra, Fireworks AI, FriendliAI, Inference.net, Lightning, Modal, Nebius, and Together AI, as well as discovery through Hugging Face and OpenRouter.
- Self-hosted frameworks: The full model family is documented for deployment with Hugging Face Transformers, vLLM, SGLang, Ollama, llama.cpp, and TensorRT-LLM on NVIDIA GPUs.
Enterprise Adopters and Go-to-Market
NVIDIA’s official site showcases enterprise adopters including Accenture, Amdocs, Cadence, CrowdStrike, Deloitte, SAP, ServiceNow, and World Wide Technology. This list indicates an overt partner-led go-to-market motion — NVIDIA is distributing through established enterprise technology integrators and platform vendors rather than pursuing a pure direct API strategy. The presence of CrowdStrike (cybersecurity), SAP (ERP), and ServiceNow (IT service management) alongside professional services firms (Accenture, Deloitte, World Wide Technology) suggests that Nemotron’s initial production penetration is concentrated in IT operations, customer service automation, and enterprise software augmentation — exactly the workload shapes that NVIDIA’s model card and Blueprints documentation target.
Nemotron Coalition and Roadmap
In March 2026, NVIDIA announced the Nemotron Coalition, an external partnership structure through which other organizations contribute to base model development in exchange for shared data flywheel access. The first coalition-built base model, co-developed by NVIDIA and Mistral AI, will underpin an upcoming Nemotron 4 family. Coalition members include Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam, and Thinking Machines Lab — a roster spanning frontier model builders, agent-framework developers, and regional AI labs. This structure represents a socialization of model development costs while NVIDIA retains the infrastructure and software layer where its economic interest is concentrated. The Coalition also creates a narrative that positions NVIDIA as the center of gravity for open enterprise AI development, analogous to how CUDA became the center of gravity for GPU computing by attracting external developers and researchers.
10. Investment Implications
Nemotron is not a direct P&L driver for NVIDIA in the near term. Its investment significance is as a demand-generation vehicle for the hardware and software segments that drive NVIDIA’s revenue: data-center GPUs (H100, H200, B200, and successors), NVIDIA AI Enterprise (NIM, NeMo, TensorRT-LLM), and emerging edge AI (DGX Spark, RTX PRO, Jetson). The following table frames the scenario analysis for how Nemotron’s trajectory maps to hardware and software demand.
| Scenario | Signal to Watch | Investment Implication |
|---|---|---|
| Upside: Nemotron becomes the default enterprise open-agent stack on NVIDIA GPUs | Broad enterprise adoption of NIM-packaged Nemotron across SAP, ServiceNow, and CrowdStrike deployments; Nemotron 4 Coalition launch with broad partner uptake; NeMo and TensorRT-LLM attach rates rising in NVIDIA AI Enterprise revenue; DGX Spark and RTX PRO volumes driven by Nano 4B and 9B v2 edge deployments | Incremental GPU demand pull beyond existing hyperscaler capex cycle; NIM and NeMo software layer begins to generate recurring software revenue above hardware attach; RTX PRO and DGX Spark carve out a credible enterprise AI PC and workstation market; NVIDIA AI Enterprise revenue accelerates toward $5B+ run rate within 2 years |
| Base: Developer adoption with selective enterprise pilots; Nemotron complements closed APIs | Consistent developer download growth on Hugging Face; selective production deployments through NIM at partner accounts (Accenture, Deloitte, WWT); Nemotron used alongside GPT-5.4 / Claude Opus 4.6 for sovereignty-sensitive workloads rather than replacing them; Coalition attracts 3–5 meaningful external partners | Nemotron meaningfully expands the addressable enterprise use case for NVIDIA GPU deployments at the margin; supports data-center GPU demand growth without changing the hyperscaler-led demand thesis; NIM and NVIDIA AI Enterprise grow steadily but below the upside software-revenue acceleration scenario; strategic moat evidence accumulates for next cycle |
| Downside: Qwen and DeepSeek maintain open-model mindshare; Nemotron stays niche | Qwen3.5 or subsequent Alibaba releases sustain benchmark leadership across multilingual and agentic tasks; DeepSeek maintains MIT licensing advantage and community momentum; Nemotron adoption concentrated in NVIDIA’s direct account base without broader ecosystem traction; Coalition fails to attract meaningful external partners; NIM attach rate disappoints | Nemotron becomes a proof-of-concept showcase rather than a demand-generation engine; GPU demand growth reverts to pure hyperscaler capex cycle dynamics; NVIDIA AI Enterprise software revenue growth remains modest; the open-model strategy does not meaningfully accelerate the hardware attach trajectory beyond what the hyperscaler cycle alone delivers |
Strategic Value as a Demand-Generation Vehicle
The most precise framing for Nemotron in an investment context is as a demand-creation vehicle for NVIDIA’s core hardware franchise. NVIDIA does not need Nemotron to generate direct licensing revenue. What NVIDIA needs is for enterprises that might otherwise evaluate DeepSeek on AMD hardware or Qwen on Google TPU instances to instead standardize on Nemotron on H100 or B200 GPUs, with TensorRT-LLM and NIM as the serving layer and NeMo Gym as the fine-tuning layer.
Each enterprise that makes that choice generates: (1) initial GPU procurement revenue, (2) ongoing NIM and NVIDIA AI Enterprise software subscription revenue, (3) a reference architecture that is difficult to migrate away from once deeply integrated, and (4) a demand signal that validates continued GPU capex investment in future hardware generations. The open-model strategy lowers the barrier to entry for enterprise evaluation by removing the model-licensing cost, while the infrastructure lock-in is generated at the serving and tooling layer rather than the model layer.
Coalition and Nemotron 4 as Moat Deepening
The March 2026 Nemotron Coalition announcement signals that NVIDIA intends to deepen this strategy through the next model generation. By socializing model-development costs with external partners while retaining the infrastructure and serving layer, NVIDIA can potentially maintain competitive open-model releases without carrying the full cost of frontier model training internally. The Nemotron 4 family, when released, will be the first test of whether the Coalition structure produces a base model competitive with Qwen, DeepSeek, and Kimi at scale without requiring NVIDIA to run an internal model research organization at the scale of Anthropic or OpenAI.
If the Coalition structure succeeds, NVIDIA’s position in the open enterprise AI stack could resemble its position in GPU computing more broadly: the company does not need to win every model benchmark to win the infrastructure layer, just as it did not need to produce the best software applications to win GPU computing through CUDA. The infrastructure position — TensorRT-LLM, NIM, NeMo, Blueprints, and the serving guides that make Nemotron work best on NVIDIA hardware — is the durable economic asset.
Near-Term Monitoring Indicators
- NIM and NVIDIA AI Enterprise revenue disclosure: NVIDIA does not separately break out NVIDIA AI Enterprise software revenue in current reporting, but commentary on enterprise software attach rates in datacenter revenue is a key proxy.
- DGX Spark and RTX PRO volume signals: Edge AI hardware attach from Nano 4B and 9B v2 deployments would be the first sign that Nemotron is driving demand beyond data-center GPUs.
- Coalition partner announcements: The breadth and technical seriousness of Nemotron 4 Coalition partners will indicate whether NVIDIA’s socialized model-development thesis is attracting credible external contributors.
- Nemotron 4 benchmark release: The first Nemotron 4 model card will be the clearest test of whether the Coalition structure produces a competitive open base model at frontier scale.
- Enterprise deployer disclosures: SAP, ServiceNow, CrowdStrike, and other announced Nemotron adopters are public companies. Any mention of Nemotron NIM deployments or NVIDIA AI Enterprise in their earnings calls or partner announcements would be a real-demand validation signal.
Data sources may include: Bloomberg, FactSet, S&P Capital IQ, company filings, earnings call transcripts, expert network interviews, SEC EDGAR.
Sources cited: NVIDIA Nemotron official documentation, model cards, and developer guides; NVIDIA GTC 2026 materials; Hugging Face model repositories; OpenAI, Anthropic, DeepSeek, Qwen, and Kimi official documentation; company filings.