Daily Update: 2026-01-21 (09:22 PM)
January 21, 2026 · Generated 09:22 PM PT
Technical Intelligence Report: 2026-01-21
Executive Summary
- ROCm Ecosystem Maturity: ROCm 7.2 has been released, introducing support for RDNA4 hardware (RX 9060 XT LP) and the new ROCm Optiq visualization tool. Simultaneously, ROCm has achieved “First-Class Platform” status in vLLM, with CI pass rates hitting 93% and official Docker/Wheel support.
- Framework Updates: PyTorch 2.10 is live, featuring improved RDNA 3.5 (GFX1150) support and Grouped GEMM via CK for AMD GPUs.
- Next-Gen Hardware Prep: Linux patches reveal Zen 6 “Venice” EPYC features, specifically focusing on advanced bandwidth enforcement (GLBE, GLSBE) and privilege management (PLZA).
- Market Competition: Upscale AI raised $200M to build “SkyHammer,” a UALink-compatible switch ASIC intended to rival NVIDIA’s NVSwitch, bolstering the open ecosystem AMD utilizes.
- NVIDIA Activity: Benchmarks compare NVIDIA’s GB10 (Grace-Blackwell) Arm cores against AMD’s “Strix Halo” Ryzen AI Max+. Jensen Huang is visiting China to negotiate H200 shipments and advocating for “AI as Infrastructure” at Davos.
🤖 ROCm Updates & Software
[2026-01-21] ROCm Becomes a First-Class Platform in vLLM (#2016)
Source: ROCm Tech Blog
Key takeaway relevant to AMD:
- Critical Stability Milestone: ROCm is now a “first-class citizen” in vLLM, meaning AMD hardware support is no longer experimental. This significantly reduces friction for enterprise deployment of LLMs on MI300/MI350 series.
- Ease of Deployment: Official Docker images and
pipwheels remove the need for developers to build from source, a major previous pain point.
Summary:
- ROCm support in vLLM (v0.12.0 - v0.14.0) has been massively upgraded.
- CI (Continuous Integration) stability for AMD hardware improved from 37% passing (Nov 2025) to 93% passing (Jan 2026).
- Native support added for vLLM-omni (multimodal inference).
Details:
- Official Docker Images: Now available via
vllm/vllm-openai-rocm:v0.14.0. - Installation: Simplified to
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/. Supports ROCm 7.0 and Python 3.12. - Performance Optimizations:
- Quantization: Native AITER FP8 kernels, fused LayerNorm/SiLU FP8, MXFP4 w4a4 MoE inference.
- Architecture: Optimized KV cache, assembly Paged Attention, and removal of DeepSeek MLA D2D copies.
- Hardware: Validated on MI300X, MI325X, MI350X, MI355X (gfx942, gfx950 architectures).
- vLLM-omni: Supports audio/image/video input and text/audio output. Optimized configs provided for Qwen2.5-Omni and Qwen3-Omni-MoE.
[2026-01-21] AMD ROCm 7.2 Now Released With More Radeon Graphics Cards Supported, ROCm Optiq Introduced
Source: Phoronix
Key takeaway relevant to AMD:
- RDNA4 Support: Early support for next-gen consumer hardware (RX 9060 XT LP) indicates RDNA4 compute support is arriving faster than previous generations.
- Tooling Expansion: The introduction of ROCm Optiq addresses a long-standing gap in visualization and profiling tools compared to NVIDIA Nsight.
Summary:
- ROCm 7.2.0 released with expanded hardware support and new tooling.
- Official support added for RDNA4 and specific RDNA3 consumer cards.
- Introduction of “ROCm Optiq” (Beta) and “ROCm Simulation”.
Details:
- New Hardware Support:
- RDNA4: AMD Radeon RX 9060 XT LP (32 CUs, 64 AI accelerators, 16GB GDDR6).
- Workstation: AMD Radeon AI PRO R9600D (3072 SPs, 32GB GDDR6).
- RDNA3: Official support for Radeon RX 7700 series.
- Software Features:
- ROCm Optiq: A new visualization platform (Windows/Linux) for viewing GPU traces from profiling tools.
- ROCm Simulation: Toolkit for physics-based/numerical simulations.
- HIP Updates: SPIR-V support for hipCUB and rocThrust; node power management for multi-GPU nodes.
- OS Support: MI350X/MI355X support added to SUSE Linux Enterprise Server 15 SP7.
[2026-01-21] PyTorch 2.10 Released With More Improvements For AMD ROCm & Intel GPUs
Source: Phoronix
Key takeaway relevant to AMD:
- RDNA 3.5 Integration: Support for GFX1150/GFX1151 (likely Ryzen AI APUs) in
hipblasltenables better on-device AI performance for laptops/embedded. - Gemm Optimization: Grouped GEMM support via Composable Kernel (CK) improves efficiency for transformer workloads.
Summary:
- PyTorch 2.10 released with broad updates for ROCm, Intel XPU, and CUDA.
- Focus on kernel optimizations and Windows support for ROCm.
Details:
- AMD ROCm Improvements:
- Enabled grouped GEMM via regular GEMM fallback and via CK (Composable Kernel).
- Added GFX1150/GFX1151 (RDNA 3.5) to
hipblaslt-supported GEMM lists. - Support for
scaled_mm v2and AOTritonscaled_dot_product_attention. - Code generation support for
fast_tanhf. - Improved heuristics for pointwise kernels.
- Enhanced ROCm support on Windows.
- General/Competitor Updates:
- Intel: New Torch XPU APIs, SYCL support for custom operators on Windows.
- NVIDIA: CUDA 13 compatibility improvements, CUTLASS MATMULs on Thor.
- Python: Experimental support for Python 3.14 free-threaded build.
[2026-01-21] Add new author: Mohit Deopujari (#2013)
Source: ROCm Tech Blog
Key takeaway relevant to AMD:
- Minimal technical impact; indicates expanding documentation/blogging team for ROCm.
Summary:
- Administrative commit adding Mohit Deopujari to the ROCm blog author list.
Details:
- Updated
.authorlist.txtand added author metadata/image.
🔲 AMD Hardware & Products
[2026-01-21] AMD Sends Out Linux Patches For Next-Gen EPYC Features: GLBE, GLSBE & PLZA
Source: Phoronix
Key takeaway relevant to AMD:
- Zen 6 “Venice” Features: Confirms advanced QoS (Quality of Service) features for next-gen EPYC servers, critical for multi-tenant cloud and mixed-workload AI clusters.
- Resource Control: AMD is significantly enhancing the
resctrl(Resource Control) capabilities in Linux to manage bandwidth contention.
Summary:
- 19 Linux kernel patches released for EPYC “Venice” (Zen 6) processors.
- Introduces three key features: GLBE, GLSBE, and PLZA.
Details:
- GLBE (Global Bandwidth Enforcement): Allows software to set bandwidth limits for thread groups spanning multiple QoS Domains. Sets a ceiling for “L3 External Bandwidth.”
- GLSBE (Global Slow Bandwidth Enforcement): Similar to GLBE but specifically for “Slow Memory” (CXL or tiered memory), managing bandwidth limits across multiple QoS domains.
- PLZA (Privilege Level Zero Association): Hardware mechanism to automatically associate Ring 0 (kernel/privileged) execution with a specific Class of Service (COS) or Resource Monitoring Identifier (RMID), overriding per-thread associations.
🤼♂️ Market & Competitors
[2026-01-21] The CPU Performance Of The NVIDIA GB10 With The Dell Pro Max vs. AMD Ryzen AI Max+ “Strix Halo”
Source: Phoronix
Key takeaway relevant to AMD:
- Direct Comparison: Phoronix is benchmarking the “Grace” portion of the Grace-Blackwell superchip against AMD’s top-tier Strix Halo APU. This highlights the blurring line between high-end client APUs and server-grade Arm CPUs.
Summary:
- Benchmarking NVIDIA GB10 (Grace-Blackwell) CPU cores vs. AMD Ryzen AI Max+ 395 (Framework Desktop).
- Focus is on traditional Linux CPU workloads, not just AI.
Details:
- NVIDIA GB10 Specs: 20 Arm cores (10x Cortex-X925 performance cores + 10x Cortex-A725 efficiency cores). 128GB LPDDR5x memory.
- AMD System: Framework Desktop with Ryzen AI Max+ 395 “Strix Halo”.
- Testing Constraint: NVIDIA GB10 does not expose CPU power metrics via Linux PowerCap/RAPL; testing relied on total AC system power.
- Software Environment: Both systems running Ubuntu 24.04.3 LTS, Linux 6.14 kernel, GCC 13.3.
[2026-01-21] Upscale AI Nabs Cash To Forge “SkyHammer” Scale Up Fabric Switch
Source: The Next Platform
Key takeaway relevant to AMD:
- UALink Momentum: Upscale AI is building a switch for UALink, the AMD-backed open standard competitor to NVLink. A high-performance merchant silicon switch for UALink is required for AMD (and others) to effectively compete with NVIDIA’s rack-scale NVSwitch architecture.
Summary:
- Upscale AI raised $200M (Series A) to develop “SkyHammer,” a high-radix scale-up fabric switch.
- Valuation over $1 billion.
- Targeting samples in late 2026, volume in 2027.
Details:
- Technology: “SkyHammer” ASIC supports UALink (Universal Accelerator Link) and Meta’s ESUN standard.
- Goal: Create a heterogeneous, high-bandwidth memory coherent fabric switch to compete with NVIDIA NVSwitch.
- Specs: UALink 1.0 spec allows for 1,024 compute engines in a single-level fabric; SkyHammer aims to support this scale.
- Founders: Ex-Auradine, Cavium, and Innovium executives (Rajiv Khemani, Barun Kar).
- Market Context: Provides an alternative to proprietary NVLink, essential for the “anti-NVIDIA” coalition (AMD, Intel, Broadcom, etc.).
[2026-01-21] Nvidia CEO Jensen Huang to visit China as company prepares to start H200 shipments…
Source: Tom’s Hardware
Key takeaway relevant to AMD:
- Export Control Dynamics: NVIDIA is actively pushing H200 into China despite restrictions. If successful, this maintains NVIDIA’s CUDA dominance in China, making it harder for AMD’s MI300 series to gain footholds in Alibaba/Baidu clouds despite AMD’s unrestricted offerings.
Summary:
- Jensen Huang is visiting China (Beijing) during Lunar New Year.
- Purpose: Negotiate H200 GPU shipments and attend internal events.
Details:
- H200 Situation: US government allows some export; Chinese government has curbs on imports to foster self-sufficiency.
- Target Clients: Alibaba, Baidu (commercial use cases relying on CUDA).
- Restrictions: China plans to prohibit H200 for military/state-owned enterprises.
[2026-01-21] ‘Largest Infrastructure Buildout in Human History’: Jensen Huang on AI’s ‘Five-Layer Cake’ at Davos
Source: NVIDIA Blog
Key takeaway relevant to AMD:
- Infrastructure Narrative: NVIDIA is positioning AI not as hardware but as “critical national infrastructure” (like roads/electricity). This narrative drives sovereign AI investments, a sector AMD is also targeting.
Summary:
- Jensen Huang spoke at World Economic Forum (Davos) with BlackRock CEO Larry Fink.
- Described AI as the “largest infrastructure buildout in human history.”
Details:
- Five-Layer Cake: Energy, Chips, Cloud Infrastructure, AI Models, Applications.
- Economic Impact: 2025 saw >$100B in VC funding, mostly for AI-native companies.
- Workforce: Emphasized AI creating demand for skilled labor (electricians, datacenters) and augmenting roles (radiologists, nurses) rather than replacing them.
💬 Reddit & Community
[2026-01-21] Unlucky customer buys RTX 5080, receives relabelled RTX 5060 Ti in the box instead…
Source: Tom’s Hardware
Key takeaway relevant to AMD:
- High-End GPU Scarcity/Value: Highlights the chaos in the high-end consumer GPU market. While regarding NVIDIA, these “return scams” often affect high-value AMD cards (RX 7900/9000 series) as well, serving as a warning for retail channel integrity.
Summary:
- Amazon customer bought an RTX 5080 but received an RTX 5060 Ti inside the box.
- The scammer applied fake “RTX 5080” stickers to the lower-end card.
Details:
- Detection: The card had an 8-pin PCIe connector (Blackwell RTX 5080 uses 16-pin), exposing the swap.
- Method: Likely a “return switcheroo” where a previous buyer kept the 5080 and returned a modified 5060 Ti, which Amazon restocking failed to catch.
📈 GitHub Stats
| Category | Repository | Total Stars | 1-Day | 7-Day | 30-Day |
|---|---|---|---|---|---|
| AMD Ecosystem | AMD-AGI/GEAK-agent | 56 | 0 | ||
| AMD Ecosystem | AMD-AGI/Primus | 66 | 0 | ||
| AMD Ecosystem | AMD-AGI/TraceLens | 54 | 0 | ||
| AMD Ecosystem | ROCm/MAD | 31 | 0 | ||
| AMD Ecosystem | ROCm/ROCm | 6,094 | +6 | ||
| Compilers | openxla/xla | 3,914 | 0 | ||
| Compilers | tile-ai/tilelang | 4,781 | +12 | ||
| Compilers | triton-lang/triton | 18,205 | +11 | ||
| Google / JAX | AI-Hypercomputer/JetStream | 403 | +1 | ||
| Google / JAX | AI-Hypercomputer/maxtext | 2,102 | +2 | ||
| Google / JAX | jax-ml/jax | 34,655 | +14 | ||
| HuggingFace | huggingface/transformers | 155,512 | +49 | ||
| Inference Serving | alibaba/rtp-llm | 1,027 | +3 | ||
| Inference Serving | efeslab/Atom | 333 | 0 | ||
| Inference Serving | llm-d/llm-d | 2,380 | 0 | ||
| Inference Serving | sgl-project/sglang | 22,650 | +32 | ||
| Inference Serving | vllm-project/vllm | 68,077 | +99 | ||
| Inference Serving | xdit-project/xDiT | 2,510 | +3 | ||
| NVIDIA | NVIDIA/Megatron-LM | 14,987 | +13 | ||
| NVIDIA | NVIDIA/TransformerEngine | 3,103 | +5 | ||
| NVIDIA | NVIDIA/apex | 8,899 | +2 | ||
| Optimization | deepseek-ai/DeepEP | 8,908 | +5 | ||
| Optimization | deepspeedai/DeepSpeed | 41,331 | +14 | ||
| Optimization | facebookresearch/xformers | 10,284 | +1 | ||
| PyTorch & Meta | meta-pytorch/monarch | 953 | +2 | ||
| PyTorch & Meta | meta-pytorch/torchcomms | 321 | 0 | ||
| PyTorch & Meta | meta-pytorch/torchforge | 600 | +2 | ||
| PyTorch & Meta | pytorch/FBGEMM | 1,519 | +1 | ||
| PyTorch & Meta | pytorch/ao | 2,641 | +2 | ||
| PyTorch & Meta | pytorch/audio | 2,814 | +1 | ||
| PyTorch & Meta | pytorch/pytorch | 96,801 | +36 | ||
| PyTorch & Meta | pytorch/torchtitan | 4,987 | +3 | ||
| PyTorch & Meta | pytorch/vision | 17,460 | +2 | ||
| RL & Post-Training | THUDM/slime | 3,464 | +17 | ||
| RL & Post-Training | radixark/miles | 749 | +5 | ||
| RL & Post-Training | volcengine/verl | 18,596 | +34 |