Technical Intelligence Report: 2026-01-21

Executive Summary

  • ROCm Ecosystem Maturity: ROCm 7.2 has been released, introducing support for RDNA4 hardware (RX 9060 XT LP) and the new ROCm Optiq visualization tool. Simultaneously, ROCm has achieved “First-Class Platform” status in vLLM, with CI pass rates hitting 93% and official Docker/Wheel support.
  • Framework Updates: PyTorch 2.10 is live, featuring improved RDNA 3.5 (GFX1150) support and Grouped GEMM via CK for AMD GPUs.
  • Next-Gen Hardware Prep: Linux patches reveal Zen 6 “Venice” EPYC features, specifically focusing on advanced bandwidth enforcement (GLBE, GLSBE) and privilege management (PLZA).
  • Market Competition: Upscale AI raised $200M to build “SkyHammer,” a UALink-compatible switch ASIC intended to rival NVIDIA’s NVSwitch, bolstering the open ecosystem AMD utilizes.
  • NVIDIA Activity: Benchmarks compare NVIDIA’s GB10 (Grace-Blackwell) Arm cores against AMD’s “Strix Halo” Ryzen AI Max+. Jensen Huang is visiting China to negotiate H200 shipments and advocating for “AI as Infrastructure” at Davos.

🤖 ROCm Updates & Software

[2026-01-21] ROCm Becomes a First-Class Platform in vLLM (#2016)

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

  • Critical Stability Milestone: ROCm is now a “first-class citizen” in vLLM, meaning AMD hardware support is no longer experimental. This significantly reduces friction for enterprise deployment of LLMs on MI300/MI350 series.
  • Ease of Deployment: Official Docker images and pip wheels remove the need for developers to build from source, a major previous pain point.

Summary:

  • ROCm support in vLLM (v0.12.0 - v0.14.0) has been massively upgraded.
  • CI (Continuous Integration) stability for AMD hardware improved from 37% passing (Nov 2025) to 93% passing (Jan 2026).
  • Native support added for vLLM-omni (multimodal inference).

Details:

  • Official Docker Images: Now available via vllm/vllm-openai-rocm:v0.14.0.
  • Installation: Simplified to uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/. Supports ROCm 7.0 and Python 3.12.
  • Performance Optimizations:
    • Quantization: Native AITER FP8 kernels, fused LayerNorm/SiLU FP8, MXFP4 w4a4 MoE inference.
    • Architecture: Optimized KV cache, assembly Paged Attention, and removal of DeepSeek MLA D2D copies.
    • Hardware: Validated on MI300X, MI325X, MI350X, MI355X (gfx942, gfx950 architectures).
  • vLLM-omni: Supports audio/image/video input and text/audio output. Optimized configs provided for Qwen2.5-Omni and Qwen3-Omni-MoE.

[2026-01-21] AMD ROCm 7.2 Now Released With More Radeon Graphics Cards Supported, ROCm Optiq Introduced

Source: Phoronix

Key takeaway relevant to AMD:

  • RDNA4 Support: Early support for next-gen consumer hardware (RX 9060 XT LP) indicates RDNA4 compute support is arriving faster than previous generations.
  • Tooling Expansion: The introduction of ROCm Optiq addresses a long-standing gap in visualization and profiling tools compared to NVIDIA Nsight.

Summary:

  • ROCm 7.2.0 released with expanded hardware support and new tooling.
  • Official support added for RDNA4 and specific RDNA3 consumer cards.
  • Introduction of “ROCm Optiq” (Beta) and “ROCm Simulation”.

Details:

  • New Hardware Support:
    • RDNA4: AMD Radeon RX 9060 XT LP (32 CUs, 64 AI accelerators, 16GB GDDR6).
    • Workstation: AMD Radeon AI PRO R9600D (3072 SPs, 32GB GDDR6).
    • RDNA3: Official support for Radeon RX 7700 series.
  • Software Features:
    • ROCm Optiq: A new visualization platform (Windows/Linux) for viewing GPU traces from profiling tools.
    • ROCm Simulation: Toolkit for physics-based/numerical simulations.
    • HIP Updates: SPIR-V support for hipCUB and rocThrust; node power management for multi-GPU nodes.
    • OS Support: MI350X/MI355X support added to SUSE Linux Enterprise Server 15 SP7.

[2026-01-21] PyTorch 2.10 Released With More Improvements For AMD ROCm & Intel GPUs

Source: Phoronix

Key takeaway relevant to AMD:

  • RDNA 3.5 Integration: Support for GFX1150/GFX1151 (likely Ryzen AI APUs) in hipblaslt enables better on-device AI performance for laptops/embedded.
  • Gemm Optimization: Grouped GEMM support via Composable Kernel (CK) improves efficiency for transformer workloads.

Summary:

  • PyTorch 2.10 released with broad updates for ROCm, Intel XPU, and CUDA.
  • Focus on kernel optimizations and Windows support for ROCm.

Details:

  • AMD ROCm Improvements:
    • Enabled grouped GEMM via regular GEMM fallback and via CK (Composable Kernel).
    • Added GFX1150/GFX1151 (RDNA 3.5) to hipblaslt-supported GEMM lists.
    • Support for scaled_mm v2 and AOTriton scaled_dot_product_attention.
    • Code generation support for fast_tanhf.
    • Improved heuristics for pointwise kernels.
    • Enhanced ROCm support on Windows.
  • General/Competitor Updates:
    • Intel: New Torch XPU APIs, SYCL support for custom operators on Windows.
    • NVIDIA: CUDA 13 compatibility improvements, CUTLASS MATMULs on Thor.
    • Python: Experimental support for Python 3.14 free-threaded build.

[2026-01-21] Add new author: Mohit Deopujari (#2013)

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

  • Minimal technical impact; indicates expanding documentation/blogging team for ROCm.

Summary:

  • Administrative commit adding Mohit Deopujari to the ROCm blog author list.

Details:

  • Updated .authorlist.txt and added author metadata/image.

🔲 AMD Hardware & Products

[2026-01-21] AMD Sends Out Linux Patches For Next-Gen EPYC Features: GLBE, GLSBE & PLZA

Source: Phoronix

Key takeaway relevant to AMD:

  • Zen 6 “Venice” Features: Confirms advanced QoS (Quality of Service) features for next-gen EPYC servers, critical for multi-tenant cloud and mixed-workload AI clusters.
  • Resource Control: AMD is significantly enhancing the resctrl (Resource Control) capabilities in Linux to manage bandwidth contention.

Summary:

  • 19 Linux kernel patches released for EPYC “Venice” (Zen 6) processors.
  • Introduces three key features: GLBE, GLSBE, and PLZA.

Details:

  • GLBE (Global Bandwidth Enforcement): Allows software to set bandwidth limits for thread groups spanning multiple QoS Domains. Sets a ceiling for “L3 External Bandwidth.”
  • GLSBE (Global Slow Bandwidth Enforcement): Similar to GLBE but specifically for “Slow Memory” (CXL or tiered memory), managing bandwidth limits across multiple QoS domains.
  • PLZA (Privilege Level Zero Association): Hardware mechanism to automatically associate Ring 0 (kernel/privileged) execution with a specific Class of Service (COS) or Resource Monitoring Identifier (RMID), overriding per-thread associations.

🤼‍♂️ Market & Competitors

[2026-01-21] The CPU Performance Of The NVIDIA GB10 With The Dell Pro Max vs. AMD Ryzen AI Max+ “Strix Halo”

Source: Phoronix

Key takeaway relevant to AMD:

  • Direct Comparison: Phoronix is benchmarking the “Grace” portion of the Grace-Blackwell superchip against AMD’s top-tier Strix Halo APU. This highlights the blurring line between high-end client APUs and server-grade Arm CPUs.

Summary:

  • Benchmarking NVIDIA GB10 (Grace-Blackwell) CPU cores vs. AMD Ryzen AI Max+ 395 (Framework Desktop).
  • Focus is on traditional Linux CPU workloads, not just AI.

Details:

  • NVIDIA GB10 Specs: 20 Arm cores (10x Cortex-X925 performance cores + 10x Cortex-A725 efficiency cores). 128GB LPDDR5x memory.
  • AMD System: Framework Desktop with Ryzen AI Max+ 395 “Strix Halo”.
  • Testing Constraint: NVIDIA GB10 does not expose CPU power metrics via Linux PowerCap/RAPL; testing relied on total AC system power.
  • Software Environment: Both systems running Ubuntu 24.04.3 LTS, Linux 6.14 kernel, GCC 13.3.

[2026-01-21] Upscale AI Nabs Cash To Forge “SkyHammer” Scale Up Fabric Switch

Source: The Next Platform

Key takeaway relevant to AMD:

  • UALink Momentum: Upscale AI is building a switch for UALink, the AMD-backed open standard competitor to NVLink. A high-performance merchant silicon switch for UALink is required for AMD (and others) to effectively compete with NVIDIA’s rack-scale NVSwitch architecture.

Summary:

  • Upscale AI raised $200M (Series A) to develop “SkyHammer,” a high-radix scale-up fabric switch.
  • Valuation over $1 billion.
  • Targeting samples in late 2026, volume in 2027.

Details:

  • Technology: “SkyHammer” ASIC supports UALink (Universal Accelerator Link) and Meta’s ESUN standard.
  • Goal: Create a heterogeneous, high-bandwidth memory coherent fabric switch to compete with NVIDIA NVSwitch.
  • Specs: UALink 1.0 spec allows for 1,024 compute engines in a single-level fabric; SkyHammer aims to support this scale.
  • Founders: Ex-Auradine, Cavium, and Innovium executives (Rajiv Khemani, Barun Kar).
  • Market Context: Provides an alternative to proprietary NVLink, essential for the “anti-NVIDIA” coalition (AMD, Intel, Broadcom, etc.).

[2026-01-21] Nvidia CEO Jensen Huang to visit China as company prepares to start H200 shipments…

Source: Tom’s Hardware

Key takeaway relevant to AMD:

  • Export Control Dynamics: NVIDIA is actively pushing H200 into China despite restrictions. If successful, this maintains NVIDIA’s CUDA dominance in China, making it harder for AMD’s MI300 series to gain footholds in Alibaba/Baidu clouds despite AMD’s unrestricted offerings.

Summary:

  • Jensen Huang is visiting China (Beijing) during Lunar New Year.
  • Purpose: Negotiate H200 GPU shipments and attend internal events.

Details:

  • H200 Situation: US government allows some export; Chinese government has curbs on imports to foster self-sufficiency.
  • Target Clients: Alibaba, Baidu (commercial use cases relying on CUDA).
  • Restrictions: China plans to prohibit H200 for military/state-owned enterprises.

[2026-01-21] ‘Largest Infrastructure Buildout in Human History’: Jensen Huang on AI’s ‘Five-Layer Cake’ at Davos

Source: NVIDIA Blog

Key takeaway relevant to AMD:

  • Infrastructure Narrative: NVIDIA is positioning AI not as hardware but as “critical national infrastructure” (like roads/electricity). This narrative drives sovereign AI investments, a sector AMD is also targeting.

Summary:

  • Jensen Huang spoke at World Economic Forum (Davos) with BlackRock CEO Larry Fink.
  • Described AI as the “largest infrastructure buildout in human history.”

Details:

  • Five-Layer Cake: Energy, Chips, Cloud Infrastructure, AI Models, Applications.
  • Economic Impact: 2025 saw >$100B in VC funding, mostly for AI-native companies.
  • Workforce: Emphasized AI creating demand for skilled labor (electricians, datacenters) and augmenting roles (radiologists, nurses) rather than replacing them.

💬 Reddit & Community

[2026-01-21] Unlucky customer buys RTX 5080, receives relabelled RTX 5060 Ti in the box instead…

Source: Tom’s Hardware

Key takeaway relevant to AMD:

  • High-End GPU Scarcity/Value: Highlights the chaos in the high-end consumer GPU market. While regarding NVIDIA, these “return scams” often affect high-value AMD cards (RX 7900/9000 series) as well, serving as a warning for retail channel integrity.

Summary:

  • Amazon customer bought an RTX 5080 but received an RTX 5060 Ti inside the box.
  • The scammer applied fake “RTX 5080” stickers to the lower-end card.

Details:

  • Detection: The card had an 8-pin PCIe connector (Blackwell RTX 5080 uses 16-pin), exposing the swap.
  • Method: Likely a “return switcheroo” where a previous buyer kept the 5080 and returned a modified 5060 Ti, which Amazon restocking failed to catch.

📈 GitHub Stats

Category Repository Total Stars 1-Day 7-Day 30-Day
AMD Ecosystem AMD-AGI/GEAK-agent 56 0    
AMD Ecosystem AMD-AGI/Primus 66 0    
AMD Ecosystem AMD-AGI/TraceLens 54 0    
AMD Ecosystem ROCm/MAD 31 0    
AMD Ecosystem ROCm/ROCm 6,094 +6    
Compilers openxla/xla 3,914 0    
Compilers tile-ai/tilelang 4,781 +12    
Compilers triton-lang/triton 18,205 +11    
Google / JAX AI-Hypercomputer/JetStream 403 +1    
Google / JAX AI-Hypercomputer/maxtext 2,102 +2    
Google / JAX jax-ml/jax 34,655 +14    
HuggingFace huggingface/transformers 155,512 +49    
Inference Serving alibaba/rtp-llm 1,027 +3    
Inference Serving efeslab/Atom 333 0    
Inference Serving llm-d/llm-d 2,380 0    
Inference Serving sgl-project/sglang 22,650 +32    
Inference Serving vllm-project/vllm 68,077 +99    
Inference Serving xdit-project/xDiT 2,510 +3    
NVIDIA NVIDIA/Megatron-LM 14,987 +13    
NVIDIA NVIDIA/TransformerEngine 3,103 +5    
NVIDIA NVIDIA/apex 8,899 +2    
Optimization deepseek-ai/DeepEP 8,908 +5    
Optimization deepspeedai/DeepSpeed 41,331 +14    
Optimization facebookresearch/xformers 10,284 +1    
PyTorch & Meta meta-pytorch/monarch 953 +2    
PyTorch & Meta meta-pytorch/torchcomms 321 0    
PyTorch & Meta meta-pytorch/torchforge 600 +2    
PyTorch & Meta pytorch/FBGEMM 1,519 +1    
PyTorch & Meta pytorch/ao 2,641 +2    
PyTorch & Meta pytorch/audio 2,814 +1    
PyTorch & Meta pytorch/pytorch 96,801 +36    
PyTorch & Meta pytorch/torchtitan 4,987 +3    
PyTorch & Meta pytorch/vision 17,460 +2    
RL & Post-Training THUDM/slime 3,464 +17    
RL & Post-Training radixark/miles 749 +5    
RL & Post-Training volcengine/verl 18,596 +34