Here is the Technical Intelligence Report for 2026-02-17.

Executive Summary

ROCm Optimization for MI350/MI355: AMD released a new quantization technique combining fine-tuned rotations with SmoothQuant for MXFP4 (4-bit) formats, recovering significant accuracy on Qwen3 models running on CDNA4 hardware.
Inference Performance Boost on MI300X: A new “Adaptive Top-K” selection strategy in the AITER library eliminates performance cliffs for small-K values in RAG workloads, leveraging Bitonic Sort and DPP instructions to bypass LDS bottlenecks.
Next-Gen EPYC “Venice” Leaks: Linux kernel patches revealed a new RMPOPT instruction designed to reduce SEV-SNP virtualization overhead, likely for upcoming Zen 6 EPYC processors (supporting up to 256 cores/socket).
Unofficial HDMI 2.1 Linux Support: An independent developer published out-of-tree patches enabling HDMI 2.1 Fixed Rate Link (FRL) on the Radeon RX 9070 XT, circumventing HDMI Forum legal blockers for open-source drivers.

🤖 ROCm Updates & Software

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

Enables near-lossless 4-bit (MXFP4) inference on AMD Instinct MI350X and MI355X accelerators.
Demonstrates accuracy recovery for small-to-mid-sized LLMs (8B-32B parameters) which typically suffer from aggressive quantization, making the CDNA4 architecture more competitive for efficient inference serving.

Summary:

The post details a method to recover accuracy lost during MXFP4 quantization by using online block-diagonal rotations and SmoothQuant.
The technique redistributes outliers in activations and weights to fit better within the OCP MXFP4 format constraints.

Details:

Hardware Context: Targets CDNA4 ISA on MI350/MI355, specifically utilizing MXFP4 GEMM kernels (e.g., V_MFMA_SCALE_F32_16X16X128_F8F6F4).
The Problem: Standard MXFP4 (group size 32) struggles with outliers. One large value forces the shared scale factor up, crushing precision for the other 31 values.
The Solution:
- Online Rotations: Applies orthogonal transforms ($R$) to activations at inference. Uses block-diagonal rotations (e.g., size 32x32) to limit compute overhead compared to full dense rotations.
- SmoothQuant Integration: Combines rotation with SmoothQuant scaling ($O = DR$).
- Optimization: Rotations are fine-tuned on the Stiefel manifold using the Cayley SGD optimizer.
Results:
- Tested on Qwen3-8B, Qwen3-14B, and Qwen3-32B.
- Recovers 45-55% of the accuracy drop on n-shot tasks.
- Quantized models retain >98% of the original BF16 accuracy.
- Overhead is minimal as operations are fused or lightweight compared to attention/MLP layers.

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

Optimizes the AITER library for MI300X, specifically addressing inefficiencies in LLM/RAG decoding where standard Radix Sort fails at small $K$ values.
Demonstrates significant performance gains (up to 55%) in long-context scenarios using specific AMD ISA features (buffer_load, med3, DPP).

Summary:

AMD engineers identified that standard Radix Sort has fixed overheads (LDS atomics) that cause performance “cliffs” when selecting a small number of top tokens (Top-K).
They implemented an adaptive strategy that switches between Radix Sort and register-based Bitonic Sort depending on the workload size.

Details:

Performance Analysis: Profiling showed SQ_ACTIVE_INST_LDS remained high even for small $K$ (e.g., $K=64$) using Radix Sort, indicating histogram construction bottleneck.
New Strategies:
- BlockTopkSort: For small input per warp. Uses register-resident data and DPP (Data Parallel Primitives) for intra-warp communication (low latency).
- BlockTopkFilter: For larger inputs. Uses a ballot-based filtering pass to prune candidates before sorting.
Hardware Optimizations (via Opus Library):
- DPP: Used for ultra-low-latency permutations (opus::mov_dpp), faster than standard shuffles.
- med3: AMD’s median-of-3 instruction used for branch-free comparisons, yielding a 32% perf improvement on (Batch 1024, Len 3072, K 128).
- Buffer Instructions: Replaced pointer-based loads with buffer_load_dwordx4 (128-bit descriptors). This provided a 55% improvement on long sequences (Len 131,072) by maximizing bandwidth and reducing instruction count.
- Double Buffering: Implemented software pipelining to hide global memory latency behind compute.

Source: Phoronix

Key takeaway relevant to AMD:

Indicates architectural improvements in upcoming Zen 6 “Venice” EPYC processors focused on reducing the performance penalty of encrypted virtualization (SEV-SNP).
Suggests massive scaling for next-gen servers, with code referencing support for CPU IDs up to 1023 (aligning with 256 cores/512 threads per socket).

Summary:

AMD submitted Linux kernel patches enabling a new instruction: RMPOPT.
This instruction optimizes Reverse Map Table (RMP) checks used in SEV-SNP security.

Details:

Functionality: RMPOPT allows the hypervisor and non-SNP guests to skip RMP checks when writing to memory, provided that specific 1GB regions are known not to contain any SNP guest memory.
Target Hardware: While not explicitly named, the timing and core count references (0-1023) strongly correlate with EPYC Zen 6 “Venice”.
Implementation: The patch allows global enablement of RMPOPT for system RAM but includes safeguards (RMPUPDATE) to disable optimizations when SNP guests are launched to maintain security integrity.
Interfaces: Adds configfs to re-enable optimizations at runtime and debugfs to report status.

Source: Phoronix

Key takeaway relevant to AMD:

Highlights the ongoing friction between the open-source community and the HDMI Forum’s legal restrictions on public documentation/drivers.
Demonstrates “working” HDMI 2.1 Fixed Rate Link (FRL) on a Radeon RX 9070 XT, proving the hardware capability exists despite the software lock.

Summary:

An independent developer released experimental, out-of-tree kernel patches enabling HDMI 2.1 bandwidth on Linux for AMD GPUs.
This feature is currently blocked in the official upstream amdgpu driver due to HDMI Forum legal restrictions.

Details:

Hardware Tested: Radeon RX 9070 XT.
Methodology: Developed by analyzing Windows driver register states and cross-referencing with AMD-Xilinx FRL training code.
Status:
- ✅ FRL training works.
- ✅ HDR works.
- ❌ Display Stream Compression (DSC) not implemented.
- ❌ YCbCr 4:2:0 not implemented.
Implications: These patches cannot be upstreamed to the Linux kernel without HDMI Forum approval (which has been previously denied), forcing users to compile their own kernels to access full HDMI 2.1 bandwidth on Linux.