Technical Intelligence Report: 2026-01-26

Executive Summary

AMD Software (ROCm/Quark): Technical documentation revealed performance data for IBM Granite 4.0 on AMD Instinct MI300/MI355 GPUs. Using the AMD Quark quantization library (FP8), token throughput nearly doubled (~1.96x) compared to baseline.
AMD LLVM Compiler: AMD engineers proposed a new Device-Side Profile Guided Optimization (PGO) for the AMDGPU LLVM backend. The proposal introduces “uniformity-aware” profiling to prevent performance regressions caused by thread divergence, showing 12-14% speedups on uniform branches.
Competitor Analysis (NVIDIA): NVIDIA launched “Earth-2,” a suite of fully open-source AI weather models (Atlas, StormScope, HealDA). This move aggressively targets the scientific computing market, challenging traditional physics-based forecasting and setting a new bar for AI-driven meteorology performance.

Source: ROCm Tech Blog (via GitHub Commit)

Key takeaway relevant to AMD:

New Hardware Mention: The documentation explicitly references the AMD Instinct MI355 alongside the MI300, confirming active software stack readiness for this hardware iteration.
Throughput Doubling: FP8 quantization via AMD Quark demonstrates a massive throughput increase (~96%) for Granite 4.0 models on AMD hardware, critical for competitive inference serving.
Tooling Maturity: Demonstrates a mature workflow using Quark, vLLM, and Safetensors on ROCm, lowering the barrier for enterprise adoption of IBM’s enterprise-focused LLMs.

Summary:

A blog post draft (observed via Git history) details the quantization of IBM Granite 4.0 (8B) to FP8 using the AMD Quark library.
The workflow integrates with ROCm-optimized vLLM to serve the quantized model on MI300/MI355 GPUs.
Benchmarks show near-lossless accuracy recovery with significant performance gains.

Details:

Hardware Tested: AMD Instinct MI300 and MI355.
Software Stack:
- Docker: rocm/vllm-dev
- Transformers: 4.56.0
- Quark: 0.11 (AMD’s quantization library)
- vLLM: 0.11.0
Quantization Process: Used AMD Quark to convert the ibm-granite/granite-4.0-h-small model to FP8 (E4M3/E5M2). This leverages native matrix-core support on MI300+.
Performance Benchmarks (MI300):
- Throughput (Original/BF16): 13,018.16 tokens/second.
- Throughput (FP8): 25,541.64 tokens/second.
- Uplift: ~1.96x speedup.
Accuracy Recovery:
- GSM8K: 98.75% recovery (85.60 vs 84.53).
- IFEVAL (Instruct, Strict): 100% recovery.
Implications: The documentation provides a reproducible script using Quark.torch and LLMTemplate to export safetensors, proving that AMD’s quantization pipeline is stabilizing for third-party models.

Source: Phoronix

Key takeaway relevant to AMD:

Compiler Optimization: AMD is upstreaming sophisticated compiler features to LLVM that address specific GPU architecture bottlenecks (SIMT divergence).
Performance Gains: Initial tests show double-digit percentage gains (12-14%) for uniform code paths, improving HIP/ROCm workload efficiency without hardware changes.
Safety Mechanism: The “Uniformity-aware” feature prevents the common pitfall where PGO hurts GPU performance by optimizing for paths that cause memory coalescing issues.

Summary:

AMD engineer Sam Liu opened an LLVM merge request for Device-Side Profile Guided Optimization (PGO) for the AMDGPU backend.
The system uses standard HIP APIs (no CLR patches needed) for instrumentation and profile collection.
It specifically addresses the risks of register spilling in divergent branches on GPUs.

Details:

Problem: Standard CPU PGO often moves register spills to “cold” paths. On a GPU, if a “cold” path is taken by only some threads in a wave (divergence), it causes partial-wave memory accesses, leading to poor coalescing and up to 3.7x slowdowns.
Solution: Uniformity-aware PGO. The compiler instruments code to detect if branches are uniform (all threads take the same path) or divergent at runtime.
Optimization Strategy:
- If a branch is uniform, the compiler applies aggressive optimizations (like spill placement), resulting in 12-14% speedup.
- If a branch is divergent, the compiler gates these optimizations to prevent regression.
Technical Implementation:
- Uses wave-aggregated counter increments to reduce atomic contention.
- Uses per-TU (Translation Unit) contiguous counter allocation to avoid linker reordering issues.
Relevance: This is critical for maximizing performance in complex HIP applications where static analysis cannot predict runtime divergence.

Source: NVIDIA Blog & The Next Platform

Key takeaway relevant to AMD:

Competitive Moat: NVIDIA is entrenching itself in the scientific computing/HPC vertical by releasing open “foundation models” for weather. To compete, AMD must ensure these open models (available on Hugging Face) perform performantly on MI300 to avoid losing the government/meteorological sector.
Architecture Shift: The move confirms the industry shift from physics-based numerical weather prediction (NWP) to AI-based prediction, which runs significantly faster on GPUs.

Summary:

NVIDIA unveiled the “Earth-2” family of open models, libraries, and frameworks for AI weather forecasting.
The release includes three distinct architectures targeting different forecast horizons (Nowcasting to Medium Range).
Major entities (Israel Meteorological Service, The Weather Company, NOAA/NWS) are already evaluating or deploying these models.

Details:

New Models Released:
1. Earth-2 Medium Range: Based on the “Atlas” architecture. Predicts up to 15 days out. Outperforms Google’s GenCast.
2. Earth-2 Nowcasting: Based on the “StormScope” architecture. Focuses on 0-6 hour prediction at kilometer-scale resolution. Uses generative AI/transformers.
3. Earth-2 Global Data Assimilation: Based on “HealDA” architecture. Used to create initial atmospheric snapshots.
Performance:
- Israel Meteorological Service reports a 90% reduction in compute time at 2.5km resolution compared to running classic numerical models on a CPU cluster.
- CorrDiff model cited as superior for precipitation verification.
Availability: Models are available via NVIDIA Earth2Studio, Hugging Face, and GitHub.
Ecosystem Integration: Supports integration with NVIDIA Modulus (PhysicsNeMo) for fine-tuning.
Strategy: By making these open source, NVIDIA encourages standardization on their CUDA-optimized architectures (“Atlas”, “StormScope”), potentially sidelining competitors if ROCm support for these specific architectures lags.