Technical Intelligence Report: 2026-01-26

Executive Summary

  • AMD Software (ROCm/Quark): Technical documentation revealed performance data for IBM Granite 4.0 on AMD Instinct MI300/MI355 GPUs. Using the AMD Quark quantization library (FP8), token throughput nearly doubled (~1.96x) compared to baseline.
  • AMD LLVM Compiler: AMD engineers proposed a new Device-Side Profile Guided Optimization (PGO) for the AMDGPU LLVM backend. The proposal introduces “uniformity-aware” profiling to prevent performance regressions caused by thread divergence, showing 12-14% speedups on uniform branches.
  • Competitor Analysis (NVIDIA): NVIDIA launched “Earth-2,” a suite of fully open-source AI weather models (Atlas, StormScope, HealDA). This move aggressively targets the scientific computing market, challenging traditional physics-based forecasting and setting a new bar for AI-driven meteorology performance.

🤖 ROCm Updates & Software

[2026-01-26] Accelerating IBM Granite 4.0 with FP8 using AMD Quark (MI300/MI355)

Source: ROCm Tech Blog (via GitHub Commit)

Key takeaway relevant to AMD:

  • New Hardware Mention: The documentation explicitly references the AMD Instinct MI355 alongside the MI300, confirming active software stack readiness for this hardware iteration.
  • Throughput Doubling: FP8 quantization via AMD Quark demonstrates a massive throughput increase (~96%) for Granite 4.0 models on AMD hardware, critical for competitive inference serving.
  • Tooling Maturity: Demonstrates a mature workflow using Quark, vLLM, and Safetensors on ROCm, lowering the barrier for enterprise adoption of IBM’s enterprise-focused LLMs.

Summary:

  • A blog post draft (observed via Git history) details the quantization of IBM Granite 4.0 (8B) to FP8 using the AMD Quark library.
  • The workflow integrates with ROCm-optimized vLLM to serve the quantized model on MI300/MI355 GPUs.
  • Benchmarks show near-lossless accuracy recovery with significant performance gains.

Details:

  • Hardware Tested: AMD Instinct MI300 and MI355.
  • Software Stack:
    • Docker: rocm/vllm-dev
    • Transformers: 4.56.0
    • Quark: 0.11 (AMD’s quantization library)
    • vLLM: 0.11.0
  • Quantization Process: Used AMD Quark to convert the ibm-granite/granite-4.0-h-small model to FP8 (E4M3/E5M2). This leverages native matrix-core support on MI300+.
  • Performance Benchmarks (MI300):
    • Throughput (Original/BF16): 13,018.16 tokens/second.
    • Throughput (FP8): 25,541.64 tokens/second.
    • Uplift: ~1.96x speedup.
  • Accuracy Recovery:
    • GSM8K: 98.75% recovery (85.60 vs 84.53).
    • IFEVAL (Instruct, Strict): 100% recovery.
  • Implications: The documentation provides a reproducible script using Quark.torch and LLMTemplate to export safetensors, proving that AMD’s quantization pipeline is stabilizing for third-party models.

[2026-01-26] AMD LLVM Device-Side PGO with Uniformity Detection

Source: Phoronix

Key takeaway relevant to AMD:

  • Compiler Optimization: AMD is upstreaming sophisticated compiler features to LLVM that address specific GPU architecture bottlenecks (SIMT divergence).
  • Performance Gains: Initial tests show double-digit percentage gains (12-14%) for uniform code paths, improving HIP/ROCm workload efficiency without hardware changes.
  • Safety Mechanism: The “Uniformity-aware” feature prevents the common pitfall where PGO hurts GPU performance by optimizing for paths that cause memory coalescing issues.

Summary:

  • AMD engineer Sam Liu opened an LLVM merge request for Device-Side Profile Guided Optimization (PGO) for the AMDGPU backend.
  • The system uses standard HIP APIs (no CLR patches needed) for instrumentation and profile collection.
  • It specifically addresses the risks of register spilling in divergent branches on GPUs.

Details:

  • Problem: Standard CPU PGO often moves register spills to “cold” paths. On a GPU, if a “cold” path is taken by only some threads in a wave (divergence), it causes partial-wave memory accesses, leading to poor coalescing and up to 3.7x slowdowns.
  • Solution: Uniformity-aware PGO. The compiler instruments code to detect if branches are uniform (all threads take the same path) or divergent at runtime.
  • Optimization Strategy:
    • If a branch is uniform, the compiler applies aggressive optimizations (like spill placement), resulting in 12-14% speedup.
    • If a branch is divergent, the compiler gates these optimizations to prevent regression.
  • Technical Implementation:
    • Uses wave-aggregated counter increments to reduce atomic contention.
    • Uses per-TU (Translation Unit) contiguous counter allocation to avoid linker reordering issues.
  • Relevance: This is critical for maximizing performance in complex HIP applications where static analysis cannot predict runtime divergence.

🤼‍♂️ Market & Competitors

[2026-01-26] NVIDIA Launches Earth-2 Family of Open Models

Source: NVIDIA Blog & The Next Platform

Key takeaway relevant to AMD:

  • Competitive Moat: NVIDIA is entrenching itself in the scientific computing/HPC vertical by releasing open “foundation models” for weather. To compete, AMD must ensure these open models (available on Hugging Face) perform performantly on MI300 to avoid losing the government/meteorological sector.
  • Architecture Shift: The move confirms the industry shift from physics-based numerical weather prediction (NWP) to AI-based prediction, which runs significantly faster on GPUs.

Summary:

  • NVIDIA unveiled the “Earth-2” family of open models, libraries, and frameworks for AI weather forecasting.
  • The release includes three distinct architectures targeting different forecast horizons (Nowcasting to Medium Range).
  • Major entities (Israel Meteorological Service, The Weather Company, NOAA/NWS) are already evaluating or deploying these models.

Details:

  • New Models Released:
    1. Earth-2 Medium Range: Based on the “Atlas” architecture. Predicts up to 15 days out. Outperforms Google’s GenCast.
    2. Earth-2 Nowcasting: Based on the “StormScope” architecture. Focuses on 0-6 hour prediction at kilometer-scale resolution. Uses generative AI/transformers.
    3. Earth-2 Global Data Assimilation: Based on “HealDA” architecture. Used to create initial atmospheric snapshots.
  • Performance:
    • Israel Meteorological Service reports a 90% reduction in compute time at 2.5km resolution compared to running classic numerical models on a CPU cluster.
    • CorrDiff model cited as superior for precipitation verification.
  • Availability: Models are available via NVIDIA Earth2Studio, Hugging Face, and GitHub.
  • Ecosystem Integration: Supports integration with NVIDIA Modulus (PhysicsNeMo) for fine-tuning.
  • Strategy: By making these open source, NVIDIA encourages standardization on their CUDA-optimized architectures (“Atlas”, “StormScope”), potentially sidelining competitors if ROCm support for these specific architectures lags.