πŸ“… Engineering Report (2025-11-01 - 2025-11-30)

πŸš€ Executive Summary

November 2025 marked a massive month for next-generation hardware adoption across the entire AI ecosystem. We saw concurrent, foundational software releases enabling the newest architectures from both AMD (MI350X/MI355X) and NVIDIA (Blackwell GB300/SM100). The month also highlighted a heavy shift toward advanced Reinforcement Learning (RL) and multi-token prediction frameworks, with major architecture overhauls in verl and slime to support asynchronous PPO and FSDP at scale. Maintenance health across the tracked repositories is exceptionally strong; large codebases like XLA, Triton, and MaxText are closing PRs at a rate that matches or exceeds new PR creation, indicating sustainable engineering velocity.

  • ROCm 7.1 Series Dominates: AMD dropped both ROCm 7.1.0 and 7.1.1 this month, officially extending support and telemetry to the MI325X, MI350X, and MI355X hardware lines.
  • Virtualization & Enterprise Ready: Substantial updates were made to KVM SR-IOV support, adding Ubuntu 24.04 and RHEL 10.1 as Guest OSes for MI300X, greatly expanding enterprise deployment flexibility.
  • Performance Upgrades: The introduction of β€œOrigami” for GEMM kernel selection reduces tuning time and improves out-of-the-box matrix multiplication performance. Additionally, CK/AITER fused-attn now natively supports padding, eliminating previous Transformer Engine workarounds.
  • Ecosystem Recognition: torchtitan officially acknowledged AMD forks, and TraceLens merged JAX/HLO kernel busy-time calculations, signaling deepening integrations of AMD hardware into upstream open-source profiling and training workflows.

Competitive Analysis

  • NVIDIA Blackwell is Here: Competing ecosystems shipped day-zero software enablement for Blackwell (SM100 / GB300). TransformerEngine v2.9 added DeepSeek v3-style FP8 block scaling, xformers shipped a dedicated cutlass FMHA op for Blackwell, and triton released a hotfix (v3.5.1) specifically to resolve GB300 compilation targets.
  • NVFP4 Expansion: NVIDIA is pushing lower-precision training aggressively. TransformerEngine introduced Jax support for the NVFP4 training recipe, continuing the trend of squeezing memory bandwidth.
  • DeepSeek Compute/Comms Overlap: xformers integrated FW+BW pass overlaps specifically mimicking DeepSeek-style communication optimizations, showing how quickly hardware libraries are adapting to recent Chinese MoE architectural trends.

πŸ“‚ Category Updates

⚑ AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2025-11-17] Added CLI auto-discovery for subcommands and reorganized backend documentation.
    • [2025-11-14] Major documentation restructuring.
  • Details:
    • [2025-11-17] Merged the moe package v1.2 PR.
    • [2025-11-17] Added support for custom model config and model args via CLI for the maxtext backend.
  • Metrics: 52 New PRs (45 Closed) 1 New Issue (1 Closed)

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-11-14] JAX integration enhancements for GPU profiling.
  • Details:
    • [2025-11-14] Addressed kernel busy time calculation based on HLO ops for JAX.
    • [2025-11-14] Classified memsets in memcpy within the GPUEventAnalyzer.
  • Metrics: 17 New PRs (19 Closed) 14 New Issues (9 Closed)

ROCm/ROCm

  • Key Activity:
    • [2025-11-26] 🚨 RELEASE: rocm-7.1.1
    • [2025-11-03] 🚨 RELEASE: rocm-7.1.0
  • Details:
    • [2025-11-26] Added RHEL 10.1 support, extended Debian 13 support to MI355X/MI350X, and added Ubuntu 24.04 as a Guest OS in KVM SR-IOV. Fixed an MI325X SR-IOV Mode 1 reset issue.
    • [2025-11-03] Enabled nested tile partitioning in HIP cooperative groups (matching CUDA). Delivered hipBLASLt optimizations for FP8/TF32 on MI350X/MI355X.
    • [2025-11-30] Community reported an issue regarding 7900xtx poor performance post-7.1.1 upgrade.
  • Metrics: 63 New PRs (70 Closed) 47 New Issues (27 Closed)

ROCm/MAD

  • Key Activity:
    • [2025-11-14] Minor script and documentation fixes for vLLM environments.
  • Details:
    • [2025-11-14] Fixed unified attention typo in vllm script.
  • Metrics: 5 New PRs (5 Closed) 0 New Issues

πŸ”₯ PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • [2025-11-12] 🚨 RELEASE: v2.9.1
  • Details:
    • [2025-11-12] Fixed a massive memory regression in F.conv3d with bfloat16 inputs.
    • [2025-11-12] Torch.compile improvements: fixed bugs compiling Gemma, cached get_free_symbol_uses, and fixed registration design for inductor graph partition for vLLM.
  • Metrics: 1481 New PRs (1328 Closed) 1007 New Issues (411 Closed)

pytorch/torchtitan

  • Key Activity:
    • [2025-11-06] Added official blurb for the AMD fork.
  • Details:
    • [2025-11-14] Added ability to precompile torchtitan models and skipped varlen integration testing on ROCm.
  • Metrics: 88 New PRs (76 Closed) 28 New Issues (10 Closed)

pytorch/vision & pytorch/audio

  • Key Activity:
    • [2025-11-12] 🚨 RELEASE: v0.24.1 (Vision) & v2.9.1 (Audio)
  • Details:
    • [2025-11-12] Patch releases issued exclusively to ensure compatibility with PyTorch 2.9.1.

🟩 NVIDIA & General Deep Learning Ecosystem

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-11-11] 🚨 RELEASE: v2.9
  • Details:
    • [2025-11-11] Added PyTorch support for DeepSeek v3-style FP8 block scaling on NVIDIA Blackwell (SM100).
    • [2025-11-11] Added Jax support for the NVFP4 training recipe.
    • [2025-11-11] Added CPU offload support for all attention layouts.

jax-ml/jax

  • Key Activity:
    • [2025-11-18] 🚨 RELEASE: jax-v0.8.1
  • Details:
    • [2025-11-18] jax.lax.linalg.eigh now accepts implementation arguments to select between QR, Jacobi, and QDWH implementations.
    • [2025-11-18] Fixed a bug causing eigh to fail for large matrices on GPUs.

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-11-06] 🚨 RELEASE: maxtext-v0.1.0
    • [2025-11-20] 🚨 RELEASE: maxtext-tutorial-v1.3.0
  • Details:
    • [2025-11-06] MaxText published its first official PyPI package, transitioning to a highly accessible installation method for JAX LLM training.
  • Metrics: 177 New PRs (188 Closed) 10 New Issues (3 Closed)

triton-lang/triton

  • Key Activity:
    • [2025-11-12] 🚨 RELEASE: v3.5.1
  • Details:
    • [2025-11-12] Emergency hotfix to repair sm103 (GB300 / Blackwell) support that was broken in the 3.5.0 release.
    • [2025-11-22] Added backend support for out-of-tree TTIR/TTGIR passes.
  • Metrics: 216 New PRs (214 Closed) 26 New Issues (20 Closed)

facebookresearch/xformers

  • Key Activity:
    • [2025-11-12] 🚨 RELEASE: v0.0.33
  • Details:
    • [2025-11-12] Added cutlass fmha Op specifically for Blackwell GPUs.
    • [2025-11-12] Implemented Forward + Backward pass overlap to support DeepSeek-like communication/compute overlap workflows.

openxla/xla

  • Key Activity:
    • [2025-11-14] Significant codebase refactoring around IR emitters.
  • Details:
    • [2025-11-14] Merged ir_emitter and ir_emitter_nested for XLA:GPU.
  • Metrics: 1149 New PRs (1103 Closed) 9 New Issues (17 Closed)

🧠 LLM & RL Training Frameworks

volcengine/verl

  • Key Activity:
    • [2025-11-14] 🚨 RELEASE: v0.6.1
  • Details:
    • [2025-11-14] Introduced the β€œFully Async Policy Trainer” to decouple the Trainer and Rollouter, allowing asynchronous PPO sample generation.
    • [2025-11-14] Added the TransferQueue Data System for asynchronous streaming data management during post-training.
    • [2025-11-14] Megatron updates: supported 1f1b_overlap, Qwen3VL MoE/Dense, and Qwen2.5 with context parallelism.

THUDM/slime

  • Key Activity:
    • [2025-11-28] 🚨 RELEASE: v0.2.0
  • Details:
    • [2025-11-28] Massive update introducing a Fully Sharded Data Parallel (FSDP) based training backend.
    • [2025-11-28] Added native support for PPO, FP8 full stack (train + infer), and Multi-Token Prediction (MTP) during RL.
    • [2025-11-28] Alleviated train-inference mismatch using Rollout Routing Replay (R3).

llm-d/llm-d

  • Key Activity:
    • [2025-11-26] 🚨 RELEASE: v0.4.0
    • [2025-11-06] 🚨 RELEASE: v0.3.1
  • Details:
    • [2025-11-26] Shipped robust component updates including llm-d-inference-scheduler, CPU variants (llm-d-cpu), and the Gateway API Inference Extension.
    • [2025-11-06] Refactored Dockerfiles to bash scripts, unified GKE image into core CUDA image, and added experimental ARM support.

xdit-project/xDiT

  • Key Activity:
    • [2025-11-12] 🚨 RELEASE: 0.4.5
  • Details:
    • [2025-11-12] Added Apple Silicon (MPS) support and Wan 2.X I2V models.
    • [2025-11-28] Upgraded to support FLUX.2 natively via diffusers format.
  • Metrics: 8 New PRs (8 Closed) 5 New Issues (2 Closed)

deepspeedai/DeepSpeed

  • Key Activity:
    • [2025-11-05] 🚨 RELEASE: v0.18.2
  • Details:
    • [2025-11-05] Added fp32 weight deduplication under torch autocast and ZeRO3.
    • [2025-11-05] Enhanced Ulysses Sequence Parallelism (UlyssesSP) API for variable sequence lengths.