๐Ÿ“… Engineering Report (2025-09-01 - 2025-09-30)

๐Ÿš€ Executive Summary

September 2025 was a monumental month defined by the leap into sub-8-bit precision architectures (FP4, FP6, MXFP8) and massive hardware enablement releases across the industry. The defining event of the month was the blockbuster release of ROCm 7.0, which fundamentally restructured AMDโ€™s software stack, integrated OCP MX data types, and brought day-zero support for the MI350/MI355X accelerators.

Across the wider ecosystem, we are witnessing rapid infrastructure stabilization for next-generation models (Llama 4, DeepSeek V3/R1). Frameworks like PyTorchโ€™s torchao, NVIDIAโ€™s TransformerEngine, and emerging compilers like TileLang are aggressively optimizing mixed-precision Quantization-Aware Training (QAT), MoE (Mixture of Experts) routing, and Tensor Parallelism.

  • ๐Ÿšจ The ROCm 7.0 Era Begins: ROCm 7.0.0 (and the immediate 7.0.1 quality patch) introduces full hardware support for the MI350X/MI355X accelerators. Crucially, it brings functional support for OCP MX-compliant data types (FP4, FP6, FP8) across HIP, Composable Kernel, hipBLASLt, and MIGraphX.
  • Software Stack Maturation & Consolidation: AMD decoupled the amdgpu driver from the core ROCm software stack for independent versioning. Furthermore, 15 disparate ROCm libraries (hipBLAS, rocPRIM, etc.) have been migrated into a unified rocm-libraries repository to streamline CI/CD and developer contributions.
  • Model Readiness in Primus: Primus v0.2.0 heavily targeted incoming frontier models with the integration of LightMegatronPretrainTrainer, initial configurations for Llama-4 (including 17B-16E and 17B-128E MoE variants), and memory/garbage-collection optimizations.
  • Expanding 3rd-Party AMD CI/CD: Upstream ecosystem health is strengthening. PyTorch torchao merged AMD MI300X float8 benchmarks, JAX introduced experimental WSL2 support on ROCm, and TileLang formally added AMD GPU CI alongside Flash Attention implementations for the MI300 series.

Competitive Analysis

  • NVIDIA Optimizes for the B200: PyTorch torchao achieved a 1.2x MXFP8 dense pretraining speedup on the B200. Concurrently, NVIDIAโ€™s TransformerEngine v2.6 focused on squeezing MoE permute fusions and PyTorch FSDP gradient accumulation, maintaining a tight grip on high-end enterprise training optimization.
  • DeepSeekโ€™s Ecosystem Remains Agile: DeepSeekโ€™s DeepEP (v1.2.1) added permute extensions to its hybrid-ep (Expert Parallelism) and introduced CUDA Graph support for internode dispatch, demonstrating their relentless focus on high-efficiency MoE scaling.
  • Ascend & Alternative Compilers Gaining Ground: The TileLang compiler released v0.1.6, notably adding support for Huawei Ascend chips alongside Blackwell (SM100) and ROCm. The race for unified, high-performance cross-hardware JIT compilation is heating up, threatening vendor lock-in strategies.
  • Triton Growing Pains on Next-Gen: Triton is experiencing optimization friction with next-gen architectures, noted by an 18% performance regression issue on B200 flex_attention_fwd, highlighting that scaling to Blackwell is not without its low-level software hurdles.

๐Ÿ“‚ Category Updates

๐ŸŸข AMD Ecosystem

ROCm/ROCm

  • Key Activity:
    • [2025-09-16] ๐Ÿšจ RELEASE: rocm-7.0.0
    • [2025-09-17] ๐Ÿšจ RELEASE: rocm-7.0.1
  • Details:
    • [2025-09-16] Released ROCm 7.0.0 featuring day-zero support for AMD Instinct MI355X and MI350X GPUs.
    • [2025-09-16] Enabled Open Compute Project (OCP) MX floating-point FP4, FP6, and FP8 data types.
    • [2025-09-16] Consolidated 15 mathematical and primitive libraries into a single rocm-libraries mono-repo. Separated the AMD GPU Driver (amdgpu) from ROCm packaging.
    • [2025-09-17] Pushed a rapid 7.0.1 hotfix to resolve out-of-bound CPER declarations for bad memory pages on MI300/MI350 series GPUs.
  • Metrics: Extensive internal milestones tracking hardware/firmware dependencies.

AMD-AGI/Primus

  • Key Activity:
    • [2025-09-11] ๐Ÿšจ RELEASE: v0.2.0
    • [2025-09-25] DOC UPDATE: Primus product matrix updated.
  • Details:
    • [2025-09-11] Shipped LightMegatronPretrainTrainer for clean config-based integration.
    • [2025-09-11] Added initial Llama-4 configs (Llama-4-Scout-17B-16E, Llama4 17B128E Maverick).
    • [2025-09-11] Implemented SyncFree MoE logic and addressed performance regressions in 8B models.
  • Metrics: 40 New PRs 40 Closed PRs 1 New Issue 0 Closed Issues. (Excellent maintenance health: 100% PR closure rate).

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-09-XX] Ongoing development for kernel launchers and testing.
  • Details:
    • [2025-09-XX] Added initial test_graph_mode.py and support for trtllm::cublas_scaled_mm.
    • [2025-09-XX] Clarified UID vs Event logic in the API and renamed kernel launchers to leaf ops.
  • Metrics: 17 New PRs 17 Closed PRs 31 New Issues 9 Closed Issues. (Good PR velocity, but rising issue backlog).

ROCm/MAD

  • Key Activity:
    • [2025-09-XX] General maintenance and CLI fixes.
  • Details:
    • [2025-09-XX] Fixed Hugging Face CLI deprecated warnings and pushed AAC fixes.
  • Metrics: 16 New PRs 13 Closed PRs 1 New Issue 1 Closed Issue.

๐Ÿ”ฅ PyTorch Ecosystem

pytorch/ao

  • Key Activity:
    • [2025-09-02] ๐Ÿšจ RELEASE: v0.13.0-rc8
  • Details:
    • [2025-09-02] Added simpler Multi-step QAT (Quantization-Aware Training) APIs.
    • [2025-09-02] Introduced prototype NVFP4 and FP8 QAT support.
    • [2025-09-02] Achieved 1.2x MXFP8 dense pretraining speedups on NVIDIA B200. Updated README includes AMD MI300X benchmark results.
  • Metrics: N/A (Release focus).

pytorch/pytorch

  • Key Activity:
    • [2025-09-26] DOC UPDATE: Windows libuv updates.
  • Details:
    • [2025-09-XX] Addressed issues with autograd poorly mixing CUDA graph and non-CUDA graph code.
    • [2025-09-XX] Fixed dynamic shape export issues for Hugging Face models using TransformersKwargs.
  • Metrics: 1799 New PRs 1821 Closed PRs 649 New Issues 747 Closed Issues. (Exceptional health: negative defect/PR burn rate).

pytorch/torchtitan

  • Key Activity:
    • [2025-09-XX] FSDP2 and Attention refactoring.
  • Details:
    • [2025-09-XX] Added support for sync_module_states in FSDP2.
    • [2025-09-XX] Ported true bf16 training into the forge experiment.
  • Metrics: 78 New PRs 63 Closed PRs 30 New Issues 11 Closed Issues.

meta-pytorch/monarch

  • Key Activity:
    • [2025-09-03] ๐Ÿšจ RELEASE: v0.0.0
  • Details:
    • [2025-09-03] First official Monarch release (torchmonarch) published to PyPI. Followed by minor doc updates.
  • Metrics: N/A

๐ŸŸข NVIDIA & JAX Ecosystems

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-09-15] ๐Ÿšจ RELEASE: v2.6
  • Details:
    • [2025-09-15] Added support for gradient accumulation fusion when using FSDP from megatron-core.
    • [2025-09-15] Optimized permute fusion kernels for MoE and improved KV caching kernel performance.
    • [2025-09-15] Added save_original_input option to decouple row-wise and column-wise quantization.

jax-ml/jax

  • Key Activity:
    • [2025-09-16] ๐Ÿšจ RELEASE: jax-v0.7.2
    • [2025-09-23] DOC UPDATE: Experimental WSL2 on ROCm.
  • Details:
    • [2025-09-16] Dropped support for NumPy < 2.0. Internal JAX representations now use LiteralArray.
    • [2025-09-16] Added CUDA 13 documentation and ROCm WSL2 support notes.

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-09-XX] Llama 4 migrations and Multihost improvements.
  • Details:
    • [2025-09-XX] Migrated Llama4DecoderLayer and Llama4ScannableBlock to NNX.
    • [2025-09-XX] Addressed memory leaks when initializing Qwen model training with the multihost runner.
  • Metrics: 159 New PRs 168 Closed PRs 7 New Issues 2 Closed Issues. (Very healthy maintenance).

openxla/xla

  • Key Activity:
    • [2025-09-XX] General maintenance and HLO ops fixes.
  • Details:
    • [2025-09-XX] Investigating issues where mhlo.acosh / mhlo.acos cannot be translated to XLA HLO.
  • Metrics: 1282 New PRs 1375 Closed PRs 6 New Issues 26 Closed Issues.

๐Ÿ› ๏ธ Triton & Compilation Ecosystem

triton-lang/triton

  • Key Activity:
    • [2025-09-19] DOC UPDATE: Added hook for configurable compiler pass pipelines.
  • Details:
    • [2025-09-XX] Community investigating an 18% performance regression in B200 flex_attention_fwd.
    • [2025-09-XX] Fixed short-circuit zero-dimensional splats.
  • Metrics: 270 New PRs 246 Closed PRs 40 New Issues 22 Closed Issues.

tile-ai/tilelang

  • Key Activity:
    • [2025-09-19] ๐Ÿšจ RELEASE: 0.1.6
    • [2025-09-21] ๐Ÿšจ RELEASE: v0.1.6.post1
  • Details:
    • [2025-09-19] Massive update: Added auto-vectorize for atomic add, MXFP4 GEMM kernel with bias, and Flash Attention examples for AMD MI300 series.
    • [2025-09-19] Added hardware support/detection for Huawei Ascend chips, Blackwell (SM100), and integrated AMD GPU CI workflows.
    • [2025-09-21] Released post1 to fix static linking issues with libgcc and libg++ impacting PyTorch builds.
  • Metrics: 99 New PRs 95 Closed PRs 46 New Issues 25 Closed Issues. (High momentum project).

๐Ÿง  LLM Frameworks & Tooling

deepseek-ai/DeepEP

  • Key Activity:
    • [2025-09-16] ๐Ÿšจ RELEASE: v1.2.1
  • Details:
    • [2025-09-16] Added permute extensions to hybrid-ep. Supported CUDA Graph for internode dispatch normal kernels.
  • Metrics: 26 New PRs 20 Closed PRs 28 New Issues 19 Closed Issues.

xdit-project/xDiT

  • Key Activity:
    • [2025-09-XX] Cross-ecosystem integrations.
  • Details:
    • [2025-09-XX] Implemented ROCm fixes and discussed integration with AIBrix for Multimodal Generation.
  • Metrics: 9 New PRs 6 Closed PRs 7 New Issues 2 Closed Issues.

huggingface/transformers

  • Key Activity:
    • [2025-09-18] DOC UPDATE: ๐Ÿšจ Fully removed Tensorflow and Jax support library-wide.
  • Details:
    • [2025-09-XX] Addressed bug with RuntimeError dtype mismatch in _group_beam_search with bfloat16/fp16 models. Added num_hidden_layers to t5gemma config.
  • Metrics: 514 New PRs 477 Closed PRs 145 New Issues 162 Closed Issues. (Strong issue triage).

(Note: Minor documentation and README updates were also merged across llm-d/llm-d, alibaba/rtp-llm, vllm-project/vllm, volcengine/verl, THUDM/slime, radixark/miles, and deepspeedai/DeepSpeed without notable code alterations).