📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

September 2025 was a monumental month defined by the leap into sub-8-bit precision architectures (FP4, FP6, MXFP8) and massive hardware enablement releases across the industry. The defining event of the month was the blockbuster release of ROCm 7.0, which fundamentally restructured AMD’s software stack, integrated OCP MX data types, and brought day-zero support for the MI350/MI355X accelerators.

Across the wider ecosystem, we are witnessing rapid infrastructure stabilization for next-generation models (Llama 4, DeepSeek V3/R1). Frameworks like PyTorch’s torchao, NVIDIA’s TransformerEngine, and emerging compilers like TileLang are aggressively optimizing mixed-precision Quantization-Aware Training (QAT), MoE (Mixture of Experts) routing, and Tensor Parallelism.

🚨 The ROCm 7.0 Era Begins: ROCm 7.0.0 (and the immediate 7.0.1 quality patch) introduces full hardware support for the MI350X/MI355X accelerators. Crucially, it brings functional support for OCP MX-compliant data types (FP4, FP6, FP8) across HIP, Composable Kernel, hipBLASLt, and MIGraphX.
Software Stack Maturation & Consolidation: AMD decoupled the amdgpu driver from the core ROCm software stack for independent versioning. Furthermore, 15 disparate ROCm libraries (hipBLAS, rocPRIM, etc.) have been migrated into a unified rocm-libraries repository to streamline CI/CD and developer contributions.
Model Readiness in Primus: Primus v0.2.0 heavily targeted incoming frontier models with the integration of LightMegatronPretrainTrainer, initial configurations for Llama-4 (including 17B-16E and 17B-128E MoE variants), and memory/garbage-collection optimizations.
Expanding 3rd-Party AMD CI/CD: Upstream ecosystem health is strengthening. PyTorch torchao merged AMD MI300X float8 benchmarks, JAX introduced experimental WSL2 support on ROCm, and TileLang formally added AMD GPU CI alongside Flash Attention implementations for the MI300 series.

Competitive Analysis

NVIDIA Optimizes for the B200: PyTorch torchao achieved a 1.2x MXFP8 dense pretraining speedup on the B200. Concurrently, NVIDIA’s TransformerEngine v2.6 focused on squeezing MoE permute fusions and PyTorch FSDP gradient accumulation, maintaining a tight grip on high-end enterprise training optimization.
DeepSeek’s Ecosystem Remains Agile: DeepSeek’s DeepEP (v1.2.1) added permute extensions to its hybrid-ep (Expert Parallelism) and introduced CUDA Graph support for internode dispatch, demonstrating their relentless focus on high-efficiency MoE scaling.
Ascend & Alternative Compilers Gaining Ground: The TileLang compiler released v0.1.6, notably adding support for Huawei Ascend chips alongside Blackwell (SM100) and ROCm. The race for unified, high-performance cross-hardware JIT compilation is heating up, threatening vendor lock-in strategies.
Triton Growing Pains on Next-Gen: Triton is experiencing optimization friction with next-gen architectures, noted by an 18% performance regression issue on B200 flex_attention_fwd, highlighting that scaling to Blackwell is not without its low-level software hurdles.

📂 Category Updates

🟢 AMD Ecosystem

ROCm/ROCm

Key Activity:
- [2025-09-16] 🚨 RELEASE: rocm-7.0.0
- [2025-09-17] 🚨 RELEASE: rocm-7.0.1
Details:
- [2025-09-16] Released ROCm 7.0.0 featuring day-zero support for AMD Instinct MI355X and MI350X GPUs.
- [2025-09-16] Enabled Open Compute Project (OCP) MX floating-point FP4, FP6, and FP8 data types.
- [2025-09-16] Consolidated 15 mathematical and primitive libraries into a single rocm-libraries mono-repo. Separated the AMD GPU Driver (amdgpu) from ROCm packaging.
- [2025-09-17] Pushed a rapid 7.0.1 hotfix to resolve out-of-bound CPER declarations for bad memory pages on MI300/MI350 series GPUs.
Metrics: Extensive internal milestones tracking hardware/firmware dependencies.

AMD-AGI/Primus

Key Activity:
- [2025-09-11] 🚨 RELEASE: v0.2.0
- [2025-09-25] DOC UPDATE: Primus product matrix updated.
Details:
- [2025-09-11] Shipped LightMegatronPretrainTrainer for clean config-based integration.
- [2025-09-11] Added initial Llama-4 configs (Llama-4-Scout-17B-16E, Llama4 17B128E Maverick).
- [2025-09-11] Implemented SyncFree MoE logic and addressed performance regressions in 8B models.
Metrics: 40 New PRs 40 Closed PRs 1 New Issue 0 Closed Issues. (Excellent maintenance health: 100% PR closure rate).

AMD-AGI/TraceLens

Key Activity:
- [2025-09-XX] Ongoing development for kernel launchers and testing.
Details:
- [2025-09-XX] Added initial test_graph_mode.py and support for trtllm::cublas_scaled_mm.
- [2025-09-XX] Clarified UID vs Event logic in the API and renamed kernel launchers to leaf ops.
Metrics: 17 New PRs 17 Closed PRs 31 New Issues 9 Closed Issues. (Good PR velocity, but rising issue backlog).

ROCm/MAD

Key Activity:
- [2025-09-XX] General maintenance and CLI fixes.
Details:
- [2025-09-XX] Fixed Hugging Face CLI deprecated warnings and pushed AAC fixes.
Metrics: 16 New PRs 13 Closed PRs 1 New Issue 1 Closed Issue.

🔥 PyTorch Ecosystem

pytorch/ao

Key Activity:
- [2025-09-02] 🚨 RELEASE: v0.13.0-rc8
Details:
- [2025-09-02] Added simpler Multi-step QAT (Quantization-Aware Training) APIs.
- [2025-09-02] Introduced prototype NVFP4 and FP8 QAT support.
- [2025-09-02] Achieved 1.2x MXFP8 dense pretraining speedups on NVIDIA B200. Updated README includes AMD MI300X benchmark results.
Metrics: N/A (Release focus).

pytorch/pytorch

Key Activity:
- [2025-09-26] DOC UPDATE: Windows libuv updates.
Details:
- [2025-09-XX] Addressed issues with autograd poorly mixing CUDA graph and non-CUDA graph code.
- [2025-09-XX] Fixed dynamic shape export issues for Hugging Face models using TransformersKwargs.

Metrics: 1799 New PRs

1821 Closed PRs

649 New Issues

747 Closed Issues. (Exceptional health: negative defect/PR burn rate).

pytorch/torchtitan

Key Activity:
- [2025-09-XX] FSDP2 and Attention refactoring.
Details:
- [2025-09-XX] Added support for sync_module_states in FSDP2.
- [2025-09-XX] Ported true bf16 training into the forge experiment.
Metrics: 78 New PRs 63 Closed PRs 30 New Issues 11 Closed Issues.

meta-pytorch/monarch

Key Activity:
- [2025-09-03] 🚨 RELEASE: v0.0.0
Details:
- [2025-09-03] First official Monarch release (torchmonarch) published to PyPI. Followed by minor doc updates.
Metrics: N/A

🟢 NVIDIA & JAX Ecosystems

NVIDIA/TransformerEngine

Key Activity:
- [2025-09-15] 🚨 RELEASE: v2.6
Details:
- [2025-09-15] Added support for gradient accumulation fusion when using FSDP from megatron-core.
- [2025-09-15] Optimized permute fusion kernels for MoE and improved KV caching kernel performance.
- [2025-09-15] Added save_original_input option to decouple row-wise and column-wise quantization.

jax-ml/jax

Key Activity:
- [2025-09-16] 🚨 RELEASE: jax-v0.7.2
- [2025-09-23] DOC UPDATE: Experimental WSL2 on ROCm.
Details:
- [2025-09-16] Dropped support for NumPy < 2.0. Internal JAX representations now use LiteralArray.
- [2025-09-16] Added CUDA 13 documentation and ROCm WSL2 support notes.

AI-Hypercomputer/maxtext

Key Activity:
- [2025-09-XX] Llama 4 migrations and Multihost improvements.
Details:
- [2025-09-XX] Migrated Llama4DecoderLayer and Llama4ScannableBlock to NNX.
- [2025-09-XX] Addressed memory leaks when initializing Qwen model training with the multihost runner.
Metrics: 159 New PRs 168 Closed PRs 7 New Issues 2 Closed Issues. (Very healthy maintenance).

openxla/xla

Key Activity:
- [2025-09-XX] General maintenance and HLO ops fixes.
Details:
- [2025-09-XX] Investigating issues where mhlo.acosh / mhlo.acos cannot be translated to XLA HLO.
Metrics: 1282 New PRs 1375 Closed PRs 6 New Issues 26 Closed Issues.

🛠️ Triton & Compilation Ecosystem

triton-lang/triton

Key Activity:
- [2025-09-19] DOC UPDATE: Added hook for configurable compiler pass pipelines.
Details:
- [2025-09-XX] Community investigating an 18% performance regression in B200 flex_attention_fwd.
- [2025-09-XX] Fixed short-circuit zero-dimensional splats.
Metrics: 270 New PRs 246 Closed PRs 40 New Issues 22 Closed Issues.

tile-ai/tilelang

Key Activity:
- [2025-09-19] 🚨 RELEASE: 0.1.6
- [2025-09-21] 🚨 RELEASE: v0.1.6.post1
Details:
- [2025-09-19] Massive update: Added auto-vectorize for atomic add, MXFP4 GEMM kernel with bias, and Flash Attention examples for AMD MI300 series.
- [2025-09-19] Added hardware support/detection for Huawei Ascend chips, Blackwell (SM100), and integrated AMD GPU CI workflows.
- [2025-09-21] Released post1 to fix static linking issues with libgcc and libg++ impacting PyTorch builds.
Metrics: 99 New PRs 95 Closed PRs 46 New Issues 25 Closed Issues. (High momentum project).

🧠 LLM Frameworks & Tooling

deepseek-ai/DeepEP

Key Activity:
- [2025-09-16] 🚨 RELEASE: v1.2.1
Details:
- [2025-09-16] Added permute extensions to hybrid-ep. Supported CUDA Graph for internode dispatch normal kernels.
Metrics: 26 New PRs 20 Closed PRs 28 New Issues 19 Closed Issues.

xdit-project/xDiT

Key Activity:
- [2025-09-XX] Cross-ecosystem integrations.
Details:
- [2025-09-XX] Implemented ROCm fixes and discussed integration with AIBrix for Multimodal Generation.
Metrics: 9 New PRs 6 Closed PRs 7 New Issues 2 Closed Issues.

huggingface/transformers

Key Activity:
- [2025-09-18] DOC UPDATE: 🚨 Fully removed Tensorflow and Jax support library-wide.
Details:
- [2025-09-XX] Addressed bug with RuntimeError dtype mismatch in _group_beam_search with bfloat16/fp16 models. Added num_hidden_layers to t5gemma config.
Metrics: 514 New PRs 477 Closed PRs 145 New Issues 162 Closed Issues. (Strong issue triage).

(Note: Minor documentation and README updates were also merged across llm-d/llm-d, alibaba/rtp-llm, vllm-project/vllm, volcengine/verl, THUDM/slime, radixark/miles, and deepspeedai/DeepSpeed without notable code alterations).

GitHub Monthly Report: 2025-09-01 to 2025-09-30

📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

Competitive Analysis

📂 Category Updates

🟢 AMD Ecosystem

ROCm/ROCm

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/MAD

🔥 PyTorch Ecosystem

pytorch/ao

pytorch/pytorch

pytorch/torchtitan

meta-pytorch/monarch

🟢 NVIDIA & JAX Ecosystems

NVIDIA/TransformerEngine

jax-ml/jax

AI-Hypercomputer/maxtext

openxla/xla

🛠️ Triton & Compilation Ecosystem

triton-lang/triton

tile-ai/tilelang

🧠 LLM Frameworks & Tooling

deepseek-ai/DeepEP

xdit-project/xDiT

huggingface/transformers

🔗 References

📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

AMD Related Updates

Competitive Analysis

📂 Category Updates

🟢 AMD Ecosystem

ROCm/ROCm

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/MAD

🔥 PyTorch Ecosystem

pytorch/ao

pytorch/pytorch

pytorch/torchtitan

meta-pytorch/monarch

🟢 NVIDIA & JAX Ecosystems

NVIDIA/TransformerEngine

jax-ml/jax

AI-Hypercomputer/maxtext

openxla/xla

🛠️ Triton & Compilation Ecosystem

triton-lang/triton

tile-ai/tilelang

🧠 LLM Frameworks & Tooling

deepseek-ai/DeepEP

xdit-project/xDiT

huggingface/transformers

🔗 References