GitHub Monthly Report: 2025-09-01 to 2025-09-30
๐ Engineering Report (2025-09-01 - 2025-09-30)
๐ Executive Summary
September 2025 was a monumental month defined by the leap into sub-8-bit precision architectures (FP4, FP6, MXFP8) and massive hardware enablement releases across the industry. The defining event of the month was the blockbuster release of ROCm 7.0, which fundamentally restructured AMDโs software stack, integrated OCP MX data types, and brought day-zero support for the MI350/MI355X accelerators.
Across the wider ecosystem, we are witnessing rapid infrastructure stabilization for next-generation models (Llama 4, DeepSeek V3/R1). Frameworks like PyTorchโs torchao, NVIDIAโs TransformerEngine, and emerging compilers like TileLang are aggressively optimizing mixed-precision Quantization-Aware Training (QAT), MoE (Mixture of Experts) routing, and Tensor Parallelism.
AMD Related Updates
- ๐จ The ROCm 7.0 Era Begins: ROCm 7.0.0 (and the immediate 7.0.1 quality patch) introduces full hardware support for the MI350X/MI355X accelerators. Crucially, it brings functional support for OCP MX-compliant data types (
FP4,FP6,FP8) across HIP, Composable Kernel, hipBLASLt, and MIGraphX. - Software Stack Maturation & Consolidation: AMD decoupled the
amdgpudriver from the core ROCm software stack for independent versioning. Furthermore, 15 disparate ROCm libraries (hipBLAS, rocPRIM, etc.) have been migrated into a unifiedrocm-librariesrepository to streamline CI/CD and developer contributions. - Model Readiness in
Primus: Primus v0.2.0 heavily targeted incoming frontier models with the integration ofLightMegatronPretrainTrainer, initial configurations for Llama-4 (including 17B-16E and 17B-128E MoE variants), and memory/garbage-collection optimizations. - Expanding 3rd-Party AMD CI/CD: Upstream ecosystem health is strengthening. PyTorch
torchaomerged AMD MI300X float8 benchmarks, JAX introduced experimental WSL2 support on ROCm, andTileLangformally added AMD GPU CI alongside Flash Attention implementations for the MI300 series.
Competitive Analysis
- NVIDIA Optimizes for the B200: PyTorch
torchaoachieved a 1.2x MXFP8 dense pretraining speedup on the B200. Concurrently, NVIDIAโsTransformerEnginev2.6 focused on squeezing MoE permute fusions and PyTorch FSDP gradient accumulation, maintaining a tight grip on high-end enterprise training optimization. - DeepSeekโs Ecosystem Remains Agile: DeepSeekโs
DeepEP(v1.2.1) added permute extensions to its hybrid-ep (Expert Parallelism) and introduced CUDA Graph support for internode dispatch, demonstrating their relentless focus on high-efficiency MoE scaling. - Ascend & Alternative Compilers Gaining Ground: The
TileLangcompiler released v0.1.6, notably adding support for Huawei Ascend chips alongside Blackwell (SM100) and ROCm. The race for unified, high-performance cross-hardware JIT compilation is heating up, threatening vendor lock-in strategies. - Triton Growing Pains on Next-Gen:
Tritonis experiencing optimization friction with next-gen architectures, noted by an 18% performance regression issue on B200flex_attention_fwd, highlighting that scaling to Blackwell is not without its low-level software hurdles.
๐ Category Updates
๐ข AMD Ecosystem
ROCm/ROCm
- Key Activity:
- [2025-09-16] ๐จ RELEASE: rocm-7.0.0
- [2025-09-17] ๐จ RELEASE: rocm-7.0.1
- Details:
- [2025-09-16] Released ROCm 7.0.0 featuring day-zero support for AMD Instinct MI355X and MI350X GPUs.
- [2025-09-16] Enabled Open Compute Project (OCP) MX floating-point
FP4,FP6, andFP8data types. - [2025-09-16] Consolidated 15 mathematical and primitive libraries into a single
rocm-librariesmono-repo. Separated the AMD GPU Driver (amdgpu) from ROCm packaging. - [2025-09-17] Pushed a rapid 7.0.1 hotfix to resolve out-of-bound CPER declarations for bad memory pages on MI300/MI350 series GPUs.
- Metrics: Extensive internal milestones tracking hardware/firmware dependencies.
AMD-AGI/Primus
- Key Activity:
- [2025-09-11] ๐จ RELEASE: v0.2.0
- [2025-09-25] DOC UPDATE: Primus product matrix updated.
- Details:
- [2025-09-11] Shipped LightMegatronPretrainTrainer for clean config-based integration.
- [2025-09-11] Added initial Llama-4 configs (Llama-4-Scout-17B-16E, Llama4 17B128E Maverick).
- [2025-09-11] Implemented SyncFree MoE logic and addressed performance regressions in 8B models.
-
Metrics: 40 New PRs 40 Closed PRs 1 New Issue 0 Closed Issues. (Excellent maintenance health: 100% PR closure rate).
AMD-AGI/TraceLens
- Key Activity:
- [2025-09-XX] Ongoing development for kernel launchers and testing.
- Details:
- [2025-09-XX] Added initial
test_graph_mode.pyand support fortrtllm::cublas_scaled_mm. - [2025-09-XX] Clarified UID vs Event logic in the API and renamed kernel launchers to leaf ops.
- [2025-09-XX] Added initial
-
Metrics: 17 New PRs 17 Closed PRs 31 New Issues 9 Closed Issues. (Good PR velocity, but rising issue backlog).
ROCm/MAD
- Key Activity:
- [2025-09-XX] General maintenance and CLI fixes.
- Details:
- [2025-09-XX] Fixed Hugging Face CLI deprecated warnings and pushed AAC fixes.
-
Metrics: 16 New PRs 13 Closed PRs 1 New Issue 1 Closed Issue.
๐ฅ PyTorch Ecosystem
pytorch/ao
- Key Activity:
- [2025-09-02] ๐จ RELEASE: v0.13.0-rc8
- Details:
- [2025-09-02] Added simpler Multi-step QAT (Quantization-Aware Training) APIs.
- [2025-09-02] Introduced prototype NVFP4 and FP8 QAT support.
- [2025-09-02] Achieved 1.2x MXFP8 dense pretraining speedups on NVIDIA B200. Updated README includes AMD MI300X benchmark results.
- Metrics: N/A (Release focus).
pytorch/pytorch
- Key Activity:
- [2025-09-26] DOC UPDATE: Windows libuv updates.
- Details:
- [2025-09-XX] Addressed issues with autograd poorly mixing CUDA graph and non-CUDA graph code.
- [2025-09-XX] Fixed dynamic shape export issues for Hugging Face models using
TransformersKwargs.
-
Metrics: 1799 New PRs 1821 Closed PRs 649 New Issues 747 Closed Issues. (Exceptional health: negative defect/PR burn rate).
pytorch/torchtitan
- Key Activity:
- [2025-09-XX] FSDP2 and Attention refactoring.
- Details:
- [2025-09-XX] Added support for
sync_module_statesin FSDP2. - [2025-09-XX] Ported true
bf16training into the forge experiment.
- [2025-09-XX] Added support for
-
Metrics: 78 New PRs 63 Closed PRs 30 New Issues 11 Closed Issues.
meta-pytorch/monarch
- Key Activity:
- [2025-09-03] ๐จ RELEASE: v0.0.0
- Details:
- [2025-09-03] First official Monarch release (
torchmonarch) published to PyPI. Followed by minor doc updates.
- [2025-09-03] First official Monarch release (
- Metrics: N/A
๐ข NVIDIA & JAX Ecosystems
NVIDIA/TransformerEngine
- Key Activity:
- [2025-09-15] ๐จ RELEASE: v2.6
- Details:
- [2025-09-15] Added support for gradient accumulation fusion when using FSDP from megatron-core.
- [2025-09-15] Optimized permute fusion kernels for MoE and improved KV caching kernel performance.
- [2025-09-15] Added
save_original_inputoption to decouple row-wise and column-wise quantization.
jax-ml/jax
- Key Activity:
- [2025-09-16] ๐จ RELEASE: jax-v0.7.2
- [2025-09-23] DOC UPDATE: Experimental WSL2 on ROCm.
- Details:
- [2025-09-16] Dropped support for NumPy < 2.0. Internal JAX representations now use
LiteralArray. - [2025-09-16] Added CUDA 13 documentation and ROCm WSL2 support notes.
- [2025-09-16] Dropped support for NumPy < 2.0. Internal JAX representations now use
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-09-XX] Llama 4 migrations and Multihost improvements.
- Details:
- [2025-09-XX] Migrated
Llama4DecoderLayerandLlama4ScannableBlockto NNX. - [2025-09-XX] Addressed memory leaks when initializing Qwen model training with the multihost runner.
- [2025-09-XX] Migrated
-
Metrics: 159 New PRs 168 Closed PRs 7 New Issues 2 Closed Issues. (Very healthy maintenance).
openxla/xla
- Key Activity:
- [2025-09-XX] General maintenance and HLO ops fixes.
- Details:
- [2025-09-XX] Investigating issues where
mhlo.acosh/mhlo.acoscannot be translated to XLA HLO.
- [2025-09-XX] Investigating issues where
-
Metrics: 1282 New PRs 1375 Closed PRs 6 New Issues 26 Closed Issues.
๐ ๏ธ Triton & Compilation Ecosystem
triton-lang/triton
- Key Activity:
- [2025-09-19] DOC UPDATE: Added hook for configurable compiler pass pipelines.
- Details:
- [2025-09-XX] Community investigating an 18% performance regression in B200
flex_attention_fwd. - [2025-09-XX] Fixed short-circuit zero-dimensional splats.
- [2025-09-XX] Community investigating an 18% performance regression in B200
-
Metrics: 270 New PRs 246 Closed PRs 40 New Issues 22 Closed Issues.
tile-ai/tilelang
- Key Activity:
- [2025-09-19] ๐จ RELEASE: 0.1.6
- [2025-09-21] ๐จ RELEASE: v0.1.6.post1
- Details:
- [2025-09-19] Massive update: Added auto-vectorize for atomic add, MXFP4 GEMM kernel with bias, and Flash Attention examples for AMD MI300 series.
- [2025-09-19] Added hardware support/detection for Huawei Ascend chips, Blackwell (SM100), and integrated AMD GPU CI workflows.
- [2025-09-21] Released
post1to fix static linking issues with libgcc and libg++ impacting PyTorch builds.
-
Metrics: 99 New PRs 95 Closed PRs 46 New Issues 25 Closed Issues. (High momentum project).
๐ง LLM Frameworks & Tooling
deepseek-ai/DeepEP
- Key Activity:
- [2025-09-16] ๐จ RELEASE: v1.2.1
- Details:
- [2025-09-16] Added permute extensions to
hybrid-ep. Supported CUDA Graph for internode dispatch normal kernels.
- [2025-09-16] Added permute extensions to
-
Metrics: 26 New PRs 20 Closed PRs 28 New Issues 19 Closed Issues.
xdit-project/xDiT
- Key Activity:
- [2025-09-XX] Cross-ecosystem integrations.
- Details:
- [2025-09-XX] Implemented ROCm fixes and discussed integration with
AIBrixfor Multimodal Generation.
- [2025-09-XX] Implemented ROCm fixes and discussed integration with
-
Metrics: 9 New PRs 6 Closed PRs 7 New Issues 2 Closed Issues.
huggingface/transformers
- Key Activity:
- [2025-09-18] DOC UPDATE: ๐จ Fully removed Tensorflow and Jax support library-wide.
- Details:
- [2025-09-XX] Addressed bug with RuntimeError
dtype mismatch in _group_beam_searchwith bfloat16/fp16 models. Addednum_hidden_layersto t5gemma config.
- [2025-09-XX] Addressed bug with RuntimeError
-
Metrics: 514 New PRs 477 Closed PRs 145 New Issues 162 Closed Issues. (Strong issue triage).
(Note: Minor documentation and README updates were also merged across llm-d/llm-d, alibaba/rtp-llm, vllm-project/vllm, volcengine/verl, THUDM/slime, radixark/miles, and deepspeedai/DeepSpeed without notable code alterations).