GitHub Monthly Report: 2025-05-01 to 2025-05-31
đź“… Engineering Report (2025-05-01 - 2025-05-31)
🚀 Executive Summary
May 2025 was marked by significant advancements in quantization, hardware-specific AI model optimization, and massive ecosystem backlog cleanups. The standout event of the month was the 🚨 release of torchao v0.11.0, which brought groundbreaking capabilities for Mixture-of-Experts (MoE) quantization and introduced a new microbenchmarking framework.
Across the open-source ML ecosystem, project health is notably strong. Massive repository cleanup efforts were observed in top-level projects, most notably openxla/xla which closed 4,585 PRs (far outpacing its 1,344 new PRs) and pytorch/pytorch which maintained a healthy negative delta (1,480 new vs. 1,494 closed PRs). Development trends indicate a heavy industry focus on optimizing DeepSeek-style architectures (MoE routing, scaling) and low-precision processing (FP8) across virtually all hardware platforms.
AMD Related Updates
- Strengthened ROCm & vLLM Integration: ROCm’s MAD repository saw active updates paving the way for the vLLM 05/27 release, including specific path updates for vLLM profiling scripts. Additionally, the broader ROCm ecosystem is addressing RCCL and Horovod installation friction.
- Quantization & PyTorch Performance: The
torchaov0.11.0 release directly benefits AMD through the inclusion of preshuffled weight matrix multiplication enhancements explicitly for ROCm, ensuring competitive performance in the rapidly evolving PyTorch quantization landscape. CI fixes for AMD hardware were also prioritized in third-party engines likesglang. - Advanced Profiling via TraceLens: The
TraceLenstoolset is maturing quickly. Feature requests for plotting roofline models and enabling batch GEMMs through gemmologist show a strong push toward providing AMD developers with deeper, more intuitive performance tuning insights. - Primus Expansion: AMD’s AGI
Primusframework received vital updates to support Mixtral pretrain configurations and saw its core Megatron-LM dependencies updated to late May 2025 branches.
Competitive Analysis
- NVIDIA Sunsetting Legacy Tools: NVIDIA’s
apexrepository completely removed its legacyamp,ddp, andrnnfeatures. This confirms NVIDIA’s continued push to rely entirely on native PyTorch implementations rather than custom Apex extensions for basic mixed precision and distributed data parallel tasks. - Google’s TPU Scaling: The JAX/XLA ecosystem is aggressively pushing hardware utilization boundaries.
maxtextis advancing v7x AOT (Ahead-of-Time) compilation support, andopenxla/xladropped restrictions on growing XnnGraph fusions. - DeepSeek Workloads Drive Network Innovations: DeepSeek’s
DeepEPheavily optimized its peer-to-peer (P2P) latency and refined IBGDA utilization, indicating competitors are aggressively tuning networking layers for MoE expert parallel workloads. - FP8 Adoption: Projects like
xDiTare rapidly adopting FP8 forward passes for FlashAttention-3, signaling that ecosystem partners are standardizing FP8 as the default for next-gen inference and generation tasks.
đź“‚ Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-05-23] Updated Megatron-LM submodule to the 20250522 baseline.
- [2025-05-20] Unified configuration usage and reorganized the README.
- Details:
- [May 2025] New PR: Added Mixtral pretrain configs.
- [May 2025] New PR: Fixed Megatron interleaved virtual pipeline training errors with corresponding Unit Tests.
-
Metrics: 17 New PRs 17 Closed PRs 0 New Issues (Healthy 1:1 PR closure rate)
AMD-AGI/TraceLens
- Key Activity:
- [May 2025] Heavy development focus on performance modeling and GEMM capabilities.
- Details:
- [May 2025] Feature Request: Expose TraceLens package functionality for plotting rooflines.
- [May 2025] New PR: Enabled batch GEMM through gemmologist.
- [May 2025] Bug Fix: Made input strides optional to prevent perf model crashes and fixed IS tensor types.
-
Metrics: 44 New PRs 43 Closed PRs 11 New Issues 12 Closed Issues
ROCm/ROCm
- Key Activity:
- [2025-05-21] Merged tool-update-641.
- Details:
- [May 2025] Issue Tracking: Investigating ROCm Horovod installation failures related to missing
rccl.hand CMake errors. - [May 2025] New PRs: Fixes integrated for
rocrsamples,rocr_debug_agent, and broken documentation links.
- [May 2025] Issue Tracking: Investigating ROCm Horovod installation failures related to missing
-
Metrics: 116 New PRs 117 Closed PRs 36 New Issues 41 Closed Issues (Strong issue/PR resolution rate)
ROCm/MAD
- Key Activity:
- [May 2025] Focused on preparing profiling pipelines for vLLM ecosystem updates.
- Details:
- [May 2025] New PR: Tracked the vLLM 05/27 release.
- [May 2025] New PR: Updated paths to utilize profiling scripts natively in vLLM.
-
Metrics: 14 New PRs 11 Closed PRs 0 New Issues
PyTorch Ecosystem
pytorch/ao
- Key Activity:
- [2025-05-09] 🚨 RELEASE: v0.11.0
- Details:
- [2025-05-09] Released MoE Quantization capabilities (Int8WeightOnly, Int4, FP8) using both
baseandfakebase tensor subclasses (demonstrated via LLaMA-4-Scout and Mixtral). - [2025-05-09] Added PyTorch 2 Export Quantization (PT2E) migration paths.
- [2025-05-09] Shipped a new microbenchmarking framework for inference APIs to evaluate matrix sizes and memory profiling.
- [2025-05-09] Landed optimizations for ROCm preshuffled weight matrix multiplication.
- [May 2025] Issue tracking shifted to QAT range learning trackers and PT2E FSDP support.
- [2025-05-09] Released MoE Quantization capabilities (Int8WeightOnly, Int4, FP8) using both
-
Metrics: 0 New PRs 0 Closed PRs 22 New Issues (Post-release issue gathering)
pytorch/pytorch
- Key Activity:
- [2025-05-14] Ongoing documentation and core URL fixes.
- Details:
- [May 2025] Core Issues: Tracking FP8 scaled mm lowering ignoring
scale_resultarguments, and inconsistenttorch.amin()results between CPU and GPU. - [May 2025] New PRs: Cleaned up NumPy compatibility for 2D small list indices and removed
AttributeErrorconstructors.
- [May 2025] Core Issues: Tracking FP8 scaled mm lowering ignoring
-
Metrics: 1480 New PRs 1494 Closed PRs 754 New Issues 543 Closed Issues
pytorch/torchtitan
- Key Activity:
- [May 2025] Enhancing framework support for DeepSeek models and Distributed Data Parallel paradigms.
- Details:
- [May 2025] New Issue: Debugging router collapse during Deepseek training loops.
- [May 2025] New PR: Working on SimpleFSDP adding support for HSDP/DDP + Tensor Parallelism.
- [May 2025] Bug Fix: Remedied HF AutoTokenizer root logger duplication for Deepseek.
-
Metrics: 65 New PRs 58 Closed PRs 25 New Issues 18 Closed Issues
meta-pytorch/monarch & facebookresearch/xformers
- Key Activity:
- [2025-05-26 to 2025-05-30] Assorted build and documentation updates across both libraries, including
grpo_actorexamples in Monarch and--no-build-isolationinstructions for xFormers.
- [2025-05-26 to 2025-05-30] Assorted build and documentation updates across both libraries, including
- Metrics: Stable/Maintenance Mode.
Google / JAX Ecosystem
openxla/xla
- Key Activity:
- [May 2025] Massive repository maintenance and architectural refactoring.
- Details:
- [May 2025] New PR: Dropped the restriction that XnnGraph fusions can only be grown from root.
- [May 2025] Deprecation: Removed
API_VERSION_STATUS_RETURNINGcustom calls in CPU XLA. - [May 2025] Bug tracking: Identified a TPU X64 rewriting issue on c64 instances.
-
Metrics: 1344 New PRs 4585 Closed PRs (Exceptional backlog resolution)
AI-Hypercomputer/maxtext & JetStream
- Key Activity:
- [2025-05-27] Updated the JAX stable stack naming to “jax ai image”.
- Details:
- [May 2025] Highlighted PR: Drafted support for v7x AOT compilation.
- [May 2025] Code Cleanup: Replaced internal usages of
jax/_src/pjit.py. - [May 2025] Feature Request: Llama3 conversion to HF bug fixes and Automatic Grain Data Iterator changes.
- [May 2025] JetStream: Aligned gcloud setups to mirror Maxtext.
-
Metrics: Maxtext: 126 New PRs 115 Closed PRs. JetStream: 10 New PRs 9 Closed PRs.
jax-ml/jax
- Key Activity:
- [2025-05-30] Deprecated Mac x86 installation instructions from the standard documentation.
NVIDIA Ecosystem
NVIDIA/apex
- Key Activity:
- [2025-05-07] Completely removed deprecated
amp,ddp, andrnnfeatures to rely on upstream PyTorch implementations.
- [2025-05-07] Completely removed deprecated
-
Metrics: 0 New PRs 0 Closed PRs (Cleanup release)
NVIDIA/Megatron-LM
- Key Activity:
- [2025-05-08] Refreshed README.md setup instructions.
Language Models, Compilers, & Inference Ecosystem
huggingface/transformers
- Key Activity:
- [2025-05-12] Updated chat generate parameterization to be powered by
GenerationConfigfor better UX. - [2025-05-14] Added
uvinstallation instructions for source builds.
- [2025-05-12] Updated chat generate parameterization to be powered by
- Details:
- [May 2025] Community Requests: Tracking requests for SparseVLM (Visual Token Sparsification) and SLaM (Sparse Latent Mixer).
- [May 2025] PRs: Fixed Blip2 tests and Markdown rendering for the BERT
[CLS]token.
-
Metrics: 423 New PRs 389 Closed PRs 175 New Issues
triton-lang/triton
- Key Activity:
- [2025-05-12] Removed Blackwell PyTorch build instructions.
- [2025-05-15/16] Added and immediately reverted options to remove debug info from modules.
- Details:
- [May 2025] New PR: Implemented Gluon attention kernels for d64 and d128.
- [May 2025] Bug fix: Added checks for shared memory limitations during
ConvertLayoutOpconversions.
-
Metrics: 308 New PRs 302 Closed PRs 33 New Issues
tile-ai/tilelang
- Key Activity:
- [2025-05-18] Refactored
tilelang.jitto support a faster, more flexible kernel cache mechanism.
- [2025-05-18] Refactored
- Details:
- [May 2025] New PR: Supported
T.annotate_l2_hit_ratioviacudaStreamSetAttribute. - [May 2025] Bug Fix: Resolved warp combination simplifications for
T.gemm.
- [May 2025] New PR: Supported
-
Metrics: 78 New PRs 73 Closed PRs 17 New Issues
deepseek-ai/DeepEP
- Key Activity:
- [May 2025] Focus on extreme latency optimization and kernel architecture.
- Details:
- [May 2025] New PRs: Pushed low-latency P2P code cleanups and transitioned to using IBGDA exclusively with lighter barriers.
- [May 2025] Issue tracking: Architecture discussions around “SM-free normal kernels” and BF16-to-FP32 conversions for reduce operations.
-
Metrics: 8 New PRs 9 Closed PRs 25 New Issues
Model Inference Tooling (xDiT, vLLM, sglang, verl, rtp-llm)
- xDiT: [May 2025] Added FP8 forward paths for FlashAttention-3 and Lumina-Next CFG parallel scripts. Active work on supporting dynamic LoRA loading/unloading in diffusers pipelines. (5 PRs opened).
- verl: [2025-05-28] Added support for PF-PPO and introduced a breaking change setting
actor.entropy_coeffdefault to 0. - sglang: [2025-05-16] Fixed AMD CI workflows and heavily updated documentation.
- vLLM & rtp-llm & llm-d: Maintenance phase focusing purely on documentation re-organization and link fixes.