📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

May 2025 was marked by significant advancements in quantization, hardware-specific AI model optimization, and massive ecosystem backlog cleanups. The standout event of the month was the 🚨 release of torchao v0.11.0, which brought groundbreaking capabilities for Mixture-of-Experts (MoE) quantization and introduced a new microbenchmarking framework.

Across the open-source ML ecosystem, project health is notably strong. Massive repository cleanup efforts were observed in top-level projects, most notably openxla/xla which closed 4,585 PRs (far outpacing its 1,344 new PRs) and pytorch/pytorch which maintained a healthy negative delta (1,480 new vs. 1,494 closed PRs). Development trends indicate a heavy industry focus on optimizing DeepSeek-style architectures (MoE routing, scaling) and low-precision processing (FP8) across virtually all hardware platforms.

Strengthened ROCm & vLLM Integration: ROCm’s MAD repository saw active updates paving the way for the vLLM 05/27 release, including specific path updates for vLLM profiling scripts. Additionally, the broader ROCm ecosystem is addressing RCCL and Horovod installation friction.
Quantization & PyTorch Performance: The torchao v0.11.0 release directly benefits AMD through the inclusion of preshuffled weight matrix multiplication enhancements explicitly for ROCm, ensuring competitive performance in the rapidly evolving PyTorch quantization landscape. CI fixes for AMD hardware were also prioritized in third-party engines like sglang.
Advanced Profiling via TraceLens: The TraceLens toolset is maturing quickly. Feature requests for plotting roofline models and enabling batch GEMMs through gemmologist show a strong push toward providing AMD developers with deeper, more intuitive performance tuning insights.
Primus Expansion: AMD’s AGI Primus framework received vital updates to support Mixtral pretrain configurations and saw its core Megatron-LM dependencies updated to late May 2025 branches.

Competitive Analysis

NVIDIA Sunsetting Legacy Tools: NVIDIA’s apex repository completely removed its legacy amp, ddp, and rnn features. This confirms NVIDIA’s continued push to rely entirely on native PyTorch implementations rather than custom Apex extensions for basic mixed precision and distributed data parallel tasks.
Google’s TPU Scaling: The JAX/XLA ecosystem is aggressively pushing hardware utilization boundaries. maxtext is advancing v7x AOT (Ahead-of-Time) compilation support, and openxla/xla dropped restrictions on growing XnnGraph fusions.
DeepSeek Workloads Drive Network Innovations: DeepSeek’s DeepEP heavily optimized its peer-to-peer (P2P) latency and refined IBGDA utilization, indicating competitors are aggressively tuning networking layers for MoE expert parallel workloads.
FP8 Adoption: Projects like xDiT are rapidly adopting FP8 forward passes for FlashAttention-3, signaling that ecosystem partners are standardizing FP8 as the default for next-gen inference and generation tasks.

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-05-23] Updated Megatron-LM submodule to the 20250522 baseline.
- [2025-05-20] Unified configuration usage and reorganized the README.
Details:
- [May 2025] New PR: Added Mixtral pretrain configs.
- [May 2025] New PR: Fixed Megatron interleaved virtual pipeline training errors with corresponding Unit Tests.
Metrics: 17 New PRs 17 Closed PRs 0 New Issues (Healthy 1:1 PR closure rate)

AMD-AGI/TraceLens

Key Activity:
- [May 2025] Heavy development focus on performance modeling and GEMM capabilities.
Details:
- [May 2025] Feature Request: Expose TraceLens package functionality for plotting rooflines.
- [May 2025] New PR: Enabled batch GEMM through gemmologist.
- [May 2025] Bug Fix: Made input strides optional to prevent perf model crashes and fixed IS tensor types.
Metrics: 44 New PRs 43 Closed PRs 11 New Issues 12 Closed Issues

ROCm/ROCm

Key Activity:
- [2025-05-21] Merged tool-update-641.
Details:
- [May 2025] Issue Tracking: Investigating ROCm Horovod installation failures related to missing rccl.h and CMake errors.
- [May 2025] New PRs: Fixes integrated for rocrsamples, rocr_debug_agent, and broken documentation links.
Metrics: 116 New PRs 117 Closed PRs 36 New Issues 41 Closed Issues (Strong issue/PR resolution rate)

ROCm/MAD

Key Activity:
- [May 2025] Focused on preparing profiling pipelines for vLLM ecosystem updates.
Details:
- [May 2025] New PR: Tracked the vLLM 05/27 release.
- [May 2025] New PR: Updated paths to utilize profiling scripts natively in vLLM.
Metrics: 14 New PRs 11 Closed PRs 0 New Issues

PyTorch Ecosystem

pytorch/ao

Key Activity:
- [2025-05-09] 🚨 RELEASE: v0.11.0
Details:
- [2025-05-09] Released MoE Quantization capabilities (Int8WeightOnly, Int4, FP8) using both base and fake base tensor subclasses (demonstrated via LLaMA-4-Scout and Mixtral).
- [2025-05-09] Added PyTorch 2 Export Quantization (PT2E) migration paths.
- [2025-05-09] Shipped a new microbenchmarking framework for inference APIs to evaluate matrix sizes and memory profiling.
- [2025-05-09] Landed optimizations for ROCm preshuffled weight matrix multiplication.
- [May 2025] Issue tracking shifted to QAT range learning trackers and PT2E FSDP support.
Metrics: 0 New PRs 0 Closed PRs 22 New Issues (Post-release issue gathering)

pytorch/pytorch

Key Activity:
- [2025-05-14] Ongoing documentation and core URL fixes.
Details:
- [May 2025] Core Issues: Tracking FP8 scaled mm lowering ignoring scale_result arguments, and inconsistent torch.amin() results between CPU and GPU.
- [May 2025] New PRs: Cleaned up NumPy compatibility for 2D small list indices and removed AttributeError constructors.
Metrics: 1480 New PRs 1494 Closed PRs 754 New Issues 543 Closed Issues

pytorch/torchtitan

Key Activity:
- [May 2025] Enhancing framework support for DeepSeek models and Distributed Data Parallel paradigms.
Details:
- [May 2025] New Issue: Debugging router collapse during Deepseek training loops.
- [May 2025] New PR: Working on SimpleFSDP adding support for HSDP/DDP + Tensor Parallelism.
- [May 2025] Bug Fix: Remedied HF AutoTokenizer root logger duplication for Deepseek.
Metrics: 65 New PRs 58 Closed PRs 25 New Issues 18 Closed Issues

meta-pytorch/monarch & facebookresearch/xformers

Key Activity:
- [2025-05-26 to 2025-05-30] Assorted build and documentation updates across both libraries, including grpo_actor examples in Monarch and --no-build-isolation instructions for xFormers.
Metrics: Stable/Maintenance Mode.

Google / JAX Ecosystem

openxla/xla

Key Activity:
- [May 2025] Massive repository maintenance and architectural refactoring.
Details:
- [May 2025] New PR: Dropped the restriction that XnnGraph fusions can only be grown from root.
- [May 2025] Deprecation: Removed API_VERSION_STATUS_RETURNING custom calls in CPU XLA.
- [May 2025] Bug tracking: Identified a TPU X64 rewriting issue on c64 instances.
Metrics: 1344 New PRs 4585 Closed PRs (Exceptional backlog resolution)

AI-Hypercomputer/maxtext & JetStream

Key Activity:
- [2025-05-27] Updated the JAX stable stack naming to “jax ai image”.
Details:
- [May 2025] Highlighted PR: Drafted support for v7x AOT compilation.
- [May 2025] Code Cleanup: Replaced internal usages of jax/_src/pjit.py.
- [May 2025] Feature Request: Llama3 conversion to HF bug fixes and Automatic Grain Data Iterator changes.
- [May 2025] JetStream: Aligned gcloud setups to mirror Maxtext.
Metrics: Maxtext: 126 New PRs 115 Closed PRs. JetStream: 10 New PRs 9 Closed PRs.

jax-ml/jax

Key Activity:
- [2025-05-30] Deprecated Mac x86 installation instructions from the standard documentation.

NVIDIA Ecosystem

NVIDIA/apex

Key Activity:
- [2025-05-07] Completely removed deprecated amp, ddp, and rnn features to rely on upstream PyTorch implementations.
Metrics: 0 New PRs 0 Closed PRs (Cleanup release)

NVIDIA/Megatron-LM

Key Activity:
- [2025-05-08] Refreshed README.md setup instructions.

Language Models, Compilers, & Inference Ecosystem

huggingface/transformers

Key Activity:
- [2025-05-12] Updated chat generate parameterization to be powered by GenerationConfig for better UX.
- [2025-05-14] Added uv installation instructions for source builds.
Details:
- [May 2025] Community Requests: Tracking requests for SparseVLM (Visual Token Sparsification) and SLaM (Sparse Latent Mixer).
- [May 2025] PRs: Fixed Blip2 tests and Markdown rendering for the BERT [CLS] token.
Metrics: 423 New PRs 389 Closed PRs 175 New Issues

triton-lang/triton

Key Activity:
- [2025-05-12] Removed Blackwell PyTorch build instructions.
- [2025-05-15/16] Added and immediately reverted options to remove debug info from modules.
Details:
- [May 2025] New PR: Implemented Gluon attention kernels for d64 and d128.
- [May 2025] Bug fix: Added checks for shared memory limitations during ConvertLayoutOp conversions.
Metrics: 308 New PRs 302 Closed PRs 33 New Issues

tile-ai/tilelang

Key Activity:
- [2025-05-18] Refactored tilelang.jit to support a faster, more flexible kernel cache mechanism.
Details:
- [May 2025] New PR: Supported T.annotate_l2_hit_ratio via cudaStreamSetAttribute.
- [May 2025] Bug Fix: Resolved warp combination simplifications for T.gemm.
Metrics: 78 New PRs 73 Closed PRs 17 New Issues

deepseek-ai/DeepEP

Key Activity:
- [May 2025] Focus on extreme latency optimization and kernel architecture.
Details:
- [May 2025] New PRs: Pushed low-latency P2P code cleanups and transitioned to using IBGDA exclusively with lighter barriers.
- [May 2025] Issue tracking: Architecture discussions around “SM-free normal kernels” and BF16-to-FP32 conversions for reduce operations.
Metrics: 8 New PRs 9 Closed PRs 25 New Issues

Model Inference Tooling (xDiT, vLLM, sglang, verl, rtp-llm)

xDiT: [May 2025] Added FP8 forward paths for FlashAttention-3 and Lumina-Next CFG parallel scripts. Active work on supporting dynamic LoRA loading/unloading in diffusers pipelines. (5 PRs opened).
verl: [2025-05-28] Added support for PF-PPO and introduced a breaking change setting actor.entropy_coeff default to 0.
sglang: [2025-05-16] Fixed AMD CI workflows and heavily updated documentation.
vLLM & rtp-llm & llm-d: Maintenance phase focusing purely on documentation re-organization and link fixes.

GitHub Monthly Report: 2025-05-01 to 2025-05-31

📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/ROCm

ROCm/MAD

PyTorch Ecosystem

pytorch/ao

pytorch/pytorch

pytorch/torchtitan

meta-pytorch/monarch & facebookresearch/xformers

Google / JAX Ecosystem

openxla/xla

AI-Hypercomputer/maxtext & JetStream

jax-ml/jax

NVIDIA Ecosystem

NVIDIA/apex

NVIDIA/Megatron-LM

Language Models, Compilers, & Inference Ecosystem

huggingface/transformers

triton-lang/triton

tile-ai/tilelang

deepseek-ai/DeepEP

Model Inference Tooling (xDiT, vLLM, sglang, verl, rtp-llm)

🔗 References

📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

AMD Related Updates

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/ROCm

ROCm/MAD

PyTorch Ecosystem

pytorch/ao

pytorch/pytorch

pytorch/torchtitan

meta-pytorch/monarch & facebookresearch/xformers

Google / JAX Ecosystem

openxla/xla

AI-Hypercomputer/maxtext & JetStream

jax-ml/jax

NVIDIA Ecosystem

NVIDIA/apex

NVIDIA/Megatron-LM

Language Models, Compilers, & Inference Ecosystem

huggingface/transformers

triton-lang/triton

tile-ai/tilelang

deepseek-ai/DeepEP

Model Inference Tooling (xDiT, vLLM, sglang, verl, rtp-llm)

🔗 References