πŸ“… Engineering Report (2025-03-01 - 2025-03-31)

πŸš€ Executive Summary

March 2025 demonstrated vigorous engineering momentum across the AI and ML ecosystems, particularly in framework support for newer models (DeepSeek-V3, Gemma 3, Qwen2.5-VL) and deep optimizations for parallelism and quantization. Core libraries like PyTorch and OpenXLA showed exceptional maintenance health, consistently closing >90% of their massive incoming PR queues. The most significant architectural shifts this month occurred in distributed training and inference engines, with both volcengine/verl and xdit-project/xDiT rolling out major feature releases that heavily bolster cross-hardware support and distributed caching mechanisms.

  • Expanded Upstream Hardware Support: Huge wins for the AMD ecosystem this month as third-party frameworks natively integrated AMD GPU support. The volcengine/verl 🚨 v0.3.0 release officially brought AMD support to its vLLM and FSDP backends, and xdit-project/xDiT 🚨 v0.4.3 merged dedicated AMD GPU support.
  • ROCm & MAD Container Updates: ROCm/MAD continuously synced its container recipes with the broader ecosystem, including updates to the unified vLLM docker (v0.7.3), maxtext-v25.4 for JAX training, and megatron-lm v25.4.
  • Tooling Stability: AMD’s internal tools saw steady improvements. TraceLens merged 25 PRs entirely focused on scalability (e.g., alternative subtract_intervals), while Primus refined data preprocessing and shell scripts. ROCm runtime engineers are actively tracking and addressing edge cases, including system suspension crashes and specific GPU (RX 6750 GRE) compatibility in Stable Diffusion.

Competitive Analysis

  • NVIDIA’s FP8 & Optimization Push: NVIDIA/TransformerEngine is aggressively optimizing MXFP8 cast kernels and GroupedGEMM operations in JAX. However, users are reporting backward-compatibility friction, specifically regarding FP8 extra states from version 1.x failing to load in 2.x.
  • DeepSeek Hardware Specifics: deepseek-ai/DeepEP explicitly dropped its NVLink low-latency plan and shifted focus to BF16 low-latency kernels. They are actively triaging Hopper-specific bugs, including processes blocking on H20 nodes.
  • Ecosystem Hardware Friction: Alternative hardware architectures are surfacing compilation issues in legacy wrappers. xformers is seeing requests for Huawei Ascend/CANN build support, while DeepSpeed engineers are addressing compilation errors specific to the newly adopted CUDA 12.6.

πŸ“‚ Category Updates

AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2025-03-31] Refactored shell scripts.
    • [2025-03-25] Added MTP support and updated the README.
  • Details:
    • [2025-03-31] HIGHLIGHT: New PR for data preprocessing enhancements.
    • [2025-03-31] HIGHLIGHT: New PR to refactor shell scripts.
  • Metrics: 12 New PRs, 11 Closed PRs 0 New Issues, 0 Closed Issues

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-03-25] Updated README for the v0.3 release.
    • [2025-03-06] Documentation updates for v0.2.
  • Details:
    • [2025-03-25] HIGHLIGHT: Merged an alternative subtract_intervals function optimized for large-scale profiling scalability.
  • Metrics: 25 New PRs, 25 Closed PRs 0 New Issues, 2 Closed Issues

ROCm/ROCm

  • Key Activity:
    • [2025-03-31] Ongoing triage of core runtime and specific GPU availability issues.
  • Details:
    • [2025-03-31] HIGHLIGHT: Ex CI added dependency on rocprof-sdk for rocprof-compute.
    • [2025-03-31] HIGHLIGHT: Ex CI added Ninja build generator for 12 components.
    • [2025-03-31] ISSUE: Tracking a ROCm runtime crash after system suspension.
    • [2025-03-31] ISSUE: Investigating Stable Diffusion compatibility on the RX 6750 GRE.
  • Metrics: 61 New PRs, 64 Closed PRs 46 New Issues, 38 Closed Issues

ROCm/MAD

  • Key Activity:
    • [2025-03-12] Unified the vLLM docker definition with upstream v0.7.3.
  • Details:
    • [2025-03-31] HIGHLIGHT: Added README for jax-training:maxtext-v25.4.
    • [2025-03-31] HIGHLIGHT: Added megatron-lm training docker v25.4.
  • Metrics: 6 New PRs, 5 Closed PRs 0 New Issues, 0 Closed Issues

PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • [2025-03-24] Removed pre-CXX11 ABI logic from build scripts.
  • Details:
    • [2025-03-31] HIGHLIGHT: Built MacOS CI with MKLDNN.
    • [2025-03-31] HIGHLIGHT: Added gen_schema support for invoke_subgraph in hop schema.
    • [2025-03-31] ISSUE: Adding copy kwarg in torch.reshape() to follow the Python array API standard.
    • [2025-03-31] ISSUE: Addressed ONNX decomp not preserving custom CompositeImplicitAutograd ops.
  • Metrics: 1432 New PRs, 1368 Closed PRs 708 New Issues, 512 Closed Issues

pytorch/torchtitan

  • Key Activity:
    • [2025-03-24] Updated modules to use -m option for script execution.
    • [2025-03-06] Legal modifications applied to additional datasets.
  • Details:
    • [2025-03-31] HIGHLIGHT: Refactored train_spec.loss_fn to build_loss_fn and implemented chunked loss.
    • [2025-03-31] HIGHLIGHT: Work-in-progress kernel for Contiguous Group GeMM.
    • [2025-03-31] ISSUE: Investigating Context parallel support on legacy Turing GPUs.
  • Metrics: 100 New PRs, 91 Closed PRs 27 New Issues, 15 Closed Issues

pytorch/ao

  • Key Activity:
    • [2025-03-10] Promoted Low Bit Optim out of prototype status.
    • [2025-03-05] Reverted the move of torchao/_models to benchmarks/_models.
  • Details:
    • [2025-03-31] ISSUE: Diagnosing failed dispatch of Custom CUDA OPs in TorchAO.
    • [2025-03-31] ISSUE: Addressed failures when saving static quantized models.
  • Metrics: 0 New PRs, 0 Closed PRs 25 New Issues, 38 Closed Issues

pytorch/FBGEMM

  • Key Activity:
    • [2025-03-29] General documentation updates.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues

JAX & OpenXLA Ecosystem

jax-ml/jax

  • Key Activity:
    • [2025-03-31] Heavy codebase maintenance with >500 PRs merged.
  • Details:
    • [2025-03-31] HIGHLIGHT: Added __jax_array__ support in jnp.reshape, transpose, and matrix_transpose.
    • [2025-03-31] HIGHLIGHT: Included the ProfilerData class in Jaxlib’s profiler submodule.
    • [2025-03-31] ISSUE: Proposed adding the β€œlargest” argument to top_k.
  • Metrics: 617 New PRs, 566 Closed PRs 101 New Issues, 60 Closed Issues

openxla/xla

  • Key Activity:
    • [2025-03-10] Homepage link updates and backend documentation.
  • Details:
    • [2025-03-31] HIGHLIGHT: Autotuning updates to ensure entry versions only need updating in one place.
    • [2025-03-31] HIGHLIGHT: Fixes submitted for TensorFlow GPU builds.
    • [2025-03-31] ISSUE: Multidevice sharding raising UnspecifiedValue with custom PJRT backends.
    • [2025-03-31] ISSUE: Requests for replacing tanhf on aarch64 with vectorized SVE implementations.
  • Metrics: 1132 New PRs, 861 Closed PRs 14 New Issues, 17 Closed Issues

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-03-22] Added Gemma 3 announcements and DeepSeek instructions to the README.
  • Details:
    • [2025-03-31] HIGHLIGHT: Refactored Prefill Packing into a dedicated python module.
    • [2025-03-31] HIGHLIGHT: Unsharded QKV on the head dimension.
    • [2025-03-31] ISSUE: Identified bug where moe_lb_loss was not divided by gradient_accumulation_steps for reporting.
    • [2025-03-31] ISSUE: Addressed errors when using Megablox with expert_parallelism.
  • Metrics: 171 New PRs, 114 Closed PRs 6 New Issues, 6 Closed Issues

AI-Hypercomputer/JetStream

  • Key Activity:
    • [2025-03-31] Focused optimizations on context generation.
  • Details:
    • [2025-03-31] HIGHLIGHT: Fixed chunked prefill regression.
    • [2025-03-31] HIGHLIGHT: Added support for long context dataset accuracy measurement.
  • Metrics: 16 New PRs, 15 Closed PRs 0 New Issues, 0 Closed Issues

NVIDIA & Low-Level Operations

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-03-31] Deep focus on FP8 performance and JAX integrations.
  • Details:
    • [2025-03-31] HIGHLIGHT: Improved performance of mxfp8 cast kernels.
    • [2025-03-31] HIGHLIGHT: JAX Refactor incorporating MXFP8 and GroupedGEMM.
    • [2025-03-31] ISSUE: Backward compatibility bug where PyTorch FP8 extra state from version 1.x cannot load in 2.x.
  • Metrics: 71 New PRs, 76 Closed PRs 28 New Issues, 33 Closed Issues

triton-lang/triton

  • Key Activity:
    • [2025-03-28] Pinned cmake to < 4 and added MAX_JOBS install instructions.
  • Details:
    • [2025-03-31] HIGHLIGHT: Updated backend to target llvm/llvm-project@1d4801f22ab.
    • [2025-03-31] ISSUE: Loading from TMA descriptor hangs investigated.
    • [2025-03-31] ISSUE: User requests for tl.dot auto-broadcast support.
  • Metrics: 215 New PRs, 212 Closed PRs 55 New Issues, 37 Closed Issues

facebookresearch/xformers

  • Key Activity:
    • [2025-03-31] Maintenance on core attention kernels.
  • Details:
    • [2025-03-31] HIGHLIGHT: Corrected the condition for using merge_nhead_groups_seqlen_q.
    • [2025-03-31] ISSUE: Ecosystem requests for NPU builds targeting Huawei Ascend / CANN.
  • Metrics: 5 New PRs, 7 Closed PRs 7 New Issues, 4 Closed Issues

deepspeedai/DeepSpeed

  • Key Activity:
    • [2025-03-25] Linked AutoTP blogs and updated documentation handlers.
  • Details:
    • [2025-03-31] HIGHLIGHT: Updated to new PyTorch grad hook APIs for BF16Optimizer and Stage2.
    • [2025-03-31] HIGHLIGHT: Removed legacy definitions causing compilation errors in CUDA 12.6.
    • [2025-03-31] ISSUE: Handled grad_norm NaN issues when using overlap_comm:True with contiguous_gradients:True.
  • Metrics: 47 New PRs, 44 Closed PRs 47 New Issues, 29 Closed Issues

deepseek-ai/DeepEP

  • Key Activity:
    • [2025-03-27] Removed NVLink low-latency plan from documentation.
    • [2025-03-10] Added BF16 support for low-latency kernels.
  • Details:
    • [2025-03-31] HIGHLIGHT: Adjusted kNumThreads of notify_dispatch.
    • [2025-03-31] ISSUE: Triaging block failures on H20 nodes during test_low_latency.py.
  • Metrics: 7 New PRs, 8 Closed PRs 65 New Issues, 40 Closed Issues

HuggingFace & LLM Frameworks

volcengine/verl

  • Key Activity:
    • [2025-03-30] 🚨 RELEASE: v0.3.0.post0 - A massive feature drop including AMD Support for vLLM and FSDP backends, Qwen2.5-VL support, PRIME/RLOO/remax algorithms, and FIRE sampling. SGLang integration preview is now available.
  • Details:
    • [2025-03-30] HIGHLIGHT: Megatron upgraded to v0.11; vLLM upgraded to v0.8.2.
    • [2025-03-30] HIGHLIGHT: Added Ulysses sequence parallel support (transformers >= 0.48) and offloading parameters during rollout.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues

xdit-project/xDiT

  • Key Activity:
    • [2025-03-20] 🚨 RELEASE: 0.4.3 - Brought official AMD GPU support, SDXL CFG parallel support, and Sage attention in long_ctx_attn.
    • [2025-03-03] 🚨 RELEASE: 0.4.2 - Added Ray disaggregating VAE/DiT, USP implementations, TeaCache, FBCache, and Tensor Parallelism for the Step-Video-T2V model.
    • [2025-03-26] 🚨 RELEASE: 0.4.3.post1 - Hotfix patch.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues

huggingface/transformers

  • Key Activity:
    • [2025-03-21] README and installation documentation updates.
  • Details:
    • [2025-03-31] HIGHLIGHT: Added fast image processor for ZoeDepth.
    • [2025-03-31] HIGHLIGHT: Updated model card for distilbert.
    • [2025-03-31] ISSUE: Diagnosing chunk dimension errors with default data parallelism in the Trainer.
    • [2025-03-31] ISSUE: Resolved Deepseek-V3 loading warnings.
  • Metrics: 461 New PRs, 369 Closed PRs 205 New Issues, 165 Closed Issues

tile-ai/tilelang

  • Key Activity:
    • [2025-03-26] Deprecated T.Buffer arguments, transitioning the syntax over to T.Tensor.
  • Details:
    • [2025-03-31] ISSUE: Users migrating from older versions are hitting module 'tilelang.language' has no attribute 'Tensor'.
    • [2025-03-31] ISSUE: Compiler caching bug in example_mha_bwd.
  • Metrics: 0 New PRs, 0 Closed PRs 48 New Issues, 34 Closed Issues

vllm-project/vllm

  • Key Activity:
    • [2025-03-20 to 2025-03-23] Trimmed and refreshed README front-page news and added a user forum.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues

sgl-project/sglang

  • Key Activity:
    • [2025-03-04 to 2025-03-22] Routine documentation and README.md alignment.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues

alibaba/rtp-llm

  • Key Activity:
    • [2025-03-04] Documented multi-process capabilities for the frontend server.
  • Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues