Jinming Ma | 2025 C++ and System Software Summit

Jinming Ma

Young Researcher at Shanghai Artificial Intelligence Laboratory

Jinming Ma holds both bachelor's and master's degrees from Harbin Institute of Technology, where he studied under Professors Xiaohua Chen and Kui Yu. He is currently a Young Researcher at the Shanghai Artificial Intelligence Laboratory, specializing in high-performance inference services for large language models (LLMs) and heterogeneous communication library technologies. He led the development of the lab’s heterogeneous communication library, which has been successfully applied to large-scale, cross-domain intelligent computing projects spanning over a thousand kilometers, addressing key communication challenges in distributed training. His related research has been published in top international conferences such as DAC, ICPP, and IISWC, and he holds four invention patents.

Topic

Design and Implementation of a Heterogeneous Communication Library Balancing Flexibility and Efficiency

In AI heterogeneous computing scenarios, inefficient interconnection across architectures, the challenges of adapting communication paradigms to diverse use cases, and the performance bottlenecks of traditional communication libraries under high concurrency and large bandwidth have long been core pain points restricting training and inference efficiency. The Shanghai Artificial Intelligence Laboratory’s open-source communication library **DeepLink DLSlime** addresses these issues with the goal of “breaking heterogeneous interconnection barriers and unlocking ultimate transmission performance.” By unifying the underlying engine and integrating multiple communication links and paradigms, DLSlime provides an easy-to-use, high-performance solution optimized for domestic heterogeneous chips. DLSlime supports a variety of high-speed intra- and inter-node links, including RDMA, NVLink, and NVShmem, enabling heterogeneous interconnection among CPUs, AMD/NVIDIA GPUs, and domestic accelerators, achieving up to **97% bandwidth utilization**. In performance tests, DLSlime’s small-message latency approaches the theoretical lower limit of RDMA, while its throughput in large-batch, high-concurrency scenarios is **3× higher than NCCL** and nearly **2× higher than NIXL**. It has been successfully deployed in key projects such as the LMDeploy inference framework and cross-domain training over thousand-kilometer networks. This talk will provide an in-depth look at DLSlime from four perspectives—technical architecture, core advantages, performance optimization, and real-world deployment—illustrating how it tackles communication challenges in heterogeneous training and inference, and offering valuable insights for the global C++ and system software communities on designing and optimizing high-performance communication libraries.

Boolan is a leading IT Education & Consulting company in China. Our core competence is our experts team around the world and their cutting edge technology experience accumulated through decades. Adhering to the tenet of "Global Experts, Global Wisdom", we are dedicated to providing our customers In-house Training,Technical Conference, Software Consulting, Expert Lecture, Seminar, Talent Evaluation and Certification and other services by gathering the world's top IT technology experts. www.boolan.com

沪ICP备15014563号