Jinming Ma

Young Researcher at Shanghai Artificial Intelligence Laboratory

Jinming Ma holds both bachelor's and master's degrees from Harbin Institute of Technology, where he studied under Professors Xiaohua Chen and Kui Yu. He is currently a Young Researcher at the Shanghai Artificial Intelligence Laboratory, specializing in high-performance inference services for large language models (LLMs) and heterogeneous communication library technologies. He led the development of the lab’s heterogeneous communication library, which has been successfully applied to large-scale, cross-domain intelligent computing projects spanning over a thousand kilometers, addressing key communication challenges in distributed training. His related research has been published in top international conferences such as DAC, ICPP, and IISWC, and he holds four invention patents.

Topic

Design and Implementation of a Heterogeneous Communication Library Balancing Flexibility and Efficiency

In AI heterogeneous computing scenarios, inefficient interconnection across architectures, the challenges of adapting communication paradigms to diverse use cases, and the performance bottlenecks of traditional communication libraries under high concurrency and large bandwidth have long been core pain points restricting training and inference efficiency. The Shanghai Artificial Intelligence Laboratory’s open-source communication library **DeepLink DLSlime** addresses these issues with the goal of “breaking heterogeneous interconnection barriers and unlocking ultimate transmission performance.” By unifying the underlying engine and integrating multiple communication links and paradigms, DLSlime provides an easy-to-use, high-performance solution optimized for domestic heterogeneous chips. DLSlime supports a variety of high-speed intra- and inter-node links, including RDMA, NVLink, and NVShmem, enabling heterogeneous interconnection among CPUs, AMD/NVIDIA GPUs, and domestic accelerators, achieving up to **97% bandwidth utilization**. In performance tests, DLSlime’s small-message latency approaches the theoretical lower limit of RDMA, while its throughput in large-batch, high-concurrency scenarios is **3× higher than NCCL** and nearly **2× higher than NIXL**. It has been successfully deployed in key projects such as the LMDeploy inference framework and cross-domain training over thousand-kilometer networks. This talk will provide an in-depth look at DLSlime from four perspectives—technical architecture, core advantages, performance optimization, and real-world deployment—illustrating how it tackles communication challenges in heterogeneous training and inference, and offering valuable insights for the global C++ and system software communities on designing and optimizing high-performance communication libraries.

© boolan.com 博览 版权所有

沪ICP备15014563号

沪公网安备31011502003949号