Infrastructures
Seed-Infrastructures 团队负责大模型的分布式训练、强化学习框架、高性能推理、异构硬件编译器等工作
课题方向
超大规模分布式训练
研究超大规模训练集群,如何让训练的稳定性和 MFU 提升,跨集群、低精度、容错及弹性训练
Large-scale
Stability
强化学习系统
研究端到端的大模型强化系统,在动态负载、复杂 Agent/环境交互、异构资源、多模态场景下设计下一代系统
Reinforcement learning
Agent
Optimization
推理并行方案
研究如何解决推理的计算和访存瓶颈,多机推理,异构硬件的并行推理方案和调度优化
Inference
Parallel
下一代模型与硬件体系联合优化
结合下一代硬件体系和下一代生成理解模型架构,研究更先进的模型结构、训练模式、推理模式
Systems-algorithm co-design
Model architecture
异构硬件编译优化
研究新硬件体系结构下高性能算子的编译优化、计算通讯联合优化
Heterogeneous systems
Compiler

精选论文

2025.05.09
Understanding Stragglers in Large Model Training Using What-if Analysis
Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?
Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li
Cluster Computing
2025.02.27
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by 1.96× and for end-to-end execution, COMET delivers a 1.71× speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu
System Research
2024.10.02
HybridFlow: A Flexible and Efficient RLHF Framework
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53×~20.57× throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at this https URL(https://github.com/volcengine/verl).
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu
Reinforcement Learning
System Research
查看更多论文

热招岗位

机器学习训练框架研发工程师/专家-Seed
北京/上海/深圳/杭州
社招
立即投递
机器学习系统推理引擎资深工程师/专家-Seed
北京/上海/杭州
社招
立即投递
机器学习系统调度工程师/专家-Seed
北京/上海/杭州
社招
立即投递
大模型推理存储系统工程师/专家-Seed
北京/上海/深圳/杭州
社招
立即投递
AI异构计算优化工程师/专家-Seed
北京/上海/深圳/杭州
社招
立即投递
机器学习系统研发实习生-Seed
北京/上海/深圳/杭州
实习
立即投递