首页研究成果团队动态 Seed Edge Top Seed 加入我们

EN

中文

首页研究成果团队动态 Seed Edge Top Seed 加入我们

Infrastructures

Seed-Infrastructures 团队负责大模型的分布式训练、强化学习框架、高性能推理、异构硬件编译器等工作

课题方向

超大规模分布式训练

研究超大规模训练集群，如何让训练的稳定性和 MFU 提升，跨集群、低精度、容错及弹性训练

Large-scale

Stability

Large-scale

Stability

强化学习系统

研究端到端的大模型强化系统，在动态负载、复杂 Agent/环境交互、异构资源、多模态场景下设计下一代系统

Reinforcement learning

Agent

Optimization

Reinforcement learning

Agent

推理并行方案

研究如何解决推理的计算和访存瓶颈，多机推理，异构硬件的并行推理方案和调度优化

Inference

Parallel

Inference

Parallel

下一代模型与硬件体系联合优化

结合下一代硬件体系和下一代生成理解模型架构，研究更先进的模型结构、训练模式、推理模式

Systems-algorithm co-design

Model architecture

Systems-algorithm co-design

Model architecture

异构硬件编译优化

研究新硬件体系结构下高性能算子的编译优化、计算通讯联合优化

Heterogeneous systems

Compiler

Heterogeneous systems

Compiler

精选论文

Understanding Stragglers in Large Model Training Using What-if Analysis

Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li

Cluster Computing

Understanding Stragglers in Large Model Training Using What-if Analysis

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu, Aurojit Panda, Jinyang Li

Cluster Computing

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by 1.96× and for end-to-end execution, COMET delivers a 1.71× speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

System Research

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

System Research

HybridFlow: A Flexible and Efficient RLHF Framework

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53×~20.57× throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at this https URL（https://github.com/volcengine/verl）.

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu

Reinforcement Learning

System Research

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu

Reinforcement Learning

System Research

查看更多论文

热招岗位

机器学习训练框架研发工程师/专家-Seed

北京/上海/深圳/杭州

社招

机器学习系统推理引擎资深工程师/专家-Seed

北京/上海/杭州

社招

机器学习系统调度工程师/专家-Seed

北京/上海/杭州

社招

大模型推理存储系统工程师/专家-Seed

北京/上海/深圳/杭州

社招

AI异构计算优化工程师/专家-Seed

北京/上海/深圳/杭州

社招

机器学习系统研发实习生-Seed

北京/上海/深圳/杭州

实习

机器学习训练框架研发工程师/专家-Seed

北京/上海/深圳/杭州

机器学习系统推理引擎资深工程师/专家-Seed

北京/上海/杭州

机器学习系统调度工程师/专家-Seed

北京/上海/杭州

大模型推理存储系统工程师/专家-Seed

北京/上海/深圳/杭州

AI异构计算优化工程师/专家-Seed

北京/上海/深圳/杭州

机器学习系统研发实习生-Seed

北京/上海/深圳/杭州

查看更多岗位

模型成果

Seed1.6 Seed1.5-VL Seedance 1.0 Seedream 4.0 SeedEdit 3.0 Seed LiveInterpret 2.0 Seed Realtime Voice Seed Music

研究团队

LLM Infrastructures Vision Speech Multimodal Interaction & World Model AI for Science Robotics Responsible AI

了解更多

研究成果团队动态 Seed Edge Top Seed 加入我们

模型成果

Seed LiveInterpret 2.0

Seed Realtime Voice

研究团队

Infrastructures

Multimodal Interaction & World Model

了解更多

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策

欢迎加入字节跳动 Seed

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策