
Multimodal Interaction & World Model
Seed-多模态交互与世界模型团队致力于研发具备人类水平的多模态理解与交互能力的模型,并推动多模态助手类产品的探索和研发
精选论文
2026.04.22
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0 [16], with
substantial improvements across generation fidelity, simulation-ready capabilities, and application
coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from
high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression
and more efficient decoding. For texture and material generation, we replace the cascaded
pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo
and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic
conditioning for improved material precision and visual fidelity. Beyond single-object generation,
Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware
decomposition, and training-free articulation generation, enabling coherent scene construction and
part-level physical interaction across physics and graphics engines. A large-scale human preference
study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates
of 69.0% to 89.9% in textured 3D asset generation.
Computer Vision
2025.05.20
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community.
Computer Vision
2025.05.13
Seed1.5-VL Technical Report
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
LLM