Vision
The Doubao Vision team focuses on foundational models for visual generation, developing multimodal generative models, and carrying out leading research and application development to solve fundamental computer vision challenges in GenAI.

Main areas of focus
Research focus
The team focuses on visual generation models, multimodal architectures, and technology research in AI vision-related areas.
Areas for exploration
This includes AIGC, diffusion models, autoregressive models, multimodal models, 3D/4D generation, visual self-supervised learning, and accelerating model optimization.
Research topics

Foundational models for visual generation
Researching and developing foundational models for visual generation (images and videos), ensuring high interactivity and controllability in visual generation, understanding patterns in videos, and exploring various visual-oriented tasks based on generative foundational models.
Multimodal
Diffusion Model
Auto Regression Model
Foundation

Multimodal generative models
Integrating various modalities into a unified generative model, generating and understanding joint modeling, supporting interleaved generation and simultaneous generation across various modalities (such as the digital avatar), and enhancing the contextual capabilities and consistency of generative models.
Multimodel
Diffusion Model
Auto Regression Model
Foundation

3D/4D generative models
3D/4D foundational generative models, learning visual world knowledge from video and 3D data, understanding the physical world's 3D space and physical laws, building spatial intelligence and world models, and exploring physics and rendering engines based on generative models.
3D
4D
World Model

Multimodal model design and optimization
Design and optimize multimodal model network architectures, optimize diffusion models, carry out efficient large-scale distributed training and inference, and push for model acceleration and optimization.
Multimodal
Optimization
Distillation
Quantization
Selected Papers

Nov 11, 2024
SeedEdit: Align Image Re-Generation to Image Editing
We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining the original image, i.e. image reconstruction, and generating a new image, i.e. image re-generation. To this end, we start from a weak generator (text-to-image model) that creates diverse pairs between such two directions and gradually align it into a strong image editor that well balances between the two tasks. SeedEdit can achieve more diverse and stable editing capability over prior image editing methods, enabling sequential revision over images generated by diffusion models.
Yichun Shi, Peng Wang, Weilin Huang
Computer Vision
2024.11.11
SeedEdit: Align Image Re-Generation to Image Editing
Yichun Shi, Peng Wang, Weilin Huang
Vision
Computer Vision

Nov 04, 2024
How Far is Video Generation from World Model: A Physical Law Perspective
OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io/
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng
Computer Vision
2024.11.04
How Far is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng
Vision
Computer Vision

Apr 21, 2024
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1.5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0.68 in CLIP Score and +0.51 in Aes Score in the 1-step inference.
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao
Computer Vision
2024.04.21
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao
Vision
Computer Vision
Learn More
Technical applications

Doubao Text-to-Image
The Doubao Text-to-Image Model has been successfully integrated into products like Douyin, CapCut/Lark, Doubao, and StarSketch. Users can input prompts into the Doubao app to generate high-quality images that beautifully capture light and shadow, create rich color atmospheres, and depict character aesthetics. The model supports input in both Chinese and English, ensuring precise understanding of complex prompts.
Text-to-Image
Model

Jimeng
Jimeng/Dreamina is an AI-powered creative product developed by ByteDance. It enables users to generate high-quality images and videos through inputs in natural language and pictures. The platform provides an intelligent canvas, a story creation mode, and various AI editing tools, significantly boosting users' creative productivity.
AI-powered
Creative
Featured Jobs
Research Scientist, Multimodal Foundation Model
Research Scientist- Foundation Model, Video Generation
Research Engineer- Foundation Model AI Platform- San Jose
Research Scientist Graduate (Foundation Model, Video Generation) - 2025 Start (PhD)
Student Researcher (Doubao (Seed) - Foundation Model, Video Generation) - 2025 Start (PhD)
Student Researcher (Doubao (Seed) - Foundation Model AI Platform) - 2025 Start (PhD)