Multimodal Interaction & World Model

The Seed Multimodal Interaction and World Model team is dedicated to developing models that have human-level multimodal understanding and interaction capabilities. The team is working to advance the exploration and development of multimodal assistant products.

Latest advancements

Seed1.5-VL

Vision-language multimodal large models demonstrate outstanding performance in tasks such as visual reasoning, image question answering, chart understanding and question answering, visual grounding/counting, video understanding, and GUI agent tasks.

BAGEL

An open-source Unified Multimodal Model which possesses multiple capabilities such as image generation, image editing, style transformation, and image expansion, and is capable of delivering precise, accurate, and photorealistic outputs.

UI-TARS

An open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.