首页研究成果团队动态 Seed Edge Top Seed 加入我们

EN

中文

首页研究成果团队动态 Seed Edge Top Seed 加入我们

Responsible AI

Seed-Responsible AI 团队致力于推动 AI 技术的安全可持续发展，通过研究清楚 AI 的基本机制，为 AI 技术的安全发展提供洞见与实践示范

精选论文

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

Shaohua Zhang, Yuan Lin, Hang Li

Computation and Language

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Shaohua Zhang, Yuan Lin, Hang Li

Computation and Language

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

Computer Vision and Pattern Recognition

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

Computer Vision and Pattern Recognition

PaSa: An LLM Agent for Comprehensive Academic Paper Search

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision.

Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

查看更多论文

技术能力展示

PaSa (Paper Search Agent)

PaSa 是基于大语言模型的论文检索智能体，能够自主做出一系列决策，包括调用搜索工具、阅读论文、选择相关参考文献等，从而为复杂的学术查询提供全面、准确的结果。PaSa 使用强化学习进行端到端优化，测评效果超越多个主流检索工具。

热招岗位

多模态大模型算法研究实习生-Seed

上海

实习

强化学习研究实习生-Seed

上海

实习

AI Agent研究实习生-Seed

上海

实习

大模型算法研究实习生-Seed

上海

实习

多模态大模型算法研究实习生-Seed

强化学习研究实习生-Seed

AI Agent研究实习生-Seed

大模型算法研究实习生-Seed

查看更多岗位

模型成果

Seed1.6 Seed1.5-VL Seedance 1.0 Seedream 4.0 SeedEdit 3.0 Seed LiveInterpret 2.0 Seed Realtime Voice Seed Music

研究团队

LLM Infrastructures Vision Speech Multimodal Interaction & World Model AI for Science Robotics Responsible AI

了解更多

研究成果团队动态 Seed Edge Top Seed 加入我们

模型成果

Seed LiveInterpret 2.0

Seed Realtime Voice

研究团队

Infrastructures

Multimodal Interaction & World Model

了解更多

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策

欢迎加入字节跳动 Seed

欢迎加入字节跳动 Seed

Copyright © 2025 Bytedance Seed

用户协议隐私政策