Building Applications with AI Agents

翻译备注：

出于简化或者秉持对某些名词的通用翻译保持怀疑的态度，下面这些术语会在翻译中除第一次之外保留原文，并在这里一并列出：

Agent，智能体

RAG，检索增强生成

前言

When I first started connecting language models, tools, orchestration, and memory together into what we now call an agent, I was surprised by how capable this design pattern was, and just how much confusion there was about this topic. During my time building agents and sharing my findings on incident investigation, threat hunting, vulnerability detection, and more, I found that this latest design pattern enabled us to solve whole new classes of problems, but also came with many practical hurdles to making them reliable for real-world applications. Engineers, scientists, product managers, and leadership all wanted to know more. “How do I get my agent to work?” “I can get my agent to work some of the time, but how do I get it to work most or all of the time?” “How do I choose a model for my use case?” “How do I design good tools for my agent?” “What kind of memory do I need?” “Should I use RAG?” “Should I build a single-agent or multiagent system?” “What architecture should I use?” “Do I need to fine-tune?” “How do I enable agents to learn from experience and improve over time?”

当我最初将语言模型（language models）、工具（tools）、编排（orchestration）与记忆（memory）系统联结成如今我们所称的智能体（agent）时，我惊讶于这种设计模式的强大潜力，同时也惊异于这一主题所引发的普遍困惑。在构建智能体并分享我在事件调查、威胁追踪（threat hunting）、漏洞检测（vulnerability detection）等领域的研究成果的过程中，我发现这种最新的设计模式让我们能够解决全新类型的问题，但要使它们在现实应用中稳定可靠，仍面临诸多实际障碍。工程师、科学家、产品经理和管理层（leadership）都希望了解更多：“如何让我的 agent 顺利运行？”“我的 agent 有时能正常工作，但要如何让它保持高稳定性甚至始终如一？”“如何为我的应用场景选择合适的模型？”“如何为智能体设计有效的工具？”“需要怎样的记忆系统？”“是否该采用检索增强生成技术（RAG）？”“应该构建 single-agent 还是 multiagent 系统？”“该采用何种架构？”“是否需要微调（fine-tune）？”“如何让 agent 通过经验学习并持续进化？”

While there are many blog posts and research papers that focus on specific aspects of the topic of designing agent systems, I realized there were a lack of accessible, holistic, trustworthy guides for this. I couldn’t find the book that I wanted to share with my colleagues, so I set out to write it.

尽管已有许多博客文章和研究论文专注于设计 agent system 的特定方面，但我意识到这一领域仍缺乏通俗易懂、全面且可靠的综合性指南。我未能找到一本适合与同事们分享的理想参考书，因此决定自己着手撰写这样一本书。

Through in-depth discussions, I’ve helped teams navigate the complexities of AI agents, considering their unique goals, constraints, and environments. AI agent systems are intricate, blending autonomy, decision making, and interaction in ways that traditional software doesn’t. They’re data-driven, adaptive, and involve multiple components like perception, reasoning, action, and learning, all while interfacing with users, tools, and other agents. Complicating matters, the foundation models that power these agents are probabilistic and stochastic by nature, making evaluation and testing more challenging.

通过深入的探讨，我帮助许多团队应对 AI agents 的复杂性，同时充分考虑他们独特的目标、约束条件与应用环境。人工智能智能体系统具有高度的复杂性，它们以传统软件所未有的方式融合了自主性（autonomy）、决策能力与交互能力。这类系统以数据为驱动，具备自适应特性，同时整合了感知、推理、行动与学习等多个组件，并且需要与用户、工具以及其他 agent 进行交互。更复杂的是，支撑这些 agent 的基础模型（foundation model）本质上是概率性与随机性（stochastic）的，这使得对其评估和测试更具挑战性。

This book takes a comprehensive approach to building applications with AI agents. It covers the entire lifecycle, from conceptualization to deployment and maintenance, illustrated with real-world case studies, supported by references, and reviewed by practitioners in the field. Sections on advanced topics—like agent architectures, tool integration, memory systems, orchestration, multiagent coordination, measurement, monitoring, security, and ethical considerations—are further refined by expert input.

本书采用全面的方法来构建基于 AI agents 的应用程序。它涵盖从概念设计到部署维护的完整生命周期，通过真实案例进行阐释，附有参考文献支持，并由该领域的实践者进行审阅。关于高级主题的章节——例如 agent 架构、工具集成、记忆系统、流程编排、多 agent 协作、性能度量（measurement）、系统监控、安全考量与伦理问题——均经过专家意见的进一步提炼与完善。

Writing this book has been a journey of discovery for me as well. The initial drafts sparked conversations that challenged my views and introduced new ideas. I hope this process continues as you read it, bringing your own insights. Feel free to share any feedback you might have for this book via Twitter (X), LinkedIn, my personal website, or any other channels that you can find.

撰写这本书的过程对我而言同样是一场探索之旅。最初的草稿引发了诸多讨论，这些对话不断挑战着我的观点，也为我带来了新的思考。我希望这种探索能在您阅读时延续，并激发出您自己的见解。欢迎您通过Twitter (X)、LinkedIn、我的个人网站或任何您能找到的渠道，分享您对本书的任何反馈。

本书关于

This book provides a practical framework for building robust applications using AI agents. It addresses key challenges and offers solutions to questions such as:

What defines an AI agent, and when should I use one? How do agents differ from traditional machine learning (ML) systems?
How do I design agent architectures for specific use cases, including scenario selection, and core components like tools, memory, planning, and orchestration?
What are effective strategies for agent planning, reasoning, execution, tool selection, and topologies like chains, trees, and graphs?
How can I enable agents to learn from experience through nonparametric methods, fine-tuning, and transfer learning?
How do I scale from single-agent to multiagent systems, including coordination patterns like democratic, hierarchical, or actor-critic approaches?
How do I evaluate and improve agent performance with metrics, testing, and production monitoring?
What tools and frameworks are best for development, deployment, and securing agents against risks?
How do I ensure agents are safe, ethical, and scalable, with considerations for user experience (UX), trust, bias, fairness, and regulatory compliance?

本书提供了一个实用的框架，用于构建基于 AI Agents 的稳健应用程序。它针对关键挑战提出解决方案，并回答以下问题：

AI Agents 的定义是什么？在什么情况下应该使用它？Agent 与传统机器学习（ML）系统有何不同？
如何针对特定用例设计 agent 架构，包括场景选择以及工具、记忆、规划（planning）与编排等核心组件？
在 agent 规划、推理（reasoning）、执行、工具选择以及链式、树状、图状等拓扑结构（topology）方面，有哪些有效策略？
如何通过非参数方法、微调与迁移学习使智能体能够从经验中学习？
如何从 single agent 系统扩展到 multiagent 系统，包括民主制（democratic）、层级制（hierarchical）或执行者-评判者（actor-critic）等协调模式？
如何通过指标评估、测试与生产环境监控来提升 agent 性能？
哪些工具和框架最适合用于智能体的开发、部署以及风险防护？
如何确保智能体的安全性、伦理合规性与可规模化的，同时兼顾用户体验（UX）、信任、偏见（bias）、公平性及法规遵从性？

译注：

拓扑结构是计算机和系统设计领域的一个术语概念，简单理解就是「组件之间的连接方式和布局形状」

The content draws from established engineering principles and emerging practices in AI agents, with case studies (such as customer support, personal assistants, legal, advertising, and code review agents) and discussions on trade-offs to help you tailor solutions to your needs.

本书内容融合了成熟的工程原则与 AI agent 领域的新兴实践，通过实际案例（例如客户支持、个人助理、法律咨询、广告投放与 code review）以及对不同方案取舍（trade-offs）的探讨，帮助您根据自身需求定制解决方案。

本书不是什么

This book isn’t an introduction to AI or ML basics. It assumes familiarity with concepts like neural networks, natural language processing, and basic programming in languages like Python. If you’re new to these, pointers to resources are provided, but the focus is on applied agent building.

本书并非关于人工智能或机器学习基础知识的入门指南。它假定读者已熟悉神经网络、自然语言处理以及 Python 等语言的基本编程概念。如果您是这些领域的新手，书中会提供相关学习资源的指引，但全书的核心重点在于 agent 构建的实际应用。

It’s also not a step-by-step tutorial for specific tools, as technologies evolve rapidly. Instead, it offers guidance on evaluating and selecting tools, with pseudocode and examples to illustrate concepts. For hands-on implementation, online tutorials and documentation are recommended, including frameworks like LangChain and AutoGen.

本书也并非针对特定工具的按步骤操作教程，因为相关技术发展迅速。相反，它提供关于如何评估和选择工具的指导，并通过伪代码和示例来阐明概念。如需动手实践，推荐参考在线教程和官方文档，其中包括 LangChain 和 AutoGen 等框架。

本书目标人群

This book is for engineers, developers, and technical leaders aiming to build AI agent-based applications. It’s geared toward roles like AI engineers, software developers, ML engineers, data scientists, and product managers with a technical bent. You might relate to scenarios like the following:

You’re tasked with building an autonomous system for decision support, or interactive services.
You have a working agent prototype and you want to harden it and get it ready for production.
Your team struggles with agent reliability—handling failures, adapting to dynamic environments, or orchestrating complex tasks—and you want systematic approaches including orchestration, memory, and learning from experience.
You’re integrating agents into existing workflows and seek best practices for scalability, multiagent coordination, UX design, measurement, validation, monitoring, and security.

本书面向旨在构建基于 AI agent 应用程序的工程师、开发人员和技术负责人。其内容主要服务于人工智能工程师、软件开发者、机器学习工程师、数据科学家以及具备技术背景（with a tech bent）的产品经理等角色。您可能会对以下场景产生共鸣：

您的任务是构建一个用于决策支持或交互服务的自主系统。
您已拥有一个可运行的 agent 原型，并希望将其强化完善（harden），为投入生产环境做好准备。
您的团队在 agent 可靠性方面面临挑战——例如处理故障、适应动态环境或协调复杂任务——而您需要系统性的解决方案，包括流程编排、记忆系统以及从经验中学习等。
您正在将 agent 集成到现有工作流程中，并寻求关于可扩展性、multiagent 协调、用户体验设计、性能度量、验证、监控及安全性的最佳实践。

You can also benefit if you’re a tool builder identifying gaps in the agent ecosystem, a researcher exploring applications, or a job seeker preparing for AI agent roles.

如果您是正在寻找 agent 生态系统中的空缺机会的工具构建者、探索应用的研究人员，或是为 AI agent 相关职位做准备的求职者，您同样能从中获益。

本书纵览

The chapters follow the lifecycle of building an AI agent application, organized into three main sections.

本书章节遵循构建 AI agent 应用程序的生命周期，分为三个主要部分。

The first three chapters cover core concepts, design principles, and essential components:

Chapter 1 introduces agents, their promise, use cases, how they compare to traditional ML, and recent advancements.
Chapter 2 provides an overview of designing agent systems, including scenario selection, core components (model selection, tools, memory, planning), design trade-offs, architecture patterns (single-agent, multiagent, modular), and best practices.
Chapter 3 focuses on UX design, covering interaction modalities (text, graphical, speech, video), synchronous versus asynchronous experiences, context retention, communicating capabilities, trust, and key UX principles.

前三章涵盖核心概念、设计原则与基本组件：

第一章介绍 agent 的定义、其潜力、应用场景、与传统机器学习的比较以及最新进展。
第二章概述 agent 系统的设计，包括场景选择、核心组件（模型选择、工具、记忆、规划）、设计权衡、架构模式（单 agent、多 agent、模块化）以及最佳实践。
第三章专注于用户体验设计，涵盖交互模式（文本、图形、语音、视频）、同步与异步体验、上下文保留、能力传达、信任建立以及关键的 UX 原则。

The next five chapters focus on creating, orchestrating, and scaling agents:

Chapter 4 dives into tools, including design (local, API-based, plug-in, hierarchies) and automated tool development (code generation, imitation learning, tool learning from rewards).
Chapter 5 covers orchestration, with fundamentals (parameterization, tool selection, execution), tool selection methods (generative, semantic, hierarchical, machine-learned), tool topologies (decomposition, single/parallel/sequential execution, chains, trees, graphs), and planning strategies (incremental execution, zero-shot, few-shot, ReAct).
Chapter 6 explores memory, including foundational approaches (context windows, keyword-based), semantic memory and vector stores (semantic search, RAG, experience memory), GraphRAG (knowledge graphs), and working memory (whiteboards, note-taking).
Chapter 7 addresses learning from experience, with nonparametric learning (experiences as examples, exploration/exploitation, reflection), parametric learning (fine-tuning large/small models), and transfer learning.
Chapter 8 discusses scaling from one agent to many, including when to use multiagents, coordination (democratic, manager, hierarchical, actor-critic, automated design), and frameworks such as LangChain.

接下来的五章重点关注 agent 的创建、编排与扩展：

第四章深入探讨工具，包括其设计（本地工具、基于 API 的工具、插件、层次结构）以及自动化工具开发（代码生成、模仿学习、基于奖励的工具学习）。
第五章涵盖编排，包括基础概念（参数化、工具选择、执行）、工具选择方法（生成式、语义式、层级式、机器学习式）、工具拓扑结构（任务分解、单次/并行/顺序执行、链式、树状、图状）以及规划策略（增量执行、零样本（zero-shot）、少样本（few-shot）、ReAct）。
第六章探索记忆系统，包括基础方法（上下文窗口、基于关键词的记忆）、语义记忆与向量存储（语义搜索、RAG、经验记忆）、GraphRAG（知识图谱）以及工作记忆（白板、笔记记录）。
第七章讨论经验学习，涵盖非参数化学习（以经验为例、探索/利用、反思）、参数化学习（大/小模型微调）以及迁移学习（transfer learning）。
第八章探讨从单 agent 扩展到多 agent，包括何时使用多 agent、协调机制（民主式、管理者式、层级式、演员-评论家式、自动化设计）以及诸如 LangChain 等框架。

The final five chapters address validation, monitoring, security, improvement, and human-agent integration:

Chapter 9 covers measurement and validation, with key objectives (accuracy, robustness, efficiency, etc.), evaluation sets, unit tests (tools, planning, memory, learning), integration tests (end-to-end, consistency, hallucinations), limitations, and deployment preparation.
Chapter 10 focuses on production monitoring, including causes of failures, agent metrics (system health, automated/human evaluation, feedback), distribution shifts, and monitoring at scale (analytics, alerting, logging).
Chapter 11 explores improvement loops, with feedback pipelines (issue detection, human review, refinement, prioritization), experimentation (shadow deployments, A/B testing, adaptive, gating), and continuous learning (in-context, offline retraining, online reinforcement).
Chapter 12 addresses protecting agent systems, covering unique risks, securing LLMs (model selection, defenses, red teaming, fine-tuning), data protection (privacy, provenance), securing agents (safeguards, external/internal protections), and governance/compliance.
Chapter 13 discusses humans and agents, with ethical principles (oversight, transparency, fairness, explainability, privacy), building trust/oversight, addressing bias, and accountability/regulatory considerations.

最后五章主要讨论验证、监控、安全、优化以及人机协作：

第九章涵盖度量与验证，包括关键目标（准确性、鲁棒性、效率等）、评估集（evaluation sets）、单元测试（工具、规划、记忆、学习）、集成测试（端到端、一致性、幻觉检测）、局限性以及部署准备。
第十章专注于生产环境监控，包括故障原因、agent 指标（系统健康度、自动化/人工评估、反馈）、数据分布偏移（distribution shifts）以及大规模监控（分析、告警、日志记录）。
第十一章探讨改进闭环，包括反馈流水线（问题检测、人工审核、优化、优先级排序）、实验方法（影子部署（shadow deployment）、A/B测试、自适应测试、门控发布（gating））以及持续学习（上下文学习、离线重训练、在线强化学习）。
第十二章讨论保护 agent 系统，涵盖独特风险、保护大语言模型（模型选择、防御措施、红队测试、微调）、数据保护（隐私、溯源（provenance））以及保障 agent 安全（安全护栏、外部/内部防护）与治理/合规性。
第十三章探讨人与 agent 的协作，包括伦理原则（监督、透明度、公平性、可解释性、隐私）、建立信任/监督机制、应对偏见以及责任与监管考量。

Feel free to skip sections you’re familiar with—the book is modular by design.

您可以自由跳过已熟悉的部分——本书在设计上采用模块化结构。

Note: I often use “we” to refer to you (the reader) and me, fostering a collaborative learning vibe.

注：书中常使用“我们”来指代您（读者）与我，旨在营造一种协作学习的氛围。

Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/building-applications-with-ai-agents-supp.

补充材料（代码示例、练习等）可通过 https://oreil.ly/building-applications-with-ai-agents-supp 下载获取。

第一章： Agent 简介

We are witnessing a profound technological transformation driven by autonomous agents—intelligent software systems capable of independent reasoning, decision making, and interacting effectively within dynamic environments. Unlike traditional software, autonomous agents interpret contexts, adapt to changing scenarios, and perform sophisticated actions with minimal human oversight.

我们正在目睹一场由自主 agent 驱动的深刻技术变革。这些智能软件系统能够在动态环境中独立推理、决策并有效交互。与传统软件不同，自主 agent 能够解读上下文，适应不断变化的场景，并在最少人为监督的情况下执行复杂操作。

定义 AI Agents

Autonomous agents are intelligent systems designed to independently analyze data, interpret their environment, and make context-driven decisions. As the popularity of the term “agent” grows, its meaning has become diluted, often applied to systems lacking genuine autonomy. In practice, agency exists on a spectrum. True autonomous agents demonstrate meaningful decision making, context-driven reasoning, and adaptive behaviors. Conversely, many systems labeled as “agents” may simply execute deterministic scripts or tightly controlled workflows. Designing genuinely autonomous, adaptive agents is challenging, prompting many teams to adopt simpler approaches to achieve quicker outcomes. Therefore, the key test of a true agent is whether it demonstrates real decision making rather than following static scripts.

自主 agent 是一种智能系统，旨在独立分析数据、解读环境并做出基于上下文的决策。随着 “agent” 这一术语的普及，其含义已逐渐泛化，常被用于指代那些缺乏真正自主性的系统。实际上，自主性（agency）是一个程度问题。真正的自主 agent 应展现出有意义的决策能力、基于上下文的推理能力以及自适应行为。相反，许多被标记为 “agent” 的系统可能仅执行确定性脚本或受到严格控制的工作流程。设计真正自主、自适应的 agent 具有挑战性，因此许多团队会采用更简单的方法以快速取得成果。因此，检验一个 agent 是否真正的关键在于它是否展现出真正的决策能力，而非仅仅遵循静态脚本。

The rapid evolution of autonomous agents is primarily driven by breakthroughs in foundation models and reinforcement learning. While traditional use cases with foundation models have focused on generating human-readable outputs, the latest advances enable these models to generate structured function signatures and parameter selections. Orchestration frameworks can then execute these functions—enabling agents to look up data, manipulate external systems, and perform concrete actions. Throughout this book, we will use the term “agentic system” to describe the full supporting functionality that enables an agent to run effectively, including the tools, memory, foundation model, orchestration, and supporting infrastructure.

自主 agent 的快速发展主要得益于基础模型和强化学习（reinforcement learning）领域的突破。虽然基础模型在传统用例中专注于生成人类可读的输出，但最新的进展使这些模型能够生成结构化的函数签名和参数选择。编排框架随后可以执行这些函数——使 agent 能够查找数据、操控外部系统并执行具体操作。在本书中，我们将使用术语 agentic system 来描述支持 agent 有效运行的完整功能，包括工具、记忆、基础模型、编排以及支持性基础设施。

With a growing range of protocols such as Model Context Protocol (discussed in Chapter 4) and Agent-to-Agent Protocol (discussed in Chapter 8), these agents will be able to use remote tools and collaborate with other agents to solve problems. This unlocks enormous opportunities for sophisticated automation—but it also brings a profound responsibility to design, measure, and manage these systems thoughtfully, ensuring their actions align with human values and operate safely in complex, dynamic environments.

随着 Model Context Protocol（将在第 4 章讨论）和 Agent-to-Agent Protocol（将在第 8 章讨论）等协议的日益丰富，这些 agent 将能够使用远程工具并与其他 agent 协作来解决问题。这为复杂的自动化开启了巨大的机遇——但也带来了深刻的责任，需要我们审慎地设计、衡量和管理这些系统，确保其行为符合人类价值观，并在复杂动态的环境中安全运行。

预训练（Pretraining）变革

While traditional ML is an incredibly powerful technique, it is usually limited by the quantity and quality of the dataset. ML practitioners will typically tell you that they spend the majority of their time not training models, but on collecting and cleaning datasets that they can use for training. The incredible success of generative models that have been trained on large volumes of data have shown that single models can now adapt to a wide range of tasks without any additional training. This upends years of practice. To build an application that used ML previously required hiring an ML engineer or data scientist, having them collect data, and then deploying that model. With the latest developments in large, pretrained generative models, high-quality models that will work reasonably well for many use cases are now available through a single call to a hosted model without any training or hosting required. This dramatically lowers the cost and complexity of building applications enabled with ML and AI.

尽管传统机器学习是一种极其强大的技术，但它通常受限于数据集的数量和质量。机器学习从业者通常会告诉您，他们大部分时间并非花费在训练模型上，而是用于收集和清理可用于训练的数据集。经过海量数据训练的生成模型（generative models）所取得的惊人成功表明，单个模型现在能够适应广泛的任务，而无需任何额外训练。这颠覆了多年的实践。以往构建一个使用机器学习的应用需要聘请机器学习工程师或数据科学家，由他们收集数据，然后部署该模型。而随着大型预训练（pretrained）生成模型的最新发展，如今只需调用一次托管模型即可获得适用于多种用例的高质量模型，且无需任何训练或自行托管。这极大降低了构建机器学习与人工智能应用的成本和复杂性。

Recent advancements in large language models (LLMs) such as GPT-5, Anthropic’s Claude, Meta’s Llama, Google’s Gemini Ultra, and DeepSeek’s V3 have increased the performance on a range of difficult tasks even further, widening the scope of problems solvable with pretrained models. These foundation models offer robust natural language understanding and content generation capabilities, enhancing agent functionality through:

Natural language understanding

Interpreting and responding intuitively to user inputs
Context-aware interaction

Maintaining context for relevant and accurate responses over extended interactions
Structured content generation

Producing text, code, and structured outputs essential for analytical and creative tasks

GPT-5、Anthropic 的 Claude、Meta 的 Llama、Google 的 Gemini Ultra 以及 DeepSeek 的 V3 等大语言模型（LLMs）的最新进展，进一步提升了在一系列复杂任务上的性能，从而拓宽了预训练模型可解决问题的范围。这些基础模型提供了强大的自然语言理解和内容生成能力，并通过以下方式增强了 agent 的功能：

自然语言理解

直观地解读并响应用户输入
上下文感知交互

在持续交互中保持上下文，提供相关且准确的响应
结构化内容的生成

生成文本、代码及结构化输出，这对于分析和创造性任务至关重要

While these models are very capable on their own, they can also be used to make decisions within well-scoped areas, adapt to new information, and invoke tools to accomplish real work. Integration with sophisticated orchestration frameworks enables these models to interact directly with external systems and execute practical tasks. These models are capable of:

Contextual interpretation and decision making

Navigating ambiguous situations without exhaustive preprogramming
Tool use

Calling other software to retrieve information or take actions
Adaptive planning

Planning and executing complex, multistep actions autonomously
Information summarization

Rapidly processing extensive documents, extracting key insights, thereby aiding legal analysis, research synthesis, and content curation
Management of unstructured data

Interpreting and responding intelligently to unstructured texts such as emails, documents, logs, and reports
Code generation

Writing and executing code and writing unit tests
Routine task automation

Efficiently handling repetitive activities in customer service and administrative workflows, freeing human workers to focus on more nuanced tasks
Multimodal information synthesis

Performing intricate analyses of image, audio, or video data at scale

尽管这些模型本身已具备强大的能力，它们还可以在明确定义的范围内进行决策、适应新信息并调用工具来完成实际工作。通过与复杂的编排框架集成，这些模型能够直接与外部系统交互并执行实际任务。这些模型能够实现：

上下文理解与决策无需详尽的预编程，即可应对模糊情境
工具调用调用其他软件以获取信息或执行操作
自适应规划自主规划并执行复杂的多步骤行动
信息摘要快速处理大量文档，提取关键见解，从而辅助法律分析、研究整合（research synthesis）与内容策展
非结构化数据管理智能解读并响应电子邮件、文档、日志及报告等非结构化文本
代码生成编写与执行代码，并编写单元测试
常规任务自动化高效处理客户服务和行政流程中的重复性活动，使人力得以专注于更复杂的任务
多模态（multimodal）信息整合对图像、音频或视频数据进行大规模的精细分析

译注：

预训练指的是在系统运行之前，提前编写好所有可能需要的指令或逻辑。就好像给机器人一个提前准备好一个操作手册，系统会按照里面定义的规则来执行。

This enhanced flexibility enables autonomous agents to effectively handle complex and dynamic scenarios that static ML models typically cannot address.

这种增强的灵活性使得自主 agent 能够有效处理复杂且动态的场景，而静态的机器学习模型通常无法应对这类情况。

Agents 的类型

As the term “agent” has gained popularity, its meaning has broadened to encompass a wide range of AI-enabled systems, often creating confusion about what truly constitutes an AI agent. The Information categorizes agents into seven practical types, reflecting how these technologies are being applied today:

Business-task agents

These agents automate predefined business workflows, such as UiPath’s robotic process automation, Microsoft Power Automate’s low-code flows, or Zapier’s app integrations. They execute sequences of deterministic actions, typically triggered by events, with minimal contextual reasoning.
Conversational agents

This category includes chatbots and customer service agents that engage users through natural language interfaces. They are optimized for dialogue management, intent recognition, and conversational turn-taking, such as virtual assistants embedded in customer support platforms.
Research agents

Research agents conduct information gathering, synthesis, and summarization tasks. They scan documents, knowledge bases, or the web to provide structured outputs that assist human analysts. Examples include Perplexity AI and Elicit.
Analytics agents

Analytics agents, such as Power BI Copilot or Glean, focus on interpreting structured datasets and generating insights, dashboards, and reports. They often integrate tightly with enterprise data warehouses, enabling users to query complex data in natural language.
Developer agents

Tools like Cursor, Windsurf, and GitHub Copilot represent coding agents, which assist developers by generating, refactoring, and explaining code. They integrate deeply into IDE workflows to augment software development productivity.
Domain-specific agents

These agents are tuned for specialized professional domains, such as legal (Harvey), medical (Hippocratic AI), or finance agents. They combine domain-specific knowledge with structured workflows to deliver targeted, expert-level assistance.
Browser-using agents

These agents navigate, interact with, extract information from, and take actions on websites without human interaction. As opposed to traditional robotic process automation, which follows prescripted steps, modern browser-using agents combine language understanding, visual perception, and dynamic planning to adapt on the fly.

随着 “agent” 这一术语的普及，其含义已扩展到涵盖广泛的 AI 赋能系统，这也常常导致人们对 AI agent 的真正构成产生困惑。The Information 将 agent 分为七种实用类型，这反映了当前这些技术的实际应用方式：

业务任务型 agent 这类 agent 用于自动化预定义的业务工作流，例如 UiPath 的机器人流程自动化、Microsoft Power Automate 的低代码流程或 Zapier 的应用集成。它们执行一系列确定性的操作，通常由事件触发，仅需极少的情境推理。
对话式 agent 此类包括通过自然语言界面与用户互动的聊天机器人和客服 agent。它们针对对话管理、意图识别（intent recognition）和话轮转换（conversational turn-take）进行了优化，例如嵌入在客户支持平台中的虚拟助手。
研究型 agent 研究型 agent 执行信息收集、整合与摘要任务。它们扫描文档、知识库或网络，以提供结构化的输出，协助人类分析师。例如 Perplexity AI 和 Elicit。
分析型 agent 分析型 agent，例如 Power BI Copilot 或 Glean，专注于解读结构化数据集，并生成洞察、仪表板和报告。它们通常与企业数据仓库紧密集成，使用户能够用自然语言查询复杂数据。
开发者型 agent 诸如 Cursor、Windsurf 和 GitHub Copilot 等工具代表了编码型 agent，它们通过生成、重构和解释代码来协助开发人员。它们深度集成到 IDE 工作流程中，以提高软件开发效率。
领域专用型 agent 这类 agent 针对特定的专业领域进行优化，例如法律领域的 Harvey、医疗领域的 Hippocratic AI 或金融领域的 agent。它们将领域专业知识与结构化工作流相结合，以提供有针对性的、专家级别的协助。
浏览器操作型 agent 这类 agent 能够在无需人工干预的情况下浏览网站、与网站交互、提取信息并执行操作。与遵循预设步骤的传统机器人流程自动化不同，现代浏览器操作型 agent 结合了语言理解、视觉感知和动态规划能力，能够实时适应变化。

译注：

The Information 是一家专注于科技和商业领域的知名数字媒体公司，以深度调查报道和高质量内容著称。

Conversational turn-take 指的是会话系统如何管理谁先说、谁后说、何时切换说话方，以及在每一轮里理解用户意图并生成恰当回复。

In addition to these seven types of agents, voice and video agents are important and also expected to increase in adoption in the coming years:

Voice agents

Powered by end-to-end speech understanding and generation, these agents are enabling conversational automation in areas like customer service, appointment scheduling, and even real-time order processing.
Video agents

These agents present users with avatar-based video responses, combining lip-synced speech, facial expression, and gesture. They’re emerging rapidly in sales, training, customer onboarding, marketing, and virtual presence tools—enabling scalable, personalized video interactions without manual production.

除了这七类 agent 之外，语音与视频 agent 也至关重要，并且预计在未来几年将得到更广泛的应用：

语音 agent 凭借端到端的语音理解与生成能力，这类 agent 正在客户服务、预约安排乃至实时订单处理等领域实现对话自动化。
视频 agent 这类 agent 向用户呈现基于虚拟形象的视频响应，结合了唇形同步的语音、面部表情和手势。它们正迅速应用于销售、培训、客户引导、营销及虚拟存在工具中，能够实现可规模化、个性化的视频互动，而无需人工制作。

Importantly, the number and variety of agent types is growing rapidly, and we will likely see new kinds of agents emerge across many domains as the field and its underlying technologies evolve. In this book, our emphasis is on the core category of agents built around language models, particularly those using text and code. While we touch on business‑task automation, voice, and video, we’ll primarily explore agents built around language models—their architectures, reasoning, and UX—in subsequent chapters.

重要的是，agent 的类型和数量正在快速增长，随着该领域及其底层技术的发展，我们很可能会看到许多领域涌现出新型的 agent。在本书中，我们重点讨论围绕语言模型构建的核心类别 agent，特别是那些使用文本和代码的 agent。虽然我们会提及业务任务自动化、语音和视频 agent，但后续章节将主要探讨围绕语言模型构建的 agent——其架构、推理能力和用户体验。

Now that we’ve discussed the evolving types of agents, the next critical question becomes: which model should you choose to power your agent? Model selection is a complex and rapidly changing domain. As discussed in the next section, you’ll need to balance factors like task complexity, modality support, latency and cost constraints, and integration requirements to make the right choice for your agent.

既然我们已经讨论了不断发展的 agent 类型，下一个关键问题就变成了：您应该选择哪种模型来驱动您的 agent？模型选择是一个复杂且快速变化的领域。正如下一节将讨论的，您需要权衡任务复杂性、模态支持、延迟与成本约束以及集成需求等因素，才能为您的 agent 做出正确的选择。

模型选择

Today, we are fortunate to have a proliferation of powerful models available from both commercial providers and the open source community. OpenAI, Anthropic, Google, Meta, and DeepSeek each offer state-of-the-art foundation models with impressive general-purpose capabilities. At the same time, open-weight models like Llama, Mistral, and Gemma are pushing the boundaries of what can be achieved with local or fine-tuned deployments. Even more striking is the rapid advancement of small- and medium-sized models. New techniques for distillation, quantization, and synthetic data generation are enabling compact models to inherit surprising levels of capability from their larger counterparts.

如今，我们很幸运地看到，无论是商业提供商还是开源社区都涌现了大量强大的模型。OpenAI、Anthropic、Google、Meta 和 DeepSeek 各自提供了具备卓越通用能力的最先进的基础模型。与此同时，像 Llama、Mistral 和 Gemma 这样的开放权重模型，正在不断拓宽本地化部署或微调部署所能实现的边界。更引人注目的是中小型模型的快速发展。蒸馏、量化和合成数据生成等新技术，正使得紧凑型模型能够从更大规模的对应模型中继承令人惊叹的能力水平。

This explosion of choice is good news: competition is driving faster innovation, better performance, and lower costs. But it also creates a dilemma—how do you choose the right model for your agentic system? The truth is, there isn’t a one-size-fits-all answer. In fact, one of the most reasonable starting points is simply to use the latest general-purpose model from a leading provider like OpenAI or Anthropic. As you can see in Table 1-1, these models offer strong performance out of the box, require little customization, and will take you surprisingly far for many applications. GPT-5 mini (Aug 2025) leads overall with the highest mean score (0.819), closely followed by o4-mini (0.812) and o3 (0.811). Proprietary and open-access models like Qwen3, Grok 4, Claude 4, and Kimi K2 also show competitive results.

选择的爆炸式增长是个好消息：竞争推动了更快的创新、更好的性能和更低的成本。但这同时也带来了一个难题——你该如何为你的 agentic system 选择正确的模型？事实是，并没有一个放之四海而皆准（one-size-fits-all）的答案。实际上，一个最合理的起点就是直接使用领先提供商（如 OpenAI 或 Anthropic）的最新通用模型。正如你在表 1-1 中看到的，这些模型提供了开箱即用的强大性能，几乎无需定制，并且对于许多应用来说，它们能带你走得很远。GPT-5 mini（2025 年 8 月）以最高的平均得分（0.819）总体领先，紧随其后的是 o4-mini（0.812）和 o3（0.811）。专有模型和开放访问模型，如 Qwen3、Grok 4、Claude 4 和 Kimi K2 也显示出有竞争力的结果。

Model	Mean score	MMLU-Pro—COT correct	GPQA—COT correct	IFEval—IFEval Strict Acc	WildBench—WB Score	Omni-MATH—Acc
GPT-5 mini (2025-08-07)	0.819	0.835	0.756	0.927	0.855	0.722
o4-mini (2025-04-16)	0.812	0.82	0.735	0.929	0.854	0.72
o3 (2025-04-16)	0.811	0.859	0.753	0.869	0.861	0.714
GPT-5 (2025-08-07)	0.807	0.863	0.791	0.875	0.857	0.647
Qwen3 235B A22B Instruct 2507 FP8	0.798	0.844	0.726	0.835	0.866	0.718
Grok 4 (0709)	0.785	0.851	0.726	0.949	0.797	0.603
Claude 4 Opus (20250514, extended thinking)	0.78	0.875	0.709	0.849	0.852	0.616
gpt-oss-120b	0.77	0.795	0.684	0.836	0.845	0.688
Kimi K2 Instruct	0.768	0.819	0.652	0.85	0.862	0.654
Claude 4 Sonnet (20250514, extended thinking)	0.766	0.843	0.706	0.84	0.838	0.602

表 1-1. HELM 核心场景排行榜（2025年8月）。排名前10位的模型在推理与评估任务（MMLU-Pro、GPQA、IFEval、WildBench 和 Omni-MATH）上的基准性能比较。

译注：HELM 是 Standforn 提供的一个模型评分榜单，上述各项指标分别评估了多学科通识考试、研究生水平的科学问答、遵循指令评测、真实世界复杂提示的稳定性和综合表现以及数学能力评测。具体关于大模型评测榜单的介绍，可以参看这边博客

That said, they aren’t always the most efficient choice. For many tasks—especially those that are well-defined, low-latency, or cost-sensitive—much smaller models can provide near-equivalent performance at a fraction of the cost. This has led to a growing trend: automated model selection. Some platforms now route simpler queries to fast, inexpensive small models, reserving the large, expensive models for more complex reasoning. This dynamic test-time optimization is proving effective, and it hints at a future where multimodel systems become the norm.

话虽如此，它们也并非总是最高效的选择。对于许多任务——尤其是那些定义明确、对延迟敏感或对成本敏感的任务——更小型的模型能以极低的成本提供近乎同等的性能。这催生了一个日益增长的趋势：自动化模型选择。一些平台现在将简单的查询路由到快速、廉价的小型模型，而将大型、昂贵的模型留给更复杂的推理任务。这种动态的测试时优化已被证明是有效的，它预示着一个多模型系统成为常态的未来。

The key takeaway is that you can spend enormous effort optimizing model selection for marginal gains—but unless your scale or constraints demand it, starting simple is fine. Over time, it’s often worth experimenting with smaller models, fine-tuning, or adding retrieval to improve performance and reduce costs. Just remember: the future is almost certainly multimodel, and designing for flexibility now will pay off later.

关键在于，你可以投入巨大的精力去优化模型选择以获得边际收益——但除非你的规模或约束条件有此需求，否则从简单的模型开始完全可行。随着时间的推移，通常值得尝试更小的模型、进行微调或添加检索功能，以提升性能并降低成本。只需记住：未来几乎必然是多模型的，现在为灵活性进行设计将在未来带来回报。

从同步操作到异步操作

Traditional software systems typically execute tasks synchronously, moving step-by-step and waiting for each action to finish before starting the next. While this approach is straightforward, it can lead to significant inefficiencies—especially when waiting on external inputs or processing large volumes of data.

传统软件系统通常以同步方式执行任务，按步骤逐步推进，并在启动下一步之前等待当前操作完成。虽然这种方法简单直接，但可能导致严重的效率低下——尤其是在需要等待外部输入或处理大量数据时。

In contrast, autonomous agents are designed for asynchronous operation. They can manage multiple tasks in parallel, swiftly adapt to new information, and prioritize actions dynamically based on changing conditions. This asynchronous processing dramatically enhances efficiency, reducing idle time and optimizing the use of computational resources.

相比之下，自主 agent 是为异步操作而设计的。它们可以并行管理多项任务，快速适应新信息，并根据不断变化的条件动态地确定行动的优先级。这种异步处理方式极大地提高了效率，减少了空闲时间，并优化了计算资源的利用。

The practical implications of this shift are substantial. For example:

Emails can arrive with reply drafts already prepared.
Invoices can come with pre-populated payment details.
Software engineers might receive tickets accompanied by code to solve them and unit tests to assess them.
Customer support agents can be provided with suggested responses and recommended actions.
Security analysts can receive alerts that have already been automatically investigated and enriched with relevant threat intelligence.

这种转变的实际意义是重大的。例如：

收到的电子邮件可能已附带起草好的回复草稿。
收到的发票可能已预填好付款信息。
软件工程师收到的工单可能会附带解决问题的代码以及用于评估的单元测试。
客户支持人员可能会获得建议的回复和推荐的操作。
安全分析师收到的警报可能已自动完成初步调查并附带了相关的威胁情报。

In each case, agents are not just speeding up routine workflows—they are changing the nature of work itself. This evolution transforms human roles from task executors to task managers. Rather than spending time on repetitive or mechanical steps, individuals can focus on strategic oversight, review, and high-value decision making—amplifying human creativity and judgment while letting agents handle the operational details. These agents make it much easier for human roles to be proactive rather than reactive.

在上述每种情况下，agent 不仅加速了常规工作流程——它们更是在改变工作本身的性质。这一演变将人类的角色从任务执行者转变为任务管理者。个人无需再将时间花费在重复性或机械的步骤上，而是可以专注于战略监督、审查和高价值的决策——从而放大人类的创造力与判断力，同时让 agent 处理操作细节。这些 agent 使得人类角色能够更容易地主动行动，而非被动应对。

实际的应用和用例

The versatility of autonomous agents opens up a myriad of applications across different industries. To keep this book grounded in clear, specific use cases, I have seven real-world example agents with evaluation systems available in the public GitHub repo supporting this book. We will frequently turn back to these examples as we explore the key aspects of agent systems:

Customer support agent

Customer support is one of the most prevalent applications for autonomous agents. These agents handle common inquiries, process refunds, update orders, and escalate complex issues to human representatives, providing 24/7 support while enhancing customer satisfaction and reducing operational costs.
Financial services agent

In banking and financial services, agents assist with account management, loan processing, fraud investigation, and investment portfolio rebalancing. They streamline customer service, accelerate transaction processing, and improve security by detecting suspicious activities in real time.
Healthcare patient intake and triage agent

These agents support frontline healthcare operations by registering new patients, verifying insurance, assessing symptoms to prioritize care, scheduling appointments, managing medical histories, and coordinating referrals, thereby improving workflow efficiency and patient outcomes.
IT help desk agent

IT help desk agents manage user access, troubleshoot network and system issues, deploy software updates, respond to security incidents, and escalate unresolved issues to specialists. They enhance productivity by resolving common technical problems swiftly.
Legal document review agent

Legal agents assist attorneys and paralegals by reviewing contracts, conducting legal research, performing client intake and conflict checks, managing discovery, assessing compliance, calculating damages, and tracking deadlines. This helps to streamline workflows and improve accuracy in legal operations.
Security Operations Center (SOC) analyst agent

SOC analyst agents investigate security alerts, gather threat intelligence, query logs, triage incidents, isolate compromised hosts, and provide updates to security teams. They accelerate incident response and strengthen organizational security posture.
Supply chain and logistics agent

In supply chain management, agents optimize inventory, track shipments, evaluate suppliers, coordinate warehouse operations, forecast demand, manage disruptions, and handle compliance requirements. These capabilities help maintain resilience and efficiency across global networks.

自主 agent 的多功能性为不同行业开启了无数的应用场景。为使本书内容立足于清晰、具体的用例，我在支持本书的公开 GitHub 仓库中提供了七个真实世界的示例 agent 及其评估系统。在探讨 agent 系统的关键方面时，我们将经常回顾这些示例：

客户支持 agent 客户支持是自主 agent 最普遍的应用场景之一。这类 agent 处理常见咨询、办理退款、更新订单，并将复杂问题转交人工处理，提供全天候支持，在提升客户满意度的同时降低运营成本。
金融服务 agent 在银行与金融服务领域，agent 协助进行账户管理、贷款处理、欺诈调查和投资组合再平衡。它们通过实时检测可疑活动，简化客户服务，加速交易处理，并提升安全性。
医疗患者接待与分诊 agent 这类 agent 通过登记新患者、核实保险信息、评估症状以确定护理优先级、安排预约、管理病史以及协调转诊，支持一线医疗运营，从而提升工作流程效率与患者治疗效果。
IT 服务台 agent IT 服务台 agent 管理用户访问权限、排查网络与系统问题、部署软件更新、响应安全事件，并将未解决的问题升级给专家处理。它们通过快速解决常见技术问题来提高工作效率。
法律文档审阅 agent 法律 agent 通过审阅合同、进行法律研究、执行客户接待与利益冲突审查、管理证据开示、评估合规性、计算损害赔偿以及跟踪截止日期，协助律师和律师助理。这有助于简化工作流程并提高法律运营的准确性。
安全运营中心（SOC）分析师 agent SOC 分析师 agent 负责调查安全警报、收集威胁情报、查询日志、对事件进行分类、隔离受感染的主机，并向安全团队提供更新。它们加速事件响应并增强组织的安全态势。
供应链与物流 agent 在供应链管理中，agent 用于优化库存、跟踪货运、评估供应商、协调仓储运营、预测需求、处理中断事件以及管理合规要求。这些能力有助于在全球网络中保持韧性与效率。

Autonomous agents offer significant potential across various use cases, from customer support and personal assistance to legal services and advertising. By integrating these agents into their operations, organizations can achieve greater efficiency, improve service quality, and unlock new opportunities for innovation and growth. As we continue to explore the capabilities and applications of autonomous agents in this book, it becomes evident that their impact will be profound and far-reaching across multiple industries.

自主 agent 在从客户支持、个人助理到法律服务和广告等多种用例中都展现出巨大潜力。通过将这些 agent 整合到运营中，组织能够实现更高的效率、提升服务质量，并为创新与增长开辟新的机遇。随着我们在本书中继续探讨自主 agent 的能力与应用，其影响将在多个行业中变得深刻而深远，这一点已显而易见。

Now that we’ve looked at some example agents, in the next section, we’ll discuss some of the key considerations when designing our agentic systems.

既然我们已经看了一些示例 agent，在下一节中，我们将讨论设计我们的 agentic 系统时的一些关键考量因素。

工作流和 Agents

In many real‐world projects, choosing between a simple script, a deterministic workflow, a traditional chatbot, a retrieval‐augmented generation (RAG) system, or a full‐blown autonomous agent can be the difference between an elegant solution and an overengineered, hard‐to‐maintain mess. To make this choice clearer, consider four key factors: the variability of your inputs, the complexity of the reasoning required, any performance or compliance constraints, and the ongoing maintenance burden.

在许多现实项目中，选择使用简单的脚本、确定性的工作流、传统的聊天机器人、检索增强生成系统（RAG），还是完备的自主 agent，其结果可能大相径庭——既可能成就一个优雅的解决方案，也可能形成一个过度设计、难以维护的烂摊子。为了使这个选择更加清晰，请考虑四个关键因素：输入信息的可变性、所需推理的复杂性、任何性能或合规性约束，以及持续的维护负担。

First, when might you choose not to use a foundation model—or any ML component at all? If your inputs are fully predictable and every possible output can be described in advance, a handful of lines of procedural code are often faster, cheaper, and far easier to test than an ML–based pipeline. For example, parsing a log file that always follows the format “YYYY‐MM‐DD HH:MM:SS—message” can be handled reliably with a small regular‐expression‐based parser in Python or Go. Likewise, if your application demands millisecond‐level latency—such as an embedded system that must react to sensor data in real time—there simply isn’t time for a language model API call. In such cases, traditional code is the right choice. Finally, regulated domains (medical devices, aeronautics, certain financial systems) often require fully deterministic, auditable decision logic—black‐box neural models won’t satisfy certification requirements. If any of these conditions hold—deterministic inputs, strict performance or explainability needs, or a static problem domain—plain code is almost always preferable to a foundation model.

首先，何时你可能选择不使用基础模型——或者根本不用任何机器学习组件呢？如果你的输入是完全可预测的，并且所有可能的输出都能预先描述，那么几行过程式代码（procedural）通常比基于机器学习的流程更快、更便宜，并且测试起来也简单得多。例如，解析一个始终遵循 “YYYY-MM-DD HH:MM:SS—message” 格式的日志文件，完全可以用 Python 或 Go 中一个小型基于正则表达式的解析器可靠地处理。同样，如果你的应用需要毫秒级延迟——例如必须对传感器数据做出实时反应的嵌入式系统——那么根本没有时间进行语言模型的 API 调用。在这种情况下，传统代码才是正确的选择。最后，受监管的领域（医疗设备、航空航天、某些金融系统）通常要求完全确定性、可审计的决策逻辑——黑盒神经网络模型无法满足认证要求。如果存在这些情况中的任何一种——确定性输入、严格的性能或可解释性要求，或是静态问题域——那么普通代码几乎总是比基础模型更可取。

Next, consider deterministic or semiautomated workflows. Here, the logic can be expressed as a finite set of steps or branches, and you know ahead of time where you might need human intervention or extra error handling. Suppose you ingest invoices from a small set of vendors and each invoice arrives in one of three known formats: CSV, JSON, or PDF. You can build a workflow that routes each format to its corresponding parser, checks for mismatches, and halts for a human review if any fields fail a simple reconciliation—no deep semantic understanding is required. Likewise, if your system must retry failed steps with exponential backoff or pause for a manager’s approval, a workflow engine (such as Airflow, AWS Step Functions, or a well‐structured set of scripts) offers clearer control over error paths than an LLM could. Deterministic workflows make sense whenever you can enumerate all decision branches in advance and you need tight, auditable control over each branch. In such scenarios, workflows scale more naturally than large, ad hoc scripts but still avoid the complexity and cost of running an agentic pipeline.

接下来，考虑确定性的或半自动化的工作流。在这种情况下，逻辑可以表示为有限的步骤或分支集合，并且你可以提前知道哪些环节可能需要人工干预或额外的错误处理。假设你从一小部分供应商那里接收发票，每张发票都是三种已知格式中的一种：CSV、JSON 或 PDF。你可以构建一个工作流，将每种格式路由到对应的解析器，检查是否存在不匹配项，并在任何字段未能通过简单的对账校验时暂停以等待人工审查——这并不需要深层的语义理解。同样，如果你的系统必须使用指数退避（exceptional backoff）重试失败的步骤，或者需要暂停以等待经理批准，那么工作流引擎（例如 Airflow、AWS Step Functions 或一组结构良好的脚本）相比大型语言模型能提供更清晰的对错误路径的控制。无论何时，只要你能预先枚举所有决策分支，并且需要对每个分支进行严格、可审计的控制，确定性工作流就是合理的选择。在这类场景中，工作流比大型的、临时拼凑的脚本更具可扩展性，同时仍避免了运行 agentic 流程的复杂性和成本。

译注：这里的工作流指的是工作流引擎驱动的流程，详情可以见《Practical Process Automation》这本书

Traditional chatbots or RAG systems occupy the next tier of complexity: they add natural language understanding and document retrieval but stop short of autonomous, multistep planning. If your primary need is to let users ask questions about a knowledge base—say, searching a product manual, a legal archive, or corporate wikis—a RAG system can embed documents into a vector store, retrieve relevant passages in response to a query, and generate coherent, context‐aware answers. For instance, an internal IT help desk might use RAG to answer “How do I reset my VPN credentials?” by fetching the latest troubleshooting guide and summarizing the relevant steps. Unlike autonomous agents, RAG systems do not independently decide on follow‐up actions (like filing a ticket or scheduling a callback); they simply surface information. A traditional chatbot or RAG approach makes sense when the task is primarily question‐answering over structured or unstructured content, with limited need for external API calls or decision orchestration. Maintenance costs are lower than for agents—your main overhead lies in keeping document embeddings up to date and refining prompts—but you sacrifice the agent’s ability to plan multistep workflows or learn from feedback loops.

传统聊天机器人或 RAG 系统处于下一个复杂度层级：它们增加了自然语言理解和文档检索能力，但尚未具备自主的多步骤规划能力。如果你的主要需求是让用户查询知识库——例如搜索产品手册、法律档案或企业维基——RAG 系统可以将文档嵌入向量存储，根据查询检索相关段落，并生成连贯、上下文感知的回答。例如，一个内部 IT 服务台可能使用 RAG 来回答“如何重置我的 VPN 凭据？”，其方式是获取最新的故障排除指南并总结相关步骤。与自主 agent 不同，RAG 系统不会独立决定后续行动（例如创建工单或安排回访）；它们只是提供信息。当任务主要是基于结构化或非结构化内容进行问答，且对外部 API 调用或决策编排的需求有限时，传统的聊天机器人或 RAG 方法是合理的。其维护成本低于 agent——你的主要开销在于保持文档嵌入的更新和改进提示词——但你牺牲了 agent 规划多步骤工作流或从反馈闭环中学习的能力。

Finally, we reach autonomous agents—situations where neither simple code, nor rigid workflows, nor RAG suffice because inputs are unstructured, novel, or highly variable, and because you require dynamic, multistep planning or continuous learning from feedback. Consider a customer support center that receives free‐form emails with issues ranging from “my laptop battery is swelling and might erupt” to “I keep getting billed for services I didn’t order.” A rule‐based workflow or a RAG‐powered FAQ lookup would shatter under such open‐ended variety, but an agent powered by a foundation model can parse intent, extract relevant entities, consult a knowledge base, draft an appropriate response, and even escalate to a human if necessary—all without being told every possible branch in advance. Similarly, in supply chain management, an agent that ingests real‐time inventory data, supplier lead times, and sales forecasts can replan shipment schedules dynamically; a deterministic workflow would require constant manual updates to handle new exceptions.

最后，我们来到了自主 agent 的应用场景——当简单的代码、固定的工作流或 RAG 系统都无法满足需求时，这些情况往往源于输入信息是非结构化、新颖或高度多变的，并且你需要动态的多步骤规划或从反馈中持续学习的能力。以一个客户支持中心为例，它接收到的电子邮件形式自由，问题五花八门，从“我的笔记本电脑电池鼓包了，可能快要爆了”到“我一直在为没订购的服务被扣款”。面对如此开放多样的提问，基于规则的工作流或依赖 RAG 的常见问题查询系统将难以应对，而由基础模型驱动的 agent 能够解析意图、提取相关实体、查询知识库、起草合适的回复，甚至在必要时将问题升级给人工处理——所有这些都无需预先告知所有可能的分支。同样，在供应链管理中，一个能够处理实时库存数据、供应商交期和销售预测的 agent 可以动态地重新规划货运计划；而一个确定性的工作流则需要不断的人工更新来处理新的异常情况。

Agents also excel when many subtasks must run in parallel—such as a security operations agent that simultaneously queries threat intelligence APIs, scans network telemetry, and performs sandbox analysis on suspicious binaries. Because agents operate asynchronously and reprioritize based on real‐time data, they avoid the brittle “one‐step‐at‐a‐time” nature of workflows or RAG systems. To justify the higher compute and maintenance costs of running a foundation model, you need this level of contextual reasoning, parallel task orchestration, or ongoing self‐improvement—scenarios where rigid code, workflows, or chatbots would be too brittle or expensive to maintain.

当需要并行运行多个子任务时，agent 的优势尤为突出——例如，一个安全运营 agent 可以同时查询威胁情报 API、扫描网络遥测数据并对可疑二进制文件进行沙箱分析。由于 agent 以异步方式运行，并能根据实时数据重新确定优先级，它们避免了工作流或 RAG 系统那种脆弱的“一次一步”的特性。要证明运行基础模型所带来的更高计算和维护成本的合理性，你需要这种级别的上下文推理、并行任务编排或持续的自我改进能力——在这些场景下，僵化的代码、工作流或聊天机器人会过于脆弱或维护成本过高。

特性	传统代码	工作流	自主 agent
输入结构	完全可预测的模式	大部分可预测，具有有限分支	高度非结构化或新颖的输入
可解释性	完全透明；易于审计	明确的分支审计跟踪	需要额外工具的“黑盒”组件
延迟	超低延迟	中等延迟	较高延迟
适应性与学习能力	无	有限	高（可从反馈中学习）

表 1-2 区份传统代码、工作流和 agents

Every path carries trade‐offs. Pure code is cheap and fast but inflexible; workflows offer control but break down when inputs grow wildly variable; traditional chatbots or RAG are great for question‐answering over documents but cannot orchestrate multistep actions; and agents are powerful but demanding—both in terms of cloud compute and engineering effort to monitor, tune, and govern. Before choosing, ask: are my inputs unstructured or unpredictable? Do I need multistep planning that adapts to intermediate results? Can a document retrieval system suffice for my users’ information needs, or must the system decide and act autonomously? Will I want this system to improve itself over time with minimal human intervention? And can I tolerate the latency and maintenance burden of a foundation model?

每条路径都需要权衡。纯代码成本低、速度快，但缺乏灵活性；工作流提供了控制力，但当输入变得高度多变时便会失效；传统聊天机器人或 RAG 系统擅长基于文档的问答，但无法编排多步骤操作；而 agent 虽然强大，但要求也高——无论是云计算的成本，还是在监控、调优和治理方面投入的工程精力都是如此。在做选择之前，请问问自己：我的输入是非结构化或不可预测的吗？我需要能够适应中间结果的多步骤规划吗？一个文档检索系统能否满足用户的信息需求，还是系统必须自主决策并采取行动？我是否希望这个系统能以最少的人工干预随时间自我改进？我能否接受基础模型带来的延迟和维护负担？

In short, if your task is a fixed, deterministic transformation, write some simple code. If there are a handful of known branches and you require explicit error‐handling checkpoints, use a deterministic workflow. If you primarily need natural language question‐answering over a corpus, choose a traditional chatbot or RAG architecture. But if you face high variability, open‐ended reasoning, dynamic planning needs, or continual learning requirements, invest in an autonomous agent. Making this choice thoughtfully ensures that you get the right balance of simplicity, performance, and adaptability—so your solution remains both effective and maintainable as requirements evolve.

简而言之，如果你的任务是固定、确定性的转换，那么就编写一些简单的代码。如果存在若干已知的分支，并且你需要明确的错误处理检查点，就使用确定性的工作流。如果你的主要需求是对一个语料库进行自然语言问答，就选择传统的聊天机器人或 RAG 架构。但是，如果你面对的是高可变性、开放性推理、动态规划需求或持续学习要求，那么就投入构建一个自主 agent。审慎地做出这个选择，能确保你在简洁性、性能和适应性之间取得恰当的平衡——从而使你的解决方案在需求演变的过程中，既能保持有效性，也易于维护。

构件高效 Agentic 系统的原则

Creating successful autonomous agents requires an approach that prioritizes scalability, modularity, continuous learning, resilience, and future-proofing:

Scalability

Ensure that agents can handle growing workloads and diverse tasks by utilizing distributed architectures, cloud-based infrastructure, and efficient algorithms that support parallel processing and resource optimization. Example: a customer support agent that processes 10 tickets per minute may crash or hang when traffic spikes to 1,000 if not backed by autoscaling infrastructure.
Modularity

Design agents with independent, interchangeable components connected through clear interfaces. This modular approach simplifies maintenance, promotes flexibility, and facilitates rapid adaptation to new requirements or technologies. Example: a poorly modular agent that hardcodes all its tools in its agent service would require a full redeployment anytime a small addition or modification is needed to a tool.
Continuous learning

Equip agents with mechanisms to learn from experience, such as in-context learning. Integrate user feedback to refine agent behaviors and maintain performance relevance as tasks evolve. Example: agents that ignore feedback loops may keep making the same mistakes—like misclassifying contract clauses or failing to escalate critical support issues.
Resilience

Develop robust resilience architectures capable of gracefully handling errors, security threats, timeouts, and unexpected conditions. Incorporate comprehensive error handling, stringent security measures, and redundancy to ensure reliable and continuous agent operations. Example: agents without retry or fallback logic may crash entirely when a single API call fails, leaving the user waiting and confused.
Future-proofing

Build agent systems around open standards and scalable infrastructure, fostering a culture of innovation to adapt quickly to emerging technologies and evolving user expectations. Example: tightly coupling your agent to one proprietary vendor’s prompt format can make switching models painful and limit experimentation.

创建成功的自主 agent，需要一种优先考虑可扩展性、模块化、持续学习、弹性（resilience）和前瞻性的方法：

可扩展性（Scalability）确保 agent 能够通过分布式架构、基于云的基础设施以及支持并行处理和资源优化的高效算法，应对不断增长的工作负载和多样化任务。示例：一个每分钟处理 10 个工单的客户支持 agent，如果没有自动扩缩容基础设施的支持，当流量激增至每分钟 1000 个工单时，可能会崩溃或挂起。
模块化（Modularity）设计 agent 时，应采用独立的、可互换的组件，并通过清晰的接口进行连接。这种模块化方法简化了维护，提升了灵活性，并有助于快速适应新的需求或技术。示例：如果一个 agent 的模块化程度很低，将所有工具都硬编码在其服务中，那么每当需要对某个工具进行微小的添加或修改时，都需要进行完整的重新部署。
持续学习（Continuous learning）为 agent 配备从经验中学习的机制，例如上下文学习（in-context learning）。整合用户反馈以优化 agent 行为，并随着任务的演变保持其性能的关联性。示例：忽略反馈循环的 agent 可能会重复犯同样的错误——例如错误分类合同条款，或未能升级处理关键的支持问题。
弹性（resilience）构建稳健的弹性架构，能够优雅地处理错误、安全威胁、超时和意外状况。纳入全面的错误处理、严格的安全措施和冗余机制，以确保 agent 可靠、持续地运行。示例：缺乏重试或降级逻辑的 agent，可能在单次 API 调用失败时完全崩溃，导致用户等待并陷入困惑。
前瞻性（Future-proofing）围绕开放标准和可扩展的基础设施构建 agent 系统，培育创新文化，以快速适应新兴技术和不断变化的用户期望。示例：将您的 agent 与某个专有供应商的提示格式紧密耦合，会使切换模型变得痛苦并限制实验探索。

Adhering to these principles enables organizations to develop autonomous agents that remain effective and relevant, adapting seamlessly to technological advancements and changing operational environments.

遵循这些原则，能使组织开发出保持高效性和相关性的自主 agent，使其能够无缝适应技术进步和不断变化的运营环境。

构建 Agentic 系统的成功组织策略（Organizing for Success）

The widespread availability of foundation models via simple API calls has spurred extensive experimentation with agent systems across many organizations. Teams frequently embark on independent proofs of concept, leading to valuable discoveries and innovative ideas. However, this ease of experimentation often results in fragmentation—overlapping projects, duplicated efforts, and unfinished experiments become scattered throughout the organization. Conversely, premature standardization could stifle creativity and trap organizations into rigid frameworks or vendor-specific solutions. Achieving success requires balancing flexibility for experimentation with sufficient alignment for scalability and coherence.

基础模型通过简单的 API 调用即可广泛使用，这推动了许多组织对 agent 系统进行大量实验。团队经常独立开展概念验证，从而带来有价值的发现和创新想法。然而，这种便捷的实验方式也常常导致碎片化——重叠的项目、重复的工作和未完成的实验在组织内四处散落。反过来，过早地推行标准化则可能扼杀创造力，并使组织陷入僵化的框架或特定供应商的解决方案之中。要取得成功，就需要在保持实验灵活性和确保足够的协调性以支持可扩展性（scalability）与一致性（coherence）之间取得平衡。

In the early phases of agent development, organizations should actively encourage exploratory efforts, permitting teams to test various architectures, workflows, and models freely. Over time, as successful patterns and best practices become apparent, strategic alignment becomes critical. Implementing a “one standard per large group” strategy can effectively balance this need. Within specific departments or functional areas, teams can standardize around common tools and methodologies, streamlining collaboration without restricting broader organizational innovation.

在 agent 开发的早期阶段，组织应积极鼓励探索性尝试，允许团队自由测试不同的架构、工作流和模型。随着时间的推移，当成功的模式和最佳实践逐渐清晰时，战略协调就变得至关重要。实施 “每个大团队采用一个标准” 的策略，可以有效平衡这一需求。在特定部门或职能领域内，团队可以围绕共同的工具和方法进行标准化，从而在不限制更广泛组织创新的前提下，提高协作效率。

Another essential aspect of success is avoiding vendor lock-in by adopting open standards, such as OpenAPI, and embracing modular system designs. These practices help ensure flexibility and reduce dependency on any single technology or provider, facilitating future adaptability.

成功的另一个关键方面是，通过采用 OpenAPI 等开放标准并拥抱模块化系统设计，来避免供应商锁定。这些做法有助于确保灵活性，减少对任何单一技术或供应商的依赖，从而为未来的适应性铺平道路。

Effective knowledge sharing is also crucial. Lessons learned from both successful and unsuccessful experiments should be communicated widely via internal forums, shared repositories, and comprehensive documentation. This collaborative approach accelerates organizational learning, minimizes redundant efforts, and promotes collective improvement.

有效的知识共享同样至关重要。无论是成功还是失败的实验，从中汲取的经验教训都应通过内部论坛、共享知识库和全面的文档进行广泛传播。这种协作方式能够加速组织的学习进程，减少冗余工作，并促进集体进步。

Lastly, governance frameworks should remain lightweight and flexible, emphasizing guiding principles over rigid mandates. A streamlined governance structure enables teams to innovate confidently while remaining aligned with overarching organizational objectives.

最后，治理框架应保持轻量化和灵活性，强调指导原则而非僵化的指令。一个精简的治理结构能使团队在保持与组织总体目标一致的前提下，充满信心地进行创新。

Organizing successfully around agentic systems is fundamentally iterative. Organizations must continually reassess their strategies to maintain a dynamic balance between exploration and standardization. By cultivating an environment that values experimentation, collaborative learning, and open standards, organizations can effectively transition agentic systems from isolated experiments into scalable, transformative solutions that are deeply integrated into their operational processes.

围绕 agentic 系统成功地进行组织本质上是一个迭代的过程。组织必须持续重新评估其策略，以在探索和标准化之间保持动态平衡。通过培育一个重视实验、协作学习和开放标准的环境，组织能够有效地将 agentic 系统从孤立的实验转变为可扩展的、变革性的解决方案，并将其深度整合到运营流程中。

Agentic 框架

Numerous frameworks currently exist for developing autonomous agents, each addressing critical functionalities such as skills integration, memory management, planning, orchestration, experiential learning, and multiagent coordination. This list is certainly not exhaustive, but leading frameworks include the following.

目前存在众多用于开发自主 agent 的框架，每个框架都致力于解决关键功能，例如技能集成、记忆管理、规划、编排、经验学习以及多 agent 协调。以下列表虽非详尽无遗，但涵盖了主要框架：

LangGraph

Strengths

Modular orchestration framework based on directed graphs whose nodes contain discrete units of logic (often foundation model calls) and whose edges manage the flow of data through complex, potentially cyclic workflows; strong developer ergonomics; native support for asynchronous workflows and retries
Trade-offs

Requires custom logic for advanced planning and memory; less built-in support for multiagent collaboration
Best for

Teams building robust, single-agent or light multiagent systems with explicit, inspectable flow control

LangGraph

优势基于有向图的模块化编排框架，其节点包含离散的逻辑单元（通常是基础模型调用），边则管理数据在复杂且可能包含循环的工作流中的流转；出色的开发者工效学设计；原生支持异步工作流和重试机制。
权衡需要自定义逻辑来实现高级规划和记忆功能；内置的多 agent 协作支持较少。
最适合构建稳健的单 agent 或轻量级多 agent 系统，且需要明确、可检查的流程控制的团队。

AutoGen

Strengths Powerful multiagent orchestration; dynamic role assignment; flexible messaging-based interaction between agents
Trade-offs Can be heavyweight or complex for simple use cases; more opinionated around agent interaction patterns
Best for Research and production systems involving dialogue between multiple agents (e.g., manager-worker, self-reflection loops)

AutoGen

优势强大的多 agent 编排能力；动态角色分配；agent 之间基于消息的灵活交互。
权衡对于简单的用例可能显得笨重或复杂；在 agent 交互模式上预设性（opinionated）较强。
最适合涉及多 agent 间对话（例如，管理者-工作者、自反思循环）的研究和生产系统。

CrewAI

Strengths

Easy to learn and use; quick setup for prototyping; useful abstractions like “crew” and “tasks”
Trade-offs

Limited customization and control over orchestration internals; less mature than LangGraph or AutoGen for complex workflows
Best for

Developers who want to get started quickly on practical, human-centric agents like assistants or support agents

CrewAI

优势易于学习和使用；能快速搭建原型；提供了“团队”和“任务”等实用抽象概念。
权衡对编排内部机制的定制和控制有限；在处理复杂工作流时，不如 LangGraph 或 AutoGen 成熟。
最适合希望快速启动开发实用型、以人为本的 agent（如助手或支持 agent）的开发者。

OpenAI Agents Software Development Kit (SDK)

Strengths

Deep integration with OpenAI’s tool ecosystem; secure and easy-to-use function calling, memory primitives, and tool routing
Trade-offs

Tightly coupled to OpenAI’s infrastructure; may be less flexible or portable for custom agent stacks or open source toolchains
Best for

Teams already using the OpenAI API and looking for a fast way to build secure, tool-using agents with minimal scaffolding

OpenAI Agents Software Development Kit (SDK)

优势与 OpenAI 的工具生态系统深度集成；提供安全易用的函数调用、记忆原语（primitive）和工具路由功能。
权衡与 OpenAI 的基础设施紧密耦合；对于自定义 agent 技术栈或开源工具链，可能灵活性和可移植性较差。
最适合已在使用 OpenAI API 并希望快速构建安全、使用工具且无需复杂脚手架的 agent 的团队。

While each framework offers unique advantages and limitations, continuous innovation and competition in this space are expected to drive further evolution. For early prototypes, CrewAI or OpenAI Agents SDK can get you running quickly. For scalable, production-grade systems, LangGraph and AutoGen provide more control and sophistication. These frameworks are also not necessary, and many teams choose to build directly against the model provider APIs. This book primarily focuses on LangGraph, chosen for its straightforward yet powerful approach to agent system development. Through detailed explanations, practical examples, and real-world scenarios, we demonstrate how LangGraph effectively addresses the complexity and dynamics required by modern intelligent agents.

尽管每个框架都有其独特的优势和局限，但该领域的持续创新与竞争有望推动其进一步发展。对于早期原型，CrewAI 或 OpenAI Agents SDK 能让您快速启动。对于可扩展的生产级系统，LangGraph 和 AutoGen 则提供了更强的控制力和更精细的功能。这些框架也并非必需，许多团队选择直接基于模型提供商的 API 进行构建。本书主要聚焦于 LangGraph，因其在 agent 系统开发上采用了直接而强大的方法。通过详细的解释、实际案例和真实场景，我们将展示 LangGraph 如何有效应对现代智能 agent 所需的复杂性和动态需求。

结语

Autonomous agents represent a transformative development in AI, capable of performing complex, dynamic tasks with a high degree of autonomy. This chapter has outlined the foundational concepts of agents, highlighted their advancements over traditional ML systems, and discussed their practical applications and limitations. As we delve deeper into the design and implementation of these systems, it becomes clear that the thoughtful integration of agents into various domains holds the potential to drive significant innovation and efficiency.

自主 agent 代表了人工智能领域的一项变革性发展，能够以高度自主性执行复杂、动态的任务。本章概述了 agent 的基本概念，强调了其相较于传统机器学习系统的进步，并讨论了其实际应用与局限性。随着我们更深入地探讨这些系统的设计与实现，我们可以清楚地看到，将 agent 审慎地整合到各个领域中，蕴含着推动重大创新和效率提升的潜力。

While the various approaches to designing autonomous agents discussed in this chapter have demonstrated significant capabilities and potential, they also highlight the complexity and challenges involved in creating effective and adaptable systems. Each method, from rule-based systems to advanced cognitive architectures, offers unique strengths but also comes with inherent limitations. In this book, I aim to bridge these gaps.

虽然本章讨论的各种设计自主 agent 的方法已展现出显著的能力和潜力，但它们也凸显了创建有效且适应性强的系统所涉及的复杂性和挑战。从基于规则的系统到高级认知架构，每种方法都有其独特的优势，但也伴随着固有的局限。在本书中，我旨在弥合这些差距。

第二章：设计 Agent 系统

Most practitioners don’t begin with a grand design document when building agent systems. They start with a messy problem, a foundation model API key, and a rough idea of what might help. This chapter is your quick start to get you up and running. We’ll cover each of the following topics in more depth through the rest of the book, and many will get their own chapter, but this chapter will give you an overview of how to design an agentic system, all grounded in a specific example of managing customer support for an ecommerce platform.

大多数从业者在构建 agent 系统时，并不会从一份宏大的设计文档开始。他们往往从一个棘手的问题、一个基础模型的 API 密钥和一个粗略的想法起步。本章将为您提供一个快速上手指南。本书后续章节会对以下每个主题进行更深入的探讨，其中许多主题会独立成章，但本章将概述如何设计一个agentic系统，并以一个具体示例——电商平台的客户支持管理——作为贯穿始终的实践基础。

我们的第一个 Agent 系统

Let’s start with the problem we’re solving. Every day, your customer-support team fields dozens or hundreds of emails asking to refund a broken mug, cancel an unshipped order, or change a delivery address. For each message, a human agent has to read free-form text, look up the order in your backend, call the appropriate API, and then type a confirmation email. This repetitive two-minute process is ripe for automation—but only if we carve off the right slice. When we realize that humans type keys and click buttons, often following rules and guidelines, we see that many of these same patterns can be performed by well-designed systems that rely on foundation models. We want our agent to take a raw customer message plus the order record, decide which tool to call (issue_refund, cancel_order, or update_address_for_order), invoke that tool with the correct parameters, and then send a brief confirmation message. That two-step workflow is narrow enough to build quickly, valuable enough to free up human time, and rich enough to showcase intelligent behavior. We can build a working agent for this use case in just a few lines of code:

让我们从要解决的问题说起。每天，您的客户支持团队都会收到数十甚至数百封电子邮件，要求为一个摔碎的杯子退款、取消一个尚未发货的订单，或是更改收货地址。对于每一条消息，客服人员都需要阅读自由格式的文本，在您的后台系统中查找订单，调用相应的 API，然后输入一封确认邮件。这个重复性的两分钟流程非常适合自动化——前提是我们能准确界定自动化的范围。当我们意识到，人类的操作（敲击键盘、点击按钮）通常遵循既定的规则和指南时，我们发现许多类似的模式都可以通过精心设计的、基于基础模型的系统来执行。我们希望我们的 agent 能够接收原始的客户消息和订单记录，决定调用哪个工具（issue_refund、cancel_order 或 update_address_for_order），使用正确的参数调用该工具，然后发送一条简短的确认消息。这个两步的工作流程范围足够聚焦，可以快速构建；价值足够显著，能够释放人力；同时又足够丰富，足以展现智能行为。我们只需几行代码就能为这个用例构建一个可运行的 agent：

from langchain.tools import tool
from langchain_openai.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage, AIMessage
from langchain_core.messages.tool import ToolMessage
from langgraph.graph import StateGraph

# -- 1) Define our single business tool
@tool
def cancel_order(order_id: str) -> str:
    """Cancel an order that hasn't shipped."""
    # (Here you'd call your real backend API)
    return f"Order {order_id} has been cancelled."

# -- 2) The agent "brain": invoke LLM, run tool, then invoke LLM again
def call_model(state):
    msgs = state["messages"]
    order = state.get("order", {"order_id": "UNKNOWN"})

    # System prompt tells the model exactly what to do
    prompt = (
        f'''You are an ecommerce support agent.
        ORDER ID: {order['order_id']}
        If the customer asks to cancel, call cancel_order(order_id) 
        and then send a simple confirmation.
        Otherwise, just respond normally.'''
    )
    full = [SystemMessage(prompt)] + msgs

    # 1st LLM pass: decides whether to call our tool
    AIMessage = ChatOpenAI(model="gpt-5", temperature=0)(full)
    out = [first]

    if getattr(first, "tool_calls", None):
        # run the cancel_order tool
        tc = first.tool_calls[0]
        result = cancel_order(**tc["args"])
        out.append(ToolMessage(content=result, tool_call_id=tc["id"]))

        # 2nd LLM pass: generate the final confirmation text
        AIMessage = ChatOpenAI(model="gpt-5", temperature=0)(full + out)
        out.append(second)

    return {"messages": out}

# -- 3) Wire it all up in a StateGraph
def construct_graph():
    g = StateGraph({"order": None, "messages": []})
    g.add_node("assistant", call_model)
    g.set_entry_point("assistant")
    return g.compile()

graph = construct_graph()

if __name__ == "__main__":
    example_order = {"order_id": "A12345"}
    convo = [HumanMessage(content="Please cancel my order A12345.")]
    result = graph.invoke({"order": example_order, "messages": convo})
    for msg in result["messages"]:
        print(f"{msg.type}: {msg.content}") 

Great—you now have a working “cancel order” agent. Before we expand our agent, let’s reflect on why we started with such a simple slice. Scoping is always a balancing act. If you narrow your task too much—say, only cancellations—you miss out on other high-volume requests like refunds or address changes, limiting real-world impact. But if you broaden it too far—“automate every support inquiry”—you’ll drown in edge cases like billing disputes, product recommendations, and technical troubleshooting. And if you keep it vague—“improve customer satisfaction”—you’ll never know when you’ve succeeded.

很好——现在你已经有了一个可以运行的“取消订单”agent。在我们扩展这个agent的功能之前，让我们先回顾一下为什么我们从如此简单的一个切面开始。确定范围始终是一种平衡艺术。如果把任务范围定得太窄——比如只处理取消订单——你就会错过其他高频请求，如退款或地址变更，从而限制了实际影响。但如果把范围定得太宽——“自动化处理所有支持咨询”——你就会淹没在各种边缘案例中，如账单纠纷、产品推荐和技术故障排除。而如果范围定义得过于模糊——“提升客户满意度”——你将永远无法知道何时才算成功。

Instead, by focusing on a clear, bounded workflow—canceling orders—we ensure concrete inputs (customer message + order record), structured outputs (tool calls + confirmations), and a tight feedback loop. For example, imagine an email that says, “Please cancel my order #B73973 because I found a cheaper option elsewhere.” A human agent would look up the order, verify it hasn’t shipped, click “Cancel,” and reply with a confirmation. Translating this into code means invoking cancel_order(order_id="B73973") and sending a simple confirmation message back to the customer.

相反，通过专注于一个清晰、有边界的工作流——例如取消订单——我们确保了明确的输入（客户消息 + 订单记录）、结构化的输出（工具调用 + 确认信息）以及一个紧密的反馈闭环。举例来说，假设有一封邮件写道：“请取消我的订单 #B73973，因为我在别处找到了更便宜的选择。”客服人员会查找该订单，确认其尚未发货，点击“取消”按钮，然后回复确认信息。将这一过程转化为代码，就意味着调用 cancel_order(order_id="B73973") 并向客户发送一条简单的确认消息。

Now that we have a working “cancel order” agent, the next question is: does it actually work? In production, we don’t just want our agent to run—we want to know how well it performs, what it gets right, and where it fails. For our cancel order agent, we care about questions like:

Did it call the correct tool (cancel_order)?
Did it pass the right parameters (the correct order ID)?
Did it send a clear, correct confirmation message to the customer?

现在我们有了一个可以运行的“取消订单”agent，接下来的问题是：它真的能正常工作吗？在生产环境中，我们不仅希望agent能够运行——更希望了解它的表现如何、做对了什么以及在哪里出错。对于我们的取消订单agent，我们关心以下问题：

它是否调用了正确的工具（cancel_order）？
它是否传递了正确的参数（正确的订单ID）？
它是否向客户发送了清晰、正确的确认信息？

In our open source repository, you’ll find a full evaluation script to automate this process:

在我们的开源仓库中，您可以找到一个完整的评估脚本来自动化此过程：

Here’s a minimal, simplified version of this logic for how you might test your agent directly:

以下是该逻辑的一个极简、简化版本，展示了如何直接测试您的agent：

# Minimal evaluation check
example_order = {"order_id": "B73973"}
convo = [HumanMessage(content='''Please cancel order #B73973. 
    I found a cheaper option elsewhere.''')]
result = graph.invoke({"order": example_order, "messages": convo})

assert any("cancel_order" in str(m.content) for m in result["messages"], 
    "Cancel order tool not called")
assert any("cancelled" in m.content.lower() for m in result["messages"], 
    "Confirmation message missing")

print("✅ Agent passed minimal evaluation.")

This snippet ensures that the tool was called and the confirmation was sent. Of course, real evaluation goes deeper: you can measure tool precision, parameter accuracy, and overall task success rates across hundreds of examples to catch edge cases before deploying. We’ll dive into evaluation strategies and frameworks in depth in Chapter 9, but for now, remember: an untested agent is an untrusted agent.

这段代码确保了工具被调用且确认信息已发送。当然，实际的评估会更加深入：您可以测量工具的精确度、参数准确性以及跨数百个示例的总体任务成功率，从而在部署前发现边缘情况。我们将在第9章深入探讨评估策略与框架，但现在请记住：未经测试的 agent 是不可信的 agent。

Because both steps are automated using @tool decorators, writing tests against real tickets becomes trivial—and you instantly gain measurable metrics like tool recall, parameter accuracy, and confirmation quality. Now that we’ve built and evaluated a minimal agent, let’s explore the core design decisions that will shape its capabilities and impact.

由于这两个步骤都通过 @tool 装饰器实现了自动化，针对真实工单编写测试变得轻而易举——您能立即获得可量化的指标，例如工具召回率、参数准确性和确认信息质量。现在我们已经构建并评估了一个最小化的 agent，接下来让我们探讨将塑造其能力和影响的核心设计决策。