可观测性

第十章：错误处理与可观测性

点击这里👇🏻获取：100万QPS短链系统、复杂的商城微服务系统、智能翻译助手AI Agent、SaaS点餐系统、刷题吧小程序、商城系统、秒杀系统、AI项目、代码生成神器、苏三demo项目、智能天气播报AI Agent、智能代码审查AI Agent等 10 个项目的：项目源代码、开发教程和技术答疑

可观测性示意

10.1 线上问题的本质：你得先知道“哪一类失败”

AI 调用失败大致分四类：

鉴权失败（401/403）：Key 错、Key 为空、权限不足
限流失败（429）：并发过高或配额不足
服务端失败（5xx）：供应商侧波动、网络问题
应用侧失败：JSON 解析失败、工具参数非法、超时、取消

排障的第一步不是“改 prompt”，而是让系统能把这四类失败明确区分并打点。

10.2 先把重试与退避配置起来

在生产里，重试必须带退避，否则会把 429/5xx 放大成雪崩。你可以在配置中启用统一的重试策略：

spring:
  ai:
    retry:
      max-attempts: 3
      backoff:
        initial-interval: 500ms
        multiplier: 2
        max-interval: 5s
      on-client-errors: false
      exclude-on-http-codes: 400,401,403,404
      on-http-codes: 429,500,502,503,504

10.3 最小落地：日志 + 指标

建议先把“能定位问题”做到位，再考虑分布式追踪。

这里给一个最小实现：所有 AI 调用都通过一个 Facade，统一记录：

requestId
provider/model（可选）
耗时
成功/失败
失败原因（脱敏）

10.3.1 依赖（Actuator）

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

10.3.2 统一 Facade：计时 + 打点 + 脱敏日志

package com.example.saa.observability;

import com.example.saa.security.SensitiveDataRedactor;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import java.time.Duration;
import java.util.Objects;
import java.util.UUID;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;

@Service
public class AiCallFacade {

    private final ChatClient chatClient;
    private final SensitiveDataRedactor redactor;
    private final Timer timer;
    private final Counter success;
    private final Counter failure;

    public AiCallFacade(ChatClient chatClient, SensitiveDataRedactor redactor, MeterRegistry registry) {
        this.chatClient = chatClient;
        this.redactor = redactor;
        this.timer = registry.timer("ai.call.latency");
        this.success = registry.counter("ai.call.success");
        this.failure = registry.counter("ai.call.failure");
    }

    public String call(String system, String user) {
        String requestId = UUID.randomUUID().toString();
        try {
            String result = timer.record(() -> chatClient.prompt()
                    .system(Objects.toString(system, ""))
                    .user(Objects.toString(user, ""))
                    .call()
                    .content());
            success.increment();
            return result;
        } catch (Exception ex) {
            failure.increment();
            String safe = redactor.redact(ex.getMessage());
            throw new AiCallException("AI 调用失败，requestId=" + requestId + ", reason=" + safe, ex);
        }
    }

    public static class AiCallException extends RuntimeException {
        public AiCallException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}

10.3.3 单元测试建议

这类“调用外部模型”的 Facade 更适合用 Spring Boot 测试做集成验证（注入真实 ChatClient 或使用测试替身），同时重点覆盖：

异常分层：401/403、429、5xx、解析失败等是否能正确映射为对外错误
关键指标：成功/失败计数、耗时 Timer 是否有数据
脱敏：日志与错误信息是否不包含手机号、Key 等敏感内容

10.4 错误分层与对外返回

建议对外返回时做到：

不暴露供应商原始错误堆栈
给出可行动建议（例如：稍后重试/降低并发/检查 Key）
通过 requestId 让你能在日志里快速定位

10.5 本章小结

你已经把 AI 调用从“不可控黑盒”变成“可观测、可治理”的系统组件：

重试与退避避免雪崩
统一 Facade 让日志与指标一致
requestId 能把线上问题快速串起来

下一章我们会进一步从“稳定”走向“可控成本”：缓存、并发控制、prompt 压缩、模型路由与限流策略。点击这里👇🏻获取：100万QPS短链系统、复杂的商城微服务系统、智能翻译助手AI Agent、SaaS点餐系统、刷题吧小程序、商城系统、秒杀系统、AI项目、代码生成神器、苏三demo项目、智能天气播报AI Agent、智能代码审查AI Agent等 10 个项目的：项目源代码、开发教程和技术答疑