快速开始：5分钟部署指南

快速部署和使用 AI Observability Agent 的完整指南

环境准备

系统要求

平台	最低配置	推荐配置
Linux	1 CPU, 512MB RAM	2 CPU, 1GB RAM
macOS	1 CPU, 512MB RAM	2 CPU, 1GB RAM
Windows	1 CPU, 1GB RAM	2 CPU, 2GB RAM

软件要求

Rust 1.70+（仅用于从源码构建）
Prometheus 2.33+（用于接收 Remote Write 数据）
Grafana 9.0+（用于可视化，可选）

网络要求

端口	用途	说明
9090	HTTP 服务	Agent 自身的 HTTP 服务
4317	OTLP gRPC	OpenTelemetry gRPC 接收器
4318	OTLP HTTP	OpenTelemetry HTTP 接收器

快速部署

1. 二进制部署

步骤 1：下载二进制文件

从 GitHub Releases 页面下载对应平台的二进制文件：

# Linux (x86_64)
wget https://github.com/username/prom-agent/releases/download/v0.1.0/prom-agent-linux-amd64

# macOS (x86_64)
wget https://github.com/username/prom-agent/releases/download/v0.1.0/prom-agent-darwin-amd64

# macOS (ARM64)
wget https://github.com/username/prom-agent/releases/download/v0.1.0/prom-agent-darwin-arm64

# Windows
download https://github.com/username/prom-agent/releases/download/v0.1.0/prom-agent-windows-amd64.exe

步骤 2：赋予执行权限

# Linux/macOS
chmod +x prom-agent-*

# Windows
# 无需操作，.exe 文件默认可执行

步骤 3：创建配置文件

mkdir -p config
cat > config/agent_config.yaml << 'EOF'
agent:
  log_level: info
  listen_address: 0.0.0.0:9090

system_collector:
  enabled: true

otlp:
  enabled: true
  grpc_endpoint: 0.0.0.0:4317
  http_endpoint: 0.0.0.0:4318

remote_write:
  endpoint: http://localhost:9090/api/v1/write
EOF

步骤 4：启动服务

# Linux/macOS
./prom-agent-linux-amd64 config/agent_config.yaml

# Windows
prom-agent-windows-amd64.exe config/agent_config.yaml

2. Docker 部署

步骤 1：构建镜像

git clone https://gitee.com/hongbin1/prom-agent.git
cd prom-agent
docker build -t prom-agent:latest .

步骤 2：运行容器

docker run -d \
  --name prom-agent \
  --restart=always \
  -p 9090:9090 \
  -p 4317:4317 \
  -p 4318:4318 \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v ./config:/etc/prom-agent \
  prom-agent:latest \
  /etc/prom-agent/agent_config.yaml

步骤 3：查看日志

docker logs -f prom-agent

3. Kubernetes 部署

步骤 1：创建配置文件

# prom-agent-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prom-agent-config
  namespace: monitoring
data:
  agent_config.yaml: |
    agent:
      log_level: info
      listen_address: 0.0.0.0:9090
    
    system_collector:
      enabled: true
      container_mode: auto
    
    otlp:
      enabled: true
      grpc_endpoint: 0.0.0.0:4317
      http_endpoint: 0.0.0.0:4318
    
    remote_write:
      endpoints:
        - name: primary
          endpoint: http://prometheus.monitoring:9090/api/v1/write
          priority: 1
        - name: backup
          endpoint: http://prometheus-backup.monitoring:9090/api/v1/write
          priority: 2
      failover:
        enabled: true
      queue_config:
        capacity: 10000
        max_samples_per_send: 1000
        batch_send_deadline_secs: 5
        max_retries: 3

步骤 2：创建 DaemonSet

# prom-agent-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: prom-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: prom-agent
  template:
    metadata:
      labels:
        app: prom-agent
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: prom-agent
          image: prom-agent:latest
          args:
            - /etc/prom-agent/agent_config.yaml
          ports:
            - containerPort: 9090
              name: http
            - containerPort: 4317
              name: otlp-grpc
            - containerPort: 4318
              name: otlp-http
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
            - name: config
              mountPath: /etc/prom-agent
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: config
          configMap:
            name: prom-agent-config

步骤 3：部署到 Kubernetes

# 创建命名空间
kubectl create namespace monitoring

# 应用配置
kubectl apply -f prom-agent-config.yaml

# 部署 DaemonSet
kubectl apply -f prom-agent-daemonset.yaml

# 查看状态
kubectl get pods -n monitoring

基础配置

最小配置

# config/agent_config.yaml
agent:
  log_level: info
  listen_address: 0.0.0.0:9090

system_collector:
  enabled: true

remote_write:
  endpoint: http://your-prometheus:9090/api/v1/write

完整配置示例

# config/agent_config.yaml
agent:
  log_level: info
  listen_address: 0.0.0.0:9090
  metrics_path: /metrics

system_collector:
  enabled: true
  collectors:
    cpu: true
    memory: true
    disk: true
    filesystem: true
    network: true
    load: true
  container_mode: auto

service_scrapers:
  - id: webapp
    url: http://localhost:8080/metrics
    interval_secs: 30
    timeout_secs: 5
    labels:
      service: webapp
      environment: production

otlp:
  enabled: true
  grpc_endpoint: 0.0.0.0:4317
  http_endpoint: 0.0.0.0:4318
  prefix: ai
  labels:
    source: otlp_receiver

ai_collectors:
  enabled: true
  
  openai:
    - name: openai_usage
      enabled: true
      api_key: ${OPENAI_API_KEY}
      scrape_interval_secs: 3600
      labels:
        source: openai
  
  litellm:
    - name: litellm_proxy
      enabled: true
      endpoint: http://localhost:4000
      scrape_interval_secs: 60
      labels:
        source: litellm

cost_tracking:
  enabled: true
  
  budget:
    enabled: true
    daily_limit_usd: 100
    monthly_limit_usd: 2000
    alert_threshold_percent: 80

quality_monitoring:
  enabled: true
  
  rules:
    - name: high_latency
      type: response_time
      threshold: 5000
      severity: warning
      weight: 1.0
    
    - name: token_inefficiency
      type: token_efficiency
      max_value: 10
      severity: info
      weight: 0.5

plugins:
  - name: http_metrics
    type: http
    enabled: true
    interval_secs: 30
    config:
      url: http://localhost:8080/metrics
    labels:
      source: http_plugin

remote_write:
  endpoints:
    - name: primary
      endpoint: http://prometheus-1:9090/api/v1/write
      priority: 1
    - name: backup
      endpoint: http://prometheus-2:9090/api/v1/write
      priority: 2
  failover:
    enabled: true
  queue_config:
    capacity: 10000
    max_shards: 5
    max_samples_per_send: 1000
    batch_send_deadline_secs: 5
    max_retries: 3
    min_backoff_secs: 3
    max_backoff_secs: 10
  persistence:
    enabled: true
    data_dir: ./data/persistence
    max_file_size_mb: 100
    retention_hours: 24

验证安装

1. 健康检查

# 检查 Agent 是否运行
curl http://localhost:9090/health

# 响应
{
  "status": "ok",
  "message": "Prometheus Agent is healthy"
}

2. 查看配置

# 获取当前配置
curl http://localhost:9090/api/v1/config | jq

# 查看端点状态
curl http://localhost:9090/api/v1/endpoints | jq

3. 查看指标

# 查看 Agent 自身指标
curl http://localhost:9090/metrics

# 示例输出
# HELP agent_uptime_seconds Agent uptime in seconds
# TYPE agent_uptime_seconds gauge
agent_uptime_seconds 3600

# HELP agent_samples_collected_total Total samples collected
# TYPE agent_samples_collected_total counter
agent_samples_collected_total 125000

# HELP agent_samples_sent_total Total samples sent via remote write
# TYPE agent_samples_sent_total counter
agent_samples_sent_total 125000

4. 验证数据推送

步骤 1：启动 Prometheus

# 确保 Prometheus 启用了 Remote Write 接收器
prometheus --config.file=prometheus.yml --web.enable-remote-write-receiver

步骤 2：查询 Prometheus

打开 Prometheus UI（默认 http://localhost:9090），执行以下查询：

# 系统指标
node_cpu_seconds_total
node_memory_MemTotal_bytes

# Agent 指标
agent_samples_collected_total
agent_samples_sent_total

# AI 指标（如果配置了 AI 采集器）
ai_requests_total
ai_tokens_input_total
aio_cost_usd_total

5. 测试 OTLP 接收

步骤 1：配置 Claude Code

export CLAUDE_CODE_ENABLE_OTEL=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# 启动 Claude Code
claude-code

步骤 2：查看 Agent 日志

# 查看 OTLP 接收日志
# 应该看到类似的日志
INFO otlp metrics received from=127.0.0.1 count=15

步骤 3：查询 Claude Code 指标

# Claude Code 会话数
sum(ai_claude_code_session_count_total)

# Claude Code Token 使用量
sum(ai_claude_code_token_usage_tokens_total)

常见问题

1. Agent 启动失败

症状：Agent 启动时出现错误

排查步骤：

检查配置文件格式是否正确
查看日志中的错误信息
验证端口是否被占用
检查权限问题

解决方案：

修正配置文件
释放占用的端口
以正确的权限运行

2. 数据推送失败

症状：Agent 运行正常，但 Prometheus 中没有数据

排查步骤：

检查 Remote Write 配置
验证 Prometheus 是否启用了 Remote Write 接收器
查看 Agent 日志中的错误信息
检查网络连接

解决方案：

修正 Remote Write 配置
启用 Prometheus Remote Write 接收器
解决网络连接问题

3. OTLP 接收失败

症状：Claude Code 无法连接到 Agent

排查步骤：

检查 OTLP 配置
验证端口 4317/4318 是否开放
查看 Agent 日志
检查网络连接

解决方案：

修正 OTLP 配置
开放相应端口
解决网络连接问题

4. 内存使用过高

症状：Agent 内存使用持续增长

排查步骤：

检查缓冲区配置
查看采集器配置
分析内存使用情况

解决方案：

调整缓冲区大小
禁用不需要的采集器
优化配置

下一步

1. 配置 Grafana Dashboard

步骤 1：导入 Dashboard

打开 Grafana UI
导航到 Dashboards → Import
点击 “Upload JSON file”
选择 dashboards/ai-observability.json
选择 Prometheus 数据源
点击 Import

步骤 2：查看 Dashboard

AI Observability Dashboard：成本、Token 使用、请求延迟等
Claude Code Dashboard：会话、代码生成、PR 统计等

2. 配置告警

步骤 1：设置预算告警

cost_tracking:
  alerts:
    enabled: true
    webhook_url: https://example.com/webhook

步骤 2：配置 Prometheus 告警

# prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

rule_files:
  - "alerts.yml"

# alerts.yml
groups:
- name: ai_observability
  rules:
  - alert: AICostHigh
    expr: sum(ai_cost_usd_total) > 100
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "AI 成本过高"
      description: "AI 成本超过 100 美元"

  - alert: AILatencyHigh
    expr: histogram_quantile(0.95, sum by (le) (rate(ai_request_latency_seconds_bucket[5m]))) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "AI 响应延迟高"
      description: "AI 响应延迟 P95 超过 5 秒"

3. 集成 AI 工具

Claude Code：

export CLAUDE_CODE_ENABLE_OTEL=true
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

OpenAI：

ai_collectors:
  openai:
    - name: openai_usage
      api_key: ${OPENAI_API_KEY}

LiteLLM：

ai_collectors:
  litellm:
    - name: litellm_proxy
      endpoint: http://localhost:4000

4. 优化配置

性能优化：

调整 max_shards 提高并发发送能力
优化 batch_send_deadline_secs 平衡延迟和吞吐量
设置合理的 capacity 避免内存使用过高

可靠性优化：

配置多端点故障转移
启用本地持久化
设置合理的重试策略

成本优化：

设置预算限制
配置告警阈值
分析成本报告优化使用

资源链接

总结

AI Observability Agent 是一个功能强大、易于部署的 AI 监控解决方案：

快速部署：支持二进制、Docker、Kubernetes 部署
简单配置：最小配置只需几行 YAML
丰富功能：系统监控、AI 监控、成本追踪、质量评估
易集成：与 Claude Code、OpenAI、LiteLLM 等无缝集成
可视化：开箱即用的 Grafana Dashboard

通过本指南，您可以在 5 分钟内完成 Agent 的部署和配置，开始监控您的 AI 服务。

提示：如果您在使用过程中遇到任何问题，请查看故障排查部分或提交 Issue。

环境准备#

系统要求#

软件要求#

网络要求#

快速部署#

1. 二进制部署#

2. Docker 部署#

3. Kubernetes 部署#

基础配置#

最小配置#

完整配置示例#

验证安装#

1. 健康检查#

2. 查看配置#

3. 查看指标#

4. 验证数据推送#

5. 测试 OTLP 接收#

常见问题#

1. Agent 启动失败#

2. 数据推送失败#

3. OTLP 接收失败#

4. 内存使用过高#

下一步#

1. 配置 Grafana Dashboard#

2. 配置告警#

3. 集成 AI 工具#

4. 优化配置#

资源链接#

总结#

环境准备

系统要求

软件要求

网络要求

快速部署

1. 二进制部署

2. Docker 部署

3. Kubernetes 部署

基础配置

最小配置

完整配置示例

验证安装

1. 健康检查

2. 查看配置

3. 查看指标

4. 验证数据推送

5. 测试 OTLP 接收

常见问题

1. Agent 启动失败

2. 数据推送失败

3. OTLP 接收失败

4. 内存使用过高

下一步

1. 配置 Grafana Dashboard

2. 配置告警

3. 集成 AI 工具

4. 优化配置

资源链接

总结