插件系统：灵活扩展采集能力

深入了解 AI Observability Agent 的插件系统，通过插件扩展采集能力

插件架构设计

AI Observability Agent 的插件系统采用灵活的插件架构，允许用户通过插件扩展采集能力：

核心设计

插件接口：统一的插件接口
动态加载：运行时动态加载插件
并发执行：插件并发执行，互不影响
错误隔离：单个插件失败不影响其他插件
生命周期管理：插件的启动、停止、重启

插件类型

插件类型	数据源	适用场景
HTTP 插件	HTTP 端点	从 HTTP 服务获取指标
Exec 插件	命令输出	执行命令获取指标
Script 插件	脚本输出	执行脚本获取指标

插件架构图

┌─────────────────────────────────────────────────────────┐
│                  Plugin System                         │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌───────────┐ │
│  │ HTTP 插件       │  │ Exec 插件       │  │ Script 插件│ │
│  └────────┬────────┘  └────────┬────────┘  └────┬──────┘ │
│           │                    │                 │        │
│           └────────────────────┼─────────────────┘        │
│                                │                         │
│  ┌─────────────────────────────────────────────────────┐ │
│  │                 Plugin Manager                     │ │
│  │  - 插件生命周期管理                                │ │
│  │  - 插件配置管理                                    │ │
│  │  - 插件执行调度                                    │ │
│  └─────────────────────────────────────────────────────┘ │
│                                │                         │
│  ┌─────────────────────────────────────────────────────┐ │
│  │                 Metrics Pipeline                    │ │
│  │  - 指标解析                                        │ │
│  │  - 标签处理                                        │ │
│  │  - 数据推送                                        │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

HTTP 插件

功能

HTTP 插件从 HTTP 端点获取 Prometheus 格式的指标。

配置示例

plugins:
  - name: http_metrics          # 插件名称
    type: http                  # 插件类型
    enabled: true               # 是否启用
    interval_secs: 30            # 采集间隔
    timeout_secs: 10            # 超时时间
    config:
      url: http://localhost:8080/metrics  # HTTP 端点 URL
      method: GET               # HTTP 方法
      headers:                  # 自定义头
        Authorization: Bearer token123
    labels:                     # 额外标签
      source: http_plugin
      service: webapp

使用场景

监控第三方服务：从第三方服务的 /metrics 端点获取指标
监控微服务：监控内部微服务的运行状态
监控云服务：从云服务的 API 获取指标
监控自定义应用：从自定义应用的指标端点获取数据

数据格式

HTTP 插件支持标准的 Prometheus text exposition 格式：

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1027 1395066363000
http_requests_total{method="POST",status="200"} 4 1395066363000

Exec 插件

功能

Exec 插件执行命令并解析命令输出获取指标。

配置示例

plugins:
  - name: system_check
    type: exec
    enabled: true
    interval_secs: 60
    timeout_secs: 5
    config:
      command: /usr/local/bin/check_system  # 执行命令
      args:                               # 命令参数
        - --verbose
        - --format=prometheus
    labels:
      source: exec_plugin
      check: system

使用场景

监控系统状态：执行系统命令获取状态信息
监控硬件：执行硬件检测命令
监控自定义指标：执行自定义检查命令
监控外部系统：执行与外部系统交互的命令

输出格式

Exec 插件支持以下输出格式：

Prometheus 格式：标准的 Prometheus text exposition 格式
JSON 格式：结构化 JSON 格式
自定义格式：通过脚本转换为 Prometheus 格式

Script 插件

功能

Script 插件执行脚本并解析脚本输出获取指标。

配置示例

plugins:
  - name: custom_metrics
    type: script
    enabled: true
    interval_secs: 60
    timeout_secs: 10
    config:
      interpreter: python3      # 解释器
      script: ./scripts/metrics.py  # 脚本路径
      args:                    # 脚本参数
        - --env=production
    labels:
      source: script_plugin
      app: custom

使用场景

复杂数据处理：需要复杂逻辑处理的场景
多源数据聚合：从多个数据源聚合数据
自定义计算：需要自定义计算的指标
特殊格式转换：需要转换特殊格式数据

脚本示例

Python 脚本示例：

#!/usr/bin/env python3
"""Custom metrics script"""
import time
import random

# 输出 Prometheus 格式指标
print("# HELP custom_metric_example Example custom metric")
print("# TYPE custom_metric_example gauge")
print(f"custom_metric_example{{label=\"value\"}} {random.random() * 100}")
print(f"custom_metric_counter{{status=\"success\"}} {int(time.time()) % 100}")

Bash 脚本示例：

#!/bin/bash
"""Custom metrics script"""

# 输出 Prometheus 格式指标
echo "# HELP disk_usage_percent Disk usage percentage"
echo "# TYPE disk_usage_percent gauge"
df -h | grep -v tmpfs | grep -v devtmpfs | while read line; do
  device=$(echo $line | awk '{print $1}')
  usage=$(echo $line | awk '{print $5}' | sed 's/%//')
  mountpoint=$(echo $line | awk '{print $6}')
  echo "disk_usage_percent{device=\"$device\",mountpoint=\"$mountpoint\"} $usage"
done

插件 API

1. 查看所有插件

端点：GET /api/v1/plugins

响应示例：

{
  "success": true,
  "data": {
    "plugins": [
      {
        "name": "http_metrics",
        "type": "http",
        "enabled": true,
        "status": "running",
        "last_run": "2024-04-11T10:00:00Z",
        "next_run": "2024-04-11T10:00:30Z"
      },
      {
        "name": "system_check",
        "type": "exec",
        "enabled": true,
        "status": "running",
        "last_run": "2024-04-11T10:00:00Z",
        "next_run": "2024-04-11T10:01:00Z"
      }
    ]
  }
}

2. 启用/禁用插件

端点：

POST /api/v1/plugins/{name}/enable
POST /api/v1/plugins/{name}/disable

响应示例：

{
  "success": true,
  "data": {
    "message": "Plugin http_metrics enabled"
  }
}

3. 重启插件

端点：POST /api/v1/plugins/{name}/restart

响应示例：

{
  "success": true,
  "data": {
    "message": "Plugin http_metrics restarted"
  }
}

4. 获取插件详情

端点：GET /api/v1/plugins/{name}

响应示例：

{
  "success": true,
  "data": {
    "name": "http_metrics",
    "type": "http",
    "enabled": true,
    "config": {
      "url": "http://localhost:8080/metrics",
      "method": "GET"
    },
    "labels": {
      "source": "http_plugin",
      "service": "webapp"
    },
    "last_run": "2024-04-11T10:00:00Z",
    "next_run": "2024-04-11T10:00:30Z",
    "metrics_collected": 15
  }
}

插件开发指南

1. 插件接口

插件系统的核心接口：

#[async_trait]
pub trait Plugin: Send + Sync {
    fn name(&self) -> &str;
    fn enabled(&self) -> bool;
    fn set_enabled(&mut self, enabled: bool);
    async fn collect(&self) -> Result<Vec<Sample>, PluginError>;
}

2. 开发 HTTP 插件

步骤：

实现 Plugin trait
处理 HTTP 请求
解析响应数据
转换为 Sample 格式

示例：

pub struct HttpPlugin {
    name: String,
    enabled: bool,
    url: String,
    method: String,
    headers: HashMap<String, String>,
    client: reqwest::Client,
}

#[async_trait]
impl Plugin for HttpPlugin {
    fn name(&self) -> &str { &self.name }
    fn enabled(&self) -> bool { self.enabled }
    fn set_enabled(&mut self, enabled: bool) { self.enabled = enabled; }
    
    async fn collect(&self) -> Result<Vec<Sample>, PluginError> {
        // 发送 HTTP 请求
        let mut request = self.client.request(
            self.method.parse()?, 
            &self.url
        );
        
        // 添加头
        for (key, value) in &self.headers {
            request = request.header(key, value);
        }
        
        // 发送请求
        let response = request.send().await?;
        let body = response.text().await?;
        
        // 解析 Prometheus 格式
        let samples = parse_prometheus_text(&body)?;
        Ok(samples)
    }
}

3. 开发 Exec 插件

步骤：

实现 Plugin trait
执行命令
解析命令输出
转换为 Sample 格式

示例：

pub struct ExecPlugin {
    name: String,
    enabled: bool,
    command: String,
    args: Vec<String>,
}

#[async_trait]
impl Plugin for ExecPlugin {
    fn name(&self) -> &str { &self.name }
    fn enabled(&self) -> bool { self.enabled }
    fn set_enabled(&mut self, enabled: bool) { self.enabled = enabled; }
    
    async fn collect(&self) -> Result<Vec<Sample>, PluginError> {
        // 执行命令
        let output = tokio::process::Command::new(&self.command)
            .args(&self.args)
            .output()
            .await?;
        
        if !output.status.success() {
            return Err(PluginError::CommandFailed(output.stderr));
        }
        
        // 解析输出
        let stdout = String::from_utf8_lossy(&output.stdout);
        let samples = parse_prometheus_text(&stdout)?;
        Ok(samples)
    }
}

4. 开发 Script 插件

步骤：

实现 Plugin trait
执行脚本
解析脚本输出
转换为 Sample 格式

示例：

pub struct ScriptPlugin {
    name: String,
    enabled: bool,
    interpreter: String,
    script: String,
    args: Vec<String>,
}

#[async_trait]
impl Plugin for ScriptPlugin {
    fn name(&self) -> &str { &self.name }
    fn enabled(&self) -> bool { self.enabled }
    fn set_enabled(&mut self, enabled: bool) { self.enabled = enabled; }
    
    async fn collect(&self) -> Result<Vec<Sample>, PluginError> {
        // 执行脚本
        let output = tokio::process::Command::new(&self.interpreter)
            .arg(&self.script)
            .args(&self.args)
            .output()
            .await?;
        
        if !output.status.success() {
            return Err(PluginError::ScriptFailed(output.stderr));
        }
        
        // 解析输出
        let stdout = String::from_utf8_lossy(&output.stdout);
        let samples = parse_prometheus_text(&stdout)?;
        Ok(samples)
    }
}

最佳实践

1. 配置最佳实践

命名规范：

插件名称应唯一且有意义
使用小写字母和下划线
避免使用特殊字符

采集间隔：

高频采集：10-30秒
中频采集：1-5分钟
低频采集：5-15分钟

超时设置：

HTTP 插件：5-10秒
Exec 插件：3-5秒
Script 插件：5-10秒

2. 插件开发最佳实践

错误处理：

妥善处理网络错误
妥善处理命令执行错误
妥善处理解析错误

性能优化：

避免长时间运行的命令
避免占用过多资源
优化脚本执行时间

安全性：

避免执行危险命令
避免使用硬编码的凭证
限制脚本权限

3. 监控最佳实践

插件监控：

监控插件执行状态
监控插件执行时间
监控插件错误率

指标命名：

使用有意义的指标名
遵循 Prometheus 命名规范
包含插件名称作为标签

标签使用：

添加插件来源标签
添加环境标签
避免高基数标签

故障排查

1. 插件执行失败

症状：插件执行失败，无指标数据

排查步骤：

检查插件配置
手动执行命令/脚本验证
查看 Agent 日志
检查网络连接

解决方案：

修正插件配置
修复命令/脚本错误
解决网络连接问题

2. 插件无数据

症状：插件执行成功，但无指标数据

排查步骤：

检查命令/脚本输出
验证输出格式
查看解析错误日志

解决方案：

修正输出格式
确保输出符合 Prometheus 格式
检查解析逻辑

3. 插件执行缓慢

症状：插件执行时间过长，影响整体性能

排查步骤：

检查命令/脚本执行时间
分析性能瓶颈
查看系统资源使用

解决方案：

优化命令/脚本
增加超时时间
调整采集间隔

总结

AI Observability Agent 的插件系统为用户提供了灵活的扩展能力：

多类型支持：HTTP、Exec、Script 三种插件类型
灵活配置：丰富的配置选项
动态管理：运行时启用/禁用/重启插件
易于开发：简单的插件接口
错误隔离：单个插件失败不影响其他插件

通过插件系统，用户可以轻松扩展 Agent 的采集能力，监控更多类型的数据源，满足各种监控需求。

下一步

本地持久化 - 网络故障数据保护
Remote Write - 高效数据推送
Grafana 可视化 - 开箱即用的监控面板

插件架构设计#

核心设计#

插件类型#

插件架构图#

HTTP 插件#

功能#

配置示例#

使用场景#

数据格式#

Exec 插件#

功能#

配置示例#

使用场景#

输出格式#

Script 插件#

功能#

配置示例#

使用场景#

脚本示例#

插件 API#

1. 查看所有插件#

2. 启用/禁用插件#

3. 重启插件#

4. 获取插件详情#

插件开发指南#

1. 插件接口#

2. 开发 HTTP 插件#

3. 开发 Exec 插件#

4. 开发 Script 插件#

最佳实践#

1. 配置最佳实践#

2. 插件开发最佳实践#

3. 监控最佳实践#

故障排查#

1. 插件执行失败#

2. 插件无数据#

3. 插件执行缓慢#

总结#

下一步#

插件架构设计

核心设计

插件类型

插件架构图

HTTP 插件

功能

配置示例

使用场景

数据格式

Exec 插件

功能

配置示例

使用场景

输出格式

Script 插件

功能

配置示例

使用场景

脚本示例

插件 API

1. 查看所有插件

2. 启用/禁用插件

3. 重启插件

4. 获取插件详情

插件开发指南

1. 插件接口

2. 开发 HTTP 插件

3. 开发 Exec 插件

4. 开发 Script 插件

最佳实践

1. 配置最佳实践

2. 插件开发最佳实践

3. 监控最佳实践

故障排查

1. 插件执行失败

2. 插件无数据

3. 插件执行缓慢

总结

下一步