第八章：运维监控与故障排除

8.1 引言

在生产环境中运行 GeoServer Cloud，有效的监控和快速的故障排除能力至关重要。本章将全面介绍 GeoServer Cloud 的运维监控体系，包括日志管理、指标监控、告警配置以及常见故障的诊断和解决方法。

一个健康的 GeoServer Cloud 系统需要持续的监控和及时的维护。通过本章的学习，您将掌握建立完善监控体系的方法，以及在问题发生时快速定位和解决问题的技能。

8.2 日志管理

8.2.1 日志配置

GeoServer Cloud 使用 Spring Boot 的日志框架，支持多种日志输出配置：

# application.yml 日志配置
logging:
  level:
    root: INFO
    org.geoserver: INFO
    org.geotools: WARN
    org.springframework.cloud: INFO
    org.springframework.security: INFO
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
    file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
  file:
    name: /var/log/geoserver/application.log
    max-size: 100MB
    max-history: 30

8.2.2 日志级别说明

级别	说明	使用场景
ERROR	错误信息	需要立即关注的问题
WARN	警告信息	潜在问题，需要注意
INFO	一般信息	生产环境默认级别
DEBUG	调试信息	开发和问题排查
TRACE	详细追踪	深度调试

8.2.3 动态调整日志级别

运行时调整日志级别无需重启：

# 查看当前日志级别
curl http://localhost:8080/actuator/loggers/org.geoserver

# 设置日志级别为 DEBUG
curl -X POST \
    -H "Content-Type: application/json" \
    -d '{"configuredLevel": "DEBUG"}' \
    http://localhost:8080/actuator/loggers/org.geoserver

# 恢复默认级别
curl -X POST \
    -H "Content-Type: application/json" \
    -d '{"configuredLevel": null}' \
    http://localhost:8080/actuator/loggers/org.geoserver

8.2.4 结构化日志

配置 JSON 格式日志便于集中收集：

logging:
  pattern:
    console: >
      {"timestamp":"%d{ISO8601}","level":"%level","thread":"%thread",
       "logger":"%logger{36}","message":"%msg","exception":"%ex"}%n

8.2.5 集中日志收集

使用 ELK Stack

# filebeat.yml 配置
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata: ~
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "geoserver-cloud-%{+yyyy.MM.dd}"

使用 Loki + Grafana

# promtail 配置
scrape_configs:
  - job_name: geoserver-cloud
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: [__meta_docker_container_label_com_docker_compose_service]
        target_label: service
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
      - labels:
          level:

8.3 指标监控

8.3.1 暴露 Prometheus 指标

确保 Actuator 端点已配置：

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: when_authorized
  metrics:
    tags:
      application: ${spring.application.name}

8.3.2 关键监控指标

JVM 指标

指标	说明	告警阈值建议
jvm_memory_used_bytes	内存使用	> 80% 最大堆内存
jvm_gc_pause_seconds	GC 暂停时间	> 1s
jvm_threads_live	活跃线程数	> 500

HTTP 指标

指标	说明	告警阈值建议
http_server_requests_seconds_count	请求计数	根据基线
http_server_requests_seconds_sum	响应时间总和	-
http_server_requests_seconds_max	最大响应时间	> 10s

数据库连接池

指标	说明	告警阈值建议
hikaricp_connections_active	活跃连接数	> 80% 最大连接
hikaricp_connections_pending	等待连接数	> 0（持续）
hikaricp_connections_timeout_total	超时计数	> 0

8.3.3 Prometheus 查询示例

# 请求速率（每秒请求数）
rate(http_server_requests_seconds_count{application="wms-service"}[5m])

# 响应时间 P95
histogram_quantile(0.95, 
  rate(http_server_requests_seconds_bucket{application="wms-service"}[5m])
)

# 错误率
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) * 100

# JVM 堆内存使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

# 数据库连接池使用率
hikaricp_connections_active / hikaricp_connections_max * 100

8.3.4 Grafana Dashboard

创建 GeoServer Cloud 监控面板：

{
  "dashboard": {
    "title": "GeoServer Cloud Monitoring",
    "panels": [
      {
        "title": "Service Availability",
        "type": "stat",
        "targets": [{
          "expr": "up{job=\"geoserver-cloud\"}"
        }]
      },
      {
        "title": "Request Rate by Service",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(http_server_requests_seconds_count[5m])) by (application)",
          "legendFormat": ""
        }]
      },
      {
        "title": "Response Time P95",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application))",
          "legendFormat": ""
        }]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[5m])) by (application)",
          "legendFormat": ""
        }]
      },
      {
        "title": "JVM Memory Usage",
        "type": "graph",
        "targets": [{
          "expr": "jvm_memory_used_bytes{area=\"heap\"}",
          "legendFormat": " - "
        }]
      },
      {
        "title": "Database Connections",
        "type": "graph",
        "targets": [{
          "expr": "hikaricp_connections_active",
          "legendFormat": " - active"
        }, {
          "expr": "hikaricp_connections_max",
          "legendFormat": " - max"
        }]
      }
    ]
  }
}

8.4 健康检查

8.4.1 健康端点配置

management:
  endpoint:
    health:
      show-details: when_authorized
      probes:
        enabled: true
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true
    db:
      enabled: true
    rabbit:
      enabled: true

8.4.2 健康检查端点

端点	用途	说明
/actuator/health	整体健康状态	包含所有组件
/actuator/health/liveness	存活探针	容器是否需要重启
/actuator/health/readiness	就绪探针	是否可以接收流量

8.4.3 自定义健康检查

@Component
public class GeoServerHealthIndicator implements HealthIndicator {
    
    @Autowired
    private Catalog catalog;
    
    @Override
    public Health health() {
        try {
            // 检查目录是否可访问
            int workspaces = catalog.getWorkspaces().size();
            return Health.up()
                .withDetail("workspaces", workspaces)
                .build();
        } catch (Exception e) {
            return Health.down()
                .withException(e)
                .build();
        }
    }
}

8.4.4 Kubernetes 探针配置

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 120
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30

8.5 告警配置

8.5.1 Prometheus 告警规则

# prometheus-rules.yaml
groups:
  - name: geoserver-cloud
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up{job="geoserver-cloud"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service  is down"
          description: " has been down for more than 1 minute."
      
      # 高错误率
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
          / sum(rate(http_server_requests_seconds_count[5m])) by (application) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on "
          description: "Error rate is %"
      
      # 响应时间过长
      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application)
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on "
          description: "P95 response time is s"
      
      # 内存使用过高
      - alert: HighMemoryUsage
        expr: |
          jvm_memory_used_bytes{area="heap"} 
          / jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on "
          description: "Heap memory usage is %"
      
      # 数据库连接池耗尽
      - alert: DatabaseConnectionPoolExhausted
        expr: hikaricp_connections_pending > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool exhausted"
          description: " has  pending connections"

8.5.2 Alertmanager 配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  receiver: 'team-email'
  group_by: ['alertname', 'application']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
    - match:
        severity: warning
      receiver: 'team-slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        
  - name: 'team-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#geoserver-alerts'
        
  - name: 'team-pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

8.6 故障排除

8.6.1 服务启动失败

问题：服务无法启动，反复重启

诊断步骤：

# 1. 查看容器日志
docker logs gscloud-wms-1 --tail 200

# 2. 检查错误信息
docker logs gscloud-wms-1 2>&1 | grep -i "error\|exception\|failed"

# 3. 检查依赖服务
docker compose ps
docker compose exec wms nc -zv database 5432
docker compose exec wms nc -zv rabbitmq 5672

# 4. 检查配置
docker compose exec wms env | sort

常见原因和解决方案：

错误信息	可能原因	解决方案
Connection refused: database	数据库未就绪	等待数据库启动或检查网络
OutOfMemoryError	内存不足	增加 JVM 堆内存
Config server not available	配置服务未就绪	检查 Config 服务状态
Eureka client failed	服务发现失败	检查 Discovery 服务

8.6.2 性能问题

问题：响应时间过长

诊断步骤：

# 1. 检查系统资源
docker stats

# 2. 查看线程转储
curl http://localhost:8080/actuator/threaddump > threaddump.txt

# 3. 查看堆转储（谨慎使用）
curl -X POST http://localhost:8080/actuator/heapdump > heapdump.hprof

# 4. 检查数据库慢查询
docker compose exec database psql -U geoserver -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"

优化建议：

症状	可能原因	优化措施
CPU 使用率高	复杂渲染/大量请求	增加实例数/优化样式
内存使用率高	大量数据加载	增加内存/优化查询
数据库连接等待	连接池不足	增加连接池大小
GC 频繁	内存分配模式	调整 GC 参数

8.6.3 数据库连接问题

问题：无法连接数据库

# 1. 测试网络连通性
docker compose exec wms nc -zv database 5432

# 2. 测试数据库登录
docker compose exec database psql -U geoserver -d geoserver -c "SELECT 1"

# 3. 检查连接数
docker compose exec database psql -U geoserver -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'geoserver';"

# 4. 检查连接池状态
curl http://localhost:8080/actuator/metrics/hikaricp.connections.active
curl http://localhost:8080/actuator/metrics/hikaricp.connections.pending

8.6.4 事件总线问题

问题：配置变更不同步

# 1. 检查 RabbitMQ 连接
docker compose exec wms nc -zv rabbitmq 5672

# 2. 查看 RabbitMQ 队列
docker compose exec rabbitmq rabbitmqctl list_queues

# 3. 查看消费者
docker compose exec rabbitmq rabbitmqctl list_consumers

# 4. 手动触发配置刷新
curl -X POST http://localhost:8080/actuator/bus-refresh

8.6.5 常用诊断命令

# 服务状态检查
curl http://localhost:8080/actuator/health | jq

# 查看配置属性
curl http://localhost:8080/actuator/configprops | jq

# 查看环境变量
curl http://localhost:8080/actuator/env | jq

# 查看 Bean 列表
curl http://localhost:8080/actuator/beans | jq

# 查看 HTTP 追踪
curl http://localhost:8080/actuator/httptrace | jq

# 查看指标
curl http://localhost:8080/actuator/metrics | jq
curl http://localhost:8080/actuator/metrics/http.server.requests | jq

8.7 备份与恢复

8.7.1 数据库备份

#!/bin/bash
# backup-pgconfig.sh

BACKUP_DIR="/backup/geoserver"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/pgconfig_${DATE}.sql.gz"

# 创建备份
docker compose exec -T database pg_dump -U geoserver geoserver | gzip > ${BACKUP_FILE}

# 保留最近 30 天的备份
find ${BACKUP_DIR} -name "pgconfig_*.sql.gz" -mtime +30 -delete

echo "Backup completed: ${BACKUP_FILE}"

8.7.2 数据库恢复

#!/bin/bash
# restore-pgconfig.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup_file>"
    exit 1
fi

# 停止所有 GeoServer 服务
docker compose stop wms wfs wcs rest webui gwc

# 恢复数据库
gunzip -c ${BACKUP_FILE} | docker compose exec -T database psql -U geoserver geoserver

# 重启服务
docker compose start wms wfs wcs rest webui gwc

echo "Restore completed from: ${BACKUP_FILE}"

8.7.3 自动化备份（Kubernetes）

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pgconfig-backup
  namespace: geoserver
spec:
  schedule: "0 2 * * *"  # 每天凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -h postgresql -U geoserver geoserver | gzip > /backup/pgconfig_$(date +%Y%m%d).sql.gz
              envFrom:
                - secretRef:
                    name: postgresql-secret
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              persistentVolumeClaim:
                claimName: backup-pvc
          restartPolicy: OnFailure

8.8 版本升级

8.8.1 升级前检查

# 1. 检查当前版本
docker compose exec wms java -jar app.jar --version

# 2. 查看发布说明
# https://github.com/geoserver/geoserver-cloud/releases

# 3. 备份数据
./backup-pgconfig.sh

# 4. 测试环境验证
docker compose -f compose-test.yml up -d

8.8.2 滚动升级

# 1. 更新镜像版本
sed -i 's/TAG=2.28.1.0/TAG=2.28.2.0/' .env

# 2. 拉取新镜像
docker compose pull

# 3. 滚动更新（逐个服务）
for service in wms wfs wcs rest webui gwc gateway; do
    docker compose up -d --no-deps $service
    sleep 30
    # 等待服务健康
    until docker compose exec $service curl -sf http://localhost:8080/actuator/health; do
        sleep 5
    done
    echo "$service upgraded successfully"
done

8.8.3 版本回滚

# 1. 恢复版本号
sed -i 's/TAG=2.28.2.0/TAG=2.28.1.0/' .env

# 2. 重新部署
docker compose up -d

# 3. 如需恢复数据库
./restore-pgconfig.sh /backup/pgconfig_before_upgrade.sql.gz

8.9 本章小结

本章全面介绍了 GeoServer Cloud 的运维监控和故障排除：

日志管理：配置日志级别、结构化日志和集中收集。
指标监控：配置 Prometheus 指标、关键监控项和 Grafana 面板。
健康检查：配置健康端点和 Kubernetes 探针。
告警配置：设置 Prometheus 告警规则和通知。
故障排除：常见问题的诊断和解决方法。
备份恢复：数据库备份、恢复和自动化。
版本升级：升级流程和回滚策略。

在下一章中，我们将学习 GeoServer Cloud 的开发扩展和定制。

8.10 思考题

在分布式系统中，如何实现日志的关联追踪（Distributed Tracing）？
应该监控哪些业务指标来评估 GeoServer Cloud 的服务质量？
如何设计一个有效的告警策略，避免告警疲劳？
在进行数据库升级时，如何确保数据的一致性和完整性？
如何实现 GeoServer Cloud 的灾难恢复方案？