第八章:运维监控与故障排除
8.1 引言
在生产环境中运行 GeoServer Cloud,有效的监控和快速的故障排除能力至关重要。本章将全面介绍 GeoServer Cloud 的运维监控体系,包括日志管理、指标监控、告警配置以及常见故障的诊断和解决方法。
一个健康的 GeoServer Cloud 系统需要持续的监控和及时的维护。通过本章的学习,您将掌握建立完善监控体系的方法,以及在问题发生时快速定位和解决问题的技能。
8.2 日志管理
8.2.1 日志配置
GeoServer Cloud 使用 Spring Boot 的日志框架,支持多种日志输出配置:
# application.yml 日志配置
logging:
level:
root: INFO
org.geoserver: INFO
org.geotools: WARN
org.springframework.cloud: INFO
org.springframework.security: INFO
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
file:
name: /var/log/geoserver/application.log
max-size: 100MB
max-history: 30
8.2.2 日志级别说明
| 级别 | 说明 | 使用场景 |
|---|---|---|
| ERROR | 错误信息 | 需要立即关注的问题 |
| WARN | 警告信息 | 潜在问题,需要注意 |
| INFO | 一般信息 | 生产环境默认级别 |
| DEBUG | 调试信息 | 开发和问题排查 |
| TRACE | 详细追踪 | 深度调试 |
8.2.3 动态调整日志级别
运行时调整日志级别无需重启:
# 查看当前日志级别
curl http://localhost:8080/actuator/loggers/org.geoserver
# 设置日志级别为 DEBUG
curl -X POST \
-H "Content-Type: application/json" \
-d '{"configuredLevel": "DEBUG"}' \
http://localhost:8080/actuator/loggers/org.geoserver
# 恢复默认级别
curl -X POST \
-H "Content-Type: application/json" \
-d '{"configuredLevel": null}' \
http://localhost:8080/actuator/loggers/org.geoserver
8.2.4 结构化日志
配置 JSON 格式日志便于集中收集:
logging:
pattern:
console: >
{"timestamp":"%d{ISO8601}","level":"%level","thread":"%thread",
"logger":"%logger{36}","message":"%msg","exception":"%ex"}%n
8.2.5 集中日志收集
使用 ELK Stack
# filebeat.yml 配置
filebeat.inputs:
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- add_docker_metadata: ~
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "geoserver-cloud-%{+yyyy.MM.dd}"
使用 Loki + Grafana
# promtail 配置
scrape_configs:
- job_name: geoserver-cloud
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: [__meta_docker_container_label_com_docker_compose_service]
target_label: service
pipeline_stages:
- json:
expressions:
level: level
message: message
- labels:
level:
8.3 指标监控
8.3.1 暴露 Prometheus 指标
确保 Actuator 端点已配置:
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: when_authorized
metrics:
tags:
application: ${spring.application.name}
8.3.2 关键监控指标
JVM 指标
| 指标 | 说明 | 告警阈值建议 |
|---|---|---|
| jvm_memory_used_bytes | 内存使用 | > 80% 最大堆内存 |
| jvm_gc_pause_seconds | GC 暂停时间 | > 1s |
| jvm_threads_live | 活跃线程数 | > 500 |
HTTP 指标
| 指标 | 说明 | 告警阈值建议 |
|---|---|---|
| http_server_requests_seconds_count | 请求计数 | 根据基线 |
| http_server_requests_seconds_sum | 响应时间总和 | - |
| http_server_requests_seconds_max | 最大响应时间 | > 10s |
数据库连接池
| 指标 | 说明 | 告警阈值建议 |
|---|---|---|
| hikaricp_connections_active | 活跃连接数 | > 80% 最大连接 |
| hikaricp_connections_pending | 等待连接数 | > 0(持续) |
| hikaricp_connections_timeout_total | 超时计数 | > 0 |
8.3.3 Prometheus 查询示例
# 请求速率(每秒请求数)
rate(http_server_requests_seconds_count{application="wms-service"}[5m])
# 响应时间 P95
histogram_quantile(0.95,
rate(http_server_requests_seconds_bucket{application="wms-service"}[5m])
)
# 错误率
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) * 100
# JVM 堆内存使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100
# 数据库连接池使用率
hikaricp_connections_active / hikaricp_connections_max * 100
8.3.4 Grafana Dashboard
创建 GeoServer Cloud 监控面板:
{
"dashboard": {
"title": "GeoServer Cloud Monitoring",
"panels": [
{
"title": "Service Availability",
"type": "stat",
"targets": [{
"expr": "up{job=\"geoserver-cloud\"}"
}]
},
{
"title": "Request Rate by Service",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_server_requests_seconds_count[5m])) by (application)",
"legendFormat": ""
}]
},
{
"title": "Response Time P95",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application))",
"legendFormat": ""
}]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[5m])) by (application)",
"legendFormat": ""
}]
},
{
"title": "JVM Memory Usage",
"type": "graph",
"targets": [{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": " - "
}]
},
{
"title": "Database Connections",
"type": "graph",
"targets": [{
"expr": "hikaricp_connections_active",
"legendFormat": " - active"
}, {
"expr": "hikaricp_connections_max",
"legendFormat": " - max"
}]
}
]
}
}
8.4 健康检查
8.4.1 健康端点配置
management:
endpoint:
health:
show-details: when_authorized
probes:
enabled: true
health:
livenessstate:
enabled: true
readinessstate:
enabled: true
db:
enabled: true
rabbit:
enabled: true
8.4.2 健康检查端点
| 端点 | 用途 | 说明 |
|---|---|---|
| /actuator/health | 整体健康状态 | 包含所有组件 |
| /actuator/health/liveness | 存活探针 | 容器是否需要重启 |
| /actuator/health/readiness | 就绪探针 | 是否可以接收流量 |
8.4.3 自定义健康检查
@Component
public class GeoServerHealthIndicator implements HealthIndicator {
@Autowired
private Catalog catalog;
@Override
public Health health() {
try {
// 检查目录是否可访问
int workspaces = catalog.getWorkspaces().size();
return Health.up()
.withDetail("workspaces", workspaces)
.build();
} catch (Exception e) {
return Health.down()
.withException(e)
.build();
}
}
}
8.4.4 Kubernetes 探针配置
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 60
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
8.5 告警配置
8.5.1 Prometheus 告警规则
# prometheus-rules.yaml
groups:
- name: geoserver-cloud
rules:
# 服务不可用
- alert: ServiceDown
expr: up{job="geoserver-cloud"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: " has been down for more than 1 minute."
# 高错误率
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
/ sum(rate(http_server_requests_seconds_count[5m])) by (application) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on "
description: "Error rate is %"
# 响应时间过长
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket[5m])) by (le, application)
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on "
description: "P95 response time is s"
# 内存使用过高
- alert: HighMemoryUsage
expr: |
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on "
description: "Heap memory usage is %"
# 数据库连接池耗尽
- alert: DatabaseConnectionPoolExhausted
expr: hikaricp_connections_pending > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool exhausted"
description: " has pending connections"
8.5.2 Alertmanager 配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
receiver: 'team-email'
group_by: ['alertname', 'application']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'team-pagerduty'
- match:
severity: warning
receiver: 'team-slack'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
- name: 'team-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#geoserver-alerts'
- name: 'team-pagerduty'
pagerduty_configs:
- service_key: 'xxx'
8.6 故障排除
8.6.1 服务启动失败
问题:服务无法启动,反复重启
诊断步骤:
# 1. 查看容器日志
docker logs gscloud-wms-1 --tail 200
# 2. 检查错误信息
docker logs gscloud-wms-1 2>&1 | grep -i "error\|exception\|failed"
# 3. 检查依赖服务
docker compose ps
docker compose exec wms nc -zv database 5432
docker compose exec wms nc -zv rabbitmq 5672
# 4. 检查配置
docker compose exec wms env | sort
常见原因和解决方案:
| 错误信息 | 可能原因 | 解决方案 |
|---|---|---|
| Connection refused: database | 数据库未就绪 | 等待数据库启动或检查网络 |
| OutOfMemoryError | 内存不足 | 增加 JVM 堆内存 |
| Config server not available | 配置服务未就绪 | 检查 Config 服务状态 |
| Eureka client failed | 服务发现失败 | 检查 Discovery 服务 |
8.6.2 性能问题
问题:响应时间过长
诊断步骤:
# 1. 检查系统资源
docker stats
# 2. 查看线程转储
curl http://localhost:8080/actuator/threaddump > threaddump.txt
# 3. 查看堆转储(谨慎使用)
curl -X POST http://localhost:8080/actuator/heapdump > heapdump.hprof
# 4. 检查数据库慢查询
docker compose exec database psql -U geoserver -c "
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"
优化建议:
| 症状 | 可能原因 | 优化措施 |
|---|---|---|
| CPU 使用率高 | 复杂渲染/大量请求 | 增加实例数/优化样式 |
| 内存使用率高 | 大量数据加载 | 增加内存/优化查询 |
| 数据库连接等待 | 连接池不足 | 增加连接池大小 |
| GC 频繁 | 内存分配模式 | 调整 GC 参数 |
8.6.3 数据库连接问题
问题:无法连接数据库
# 1. 测试网络连通性
docker compose exec wms nc -zv database 5432
# 2. 测试数据库登录
docker compose exec database psql -U geoserver -d geoserver -c "SELECT 1"
# 3. 检查连接数
docker compose exec database psql -U geoserver -c "
SELECT count(*) FROM pg_stat_activity WHERE datname = 'geoserver';"
# 4. 检查连接池状态
curl http://localhost:8080/actuator/metrics/hikaricp.connections.active
curl http://localhost:8080/actuator/metrics/hikaricp.connections.pending
8.6.4 事件总线问题
问题:配置变更不同步
# 1. 检查 RabbitMQ 连接
docker compose exec wms nc -zv rabbitmq 5672
# 2. 查看 RabbitMQ 队列
docker compose exec rabbitmq rabbitmqctl list_queues
# 3. 查看消费者
docker compose exec rabbitmq rabbitmqctl list_consumers
# 4. 手动触发配置刷新
curl -X POST http://localhost:8080/actuator/bus-refresh
8.6.5 常用诊断命令
# 服务状态检查
curl http://localhost:8080/actuator/health | jq
# 查看配置属性
curl http://localhost:8080/actuator/configprops | jq
# 查看环境变量
curl http://localhost:8080/actuator/env | jq
# 查看 Bean 列表
curl http://localhost:8080/actuator/beans | jq
# 查看 HTTP 追踪
curl http://localhost:8080/actuator/httptrace | jq
# 查看指标
curl http://localhost:8080/actuator/metrics | jq
curl http://localhost:8080/actuator/metrics/http.server.requests | jq
8.7 备份与恢复
8.7.1 数据库备份
#!/bin/bash
# backup-pgconfig.sh
BACKUP_DIR="/backup/geoserver"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/pgconfig_${DATE}.sql.gz"
# 创建备份
docker compose exec -T database pg_dump -U geoserver geoserver | gzip > ${BACKUP_FILE}
# 保留最近 30 天的备份
find ${BACKUP_DIR} -name "pgconfig_*.sql.gz" -mtime +30 -delete
echo "Backup completed: ${BACKUP_FILE}"
8.7.2 数据库恢复
#!/bin/bash
# restore-pgconfig.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file>"
exit 1
fi
# 停止所有 GeoServer 服务
docker compose stop wms wfs wcs rest webui gwc
# 恢复数据库
gunzip -c ${BACKUP_FILE} | docker compose exec -T database psql -U geoserver geoserver
# 重启服务
docker compose start wms wfs wcs rest webui gwc
echo "Restore completed from: ${BACKUP_FILE}"
8.7.3 自动化备份(Kubernetes)
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: pgconfig-backup
namespace: geoserver
spec:
schedule: "0 2 * * *" # 每天凌晨2点
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
command:
- /bin/sh
- -c
- |
pg_dump -h postgresql -U geoserver geoserver | gzip > /backup/pgconfig_$(date +%Y%m%d).sql.gz
envFrom:
- secretRef:
name: postgresql-secret
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
8.8 版本升级
8.8.1 升级前检查
# 1. 检查当前版本
docker compose exec wms java -jar app.jar --version
# 2. 查看发布说明
# https://github.com/geoserver/geoserver-cloud/releases
# 3. 备份数据
./backup-pgconfig.sh
# 4. 测试环境验证
docker compose -f compose-test.yml up -d
8.8.2 滚动升级
# 1. 更新镜像版本
sed -i 's/TAG=2.28.1.0/TAG=2.28.2.0/' .env
# 2. 拉取新镜像
docker compose pull
# 3. 滚动更新(逐个服务)
for service in wms wfs wcs rest webui gwc gateway; do
docker compose up -d --no-deps $service
sleep 30
# 等待服务健康
until docker compose exec $service curl -sf http://localhost:8080/actuator/health; do
sleep 5
done
echo "$service upgraded successfully"
done
8.8.3 版本回滚
# 1. 恢复版本号
sed -i 's/TAG=2.28.2.0/TAG=2.28.1.0/' .env
# 2. 重新部署
docker compose up -d
# 3. 如需恢复数据库
./restore-pgconfig.sh /backup/pgconfig_before_upgrade.sql.gz
8.9 本章小结
本章全面介绍了 GeoServer Cloud 的运维监控和故障排除:
-
日志管理:配置日志级别、结构化日志和集中收集。
-
指标监控:配置 Prometheus 指标、关键监控项和 Grafana 面板。
-
健康检查:配置健康端点和 Kubernetes 探针。
-
告警配置:设置 Prometheus 告警规则和通知。
-
故障排除:常见问题的诊断和解决方法。
-
备份恢复:数据库备份、恢复和自动化。
-
版本升级:升级流程和回滚策略。
在下一章中,我们将学习 GeoServer Cloud 的开发扩展和定制。
8.10 思考题
-
在分布式系统中,如何实现日志的关联追踪(Distributed Tracing)?
-
应该监控哪些业务指标来评估 GeoServer Cloud 的服务质量?
-
如何设计一个有效的告警策略,避免告警疲劳?
-
在进行数据库升级时,如何确保数据的一致性和完整性?
-
如何实现 GeoServer Cloud 的灾难恢复方案?