kubernetes部署thanos ruler的发送重复告警的一个隐秘的坑
1 概述:1.1 环境thanos ruler和alertmanager都部署在kubernetes集群,版本信息如下:a、kubernetes集群:v1.18.5b、thanos ruler: v0.11.0c、alertmanager: v0.20.0thanos ruler的yaml文件简介:apiVersion: apps/v1kind: StatefulSetmetadata:label
·
1 概述:
1.1 环境
thanos ruler和alertmanager都部署在kubernetes集群,版本信息如下:
a、kubernetes集群:v1.18.5
b、thanos ruler: v0.11.0
c、alertmanager: v0.20.0
thanos ruler的yaml文件简介:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: thanos-rule
name: thanos-rule
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: thanos-rule
serviceName: thanos-rules
template:
metadata:
labels:
app.kubernetes.io/name: thanos-rule
spec:
containers:
- image: registry.cn-shenzhen.aliyuncs.com/gzlj/thanos-reloader:v0.1
imagePullPolicy: Always
name: reloader
resources:
limits:
cpu: 100m
memory: 100Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- rule
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --rule-file=/etc/thanos/rules/*rules.yaml
- --data-dir=/var/thanos/rule
- --label=rule_replica="$(NAME)"
#请注意--alert.label-drop这行记录,值是带""
- --alert.label-drop="rule_replica"
- --query=dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local
- --alertmanagers.url=http://alertmanager-main.monitoring.svc.cluster.local:9093
env:
- name: NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: quay.mirrors.ustc.edu.cn/thanos/thanos:v0.11.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 24
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: thanos-rule
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 18
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
volumeMounts:
- mountPath: /var/thanos/rule
name: data
- mountPath: /etc/thanos/rules
name: thanos-rules
restartPolicy: Always
serviceAccount: thanos-rules
serviceAccountName: thanos-rules
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: thanos-rules
name: thanos-rules
- emptyDir: {}
name: data
重点截图如下
1.2 现象
alertmanager收到重复告警,两个重复的告警唯一的区别是自定义标签rule_replica的值不一样,如图所示:
2 解决方式
尝试过更换成thanos ruler的镜像版本(v0.15.0),但现象依旧。
即将放弃的时候,我把thanos ruler的启动命令参数 --alert.label-drop="rule_replica"变成 --alert.label-drop=rule_replica,即只是去掉了双引号,alertmanager重复接收告警的现象解决。
3 解决后的现象
thanos ruler将告警信息中的标签 rule_replica 扔掉,再将告警发送给alertmanager,此时alertmanager中只存在一份告警信息,而不是先前的两份。
更多推荐
已为社区贡献8条内容
所有评论(0)