1 概述:

1.1 环境

thanos ruler和alertmanager都部署在kubernetes集群,版本信息如下:
a、kubernetes集群:v1.18.5
b、thanos ruler: v0.11.0
c、alertmanager: v0.20.0

thanos ruler的yaml文件简介:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-rule
  name: thanos-rule
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-rule
  serviceName: thanos-rules
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-rule
    spec:
      containers:
      - image: registry.cn-shenzhen.aliyuncs.com/gzlj/thanos-reloader:v0.1
        imagePullPolicy: Always
        name: reloader
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - args:
        - rule
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --rule-file=/etc/thanos/rules/*rules.yaml
        - --data-dir=/var/thanos/rule
        - --label=rule_replica="$(NAME)"
        #请注意--alert.label-drop这行记录,值是带""
        - --alert.label-drop="rule_replica"
        - --query=dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local
        - --alertmanagers.url=http://alertmanager-main.monitoring.svc.cluster.local:9093
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: quay.mirrors.ustc.edu.cn/thanos/thanos:v0.11.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 24
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-rule
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 18
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        volumeMounts:
        - mountPath: /var/thanos/rule
          name: data
        - mountPath: /etc/thanos/rules
          name: thanos-rules
      restartPolicy: Always
      serviceAccount: thanos-rules
      serviceAccountName: thanos-rules
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: thanos-rules
        name: thanos-rules
      - emptyDir: {}
        name: data

重点截图如下
在这里插入图片描述


1.2 现象

alertmanager收到重复告警,两个重复的告警唯一的区别是自定义标签rule_replica的值不一样,如图所示:
在这里插入图片描述


2 解决方式

尝试过更换成thanos ruler的镜像版本(v0.15.0),但现象依旧。
即将放弃的时候,我把thanos ruler的启动命令参数 --alert.label-drop="rule_replica"变成 --alert.label-drop=rule_replica,即只是去掉了双引号,alertmanager重复接收告警的现象解决。


3 解决后的现象

thanos ruler将告警信息中的标签 rule_replica 扔掉,再将告警发送给alertmanager,此时alertmanager中只存在一份告警信息,而不是先前的两份。
在这里插入图片描述

Logo

开源、云原生的融合云平台

更多推荐