Commit 00cafdfa by unknown

add README

parents
# __Prometheus报警规则配置__
------
## 在Kubernetes集群上安装Prometheus
> ### 安装prometheus-operator
> ```shell
git clone https://github.com/camilb/prometheus-kubernetes.git
cd prometheus-kubernetes
./deploy
```
> ### 查看部署是否完成
> ```shell
kubectl -n monitoring get pods
```
> ### 修改Service类型
> ```shell
kubectl -n monitoring edit svc grafana
kubectl -n monitoring edit svc alertmanager-main
kubectl -n monitoring edit svc prometheus-k8s
```
> ```yaml
spec:
clusterIP: None
ports:
- name: web
port: xxxx
protocol: TCP
targetPort: web
selector:
app: xxx
sessionAffinity: None
type: ClusterIP ##修改为"NodePort
```
> ### 尝试访问Dashboard
> ```shell
kubectl get node -o wide
kubectl -n monitoring get svc
```
> ```yaml
EXTERNAL-IP
212.64.111.xx ##集群外部IP
212.64.43.xxx
212.64.44.xxx
```
> ```yaml
NAME PORT(S)
alertmanager-main 9093:30xxx/TCP ##服务NodePort端口
grafana 3000:32xxx/TCP
prometheus-k8s 9090:31xxx/TCP
```
> ```shell
curl http://[集群任意外部IP]:[服务NodePort端口]
```
## 设置报警规则
> ### 通过Prometheus管理界面查看现有规则
> ```yaml
使用浏览器打开
http://[集群IP]:[prometheus-k8s端口]/rules
```
> ### 修改现有规则
```shell
kubectl -n monitoring edit prometheusrules.monitoring.coreos.com prometheus-k8s-rules
```
```yaml
spec:
groups:
- name: kubernetes-system ##组名
rules:
- alert: KubeNodeNotReady ##报警名
annotations:
message: '{{ $labels.node }} has been unready for more than an hour' ##报警信息
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready
expr: | ##PromQL
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
for: 1h ##Pending时长
labels:
severity: warning
```
> ### 添加新的规则文件
> ```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-k8s-rules-extra
namespace: monitoring
spec:
groups:
- name: NAME
rules:
- alert: ALERT
annotations:
message: MESSAGE
runbook_url: RUNBOOK_URL
expr: |
absent(up{job="alertmanager-main"} == 1)
for: 15m
labels:
severity: critical
```
> ### 应用新规则
```shell
kubectl apply -f [新规则]
```
## 现有规则列表
> ### 组名: "kubernetes-absent"
> ##### 报警名称: "KubeAPIDown"
> *消息*: `KubeAPI has disappeared from Prometheus target discovery.`
备注: API失去连接
> ##### 报警名称: "KubeControllerManagerDown"
> *消息*: `KubeControllerManager has disappeared from Prometheus target discovery.`
备注: 控制器失去连接
> ##### 报警名称: KubeSchedulerDown
> *消息*: `KubeScheduler has disappeared from Prometheus target discovery`
备注: 调度器失去连接
> ##### 报警名称: KubeletDown
> *消息*: `Kubelet has disappeared from Prometheus target discovery.`
备注: Kubelet失去连接
> ### 组名: kubernetes-apps
> ##### 报警名称: KubePodCrashLooping
> *消息*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second`
备注: Pod连续重启
> ##### 报警名称: "KubePodNotReady"
> *消息*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.`
备注: Pod没有进入就绪状态
> ##### 报警名称: "KubeDeploymentGenerationMismatch"
> *消息*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch`
备注: Deployment版本不正确
> ##### 报警名称: "KubeDeploymentReplicasMismatch"
> *消息*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch`
备注: Deployment副本数不正确
> ##### 报警名称: "KubeStatefulSetReplicasMismatch"
> *消息*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch`
备注: StatefulSet副本数不正确
> ##### 报警名称: "KubeStatefulSetGenerationMismatch"
> *消息*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch`
备注: StatefulSet版本不正确
> ##### 报警名称: "KubeDaemonSetRolloutStuck"
> *消息*: `Only {{$value}}% of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}`
备注: DaemonSet滚动更新出错
> ##### 报警名称: "KubeDaemonSetNotScheduled"
> *消息*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.`
备注: DaemonSet未被调度
>
> ##### 报警名称: "KubeDaemonSetMisScheduled"
> *消息*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.`
备注: DaemonSet调度出错
>
> ##### 报警名称: "KubeCronJobRunning"
> *消息*: `CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.`
备注: 定时任务执行时间过长
>
> ##### 报警名称: "KubeJobCompletion"
> *消息*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 1h to complete.`
备注: 任务执行时间过长
>
> ##### 报警名称: "KubeJobFailed"
> *消息*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.`
备注: 任务执行失败
>
> ### 组名: "kubernetes-resources"
> ##### 报警名称: "KubeCPUOvercommit"
> *消息*: `Overcommited CPU resource requests on Pods, cannot tolerate node failure.`
备注: CPU超出节点限额
> ##### 报警名称: "KubeMemOvercommit"
> *消息*: `Overcommited Memory resource requests on Pods, cannot tolerate node failure.`
备注: MEM超出节点限额
> ##### 报警名称: "KubeCPUOvercommit"
> *消息*: `Overcommited CPU resource request quota on Namespaces.`
备注: CPU超出命名空间限额
> ##### 报警名称: "KubeMemOvercommit"
> *消息*: `Overcommited Memory resource request quota on Namespaces.`
备注: MEM超出命名空间限额
> ##### 报警名称: "KubeQuotaExceeded"
> *消息*: `{{ printf \"%0.0f\" $value }}% usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`
备注: 命名空间剩余配额过低
> ### 组名: "kubernetes-storage"
> ##### 报警名称: "KubePersistentVolumeUsageCritical"
> *消息*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ printf \"%0.0f\" $value }}% free.`
备注: PVC磁盘剩余配额过低
> ##### 报警名称: "KubePersistentVolumeFullInFourDays"
> *消息*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.`
备注: PVC磁盘配额预计在4天后用尽
> ### 组名: "kubernetes-system"
> ##### 报警名称: "KubeNodeNotReady"
> *消息*: `{{ $labels.node }} has been unready for more than an hour"`
备注: 节点不可用
> ##### 报警名称: "KubeVersionMismatch"
> *消息*: `There are {{ $value }} different versions of Kubernetes components running.`
备注: 系统组件版本不正确
> ##### 报警名称: "KubeClientErrors"
> *消息*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }}% errors.'`
备注: API客户端错误率过高
> ##### 报警名称: "KubeClientErrors"
> *消息*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }} errors / sec.'`
备注: API客户端每秒错误数过高
> ##### 报警名称: "KubeletTooManyPods"
> *消息*: `Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110.`
备注: 单个节点Pod数超过110个
> ##### 报警名称: "KubeAPILatencyHigh"
> *消息*: `The API server has a 99th percentile latency of {{ $value }} seconds for {{$labels.verb}} {{$labels.resource}}.`
备注: API服务端返回延迟过高
> ##### 报警名称: "KubeAPILatencyHigh"
> *消息*: `The API server has a 99th percentile latency of {{ $value }} seconds for {{$labels.verb}} {{$labels.resource}}.`
备注: API服务端返回延迟极高
> ##### 报警名称: "KubeAPIErrorsHigh"
> *消息*: `API server is erroring for {{ $value }}% of requests.`
备注: API服务端返回错误率过高
> ##### 报警名称: "KubeAPIErrorsHigh"
> *消息*: `API server is erroring for {{ $value }}% of requests.`
备注: API服务端返回错误率极高
> ##### 报警名称: "KubeClientCertificateExpiration"
> *消息*: `Kubernetes API certificate is expiring in less than 7 days.`
备注: 系统证书将在7天后过期
> ##### 报警名称: "KubeClientCertificateExpiration"
> *消息*: `Kubernetes API certificate is expiring in less than 1 day.`
备注: 系统证书将在1天后过期
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment