add README

00cafdfa · unknown · 00cafdfa
Commit 00cafdfa authored Nov 22, 2018 by unknown
Hide whitespace changes
Inline Side-by-side

Showing with 218 additions and 0 deletions

README.md
+218 -0

No files found.
--- a/README.md
+++ b/README.md
+# __Prometheus报警规则配置__
+------
+## 在Kubernetes集群上安装Prometheus
+> ### 安装prometheus-operator
+> ```shell
+git clone https://github.com/camilb/prometheus-kubernetes.git
+cd prometheus-kubernetes
+./deploy
+```
+> ### 查看部署是否完成
+> ```shell
+kubectl -n monitoring get pods
+```
+> ### 修改Service类型
+> ```shell
+kubectl -n monitoring edit svc grafana
+kubectl -n monitoring edit svc alertmanager-main
+kubectl -n monitoring edit svc prometheus-k8s
+```
+> ```yaml
+spec:
+  clusterIP: None
+  ports:
+  - name: web
+    port: xxxx
+    protocol: TCP
+    targetPort: web
+  selector:
+    app: xxx
+  sessionAffinity: None
+  type: ClusterIP  ##修改为"NodePort
+```
+> ### 尝试访问Dashboard
+> ```shell
+kubectl get node -o wide
+kubectl -n monitoring get svc
+```
+> ```yaml
+EXTERNAL-IP
+212.64.111.xx  ##集群外部IP
+212.64.43.xxx
+212.64.44.xxx
+```
+> ```yaml
+NAME                PORT(S)
+alertmanager-main   9093:30xxx/TCP  ##服务NodePort端口
+grafana             3000:32xxx/TCP
+prometheus-k8s      9090:31xxx/TCP
+```
+> ```shell
+curl http://[集群任意外部IP]:[服务NodePort端口]
+```
+## 设置报警规则
+> ### 通过Prometheus管理界面查看现有规则
+> ```yaml
+使用浏览器打开
+http://[集群IP]:[prometheus-k8s端口]/rules
+```
+> ### 修改现有规则
+```shell
+kubectl -n monitoring edit prometheusrules.monitoring.coreos.com prometheus-k8s-rules
+```
+```yaml
+spec:
+  groups:
+  - name: kubernetes-system  ##组名
+    rules:
+    - alert: KubeNodeNotReady  ##报警名
+      annotations:
+        message: '{{ $labels.node }} has been unready for more than an hour'  ##报警信息
+        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready
+      expr: |  ##PromQL
+        kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
+      for: 1h  ##Pending时长
+      labels:
+        severity: warning
+```
+> ### 添加新的规则文件
+> ```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: prometheus-k8s-rules-extra
+  namespace: monitoring
+spec:
+  groups:
+  - name: NAME
+    rules:
+    - alert: ALERT
+      annotations:
+        message: MESSAGE
+        runbook_url: RUNBOOK_URL
+      expr: |
+        absent(up{job="alertmanager-main"} == 1)
+      for: 15m
+      labels:
+        severity: critical
+```
+> ### 应用新规则
+```shell
+kubectl apply -f [新规则]
+```
+## 现有规则列表
+> ### 组名: "kubernetes-absent"
+> ##### 报警名称: "KubeAPIDown"
+> *消息*: `KubeAPI has disappeared from Prometheus target discovery.`  
+备注: API失去连接
+> ##### 报警名称: "KubeControllerManagerDown"
+> *消息*: `KubeControllerManager has disappeared from Prometheus target discovery.`  
+备注: 控制器失去连接
+> ##### 报警名称: KubeSchedulerDown
+> *消息*: `KubeScheduler has disappeared from Prometheus target discovery`  
+备注: 调度器失去连接
+> ##### 报警名称: KubeletDown
+> *消息*: `Kubelet has disappeared from Prometheus target discovery.`  
+备注: Kubelet失去连接
+> ### 组名: kubernetes-apps
+> ##### 报警名称: KubePodCrashLooping
+> *消息*: `{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} / second`  
+备注: Pod连续重启
+> ##### 报警名称: "KubePodNotReady"
+> *消息*: `{{ $labels.namespace }}/{{ $labels.pod }} is not ready.`  
+备注: Pod没有进入就绪状态
+> ##### 报警名称: "KubeDeploymentGenerationMismatch"
+> *消息*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} generation mismatch`  
+备注: Deployment版本不正确
+> ##### 报警名称: "KubeDeploymentReplicasMismatch"
+> *消息*: `Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch`  
+备注: Deployment副本数不正确
+> ##### 报警名称: "KubeStatefulSetReplicasMismatch"
+> *消息*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} replica mismatch`  
+备注: StatefulSet副本数不正确
+> ##### 报警名称: "KubeStatefulSetGenerationMismatch"
+> *消息*: `StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} generation mismatch`  
+备注: StatefulSet版本不正确
+> ##### 报警名称: "KubeDaemonSetRolloutStuck"
+> *消息*: `Only {{$value}}% of desired pods scheduled and ready for daemon set {{$labels.namespace}}/{{$labels.daemonset}}`  
+备注: DaemonSet滚动更新出错
+> ##### 报警名称: "KubeDaemonSetNotScheduled"
+> *消息*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are not scheduled.`  
+备注: DaemonSet未被调度
+>
+> ##### 报警名称: "KubeDaemonSetMisScheduled"
+> *消息*: `A number of pods of daemonset {{$labels.namespace}}/{{$labels.daemonset}} are running where they are not supposed to run.`  
+备注: DaemonSet调度出错
+>
+> ##### 报警名称: "KubeCronJobRunning"
+> *消息*: `CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.`  
+备注: 定时任务执行时间过长
+>
+> ##### 报警名称: "KubeJobCompletion"
+> *消息*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 1h to complete.`  
+备注: 任务执行时间过长
+>
+> ##### 报警名称: "KubeJobFailed"
+> *消息*: `Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.`  
+备注: 任务执行失败
+>
+> ### 组名: "kubernetes-resources"
+> ##### 报警名称: "KubeCPUOvercommit"
+> *消息*: `Overcommited CPU resource requests on Pods, cannot tolerate node failure.`  
+备注: CPU超出节点限额
+> ##### 报警名称: "KubeMemOvercommit"
+> *消息*: `Overcommited Memory resource requests on Pods, cannot tolerate node failure.`  
+备注: MEM超出节点限额
+> ##### 报警名称: "KubeCPUOvercommit"
+> *消息*: `Overcommited CPU resource request quota on Namespaces.`  
+备注: CPU超出命名空间限额
+> ##### 报警名称: "KubeMemOvercommit"
+> *消息*: `Overcommited Memory resource request quota on Namespaces.`  
+备注: MEM超出命名空间限额
+> ##### 报警名称: "KubeQuotaExceeded"
+> *消息*: `{{ printf \"%0.0f\" $value }}% usage of {{ $labels.resource }} in namespace {{ $labels.namespace }}.`  
+备注: 命名空间剩余配额过低
+> ### 组名: "kubernetes-storage"
+> ##### 报警名称: "KubePersistentVolumeUsageCritical"
+> *消息*: `The persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has {{ printf \"%0.0f\" $value }}% free.`  
+备注: PVC磁盘剩余配额过低
+> ##### 报警名称: "KubePersistentVolumeFullInFourDays"
+> *消息*: `Based on recent sampling, the persistent volume claimed by {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is expected to fill up within four days.`  
+备注: PVC磁盘配额预计在4天后用尽
+> ### 组名: "kubernetes-system"
+> ##### 报警名称: "KubeNodeNotReady"
+> *消息*: `{{ $labels.node }} has been unready for more than an hour"`  
+备注: 节点不可用
+> ##### 报警名称: "KubeVersionMismatch"
+> *消息*: `There are {{ $value }} different versions of Kubernetes components running.`  
+备注: 系统组件版本不正确
+> ##### 报警名称: "KubeClientErrors"
+> *消息*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }}% errors.'`  
+备注: API客户端错误率过高
+> ##### 报警名称: "KubeClientErrors"
+> *消息*: `Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }} errors / sec.'`  
+备注: API客户端每秒错误数过高
+> ##### 报警名称: "KubeletTooManyPods"
+> *消息*: `Kubelet {{$labels.instance}} is running {{$value}} pods, close to the limit of 110.`  
+备注: 单个节点Pod数超过110个
+> ##### 报警名称: "KubeAPILatencyHigh"
+> *消息*: `The API server has a 99th percentile latency of {{ $value }} seconds for {{$labels.verb}} {{$labels.resource}}.`  
+备注: API服务端返回延迟过高
+> ##### 报警名称: "KubeAPILatencyHigh"
+> *消息*: `The API server has a 99th percentile latency of {{ $value }} seconds for {{$labels.verb}} {{$labels.resource}}.`  
+备注: API服务端返回延迟极高
+> ##### 报警名称: "KubeAPIErrorsHigh"
+> *消息*: `API server is erroring for {{ $value }}% of requests.`  
+备注: API服务端返回错误率过高
+> ##### 报警名称: "KubeAPIErrorsHigh"
+> *消息*: `API server is erroring for {{ $value }}% of requests.`  
+备注: API服务端返回错误率极高
+> ##### 报警名称: "KubeClientCertificateExpiration"
+> *消息*: `Kubernetes API certificate is expiring in less than 7 days.`  
+备注: 系统证书将在7天后过期
+> ##### 报警名称: "KubeClientCertificateExpiration"
+> *消息*: `Kubernetes API certificate is expiring in less than 1 day.`  
+备注: 系统证书将在1天后过期