文章目录
- 一、Prometheus介绍
- 二、Prometheus组件与监控
- 三、Prometheus基本使用:怎么来监控
- 四、Kubernetes监控指标
- 五、Prometheus的基本使用:部署
- 1.基于docker部署prometheus+grafana
- 2. 查看prometheus配置文件
- 3. 监控Linux服务器
- 3.1找到自己相应的系统去下载软件包
- 3.2获取到节点的指标后,需要去编辑prometheus配置文件让它来收集这些指标
- 3.3 用grafana将收集到的数据展示出来
- 六、在kubernetes平台部署prometheus相关组件
- 1.部署组件
- 1.1 prometheus-configmap
- 1.2 prometheus-rules
- 1.3 prometheus-deployment
- 1.4 node_exporter
- 1.5 kube-state-metrics k8s资源(例如depolyment、daemonset等资源)
- 1.6 grafana
- 2.grafana 添加数据源
- 3.grafana 导入仪表盘
- 七、Prometheus的基本使用:查询数据
一、Prometheus介绍
prometheus是一个最初在SoundCloud上构建的监控系统,自2012年成为社区开源项目,拥有非常活跃的开发人员和用户社区。为强调开源及独立维护,Prometheus于2016年加入云原生云计算基金会(CNCF),成为继kubernetes之后的第二个托管项目。
项目地址:https://prometheus.io/
https://github.com/prometheus
二、Prometheus组件与监控
- Prometheus Server:收集指标和存储时间序列数据,并提供查询接口;
- ClientLibrary:客户端库;
- Push Gateway:短期存储指标数据,主要用于临时性任务;
- Exports:采集已有的第三方服务监控指标并暴露metrics;
- Alertmanager:告警;
- Web UI:简单的Web控制台;
三、Prometheus基本使用:怎么来监控
如果想要监控,前提是能获取被监控端的指标数据,并且这个数据个谁必须遵循Prometheus的数据模型,这样才能识别和采集,一般是用export提供监控指标数据,如下图
export在这里采集各个应用的指标数据,然后提供给prometheus server端进行处理,这里的export就是一个数据模型
export列表:
https://prometheus.io/docs/instrumenting/exporters/
这个地址是各个应用的export组件,ngx、数据库、消息队列等等
四、Kubernetes监控指标
Kubernetes本身监控:
- Node资源利用率
- Node数量
- 每个Node运行Pod数量
- 资源对象状态(deployment、svc、pod等状态)
Pod监控:
- Pod总数量及每隔控制器预期数量
- Pod状态(是否为running)
- 容器资源利用率:CPU、内存、网络
kubernetes监控思路:
- Pod:kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有Pod和容器相关的性能指标数据;
接口指标:https://NodeIP:10250/metrics/cadvisor
curl -k https://NodeIP:10250/metrics/cadvisor
因需要自签证书,所以会显示未授权 - Node:
使用node_exporter收集器采集节点资源利用率(daemonset形式);
项目地址:https://github.com/prometheus/node_exporter - k8s资源对象
kube-state-metrics采集了k8s中各个资源对象的状态信息;
项目地址:https://github.com/kubernetes/kube-state-metrics
五、Prometheus的基本使用:部署
prometheus部署文档:https://prometheus.io/docs/prometheus/latest/installation/
grafana部署文档:https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/
grafana仪表盘地址:https://grafana.com/grafana/dashboards/?pg=graf&plcmt=dashboard-below-text
1.基于docker部署prometheus+grafana
prometheus
docker部署的prometheus的数据目录为/prometheus
mkdir /home/prometheus-data
chown -R 65534:65534 prometheus-data/
docker run -itd \
-p 9090:9090 \
#将自己配置好的prometheus配置文件映射到容器目录里(prometheus容器启动时会需要这个文件)
-v /home/prometheus.yml:/etc/prometheus/prometheus.yml \
#将数据目录持久化到宿主机
-v /home/prometheus-data:/prometheus \
prom/prometheus
grafana
docker部署的prometheus的数据目录为/prometheus
mkdir /home/grafana-data
chmod 777 /home/grafana-data
docker run -d -p 3000:3000 --name=grafana \
--volume /home/grafana-data:/var/lib/grafana \
grafana/grafana:8.4.0
#用户名/密码:admin/admin # 第一次需要重置密码
2. 查看prometheus配置文件
#全局配置
global:
#采集数据 时间的间隔,默认为1min
scrape_interval: 15s
#评估告警规则 时间的间隔,默认为1min
evaluation_interval: 15s
#采集数据超时时间,默认10s
scrape_timeout: 5s
#告警配置
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
#告警规则存放的文件
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
#配置监控端,成为target,每个target用job_name分组管理,又分为静态配置和服务发现;
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- 目标(targets):被监控端
- 实例(Instances):每个被监控端成为实例
- 作业(Job):具有相同目标的实例集合(组)称为作业
3. 监控Linux服务器
node_export:用于监控Linux系统的指标采集器
常用指标:
- CPU
- 内存
- 硬盘
- 网络流量
- 文件描述符
- 系统负载
- 系统服务
数据接口:http://IP:9100
使用文档:https://prometheus.io/docs/guides/node-exporter/
github:https://github.com/prometheus/node_exporter
3.1找到自己相应的系统去下载软件包
https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
在linux机器上下载后解压并直接运行二进制文件即可
[root@k8s-node1 fands]# tar -zxf node_exporter-1.8.2.linux-amd64.tar.gz
[root@k8s-node1 fands]# cd node_exporter-1.8.2.linux-amd64
[root@k8s-node1 node_exporter-1.8.2.linux-amd64]# ./node_exporter
启动后可直接去浏览器访问IP:9100/metrics,node_exporter会将收集到的节点指标都放在这里
3.2获取到节点的指标后,需要去编辑prometheus配置文件让它来收集这些指标
[root@k8s-master prometheus]# docker ps -a | grep prometheus
6b25be4431be prom/prometheus "/bin/prometheus --c…" 4 weeks ago Up 6 minutes 0.0.0.0:9090->9090/tcp prometheus
[root@k8s-master prometheus]# docker exec -it 6b25be4431be /bin/sh
/prometheus $ vi /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
#添加监控项
- job_name: "linux_server"
static_configs:
- targets: ["192.168.1.2:9100"]
接着重启prometheus容器
[root@k8s-master prometheus]# docker restart 6b25be4431be
6b25be4431be
就可以在prometheus看到我们采集的linux服务器数据了
在查询框内输入 node开头就可以看到node_export组件收集到的所有数据,例如节点内存
3.3 用grafana将收集到的数据展示出来
在grafana中添加prometheus数据源,并导入"9276
"监控模版即可
六、在kubernetes平台部署prometheus相关组件
1.部署组件
1.1 prometheus-configmap
#prometheus-config.yaml主配置文件,主要配置kubernetes各服务发现以及报警
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: ops
data:
prometheus.yml: |
rule_files:
- /etc/config/rules/*.rules
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
# 采集kube-apiserver的指标
- job_name: kubernetes-apiservers
# 基于k8s服务自动发现
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
# 获取默认命名空间里名为kubernetes https的endpoint(kubectl get ep)
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
# 授权(ca.crt为集群的根证书ca.crt kubeadm方式安装的默认在/etc/kubernetes/pki/ca.crt)
# token 为serviceaccount自动生成的token,会使用这个token访问kube-apiserver获取地址
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 采集集群中所有节点的指标
- job_name: 'kubernetes-nodes-kubelet'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
# 实际访问指标接口 https://NodeIP:10250/metrics
replacement: /metrics
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 采集当前节点上所存在的pod指标
- job_name: 'kubernetes-nodes-cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
# 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor,这里替换默认指标URL路径
replacement: /metrics/cadvisor
metric_relabel_configs:
# 将指标名instance的 换为 指标名node
- source_labels: [instance]
separator: ;
regex: (.+)
target_label: node
replacement: $1
action: replace
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 采集service关联的endpoint 后端的pod指标
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints # 从Service列表中的Endpoint发现Pod为目标
relabel_configs:
# Service没配置注解prometheus.io/scrape的不采集
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
# 重命名采集目标协议
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
# 采集k8s资源对象的指标(deployment\svc\ing\secerts等)
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- ops
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
regex: kube-state-metrics
replacement: $1
action: keep
# 采集pod的指标(scrape值为ture的pod)
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod # 以Pod为目标
# 重命名采集目标协议
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
1.2 prometheus-rules
#prometheus-rules.yaml #prometheus告警规则配置文件 配置了各个指标的阈值
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: ops
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
alertinstance: '{{ $labels.job }}/{{ $labels.instance }}'
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs|ext3"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|ext3"} * 100) > 80
for: 1m
labels:
severity: warning
alertinstance: '{{ $labels.instance }}:{{ $labels.device }}'
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
severity: warning
alertinstance: '{{ $labels.instance }}'
annotations:
summary: "Instance {{ $labels.instance }} 节点内存使用率过高"
description: "{{ $labels.instance }}节点内存使用大于80% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
for: 1m
labels:
severity: warning
alertinstance: '{{ $labels.instance }}'
annotations:
summary: "Instance {{ $labels.instance }} 节点CPU使用率过高"
description: "{{ $labels.instance }}节点CPU使用大于60% (当前值: {{ $value }})"
- alert: KubeNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1m
labels:
severity: error
alertinstance: '{{ $labels.node }}/{{ $labels.instance }}'
annotations:
description: "{{ $labels.node }} 节点离线 已经有10多分钟没有准备好了"
pod.rules: |
groups:
- name: pod.rules
rules:
- alert: PodCPUUsage
expr: (sum(rate(container_cpu_usage_seconds_total{image!=""}[3m])) by (pod,namespace)) / (sum(container_spec_cpu_quota{image!=""}) by (pod,namespace) /100000) *100 > 80
for: 5m
labels:
severity: warning
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"
description: "{{ $labels.namespace }}/{{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"
- alert: PodMemoryUsage
expr: sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 80
for: 5m
labels:
severity: warning
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"
description: "{{ $labels.namespace }}/{{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"
- alert: PodNetworkReceive
expr: sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
for: 5m
labels:
severity: warning
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"
description: "{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.interface }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"
- alert: PodNetworkTransmit
expr: sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
for: 5m
labels:
severity: warning
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}出口流量大于30MB/s (当前值: {{ $value }}/K/s)"
description: "{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.interface }} 出口流量大于30MB/s (当前值: {{ $value }}/K/s)"
- alert: PodRestart
expr: sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 0
for: 1m
labels:
severity: warning
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"
description: "{{ $labels.namespace }}/{{ $labels.pod }} Pod重启 (当前值: {{ $value }})"
- alert: PodNotHealthy
expr: sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
for: 5m
labels:
severity: error
alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.phase }}'
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态不健康 (当前值: {{ $value }})"
description: "{{ $labels.namespace }}/{{ $labels.pod }} Pod状态不健康 (当前值: {{ $labels.phase }})"
1.3 prometheus-deployment
#prometheus-deployment.yaml #主要用于部署prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: ops
labels:
k8s-app: prometheus
spec:
replicas: 1
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
spec:
serviceAccountName: prometheus
initContainers:
- name: "init-chown-data"
image: "busybox:latest"
imagePullPolicy: "IfNotPresent"
command: ["chown", "-R", "65534:65534", "/data"]
volumeMounts:
- name: prometheus-data
mountPath: /data
subPath: ""
containers:
#负责热加载configmap配置文件的容器
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9090/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 10Mi
requests:
cpu: 10m
memory: 10Mi
- name: prometheus-server
image: "prom/prometheus:v2.45.4"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
#数据保留天数
- --storage.tsdb.retention=3d
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
limits:
cpu: 500m
memory: 1500Mi
requests:
cpu: 200m
memory: 1000Mi
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
- name: prometheus-rules
mountPath: /etc/config/rules
- name: prometheus-etcd
mountPath: /var/run/secrets/kubernetes.io/etcd-certs
- name: timezone
mountPath: /etc/localtime
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-rules
- name: prometheus-data
persistentVolumeClaim:
claimName: prometheus
- name: prometheus-etcd
secret:
secretName: etcd-certs
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus
namespace: ops
spec:
storageClassName: "managed-nfs-storage"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: ops
spec:
type: NodePort
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
nodePort: 30090
selector:
k8s-app: prometheus
---
#授权prometheus访问k8s资源的凭证
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- "/metrics"
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: ops
1.4 node_exporter
#node-exporter.yaml #采集节点的指标数据,通过daemonset方式部署,并声明让prometheus收集
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: ops
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
#标注此注释,声明让prometheus去收集
prometheus.io/scrape: "true"
#prometheus.io/scheme: "http"
#prometheus.io/path: "/metrics"
prometheus.io/port: "9100"
spec:
tolerations:
- effect: NoSchedule
operator: Exists
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: "prom/node-exporter:latest"
args:
- --path.rootfs=/host
- --web.listen-address=:9100
ports:
- name: metrics
containerPort: 9100
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumes:
- name: rootfs
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: ops
annotations:
prometheus.io/scrape: "true"
spec:
clusterIP: None
ports:
- name: metrics
port: 9100
protocol: TCP
targetPort: 9100
selector:
k8s-app: node-exporter
1.5 kube-state-metrics k8s资源(例如depolyment、daemonset等资源)
#kube-state-metrics.yaml #采集k8s资源 例如 deployment/svc,并声明让prometheus收集
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: ops
spec:
selector:
matchLabels:
app: kube-state-metrics
replicas: 1
template:
metadata:
labels:
app: kube-state-metrics
annotations:
prometheus.io/scrape: "true"
##prometheus.io/scheme: "http"
##prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-shenzhen.aliyuncs.com/starsl/kube-state-metrics:v2.3.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65534
volumeMounts:
- name: timezone
mountPath: /etc/localtime
volumes:
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: ops
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
protocol: TCP
- name: telemetry
port: 8081
targetPort: telemetry
protocol: TCP
selector:
app: kube-state-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
- ingressclasses
- ingresses
verbs:
- list
- watch
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- list
- watch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterrolebindings
- clusterroles
- rolebindings
- roles
verbs:
- list
- watch
---
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: ops
1.6 grafana
#grafana.yaml #可视化展示收集到的数据
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: ops
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:8.4.0
ports:
- containerPort: 3000
protocol: TCP
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana
subPath: grafana
securityContext:
fsGroup: 472
runAsUser: 472
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana
namespace: ops
spec:
storageClassName: "managed-nfs-storage"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: ops
spec:
type: NodePort
ports:
- port : 80
targetPort: 3000
nodePort: 30030
selector:
app: grafana
#按照顺序运行上面的yaml资源清单
[root@k8s-master prometheus]# kubectl create ns ops
[root@k8s-master prometheus]# kubectl apply -f prometheus-configmap.yaml
[root@k8s-master prometheus]# kubectl apply -f prometheus-rules.yaml
[root@k8s-master prometheus]# kubectl apply -f prometheus-deployment.yaml
[root@k8s-master prometheus]# kubectl apply -f grafana.yaml
[root@k8s-master prometheus]# kubectl get pod,svc -n ops
NAME READY STATUS RESTARTS AGE
pod/grafana-79c5bfb955-lxq6w 1/1 Running 0 58s
pod/prometheus-5ccf96b898-zd8b2 2/2 Running 0 5m6s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/grafana NodePort 10.111.187.83 <none> 80:30030/TCP 57s
service/prometheus NodePort 10.111.181.145 <none> 9090:30090/TCP 5m12s
如上,pod启动完成后访问prometheus(nodeIP:30090端口)和grafana(nodeIP:30030)
查看prometheus被监控端(来源于prometheus-config.yml)
prometheus告警规则(来源于prometheus-rules.yml)
访问grafana如下
#用户名/密码:admin/admin # 第一次需要重置密码
2.grafana 添加数据源
3.grafana 导入仪表盘
Dashboard --> Manage --> import --> upload (ID 13105)
或者使用我已经修改好的,添加了对jvm的监控,开发端暴露指标,再进行每个服务的收集
应用的deployment配置需要开启指标收集
K8S微服务监控大盘模版
七、Prometheus的基本使用:查询数据
PromQL(Prometheus Query Language)是prometheus自己开发的数据查询DSL语言,语言表现力非常丰富、支持条件查询、操作符,并且内建了大量内置函数,供我们针对监控数据的各种维度进行查询。
数据模型:
- prometheus将所有数据存储为时间序列(内置时序数据库TSDB);
- 具有相同度量名称以及标签属于同一个指标;
- 每个时间序列都由度量标准名称和一组键值对(称为标签)唯一标识,通过标签查询指定指标
指标格式为:
<metric name> {<label name>=<label value>,....}
示例:
查询指标最新样本(称为瞬时向量/最新的数据):
node_cpu_seconds_total
可以通过附加一组标签来近一步查询:
node_cpu_seconds_total{job="linux_server"}
#{ }里的“key=value”就是查询条件
查询指标近5分钟内样本(称为范围向量/历史数据,时间单位s、m、h、d、w、y):
node_cpu_seconds_total{job="linux_server"}[5m]
node_cpu_seconds_total{job="linux_server"}[1h]