故障现象
k8s集群不能访问,具体表现kubectl命令不能使用。
思路
检查apiserver服务状态,检查etcd集群状态中errors列中存在一个alarm:NOSPACE的告警
解决&分析
具体表现
恢复使用第一,先尝试解除告警看能否恢复
etcdctl --endpoints=https://192.168.1.15:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key alarm disarm
验证集群中是否还有alarm信息
etcdctl --endpoints=https://192.168.1.15:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key alarm list
可以愉快的访问集群了。
继续观察 15节点的DB SIZE达到2.1GB了,其它两个节点正常
使用资料中的方法通过执行压缩节点1的数据释放blobDB并对数据库文件进行defragmentation碎片整理,结果是节点无法执行压缩及无法同步新key,随便整理后也不行,节点1和另外两个节点的reversion落后太多,导致执行compaction时报错:Got: Error from server: etcdserver: mvcc: required revision is a future revision,且此报错网上非常少,资料上一般到这里就让继续清理另外两台机器上的dbsize了,但并不能解决etcd1上的问题,我这边使用将节点1踢出集群并通过正常节点的snapshot来重新加入集群,使得三节点的历史数据保持一致
压缩并释放
通过endpoints status -w table看到的字段中的RAFT INDEX和RAFT APPLIED INDEX字段集群中的节点是相同的,但 -w json才能看到每个节点上同步到的最新的reversion是多少
etcdctl --endpoints=https://192.168.1.15:2379,https://192.168.1.17:2379,https://192.168.1.38:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out="json"
[{"Endpoint":"https://192.168.1.15:2379","Status":{"header":{"cluster_id":3483896258597590173,"member_id":14117095307229358704,"revision":542770334,"raft_term":420},"version":"3.5.0","dbSize":2146328576,"leader":17230009819260537482,"raftIndex":664811268,"raftTerm":420,"raftAppliedIndex":664811268,"dbSizeInUse":2146320384}},{"Endpoint":"https://192.168.1.17:2379","Status":{"header":{"cluster_id":3483896258597590173,"member_id":17230009819260537482,"revision":582270095,"raft_term":420},"version":"3.5.0","dbSize":718491648,"leader":17230009819260537482,"raftIndex":664811268,"raftTerm":420,"raftAppliedIndex":664811268,"dbSizeInUse":69287936}},{"Endpoint":"https://192.168.1.38:2379","Status":{"header":{"cluster_id":3483896258597590173,"member_id":9235900205731567691,"revision":582270095,"raft_term":420},"version":"3.5.0","dbSize":718491648,"leader":17230009819260537482,"raftIndex":664811268,"raftTerm":420,"raftAppliedIndex":664811268,"dbSizeInUse":69283840}}]
etcdctl --endpoints=https://192.168.1.15:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key compact 542770334
compacted revision 542770334
碎片整理(建议稍等一会再执行,否则会报错)
{“level”:“warn”,“ts”:“2024-12-09T10:08:09.160931+0800”,“caller”:“clientv3/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“passthrough:///https://192.168.1.15:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
Failed to defragment etcd member[https://192.168.1.15:2379] (context deadline exceeded)
etcdctl --endpoints=https://192.168.1.15:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key defrag
Finished defragmenting etcd member[https://192.168.1.15:2379]
etcd1日志输出
{"level":"info","ts":"2024-12-09T02:15:18.829Z","caller":"v3rpc/maintenance.go:89","msg":"starting defragment"}
{"level":"info","ts":"2024-12-09T02:15:18.832Z","caller":"backend/backend.go:479","msg":"defragmenting","path":"/var/lib/etcd/member/snap/db","current-db-size-bytes":1039929344,"current-db-size":"1.0 GB","current-db-size-in-use-bytes":124841984,"current-db-size-in-use":"125 MB"}
{"level":"info","ts":"2024-12-09T02:15:19.759Z","caller":"backend/backend.go:537","msg":"finished defragmenting directory","path":"/var/lib/etcd/member/snap/db","current-db-size-bytes-diff":-915320832,"current-db-size-bytes":124608512,"current-db-size":"125 MB","current-db-size-in-use-bytes-diff":-253952,"current-db-size-in-use-bytes":124588032,"current-db-size-in-use":"125 MB","took":"929.794141ms"}
{"level":"info","ts":"2024-12-09T02:15:19.759Z","caller":"v3rpc/maintenance.go:95","msg":"finished defragment"}
清理后节点信息
root@dev-k8s-master03:~# etcdctl --endpoints=https://192.168.1.15:2379,https://192.168.1.17:2379,https://192.168.1.38:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out="table"
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.15:2379 | c3e9ff62e7bd6a70 | 3.5.0 | 126 MB | false | false | 420 | 664830103 | 664830103 | |
| https://192.168.1.17:2379 | ef1d4aa461844a8a | 3.5.0 | 718 MB | true | false | 420 | 664830103 | 664830103 | |
| https://192.168.1.38:2379 | 802c824d9f96584b | 3.5.0 | 718 MB | false | false | 420 | 664830103 | 664830103 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
重新加入集群
root@dev-k8s-master03:~# etcdctl --endpoints=https://192.168.1.38:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /tmp/etcd-snapshot-`date +%F-%T`.db
{"level":"info","ts":1733714785.0831125,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/tmp/etcd-snapshot-2024-12-09-11:26:25.db.part"}
{"level":"info","ts":"2024-12-09T11:26:25.088262+0800","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1733714785.0883157,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://192.168.1.38:2379"}
{"level":"info","ts":"2024-12-09T11:26:29.600605+0800","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1733714789.8358912,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://192.168.1.38:2379","size":"718 MB","took":4.752708895}
{"level":"info","ts":1733714789.836093,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/tmp/etcd-snapshot-2024-12-09-11:26:25.db"}
Snapshot saved at /tmp/etcd-snapshot-2024-12-09-11:26:25.db
添加集群
(base) root@dev-k8s-master03:/etc/kubernetes/manifests# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member add dev-k8s-master01 --peer-urls=https://192.168.1.15:2380
Member e84022fd6cf784bd added to cluster 305949305005949d
ETCD_NAME="dev-k8s-master01"
ETCD_INITIAL_CLUSTER="dev-k8s-master03=https://192.168.1.38:2380,dev-k8s-master01=https://192.168.1.15:2380,dev-k8s-master02=https://192.168.1.17:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.1.15:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
在节点1执行操作
mv /var/lib/etcd /tmp/
root@dev-k8s-master01:/etc/kubernetes/manifests# etcdctl --endpoints=192.168.1.15:2379 snapshot restore /tmp/etcd-snapshot-2024-12-09-11:26:25.db --data-dir=/var/lib/etcd/ --name=dev-k8s-master01 --initial-cluster dev-k8s-master05=dev-k8s-master01=https://192.168.1.15:2380,dev-k8s-master03=https://192.168.1.38:2380,dev-k8s-master02=https://192.168.1.17:2380 --initial-cluster-token etcd-cluster-1 --initial-advertise-peer-urls https://192.168.1.15:2380
{"level":"info","ts":1733716067.1317353,"caller":"snapshot/v3_snapshot.go:306","msg":"restoring snapshot","path":"/tmp/etcd-snapshot-2024-12-09-11:26:25.db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd/","snap-dir":"/var/lib/etcd/member/snap"}
{"level":"info","ts":1733716070.3973181,"caller":"mvcc/kvstore.go:388","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":582286281}
{"level":"info","ts":1733716070.4496565,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"a9e5595b49b6f283","local-member-id":"0","added-peer-id":"5c03789b52368037","added-peer-peer-urls":["https://192.168.1.15:2380"]}
{"level":"info","ts":1733716070.4497328,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"a9e5595b49b6f283","local-member-id":"0","added-peer-id":"784dac5337680b08","added-peer-peer-urls":["https://192.168.1.17:2380"]}
{"level":"info","ts":1733716070.4497697,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"a9e5595b49b6f283","local-member-id":"0","added-peer-id":"aed753eb6f15876c","added-peer-peer-urls":["https://192.168.1.38:2380"]}
{"level":"info","ts":1733716070.449801,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"a9e5595b49b6f283","local-member-id":"0","added-peer-id":"d71e7a79c5fb9cce","added-peer-peer-urls":["https://192.168.0.90:2380"]}
{"level":"info","ts":1733716070.4569333,"caller":"snapshot/v3_snapshot.go:326","msg":"restored snapshot","path":"/tmp/etcd-snapshot-2024-12-09-11:26:25.db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd/","snap-dir":"/var/lib/etcd/member/snap"}
日志中有加入集群的详细过程,此处不再截取
验证状态
(base) root@dev-k8s-master03:/etc/kubernetes/manifests# etcdctl --endpoints=https://192.168.1.15:2379,https://192.168.1.17:2379,https://192.168.1.38:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out="table"
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.1.15:2379 | e84022fd6cf784bd | 3.5.0 | 718 MB | false | false | 420 | 664841654 | 664841654 | |
| https://192.168.1.17:2379 | ef1d4aa461844a8a | 3.5.0 | 718 MB | true | false | 420 | 664841654 | 664841654 | |
| https://192.168.1.38:2379 | 802c824d9f96584b | 3.5.0 | 718 MB | false | false | 420 | 664841654 | 664841654 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
后续
查看etcd启动日志发现虽然开启了auto-compaction-mode和auto-compaction-retention,但retention的值为0,代表不生效,三个节点都要启用
auto-compaction-mode=periodic
表示时间间隔(如 1h、30m、10s),即 etcd 每隔多长时间对数据进行一次压缩。
例如,设置为 --auto-compaction-retention=1h 意味着每小时自动执行一次 compaction,清理一小时前的历史数据。
当 --auto-compaction-mode=revision
–auto-compaction-retention 表示保留多少个修订版本(revision)的历史记录。
例如,–auto-compaction-retention=1000 表示只保留最近 1000 个 revision 的历史数据,超过这个修订范围的历史版本将被压缩清理。
更改自动压缩的参数使etcd自动删除过期的key,注意:这样仍然不会释放物理空间的blobDB size,仍需手动释放。
启用前etcd日志输出
kubectl logs -f etcd-dev-k8s-master02 |grep -i compaction
{"level":"info","ts":"2024-12-09T01:35:20.425Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"dev-k8s-master02","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.1.17:2380"],"listen-peer-urls":["https://192.168.1.17:2380"],"advertise-client-urls":["https://192.168.1.17:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.1.17:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
更改etcd static manifest文件
root@dev-k8s-master03:/etc/kubernetes/manifests# grep compaction /etc/kubernetes/manifests/etcd.yaml
- --auto-compaction-retention=10h
会触发pod自动重建,验证
(base) root@dev-k8s-master03:/etc/kubernetes/manifests# kubectl logs -f etcd-dev-k8s-master03 |grep -i compaction
{"level":"info","ts":"2024-12-09T07:04:26.400Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.1.38:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--initial-advertise-peer-urls=https://192.168.1.38:2380","--initial-cluster=dev-k8s-master05=https://192.168.0.90:2380,dev-k8s-master01=https://192.168.1.15:2380,dev-k8s-master03=https://192.168.1.38:2380,dev-k8s-master02=https://192.168.1.17:2380","--initial-cluster-state=existing","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.1.38:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.1.38:2380","--name=dev-k8s-master03","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--auto-compaction-retention=10h","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2024-12-09T07:04:26.400Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"946a5a6f2","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"dev-k8s-master03","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.1.38:2380"],"listen-peer-urls":["https://192.168.1.38:2380"],"advertise-client-urls":["https://192.168.1.38:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.1.38:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"10h0m0s","auto-compaction-interval":"10h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
# 已经触发一次压缩
{"level":"info","ts":"2024-12-09T07:08:13.712Z","caller":"mvcc/kvstore_compaction.go:57","msg":"finished scheduled compaction","compact-revision":582337465,"took":"128.526142ms"}
还可以扩大etcd的blobsize。
reference
https://etcd.io/docs/v3.4/op-guide/maintenance/