在运行kubeadm init命令时,遇到了一些问题。整理了一份问题解决方法,供参考。
问题一: kubeadm config images pull报错 pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\
已配置镜像仓库地址为aliyun的地址,pull镜像时报错permission denied
[shirley@master k8s_install]$ kubeadm config images pull --config kubeadm.yam
failed to pull image "registry.aliyuncs.com/google_containers/kube-apiserver:from image service failed" err="rpc error: code = Unavailable desc = connecticontainerd.sock: connect: permission denied\"" image="registry.aliyuncs.com/g
time="2023-10-10T14:56:54+08:00" level=fatal msg="pulling image: rpc error: cng dial unix /var/run/containerd/containerd.sock: connect: permission denied\
, error: exit status 1
To see the stack trace of this error execute with --v=5 or higher
参考https://techglimpse.com/failed-pull-image-registry-kube-apiserver/ 中方式进行排查。
2. kubernetes使用crictl命令管理CRI,查看其配置文件/etc/crictl.yaml
。初始情况下没有这个配置文件,这里建议添加这个配置,否则kubeadm init时会报其他错。
cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
3. 查看配置文件:/etc/containerd/config.toml
,注释掉disabled_plugins = ["cri"]
# disabled_plugins = ["cri"]
containerd config default > /etc/containerd/config.toml
sudo systemctl restart containerd
问题二:kubelet报错failed to run Kubelet: running with swap on is not supported,
运行命令:sudo kubeadm init --config kubeadm.yaml
时报错kubelet 异常。查看kubelet日志,报错 failed to run Kubelet: running with swap on is not supported, please disable swap!
[root@master ~]# journalctl -f -ukubelet
... ...
Oct 10 16:18:35 master.k8s kubelet[2079]: I1010 16:18:35.432021 2079 server.go:725] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
Oct 10 16:18:35 master.k8s kubelet[2079]: E1010 16:18:35.432363 2079 run.go:74] "command failed" err="failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename\t\t\t\tType\t\tSize\tUsed\tPriority /dev/dm-1 partition\t2097148\t0\t-2]"
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Oct 10 16:18:35 master.k8s systemd[1]: Unit kubelet.service entered failed state.
Oct 10 16:18:35 master.k8s systemd[1]: kubelet.service failed.
... ...
# 临时
sudo swapoff -a
# 永久防止开机自动挂载swap
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
[root@master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
Active: active (running) since Tue 2023-10-10 16:26:16 CST; 1min 4s ago
Docs: https://kubernetes.io/docs/
Main PID: 2532 (kubelet)
Tasks: 11
Memory: 32.1M
CGroup: /system.slice/kubelet.service
└─2532 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/l...
Oct 10 16:27:14 master.k8s kubelet[2532]: E1010 16:27:14.927100 2532 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc e....k8s.io/p
问题三:kubeadm init时报错一些配置文件已存在
[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[sudo] password for shirley:
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
[WARNING Hostname]: hostname "node" could not be reached
[WARNING Hostname]: hostname "node": lookup node on server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
解决方法:kubeadm reset
从log可以看出,一些配置文件已经存在。由于前面kubeadm运行报错了,导致init命令运行一半就退出,而配置文件已经生成。通过kubeadm reset命令撤销之前的操作,如下:
[shirley@master k8s_install]$ sudo kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1010 16:34:16.187161 2705 reset.go:120] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: Get "": dial tcp connect: connection refused
W1010 16:34:16.187828 2705 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W1010 16:34:41.266029 2705 removeetcdmember.go:106] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.
The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
问题四: kubeadm init报ipv4相关错误
[shirley@master k8s_install]$ sudo kubeadm init --config kubeadm.yaml
[init] Using Kubernetes version: v1.28.0
[preflight] Running pre-flight checks
[WARNING Hostname]: hostname "node" could not be reached
[WARNING Hostname]: hostname "node": lookup node on server misbehaving
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
[ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
# 加载ipvs模块
modprobe br_netfilter
modprobe -- ip_vs
modprobe -- ip_vs_sh
modprobe -- ip_vs_rr
modprobe -- ip_vs_wrr
modprobe -- nf_conntrack_ipv4
# 验证ip_vs模块
lsmod |grep ip_vs
ip_vs_wrr 12697 0
ip_vs_rr 12600 0
ip_vs_sh 12688 0
ip_vs 145458 6 ip_vs_rr,ip_vs_sh,ip_vs_wrr
nf_conntrack 139264 2 ip_vs,nf_conntrack_ipv4
libcrc32c 12644 3 xfs,ip_vs,nf_conntrack
# 内核文件
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
# 生效并验证内核优化
sysctl -p /etc/sysctl.d/k8s.conf
问题五:kubeadm init时,kubelet 报错crictl --runtime-endpoint配置不对
kubeadm init报如下错误:
... ...
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
[root@master k8s_install]# crictl images list
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E1010 17:19:18.816289 3832 remote_image.go:119] "ListImages with filter from image service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ImageFilter{Image:&ImageSpec{Image:list,Annotations:map[string]string{},},}"
FATA[0000] listing images: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
出现如上报错的原因时,crictl下载镜像时使用的是默认端点[unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]
cat > /etc/crictl.yaml <<EOF
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 0
debug: false
pull-image-on-create: false
运行crictl images list命令,不再报错
[root@master ~]# crictl images list
registry.aliyuncs.com/google_containers/coredns v1.10.1 ead0a4a
kubeadm init
时,报错The kubelet is not running
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
通过log提示执行命令crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a
[root@master ~]# journalctl -fu containerd
Oct 11 08:35:16 master.k8s containerd[1903]: time="2023-10-11T08:35:16.760026536+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-apiserver-node,Uid:a5a7c15a42701ab6c9dca630e6523936,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"registry.k8s.io/pause:3.6\": failed to pull image \"registry.k8s.io/pause:3.6\": failed to pull and unpack image \"registry.k8s.io/pause:3.6\": failed to resolve reference \"registry.k8s.io/pause:3.6\": failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp connect: connection refused"
Oct 11 08:35:18 master.k8s containerd[1903]: time="2023-10-11T08:35:18.606581001+08:00" level=info msg="trying next host" error="failed to do request: Head \"https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.6\": dial tcp connect: connection refused" host=registry.k8s.io
报错显示containerd拉去镜像失败。error="failed to get sandbox image \"registry.k8s.io/pause:3.6\"
用containerd config dump
[root@master k8s_install]# containerd config dump
... ...
sandbox_image = "registry.k8s.io/pause:3.6"
selinux_category_range = 1024
... ...
发现containerd模式的配置中使用pause的image repo为registry.k8s.io/pause:3.6
[root@master k8s_install]# crictl images list | grep pause
registry.aliyuncs.com/google_containers/pause 3.9 e6f1816883972 322kB
运行containerd config dump > /etc/containerd/config.toml
## 将当前配置到处到配置文件
containerd config dump > /etc/containerd/config.toml
## 修改配置文件/etc/containerd/config.toml, 更改sandbox_image配置
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
sudo systemctl restart containerd
[root@master k8s_install]# containerd config dump | grep pause
pause_threshold = 0.02
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
问题七: kubelet报错/var/lib/kubelet/config.yaml不存在
Oct 11 11:38:00 slave.k8s kubelet[77030]: E1011 11:38:00.869724 77030 run.go:74] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
在执行kubeadm init 或kubeadm join之前,会发现启动的kubelet日志报错,读不到配置文件/var/lib/kubelet/config.yaml。不用担心,执行完kubeadm init/join之后,会自动生成配置文件