1.服务器环境以及配置
物理机/虚拟机/云/容器 | 物理机 |
外网/私有网络/无网络 | 私有网络 |
处理器: | Hygon C86 7265 24-core Processor 2 of 2 CPU sockets populated, 24 cores/48 threads per CPU 48 total cores, 96 total threads |
内存: | 768 GiB |
整机类型/架构: | ZTE R5930 G2 |
BIOS版本: | Insyde 03.15.100 2022/09/27 |
具体操作系统版本 | Kylin Linux Advanced Server V10 (Tercel) V10 (Tercel) |
内核版本 | 4.19.90-24.4.v2101.ky10.x86_64 |
2.问题现象描述
3台服务器宕机,有生成vmcore,需要厂家分析宕机原因。
3.问题分析
分析sosreport,当前系统为Kylin Linux Advanced Server V10 SP1,而内核使用的是V10 SP2的内核,系统组件还是V10 SP1的系统组件。
查看kdump的配置情况,设置了kernel.hung_task_panic=1,当系统发生hung task的时候,系统会崩溃,触发异常重启。
3.1 分析sosreport-NFDW5-X86TSTACK-TX-COM30-20231208104840
查看磁盘存储情况,多路径盘没有文件系统格式。
3.1.1 分析vmcore-dmesg日志
系统出现hung task,出现了blk_cloned_rq_check_limits: over max size limit日志打印,及多路径的相关报错。
[8383722.927691] sd 34:0:0:2: alua: port group 01 state A preferred supports tolUsNa [8383722.928275] sd 34:0:0:2: alua: port group 01 state A preferred supports tolUsNA [8383723.023106] blk_cloned_rq_check_limits: over max size limit. [8383723.030122] blk_cloned_rq_check_limits: over max size limit. [8383723.030133] device-mapper: multipath: Failing path 8:144. [8383723.919298] device-mapper: multipath: Reinstating path 8:160. [8383723.921083] device-mapper: multipath: Reinstating path 8:192. [8383723.922643] device-mapper: multipath: Reinstating path 8:176. [8383723.925439] device-mapper: multipath: Reinstating path 8:208. [8383723.931608] sd 34:0:1:2: alua: port group 01 state A preferred supports tolUsNA [8383723.932190] sd 34:0:1:2: alua: port group 01 state A preferred supports tolUsNA [8383724.023032] blk_cloned_rq_check_limits: over max size limit. [8383724.030045] blk_cloned_rq_check_limits: over max size limit. [8383724.030056] device-mapper: multipath: Failing path 8:208. [8383724.037055] blk_cloned_rq_check_limits: over max size limit. [8383724.037167] device-mapper: multipath: Failing path 8:176. [8383724.044067] blk_cloned_rq_check_limits: over max size limit. [8383724.044148] device-mapper: multipath: Failing path 8:192. [8383724.051079] blk_cloned_rq_check_limits: over max size limit. [8383724.051162] device-mapper: multipath: Failing path 8:160. [8383724.929323] device-mapper: multipath: Reinstating path 8:240. [8383724.939527] sd 35:0:2:2: alua: port group 01 state A preferred supports tolUsNA [8383724.940120] sd 35:0:2:2: alua: port group 01 state A preferred supports tolUsNA [8383725.034956] blk_cloned_rq_check_limits: over max size limit. [8383725.041964] blk_cloned_rq_check_limits: over max size limit. [8383725.041975] device-mapper: multipath: Failing path 8:240. [8383725.899647] INFO: task worker:2189823 blocked for more than 120 seconds. [8383725.907831] Tainted: G OE 4.19.90-24.4.v2101.ky10.x86_64 #1 [8383725.916697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [8383725.926127] worker D 0 2189823 1 0x000003a0 [8383725.933137] Call Trace: [8383725.936437] device-mapper: multipath: Reinstating path 8:224. [8383725.936568] ? __schedule+0x296/0x910 [8383725.945235] device-mapper: multipath: Reinstating path 65:0. [8383725.948431] schedule+0x28/0x80 [8383725.955527] sd 35:0:1:2: alua: port group 01 state A preferred supports tolUsNA [8383725.959626] schedule_timeout+0x1ee/0x3a0 [8383725.969017] sd 35:0:1:2: alua: port group 01 state A preferred supports tolUsNa [8383725.973626] ? blk_flush_plug_list+0x218/0x270 [8383725.984815] sd 35:0:1:2: alua: port group 01 state A preferred supports tolUsNA [8383725.988127] io_schedule_timeout+0x19/0x40 [8383726.002222] wait_for_completion_io+0x126/0x190 [8383726.007971] ? wake_up_q+0x70/0x70 [8383726.012459] submit_bio_wait+0x5b/0x80 [8383726.017335] blkdev_issue_zeroout+0xdc/0x210 [8383726.022794] ? blkdev_ioctl+0x893/0x930 [8383726.027766] blkdev_ioctl+0x893/0x930 [8383726.032547] block_ioctl+0x39/0x40 [8383726.037034] do_vfs_ioctl+0xa4/0x640 [8383726.041721] ? do_numa_page+0x250/0x267 [8383726.046685] ksys_ioctl+0x70/0x80 [8383726.046888] blk_cloned_rq_check_limits: over max size limit. [8383726.051076] __x64_sys_ioctl+0x16/0x20 [8383726.058080] blk_cloned_rq_check_limits: over max size limit. [8383726.058088] device-mapper: multipath: Failing path 65:0. [8383726.062961] 0xffffffffc0ee3868 [8383726.062971] ? __audit_syscall_entry+0x103/0x130 [8383726.069969] blk_cloned_rq_check_limits: over max size limit. [8383726.069976] device-mapper: multipath: Failing path 8:224. |
整理hung task堆栈,从堆栈中可见调用了blkdev_issue_zeroout,这可能是在做格式化的操作。
ksys_ioctl+0x70/0x80 do_vfs_ioctl+0xa4/0x640 ? do_numa_page+0x250/0x267 block_ioctl+0x39/0x40 // 块设备的ioctl操作处理函数,它处理块设备相关的ioctl调用 blkdev_ioctl+0x893/0x930 // 块设备ioctl系统调用的处理函数,用于执行块设备相关的命令,如分区大小调整、获取设备信息等 blkdev_ioctl+0x893/0x930 ? blkdev_issue_zeroout+0xdc/0x210 // 可能用于将块设备的一个区域清零 submit_bio_wait+0x5b/0x80 // 提交一个bio结构(块I/O)到底层设备,并等待操作完成 ? wake_up_q+0x70/0x70 // 用于唤醒在等待队列中的进程 wait_for_completion_io+0x126/0x190 // 用于等待I/O操作的完成 io_schedule_timeout+0x19/0x40 // 可能是负责刷新块插件列表的函数,用于优化I/O操作的执行 ? blk_flush_plug_list+0x218/0x270 schedule_timeout+0x1ee/0x3a0 |
根据沟通了解到,当时正在执行mkfs.xfs的格式化操作,然后导致hung task。加上堆栈中的blkdev_issue_zeroout相关函数,推断出,这个内核的已知问题,且符合已知问题的堆栈。
报错问题的原因是系统在处理Write Zero类IO(输入/输出)请求时,发现多路径设备及其底层设备不支持Write Same/Zero命令。这导致了在多路径模块中IO请求以IOERR(输入/输出错误)状态完成,被误判为路径失效。随后,这个Write Zero IO请求在多路径中循环选择路径,但由于所有路径都不支持Write Zero命令,导致IO请求无法完成,反复导致路径失败,无法正常处理。
解决方案是对Write Zero类IO请求进行修复调整,将其状态码改为NOTSUPP(不支持),以便通知IO请求的发起方该请求不被支持。这样,发起方可以知晓情况进行调整,而不是一直等待一个不能被处理的IO请求的完成。该修复方案已合入4.19.90-23.30.v2101及以上内核版本,建议通过升级内核来解决此问题。
3.1.2 分析sa日志
sar -rh -f sa06,查看问题发生时的内存使用情况,空闲内存还有很多。
sar -u -f sa06,当时的CPU使用率很低。
sar -q -f sa06,当时的系统负载也很低。
3.2 分析sosreport-NFDW5-X86TSTACK-TX-COM80-20231208104517
3.2.1 分析vmcore-dmesg日志
查看vmcore-dmesg日志,和前面一台机器的vmcore-demsg日志相同,是同一个问题。
[8444815.609179] device-mapper: multipath: Failing path 8:240. [8444815.616184] blk_cloned_rq_check_limits: over max size limit. [8444815.616223] device-mapper: multipath: Failing path 8:224. [8444815.623199] blk_cloned_rq_check_limits: over max size limit. [8444815.623206] device-mapper: multipath: Failing path 8:208. [8444815.630216] blk_cloned_rq_check_limits: over max size limit. [8444815.630275] device-mapper: multipath: Failing path 8:192. [8444815.637230] blk_cloned_rq_check_limits: over max size limit. [8444815.637243] device-mapper: multipath: Failing path 8:176. [8444815.644248] blk_cloned_rq_check_limits: over max size limit. [8444815.644280] device-mapper: multipath: Failing path 8:160. [8444815.651263] blk_cloned_rq_check_limits: over max size limit. [8444815.651275] device-mapper: multipath: Failing path 8:144. [8444821.198270] INFO: task worker:2981470 blocked for more than 120 seconds. [8444821.206446] Tainted: G OE 4.19.90-24.4.v2101.ky10.x86_64 #1 [8444821.215294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [8444821.224714] worker D 0 2981470 1 0x000003a0 [8444821.231720] Call Trace: [8444821.235145] ? __schedule+0x296/0x910 [8444821.239927] schedule+0x28/0x80 [8444821.244116] schedule_timeout+0x1ee/0x3a0 [8444821.249287] ? blk_flush_plug_list+0x218/0x270 [8444821.254936] io_schedule_timeout+0x19/0x40 [8444821.260192] wait_for_completion_io+0x126/0x190 [8444821.265938] ? wake_up_q+0x70/0x70 [8444821.270423] submit_bio_wait+0x5b/0x80 [8444821.275299] blkdev_issue_zeroout+0xdc/0x210 [8444821.280752] ? blkdev_ioctl+0x893/0x930 [8444821.285720] blkdev_ioctl+0x893/0x930 [8444821.290498] block_ioctl+0x39/0x40 [8444821.294982] do_vfs_ioctl+0xa4/0x640 [8444821.299651] ? do_numa_page+0x250/0x267 [8444821.304617] ksys_ioctl+0x70/0x80 [8444821.309000] __x64_sys_ioctl+0x16/0x20 [8444821.313888] 0xffffffffc0ddd868 [8444821.318085] ? __audit_syscall_entry+0x103/0x130 [8444821.323924] 0xffffffffc0dde2a6 [8444821.328124] do_syscall_64+0x5b/0x1d0 [8444821.332898] entry_SYSCALL_64_after_hwframe+0x44/0xa9 |
3.2.2 分析sa日志
sar -rh -f sa07,分析当时的内存使用情况,内存使用率很低。
sar -u -f sa07,查看问题发生时的CPU使用率,可见其使用率很低。
sar -d -f sa07,查看磁盘读写情况,磁盘读写很低。
sar -q -f sa07,查看当时的系统负载,系统负载也非常的低。
3.3 分析sosreport-NFDW5-X86TSTACK-TX-COM81-20231208104057
查看kdump的配置情况,设置了kernel.hung_task_panic=1,当系统发生hung task的时候,系统会崩溃,触发异常重启。
3.3.1 分析vmcore-dmesg日志
查看vmcore-dmesg日志,和前面两台系统的hung task日志一致。
[8397590.155959] blk_cloned_rq_check_limits: over max size limit. [8397590.155974] device-mapper: multipath: Failing path 8:144. [8397597.335970] INFO: task worker:1982596 blocked for more than 120 seconds. [8397597.344146] Tainted: G OE 4.19.90-24.4.v2101.ky10.x86_64 #1 [8397597.352996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [8397597.362427] worker D 0 1982596 1 0x000003a0 [8397597.369438] Call Trace: [8397597.372870] ? __schedule+0x296/0x910 [8397597.377647] schedule+0x28/0x80 [8397597.381844] schedule_timeout+0x1ee/0x3a0 [8397597.387015] ? blk_flush_plug_list+0x218/0x270 [8397597.392665] io_schedule_timeout+0x19/0x40 [8397597.397929] wait_for_completion_io+0x126/0x190 [8397597.403680] ? wake_up_q+0x70/0x70 [8397597.408171] submit_bio_wait+0x5b/0x80 [8397597.413049] blkdev_issue_zeroout+0xdc/0x210 [8397597.418511] ? blkdev_ioctl+0x893/0x930 [8397597.423481] blkdev_ioctl+0x893/0x930 [8397597.428265] block_ioctl+0x39/0x40 [8397597.432756] do_vfs_ioctl+0xa4/0x640 [8397597.437441] ? do_numa_page+0x250/0x267 [8397597.442413] ksys_ioctl+0x70/0x80 [8397597.446801] __x64_sys_ioctl+0x16/0x20 [8397597.451690] 0xffffffffc0e90868 [8397597.455889] ? __audit_syscall_entry+0x103/0x130 [8397597.461736] 0xffffffffc0e912a6 [8397597.465940] do_syscall_64+0x5b/0x1d0 [8397597.470719] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [8397597.477051] RIP: 0033:0x7f91fa507ec7 [8397597.481737] Code: Bad RIP value. |
4.问题分析结果
1、分析三台机器的sosreport日志,多路径盘没有文件系统,hung task发生时正在执行mkfs.xfs,进行格式化。
2、分析三台机器sa日志,问题发生前,系统一切资源都消耗很低,资源使用正常。
3、分析三台机器的vmcore-dmesg日志,出现了blk_cloned_rq_check_limits: over max size limit日志打印,及多路径的相关报错,堆栈调用blkdev_issue_zeroout函数,将块设备的一个区域清零,而后出现了超时,最后出现hung task,从报错和堆栈上看,是多路径格式化xfs失败的已知问题。由于设置了kernel.hung_task_panic=1,当系统发生hung task的时候,系统会崩溃,触发异常重启。
报错问题的原因是系统在处理Write Zero类IO(输入/输出)请求时,发现多路径设备及其底层设备不支持Write Same/Zero命令。这导致了在多路径模块中IO请求以IOERR(输入/输出错误)状态完成,被误判为路径失效。随后,这个Write Zero IO请求在多路径中循环选择路径,但由于所有路径都不支持Write Zero命令,导致IO请求无法完成,反复导致路径失败,无法正常处理。
解决方案是对Write Zero类IO请求进行修复调整,将其状态码改为NOTSUPP(不支持),以便通知IO请求的发起方该请求不被支持。这样,发起方可以知晓情况进行调整,而不是一直等待一个不能被处理的IO请求的完成。该修复方案已合入4.19.90-23.30.v2101及以上内核版本,建议通过升级内核来解决此问题。
5.相关建议
升级最新的SP1内核4.19.90-23.42.v2101。
4.19.90-23.30.v2101
4.19.90-23.13.v2101
- 修复mlx 25G光模块通过ethtool读取信息不对的问题。