linux负载均衡和系统负载分析笔记

1 负载均衡

1.1 计算负载

1.1.1 PELT算法简介

从Linux3.8内核以后进程的负载计算不仅考虑权重，⽽且跟踪每个调度实体的历史负载情况，该算法称为PELT(Per-entity Load Tracking)

《奔跑吧Linux内核》卷1：基础架构；P505

1.1.2 记录负载的数据结构struct sched_avg

1.1.2.1 定义

/*
 * The load_avg/util_avg accumulates an infinite geometric series
 * (see __update_load_avg() in kernel/sched/fair.c).
 *
 * [load_avg definition]
 *
 *   load_avg = runnable% * scale_load_down(load)
 *
 * where runnable% is the time ratio that a sched_entity is runnable.
 * For cfs_rq, it is the aggregated load_avg of all runnable and
 * blocked sched_entities.
 *
 * [util_avg definition]
 *
 *   util_avg = running% * SCHED_CAPACITY_SCALE
 *
 * where running% is the time ratio that a sched_entity is running on
 * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
 * and blocked sched_entities.
 *
 * load_avg and util_avg don't direcly factor frequency scaling and CPU
 * capacity scaling. The scaling is done through the rq_clock_pelt that
 * is used for computing those signals (see update_rq_clock_pelt())
 *
 * N.B., the above ratios (runnable% and running%) themselves are in the
 * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
 * to as large a range as necessary. This is for example reflected by
 * util_avg's SCHED_CAPACITY_SCALE.
 *
 * [Overflow issue]
 *
 * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
 * with the highest load (=88761), always runnable on a single cfs_rq,
 * and should not overflow as the number already hits PID_MAX_LIMIT.
 *
 * For all other cases (including 32-bit kernels), struct load_weight's
 * weight will overflow first before we do, because:
 *
 *    Max(load_avg) <= Max(load.weight)
 **
 * Then it is the load_weight's responsibility to consider overflow
 * issues.
 */
struct sched_avg {
    u64             last_update_time;
    u64             load_sum;
    u64             runnable_load_sum;
    u32             util_sum;
    u32             period_contrib;
    unsigned long           load_avg;
    unsigned long           runnable_load_avg;
    unsigned long           util_avg;
    struct util_est         util_est;
} ____cacheline_aligned;

1.1.2.2 struct sched_avg成员变量含义

1.1.3 数据结构组织关系

进程队列

进程调度实体

1.1.4 ___update_load_avg() 和 ___update_load_sum();

___update_load_avg()：计算量化负载(load_avg) 和实际算⼒(util_avg)。
___update_load_sum()：计算工作负载

《奔跑吧Linux内核》卷1：基础架构；P515

《Linux内核深度解析》P104

1.1.5 查看单个进程的负载信息

例如，查看pid为7202进程的负载信息

# cat /proc/7202/sched | grep se.avg
se.avg.load_sum                              :                   15
se.avg.runnable_sum                          :                15521
se.avg.util_sum                              :                15521
se.avg.load_avg                              :                    0
se.avg.runnable_avg                          :                    0
se.avg.util_avg                              :                    0
se.avg.last_update_time                      :       42221565865984
se.avg.util_est.ewma                         :                    8
se.avg.util_est.enqueued                     :                    8

1.1.6 查看公平队列(cfs_rq)的负载信息

# cat /sys/kernel/debug/sched/debug
......
cfs_rq[0]:/
  ......
  .load_avg                      : 0  
  .runnable_avg                  : 1  
  .util_avg
  ......
cfs_rq[1]:/
  ......
  .load_avg                      : 2
  .runnable_avg                  : 6
  .util_avg
  ......
cfs_rq[2]:/
  ......
  .load_avg                      : 0
  .runnable_avg                  : 0
  .util_avg                      : 0
  ......

1.1.7 中断处理程序占用的负载

需要打开内核配置：CONFIG_HAVE_SCHED_AVG_IRQ

1.2 完全公平调度类的负载均衡

1.2.1 调度域和调度组

1.2.1.1 简介

调度域实际上是⼀个CPU集合，它们的⼯作量应该由内核保持平衡。《深⼊理解LINUX内核》P285

内核按照处理器拓扑层次划分调度域层次，每个调度域包含多个调度组。《Linux内核深度解析》P100

调度组是负载均衡调度的最⼩单位。在最低层级的调度域中，通常⼀个调度组描述⼀个CPU。

调度域和调度组的关系。《奔跑吧Linux内核》卷1：基础架构；P521

只有在某个调度域的某个组的总⼯作量远远低于同⼀个调度域的另⼀个组的⼯作量时，才把进程从⼀个CPU迁移到另⼀个CPU。

《深⼊理解LINUX内核》P285

1.2.1.2 调度域数据结构：struct sched_domain;

1.2.1.3 调度域的相关配置：/sys/kernel/debug/sched/domains/

/sys/kernel/debug/sched/domains/cpuX/domainX/目录下的内容实际上就是struct sched_domain的成员变量。

# tree /sys/kernel/debug/sched/domains/

1.2.1.4 查看调度域统计信息：/proc/schedstat

linux内核调度相关操作发生的很频繁，所以记录调度相关的信息会带来一定的开销，默认情况下内核不会去记录这些信息。如果需要内核记录调度相关的信息，可以执行下面的命令：
echo 1 > /proc/sys/kernel/sched_schedstats

执行完上面的命令，可以看到下面的信息：

 cat /proc/schedstat 
version 15
timestamp 4300757388
cpu0 17 0 301792 149974 155077 44875 517287834502 12966866224 2808579
domain0 11 41211 41126 49 85 36 0 0 41126 605 603 0 2 2 0 0 603 7133 6658 50 517 466 0 0 6657 0 0 0 0 0 0 0 0 0 16518 3737 0
domain1 ff 17882 17020 836 880 29 0 0 13754 61 60 1 1 0 0 0 15 6496 4341 1655 2337 607 6 1 4340 1 0 1 0 0 0 0 0 0 93684 20030 0
cpu1 0 0 387454 180994 225274 95619 472295220781 13216661039 2921559
domain0 22 57246 57149 43 89 47 0 0 57166 462 462 0 0 0 0 0 462 14118 13659 47 499 450 0 0 13657 0 0 0 0 0 0 0 0 0 17150 4400 0
domain1 ff 21251 20522 703 747 35 1 0 15691 52 51 0 2 2 0 0 8 13514 11220 1764 2475 620 4 2 11218 1 0 1 0 0 0 0 0 0 112505 22988 0
cpu2 1 0 293600 143123 138283 43027 541851523627 11551033606 2640733
domain0 44 39355 39274 33 73 40 0 0 39308 697 696 1 31 0 0 0 696 6899 6449 56 489 432 0 0 6449 0 0 0 0 0 0 0 0 0 13490 3002 0
domain1 ff 16536 15525 973 1031 36 1 1 12770 125 123 2 2 0 0 0 27 6301 4189 1555 2353 679 3 0 4189 2 0 2 0 0 0 0 0 0 81766 17801 0
cpu3 20 0 290491 141093 135912 43103 501320856499 11449769836 2678896
domain0 88 39419 39343 46 81 35 0 0 39353 643 642 1 55 0 0 0 642 6916 6509 42 427 385 0 0 6509 0 0 0 0 0 0 0 0 0 12498 2412 0
domain1 ff 16423 15442 943 1012 40 1 1 12678 112 109 2 3 1 0 0 19 6312 4269 1531 2334 648 5 0 4269 1 0 1 0 0 0 0 0 0 80311 17763 0
cpu4 1 0 301517 145953 147085 45129 425431559500 16937152347 3871699
domain0 11 39179 38999 76 132 49 0 0 38998 638 637 1 23 0 0 0 637 6276 5827 50 517 427 0 0 5790 0 0 0 0 0 0 0 0 0 14980 3230 0
domain1 ff 16427 16374 71 77 5 0 0 1974 118 118 0 0 0 0 0 0 5659 3715 1441 2116 615 1 1 3714 0 0 0 0 0 0 0 0 0 86976 20810 0
cpu5 1 0 273200 132423 134324 39356 537132479700 11555079637 2917078
domain0 22 34464 34105 215 383 135 0 0 34074 791 791 0 0 0 0 0 791 5255 4561 205 779 543 0 0 4532 0 0 0 0 0 0 0 0 0 14842 2741 0
domain1 ff 15127 15083 35 47 11 0 0 2688 126 126 0 0 0 0 0 0 4553 2928 1153 1798 570 0 0 2928 0 0 0 0 0 0 0 0 0 80126 16853 0
cpu6 0 0 285169 138295 138452 44061 517374796871 12016525074 3001982
domain0 44 38288 38170 70 111 38 0 0 38157 641 641 0 0 0 0 0 641 5832 5424 42 444 402 0 0 5424 0 0 0 0 0 0 0 0 0 13452 3378 0
domain1 ff 15901 15834 60 75 8 1 0 2023 117 117 0 0 0 0 0 0 5231 3271 1432 2254 656 0 0 3271 0 0 0 0 0 0 0 0 0 80939 18100 0
cpu7 6 0 264580 128190 144356 41145 497275125272 12070082308 2985981
domain0 88 36128 36024 49 90 40 0 0 36028 767 766 1 33 0 0 0 766 5635 5272 37 388 349 0 0 5270 0 0 0 0 0 0 0 0 0 13360 2345 0
domain1 ff 15161 15097 64 73 5 0 0 1968 144 144 0 0 0 0 0 0 5125 3303 1334 2094 593 0 0 3303 0 0 0 0 0 0 0 0 0 89851 14521 0

Domain statistics
-----------------
One of these is produced per domain for each cpu described. (Note that if
CONFIG_SMP is not defined, *no* domains are utilized and these lines
will not appear in the output.)

domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

The first field is a bit mask indicating what cpus this domain operates over.

The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):

1) # of times in this domain load_balance() was called when the
cpu was idle
2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu
was idle
6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy
11) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was busy
12) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
13) # of times in this domain pull_task() was called when busy
14) # of times in this domain pull_task() was called even though the
target task was cache-hot when busy
15) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was busy
16) # of times in this domain a busier queue was found while the cpu
was busy but no busier group was found
17) # of times in this domain load_balance() was called when the
cpu was just becoming idle
18) # of times in this domain load_balance() checked but found the
load did not require balancing when the cpu was just becoming idle
19) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was just becoming idle
20) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
21) # of times in this domain pull_task() was called when newly idle
22) # of times in this domain pull_task() was called even though the
target task was cache-hot when just becoming idle
23) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was just becoming idle
24) # of times in this domain a busier queue was found while the cpu
was just becoming idle but no busier group was found

Next three are active_load_balance() statistics:

25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task

Next three are sched_balance_exec() statistics:

28) sbe_cnt is not used
29) sbe_balanced is not used
30) sbe_pushed is not used

Next three are sched_balance_fork() statistics:

31) sbf_cnt is not used
32) sbf_balanced is not used
33) sbf_pushed is not used

Next three are try_to_wake_up() statistics:

34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the
waking cpu because it was cache-cold on its own cpu anyway
36) # of times in this domain try_to_wake_up() started passive balancing

《Documentation/scheduler/sched-stats.rst》

1.2.2 负载均衡的流程

1.2.2.1 流程图

《Linux内核深度解析》P107

《奔跑吧Linux内核》卷1：基础架构；P530

1.2.2.2 找出最忙的调度组: find_busiest_group();

相关函数：update_sd_lb_stats()、calculate_imbalance() 和 update_sg_lb_stats();

《Linux内核深度解析》P107

《深⼊理解LINUX内核》P288

《深⼊Linux内核架构》P99

《奔跑吧Linux内核》卷1：基础架构；P531

1.2.2.3 detach_tasks() / attach_tasks()

detach_tasks()

便利最繁忙的就绪队列中的所有的进程，找出适合被迁移的进程，然后让这些进程退出就绪队列。

attach_tasks()

把刚才从最繁忙就绪队列中迁出的进程都迁⼊当前CPU的就绪队列中。

《奔跑吧Linux内核》卷1：基础架构；P530

1.2.2.4 迁移线程： migration/<cpu_id>

如果负载均衡失败，即没有迁移⼀个进程，那么为最忙处理器设置主动负载均衡标志，记录当前处理器作为迁移⽬标，向最忙处理器的停机⼯作队列添加⼀个⼯作，⼯作函数是active_load_balance_cpu_stop，唤醒最忙处理器的迁移线程。迁移线程将会从停机⼯作队列取出⼯作，执⾏主动的负载均衡。

《Linux内核深度解析》P107

《深⼊Linux内核架构》P100

1.2.3 进程迁移的代价

1.3 限期调度类的负载均衡

调度器选择下⼀个限期进程的时候，如果当前正在执⾏的进程是限期进程，将会试图从限期进程超载的处理器把限期进程拉过来。

限期进程超载的定义：

限期运⾏队列⾄少有2个限期进程。
⾄少有⼀个限期进程绑定到多个处理器。

《Linux内核深度解析》P96

1.4 实时调度类的负载均衡

调度器选择下一个实时进程时，如果当前处理器的实时运⾏队列中的进程的最⾼调度优先级⽐当前正在执⾏的进程的调度优先级低，将会试图从实时进程超载的处理器把可推送实时进程拉过来。

实时进程超载的定义：

实时运⾏队列⾄少有2个实时进程。
⾄少有⼀个可推送实时进程。可推送实时进程是指绑定到多个处理器的实时进程，可以在处理器之间迁移。
《Linux内核深度解析》P98

1.5 调试

/sys/kernel/debug/tracing/events/sched/sched_migrate_task/

2 单个处理器核的负载(使用率)

可以通过命令“sar -P ALL 1”查看处理器核的使用率信息，也可以生成使用率图表，请看Linux下性能分析的可视化图表工具_linux 热力图-CSDN博客

3 系统负载

3.1 1分钟、5分钟、 15分钟内的平均负载

3.1.1 简介

展⽰了系统中的负载需求：系统中处于可运⾏状态的，以及不可中断等待状态的任务的数量。
《BPF之巅.洞悉Linux系统和应⽤性能》P198

1分钟、5分钟、15分钟的平均负载数据含义请看一篇读懂｜Linux系统平均负载_系统负载怎么算-CSDN博客

3.1.2 查看方式

执行以下命令

uptime
top / htop
w
cat /proc/loadavg

3.2 Pressure Stall Information (PSI)

3.2.1 简介

An interface has now been added in Linux 4.20 that provides such a breakdown: pressure stall information (PSI), which gives averages for CPU, memory, and I/O.
《SystemsPerformance_ EnterpriseandtheCloud(2020,Pearson)》P257

3.2.2 /proc/pressure/cpu

# cat /proc/pressure/cpu 
some avg10=0.00 avg60=0.00 avg300=0.00 total=6305749
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.
The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously.
Documentation/accounting/psi.rst