linux之调度管理（1）-调度器的初始化

一、linux 启动内核时的第一个进程 init_task

linux 进程的鼻祖 0,1,2，我在其他文章中，有具体讲解，链接是： linux 之0号进程、1号进程、2号进程_linux下0号进程 swap-CSDN博客。

在这里就不具体展开了。请看上面的文章详解。

当linux启动时，最先会通过汇编代码进行硬件和CPU的初始化，最后会跳转到C代码，而最初跳转到的C代码入口为：

/* 代码地址： linux/init/Main.c */
asmlinkage __visible void __init start_kernel(void)

补充一点：内核在执行C语言部分是有一段汇编语言，代码路径：kernel/arch/arm64/kernel/head.S

head.S 的汇编大致执行以下内容：

preserve_boot_args用来保留bootloader传递的参数，比如ARM上通常的dtb的地址
el2_setup：用来trap到EL1，说明我们在运行此指令前还在EL2
__create_page_tables：用来创建页表，linux才有的是页面管理物理内存的，在使用虚拟地址之前需要设置好页面，然后会打开MMU。目前还是运行在物理地址上的
__primary_switch：主要任务是完成MMU的打开工作
调用__primary_switched来设置0号进程的运行内核栈，然后调用start_kernel函数，至此，0号进程已经运行了，执行的函数就是 start_kernel 。

在start_kerenl函数中，进行了系统启动过程中几乎所有重要的初始化,包括内存、页表、必要数据结构、信号、调度器、硬件设备等。

而这些初始化是由谁来负责的？就是由init_task这个进程。init_task是静态定义的一个进程，也就是说当内核被放入内存时，它就已经存在，它没有自己的用户空间，一直处于内核空间中运行，并且也只处于内核空间运行。当它执行到最后，将start_kernel中所有的初始化执行完成后，会在内核中启动一个kernel_init内核线程和一个kthreadd内核线程，kernel_init内核线程执行到最后会通过execve系统调用执行转变为我们所熟悉的init进程，而kthreadd内核线程是内核用于管理调度其他的内核线程的守护线程。在最后init_task将变成一个idle进程，用于在CPU没有进程运行时运行它，它在此时仅仅用于空转。

init_task进程也就是0号进程，也是idle 进程，也是swapper进程，执行完 start_kernel 函数后，运行队列rq 的idle 的指针会指向init_task的静态定义的地址，并不会放到运行队列中。

kernel_init进程，也是1号进程，也是系统的init进程，也有的会说systemd进程。

kthreadd进程，也是2号进程，负责创建内核线程，而用户空间创建进程，就需要调用fork系统调用。

上面这三个进程的随着内核启动，变化的流程图如下：

上面标注的idle进程其实不严谨，此时还没有变成idle进程，还是正常运行的进程。

二、sched_init 调度初始化

在start_kernel中对调度器进行初始化的函数就是sched_init，其主要工作为

对相关数据结构分配内存
初始化root_task_group
初始化每个CPU的rq队列(包括其中的cfs队列和实时进程队列)
将init_task进程转变为idle进程（只是把rq的idle指针指向 init_task）

需要说明的是init_task在这里会被转变为idle进程，但是它还会继续执行初始化工作，相当于这里只是给init_task挂个idle进程的名号，它其实还是init_task进程，只有到最后init_task进程开启了kernel_init和kthreadd进程之后，才转变为真正意义上的idle进程。

内核源码路径：kernel/kernel/sched/core.c

void __init sched_init(void)
{
        unsigned long ptr = 0;
        int i;

        /* Make sure the linker didn't screw up */
        BUG_ON(&idle_sched_class + 1 != &fair_sched_class ||
               &fair_sched_class + 1 != &rt_sched_class ||
               &rt_sched_class + 1   != &dl_sched_class);
#ifdef CONFIG_SMP
        BUG_ON(&dl_sched_class + 1 != &stop_sched_class);
#endif

        wait_bit_init();
/* 计算所需要分配的数据结构空间 */
#ifdef CONFIG_FAIR_GROUP_SCHED
        ptr += 2 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
        ptr += 2 * nr_cpu_ids * sizeof(void **);
#endif
        if (ptr) {/* 分配内存 */
                ptr = (unsigned long)kzalloc(ptr, GFP_NOWAIT);

#ifdef CONFIG_FAIR_GROUP_SCHED
           /* 设置 root_task_group 每个CPU上的CFS调度实体指针se */
                root_task_group.se = (struct sched_entity **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

            /* 设置 root_task_group 每个CPU上的CFS运行队列指针cfs_rq */
                root_task_group.cfs_rq = (struct cfs_rq **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

                root_task_group.shares = ROOT_TASK_GROUP_LOAD;
                init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
            /* 设置 root_task_group 每个CPU上的实时调度实体指针se */
                root_task_group.rt_se = (struct sched_rt_entity **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

                root_task_group.rt_rq = (struct rt_rq **)ptr;
                ptr += nr_cpu_ids * sizeof(void **);

#endif /* CONFIG_RT_GROUP_SCHED */
        }
#ifdef CONFIG_CPUMASK_OFFSTACK
        for_each_possible_cpu(i) {
                per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
                        cpumask_size(), GFP_KERNEL, cpu_to_node(i));
                per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
                        cpumask_size(), GFP_KERNEL, cpu_to_node(i));
        }
#endif /* CONFIG_CPUMASK_OFFSTACK */
        /* 初始化实时进程的带宽限制，用于设置实时进程在CPU中所占用比的 */
        init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
        init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime());

#ifdef CONFIG_SMP
        /* 初始化默认的调度域，调度域包含一个或多个CPU，负载均衡是在调度域内执行的，相互之间隔离 */
        init_defrootdomain();
#endif

#ifdef CONFIG_RT_GROUP_SCHED/* 初始化实时进程的带宽限制，用于设置实时进程在CPU中所占用比的 */
        init_rt_bandwidth(&root_task_group.rt_bandwidth,
                        global_rt_period(), global_rt_runtime());
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
        task_group_cache = KMEM_CACHE(task_group, 0);
        /* 将分配好空间的 root_task_group 加入 task_groups 链表 */
        list_add(&root_task_group.list, &task_groups);
        INIT_LIST_HEAD(&root_task_group.children);
        INIT_LIST_HEAD(&root_task_group.siblings);
        /* 自动分组初始化
        autogroup_init(&init_task);

#endif /* CONFIG_CGROUP_SCHED */
       for_each_possible_cpu(i) { /* 遍历设置每一个CPU */
                struct rq *rq;

                rq = cpu_rq(i);/* 获取CPUi的rq队列 */
                raw_spin_lock_init(&rq->lock);/* 初始化rq队列的自旋锁 */
                rq->nr_running = 0;/* CPU运行队列中调度实体(sched_entity)数量为0 */
                rq->calc_load_active = 0;/* CPU负载 */
                rq->calc_load_update = jiffies + LOAD_FREQ;/* 负载下次更新时间 */
                init_cfs_rq(&rq->cfs);/* 初始化CFS运行队列 */
                init_rt_rq(&rq->rt);/* 初始化实时进程运行队列 */
                init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
                INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
                rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
                /*
                 * How much CPU bandwidth does root_task_group get?
                 *
                 * In case of task-groups formed thr' the cgroup filesystem, it
                 * gets 100% of the CPU resources in the system. This overall
                 * system CPU resource is divided among the tasks of
                 * root_task_group and its child task-groups in a fair manner,
                 * based on each entity's (task or task-group's) weight
                 * (se->load.weight).
                 *
                 * In other words, if root_task_group has 10 tasks of weight
                 * 1024) and two child groups A0 and A1 (of weight 1024 each),
                 * then A0's share of the CPU resource is:
                 *
                 *      A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
                 *
                 * We achieve this by letting root_task_group's tasks sit
                 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
                 */
                init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */

                rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
                init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
#ifdef CONFIG_SMP/* 这些参数都是负载均衡使用的 */
                rq->sd = NULL;
                rq->rd = NULL;
                rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
                rq->balance_callback = NULL;
                rq->active_balance = 0;
                rq->next_balance = jiffies;
                rq->push_cpu = 0;
                rq->cpu = i;
                rq->online = 0;
                rq->idle_stamp = 0;
                rq->avg_idle = 2*sysctl_sched_migration_cost;
                rq->max_idle_balance_cost = sysctl_sched_migration_cost;

                INIT_LIST_HEAD(&rq->cfs_tasks);
            /* 将CPU运行队列加入到默认调度域中 */
                rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ_COMMON
                /* 该队列最后一次更新cpu_load的时间值为当前 */
                rq->last_blocked_load_update_tick = jiffies;
                atomic_set(&rq->nohz_flags, 0);/* 动态时钟使用的标志位，初始时动态时钟是不使用的 */

                rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
#endif
#ifdef CONFIG_HOTPLUG_CPU
                rcuwait_init(&rq->hotplug_wait);
#endif
#endif /* CONFIG_SMP *
   /* 初始化运行队列定时器，这个是高精度定时器，但是只是初始化，这时并没有使用 */
                hrtick_rq_init(rq);
                atomic_set(&rq->nr_iowait, 0);
        }
       /* 设置 init_task 进程的权重 */
        set_load_weight(&init_task, false);

        /*
         * The boot idle thread does lazy MMU switching as well:
         */
        mmgrab(&init_mm);
        enter_lazy_tlb(&init_mm, current);

        /*
         * Make us the idle thread. Technically, schedule() should not be
         * called from this thread, however somewhere below it might be,
         * but because we are the idle thread, we just pick up running again
         * when this runqueue becomes "idle".
         */
/* 将当前进程初始化为idle进程，idle进程用于当CPU没有进程可运行时运行，空转 */
        init_idle(current, smp_processor_id());
 /* 下次负载更新时间(是一个相对时间) */
        calc_load_update = jiffies + LOAD_FREQ;

#ifdef CONFIG_SMP
        idle_thread_set_boot_cpu();
#endif
        init_sched_fair_class();

        init_schedstats();

        psi_init();

        init_uclamp();
 /* 这里只是标记调度器开始运行了，但是此时系统只有一个init_task(idle)进程，并且定时器都还没启动。并不会调度到其他进程，也没有其他进程可供调度 */
        scheduler_running = 1;
}

 执行到此时内核只有一个进程init_task，current就为init_task。之后的init进程在初始化到最后的rest_init中启动 */

sched_init 初始化之后，各个CPU 上都有各自的rq,如下图所示：

调度器的初始化还是比较简单的，毕竟调度器的核心不在此，重头戏在它的运行时处理，之后的文章会详细分析调度器的运行时处理。

三、进程加入运行队列的时机

只有处于TASK_RUNNING状态下的进程才能够加入到调度器，其他状态都不行，也就说明了，当一个进程处于睡眠、挂起状态的时候是不存在于调度器中的，而进程加入调度器的时机如下：

当进程创建完成时，进程刚创建完成时，即使它运行起来立即调用sleep()进程睡眠，它也必定先会加入到调度器，因为实际上它加入调度器后自己还需要进行一定的初始化和操作，才会调用到我们的“立即”sleep()。
当进程被唤醒时，也使用sleep的例子说明，我们平常写程序使用的sleep()函数实现原理就是通过系统调用将进程状态改为TASK_INTERRUPTIBLE，然后移出运行队列，并且启动一个定时器，在定时器到期后唤醒进程，再重新放入运行队列。

（1）进程创建的函数：sched_fork

//kernel/kernel/sched/core.c

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
        /* 初始化跟调度相关的值，比如调度实体，运行时间等 */
        __sched_fork(clone_flags, p);
        /*
         * We mark the process as NEW here. This guarantees that
         * nobody will actually run it, and a signal or other external
         * event cannot wake it up and insert it on the runqueue either.
         */
/*标记为运行状态，表明此进程正在运行或准备好运行，实际上没有真正在CPU上运行，这里只是导致了外部信号和事件不能够唤醒此进程，之后将它插入到运行队列中*/

        p->state = TASK_NEW;

        /*
         * Make sure we do not leak PI boosting priority to the child.
         */
        p->prio = current->normal_prio;//根据父进程的运行优先级设置设置进程的优先级

        uclamp_fork(p);

        /*
         * Revert to default priority/policy on fork if requested.
         */
        //    如果需要重新设置优先级
        if (unlikely(p->sched_reset_on_fork)) {
                     /* 如果是dl调度或者实时调度 */
                if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
                        p->policy = SCHED_NORMAL;//调度策略为SCHED_NORMAL，这个选项将使用CFS调度
                        p->static_prio = NICE_TO_PRIO(0);//根据默认nice值设置静态优先级
                        p->rt_priority = 0;实时优先级为0 
                } else if (PRIO_TO_NICE(p->static_prio) < 0)
                        p->static_prio = NICE_TO_PRIO(0);//根据默认nice值设置静态优先级

                p->prio = p->normal_prio = p->static_prio;
                set_load_weight(p, false);/* 设置进程权重 */

                /*
                 * We don't need the reset flag anymore after the fork. It has
                 * fulfilled its duty:
                 */
         /* sched_reset_on_fork成员在之后已经不需要使用了，直接设为0 */

                p->sched_reset_on_fork = 0;
        }

        if (dl_prio(p->prio))
                return -EAGAIN;
        else if (rt_prio(p->prio))/* 根据优先级判断，如果是实时进程，设置其调度类为rt_sched_class */
                p->sched_class = &rt_sched_class;
        else
                p->sched_class = &fair_sched_class;设置其调度类为fair_sched_class

        init_entity_runnable_average(&p->se);

#ifdef CONFIG_SCHED_INFO
        if (likely(sched_info_on()))
                memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
        p->on_cpu = 0;
#endif
        init_task_preempt_count(p); /* 初始化该进程为内核禁止抢占 */
#ifdef CONFIG_HAVE_PREEMPT_LAZY
        task_thread_info(p)->preempt_lazy_count = 0;
#endif
#ifdef CONFIG_SMP
        plist_node_init(&p->pushable_tasks, MAX_PRIO);
        RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
        return 0;
}

在sched_fork()函数中，主要工作如下：

初始化进程p的一些变量(实时进程和普通进程通用的那些变量)
根据进程p的优先级设置其调度类(实时进程优先级:0~99　　普通进程优先级:100~139)
初始化进程p禁止内核抢占(因为当CPU执行到进程p时，进程p还需要进行一些初始化)

可以看出sched_fork()进行的初始化也比较简单，需要注意的是不同类型的进程会使用不同的调度类，并且也会调用调度类中的初始化函数。在实时进程的调度类中是没有特定的task_fork()函数的，而普通进程使用cfs策略时会调用到task_fork_fair()函数，我们具体看看实现：

源码路径：kernel/kernel/sched/fair.c

static void task_fork_fair(struct task_struct *p)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se, *curr;  /* 进程p的调度实体se */
        struct rq *rq = this_rq();/* 获取当前CPU rq */
        struct rq_flags rf;

        rq_lock(rq, &rf);上锁并保存中断记录
        update_rq_clock(rq);/* 更新rq运行时间 */

        cfs_rq = task_cfs_rq(current);
        curr = cfs_rq->curr;/* 设置当前进程所在队列为父进程所在队列 */
        if (curr) {
                /* 更新当前进程运行时间 */
                update_curr(cfs_rq);
                /* 将父进程的虚拟运行时间赋给了新进程的虚拟运行时间 */
                se->vruntime = curr->vruntime;
        }
        place_entity(cfs_rq, se, 1); /* 调整了se的虚拟运行时间 */

        if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
                /*
                 * Upon rescheduling, sched_class::put_prev_task() will place
                 * 'current' within the tree based on its new key value.
                 */
                swap(curr->vruntime, se->vruntime);
                resched_curr_lazy(rq);
        }
/* 保证了进程p的vruntime是运行队列中最小的(这里暂时不确定是不是这个用法，不过确实是最小的了) */
        se->vruntime -= cfs_rq->min_vruntime;
        rq_unlock(rq, &rf);/* 解锁，还原中断记录 */
}

在task_fork_fair()函数中主要就是设置进程p的虚拟运行时间和所处的cfs队列，值得我们注意的是 cfs_rq = task_cfs_rq(current); 这一行，在注释中已经表明task_cfs_rq(current)返回的是current的se.cfs_rq，注意se.cfs_rq保存的并不是根cfs队列，而是所处的cfs_rq，也就是如果父进程处于一个进程组的cfs_rq中，新创建的进程也会处于这个进程组的cfs_rq中。

（2）到这里新进程关于调度的初始化已经完成，但是还没有被调度器加入到队列中，其是在do_fork()中的wake_up_new_task(p)；中加入到队列中的，我们具体看看wake_up_new_task()的实现：

void wake_up_new_task(struct task_struct *p)
{
        struct rq_flags rf;
        struct rq *rq;

        raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
        p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
        /*
         * Fork balancing, do it here and not earlier because:
         *  - cpus_ptr can change in the fork path
         *  - any previously selected CPU might disappear through hotplug
         *
         * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
         * as we're not fully set-up yet.
         */
        p->recent_used_cpu = task_cpu(p);
        rseq_migrate(p);
        /* 为进程选择一个合适的CPU */
        __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
        rq = __task_rq_lock(p, &rf);
        update_rq_clock(rq);
        post_init_entity_util_avg(p);/* 这里是跟多核负载均衡有关 */

        activate_task(rq, p, ENQUEUE_NOCLOCK);/* 将进程加入到CPU的运行队列 */
        trace_sched_wakeup_new(p);/* 跟调试有关 */
        check_preempt_curr(rq, p, WF_FORK);
#ifdef CONFIG_SMP
        if (p->sched_class->task_woken) {
                /*
                 * Nothing relies on rq->lock after this, so its fine to
                 * drop it.
                 */
                rq_unpin_lock(rq, &rf);
                p->sched_class->task_woken(rq, p);
                rq_repin_lock(rq, &rf);
        }
#endif
        task_rq_unlock(rq, p, &rf);
}

在wake_up_new_task()函数中，将进程加入到运行队列的函数为activate_task(),而activate_task()函数最后会调用到新进程调度类中的enqueue_task指针所指函数，这里我们具体看一下cfs调度类的enqueue_task指针所指函数enqueue_task_fair()：

//kernel/kernel/sched/fair.c
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
        struct cfs_rq *cfs_rq;
        struct sched_entity *se = &p->se;
        int idle_h_nr_running = task_has_idle_policy(p);
        int task_new = !(flags & ENQUEUE_WAKEUP);

        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
         * Let's add the task's estimated utilization to the cfs_rq's
         * estimated utilization, before we update schedutil.
         */
        util_est_enqueue(&rq->cfs, p);

        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag
         * passed.
         */
        if (p->in_iowait)
                cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
/*这里是一个迭代，我们知道，进程有可能是处于一个进程组中的，所以当这个处于进程组中的进程加入到该进程组的队列中时，要对此队列向上迭代 */

        for_each_sched_entity(se) {
                if (se->on_rq)
                        break;
/* 如果不是CONFIG_FAIR_GROUP_SCHED，获取其所在CPU的rq运行队列的cfs_rq运行队列
         * 如果是CONFIG_FAIR_GROUP_SCHED，获取其所在的cfs_rq运行队列
         */
                cfs_rq = cfs_rq_of(se);
                enqueue_entity(cfs_rq, se, flags);/* 加入到队列中 */

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

                flags = ENQUEUE_WAKEUP;
        }

/* 只有se不处于队列中或者cfs_rq_throttled(cfs_rq)返回真才会运行这个循环 */
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);

                update_load_avg(cfs_rq, se, UPDATE_TG);
                se_update_runnable(se);
                update_cfs_group(se);

                cfs_rq->h_nr_running++;
                cfs_rq->idle_h_nr_running += idle_h_nr_running;

                /* end evaluation on encountering a throttled cfs_rq */
                if (cfs_rq_throttled(cfs_rq))
                        goto enqueue_throttle;

               /*
                * One parent has been throttled and cfs_rq removed from the
                * list. Add it back to not break the leaf list.
                */
               if (throttled_hierarchy(cfs_rq))
                       list_add_leaf_cfs_rq(cfs_rq);
        }

        /* At this point se is NULL and we are at root level*/
        add_nr_running(rq, 1); /* 当前CPU运行队列活动进程数 + 1 */

        /*
         * Since new tasks are assigned an initial util_avg equal to
         * half of the spare capacity of their CPU, tiny tasks have the
         * ability to cross the overutilized threshold, which will
         * result in the load balancer ruining all the task placement
         * done by EAS. As a way to mitigate that effect, do not account
         * for the first enqueue operation of new tasks during the
         * overutilized flag detection.
         *
         * A better way of solving this problem would be to wait for
         * the PELT signals of tasks to converge before taking them
         * into account, but that is not straightforward to implement,
         * and the following generally works well enough in practice.
         */
        if (!task_new)
                update_overutilized_status(rq);

enqueue_throttle:
        if (cfs_bandwidth_used()) {
                /*
                 * When bandwidth control is enabled; the cfs_rq_throttled()
                 * breaks in the above iteration can result in incomplete
                 * leaf list maintenance, resulting in triggering the assertion
                 * below.
                 */
                for_each_sched_entity(se) {
                        cfs_rq = cfs_rq_of(se);

                        if (list_add_leaf_cfs_rq(cfs_rq))
                                break;
                }
        }

        assert_list_leaf_cfs_rq(rq);

        hrtick_update(rq);/* 设置下次调度中断发生时间 */
}

在enqueue_task_fair()函数中又使用了enqueue_entity()函数进行操作，如下：

// kernel/kernel/sched/fair.c
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
        bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
        bool curr = cfs_rq->curr == se;

        /*
         * If we're the current task, we must renormalise before calling
         * update_curr().
         */
        if (renorm && curr)
                se->vruntime += cfs_rq->min_vruntime;
/* 更新当前进程运行时间和虚拟运行时间 */
        update_curr(cfs_rq);

        /*
         * Otherwise, renormalise after, such that we're placed at the current
         * moment in time, instead of some random moment in the past. Being
         * placed in the past could significantly boost this task to the
         * fairness detriment of existing tasks.
         */
        if (renorm && !curr)
                se->vruntime += cfs_rq->min_vruntime;

        /*
         * When enqueuing a sched_entity, we must:
         *   - Update loads to have both entity and cfs_rq synced with now.
         *   - Add its load to cfs_rq->runnable_avg
         *   - For group_entity, update its weight to reflect the new share of
         *     its group cfs_rq
         *   - Add its new weight to cfs_rq->load.weight
         */
/* 更新cfs_rq队列总权重(就是在原有基础上加上se的权重) */
        update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
        se_update_runnable(se);
        update_cfs_group(se);
        account_entity_enqueue(cfs_rq, se);

/* 新建的进程flags为0，不会执行这里 */
        if (flags & ENQUEUE_WAKEUP)
                place_entity(cfs_rq, se, 0);

        check_schedstat_required();
        update_stats_enqueue(cfs_rq, se, flags);
        check_spread(cfs_rq, se);
/* 将se插入到运行队列cfs_rq的红黑树中 */
        if (!curr)
                __enqueue_entity(cfs_rq, se);
        se->on_rq = 1;/* 将se的on_rq标记为1 */

        /*
         * When bandwidth control is enabled, cfs might have been removed
         * because of a parent been throttled but cfs->nr_running > 1. Try to
         * add it unconditionnally.
         */
/* 如果cfs_rq的队列中只有一个进程，这里做处理 */
        if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
                list_add_leaf_cfs_rq(cfs_rq);

        if (cfs_rq->nr_running == 1)
                check_enqueue_throttle(cfs_rq);
}

重点是：加入运行队列时系统会根据CPU的负载情况放入不同的CPU队列中。