概述
KVM是由以色列初创公司Qumranet在CPU推出硬件虚拟化之后开发的一个基于内核的虚拟机监控器。
KVM是一个虚拟化的统称方案,除了x86外,ARM等其他架构也有自己的方案,所以KVM的主体代码位于内核树virt/kvm目录下面,表示所有CPU架构的公共代码,这也是内核kvm.ko对应的源码。
CPU架构代码位于arch/目录下面,如x86的架构相关的代码在arch/x86/kvm下。当然,同一个架构可能会有多种不同的实现,如KVM就有Intel和AMD两家的CPU实现,所以在x86目录下面就有多种实现代码,如Intel的vmx.c(对应intel VM-X方案)、AMD的svm.c(对应AMD-V方案),ioapic.c和lapic.c是中断控制器的代码,这也是intel-kvm.ko和amd-kvm.ko的来源。这种源码组织架构也常见于Linux内核的其他子系统。
KVM的所有虚拟化实现(Intel和AMD)都会向KVM模块注册一个kvm_x86_ops结构体,这样,KVM中的一些函数就是一个外壳,它可能首先会调用kvm_arch_xxx函数,表示的是调用CPU架构相关的函数,而如果kvm_arch_xxx函数需要调用到实现相关的代码,则会调用kvm_x86_ops结构中的相关回调函数。
kvm_intel.ko 与 kvm.ko 的关系:
VM创建
qemu侧虚机创建
qemu中支持kvm的代码入口主要都在kvm-all.c中,其中初始化函数kvm_init()。
当运行qemu时,如果命令行中带有--enable-kvm
参数,则在qemu_init()
函数中会处理:
case QEMU_OPTION_enable_kvm:
olist = qemu_find_opts("machine");
qemu_opts_parse_noisily(olist, "accel=kvm", false);
break;
machine optslist这个参数项加了一个accel=kvm参数,之后main函数会调用configure_accelerator(current_machine),该函数会从machine的参数列表中取出accel的值,找出所属的类型,然后调用accel_init_machine。
int accel_init_machine(AccelState *accel, MachineState *ms)
{
AccelClass *acc = ACCEL_GET_CLASS(accel); /*获取指定类型(这里是kvm)的accel类*/
int ret;
ms->accelerator = accel;
*(acc->allowed) = true;
ret = acc->init_machine(ms); /* 执行其对应的 init_machine 函数*/
if (ret < 0) {
ms->accelerator = NULL;
*(acc->allowed) = false;
object_unref(OBJECT(accel));
} else {
object_set_accelerator_compat_props(acc->compat_props);
}
return ret;
}
那么accel=kvm的init_machine函数是谁呢?
#define TYPE_KVM_ACCEL ACCEL_CLASS_NAME("kvm") #定义TYPE_KVM_ACCEL 就是 kvm-accel
然后在kvm-all.c中,构造kvm_accel_type结构体时设置了其init_machine钩子函数:
static void kvm_accel_class_init(ObjectClass *oc, void *data)
{
AccelClass *ac = ACCEL_CLASS(oc);
ac->name = "KVM";
ac->init_machine = kvm_init; /* 这里初始化kvm accel的init_machine 函数为 kvm_init()*/
ac->has_memory = kvm_accel_has_memory;
ac->allowed = &kvm_allowed;
...
}
/* 初始化kvm_accel_type结构体 */
static const TypeInfo kvm_accel_type = {
.name = TYPE_KVM_ACCEL,
.parent = TYPE_ACCEL,
.instance_init = kvm_accel_instance_init,
.class_init = kvm_accel_class_init,
.instance_size = sizeof(KVMState),
};
static void kvm_type_init(void)
{
type_register_static(&kvm_accel_type); /* 注册kvm_accel_type结构体 */
}
type_init(kvm_type_init);
kvm-all.c中 kvm_init()函数:
static int kvm_init(MachineState *ms)
{
/* 省略代码... */
s = KVM_STATE(ms->accelerator);
/* 省略代码... */
s->fd = qemu_open("/dev/kvm", O_RDWR); /* 打开 /dev/kvm 得到fd句柄 */
/* 省略代码... */
do {
ret = kvm_ioctl(s, KVM_CREATE_VM, type); /* ioctl打开的/dev/kvm的fd句柄,KVM_CREATE_VM命令通知kvm.ko模块创建虚机*/
} while (ret == -EINTR);
/* 省略代码... */
ret = kvm_arch_init(ms, s); /* 做一些架构相关的初始化操作*/
/* 省略代码... */
return ret;
}
kvm_init()的主要作用就是调用/dev/kvm
提供的一系列ioctl接口,在内核KVM中创建一台虚拟机。一个QEMU进程对应一台虚拟机VM。
kvm侧虚机创建
内核kvm模块的主要代码入口在kvm_main.c中,以kvm与intel组合为例,后面的分析涉及架构都是intel:
数据结构
内核kvm模块中,struct kvm其实就代表一台虚拟机。
初始化/dev/kvm
kvm_init()函数中初始化/dev/kvm
设备,留给qemu去访问,并初始化对应的options操作函数。
x86架构下,kvm的options对象kvm_x86_ops。
arch/x86/kvm/x86.c中定义了全局变量 kvm_x86_ops
struct kvm_x86_ops kvm_x86_ops __read_mostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);
kvm_x86_ops结构体中是一系列函数指针,其具体的函数初始化是vmx_x86_ops中初始化的。
struct kvm_x86_ops {
int (*hardware_enable)(void);
void (*hardware_disable)(void);
void (*hardware_unsetup)(void);
bool (*cpu_has_accelerated_tpr)(void);
bool (*has_emulated_msr)(u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
/*省略一大堆函数指针*/
}
x86架构的vmx.c中vmx_init函数在调用kvm_init时传入的是vmx_init_ops:
r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
__alignof__(struct vcpu_vmx), THIS_MODULE);
主要起作用的是vmx_x86_ops,在/arch/x86/kvm/vmx/vmx.c中初始化:
static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
.runtime_ops = &vmx_x86_ops,
};
其中,vmx_x86_ops也是一个全局静态对象,其具体内容:
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.hardware_unsetup = hardware_unsetup,
.hardware_enable = hardware_enable,
.hardware_disable = hardware_disable,
.cpu_has_accelerated_tpr = report_flexpriority,
.has_emulated_msr = vmx_has_emulated_msr,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
/*省略...*/
};
内核kvm_main.c中,定义了kvm的设备、字符设备ioctl、vm虚机的ioctl、vcpu的iotctl等全局变量以便响应用户态的操作。
static struct file_operations kvm_vcpu_fops = {
.release = kvm_vcpu_release,
.unlocked_ioctl = kvm_vcpu_ioctl,
.mmap = kvm_vcpu_mmap,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vcpu_compat_ioctl),
};
static struct file_operations kvm_vm_fops = {
.release = kvm_vm_release,
.unlocked_ioctl = kvm_vm_ioctl,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vm_compat_ioctl),
};
static struct file_operations kvm_chardev_ops = {
.unlocked_ioctl = kvm_dev_ioctl,
.llseek = noop_llseek,
KVM_COMPAT(kvm_dev_ioctl),
};
static struct miscdevice kvm_dev = {
KVM_MINOR,
"kvm",
&kvm_chardev_ops,
};
kvm_preempt_ops.sched_in = kvm_sched_in;
kvm_preempt_ops.sched_out = kvm_sched_out;
kvm_dev_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_GET_API_VERSION | |
KVM_CREATE_VM | 创建虚机,kvm_dev_ioctl_create_vm() --> kvm_create_vm() |
KVM_CHECK_EXTENSION | 检查扩展功能,kvm_vm_ioctl_check_extension_generic() |
KVM_GET_VCPU_MMAP_SIZE | 创建qemu与kvm共享内存 |
… |
kvm_vm_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_CREATE_VCPU | 创建vcpu,kvm_vm_ioctl_create_vcpu |
KVM_ENABLE_CAP | kvm_vm_ioctl_enable_cap_generic |
KVM_SET_USER_MEMORY_REGION | kvm_vm_ioctl_set_memory_region |
KVM_GET_DIRTY_LOG | kvm_vm_ioctl_get_dirty_log |
KVM_REGISTER_COALESCED_MMIO | |
KVM_IRQFD | kvm_irqfd |
KVM_IOEVENTFD | kvm_ioeventfd |
KVM_CREATE_DEVICE | kvm_ioctl_create_device |
KVM_CHECK_EXTENSION | kvm_vm_ioctl_check_extension_generic |
… |
kvm_vcpu_ioctl:
ioctl操作 | 对应处理函数 |
---|---|
KVM_RUN | 运行vcpu,kvm_arch_vcpu_ioctl_run() |
KVM_GET_REGS | |
KVM_SET_REGS | |
… |
kvm_dev_ioctl与kvm_vm_ioctl与kvm_vcpu_ioctl之间的关系:
QEMU创建CPU
qemu中的CPU模型继承关系:
qemu中支持的x86 CPU都定义在target/i386/cpu.c中的X86CPUDefinition类型的builtin_x86_defs数组中:
/* Base definition for a CPU model */
typedef struct X86CPUDefinition {
const char *name;
uint32_t level;
uint32_t xlevel;
/* vendor is zero-terminated, 12 character ASCII string */
char vendor[CPUID_VENDOR_SZ + 1];
int family;
int model;
int stepping;
FeatureWordArray features;
const char *model_id;
CPUCaches *cache_info;
/* Use AMD EPYC encoding for apic id */
bool use_epyc_apic_id_encoding;
/*
* Definitions for alternative versions of CPU model.
* List is terminated by item with version == 0.
* If NULL, version 1 will be registered automatically.
*/
const X86CPUVersionDefinition *versions;
} X86CPUDefinition;
其中:
X86CPUDefinition成员 | 作用 |
---|---|
name | CPU的名字 |
level | CPUID指令支持的最大功能号 |
xlevel | CPUID扩展质量支持的最大功能号 |
vendor、family、model、stepping | CPU的基本信息 |
features | 记录CPU特性的数组 |
model_id | CPU的全名 |
builtin_x86_defs数组:
static X86CPUDefinition builtin_x86_defs[] = {
{
.name = "qemu64",
.level = 0xd,
.vendor = CPUID_VENDOR_AMD,
.family = 6,
.model = 6,
.stepping = 3,
.features[FEAT_1_EDX] =
PPRO_FEATURES |
CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA |
CPUID_PSE36,
.features[FEAT_1_ECX] =
CPUID_EXT_SSE3 | CPUID_EXT_CX16,
.features[FEAT_8000_0001_EDX] =
CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
.features[FEAT_8000_0001_ECX] =
CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM,
.xlevel = 0x8000000A,
.model_id = "QEMU Virtual CPU version " QEMU_HW_VERSION,
},
... /*有2000多行代码*/
}
qemu中通过struct X86CPU结构体来实例化一个虚拟的x86 CPU:
qemu中创建vcpu的函数调用路径:
其中,qemu中的kvm_init_vcpu()代码
int kvm_init_vcpu(CPUState *cpu)
{
/*以下都省略部分代码,只留关心的部分*/
ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu)); /*KVM_CREATE_VCPU去创建vcpu*/
mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0); /*创建共享内存空间*/
cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
cpu->kvm_fd, 0); /*qemu拿到共享内存后,对其fd进行mmap,kvm中处理函数是kvm_vcpu_mmap()*/
ret = kvm_arch_init_vcpu(cpu);
return ret;
}
KVM创建CPU
qemu与kvm共享数据
QEMU与KVM经常需要共享数据,如KVM将VM Exit的信息放到共享内存中,QEMU可以通过共享内存区域获取这些数据。QEMU与KVM之间的数据共享是QEMU在创建VCPU时分配的。
qemu在kvm_init_vcpu()中有kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0)
,该接口返回的是qemu与kvm共享内存的大小。
kvm中处理该接口的函数是:
static long kvm_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
/*省略部分代码*/
case KVM_GET_VCPU_MMAP_SIZE:
if (arg)
goto out;
r = PAGE_SIZE; /* struct kvm_run */
#ifdef CONFIG_X86
r += PAGE_SIZE; /* pio data page */
#endif
#ifdef CONFIG_KVM_MMIO
r += PAGE_SIZE; /* coalesced mmio ring page */
#endif
break;
return r;
}
ioctl(KVM_GET_VCPU_MMAP_SIZE)可能返回的大小为1个、2个或者3个页。第一页用于kvm_run,该结构体用于与QEMU和KVM进行基本的数据交互,第二页用于虚拟机访问IO端口时存储相应的数据,最后一页用于聚合的MMIO。
然后qemu对共享内存进行mmap操作
static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
{
struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
struct page *page;
if (vmf->pgoff == 0)
page = virt_to_page(vcpu->run);
#ifdef CONFIG_X86
else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET)
page = virt_to_page(vcpu->arch.pio_data);
#endif
#ifdef CONFIG_KVM_MMIO
else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
#endif
else
return kvm_arch_vcpu_fault(vcpu, vmf);
get_page(page);
vmf->page = page;
return 0;
}
static const struct vm_operations_struct kvm_vcpu_vm_ops = {
.fault = kvm_vcpu_fault,
};
static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
{
vma->vm_ops = &kvm_vcpu_vm_ops;
return 0;
}
QEMU调用mmap映射VCPU的fd这个匿名文件的时候,实际上仅分配了虚拟地址空间,并且设置了这段虚拟地址空间的操作为kvm_vcpu_vm_ops,该操作回调只有一个fault回调函数kvm_vcpu_fault。kvm_vcpu_fault函数会在QEMU访问共享内存产生缺页异常的时候被调用,从其代码可以看到,内核会在QEMU把对应的数据与虚拟地址空间联系起来。
访问共享内存页 | 实际访问 |
---|---|
page1 | kvm_vcpu->run |
page2 | kvm_vcpu->arch |
page3 | kvm->coalesced_mmio_ring |
VCPU运行
QEMU运行VCPU
每个VCPU都会有一个对应的VMCS(Virtual Machine Control Structure),该结构是Intel x86处理器中实现CPU虚拟化记录vCPU状态的一个关键数据结构。VMCS的物理地址会作为操作数提供给VMX的指令。VMCS总共有如下4种状态:
- Inactive:即只是分配和初始化VMCS结构或者是执行VMCLEAR指令之后的状态。
- working:CPU在一个VMCS上执行了VMPTRLD指令或者产生VM exit之后所处的状态,这个时候CPU还是在VMX root状态。
- Active:当前VMCS执行了VMPTRLD指令,同一个CPU执行了另一个VCPU的VMPTRLD之后,前一个VMCS所处的状态。
- controlling:当CPU在一个VMCS上执行了VMLAUNCH指令之后CPU所处的VMX non-root状态。
Intel SDM 31.6所描述的要让一个虚拟机运行起来的步骤。
- 在非分页内存中分配一个4KB对齐的VMCS区域,其大小通过IA32_VMX_BASIC MSR得到,对于KVM,这个过程主要是通过vmx_create_vcpu调用alloc_vmcs来完成的。
- 初始化VMCS区域的版本标识(VMCS区域的前31位),这也是通过IA32_VMX_BASIC SMR得到的,清除VMCS区域前4个字节的31位,对于KVM,这个过程在alloc_vmcs_cpu中完成。
- 使用VMCS的物理地址作为操作数执行VMCLEAR指令,这会将当前CPU的working-VMCS指针指向FFFFFFFF_FFFFFFFFH,指令执行完成之后检查RFLAGS.CF=0以及RFLAGS.ZE=0,对于KVM,这个过程主要通过loaded_vmcs_clear函数最终调用vmcs_clear来完成。
- 使用VMCS的物理地址执行VMPTRLD指令,这个时候CPU的working-VMCS指针指向VMCS区域的物理地址,对于KVM,这个过程通过vmx_vcpu_load调用vmcs_load来完成。
- 执行VMWRITE指令,初始化VMCS的host-state区域,当产生VM exit后,这个区域会用来创建宿主机的CPU状态和上下文,host-state区域包括控制寄存器(CR0、CR3以及CR4),段寄存器(CS、SS、DS、ES、FS、GS、TR)以及RSP、RIP和一些MSR寄存器,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
- 执行VMWRITE指令,初始化VMCS中的VM-exit control区域、VM-entry control区域以及VM-execution control区域。这些区域的某些数据需要根据VMX capability MSR的报告设置,如MSR寄存器报告在当前CPU上某些位只能设置为0,对于KVM,这个过程主要在vmx_vcpu_setup函数中完成。
- 执行VMWRITE指令,初始化guest-state区域,当CPU进入VMX non-root模式时会根据这些数据创建上下文,对于KVM,这个过程主要在vmx_vcpu_reset中完成。
- guest-state的设置需要满足如下条件。
- ① 如果虚拟机需要模拟一个从BIOS启动的完整OS,则需要将guest的状态设置为物理CPU加电时的状态。
- ② 需要将VMM不能截获的guest-state数据正确设置,如通用寄存器、CR2控制寄存器、调试寄存器、浮点数寄存器等。
- 执行VMLAUNCH,使得CPU处于VMX non-root状态,如果这个过程出错,将会设置RFLAGS.CF或者RFLAGS.ZF,对于KVM,这个过程在vmx_vcpu_run中完成。
qemu中vcpu线程的routine函数是
static void *qemu_kvm_cpu_thread_fn(void *arg)
{
/*省略*/
r = kvm_init_vcpu(cpu);
kvm_init_cpu_signals(cpu);
/* signal CPU creation */
cpu->created = true;
qemu_cond_signal(&qemu_cpu_cond);
qemu_guest_random_seed_thread_part2(cpu->random_seed);
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu); /*vcpu运行的核心代码*/
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu); /*vcpu不好运行时,则将CPU等待在cpu->halt_cond条件上*/
} while (!cpu->unplug || cpu_can_run(cpu));
/*省略*/
return NULL;
}
qemu中vcpu运行的核心代码函数kvm_cpu_exec(),其核心也是一个do{}while()循环。
int kvm_cpu_exec(CPUState *cpu)
{
/*省略*/
do {
/*省略*/
kvm_arch_pre_run(cpu, run);
run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
attrs = kvm_arch_post_run(cpu, run);
switch (run->exit_reason) {
case KVM_EXIT_IO:
DPRINTF("handle_io\n");
/* Called outside BQL */
kvm_handle_io(run->io.port, attrs,
(uint8_t *)run + run->io.data_offset,
run->io.direction,
run->io.size,
run->io.count);
ret = 0;
break;
case KVM_EXIT_MMIO:
DPRINTF("handle_mmio\n");
/* Called outside BQL */
address_space_rw(&address_space_memory,
run->mmio.phys_addr, attrs,
run->mmio.data,
run->mmio.len,
run->mmio.is_write);
ret = 0;
break;
/*省略*/
case KVM_EXIT_SYSTEM_EVENT:
default:
DPRINTF("kvm_arch_handle_exit\n");
ret = kvm_arch_handle_exit(cpu, run);
break;
}
} while (ret == 0);
/*省略*/
return ret;
}
kvm_arch_pre_run首先做一些运行前的准备工作,如nmi和smi的中断注入,之后触发VCPU的ioctl(KVM_RUN)使该CPU运行起来,KVM模块在处理该ioctl时,会执行对应的VMX指令,把该VCPU运行的物理CPU从VMX root模式转换成VMX non-root模式,开始运行虚拟机中的代码。虚拟机内部如果遇到一些事件产生VM Exit,就会退出到KVM,如果KVM无法处理就会分发到QEMU,也就是在ioctl(KVM_RUN)返回的时候调用kvm_arch_post_run来进行一些初步处理,然后开始根据QEMU和KVM共享内存kvm_run中的数据来判断退出原因,并做出相应处理,如对于I/O的退出会调用kvm_handle_io进行分发,最终调用到注册该I/O端口的设备回调函数。可以看到,这里用了很多kvm_run里面的数据,如果退出原因是由于访问MMIO,则会调用address_space_rw,这个函数会找到MMIO是由哪个设备注册的,从而调用其相关回调函数。
qemu、kvm与vm之间的关系:
KVM运行VCPU
kvm_vcpu_ioctl
由kvm_vcpu_ioctl中去处理,最后有arch/x86/kvm/x86.c中的vcpu_run()函数做主要处理:
static struct file_operations kvm_vcpu_fops = {
.release = kvm_vcpu_release,
.unlocked_ioctl = kvm_vcpu_ioctl,
.mmap = kvm_vcpu_mmap,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vcpu_compat_ioctl),
};
kvm_vcpu_ioctl()函数如何保证是当前vcpu线程在处理的呢?函数中首先处理如下判断,
if (vcpu->kvm->mm != current->mm)
return -EIO;
switch (ioctl) {
case KVM_RUN: {
struct pid *oldpid;
r = -EINVAL;
if (arg)
goto out;
oldpid = rcu_access_pointer(vcpu->pid); //这里有可能运行该vcpu的线程换了
if (unlikely(oldpid != task_pid(current))) {
/* The thread running this VCPU changed. */
struct pid *newpid;
r = kvm_arch_vcpu_run_pid_change(vcpu);
if (r)
break;
newpid = get_task_pid(current, PIDTYPE_PID);
rcu_assign_pointer(vcpu->pid, newpid); //如果换线程了,则更新vcpu->pid为current->pid
if (oldpid)
synchronize_rcu();
put_pid(oldpid);
}
/*这里可以对vcpu进行特征统计,对运行vcpu的线程进行标记,但是如果统计vcpu特征了,还需要标记线程么?*/
r = kvm_arch_vcpu_ioctl_run(vcpu); //进入具体架构vcpu run代码
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
kvm_arch_vcpu_ioctl_run
进入kvm_arch_vcpu_ioctl_run()函数,这里分析x86架构:
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
struct kvm_run *kvm_run = vcpu->run;
int r;
vcpu_load(vcpu);
//省略代码
if (kvm_run->immediate_exit)
r = -EINTR;
else
r = vcpu_run(vcpu); //主要是vcpu_run函数
out:
kvm_put_guest_fpu(vcpu);
if (kvm_run->kvm_valid_regs)
store_regs(vcpu);
post_kvm_run_save(vcpu);
kvm_sigset_deactivate(vcpu);
vcpu_put(vcpu);
return r;
}
vcpu_load 与 vcpu_put
vcpu_load是加载vcpu至对应的物理cpu,vcpu_put则相反。
kvm中定义了一个per cpu变量,kvm_running_vcpu,用于记录是否运行vcpu任务。
static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
vcpu_load()函数,主要就是kvm_running_vcpu赋值,
/*
* Switches to specified vcpu, until a matching vcpu_put()
*/
void vcpu_load(struct kvm_vcpu *vcpu)
{
int cpu = get_cpu(); //关闭抢占,返回cpu的id
__this_cpu_write(kvm_running_vcpu, vcpu); //赋值per-cpu变量kvm_running_vcpu为当前vcpu
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu(); //开启抢占
}
EXPORT_SYMBOL_GPL(vcpu_load);
vcpu_put()与vcpu_load()是相对使用的。
void vcpu_put(struct kvm_vcpu *vcpu)
{
preempt_disable();
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(&vcpu->preempt_notifier);
__this_cpu_write(kvm_running_vcpu, NULL);
preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);
vcpu_run
static int vcpu_run(struct kvm_vcpu *vcpu)
{
/*省略*/
for (;;) {
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu); /*判断的结果是可以运行,则会调用vcpu_enter_guest来进入虚拟机*/
} else {
r = vcpu_block(kvm, vcpu); /*如果vcpu_run判断此时VCPU不能运行,不考虑poll机制,则调用schedule()提请调度,让出CPU。*/
}
if (r <= 0)
break;
/*省略*/
}
/*省略*/
return r;
}
/* 判断两个方面:
* 1. vcpu.arch结构的mp_state是否为KVM_MP_STATE_RUNNABLE
* 2. vcpu.arch结构中的apf.halted表示的虚拟机中是否存在需要访问却被宿主机swap出去的内存页,如果由于apf而被暂停,则这个时候虚拟CPU也是不能运行的
*/
static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu))
kvm_x86_ops.nested_ops->check_events(vcpu);
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted);
}
如果vcpu_run判断此时VCPU不能运行,则会调用vcpu_block,后者调用kvm_vcpu_block,如果不考虑poll机制,则kvm_vcpu_block会调用schedule()提请调度,让出CPU。
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
/*省略*/
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (kvm_vcpu_check_block(vcpu) < 0)
break;
waited = true;
schedule();
}
/*省略*/
}
vcpu_enter_guest
返回1,则vcpu_run()函数就一直在for循环中,否则返回至userspace。
/*
* Returns 1 to let vcpu_run() continue the guest execution loop without
* exiting to the userspace. Otherwise, the value will be returned to the
* userspace.
*/
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
/*省略...........................*/
r = kvm_mmu_reload(vcpu);
if (unlikely(r)) {
goto cancel_injection;
}
preempt_disable(); //关闭抢占
kvm_x86_ops.prepare_guest_switch(vcpu); //这里是保存host主机的state,以便虚拟机退出后能正常运行host
/*
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
* IPI are then delayed after guest entry, which ensures that they
* result in virtual interrupt delivery.
* 这里禁止CPU的外部中断请求 */
local_irq_disable();
vcpu->mode = IN_GUEST_MODE; //进入guest mode
//省略
trace_kvm_entry(vcpu->vcpu_id); //这里追踪kvm entry,而kvm exit是在vmx_vcpu_run()函数中追踪的
//省略
exit_fastpath = kvm_x86_ops.run(vcpu); //这里进入vmx_vcpu_run()函数
//省略
vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
vcpu->mode = OUTSIDE_GUEST_MODE; //退出guest mode
smp_wmb();
kvm_x86_ops.handle_exit_irqoff(vcpu); //退出虚机后,处理外部中断
/*
* Consume any pending interrupts, including the possible source of
* VM-Exit on SVM and any ticks that occur between VM-Exit and now.
* An instruction is required after local_irq_enable() to fully unblock
* interrupts on processors that implement an interrupt shadow, the
* stat.exits increment will do nicely.
*/
kvm_before_interrupt(vcpu);
local_irq_enable();
++vcpu->stat.exits; //这里对退出的数据进行统计
local_irq_disable();
kvm_after_interrupt(vcpu);
if (lapic_in_kernel(vcpu)) {
s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
if (delta != S64_MIN) {
trace_kvm_wait_lapic_expire(vcpu->vcpu_id, delta);
vcpu->arch.apic->lapic_timer.advance_expire_delta = S64_MIN;
}
}
local_irq_enable();
preempt_enable();
//省略
r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath); //其实到这里已经没有什么外部中断需要处理了,就是统计虚机退出的一些原因数据
return r;
cancel_injection:
if (req_immediate_exit)
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_x86_ops.cancel_injection(vcpu);
if (unlikely(vcpu->arch.apic_attention))
kvm_lapic_sync_from_vapic(vcpu);
out:
return r;
}
该函数会陷入kvm_vcpu对应的vmx_vcpu_run,当vmx_vcpu_run执行完返回的时候,其实已经完成了一轮VMEntry与VM Exit了。
vcpu->mode有以下几种
enum {
OUTSIDE_GUEST_MODE,
IN_GUEST_MODE,
EXITING_GUEST_MODE,
READING_SHADOW_PAGE_TABLES,
};
CPU在guest模式运行时,中断是关闭的,运行着虚拟机代码的CPU不会接收到外部中断,但是外部中断会导致CPU退出guest模式,进入VMX root模式。外部中断的处理是在handle_exit之前进行的,所以后面在handle_exit中处理外部中断的时候就没有什么实际的事可以做了,而只是对统计数据进行了修改。
vmx_vcpu_run
static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
fastpath_t exit_fastpath;
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4;
reenter_guest:
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!enable_vnmi &&
vmx->loaded_vmcs->soft_vnmi_blocked))
vmx->loaded_vmcs->entry_time = ktime_get();
/* Don't enter VMX if guest state is invalid, let the exit handler
start emulation until we arrive back to a valid state */
if (vmx->emulation_required)
return EXIT_FASTPATH_NONE;
if (vmx->ple_window_dirty) {
vmx->ple_window_dirty = false;
vmcs_write32(PLE_WINDOW, vmx->ple_window);
}
/*
* We did this in prepare_switch_to_guest, because it needs to
* be within srcu_read_lock.
*/
WARN_ON_ONCE(vmx->nested.need_vmcs12_to_shadow_sync);
if (kvm_register_is_dirty(vcpu, VCPU_REGS_RSP))
vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
if (kvm_register_is_dirty(vcpu, VCPU_REGS_RIP))
vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
cr3 = __get_current_cr3_fast();
if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
vmcs_writel(HOST_CR3, cr3);
vmx->loaded_vmcs->host_state.cr3 = cr3;
}
cr4 = cr4_read_shadow();
if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
vmcs_writel(HOST_CR4, cr4);
vmx->loaded_vmcs->host_state.cr4 = cr4;
}
/* When single-stepping over STI and MOV SS, we must clear the
* corresponding interruptibility bits in the guest state. Otherwise
* vmentry fails as it then expects bit 14 (BS) in pending debug
* exceptions being set, but that's not correct for the guest debugging
* case. */
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
kvm_load_guest_xsave_state(vcpu);
pt_guest_enter(vmx);
atomic_switch_perf_msrs(vmx);
if (enable_preemption_timer)
vmx_update_hv_timer(vcpu);
if (lapic_in_kernel(vcpu) &&
vcpu->arch.apic->lapic_timer.timer_advance_ns)
kvm_wait_lapic_expire(vcpu);
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
* is no need to worry about the conditional branch over the wrmsr
* being speculatively taken.
*/
x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);
/* The actual VMENTER/EXIT is in the .noinstr.text section. */
vmx_vcpu_enter_exit(vcpu, vmx);
/*
* We do not use IBRS in the kernel. If this vCPU has used the
* SPEC_CTRL MSR it may have left it on; save the value and
* turn it off. This is much more efficient than blindly adding
* it to the atomic save/restore list. Especially as the former
* (Saving guest MSRs on vmexit) doesn't even exist in KVM.
*
* For non-nested case:
* If the L01 MSR bitmap does not intercept the MSR, then we need to
* save it.
*
* For nested case:
* If the L02 MSR bitmap does not intercept the MSR, then we need to
* save it.
*/
if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);
/* All fields are clean at this point */
if (static_branch_unlikely(&enable_evmcs))
current_evmcs->hv_clean_fields |=
HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
if (static_branch_unlikely(&enable_evmcs))
current_evmcs->hv_vp_id = vcpu->arch.hyperv.vp_index;
/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
if (vmx->host_debugctlmsr)
update_debugctlmsr(vmx->host_debugctlmsr);
#ifndef CONFIG_X86_64
/*
* The sysexit path does not restore ds/es, so we must set them to
* a reasonable value ourselves.
*
* We can't defer this to vmx_prepare_switch_to_host() since that
* function may be executed in interrupt context, which saves and
* restore segments around it, nullifying its effect.
*/
loadsegment(ds, __USER_DS);
loadsegment(es, __USER_DS);
#endif
vmx_register_cache_reset(vcpu);
pt_guest_exit(vmx);
kvm_load_host_xsave_state(vcpu);
vmx->nested.nested_run_pending = 0;
vmx->idt_vectoring_info = 0;
if (unlikely(vmx->fail)) {
vmx->exit_reason = 0xdead;
return EXIT_FASTPATH_NONE;
}
vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))
kvm_machine_check();
trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);
if (unlikely(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
return EXIT_FASTPATH_NONE;
vmx->loaded_vmcs->launched = 1;
vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
vmx_recover_nmi_blocking(vmx);
vmx_complete_interrupts(vmx);
if (is_guest_mode(vcpu))
return EXIT_FASTPATH_NONE;
exit_fastpath = vmx_exit_handlers_fastpath(vcpu);
if (exit_fastpath == EXIT_FASTPATH_REENTER_GUEST) {
if (!kvm_vcpu_exit_request(vcpu)) {
/*
* FIXME: this goto should be a loop in vcpu_enter_guest,
* but it would incur the cost of a retpoline for now.
* Revisit once static calls are available.
*/
if (vcpu->arch.apicv_active)
vmx_sync_pir_to_irr(vcpu);
goto reenter_guest;
}
exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED;
}
return exit_fastpath;
}
该函数首先根据VCPU的状态写一些VMCS的值,接着执行汇编ASM_VMX_VMLAUNCH将CPU置于guest模式,这个时候CPU就开始执行虚拟机的代码,当发生退出时候,其地址是vmx_return。
VCPU退出
x86架构
VCPU的exit事件,由kvm_x86_ops.handle_exit()来处理,在/arch/x86/kvm/x86.c中
static int vcpu_enter_guest(struct kvm_vcpu *vcpu){
//省略
r = kvm_x86_ops.handle_exit(vcpu, exit_fastpath);
}
退出事件
#define VMX_EXIT_REASONS_FAILED_VMENTRY 0x80000000
#define EXIT_REASON_EXCEPTION_NMI 0
#define EXIT_REASON_EXTERNAL_INTERRUPT 1
#define EXIT_REASON_TRIPLE_FAULT 2
#define EXIT_REASON_INIT_SIGNAL 3
#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
#define EXIT_REASON_TASK_SWITCH 9
#define EXIT_REASON_CPUID 10
#define EXIT_REASON_HLT 12
#define EXIT_REASON_INVD 13
#define EXIT_REASON_INVLPG 14
#define EXIT_REASON_RDPMC 15
#define EXIT_REASON_RDTSC 16
#define EXIT_REASON_VMCALL 18
#define EXIT_REASON_VMCLEAR 19
#define EXIT_REASON_VMLAUNCH 20
#define EXIT_REASON_VMPTRLD 21
#define EXIT_REASON_VMPTRST 22
#define EXIT_REASON_VMREAD 23
#define EXIT_REASON_VMRESUME 24
#define EXIT_REASON_VMWRITE 25
#define EXIT_REASON_VMOFF 26
#define EXIT_REASON_VMON 27
#define EXIT_REASON_CR_ACCESS 28
#define EXIT_REASON_DR_ACCESS 29
#define EXIT_REASON_IO_INSTRUCTION 30
#define EXIT_REASON_MSR_READ 31
#define EXIT_REASON_MSR_WRITE 32
#define EXIT_REASON_INVALID_STATE 33
#define EXIT_REASON_MSR_LOAD_FAIL 34
#define EXIT_REASON_MWAIT_INSTRUCTION 36
#define EXIT_REASON_MONITOR_TRAP_FLAG 37
#define EXIT_REASON_MONITOR_INSTRUCTION 39
#define EXIT_REASON_PAUSE_INSTRUCTION 40
#define EXIT_REASON_MCE_DURING_VMENTRY 41
#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
#define EXIT_REASON_APIC_ACCESS 44
#define EXIT_REASON_EOI_INDUCED 45
#define EXIT_REASON_GDTR_IDTR 46
#define EXIT_REASON_LDTR_TR 47
#define EXIT_REASON_EPT_VIOLATION 48
#define EXIT_REASON_EPT_MISCONFIG 49
#define EXIT_REASON_INVEPT 50
#define EXIT_REASON_RDTSCP 51
#define EXIT_REASON_PREEMPTION_TIMER 52
#define EXIT_REASON_INVVPID 53
#define EXIT_REASON_WBINVD 54
#define EXIT_REASON_XSETBV 55
#define EXIT_REASON_APIC_WRITE 56
#define EXIT_REASON_RDRAND 57
#define EXIT_REASON_INVPCID 58
#define EXIT_REASON_VMFUNC 59
#define EXIT_REASON_ENCLS 60
#define EXIT_REASON_RDSEED 61
#define EXIT_REASON_PML_FULL 62
#define EXIT_REASON_XSAVES 63
#define EXIT_REASON_XRSTORS 64
#define EXIT_REASON_UMWAIT 67
#define EXIT_REASON_TPAUSE 68
vmx_handle_exit()
退出最终会到vmx_handle_exit()中处理,然后根据事件分发给对应的处理函数
/*
* The exit handlers return 1 if the exit was handled fully and guest execution
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
* to be done to userspace and return 0.
*/
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception_nmi, /*处理不可屏蔽中断non-maskable interrupt*/
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, /*总是返回1,没做什么具体处理,可忽略*/
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, /*总是返回0,kvm exit shutdown*/
[EXIT_REASON_NMI_WINDOW] = handle_nmi_window, /*总是返回1, 可不考虑*/
[EXIT_REASON_IO_INSTRUCTION] = handle_io, /*看名字就是IO操作*/
[EXIT_REASON_CR_ACCESS] = handle_cr, /*操作控制寄存器*/
[EXIT_REASON_DR_ACCESS] = handle_dr, /*操作调试寄存器*/
[EXIT_REASON_CPUID] = kvm_emulate_cpuid, /*模拟cpuid,还是操作eax等寄存器*/
[EXIT_REASON_MSR_READ] = kvm_emulate_rdmsr, /*模拟rdmsr指令,本质还是操作EAX寄存器*/
[EXIT_REASON_MSR_WRITE] = kvm_emulate_wrmsr, /*模拟wrmsr指令,操作MSR等寄存器*/
[EXIT_REASON_INTERRUPT_WINDOW] = handle_interrupt_window, /*总是返回1,可不考虑*/
[EXIT_REASON_HLT] = kvm_emulate_halt, /*HLT指令,暂停cpu*/
[EXIT_REASON_INVD] = handle_invd, /*调用kvm_emulate_instruction*/
[EXIT_REASON_INVLPG] = handle_invlpg, /*调用kvm_skip_emulate_instruction*/
[EXIT_REASON_RDPMC] = handle_rdpmc, /*x86的rdpmc指令,读取PMU寄存器*/
[EXIT_REASON_VMCALL] = handle_vmcall, /*vmcall指令,kvm_emulate_hypercall调用*/
[EXIT_REASON_VMCLEAR] = handle_vmx_instruction,
[EXIT_REASON_VMLAUNCH] = handle_vmx_instruction,
[EXIT_REASON_VMPTRLD] = handle_vmx_instruction,
[EXIT_REASON_VMPTRST] = handle_vmx_instruction,
[EXIT_REASON_VMREAD] = handle_vmx_instruction,
[EXIT_REASON_VMRESUME] = handle_vmx_instruction,
[EXIT_REASON_VMWRITE] = handle_vmx_instruction,
[EXIT_REASON_VMOFF] = handle_vmx_instruction,
[EXIT_REASON_VMON] = handle_vmx_instruction, /*handle_vmx_instruct函数总是返回1*/
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold, /*操作寄存器,函数返回1*/
[EXIT_REASON_APIC_ACCESS] = handle_apic_access, /*APIC控制器*/
[EXIT_REASON_APIC_WRITE] = handle_apic_write, /*函数返回1*/
[EXIT_REASON_EOI_INDUCED] = handle_apic_eoi_induced, /*函数总返回1*/
[EXIT_REASON_WBINVD] = handle_wbinvd, //操作寄存器
[EXIT_REASON_XSETBV] = handle_xsetbv, //操作寄存器
[EXIT_REASON_TASK_SWITCH] = handle_task_switch, //处理模拟进程切换
[EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check, //总是返回1,可忽略
[EXIT_REASON_GDTR_IDTR] = handle_desc,
[EXIT_REASON_LDTR_TR] = handle_desc,
[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation, //和NMI相关
[EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig, //ept配置错误处理
[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause, //PAUSE
[EXIT_REASON_MWAIT_INSTRUCTION] = handle_mwait, //使用NOP指令模拟MWAIT
[EXIT_REASON_MONITOR_TRAP_FLAG] = handle_monitor_trap, //返回1,可忽略
[EXIT_REASON_MONITOR_INSTRUCTION] = handle_monitor, //NOP模拟MONITOR
[EXIT_REASON_INVEPT] = handle_vmx_instruction,
[EXIT_REASON_INVVPID] = handle_vmx_instruction,
[EXIT_REASON_RDRAND] = handle_invalid_op, //返回1,可忽略
[EXIT_REASON_RDSEED] = handle_invalid_op,
[EXIT_REASON_PML_FULL] = handle_pml_full, //返回1,可忽略
[EXIT_REASON_INVPCID] = handle_invpcid, //和操作内存相关,PCIDs
[EXIT_REASON_VMFUNC] = handle_vmx_instruction, //返回1,可忽略
[EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer, //返回1,可忽略
[EXIT_REASON_ENCLS] = handle_encls, //返回1,可忽略
};
vm exit原因
有许多events或者instructions会导致VM exit,其中某些事永久enable开启的,有些是可以通过VMSC控制域开关的。
Unconditional reasons for VM exit include:
- CPUID
- RDMSR and WRMSR unless MSR bitmap is used
- most of VMX instructions
- INIT signal
- SIPI signal - does not result in exit if the processor is not in wait-for-SIPI state
- triple fault
- task switches (hardware, including
- VM entry failure
There are too many controllable exit reasons to describe each one separately, but most of them can be classified as one of:
-
interrupts or interrupt windows
-
I/O ports access
-
memory access - controlled by EPT
-
HLT/PAUSE and pre-emption timer - useful for multiple VMs running on one physical CPU
-
changes to descriptor tables and control registers
-
APIC access
kvm_userspace_exit
virt/kvm/kvm_main.c中的kvm_vcpu_ioctl()在处理KVM_RUN中,当从kvm_arch_vcpu_ioctl_run()这个涉及具体架构的vcpu run的处理函数退出时,意味着内核kvm层对vcpu的处理已经无法处理,需要继续退出至qemu去处理,即需要从内核态返回用户态去处理了。
r = kvm_arch_vcpu_ioctl_run(vcpu);
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
kvm_arch_vcpu_ioctl_run()函数退出时,系统叫它userspace exit
#define KVM_EXIT_UNKNOWN 0
#define KVM_EXIT_EXCEPTION 1
#define KVM_EXIT_IO 2
#define KVM_EXIT_HYPERCALL 3
#define KVM_EXIT_DEBUG 4
#define KVM_EXIT_HLT 5
#define KVM_EXIT_MMIO 6
#define KVM_EXIT_IRQ_WINDOW_OPEN 7
#define KVM_EXIT_SHUTDOWN 8
#define KVM_EXIT_FAIL_ENTRY 9
#define KVM_EXIT_INTR 10
#define KVM_EXIT_SET_TPR 11
#define KVM_EXIT_TPR_ACCESS 12
#define KVM_EXIT_S390_SIEIC 13
#define KVM_EXIT_S390_RESET 14
#define KVM_EXIT_DCR 15 /* deprecated */
#define KVM_EXIT_NMI 16
#define KVM_EXIT_INTERNAL_ERROR 17
#define KVM_EXIT_OSI 18
#define KVM_EXIT_PAPR_HCALL 19
#define KVM_EXIT_S390_UCONTROL 20
#define KVM_EXIT_WATCHDOG 21
#define KVM_EXIT_S390_TSCH 22
#define KVM_EXIT_EPR 23
#define KVM_EXIT_SYSTEM_EVENT 24
#define KVM_EXIT_S390_STSI 25
#define KVM_EXIT_IOAPIC_EOI 26
#define KVM_EXIT_HYPERV 27
#define KVM_EXIT_ARM_NISV 28
VCPU调度
现代处理器通常都是多对称处理,操作系统一般可以自由地将VCPU调度到任何一个物理CPU上运行。当VCPU在不同的物理CPU上运行的时候会影响虚拟机的性能。这是由于在同一个物理CPU上运行VCPU时只需要执行VMRESUME指令即可,但是如果要切换到不同的物理CPU,则需要执行VMCLEAR、VMPTRLD和VMLAUNCH指令。
将一个VCPU调度到不同的物理CPU上的简化步骤,实际kvm处理比这复杂:
- 在源物理CPU执行VMCLEAR指令,这可以保证将当前CPU关联的VMCS相关缓存数据冲刷到内存中
- 在目的VMCS区域以VCPU的VMCS物理地址为操作数执行VMPTRLD指令
- 在目的VMCS区域执行VMLAUNCH指令
每个物理CPU会有一个指向VMCS结构体的指针per cpu变量current_vmcs,这是在vmx.c中定义的
DEFINE_PER_CPU(struct vmcs *, current_vmcs);
每一个VCPU也分配了一个VMCS结构,这是在vmx_create_vcpu中创建并保存在vmx_vcpu的loaded_vmcs中vmcs成员中的。VCPU的调度本质上就是让物理CPU的per cpu变量current_vmcs在所有VCPU之间分配,在某一时刻会指向这些VCPU中的一个。
- 内核调用vcpu_load将VCPU1与PCPU1关联起来,如果是第一次调用ioctl(KVM_RUN),则vcpu_load在kvm_vcpu_ioctl函数的开始被调用。如果是被调度进来的,则是在kvm_sched_in中,通过kvm_arch_vcpu_load调用到最终实现的vcpu_load(如vmx_vcpu_load),完成关联过程。
- 当PCPU1执行虚拟机代码时,当前线程是禁止抢占以及被中断打断的,但是中断却可以触发VM Exit,也就是让虚拟机退出到宿主机。退出并处理一些必要的工作之后就会开启中断和抢占,这样PCPU1就有可能去调度别的线程或VCPU。
- VCPU1的线程被抢占之后调用kvm_sched_out。当又该调度VCPU1时,系统却把它调度到物理CPU2上,那么就需要将VCPU1的状态与PCPU2关联起来。