Linux kfence使用与实现原理

0 背景

为了更好的检测linux kernel中内存out-of-bounds、mem-corruption、use-after-free、invaild-free等问题，调研了kfence功能（该功能在linux kernel 5.12引入），帮助研发更好的分析与定位这类内存错误的问题。

一、kfence介绍

1.1 什么是kfence

kfence是Linux kernel中用于检测内存错误的工具，如检测out-of-bounds、mem-corruption、use-after-free、invaild-free等，利用该工具尽早发现项目中存在的内存错误问题，帮助研发人员快速定位分析这些问题。

1.2 kfence与kasan区别

检测范围

检测原理

性能影响

适用场景

kfence

小于1个page（4KB）的slab内存分配

1）采用page fence和canary pattern机制检测内存out-of-bounds

2）采用data page的状态标志（如已释放的data page标记free）检测内存use-after-free

对内存的影响：

kfence采用以大量内存开销换取较小的性能干扰的思路，占用的内存较高，但可设定任意较小的num_objects来节约内存；

其他情况（全量模式及动态开启）则需消耗GB级别的内存。

对性能的影响：

采样模式下，对性能影响较小；

全量模式，对性能影响较大。

采样模式下，由于性能开销较小，可以在量产阶段使用

ksan

适用整个kernel的内存分配，包括所有的slab、page、堆栈和全局内存等

采用shadow memory检测机制

开销较大

由于性能开销大，一般在研发阶段使用

二、kfence如何使用

kfence是linux kernel 5.12版本才引入，低内核版本想使用kfence工具，第一步需要功能移植（详见第四节）。

2.1 打开kfence功能开关

CONFIG_KFENCE=y    // kfence enable
CONFIG_KFENCE_SAMPLE_INTERVAL=500    // 采样时间间隔，每隔500ms做检测
CONFIG_KFENCE_NUM_OBJECTS=63    // kfence内存池size

以上宏控配置可以根据自己的需求来做配置。

2.2 debug

宏控配置的方式不够灵活，不利于debug。因此，内核向用户空间提供了一些节点，方便用户动态调整配置：

/sys/module/kfence/parameters/check_on_panic
Y:更多的DEBUG信息
N:在生产环境中，减少系统崩溃时的额外开销

/sys/module/kfence/parameters/deferrable
Y：KFENCE可以延迟执行某些内存检测操作，以减少对系统性能的影响
N：KFENCE 不会延迟执行内存检测操作，而是立即执行

/sys/kernel/debug/kfence/stats  // 记录kfence内存检测的状态信息

/sys/kernel/debug/kfence/objects  // 提供关于 KFENCE 管理的内存对象的信息

echo -1 > /sys/module/kfence/parameters/sample_interval    // 动态调整内存检测的采样时间间隔；0：表示关闭kfence功能，-1：所有符合（slab类型筛选）条件的内存均将进入kfence的监控范围内
echo 100 > /sys/module/kfence/parameters/skip_covered_thresh    // 当某个内存区域的访问频率超过这个阈值时，KFENCE 可能会选择跳过对该区域的检测

2.3 查询相关日志信息

当kfence捕获到内存错误问题时，可以 cat /sys/kernel/debug/kfence/stats节点，查看total bugs计数会增加：

系统会将信息打印在dmesg，通过dmesg | grep -i kfence查询kfence相关的错误日志信息：

2.4 如何独立收集这些错误信息

在kfence捕获到内存错误，将日志输出到dmesg附近做hook,将日志获取到。详见3.2节。

三、kfence实现原理

3.1 检测原理

3.1.1 slub/slab hook实现

需要在slub/slab的malloc、free流程中加入kfence模块的hook，这样在内存分配与释放流程中才能走kfence的malloc、free流程，实现对内存错误的监控。

1）kfence alloc实现流程

在初始化阶段，kfence创建了自己的专有检测内存池 kfence_pool，详见3.3。

kmem_cache_alloc--->__kmem_cache_alloc_lru---> slab_alloc--->slab_alloc_node--->kfence_alloc，kfence alloc代码实现，详见3.4节。

2）kfence free实现流程

__kmem_cache_free--->__do_kmem_cache_free--->__cache_free--->__kfence_free，kfence free代码实现，详见3.5节。

3.1.2 use-after-free

obj 被 free 以后，对应 data page 也会被设置成不可访问状态。当被访问时，立刻会触发异常。

3.1.3 out-of-bounds或mem-corruption

内存访问越界，可分为data page页外访问越界（out-of-bounds）和页内访问越界（mem-corruption）。

data page页外访问越界：

从 kfence_pool内存池中分配一个内存对象 obj，不管 obj 的实际大小有多大，都会占据一个 data page， data page 的两边加上了 fence page 电子栅栏，利用 MMU 的特性把 fence page 设置成不可访问。如果对 data page 的访问越过了 page 边界，即访问page fence，就会立刻触发异常，这种就称为data page页外访问越界。

data page页内访问越界：

大部分情况下 obj 是小于一个 page 的，对于 data page 剩余空间系统使用 canary pattern 进行填充。这种操作是为了检测超出了 obj 但还在 data page 范围内的溢出访问，这种就称为data page页内访问越界。

页内访问越界发生时不会立刻触发，只能在 obj free 时，通过检测 canary pattern 被破坏来检测到有 canary 区域的溢出访问，这种异常访问也被叫做mem-corruption.

3.1.4 invalid-free

当obj free 时，会检查记录的 malloc 信息，判断是不是一次异常的 free，如内存重复释放。

3.2 异常如何触发&日志打印

1）use-after-free：KFENCE_ERROR_UAF类型的内存错误

当某个模块的代码中触发了use-after-free，会走kernel原生的流程，调用kfence的kfence_handle_page_fault函数，进行错误日志的收集与打印。

// kernel/arch/arm/mm/fault.c 

/*
 * Oops.  The kernel tried to access some page that wasn't present.
 */
static void
__do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
                  struct pt_regs *regs)
{
        const char *msg;
        /*
         * Are we prepared to handle this kernel fault?
         */
        if (fixup_exception(regs))
                return;

        /*
         * No handler, we'll have to terminate things with extreme prejudice.
         */
        if (addr < PAGE_SIZE) {
                msg = "NULL pointer dereference";
        } else {
                if (is_translation_fault(fsr) &&
                    kfence_handle_page_fault(addr, is_write_fault(fsr), regs))
                        return;

                msg = "paging request";
        }

        die_kernel_fault(msg, mm, addr, fsr, regs);
}

kfence_handle_page_fault函数中判断是KFENCE_ERROR_OOB或KFENCE_ERROR_UAF类型的错误，调用kfence_report_error将错误的日志打印到dmesg.

bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs)
{
        const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE;
        struct kfence_metadata *to_report = NULL;
        enum kfence_error_type error_type;
        unsigned long flags;

        if (!is_kfence_address((void *)addr))
                return false;

        if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */
                return kfence_unprotect(addr); /* ... unprotect and proceed. */

        atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
        // 判断是KFENCE_ERROR_OOB（data page页外越界访问）还是KFENCE_ERROR_UAF（use-after-free）类型的错误
        // 如果page_index是奇数，说明是fence page被访问，KFENCE_ERROR_OOB类型错误
        // 如果page_index是偶数，说明是data page释放后被访问，KFENCE_ERROR_UAF类型错误
        if (page_index % 2) {
                /* This is a redzone, report a buffer overflow. */
                struct kfence_metadata *meta;
                int distance = 0;

                meta = addr_to_metadata(addr - PAGE_SIZE);
                if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
                        to_report = meta;
                        /* Data race ok; distance calculation approximate. */
                        distance = addr - data_race(meta->addr + meta->size);
                }

                meta = addr_to_metadata(addr + PAGE_SIZE);
                if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) {
                        /* Data race ok; distance calculation approximate. */
                        if (!to_report || distance > data_race(meta->addr) - addr)
                                to_report = meta;
                }

                if (!to_report)
                        goto out;

                raw_spin_lock_irqsave(&to_report->lock, flags);
                to_report->unprotected_page = addr;
                error_type = KFENCE_ERROR_OOB;

                /*
                 * If the object was freed before we took the look we can still
                 * report this as an OOB -- the report will simply show the
                 * stacktrace of the free as well.
                 */
        } else {
                to_report = addr_to_metadata(addr);
                if (!to_report)
                        goto out;

                raw_spin_lock_irqsave(&to_report->lock, flags);
                error_type = KFENCE_ERROR_UAF;
                /*
                 * We may race with __kfence_alloc(), and it is possible that a
                 * freed object may be reallocated. We simply report this as a
                 * use-after-free, with the stack trace showing the place where
                 * the object was re-allocated.
                 */
        }

out:
        if (to_report) {
                kfence_report_error(addr, is_write, regs, to_report, error_type);
                raw_spin_unlock_irqrestore(&to_report->lock, flags);
        } else {
                /* This may be a UAF or OOB access, but we can't be sure. */
                // 无法判断是哪种类型的内存错误
                kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID);
        }

        return kfence_unprotect(addr); /* Unprotect and let access proceed. */
}

2）out-of-bounds(页外访问越界)：KFENCE_ERROR_OOB类型的内存错误

同上

3）out-of-bounds(页内访问越界)：KFENCE_ERROR_CORRUPTION类型的内存错误

在kfence allock阶段初始化canary区域（详见3.4），kfence free阶段去检测canary区域是否被访问过或破坏，如果被破坏，传入KFENCE_ERROR_CORRUPTION类型的参数，调用kfence_report_error函数，打印错误日志信息。

static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{
        ......
        
        /* Check canary bytes for memory corruption. */
        for_each_canary(meta, check_canary_byte);
        
        ......
}

/* __always_inline this to ensure we won't do an indirect call to fn. */
static __always_inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *))
{
        // pageaddr为这块data page的首地址
        const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
        unsigned long addr;

        /*
         * We'll iterate over each canary byte per-side until fn() returns
         * false. However, we'll still iterate over the canary bytes to the
         * right of the object even if there was an error in the canary bytes to
         * the left of the object. Specifically, if check_canary_byte()
         * generates an error, showing both sides might give more clues as to
         * what the error is about when displaying which bytes were corrupted.
         */

        /* Apply to left of object. */
        // 检查左边的canary区域
        for (addr = pageaddr; addr < meta->addr; addr++) {
                if (!fn((u8 *)addr))
                        break;
        }

        /* Apply to right of object. */
        // 检查右边的canary区域
        for (addr = meta->addr + meta->size; addr < pageaddr + PAGE_SIZE; addr++) {
                if (!fn((u8 *)addr))
                        break;
        }
}

/* Check canary byte at @addr. */
static inline bool check_canary_byte(u8 *addr)
{
        struct kfence_metadata *meta;
        unsigned long flags;
        // 如果data page的canary区域没被访问过或破坏，直接返回，否则，调用kfence_report_error函数，打印错误日志信息
        if (likely(*addr == KFENCE_CANARY_PATTERN(addr)))
                return true;

        atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
        // 根据内存地址找到元数据对象
        meta = addr_to_metadata((unsigned long)addr);
        raw_spin_lock_irqsave(&meta->lock, flags);
        // 传入KFENCE_ERROR_CORRUPTION类型的参数，调用kfence_report_error函数，打印错误日志信息
        kfence_report_error((unsigned long)addr, false, NULL, meta, KFENCE_ERROR_CORRUPTION);
        raw_spin_unlock_irqrestore(&meta->lock, flags);

        return false;
}

/*
 * Get the canary byte pattern for @addr. Use a pattern that varies based on the
 * lower 3 bits of the address, to detect memory corruptions with higher
 * probability, where similar constants are used.
 */
#define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7))

4）invalid-free：KFENCE_ERROR_INVALID_FREE类型的内存错误

kfence free阶段去检测本次内存释放是否为invalid-free，调用kfence_report_error函数，传入KFENCE_ERROR_INVALID_FREE类型的参数，打印错误日志信息。

static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{
        ......
        // 如果内存块没有被分配就释放（包含了double-free）或内存块分配与释放时的地址不一样，认为本次释放是invalid-free
        if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
                /* Invalid or double-free, bail out. */
                atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
                // 调用kfence_report_error函数，传入KFENCE_ERROR_INVALID_FREE类型的参数，打印错误日志信息
                kfence_report_error((unsigned long)addr, false, NULL, meta,
                                    KFENCE_ERROR_INVALID_FREE);
                raw_spin_unlock_irqrestore(&meta->lock, flags);
                return;
        }

        ......
}

下面看如何打印错误的日志信息，kfence_report_error错误的日志信息会打印到dmesg.

#define pr_err printk


void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs,
                         const struct kfence_metadata *meta, enum kfence_error_type type)
{
       ......
       
        /* Print report header. */
        switch (type) {
        // 打印data page页外访问越界的错误日志信息到dmesg
        case KFENCE_ERROR_OOB: {
                const bool left_of_object = address < meta->addr;

                pr_err("BUG: KFENCE: out-of-bounds %s in %pS\n\n", get_access_type(is_write),
                       (void *)stack_entries[skipnr]);
                pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%td):\n",
                       get_access_type(is_write), (void *)address,
                       left_of_object ? meta->addr - address : address - meta->addr,
                       left_of_object ? "left" : "right", object_index);
                break;
        }
         // 打印use-after-free的错误日志信息到dmesg
        case KFENCE_ERROR_UAF:
                pr_err("BUG: KFENCE: use-after-free %s in %pS\n\n", get_access_type(is_write),
                       (void *)stack_entries[skipnr]);
                pr_err("Use-after-free %s at 0x%p (in kfence-#%td):\n",
                       get_access_type(is_write), (void *)address, object_index);
                break;
        // 打印data page页内（canary区域内存破坏）访问越界的错误日志信息到dmesg
        case KFENCE_ERROR_CORRUPTION:
                pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]);
                pr_err("Corrupted memory at 0x%p ", (void *)address);
                print_diff_canary(address, 16, meta);
                pr_cont(" (in kfence-#%td):\n", object_index);
                break;
        case KFENCE_ERROR_INVALID:
                pr_err("BUG: KFENCE: invalid %s in %pS\n\n", get_access_type(is_write),
                       (void *)stack_entries[skipnr]);
                pr_err("Invalid %s at 0x%p:\n", get_access_type(is_write),
                       (void *)address);
                break;
        // 打印invalid-free的错误日志信息到dmesg
        case KFENCE_ERROR_INVALID_FREE:
                pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]);
                pr_err("Invalid free of 0x%p (in kfence-#%td):\n", (void *)address,
                       object_index);
                break;
        }

      ......
}

3.3 kfence init

kfence初始化主要做了几件事情：

1）判断kfence_sample_interval采样间隔是否为0，设置为0，说明kfence功能disable

2）分配kfence pool内存池，默认内存块是255，分配（255+1）*2 = 512个page，包括255个data page，256个fence page，1个不可用的data page（放在第一个位置，记为page 0）

3）初始化metadata数组，记录每个data page内存块状态信息

4）初始化freelist空闲链表，记录data page内存块的是否可分配

5）将所有fence page和page 0设置为不可访问

// mm/kfence/core.c

void __init kfence_init(void)
{
        stack_hash_seed = get_random_u32();

        /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */
        // 1. 采样间隔为0，kfence disable
        if (!kfence_sample_interval)
                return;
        // 2. 初始化kfence pool内存池
        if (!kfence_init_pool_early()) {
                pr_err("%s failed\n", __func__);
                return;
        }
        kfence_init_enable();
}

static bool __init kfence_init_pool_early(void)
{
        unsigned long addr;

        if (!__kfence_pool)
                return false;

        addr = kfence_init_pool();

        ......
}

#define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE)    // 默认为256*2个page
static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist);    // 空闲链表，记录空闲的内存块
struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];    // metadata数组，记录data page内存块状态信息
/*
 * Initialization of the KFENCE pool after its allocation.
 * Returns 0 on success; otherwise returns the address up to
 * which partial initialization succeeded.
 */
static unsigned long kfence_init_pool(void)
{
        unsigned long addr;
        struct page *pages;
        int i;

        if (!arch_kfence_init_pool())
                return (unsigned long)__kfence_pool;

        addr = (unsigned long)__kfence_pool;
        // 将虚拟地址转换为物理地址
        pages = virt_to_page(__kfence_pool);

        /*
         * Set up object pages: they must have PG_slab set, to avoid freeing
         * these as real pages.
         *
         * We also want to avoid inserting kfence_free() in the kfree()
         * fast-path in SLUB, and therefore need to ensure kfree() correctly
         * enters __slab_free() slow-path.
         */
         // 默认分配512个page
        for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
                struct slab *slab = page_slab(nth_page(pages, i));

                if (!i || (i % 2))
                        continue;

                __folio_set_slab(slab_folio(slab));
#ifdef CONFIG_MEMCG
                slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
                                   MEMCG_DATA_OBJCGS;
#endif
        }

        /*
         * Protect the first 2 pages. The first page is mostly unnecessary, and
         * merely serves as an extended guard page. However, adding one
         * additional page in the beginning gives us an even number of pages,
         * which simplifies the mapping of address to metadata index.
         */
        for (i = 0; i < 2; i++) {
                if (unlikely(!kfence_protect(addr)))
                        return addr;

                addr += PAGE_SIZE;
        }

        for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) {
                struct kfence_metadata *meta = &kfence_metadata_init[i];

                /* Initialize metadata. */
                INIT_LIST_HEAD(&meta->list);
                raw_spin_lock_init(&meta->lock);
                // 记录内存块状态为unused
                meta->state = KFENCE_OBJECT_UNUSED;
                // 记录内存块地址
                meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */
                // 加入空闲链表
                list_add_tail(&meta->list, &kfence_freelist);

                /* Protect the right redzone. */
                // 将fence page设置为不可访问
                if (unlikely(!kfence_protect(addr + PAGE_SIZE)))
                        goto reset_slab;
                // 下一个data page的首地址
                addr += 2 * PAGE_SIZE;    // 每个page data间隔8KB，因为中间隔了一个fence page
        }

        /*
         * Make kfence_metadata visible only when initialization is successful.
         * Otherwise, if the initialization fails and kfence_metadata is freed,
         * it may cause UAF in kfence_shutdown_cache().
         */
        smp_store_release(&kfence_metadata, kfence_metadata_init);
        return 0;

reset_slab:
        for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
                struct slab *slab = page_slab(nth_page(pages, i));

                if (!i || (i % 2))
                        continue;
#ifdef CONFIG_MEMCG
                slab->memcg_data = 0;
#endif
                __folio_clear_slab(slab_folio(slab));
        }

        return addr;
}

3.4 kfence alloc

Kfence alloc主要做了以下几个事情：

1）从kfence pool内存池中找到空闲内存块（data page）

2）向data page canary区域写入固定的数据，便于在free阶段做检测

void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
{
        unsigned long stack_entries[KFENCE_STACK_DEPTH];
        size_t num_stack_entries;
        u32 alloc_stack_hash;

        /*
         * Perform size check before switching kfence_allocation_gate, so that
         * we don't disable KFENCE without making an allocation.
         */
         // 如果申请的内存超过1个page（4KB），直接返回NULL
        if (size > PAGE_SIZE) {
                atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);
                return NULL;
        }

        /*
         * Skip allocations from non-default zones, including DMA. We cannot
         * guarantee that pages in the KFENCE pool will have the requested
         * properties (e.g. reside in DMAable memory).
         */
        if ((flags & GFP_ZONEMASK) ||
            (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32))) {
                atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);
                return NULL;
        }

        /*
         * Skip allocations for this slab, if KFENCE has been disabled for
         * this slab.
         */
         // 标志位设置了 SLAB_SKIP_KFENCE，说明对于该 slab 已经禁用了 KFENCE，直接返回 NULL
         /*
         除此之外，还有以下标志位
         SLAB_RECLAIM_ACCOUNT：用于标记 slab 是可回收的，即可以被内存回收机制重新使用。
        SLAB_PANIC：在出现内存分配失败时，会触发内核崩溃转储，用于故障排除。      
        SLAB_CONSISTENCY_CHECKS：启用一致性检查，用于检测内存污染或其他问题。
        SLAB_RED_ZONE：在分配的内存块两端添加红色区域，用于检测写越界操作。
        SLAB_STORE_USER：在 slab 元数据中存储用户定义的数据。 
        SLAB_DEBUG_OBJECTS：用于开启额外的对象调试功能。
        */
        if (s->flags & SLAB_SKIP_KFENCE)
            return NULL;
        // kfence_allocation_gate > 1，说明还没到下一轮采样时间点
        if (atomic_inc_return(&kfence_allocation_gate) > 1)
                return NULL;
#ifdef CONFIG_KFENCE_STATIC_KEYS
        /*
         * waitqueue_active() is fully ordered after the update of
         * kfence_allocation_gate per atomic_inc_return().
         */
        if (waitqueue_active(&allocation_wait)) {
                /*
                 * Calling wake_up() here may deadlock when allocations happen
                 * from within timer code. Use an irq_work to defer it.
                 */
                irq_work_queue(&wake_up_kfence_timer_work);
        }
#endif

        if (!READ_ONCE(kfence_enabled))
                return NULL;

        num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 0);

        /*
         * Do expensive check for coverage of allocation in slow-path after
         * allocation_gate has already become non-zero, even though it might
         * mean not making any allocation within a given sample interval.
         *
         * This ensures reasonable allocation coverage when the pool is almost
         * full, including avoiding long-lived allocations of the same source
         * filling up the pool (e.g. pagecache allocations).
         */
        alloc_stack_hash = get_alloc_stack_hash(stack_entries, num_stack_entries);
        if (should_skip_covered() && alloc_covered_contains(alloc_stack_hash)) {
                atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_COVERED]);
                return NULL;
        }

        return kfence_guarded_alloc(s, size, flags, stack_entries, num_stack_entries,
                                    alloc_stack_hash);
}

static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp,
                                  unsigned long *stack_entries, size_t num_stack_entries,
                                  u32 alloc_stack_hash)
{
        // 以kfence_metadata结构体管理元数据
        struct kfence_metadata *meta = NULL;
        unsigned long flags;
        struct slab *slab;
        void *addr;
        const bool random_right_allocate = prandom_u32_max(2);
        const bool random_fault = CONFIG_KFENCE_STRESS_TEST_FAULTS &&
                                  !prandom_u32_max(CONFIG_KFENCE_STRESS_TEST_FAULTS);

        /* Try to obtain a free object. */
        // 从kfence list中获取空闲的内存块
        raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
        if (!list_empty(&kfence_freelist)) {
                meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
                list_del_init(&meta->list);
        }
        
        ......

        meta->addr = metadata_to_pageaddr(meta);
        /* Unprotect if we're reusing this page. */
        // 如果该data page被标记为已释放状态，则取消该标记
        if (meta->state == KFENCE_OBJECT_FREED)
                kfence_unprotect(meta->addr);

        /*
         * Note: for allocations made before RNG initialization, will always
         * return zero. We still benefit from enabling KFENCE as early as
         * possible, even when the RNG is not yet available, as this will allow
         * KFENCE to detect bugs due to earlier allocations. The only downside
         * is that the out-of-bounds accesses detected are deterministic for
         * such allocations.
         */
        if (random_right_allocate) {
                /* Allocate on the "right" side, re-calculate address. */
                meta->addr += PAGE_SIZE - size;
                meta->addr = ALIGN_DOWN(meta->addr, cache->align);
        }

        addr = (void *)meta->addr;

        /* Update remaining metadata. */
        metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED, stack_entries, num_stack_entries);
        /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
        WRITE_ONCE(meta->cache, cache);
        meta->size = size;
        meta->alloc_stack_hash = alloc_stack_hash;
        raw_spin_unlock_irqrestore(&meta->lock, flags);

        alloc_covered_add(alloc_stack_hash, 1);

        /* Set required slab fields. */
        slab = virt_to_slab((void *)meta->addr);
        slab->slab_cache = cache;
#if defined(CONFIG_SLUB)
        slab->objects = 1;
#elif defined(CONFIG_SLAB)
        slab->s_mem = addr;
#endif

        /* Memory initialization. */
        // 初始化 canary区域
        for_each_canary(meta, set_canary_byte);

        /*
         * We check slab_want_init_on_alloc() ourselves, rather than letting
         * SL*B do the initialization, as otherwise we might overwrite KFENCE's
         * redzone.
         */
        if (unlikely(slab_want_init_on_alloc(gfp, cache)))
                memzero_explicit(addr, size);
        if (cache->ctor)
                cache->ctor(addr);

        if (random_fault)
                kfence_protect(meta->addr); /* Random "faults" by protecting the object. */

        atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]);
        atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]);

        return addr;
}

下面看下是如何向data page的canary区域写入固定的数据：

/* Write canary byte to @addr. */
static inline bool set_canary_byte(u8 *addr)
{
        *addr = KFENCE_CANARY_PATTERN(addr);
        return true;
}

3.5 kfence free

kfence free主要做了以下事情：

1） data page释放后，将状态设置为‘不可访问状态’

2）检查data page的canary区域是否被破坏

3）将释放的内存还回到kfence pool内存池或空闲链表

void __kfence_free(void *addr)
{
        // 地址转换为 struct kfence_metadata 结构体指针 meta。
        // 这里的 struct kfence_metadata 是内存分配元数据结构，用于追踪内存分配和释放的相关信息。
        struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);

#ifdef CONFIG_MEMCG
        KFENCE_WARN_ON(meta->objcg);
#endif
        /*
         * If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
         * the object, as the object page may be recycled for other-typed
         * objects once it has been freed. meta->cache may be NULL if the cache
         * was destroyed.
         */
         // 码判断了 meta 对应的缓存是否存在，并且缓存的标志为 SLAB_TYPESAFE_BY_RCU，
         // 如果满足条件，则调用 call_rcu 来延迟释放对象。这是因为一些缓存类型在被释放后可能会
         // 立即被重新利用，因此需要通过 RCU 机制来确保安全释放。
        if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU)))
                call_rcu(&meta->rcu_head, rcu_guarded_free);
        else
                // 否则，立即释放内存
                kfence_guarded_free(addr, meta, false);
}

static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie)
{
        struct kcsan_scoped_access assert_page_exclusive;
        unsigned long flags;
        bool init;

        raw_spin_lock_irqsave(&meta->lock, flags);

        if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) {
                /* Invalid or double-free, bail out. */
                atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
                kfence_report_error((unsigned long)addr, false, NULL, meta,
                                    KFENCE_ERROR_INVALID_FREE);
                raw_spin_unlock_irqrestore(&meta->lock, flags);
                return;
        }

        /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */
        kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE,
                                  KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT,
                                  &assert_page_exclusive);

        if (CONFIG_KFENCE_STRESS_TEST_FAULTS)
                kfence_unprotect((unsigned long)addr); /* To check canary bytes. */

        /* Restore page protection if there was an OOB access. */
        // 
        if (meta->unprotected_page) {
                memzero_explicit((void *)ALIGN_DOWN(meta->unprotected_page, PAGE_SIZE), PAGE_SIZE);
                kfence_protect(meta->unprotected_page);
                meta->unprotected_page = 0;
        }

        /* Mark the object as freed. */
        // data page释放后，需要将状态设置为‘不可访问状态’，若被访问，立即触发use-after-free异常
        metadata_update_state(meta, KFENCE_OBJECT_FREED, NULL, 0);
        init = slab_want_init_on_free(meta->cache);
        raw_spin_unlock_irqrestore(&meta->lock, flags);

        alloc_covered_add(meta->alloc_stack_hash, -1);

        /* Check canary bytes for memory corruption. */
        // 检查data page的canary区域是否被破坏，即是否被访问过
        for_each_canary(meta, check_canary_byte);

        /*
         * Clear memory if init-on-free is set. While we protect the page, the
         * data is still there, and after a use-after-free is detected, we
         * unprotect the page, so the data is still accessible.
         */
        if (!zombie && unlikely(init))
                memzero_explicit(addr, meta->size);

        /* Protect to detect use-after-frees. */
        kfence_protect((unsigned long)addr);

        kcsan_end_scoped_access(&assert_page_exclusive);
        
        // 如果不是僵死进程，则将释放的内存还回到kfence pool内存池或空闲链表
        if (!zombie) {
                /* Add it to the tail of the freelist for reuse. */
                raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
                KFENCE_WARN_ON(!list_empty(&meta->list));
                list_add_tail(&meta->list, &kfence_freelist);
                raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);

                atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]);
                atomic_long_inc(&counters[KFENCE_COUNTER_FREES]);
        } else {
                /* See kfence_shutdown_cache(). */
                atomic_long_inc(&counters[KFENCE_COUNTER_ZOMBIES]);
        }
}

3.6 metadata

metadata用于记录内存块的状态。

3.7 核心数据结构

/* Alloc/free tracking information. */
// 用于跟踪分配和释放的信息
struct kfence_track {
        pid_t pid;    // 进行分配/释放内存操作的进程ID
        int cpu;    // 进行操作时的CPU
        u64 ts_nsec;    // 记录内存分配或释放时间点
        int num_stack_entries;    // 函数调用栈数量
        unsigned long stack_entries[KFENCE_STACK_DEPTH];    // 函数调用栈存放数组
};

/* KFENCE error types for report generation. */
// 异常类型定义
enum kfence_error_type {
        KFENCE_ERROR_OOB,                /* Detected a out-of-bounds access. */
        KFENCE_ERROR_UAF,                /* Detected a use-after-free access. */
        KFENCE_ERROR_CORRUPTION,        /* Detected a memory corruption on free. */
        KFENCE_ERROR_INVALID,                /* Invalid access of unknown type. */
        KFENCE_ERROR_INVALID_FREE,        /* Invalid free. */
};

/* KFENCE object states. */
// 定义元数据对象的状态
enum kfence_object_state {
        KFENCE_OBJECT_UNUSED,                /* Object is unused. */
        KFENCE_OBJECT_ALLOCATED,        /* Object is currently allocated. */
        KFENCE_OBJECT_FREED,                /* Object was allocated, and then freed. */
};

/* KFENCE metadata per guarded allocation. */
// 用于记录data page的信息
struct kfence_metadata {
        struct list_head list;              /* Freelist node; access under kfence_freelist_lock. */
        struct rcu_head rcu_head;        /* For delayed freeing. */

        /*
         * Lock protecting below data; to ensure consistency of the below data,
         * since the following may execute concurrently: __kfence_alloc(),
         * __kfence_free(), kfence_handle_page_fault(). However, note that we
         * cannot grab the same metadata off the freelist twice, and multiple
         * __kfence_alloc() cannot run concurrently on the same metadata.
         */
        raw_spinlock_t lock;

        /* The current state of the object; see above. */
        enum kfence_object_state state;    // 内存块的状态

        /*
         * Allocated object address; cannot be calculated from size, because of
         * alignment requirements.
         *
         * Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant.
         */
        unsigned long addr;    // data page内存块的地址

        /*
         * The size of the original allocation.
         */
        size_t size;    // 原始size

        /*
         * The kmem_cache cache of the last allocation; NULL if never allocated
         * or the cache has already been destroyed.
         */
        struct kmem_cache *cache;    // 用于分配小块内存的高速缓存，减少频繁地分配和释放内存的开销

        /*
         * In case of an invalid access, the page that was unprotected; we
         * optimistically only store one address.
         */
        unsigned long unprotected_page;

        /* Allocation and free stack information. */
        struct kfence_track alloc_track;    // 记录内存分配的信息
        struct kfence_track free_track;    // 记录内存释放的信息
        /* For updating alloc_covered on frees. */
        u32 alloc_stack_hash;    // 使用 alloc_stack_hash 来比较分配和释放时的栈信息哈希值，可以提高对释放操作的准确性和安全性
#ifdef CONFIG_MEMCG
        struct obj_cgroup *objcg;
#endif
};