Linux时间子系统5：timekeeper、timecountercyclecounter

1. 前言

前面我们介绍了用户态获取时间的接口clock_gettime，时钟的种类posix_clocks以及时钟源clocksource。那么我们思考这样一个问题，无论clock_gettime或者posix_clock定义的时间都是相对于某个起始点的时间，即相对于Linux Epoch的秒数，但是我们上一章节介绍的时钟源clocksource，它提供的read接口，我们看一下定义，它返回的是时钟源的cycle值，那么cycle和time是怎么对应的，内核是怎么通过clocksource来实现我们之前说的多种posix clock的，这就是本文要分析的内容timekeeper，同时我们再泛化到所有的硬件计时器，介绍timecounter和cyclecounter的概念。本文同样是在Linux时间子系统之（四）：timekeeping文章的基础上，增加了一点个人的笔记。

2 Timekeeping

timekeeping模块是一个提供时间服务的基础模块。Linux内核提供各种time line，real time clock，monotonic clock、monotonic raw clock等，timekeeping模块就是负责跟踪、维护这些timeline的，并且向其他模块（timer相关模块、用户空间的时间服务等）提供服务，而timekeeping模块维护timeline的基础是基于clocksource模块和tick模块。通过tick模块的tick事件，可以周期性的更新time line，通过clocksource模块、可以获取tick之间更精准的时间信

2.1 Timekeeper数据结构

struct timekeeper数据结构如下：

struct timekeeper {
	struct tk_read_base	tkr_mono;
	struct tk_read_base	tkr_raw;
	u64			xtime_sec;
	unsigned long		ktime_sec;
	struct timespec64	wall_to_monotonic;
	ktime_t			offs_real;
	ktime_t			offs_boot;
	ktime_t			offs_tai;
	s32			tai_offset;
	unsigned int		clock_was_set_seq;
	u8			cs_was_changed_seq;
	ktime_t			next_leap_ktime;
	u64			raw_sec;
	struct timespec64	monotonic_to_boot;

	/* The following members are for timekeeping internal use */
	u64			cycle_interval;
	u64			xtime_interval;
	s64			xtime_remainder;
	u64			raw_interval;
	/* The ntp_tick_length() value currently being used.
	 * This cached copy ensures we consistently apply the tick
	 * length for an entire tick, as ntp_tick_length may change
	 * mid-tick, and we don't want to apply that new value to
	 * the tick in progress.
	 */
	u64			ntp_tick;
	/* Difference between accumulated time and NTP time in ntp
	 * shifted nano seconds. */
	s64			ntp_error;
	u32			ntp_error_shift;
	u32			ntp_err_mult;
	/* Flag used to avoid updating NTP twice with same second */
	u32			skip_second_overflow;
#ifdef CONFIG_DEBUG_TIMEKEEPING
	long			last_warning;
	/*
	 * These simple flag variables are managed
	 * without locks, which is racy, but they are
	 * ok since we don't really care about being
	 * super precise about how many events were
	 * seen, just that a problem was observed.
	 */
	int			underflow_seen;
	int			overflow_seen;
#endif
};

tkr_mono：记录单调时间的结构体。
tkr_raw：记录原始单调时间的结构体。
xtime_sec：实时时间当前的秒数。
ktime_sec：单调时间当前的秒数。
wall_to_monotonic：实时时间和单调时间之间的差值。
offs_real：单调时间和实时时间之间的差值，offs_real=-wall_to_monotonic。
offs_boot：单调时间和启动时间之间的差值。
offs_tai：单调时间和TAI时间之间的差值，offs_tai=offs_real+tai_offset。
tai_offset：实时时间和TAI时间之间的差值。
clock_was_set_seq：表示时钟被设置的序数。
cs_was_changed_seq：表示时钟源更换的序数。

next_leap_ktime：下一次需要闰秒（跳变秒）的时间。“闰秒”就是1分钟有61秒， “跳秒”都安排在6月30日，或是12月31日的最后一瞬间。地球自转并非十分均匀，准确的说自转是在不断地在变慢的。每当地球自转变化引起的时间误差积累到与原子钟相关接近1秒时，就要人为地把时钟增加或减少1秒，从而使两者重新协调一致。这增加或减少的1秒称为“跳秒”。若是增加的，就是“正跳秒”（拨慢1秒）；若是减少的就是“负跳秒”（拨快1秒），不过负跳秒至今还没有发生过。这样，每逢正跳秒那1分钟自然就是61秒了，正因为这1分钟多1秒，所以又叫“闰秒”。
raw_sec：原始单调时间当前的秒数。
monotonic_to_boot：单调时间和启动时间之间的差值。
cycle_interval：表示一次NTP周期包含多少个时钟周期。
xtime_interval：表示一个NTP周期包含多少纳秒，不过这个值是位移过后的，也就是实际的纳秒数向左位移了shift位，而且这个值会根据NTP层的状况做出调整。
xtime_remainder：表示从周期数转换成纳秒数时候的精度损失，后面分析代码的时候会解释。
raw_interval：也表示了一个NTP周期包含多少纳秒，也是位移过后的，不过这个值不会根据NTP的状况做出调整，一旦设置好后就不会变了。在初始状态下，xtime_interval和raw_interval的值是完全一样的。
ntp_tick：记录了NTP周期的纳秒数，这个值也是位移过后的，但其位移的位数不是有时钟源设备决定的，而是一个固定的值。
ntp_error：NTP时间和当前实时时间之间的差值，如果ntp_error大于0，表示当前系统的实时时间慢于NTP时间，相反如果小于0则表示快于。
ntp_error_shift：存放了NTP的shift和时钟源设备shift之间的差值。NTP层也需要对纳秒数做shift的操作，其值由宏NTP_SCALE_SHIFT定义，现在被定义成了32位。但是时钟源设备的shift值是根据条件计算出来的，所以在两层之间虽然都会shift，但位数是不同的。如果需要转换的话，必须记录下来它们之间的差值。
ntp_err_mult：如果ntp_error大于0，则为1，否则都是0。
skip_second_overflow：处理闰秒的时候是否需要跳过这一秒。

其中，tk_read_base的数据结构如下：

struct tk_read_base {
	struct clocksource	*clock;
	u64			mask;
	u64			cycle_last;
	u32			mult;
	u32			shift;
	u64			xtime_nsec;
	ktime_t			base;
	u64			base_real;
};

clock：指向对应底层时钟源设备结构体的指针。
cycle_last：记录了最近一次时钟源的周期计数。
mask、mult和shift：对应底层时钟源设备的mask、mult和shift的值，用于将时钟周期数和纳秒数之间互相转换。
xtime_nsec：实时时间当前的纳秒数，这个值也是移位过后的，也就是实际的纳秒数向左移动了shift位。累积起来会进位。
base：单调时间的基准时间。
base_real：实时时间的基准时间，base_real=base+offs_real。

以上内容，参考Linux时间子系统之时间维护层

2.1 全局变量timekeeper

timekeeper维护了系统的所有的clock（这句话并不准确，如同posxi clocks那篇文章所说，timekeeper维护了系统中所有的与系统时间相关的clock，这也正是为什么会存在timecounter，我们稍后再讲）。如下：

static struct {
	seqcount_raw_spinlock_t	seq;
	struct timekeeper	timekeeper;
} tk_core ____cacheline_aligned = {
	.seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock),
};

tk_core就是保存内核时间信息的变量。____cacheline_aligned宏定义指示编译器对应于L1缓存行开头的地址处实例化一个结构体或变量（请参考其他文献，本文不深入讨论）。

2.2 初始化

timekeeping初始化的代码位于timekeeping_init函数中，在系统初始化的时候（start_kernel）会调用该函数进行timekeeping的初始化。

timekeeping模块中的若干个system clock，数据保存在ram中，一旦断电，数据就丢失了。因此，在系加电启动后，会从persistent clock中中取出当前时间值（例如RTC，RTC有battery供电，因此系统断电也可以保存数据），根据情况初始化各种system clock。如下：

void __init timekeeping_init(void)
{
	struct timespec64 wall_time, boot_offset, wall_to_mono;
	struct timekeeper *tk = &tk_core.timekeeper;
	struct clocksource *clock;
	unsigned long flags;

	read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
	if (timespec64_valid_settod(&wall_time) &&
	    timespec64_to_ns(&wall_time) > 0) {
		persistent_clock_exists = true;
	} else if (timespec64_to_ns(&wall_time) != 0) {
		pr_warn("Persistent clock returned invalid value");
		wall_time = (struct timespec64){0};
	}

	if (timespec64_compare(&wall_time, &boot_offset) < 0)
		boot_offset = (struct timespec64){0};

	/*
	 * We want set wall_to_mono, so the following is true:
	 * wall time + wall_to_mono = boot time
	 */
	wall_to_mono = timespec64_sub(boot_offset, wall_time);

	raw_spin_lock_irqsave(&timekeeper_lock, flags);
	write_seqcount_begin(&tk_core.seq);
	ntp_init();

	clock = clocksource_default_clock();
	if (clock->enable)
		clock->enable(clock);
	tk_setup_internals(tk, clock);

	tk_set_xtime(tk, &wall_time);
	tk->raw_sec = 0;

	tk_set_wall_to_mono(tk, wall_to_mono);

	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

	write_seqcount_end(&tk_core.seq);
	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
}

read_persistent_wall_and_boot_offset中调用了read_persistent_clock64，这是和architecture相关的函数。接下来的代码都是判断获取到的时间是否合法。只有tegra和omap平台实现了read_persistent_clock函数.其他ARM平台打开CONFIG_RTC_HCTOSYS这个内核配置项，打开该配置后，driver/rtc/hctosys.c将会编译到系统中，由rtc_hctosys函数通过do_settimeofday在系统初始化时完成xtime变量的初始化：

clocksource_default_clock和tk_setup_internals为timekeeping模块设置默认的clocksource。在timekeeping初始化的时候，很难选择一个最好的clock source，因为很有可能最好的那个还没有初始化呢。因此，这里的策略就是采用一个在timekeeping初始化时一定是ready的clock source，也就是基于jiffies 的那个clocksource。clocksource_default_clock定义在kernel/time/jiffies.c，是一个weak symble，如果你愿意也可以重新定义clocksource_default_clock这个函数。不过，要保证在timekeeping初始化的时候是ready的。

接下来则是初始化real time clock，monotonic clock和monotonic raw clock

2.3 获取和设定系统时间

获取monotonic clock的时间值：ktime_get和ktime_get_ts64

获取real time clock的时间值：ktime_get_real和ktime_get_real_ts64

获取boot clock的时间值：ktime_get_boottime和ktime_get_boottime_ts64

一般而言，timekeeping模块是在tick到来的时候更新各种系统时钟的时间值，ktime_get调用很有可能发生在两次tick之间，这时候，仅仅依靠当前系统时钟的值精度就不甚理想了，毕竟那个时间值是per tick更新的。因此，为了获得高精度，ns值的获取是通过timekeeping_get_ns完成的，该函数获取了real time clock的当前时刻的纳秒值，而这是通过上一次的tick时候的real time clock的时间值（xtime_nsec）加上当前时刻到上一次tick之间的delta时间值计算得到的。

ktime_get_ts的概念和ktime_get是一样的，只不过返回的时间值格式不一样而已。

2.4 更新时钟

timekeeping_update函数用来更新时间维护层的数据。该函数的第二个参数是action动作，目前共定义了下面三个值：

#define TK_CLEAR_NTP		(1 << 0)
#define TK_MIRROR		(1 << 1)
#define TK_CLOCK_WAS_SET	(1 << 2)

TK_CLEAR_NTP：是否需要清除NTP层的状态信息。
TK_MIRROR：是否需要复制到影子timekeeper结构体中。
TK_CLOCK_WAS_SET：是否需要递增clock_was_set_seq变量，该变量在每次设置时钟后都需要加一。

3 cyclecounter和clockcounter

3.1 为什么会有timecounter和cyclecounter

在内核的driver中，我们可能有这样的需求：获取drive中的A事件和B事件之间的时间值或者一个event stream过程中，各个event的时刻值。这里，driver不关心绝对的时间点，关心的是事件之间的时长。为了应对这个需求，clock source模块提供了timecounter和cyclecounter。

实际上上面的话并不准确，我们这么想，clocksource和timekeeper分别对应了系统时钟的时钟源和时间管理软件，但是对于非系统时钟的其他时钟源，如何获取他们的硬件counter和时间呢？内核用timecounter和cyclecounter就可以统一除系统时钟以外的所有的硬件时钟的需求。

内核中使用struct cyclecounter 来抽象一个free running的counter，从0开始，不断累加。由于counter的bit数目有限，因此，某个时间后，counter会wraparound，从0继续开始。该数据结构定义如下：

struct cyclecounter {
	u64 (*read)(const struct cyclecounter *cc);
	u64 mask;
	u32 mult;
	u32 shift;
};

每个cycle counter的counter value都是针对clock计数的，因此，通过read获取的counter value是基于cycle的，而cycle又是和输入频率有关。不过，对于其他driver而言，cycle数据是没有意义的，最好统一使用纳秒这样的单位，因此在
struct cyclecounter 中就有了mult和shift这两个成员了，这和clocksource的概念是不是基本一致。

实际上，最开始的时候，内核的确是只有clock source模块，它位于timekeeping模块和硬件之间。但是，其他内核模块也有访问free running counter的需要，这时候，内核开发人员创建了cycle counter和timer counter这样的概念，虽然代码有一点重复，但是这样不会触及clock source代码的改动。

timecounter是构架在cycle counter之上，使用纳秒这样的时间单位而不是cycle数目，这样的设计会让用户接口变得更加友好，毕竟大家还是喜欢直观的纳秒值。timecounter的定义如下：

struct timecounter {
	const struct cyclecounter *cc;
	u64 cycle_last;
	u64 nsec;
	u64 mask;
	u64 frac;
};

3.2 如何使用timecounter

首先需要初始化，注册timecounter

void timecounter_init(struct timecounter *tc,
		      const struct cyclecounter *cc,
		      u64 start_tstamp)
{
	tc->cc = cc;
	tc->cycle_last = cc->read(cc);
	tc->nsec = start_tstamp;
	tc->mask = (1ULL << cc->shift) - 1;
	tc->frac = 0;
}

读取timecounter:


u64 timecounter_read(struct timecounter *tc)
{
	u64 nsec;

	/* increment time by nanoseconds since last call */
	nsec = timecounter_read_delta(tc);
	nsec += tc->nsec;
	tc->nsec = nsec;

	return nsec;
}

可能很多人认为，除了系统时间，我们还需要读别的什么时间吗？提供这样的接口的意义在哪里呢？在后面PTP时钟同步章节会有分析。不过说到这个，有意思的是硬件厂家的硬件时钟并不总是提供cycle counter的功能，如果硬件厂家提供的接口能读取到的直接是time，而不是cycle，那不是很尴尬吗？实际上我确实遇到过，后面再说。