再谈 TCP 连接的源端口选择

TCP 源端口的选择有两个场景：

主机场景
SNAT 场景

先看主机场景，其中又区分了两类互斥的场景：

bind 时选端口：如果 bind 的端口被某条 established 连接使用，则无法 bind；
connect 时选端口：如果端口已经被 bind，则无法在 connect 时被选中。

因此必须在 bind 和 connect 之间以及 bind，connect 彼此之间做互斥，遍历某端口的使用链表(bind 只涉及 2 元组，因此只涉及 port hash list)时需要 lock，由于一台主机可能有多个进程同时进行 bind 和 connect，为提高并发能力，Linux 整了个花活儿：

bind 先遍历奇数端口，再遍历偶数端口；
connect 先遍历偶数端口，再遍历奇数端口。

这样就巧妙地将 bind 和 connect 的遍历路径错开，沿着这个思路继续，还可以更花，比如在奇数或偶数的遍历空间内，根据进程 pid 再次区分奇偶跨度，比如奇数跨度按 1，5，9，13，… 顺序遍历，而偶数跨度按 3，7，11，… 顺序遍历，诸如此类。

本着 “及时停止优化” 的哲学，审视一下花活的代价是高尚的。

在奇数(或偶数)端口号都已经用尽情况下，花活儿不但发挥不了作用，还平添遍历一边奇数空间的开销，因为在遍历一遍之前，进程并不知道奇数空间已经用尽，虽然可以记录一个 anchor 来记录 “奇数空间已经用尽” 的事实，但本质上这个问题的原因是 16bit 的端口空间在 TCP 看来太少，但对实现时需要遍历时又太大，而之所以搞这些精致的花活儿，还是 16bit 空间太小。

所以不要觉得内核妹忒内儿提的任何派驰都是优化，相当大一部分是在恶心人，它们只是针对了场景，而在这些假设之外，性能反而劣化。这类似抛硬币猜正反，但凡做任何假设，总的成功率都会下降。

再看 SNAT 场景。先看一个没有花活儿的实现(约等于但不等于 Linux 内核 nf_conntrack SNAT 实现)：

uint16 tcp_get_port(uint32 sip, unit32 dip, uint16 sport, uint16 dport)
{
        struct entry *ent;
        int anchor = random16();
        int begin = PORT_BASE;

retry:
        // 4 元组 hash 而不是 3 元组，防止访问局部性导致 bucket 链表过长
        hash = jhash(sip, sport, dip, dport, 2);
        uniq = true;
        list_for_each_from(ent, bucket[hash]) {
                if (ent->sip == sip && ent->dip == dip && ent->sport == sport && ent->dport == dport) {
                        uniq = false;;
                        break;
                }
        }
        if (uniq) {
                ent = malloc(sizeof(*ent));
                ent->sip = sip;
                ent->dip = dip;
                ent->sport = sport;
                ent->dport = dport;
                hash_insert(ent, hash);
                return sport;
        }
        if (sport - begin < K && retried ++ > MAX_RETRIES) {
                // 列维长跳
                if (retried % (MAX_RETRIES/N) == 0)
                        anchor = random16(;
                sport = anchor;
                anchor ++;
                goto retry;
        }
        return -1;
}

要整花活儿炫点技巧，一步一步来。

首先，递增 anchor 和列维长跳已够合理，人类文明的轨迹就是列维长跳的结果，典型的大杂居，小聚居对资源利用的合理分布非常有效，正因为此，中原文明才能在几千年时间同化掉整个东亚大陆。然而这太直观了，以至于不够 “厉害”。

如果不使用列维长跳来搜索可用端口，可以用时间戳(精确到 us)，进程 pid 一起组成候选 sport。使用时间戳可以自然跨过同时竞争同一端口的进程，从而减少互斥开销，但还有更好的。

如果系统中已经没有可用端口，在遍历一遍前，进程并不知晓这个事实，因此要引入 “快速失败” 机制。这并不难，只需要提供一些积累信息。于是有以下的算法：

uint16 get_port_new(uint32 sip, unit32 dip, uint16 sport, uint16 dport)
{
        struct entry3 *ent3;
        uint hash3;

        hash3 = jhash(sip, dip, dport, 2);
        list_for_each_from(ent3, bucket3[hash3]) {
                if (ent3->sip == sip && ent3->dip == dip && ent3->dport == dport) {
                        if (ent3->free = ent3->end) // 已经满了，无空闲端口可用
                                return -1;
                        sport = ent3->free + 1;
                        ent3->free ++;
                        break;
                }
        }
        if (ent3 == NULL) {
                ent3 = malloc(sizeof(*ent3));
                free_port_init(ent3);
                sport = ent3->free + 1;
                hash3_insert(ent3, hash3);
        }
        return sport;
}

魔鬼在细节，太复杂必然是错的，太简单也可能不对。

随机释放端口意味着端口分配并非连续可得，因此还是要一个一个尝试，于是代码又恢复成了冗杂，3 元组 hash 只是提高了运算复杂度降低的概率，所有的重试依然还要进行：

uint16 tcp_get_port_new(uint32 sip, unit32 dip, uint16 sport, uint16 dport)
{
        struct entry3 *ent3;
        struct entry4 *ent4;
        int anchor = -1;
        int begin = PORT_BASE;
        int find = false;

        hash3 = jhash(sip, dip, dport, 2);
        list_for_each_from(ent3, bucket3[hash3]) {
                if (ent3->sip == sip && ent3->dip == dip && ent3->dport == dport) {
                        find = true;
                        uniq = false;;
                        if (ent3->full)
                                return -1;
                        anchor = ent3->free;
                        break;
                }
        }
        // 有了以上 3 元组 hash 辅助，下面的过程将省去遍历过程，几乎直接找到端口
        if (anchor == -1)
                anchor = 1;
retry:
        // 4 元组 hash 而不是 3 元组，防止访问局部性导致 bucket 链表过长
        hash4 = jhash(sip, sport, dip, dport, 2);
        uniq = true;
        list_for_each_from(ent4, bucket4[hash4]) {
                if (ent4->sip == sip && ent4->dip == dip && ent4->sport == sport && ent4->dport == dport) {
                        uniq = false;;
                        break;
                }
        }
        if (uniq) {
                ent4 = malloc(sizeof(*ent4));
                ent4->sip = sip;
                ent4->dip = dip;
                ent4->sport = sport;
                ent4->dport = dport;
                if (find == false) {
                        ent3 = malloc(sizeof(*ent3));
                        ent3->full = false;
                        ent3->end = 65535;
                        hash3_insert(ent3, hash3);
                }
                ent3->free = sport + 1;
                hash4_insert(ent4, hash4);
                return sport;
        }
        if (sport - begin < K && retried ++ > MAX_RETRIES) {
                // 列维长跳
                if (retried % (MAX_RETRIES/N) == 0)
                        anchor = random16(;
                sport = anchor;
                anchor ++;
                goto retry;
        }
        return -1;
}

合适的数据结构上场比算法更能解决大部分问题，就像设置一个 anchor 的威力一样，如果用 stack 或 queue 组织空闲端口，事情就会高尚。

使用 stack 会倾向于使用最近被释放的 port，而使用 queue 则相反，使用最早被释放的 port，以 stack 为例，意思如下：

void free_port(uint16 port, uint32 sip, unit32 dip, uint16 dport)
{
    hash3 = jhash(sip, dip, dport, 2);
    list_for_each_from(ent3, bucket3[hash3]) {
        if (ent3->sip == sip && ent3->dip == dip && ent3->dport == dport) {
            sport = ent3->free.push(ent3.port);
            break;
        }
    }
}
uint16 get_port(uint32 sip, unit32 dip, uint16 sport, uint16 dport)
{
    struct entry3 *ent3;
    struct port_entry *port;
    uint hash3;

    hash3 = jhash(sip, dip, dport, 2);
    list_for_each_from(ent3, bucket3[hash3]) {
        if (ent3->sip == sip && ent3->dip == dip && ent3->dport == dport) {
            if (ent3->stack.num == 0) // 已经满了，无空闲端口可用
                return -1;
            port = ent3->free.pop();
            break;
        }
    }
    if (port == NULL) {
        ent3 = malloc(sizeof(*ent3));
        free_port_init(ent3);
        hash3_insert(ent3, hash3);
        port = ent3->free.pop();
    } 
    return port.num;
}

只要 hash 足够好，源端口几乎可以在 O(1) 时间获得。值得注意的是，stack，queue 与 bitmap 相比具有优势，因为 bitmap 强依赖于实现，可以利用硬件高速实现，也可以使用 for-for 嵌套实现。

这些基本技巧已经谈不上是什么花活，malloc 库函数以及文件系统空闲块管理都会用类似机制管理空闲资源，没什么大不了，但 Linux 内核为什么没有使用这种技巧而遍历端口一个一个尝试呢？

首先是简单性，可维护性，这个不必多说，其次是资源占用，无论 bitmap，stack 还是 queue，动用的数据结构都对内存空间有不同需求，而通用操作系统不能假设空间需求的满足，但却可以用时间换空间。换句话说，内存不足就跑不起来了，而时间复杂度就算再高顶多就是慢，通用系统的可用性指标即程序可以完成。

想要程序运行快，除了算法分析外，局部性一定要利用：