文章目录
- 哈希表及其碰撞解决策略
- 1. 引言
- 2. 哈希表简介
- 3. 哈希函数
- 4. 碰撞解决策略
- 4.1 分离链接法(拉链法)
- 4.2 开放寻址法
- 4.2.1 线性探测
- 4.2.2 二次探测
- 4.2.3 双重哈希
- 5. 总结
哈希表及其碰撞解决策略
1. 引言
哈希表是一种高效的数据结构,用于将键映射到值(也称为表或映射抽象数据类型/ADT)。哈希表利用哈希函数将大的或非整数键映射到一个小的整数索引范围(通常是 [0..hash_table_size-1]
)。由于哈希函数可能会将不同的键映射到相同的索引,这就引发了碰撞问题。本文将介绍几种常见的碰撞解决策略,包括开放寻址法(线性探测、二次探测和双重哈希)和闭散列法(分离链接)。
2. 哈希表简介
哈希表通过哈希函数将键映射到数组中的索引,从而实现快速的数据存储和检索。这个过程包括两个主要步骤:
- 哈希函数:计算键的哈希值。
- 索引计算:将哈希值映射到数组的索引范围内。
公式表示为:
index = hash_function(key) % hash_table_size
哈希表在计算机软件中广泛应用,如关联数组、数据库索引、缓存和集合等。它们的主要优点是能够在平均情况下以常数时间复杂度(O(1))进行插入、删除和查找操作。
3. 哈希函数
哈希函数需要满足以下条件:
- 快速计算:时间复杂度为O(1)。
- 最小冲突:尽可能减少碰撞。
- 均匀分布:键均匀地分散到哈希表中。
一个常用的哈希函数是取模运算:
h(v) = v % M
其中,M
是哈希表的大小,通常设为一个大素数。这样可以减少冲突并确保均匀分布。哈希函数的选择非常重要,因为一个好的哈希函数能够有效减少冲突,提高哈希表的性能。
如果负载因子α = N/M
(N是键的数量,M是哈希表大小)保持在一个小常数范围内,所有操作的时间复杂度均为O(1)。负载因子越小,碰撞的概率就越低,链表的平均长度也就越短,从而提高了操作效率。
4. 碰撞解决策略
当两个不同的键通过哈希函数映射到相同的索引时,就会发生碰撞。为了解决这个问题,常见的碰撞解决策略包括分离链接法和开放寻址法。
4.1 分离链接法(拉链法)
分离链接法(Separate Chaining,简称SC)是最简单的碰撞解决方法之一。每个哈希表位置都包含一个链表,所有碰撞到同一位置的键值对都存储在这个链表中。这种方法将冲突键值对存储在链表中,避免了冲突带来的问题。
- 搜索(v):遍历链表,检查是否存在v。
- 插入(v):将v插入链表尾部。
- 删除(v):遍历链表,删除v。
class Node:
def __init__(self, value: int):
self.value = value
self.next: Node = None
class SeparateChaining:
def __init__(self, size: int, same: bool=True):
self.size = size
self.same = same
self.nums = 0
self.table: List[Node] = [Node(None) for _ in range(self.size)] # 初始化哨兵节点
def hash_func(self, value: int) -> int:
return value % self.size
def add(self, value: int) -> None:
node = self._search(value, self.same)
new_node = Node(value)
node.next, new_node.next = new_node, node.next
def remove(self, value: int) -> None:
node = self._search(value, False)
if node.next and node.next.value == value:
node.next = node.next.next
def search(self, value: int) -> bool:
if self._search(value, False).next:
return True
return False
def _search(self, value: int, same: bool=True) -> Node:
index = self.hash_func(value)
node = self.table[index]
while node.next:
next_node = node.next
if not same and next_node.value == value:
break
node = next_node
return node
def __repr__(self) -> str:
out = []
for node in self.table:
s = []
while node:
s.append(node.value)
node = node.next
out.append(f"{s !r}")
return f"{out !r}"
if __name__ == "__main__":
sc = SeparateChaining(25, True)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
sc.add(i)
print(sc)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
sc.remove(i)
print(sc)
# print输出
['[None]', '[None, 76]', '[None, 2]', '[None, 53]', '[None]', '[None]', '[None]', '[None, 7, 57]', '[None, 33]', '[None]', '[None]', '[None]', '[None]', '[None, 88, 38]', '[None, 89]', '[None, 40]', '[None, 41, 16]', '[None, 42, 42, 42]', '[None]', '[None, 19]', '[None, 45, 95]', '[None, 71]', '[None, 72]', '[None]', '[None]']
['[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]', '[None]']
4.2 开放寻址法
开放寻址法(Open Addressing)通过探测序列来解决碰撞。常见的探测方法有线性探测、二次探测和双重哈希。
它们的差别是使用的hash
函数不相同,下面是它们之间的公共的基类,其中包含了搜索、插入、和删除的公共方法:
from abc import abstractmethod
from typing import List
class Probing:
def __init__(self, size: int, max_collision: int = 10, same=True):
self.size = size # 最大的存储容量
self.table: List[int] = [None] * size # hash表
self.max_collision = max_collision # 最大的hash碰撞次数
self.nums = 0 # 有效元素的个数
self.same = same # 在插入的时候是否容忍具有相同的元素存在
@abstractmethod
def hash_func(self, value: int, step: int = 0) -> int:...
def _search(self, value: int, same=True) -> int:
# 搜寻该元素value可用的槽位,而不是搜寻该元素在不在
step: int = 0
first_none = -1
while step <= self.max_collision:
_hash = self.hash_func(value, step)
if self.table[_hash] is None:
# 可能这个位置的值已经被删除了,并不一定后续没有,所以需要继续查找,而不是返回None
# 记录第一次为空的位置,如果要插入一个元素,就可以插入到这里
if first_none == -1:
first_none = _hash
else:
# 不许相同,遇到相同值
if not same and self.table[_hash] == value:
return _hash
#否则继续寻找,直到找到None或者超出hash碰撞最大值
step += 1
# first_none为-1, 说明超出hash碰撞最大值
return first_none
def search(self, value: int) -> int:
# 搜寻元素,所以遇到该元素就返回
return self._search(value, same=False)
@property
def alpha(self) -> float:
return self.nums / self.size
def add(self, value: int) -> None:
# 搜寻槽位
index = self._search(value, self.same)
if index == -1:
raise ValueError("超出hash碰撞的最大次数")
if self.table[index] == value:
return
self.table[index] = value
self.nums += 1
def remove(self, value: int) -> None:
# 搜寻该元素
index = self._search(value, False)
if index == -1 or self.table[index] is None:
return
self.table[index] = None
self.nums -= 1
def __repr__(self) -> str:
return f"Probing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
4.2.1 线性探测
线性探测(Linear Probing)采用固定步长进行探测:
i = (base + step * 1) % M
其中base
是初始哈希值,step
为当前探测步骤。探测序列为:
h(v)
:基地址(h(v) + 1 * 1) % M
(h(v) + 2 * 1) % M
- …
下面是线性探测的具体实现代码:
class LinearProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + step * 1) % self.size
def __repr__(self) -> str:
return f"LinearProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
if __name__ == "__main__":
lp = LinearProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
lp.add(i)
print(lp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
lp.remove(i)
print(lp)
# print 输出
LinearProbing([16, 89, 2, 53, 76, None, None, 7, 57, 33, None, None, None, 88, 38, 40, 41, 42, 42, 19, 45, 71, 72, 42, 95]), nums:20, alpha:0.8
LinearProbing([None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]), nums:0, alpha:0.0
线性探测的优点是实现简单,但容易出现主聚类(Primary Clustering)问题,即连续的空闲槽位被逐渐填满,导致探测序列变长,性能下降。
4.2.2 二次探测
二次探测(Quadratic Probing)采用平方步长进行探测,step
为碰撞次数:
i = (base + step * step) % M
探测序列为:
h(v)
:基地址(h(v) + 1 * 1) % M
(h(v) + 2 * 2) % M
(h(v) + 3 * 3) % M
- …
class QuadraticProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + step * step) % self.size
def __repr__(self) -> str:
return f"QuadraticProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
if __name__ == "__main__":
qp = QuadraticProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp.add(i)
print(qp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp.remove(i)
print(qp)
#print输出
QuadraticProbing([16, 42, 2, 53, None, 76, None, 7, 57, 33, None, None, None, 88, 38, 40, 41, 42, 42, 19, 45, 71, 72, 89, 95]), nums:20, alpha:0.8
QuadraticProbing([None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]), nums:0, alpha:0.0
也可以是用左右平方步长探测:
i = (base + (-1)**step * ((step+1)//2)**2) % M
探测序列为:
h(v)
:基地址(h(v) + 1 * 1) % M
(h(v) - 1 * 1) % M
(h(v) + 2 * 2) % M
(h(v) - 2 * 2) % M
- …
class QuadraticProbing2(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + ((-1) ** step) * (((step+1) // 2) ** 2)) % self.size
def __repr__(self) -> str:
return f"QuadraticProbing2({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
if __name__ == "__main__":
qp2 = QuadraticProbing2(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp2.add(i)
print(qp2)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp2.remove(i)
print(qp2)
# print输出
QuadraticProbing2([16, 76, 2, 53, None, None, 57, 7, 42, 33, None, None, 38, 88, 89, 40, 41, 42, 42, 19, 45, 71, 72, None, 95]), nums:20, alpha:0.8
QuadraticProbing2([None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]), nums:0, alpha:0.0
如果负载因子α < 0.5
且M
是一个质数,前M/2
个二次探测索引都是唯一的。二次探测减少了主聚类问题,但仍可能出现次聚类(Secondary Clustering)问题,即哈希值相同但初始位置不同的键会形成新的聚类。
4.2.3 双重哈希
双重哈希(Double Hashing)采用两个哈希函数进行探测:
i = (base + step * secondary) % M
其中secondary = smaller_prime - key % smaller_prime
,smaller_prime
是小于M
的质数。探测序列为:
h(v)
:基地址(h(v) + 1 * h2(v)) % M
(h(v) + 2 * h2(v)) % M
(h(v) + 3 * h2(v)) % M
- …
双重哈希通过引入第二个哈希函数来决定探测步长,从而有效减少了主聚类和次聚类问题,使得探测序列更加分散。常见的第二哈希函数为:
h2(v) = smaller_prime - (v % smaller_prime)
其中smaller_prime
通常设为小于哈希表大小的质数,这样可以确保h2(v)
的值在[1..smaller_prime]
范围内。
class DoubleProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
h1 = value % self.size
smaller_prime = 23
h2 = smaller_prime - (value % smaller_prime)
return (h1 + step * h2) % self.size
def __repr__(self) -> str:
return f"DoubleProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
if __name__ == "__main__":
dp = DoubleProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
dp.add(i)
print(dp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
dp.remove(i)
print(dp)
#print输出
DoubleProbing([42, 76, 2, 53, 42, None, 57, 7, 33, None, 95, None, 38, 88, 89, 40, 41, 42, None, 19, 45, 71, 72, 16, None]), nums:20, alpha:0.8
DoubleProbing([None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]), nums:0, alpha:0.0
5. 总结
哈希表是一种高效的数据结构,通过哈希函数将键映射到值。碰撞是哈希表中的一个重要问题,常用的解决策略包括分离链接法和开放寻址法。选择合适的哈希函数和碰撞解决策略可以大大提高哈希表的性能和效率。本文详细介绍了几种常见的碰撞解决方法及其优缺点,希望能帮助读者更好地理解和应用哈希表。
在实际应用中,根据具体需求和数据特点选择合适的哈希表实现方式,可以有效提升程序的性能和稳定性。无论是分离链接法还是开放寻址法,都有其独特的优点和适用场景,合理运用这些技术可以使哈希表在各种情况下发挥最佳性能。
完整代码如下:
from abc import abstractmethod
from typing import List
class Probing:
def __init__(self, size: int, max_collision: int = 10, same: bool=True):
self.size = size
self.table: List[int] = [None] * size
self.max_collision = max_collision
self.nums = 0
self.same = same
@abstractmethod
def hash_func(self, value: int, step: int = 0) -> int:...
def _search(self, value: int, same: bool=True) -> int:
step: int = 0
first_none = -1
while step <= self.max_collision:
_hash = self.hash_func(value, step)
if self.table[_hash] is None:
# 可能这个位置的值已经被删除了,并不一定后续没有需要查找的值
# 记录第一次为空的位置
if first_none == -1:
first_none = _hash
else:
# 不许相同,遇到相同值
if not same and self.table[_hash] == value:
return _hash
#否则继续寻找,直到找到None或者超出hash碰撞最大值
step += 1
# -1超出hash碰撞最大值
return first_none
def search(self, value: int) -> int:
return self._search(value, same=False)
@property
def alpha(self) -> float:
return self.nums / self.size
def add(self, value: int) -> None:
index = self._search(value, self.same)
if index == -1:
raise ValueError("超出hash碰撞的最大次数")
if self.table[index] == value:
return
self.table[index] = value
self.nums += 1
def remove(self, value: int) -> None:
index = self._search(value, False)
if index == -1 or self.table[index] is None:
return
self.table[index] = None
self.nums -= 1
def __repr__(self) -> str:
return f"Probing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
class LinearProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + step * 1) % self.size
def __repr__(self) -> str:
return f"LinearProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
class QuadraticProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + step * step) % self.size
def __repr__(self) -> str:
return f"QuadraticProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
class QuadraticProbing2(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
return (value + ((-1) ** step) * (((step+1) // 2) ** 2)) % self.size
def __repr__(self) -> str:
return f"QuadraticProbing2({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
class DoubleProbing(Probing):
def __init__(self, size: int, max_collision: int = 10):
super().__init__(size, max_collision)
def hash_func(self, value: int, step: int = 0) -> int:
h1 = value % self.size
smaller_prime = 23
h2 = smaller_prime - (value % smaller_prime)
return (h1 + step * h2) % self.size
def __repr__(self) -> str:
return f"DoubleProbing({self.table !r}), nums:{self.nums}, alpha:{self.alpha}"
class Node:
def __init__(self, value: int):
self.value = value
self.next: Node = None
class SeparateChaining:
def __init__(self, size: int, same: bool=True):
self.size = size
self.same = same
self.nums = 0
self.table: List[Node] = [Node(None) for _ in range(self.size)] # 初始化哨兵节点
def hash_func(self, value: int) -> int:
return value % self.size
def add(self, value: int) -> None:
node = self._search(value, self.same)
new_node = Node(value)
node.next, new_node.next = new_node, node.next
def remove(self, value: int) -> None:
node = self._search(value, False)
if node.next and node.next.value == value:
node.next = node.next.next
def search(self, value: int) -> bool:
if self._search(value, False).next:
return True
return False
def _search(self, value: int, same: bool=True) -> Node:
index = self.hash_func(value)
node = self.table[index]
while node.next:
next_node = node.next
if not same and next_node.value == value:
break
node = next_node
return node
def __repr__(self) -> str:
out = []
for node in self.table:
s = []
while node:
s.append(node.value)
node = node.next
out.append(f"{s !r}")
return f"{out !r}"
if __name__ == "__main__":
lp = LinearProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
lp.add(i)
print(lp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
lp.remove(i)
print(lp)
qp = QuadraticProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp.add(i)
print(qp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp.remove(i)
print(qp)
qp2 = QuadraticProbing2(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp2.add(i)
print(qp2)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
qp2.remove(i)
print(qp2)
dp = DoubleProbing(25, 20)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
dp.add(i)
print(dp)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
dp.remove(i)
print(dp)
sc = SeparateChaining(25, True)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
sc.add(i)
print(sc)
for i in [72, 7, 53, 71, 40, 2, 45, 41, 42, 19, 42, 88, 42, 95, 38, 57, 16, 33, 89, 76]:
sc.remove(i)
print(sc)