13. 第十三章案例研究-选择数据结构

13. 案例研究-选择数据结构

到这里尼应该已经学会了Python的核心数据结构, 也见过了一些使用它们的算法.
如果你想要更多地了解算个发可以阅读第21章.
本章配合联系介绍一个案例分析, 帮你思考如何选择数据结构并如何使用它们.

13.1 单词频率分析

1. 练习1

编写一个程序, 读入一个文件, 将每行内容拆解为单词, 剥去单词周围的空白字符和标点, 并转换为小写.
提示: string模块提供了whitespace, 包括空格, 制表符, 换行符等;
      它也提供了punctuation, 包含了所有的标点字符.
 
使用的文件为: https://raw.githubusercontent.com/AllenDowney/ThinkPython2/master/code/emma.txt

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.whitespace
'\t\n\x0b\x0c\r '

另外也可以考虑字符串方法, strip, replace和translate.

translate方法: 返回一个可用于 str.translate()的转换表.
如果只有一个参数, 它必须是将Unicode序数(整数)或字符映射到Unicode序数, 字符串或None的字典.
然后, 字符键将转换为序数.
如果有两个参数, 它们必须是长度相等的字符串, 并且在生成的字典中, 
x中的每个字符都将映射到y中相同位置的字符.
如果有第三个参数, 它必须是一个字符串, 其字符将在结果中映射到 None.

# string模式使用
import string


# 提供一个参数是, 参数必须是字典, 键的长度必须是1, 还是Unicode的字符.
d1 = str.maketrans({'a': 'v1', 'b': 'v2'}) 
print(d1)  # {97: 'v1', 98: 'v2'}

# 提供两个参数, 参数的长度必须相等, x中的每个字符都将映射到y中相同位置的字符.
special_characters = string.punctuation

d2 = str.maketrans(special_characters, '|'*32)
# 字符对于的unicode编码十进制. 124是'|'
print(d2)  # {33: 124, 34: 124, 35: 124, 36: 124, 37: 124, 38: 124, 39: 124 ...}

# 第三个参数的值作为键, None作为值.
d3 = str.maketrans(special_characters, '|'*32, 'asd')
print(d3)  # {33: 124, 34: 124, 35: 124, 36: 124, 37: 124, 38: 124, 39: 124 ... 97: None, 115: None, 100: None}

文件中使用-拼接单词, 将-替换换空格, 字符串再按空格切分为列表, 最后对单词前后的符号做处理.

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	"""
	:param file_name: 文件的路径
	:return: 直方图
	"""
	words_dict = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	for line in file_data:
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
	
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			words_dict: dict
			words_dict[lower_word] = words_dict.get(lower_word, 0) + 1
			
	return words_dict


res_words_dict = read_file('emma.txt')
# print(res_words_dict)
print(len(res_words_dict))  # 7530

存在这样的单词与单词组合或数字:
网站: http://pglaf.org/donate'
邮箱: business@pglaf.org
标记: 1.e.2, 0.zip
数字: 90
缩写单词: sister's, taylor's 只有这个是正常的.
往下写练习2去除掉文章的开头结尾读取正文部分, 上面几种除了缩写单词外都会消失...

2. 练习2

去往古腾堡工程(Project Gutenberg, http://www.gutenberg.org)
并下载你最喜欢的无版权的纯文本文档.

修改前一个练习中的程序, 改为从你下载的书籍中读取内容, 
跳过文件开头的信息部分, 并和前面一样将文本处理成单词. (去除开头与结尾, 读取文件的正文部分.)

接着修改程序, 计算书中出现的全部单词的总数, 以及每个单词使用的次数.

(下面没必要去搞...)
打印书中使用的不同单词的个数. 比较不同的时代, 不同的作者的不同书籍.
哪一个作者使用的词汇最广泛?

*** START OF THIS PROJECT GUTENBERG EBOOK EMMA ***
正文...
*** END OF THIS PROJECT GUTENBERG EBOOK EMMA ***
使用字符串的startwith方法, 开头结尾.

import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	"""
	:param file_name: 文件的路径
	:return: 直方图
	"""
	words_dict = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			words_dict: dict
			words_dict[lower_word] = words_dict.get(lower_word, 0) + 1
	
	return words_dict


def skip_header(file_obj):
	"""
	:param file_obj: 文件对象
	:return: None
	
	读取文件对象, 读取到 *** START OF THIS PROJECT GUTENBERG EBOOK EMMA *** 这一行则停止
	"""
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


res_words_dict = read_file('emma.txt')
# print(res_words_dict)
print('单词数量:', len(res_words_dict))  # 7212
print('单词总数:', sum(res_words_dict.values()))  # 161057

3. 练习3

修改前一个练习中的程序, 计算书中使用频率最高的20个单词.

import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	"""
	:param file_name: 文件的路径
	:return: 直方图
	"""
	words_dict = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			words_dict: dict
			words_dict[lower_word] = words_dict.get(lower_word, 0) + 1
	
	return words_dict


def skip_header(file_obj):
	"""
	:param file_obj: 文件对象
	:return: None
	
	读取文件对象, 读取到 *** START OF THIS PROJECT GUTENBERG EBOOK EMMA *** 这一行则停止
	"""
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


# 频率排序
def frequency_sorting(words_dict):
	words_list = list()
	for k, v in words_dict.items():
		# 组织成元组列表
		words_list.append((v, k))
	
	# 列表的按每个元素的第一个元素进行排序, 降序
	words_list.sort(reverse=True)
	# 返回前20个
	return words_list[:20]


res_words_dict = read_file('emma.txt')
for val, key in frequency_sorting(res_words_dict):
	print(key, val)

4. 练习4

修改前面的程序, 读入一个单词表(参见9.1节), 并打印出书中所有不在单词表之中的单词.
(下面的不要理会)
这其中有多少是拼写错误? 有多少是因该出现在单词表中的常用单词? 有多少是正确冷僻的单词?

import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	"""
	:param file_name: 文件的路径
	:return: 直方图
	"""
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):
	"""
	:param file_obj: 文件对象
	:return: None
	
	读取文件对象, 读取到 *** START OF THIS PROJECT GUTENBERG EBOOK EMMA *** 这一行则停止
	"""
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


def make_words_dict(file_name):
	words_dict = dict()
	fin = open(file_name, encoding='utf8')
	for line in fin:
		# 去除结尾\n
		word = line.strip()
		words_dict[word] = ''
	
	return words_dict


def difference():
	res_hist = read_file('emma.txt')
	res_words_dict = make_words_dict('words.txt')
	
	# 判断存在的单词与不存在额单词
	not_exist_words_list = list()
	for word in res_hist:
		if word not in res_words_dict:
			not_exist_words_list.append(word)
	
	print('存在的单词有', len(res_hist) - len(not_exist_words_list), '个.')
	print('不存在的单词有', len(not_exist_words_list), '个.')


difference()

13.2 随机数

给定相同的输入, 大部分计算机程序每次运行都会生产相同的输出, 所以它们被认为是有确定性的.
确定性通常是件好事, 因为我们希望相同的计算能有相同的结果.
但对某些特别的应用, 我们希望计算机是不可预测的. 游戏是一个明显的例子, 但还有更多类似的例子.

让程序变得真正地不确定很难, 但也有办法让它至少看起来是不确定的. 一种办法是使用算法来生成'伪随机数'.
伪随机数并不是真正随机的, 因为它们是通过一个确定性的算法生产的,
但若只是输出的数字的话, 几乎不可能看出来和随机数有什么区别.

模块random提供了用于生成伪随机数的函数(接下来我直接简单地将它称为'随机数').
函数random返回一个从0.0到1.0之间的随机浮点(包括0.0, 但不包括1.0).
每当调用random时, 会得到一个很长的随机数序列中的下一个数.
运行下面的循环, 可以看到一个样本:

import random
for i in range(10):
    x = random.random()
	print(x)

函数randint接收参数low和high, 并返回low和high之间(两则都包含)的一个整数.

>>> random.randint(5, 10)
5
>>> random.randint(5, 10)
9

要从序列中随机选择一个元素, 可以使用choice:

>>> t = [1, 2, 3]
>>> random.choice(t)
2
>>> random.choice(t)
3

random模块还提供了可以从各种连续分布序列中生成随机数的函数,
包括高斯分布, 指数分布, y分布以及其他几种.

1. 练习5

编写一个函数choose_from_hist, 接受一个11.1节所定义的直方图作为参数, 
并从这个直方图中, 按照频率的大小, 成比例地随机返回一个值. 
例如, 下面这个直方图:

>>> t = ['a', 'a', 'b']
>>> hist = histogram(t)
{'a': 2, 'b': 1}

你的函数因该是以2/3地概率返回'a', 以1/3的概率返回'b'.

import random


def choose_from_hist(hist_dict):
	frequency_list = list()
	# 遍历键值对
	for key, val in hist_dict.items():
		# 列表添加val个键
		for i in range(val):
			frequency_list.append(key)
	
	random_res = random.choice(frequency_list)
	return random_res


for v in range(3):
	histogram_dict = {'a': 2, 'b': 1}
	res = choose_from_hist(histogram_dict)
	print(res)

13.3 单词直方图

在继续阅读之前你应当尝试前面的练习.
你可以从↓下载我的解答.
https://github.com/AllenDowney/ThinkPython2/tree/master/code/analyze_book1.py 
下面是一个读取文件中单词构造直方图的例子:

import string

def process_file(filename):
	hist = dict()
	fp = open(filename)
	for line in fp:
		process_line(line, hist)
	
	return hist


def process_line(line, hist):
	line = line.replace('-', ' ')
	for word in line.split():
		word = word.strip(string.punctuaion + string.whitespace)
		word = word.lower()
		hist[word] = hist.get(word, 0) + 1

hist = process_file('emma.txt')

这个程序读入emma.txt, 其内容是简·奥斯丁的<<爱玛>>的文本.
process_file循环遍历文件中的每一行, 每次将一行传递给process_line函数, 直方图hist用作累加器.
process_line使用字符串方法reolace将'-'符号替换我i空额, 再使用split将各行文本拆分成一个字符串列表.
它遍历单词列表, 使用strip和lower去掉标点符号并转换为小写.
(我们说'转换', 只是简称, 记住字符串是不可变的, 所有strip和lower这样的方法返回的是新字符串.)
最后, process_line通过创建新项或者增加有项的值来更新直方图.
要计算文件中单词的总数, 我们可以累加直方图中的频率:

def total_words(hist):
	return sum(hist.values())

不同单词的个数, 就是字典里的元素数量:

def different_words(hist):
	return len(hist)

下面是打印结果的代码:

print('Total number of words:', total_words(hist))
print('Number of different words', different_words(hist))

以及结果:

Total number of words: 164120
Number of different words 8904

import string


def process_file(filename):
	hist = dict()
	fp = open(filename, encoding='utf8')
	for line in fp:
		process_line(line, hist)
	
	return hist


def process_line(line, hist):
	line = line.replace('-', ' ')
	for word in line.split():
		word = word.strip(string.punctuation + string.whitespace)
		word = word.lower()
		hist[word] = hist.get(word, 0) + 1


hist = process_file('emma.txt')


def total_words(hist):
	return sum(hist.values())


def different_words(hist):
	return len(hist)


# 单词总数
print('Total number of words:', total_words(hist))
# 不同单词的数量
print('Number of different words', different_words(hist))

13.4 最常见的单词

要寻找最常用的单词, 我们生成一个元组的列表, 其中每个元组包括一个单词及其频率, 并对其进行排序.
下面的函数接受一个直方图, 并返回'单词-频率'元组的列表:

def most_common(hist):
	t = []
	
	for key, value in hist.items():
		t.append((value, key))
		
	t.sort(reverse=True)
	
	return t

在每个元组中, 频率先出现, 所有结果列表按频率排序, 下面的循环打印出最常用的10个单词:

t = most_common(hist)
    
print('The most common word are:')

for freq, word in t[:10]:
	print(word, freq, sep='\t')

这里我使用关键字sep通知print去使用制表符最为分隔符, 而不使用空格.
于是第二列可以对其排序. 下面是<<爱玛>>的结果:

The most common word are:
to	5242
the	5205
and	4897
of	4295
i	3191
a	3130
it	2529
her	2483
was	2400
she	2364

这段代码可以使用sort函数的key参数进行简化.
如果你有兴趣, 可以读一下相关文章: http://wiki.python.org/moin/HowTo/Sorting.

list.sort(key=None, reverse=False)
key: 自定义排序的方式的键.
reverse控制是按升序还是降序排.

list.sort(key=lambda x1: x1[1])
自动遍历列表, 将遍历的元素作为参数给x1, 再出去参数的第二个元素赋值给key.

# 完整代码
import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	"""
	:param file_name: 文件的路径
	:return: 直方图
	"""
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):
	"""
	:param file_obj: 文件对象
	:return: None

	读取文件对象, 读取到 *** START OF THIS PROJECT GUTENBERG EBOOK EMMA *** 这一行则停止
	"""
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


def most_common(hist):
	t = []
	
	for key, value in hist.items():
		t.append((value, key))
	
	t.sort(reverse=True)
	
	return t


hist = read_file('emma.txt')
t = most_common(hist)

print('The most common word are:')

for freq, word in t[:10]:
	print(word, freq, sep='\t')

# 简化版本, 注意: 需要去掉开头与结尾的..
import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):

	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


hist = read_file('emma.txt')

# 简化
hist_list = list(hist.items())
hist_list.sort(key=lambda x: x[1], reverse=True)


print('The most common word are:')

for freq, word in hist_list[:10]:
	print(word, freq, sep='\t')

13.5 可选形参

我们已经见过一些接收可选形参的内置函数和方法. 用户也可以编写可选参数的自定义函数.
例如, 下面的函数打印一个直方图中最常见的单词:

def print_most_common(hist, num=10):
    t = most_common(hist)
    print(The most common word are:)
    for freq, word in t[:num]:
        print(word, freq, sep='\t')

第一个形参是必须的, 第二个是可选的. 形参num的默认值是10.
如果值提供一个实参:

print_mosr_common(hist)

num会获得默认值. 如果提供了两个实参:

print_most_common(hist, 20)

num怎会获得提供的实参值. 换句话说, 可选实参值覆盖默认形参值.
如果一个函数即有必须形参, 也有可选形参, 则所有的必须参数都必须在前面, 后面跟着可选形参.

# 完整代码
import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def read_file(file_name):
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	# 跳过开头部分
	skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):

	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


def most_common(hist):
	t = []
	
	for key, value in hist.items():
		t.append((value, key))
	
	t.sort(reverse=True)
	
	return t


def print_most_common(hist, num=10):
	t = most_common(hist)
	print('The most common word are:')
	for freq, word in t[:num]:
		print(word, freq, sep='\t')


hist = read_file('emma.txt')
print_most_common(hist)
# print_most_common(hist, 20)

13.6 字典减法

寻找在书中出现却不再words.txt单词表中的单词, 这个问题可以看作是集合减法;
也就是说, 我们想要找到出现在一个集合(书中的单词)而不在另一个集合(单词表中的单词)的所有单词.

subtract函数接收两个字典的d1和d2, 并返回一个新的字典, 包好所有出现在d1中且不出现在d2中的键值.
由于我们并不真的关心字典的值, 我们所有值都设置为None.

def subtract(d1, d2):
	res = dict()
    
    for key in d1:
        if key not in d2:
            res[key] = None
            
	return res

要找出书中出现而不在words.txt单词表中的此, 我们可以使用process_file为words.txt建立一个直方图,
再使用减法:

words = process_file('words')
diff = subtract(hist, words)
print("Words in the book that aren't in the word list:")
for word in diff:
	print(word, end=' ')

下面是<<艾玛>>一书中的部分结果: (结果是一样的, 排序可能不一样, 看你当前版本的字典是否有序!)

# 作者的结果
Words in the book that aren't in the word list:
rencontre jane's blanche woodhouses disingenuseness friend's venice apartment ...

这些词中有些是名字或所有格单词. 其他的如'rencontre'已经不再常用.
但也有一些是真应该包含再单词表中的!

# 我的结果
Words in the book that aren't in the word list:
emma austen i woodhouse a sister's remembrance taylor mr woodhouse's taylor's ...

# 完整代码, 有一点不同, 就是调用函数生成直方图时, 要不要跳过开头的配置...
import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def process_file(file_name, is_skip=True):
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	if is_skip:
		# 跳过开头部分
		skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


# 字典减法
def subtract(d1, d2):
	res = dict()
	
	for key in d1:
		if key not in d2:
			res[key] = None
	
	return res


# 跳过靠头结尾
hist = process_file('emma.txt')
# 不跳过开头
words = process_file('words.txt', False)

diff = subtract(hist, words)
print("Words in the book that aren't in the word list:")
for word in diff:
	print(word, end=' ')

1. 练习6

Python提供看一个数据结构set, 它提供了很多常见的集合操作.
你可以读19.5节中的关于集合操作的内容, 或者阅读
http://docs.python.org/3/library/stdtypes.html#type-set
上的文档, 并编写一个程序使用集合减法来寻找出现在书中但不出现再单词表中大单词.
解答: http://thinkpython2.com/code/analyze_book2.py

# 完整代码, 有一点不同, 就是调用函数生成直方图时, 要不要跳过开头的配置...
import string

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def process_file(file_name, is_skip=True):
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	if is_skip:
		# 跳过开头部分
		skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


# 集合减法, 集合是无序的, 输出的顺序不一致的...
def subtract(d1, d2):
	# 小集合 - 大集合
	return set(d1) - set(d2)


# 跳过靠头结尾
hist = process_file('emma.txt')
# 不跳过开头
words = process_file('words.txt', False)

diff = subtract(hist, words)
print("Words in the book that aren't in the word list:")
for word in diff:
	print(word, end=' ')

13.7 随机单词

若要从直方图中随机选择一个单词, 最简单的算法是根据计算得到的频率构造一个列表, 
其中每个单词根据词频有多个拷贝, 并从中随机选择一个单词:

def random(h):
	t = []
	
	for word, freq in h.items():
		t.append([word] * freq)
	
	return random.choice(t)

表达式[word] * freq创建一个列表, 里面有单词word的freq个副本.
entend方法和append类似, 区别是接收的参数是一个序列.

这个算法可以使用, 但频率并不高, 每当选择一个随机单词时, 它会重建列表, 而这个列表和原书差不多长.
一个明显的改进方法是至建立列表移除, 再使用多次选择, 但这么做列表任然很大.

更好的代替方案如下.
* 1. 使用keys来获的书中所有的单词的列表.
* 2. 构建一个列表, 包含单词频率的累计和(参见练习10-2), 这个列表中的最后一项是书中的总数n.
* 3. 再1到n之间随机选择一个数, 使用二分查找(参见练习10-11)
     来找到随机数再累积和列表中应该出现的位置下标.
* 4. 使用这个下标, 在单词表中找到相应的单词.

1. 练习7

编写一个程序, 使用这个算法来从书中选择一个随机的单词.
解答: analyze_book3.py

import string
import random

# 移除符号, 文本中有中文符号“与”
remove_symbol = string.punctuation + string.whitespace + '“”'


# 读取文件
def process_file(file_name, is_skip=True):
	hist = dict()
	
	file_data = open(file_name, encoding='utf8')
	
	if is_skip:
		# 跳过开头部分
		skip_header(file_data)
	
	# 现在的file_data以及没有了开头部分.
	for line in file_data:
		# 跳过结尾部分
		if line.startswith('*** END OF THIS PROJECT'):
			break
		
		# 单词很多使用'-'拼在一起, 将'-'替换为空格
		words_line = line.replace('-', ' ')
		
		# 字符串按空格切分
		words_list = words_line.split()
		
		for word in words_list:
			# 处理单词前后的符号, 中间的不要管
			word = word.strip(remove_symbol)
			
			# 转为全小写
			lower_word = word.lower()
			# 统计频率
			hist: dict
			hist[lower_word] = hist.get(lower_word, 0) + 1
	
	return hist


def skip_header(file_obj):
	for line in file_obj:
		if line.startswith('*** START OF THIS PROJECT'):
			break


# 创建值的累加列表
def make_count_list(d):
	count_list = list()
	count = 0
	for val in d.values():
		count = count + val
		count_list.append(count)
	
	return count_list


# 完整单词列表
def full_list(hist):
	# 单词从1开始
	full = [' ', ]
	
	for k, v in hist.items():
		for i in range(v):
			full.append(k)
	
	return full


# 二分法查找随机值
def dichotomy(tem_list, find, high_index, lower_index=0, index=None, direction=None):
	# 基准情形
	len_list = len(tem_list)
	if len_list == 0:
		print('找不到', direction)
		
		# 当找不到时, 还需要往右切, 则需要加1, 往左则直接返回.
		# 需要向右切则说明是下一个单词, 向左还在这个单词的范围内, 不需要操作.
		if direction == 'right':
			return index + 1
		
		if direction == 'left':
			return index
	
	# 获取中间值
	middle = len_list // 2
	index = (lower_index + high_index) // 2
	print(f'当前列表的第{middle}个元素, 原本列表的第{(lower_index + high_index) // 2}个元素.')
	
	# 右切 -->
	if find > tem_list[middle]:
		lower = middle + 1
		# lower_index 向右移动middle+1位
		lower_index = lower_index + lower
		index = dichotomy(tem_list[lower:], find, high_index, lower_index, index, direction='right')
		return index
	
	elif find < tem_list[middle]:
		high = middle
		# 高位向左移动
		high_index -= (len_list - len(tem_list[:high]))
		index = dichotomy(tem_list[:high], find, high_index, lower_index, index, direction='left')
		return index
	
	else:
		print('找到了')
		return index


# 随机函数
def randint_num():
	# 跳过靠头结尾
	res_hist = process_file('emma.txt')
	
	# 创建单词列表
	g_words_list = list(res_hist.keys())
	# 每个单词只有一个, 每个值的下标范围很广, 如第一个单词, produced': 15, 那么1-15的坐标都是produced
	res_count_list = make_count_list(res_hist)
	# 获取随机数, 从1开始, 如果从0开始第一个单词就多了一次.
	fin_num = random.randint(1, res_count_list[-1])
	
	# 测试数据, 将fin_num替换成列表中的数字
	# [15[范围在1-15], 588[16-588], 1052, 1053, 1054, 1841, 2124]
	# 'produced', 'by', 'an', 'anonymous', 'volunteer', 'emma', 'jane',
	# 查找这个值的下标在哪里
	# fin_num = 588
	index = dichotomy(res_count_list, fin_num, len(res_count_list))
	
	print(f'精简列表的第{index}位', g_words_list[index])
	
	# 使用完整列表测试
	full = full_list(res_hist)
	print(f'完整列表中的第{fin_num}位', full[fin_num])


randint_num()
"""
精简列表的第58位      mistress
完整列表中的第45709位 been

"""

# 作者的代码, 有一点不同的是, 他的随机数和存储单词的位置从0开始, 我从1开始.
import string
import random
from bisect import bisect


def process_file(filename, skip_header):
	hist = {}
	fp = open(filename, encoding='utf8')
	
	if skip_header:
		skip_gutenberg_header(fp)
	
	for line in fp:
		if line.startswith('*** END OF THIS'):
			break
		
		process_line(line, hist)
	
	return hist


def skip_gutenberg_header(fp):
	for line in fp:
		if line.startswith('*** START OF THIS'):
			break


def process_line(line, hist):
	line = line.replace('-', ' ')
	strip_ables = string.punctuation + string.whitespace
	
	for word in line.split():
		word = word.strip(strip_ables)
		word = word.lower()
		
		hist[word] = hist.get(word, 0) + 1


def random_word(hist):
	# 单词列表
	words = []
	# 频率
	freqs = []
	# 频率总数
	total_freq = 0
	
	# 遍历键值对
	for word, freq in hist.items():
		# 频率总数
		total_freq += freq
		# 收录单词
		words.append(word)
		# 频率列表
		freqs.append(total_freq)
	
	# 查找的数
	x = random.randint(0, total_freq - 1)
	# 返回索引
	index = bisect(freqs, x)
	return words[index]


hist = process_file('158-0.txt', skip_header=True)

print(random_word(hist), end=' ')

13.8 马尔可夫分析

如果你从书中随机地获取单词, 可以凭此感受一下书中地词汇, 但可能无法通过随机获取来得到一句话:

this the small regard harriet which knightley's it most things

一个随机单词地序列, 很难组成有意义地话, 因为相邻地词之间没有任何关联.
例如, 在一个真实地句子中, 冠词'the'应当会后接一个形容词, 而不应是动词或副词.
测量这种类型的关联的方法之一是使用马尔可夫分析, 
它能够用于描述给定地单词序列中下一个可能出现地单词地概率.
例如, 歌曲<<Eric, the Half a Bee>>的开头是:

Half a bee, philosophically,
Must, ipso facto, half not be.
But half the bee has got to be
Vis a vis, its entity. D'you see?
But can a bee be said to be
Or not to be an entire bee
When half the bee is not a bee
Due to some ancient injury?

在这段文本中, 短语'half the'总是后接着单词'bee', 但短语'the bee', 则可能后接'has'或'is'.
马尔可夫分析的结果是一个从每个前缀(如'half the' 和'the bee')
到其所有可能后缀(如'has'和'is')的映射. 给定这种映射后, 你就可以用它来生成随机文件. 
从任意前缀开始, 并从它的可能后缀中随机选择一个.
接着, 你可以将前缀的结尾和后缀组合起来, 作为下一个前缀, 并继续重复.
例如, 如果你以前缀'Half a'开始, 则接下来一个单词必定是'bee', 因为这个前缀在文本中只出现了一次.
下一个前缀, 是'a bee', 所以下一个后缀可能是'philosophically', 'be'或者'due'.
在这个例子中前缀的长度总是2, 但其实你可以使用任意前缀长度来进行马尔可夫分析.

1. 练习8

马尔可夫分析:
1. 编写一个程序从文件中读入文本, 并进行马尔可夫分析.
   结果因该是一个字典, 将前缀映射到可能后缀的集合. 集合可以是列表, 元组或者字典.
   由你来做出合适的选择, 你可以使用前缀长度2来测试程序, 
   但编写程序时应当考虑可以方便地改为其他前缀长度.
   
2. 在前面编写地程序中添加一个函数, 基于马尔可夫分析地结果随机生成文本.
   下面时一个从<<爱玛>>中使用前缀长度2生成地例子:
   He was very clever, be it sweetness or be angry, ashamed or only amused,
   at such a stroke. She had never thought of Hannah till you were never meant for me?"
   "I cannot make speechess, Emma:" he soon cut it all himself.
   对这个例子, 我留下每个单词后面地标点. 结果几乎时语法正确地, 但也不完整正确.
   语义上, 它看起来像是有意义的, 但也不完全是.
   当增加前缀长度是, 结果会怎么样? 随机生成的文本会不会看起来更有意义?
   
3. 一旦你的程序可以正常运行后, 可以考虑尝试一下混搭: 如果对两本或更多本书进行组合, 
   则生成的随机文本会以一种有趣的方式混合各书中的词汇和短语.

致谢: 本案例分析基于Kernighan和Pike和The Practice of Programming(Addison-Wesley, 1999)
一书中的一个实例.

你应当在继续阅读全尝试这个练习, 接着可从↓下载我的解答:
https://raw.githubusercontent.com/AllenDowney/ThinkPython2/master/code/markov.py
你也需要↓:
https://raw.githubusercontent.com/AllenDowney/ThinkPython2/master/code/emma.txt

'将前缀映射'到可能'后缀的集合', 前缀的两个单词作为键, 后缀最为值, 是一个集合.
相同的值直接添加进去就得了. 
正文第一段 Produced by An Anonymous Volunteer 会转为:

{
    ('Produced', 'by'): ['An'],
    ('by', 'An'): ['Anonymous'], 
    ('An', 'Anonymous'): ['Volunteer'],
}

第一次运行时, 随机获取一个键(前缀1, 前缀2), 在从它的值中随机选取一个值(rondom_val).
然后修改键为(前缀1修改为前缀2, 前缀2使用值rondom_val), 通过这个键获取对应的后缀的集合, 随机选值.

其中有很多单词如:away--you, 不用分开看成一个单词即可.

# 导入模块
import random


# 读取文件, 返回单词列表
def make_words_dict(file, prefix_length):
	# 返回的字典
	words_dict = dict()
	
	# 存储前缀的元组
	prefix_tuple = tuple()
	# 文件对象
	fin = open(file, encoding='utf8')
	
	# 遍历文件对象
	for line in fin:
		# 去除\n, 对字符串进行切分
		for word in line.rstrip().split():
			# 将字符串保存到元组中, 先将元组填充两个元素
			if len(prefix_tuple) < prefix_length:
				prefix_tuple += word,
				continue
			
			# 前缀作为键, 后缀作为值
			words_dict.setdefault(prefix_tuple, []).append(word)
			# 修改前缀
			prefix_tuple = prefix_tuple[1:] + (word,)
	
	return words_dict


def random_word(word_dict, num=100):
	# 先随机获取一个键, 通过这个键获取值
	start_word = random.choice(list(word_dict.keys()))
	# 前缀元组
	prefix_tuple = start_word
	for i in range(num):
		word = random.choice(list(word_dict.get(prefix_tuple)))
		# 打印值
		print(word, end=' ')
		
		# 修改键, 也是是前缀元组
		prefix_tuple = prefix_tuple[1:] + (word,)
	

def main():
	# a.txt 删除了前后说明的正文
	res_dict = make_words_dict('a.txt', 3)
	random_word(res_dict)


main()

13.9 数据结构

使用马尔可夫分析生成随机文本很有趣, 但这个练习还有一个要点: 数据结构的选择.
在前面的练习中, 你需要选择:
* 如何表达前缀;
* 如何表达可能的后缀的集合.
* 如何表达每个前缀到可能后缀的集合的映射.
最后一个选择很简单, 要从键映射到对应的值, 字典是最自然的选择.
对前缀来说, 最明显的选择是字符串, 字符串列表或者字符串元组.
对后缀来说, 一个选择是列表, 另一种是直方图(字典).
你会如何选择? 第一步需要思考每种数据结构需要实现的操作,
对前缀而言, 我们需要能够从前方删除一个单词, 并在后面添加一个单词.

例如, 如果但钱的前缀是'Half a', 而下一个单词是'bee', 则需要能够构造下一个前缀. 'a bee'.
你的第一个选择可能是列表, 因为列表添加和删除元素都很方便.
但我们也需要使用前缀作为字典的键, 所以列表被排除掉, (字典的键必须是不可变类型).
对元组而言, 虽然你不能附加或删除, 但可以使用加法操作符来构造一个新的元组:

def shift(prefic, word):
	# 前缀元组的后一个元素 + word 组成新的元组.
	return prefic[1:] + (word, )

shift接收一个单词的元组, prefix, 以及一个字符串word, 
并构建一个新的元组, 包含prefi中除了一个元素之外的元素, 并把word添加在最后.

对后缀集合而言, 我们需要进行的操作包括添加一个新的后缀(或者增加一个已有的后缀的频率, '直方图'), 
以及随机选择一个后缀.

添加一个新后缀, 使用列表实现或者直方图实现效率上相同.
从一个列表中随机选择元素很简单; 从直方图中随机选择则更难一些(参见练习13-7).

到处为止我们一致在讨论实现的简易性, 但选择数据结构时, 还有其他需要考虑的因素.
* 1. 一个时运行时间.
     有时候, 我们可以从理论上预期一种数据结构比另一种更快.
     例如, 我们提到过in操作符, 当元素数量很大时, 在字典中使用比在列表中快.
     但哪种实现会更快常常无法事先预知. 一个办法时两种都实现, 在比较那个更快.
     这种方法称为基准比较(benchmarking). 
     比较实际的方案是先选择最容易实现的数据结构, 然后看它是否对预期的程序而言足够快.
     如果已经足够, 则不需要变动.
     否则, 可以使用profile模块之类的工具, 发现程序中哪些地方占用了最长的时间.

* 2. 另外一个考虑因素是存储空间.
     例如, 使用直方图来保存后缀集合可能占用较少空间,
     因为不论一个单词在文本中出现多少次, 你只需要保存一次.
     有的情况下, 节省空间也可以让你的程序运行更快, 而在极端的情形中, 
     如果导致内存溢出, 则程序无法正常运行. 

但对大多数程序来说, 存储空间是次于运行速度的第二考虑因素.

最后一点: 在这个讨论中, 对于分析和生成两个过程, 我暗示了我们应当使用相同的数据结构.
但因为这是两个分开的阶段, 所以也可以在分析阶段使用一种数据结构, 
再转换为另一种数据结构用于生成阶段.
如果新的数据结构在生成阶段节省的时间大于转换花费的时间, 则总的来说是有利的.

意思是: 
分析阶段使用一种数据结构,.
生成阶段时转为另一种数据结构(类型可能需要进行转换, 转换需要消耗时间).
如果使用新的数据类型节省的时间大于转换花费的时间, 则是有利的(其实还要考虑存储空间的问题).

13.10 调试

但在调试程序时, 尤其是对付一个困难的bug时, 尝试下面5点:
* 1. 阅读
	 审读你的代码, 对自己读出来, 并检查它是否和你想说的一致.
	 
* 2. 运行
     做一些小修改并进行试验, 或者运行不同的版本.
     通常如果在程序中正确的地方加上正确的输出, 问题就会变得更加显而易见.
     但有时候你需要构建一个脚手架.
     
* 3. 沉思
     花些时间思考! 可能是哪种类型的错误: 语法错误, 运行时的还是语义的?
     从错误消息或程序输出中可以得到什么信息?
     哪种错误可能导致你看到的问题? 在问题出现之前, 你的最后一次修改是什么?
     
* 4. 橡皮鸭调试
     如果你向别人解释遇到的问题, 有时候在说完问题之前就能找到答案.
     通常你甚至不需要找人去述说, 而只是需要对橡皮鸭诉说即可.
     这是著名的橡皮鸭调试(rubber, duck debugging)的来源.
     这可不是我编出来的, 参见: https://enwikipedia.org/wifi/Rubber_duck_debugging.
     
* 5. 回退
	 在某种情况下, 最好的方法就是回退, 撤销最近的修改, 
	 直到你的程序恢复到之前没有错误且能够理解的程度. 然后可以开始重新构建.
	 新手程序员会有时开在这个环节中的某一个上, 却忘了还可以尝试其他的环节.
	 每个环节都有其独自的失败模式.
	 
	 例如, 当问题时一个拼写错误时, 阅读代码可以帮忙, 但若问题时概念误解导致, 就没有效果了.
	 如果你不理解自己的程序, 俺么即使阅读100遍, 也发现不了问题, 因为错误是在你大脑中.
	 
	 运行一些实验代码可以起到很大的帮助, 尤其是哪些短小而简单的测试程序.
	 但如果你没有思考或阅读代码就运行试验代码, 则可能会陷入我称为'随机走动编程'的模式之中.
	 即毫无目标地随机改变程序, 直到程序正确运行为止, 还无疑问, 随机走动编程可能要花费很长地时间.
	 
	 但如果有太多地错误, 或者你要修正地代码太大, 太复杂, 即使用最好地调试技巧也会失败.
	 有时候最好地选择是回退, 简化程序, 直达得到一个你能够理解并且正常运行的程序.
	 
	 新手程序员往往不愿意后撤, 他们无法忍受删除一行代码(即使那是错误的代码).
	 如果能让你感觉更好, 可以将程序复制到另一个文件再开始删减它.
	 这样以后就可以一点一点的复制回来.
	 
	 寻找一个困难的bug, 需要阅读, 运行, 沉思, 甚至有时候需要回退.
	 如果你在这其中一个环节上卡住了, 可以尝试其他的环节.

13.11 术语表

确定性(deterministic): 程序的一种特性: 给定相同的输入, 每次运行都会执行相同的操作.

伪随机(pseudorandom): 一序列数: 看起来是随机的, 但实际上是由带着确定性的程序生成的.

默认值(default value): 可选参数声明时给定的值, 如果函数调用时没有指定这个实参, 则使用该默认值.

覆盖(override): 使用实参值替换一个默认值.

基准测试(benchmarking): 实现不同的备选方案, 并使用各种可能输出的样本来测试它们, 
  以达到选择使用哪种数据结构的目的.

橡皮鸭调试(rubber duck debugging): 通过向类似橡皮鸭之类的静物解释你的问题, 进行调试的过程.
  虽然橡皮鸭不懂Python, 但通过诉说和解释, 可以帮助你解决问题.

13.12 练习

1. 练习9

一个单词的'排名', 时它在单词列表中按频率排序的位置: 最常见的单词排名第1, 次常用的词排第2, 等等.
齐普夫定律(Zipf's law)描述了排名和自然语言中词频的关系(http://en.wikipedia.org/wiki/Zipfs_law).
特别地, 它预测了排名为r地单词地频率f: f = cr-s

这里的s和c是依赖于语言和文件本的参数. 如果在表达时两侧都调用对数, 则得到:
log f = log c - slog r
所以如果以log r为横轴给log绘图, 则会得到斜率为-s, 截距为logc的直线.
编写一个程序, 从文本中读入文本, 计算单词词频, 以及log f和log r.
使用你喜欢的制图程序将结果以图标形式展示出来, 并检查它是否为直线, 你能估计s的值吗? 
解答: https://raw.githubusercontent.com/AllenDowney/ThinkPython2/master/code/zipf.py
要运行我的解答, 你需要安装绘图模块matplotlib.
如果安装过Anaconda, 你就已经有了matplotlib, 否则你可能需要安装它.

如果出现: ImportError: DLL load failed while importing _cext: 找不到指定的模块.
则是numpy模块出错了, numpy是在安装matplotlib时安装的.
默认安装的时1.24.2版本, 退回到旧版本1.22.4.

* 1. 终端输入下面指令卸载卸载numpy和matplotlib的命令:
     pip uninstall numpy
     pip uninstall matplotlib

* 2. 在http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy 下载numpy
  下载与本机安装的python版本一致numpy, 注意其中符号含义, 一定要选择带mkl的numpy.
  前面的1.22.4为numpy的版本号, 一般选择最新版本即可.
  中间的cp39表示对应python3.9.x版本..
  后面的win_amd64表示对应的电脑的64位

* 3. 安装本地软件包命令:
	pip install 安装包路径
	
* 4. 最后安装matplotlib即可
     pip install matplotlib

import string
import matplotlib.pyplot as plt


def process_file(filename, skip_header):
	hist = {}
	fp = open(filename, encoding='utf8')
	
	if skip_header:
		skip_gutenberg_header(fp)
	
	for line in fp:
		if line.startswith('*** END OF THIS'):
			break
		
		process_line(line, hist)
	
	return hist


def skip_gutenberg_header(fp):
	for line in fp:
		if line.startswith('*** START OF THIS'):
			break


def process_line(line, hist):
	line = line.replace('-', ' ')
	strippables = string.punctuation + string.whitespace
	for word in line.split():
		word = word.strip(strippables)
		word = word.lower()
		hist[word] = hist.get(word, 0) + 1


def rank_freq(hist):
	# 获取值的列表
	freqs = list(hist.values())
	# 排序, 降序
	freqs.sort(reverse=True)
	# 列表生成式, 元组列表, 使用枚举函数, 获取(index, 与值), 索引从0开始, 设置为从1开始.
	rf = [(r + 1, f) for r, f in enumerate(freqs)]
	return rf


def plot_ranks(hist, scale='log'):
	# 得到一个元组列表 [(1, 5223), (2, 5183), (3, 4822), (4, 4279)...]
	t = rank_freq(hist)
	# 将所有元组最为zip的参数, 将index存放在一个元组中, 频率存在一个元组中.
	rs, fs = zip(*t)
	
	plt.clf()
	plt.xscale(scale)
	plt.yscale(scale)
	plt.title('Zipf plot')
	plt.xlabel('rank')
	plt.ylabel('frequency')
	# 划线, 线宽为3
	plt.plot(rs, fs, 'r-', linewidth=3)
	plt.show()


def main(filename='emma.txt'):
	hist = process_file(filename, skip_header=True)
	plot_ranks(hist)


main()