本文主要参考《A Neural Probabilistic Language Model》这是一篇很重要的语言模型论文,发表于2003年。主要贡献如下:
- 提出了一种基于神经网络的语言模型,是较早将神经网络应用于语言模型领域的工作之一,具有里程碑意义。
- 采用神经网络模型预测下一个单词的概率分布,已经成为神经网络语言模型训练的标准方法之一。
- 在论文中,作者训练了一个前馈神经网络,同时学习词的特征表示和词序列的概率函数,并取得了比 trigram 语言模型更好的单词预测效果,这证明了神经网络语言模型的有效性。
- 论文提出的思想与模型为后续的许多神经网络语言模型研究奠定了基础。
模型结构
模型的目标是学到一个概率函数,根据上下文预测下一个词的概率分布:
import os
import time
import pandas as pd
from dataclasses import dataclass
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter
加载数据
数据集来10+中文外卖评价数据集:
data = pd.read_csv('./dataset/waimai_10k.csv')
data.dropna(subset='review',inplace=True)
data['review_length'] = data.review.apply(lambda x:len(x))
data.sample(5)
label | review | review_length | |
---|---|---|---|
4545 | 0 | 从说马上送餐,到收到餐,时间特别久,给的送餐电话打不通! | 28 |
9855 | 0 | 韩餐做得像川菜,牛肉汤油得不能喝,量也比实体少很多,送餐时间久得太久了,1个半小时,唉。 | 44 |
5664 | 0 | 太糟了。等了两个小时,牛肉我吃的快吐了,再也不可能第二次 | 28 |
2323 | 1 | 很好吃,就是粥撒了点,等了一个多小时 | 18 |
8117 | 0 | 送餐员给我打电话比较粗鲁 | 12 |
统计信息:
data = data[data.review_length <= 50]
words = data.review.tolist()
chars = sorted(list(set(''.join(words))))
max_word_length = max(len(w) for w in words)
print(f"number of examples: {len(words)}")
print(f"max word length: {max_word_length}")
print(f"size of vocabulary: {len(chars)}")
number of examples: 10796
max word length: 50
size of vocabulary: 2272
划分训练/测试集
test_set_size = min(1000, int(len(words) * 0.1))
rp = torch.randperm(len(words)).tolist()
train_words = [words[i] for i in rp[:-test_set_size]]
test_words = [words[i] for i in rp[-test_set_size:]]
print(f"split up the dataset into {len(train_words)} training examples and {len(test_words)} test examples")
split up the dataset into 9796 training examples and 1000 test examples
构造字符数据集[tensor]
- < BLANK> : 0
- token seqs : [1, 2, 3, 4, 5, 6]
- block_size : 3,上下文长度
- x : [[0, 0, 0],[0, 0, 1],[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]
- y : [1, 2, 3, 4, 5, 6, 0]
`class CharDataset(Dataset):
def __init__(self, words, chars, max_word_length, block_size=1):
self.words = words
self.chars = chars
self.max_word_length = max_word_length
self.block_size = block_size
self.char2i = {ch:i+1 for i,ch in enumerate(chars)}
self.i2char = {i:s for s,i in self.char2i.items()}
def __len__(self):
return len(self.words)
def contains(self, word):
return word in self.words
def get_vocab_size(self):
return len(self.chars) + 1
def get_output_length(self):
return self.max_word_length + 1
def encode(self, word):
ix = torch.tensor([self.char2i[w] for w in word], dtype=torch.long)
return ix
def decode(self, ix):
word = ''.join(self.i2char[i] for i in ix)
return word
def __getitem__(self, idx):
word = self.words[idx]
ix = self.encode(word)
x = torch.zeros(self.max_word_length + self.block_size, dtype=torch.long)
y = torch.zeros(self.max_word_length, dtype=torch.long)
x[self.block_size:len(ix)+self.block_size] = ix
y[:len(ix)] = ix
y[len(ix)+1:] = -1
if self.block_size > 1:
xs = []
for i in range(x.shape[0]-self.block_size):
xs.append(x[i:i+self.block_size].unsqueeze(0))
return torch.cat(xs), y
else:
return x, y`
数据加载器[DataLoader]
class InfiniteDataLoader:
def __init__(self, dataset, **kwargs):
train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))
self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)
self.data_iter = iter(self.train_loader)
def next(self):
try:
batch = next(self.data_iter)
except StopIteration:
self.data_iter = iter(self.train_loader)
batch = next(self.data_iter)
return batch
构建模型
context_tokens → \to → embedding → \to → concate feature vector → \to → hidden layer → \to → output layer
@dataclass
class ModelConfig:
block_size: int = None
vocab_size: int = None
n_embed : int = None
n_hidden: int = None
`class MLP(nn.Module):
"""
takes the previous block_size tokens, encodes them with a lookup table,
concatenates the vectors and predicts the next token with an MLP.
Reference:
Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
"""
def __init__(self, config):
super().__init__()
self.block_size = config.block_size
self.vocab_size = config.vocab_size
self.wte = nn.Embedding(config.vocab_size + 1, config.n_embed)
self.mlp = nn.Sequential(
nn.Linear(self.block_size * config.n_embed, config.n_hidden),
nn.Tanh(),
nn.Linear(config.n_hidden, self.vocab_size)
)
def get_block_size(self):
return self.block_size
def forward(self, idx, targets=None):
embs = []
for k in range(self.block_size):
tok_emb = self.wte(idx[:,:,k])
embs.append(tok_emb)
x = torch.cat(embs, -1)
logits = self.mlp(x)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss`
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
@torch.inference_mode()
def evaluate(model, dataset, batch_size=10, max_batches=None):
model.eval()
loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)
losses = []
for i, batch in enumerate(loader):
batch = [t.to('cuda') for t in batch]
X, Y = batch
logits, loss = model(X, Y)
losses.append(loss.item())
if max_batches is not None and i >= max_batches:
break
mean_loss = torch.tensor(losses).mean().item()
model.train()
return mean_loss
训练模型
环境初始化
torch.manual_seed(seed=12345)
torch.cuda.manual_seed_all(seed=12345)
work_dir = "./Mlp_log"
os.makedirs(work_dir, exist_ok=True)
writer = SummaryWriter(log_dir=work_dir)
config = ModelConfig(vocab_size=train_dataset.get_vocab_size(),
block_size=7,
n_embed=64,
n_hidden=128)
格式化数据
train_dataset = CharDataset(train_words, chars, max_word_length, block_size=config.block_size)
test_dataset = CharDataset(test_words, chars, max_word_length, block_size=config.block_size)
train_dataset[0][0].shape, train_dataset[0][1].shape
(torch.Size([50, 7]), torch.Size([50]))
初始化模型
model = MLP(config)
model.to('cuda')
MLP(
(wte): Embedding(2274, 64)
(mlp): Sequential(
(0): Linear(in_features=448, out_features=128, bias=True)
(1): Tanh()
(2): Linear(in_features=128, out_features=2273, bias=True)
)
)
`optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01, betas=(0.9, 0.99), eps=1e-8)
batch_loader = InfiniteDataLoader(train_dataset, batch_size=64, pin_memory=True, num_workers=4)
best_loss = None
step = 0
train_losses, test_losses = [],[]
while True:
t0 = time.time()
batch = batch_loader.next()
batch = [t.to('cuda') for t in batch]
X, Y = batch
logits, loss = model(X, Y)
model.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
t1 = time.time()
if step % 1000 == 0:
print(f"step {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms")
if step > 0 and step % 100 == 0:
train_loss = evaluate(model, train_dataset, batch_size=100, max_batches=10)
test_loss = evaluate(model, test_dataset, batch_size=100, max_batches=10)
train_losses.append(train_loss)
test_losses.append(test_loss)
if best_loss is None or test_loss < best_loss:
out_path = os.path.join(work_dir, "model.pt")
print(f"test loss {test_loss} is the best so far, saving model to {out_path}")
torch.save(model.state_dict(), out_path)
best_loss = test_loss
step += 1
if step > 15100:
break`
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
* 45
* 46
* 47
* 48
* 49
`step 0 | loss 7.7551 | step time 13.09ms
test loss 5.533482551574707 is the best so far, saving model to ./Mlp_log/model.pt
test loss 5.163593292236328 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.864410877227783 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.6439409255981445 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.482759475708008 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.350367069244385 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.250306129455566 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.16674280166626 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.0940842628479 is the best so far, saving model to ./Mlp_log/model.pt
.......................
step 6000 | loss 2.8038 | step time 6.44ms
step 7000 | loss 2.7815 | step time 11.88ms
step 8000 | loss 2.6511 | step time 5.93ms
step 9000 | loss 2.5898 | step time 5.00ms
step 10000 | loss 2.6600 | step time 6.12ms
step 11000 | loss 2.4634 | step time 5.94ms
step 12000 | loss 2.5373 | step time 7.75ms
step 13000 | loss 2.4050 | step time 6.29ms
step 14000 | loss 2.5434 | step time 7.77ms
step 15000 | loss 2.4084 | step time 7.10ms`
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
测试:评论生成器
`@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
block_size = model.get_block_size()
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(2) <= block_size else idx[:, :,-block_size:]
logits, _ = model(idx_cond)
logits = logits[:,-1,:] / temperature
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
if do_sample:
idx_next = torch.multinomial(probs, num_samples=1)
else:
_, idx_next = torch.topk(probs, k=1, dim=-1)
idx = torch.cat((idx, idx_next.unsqueeze(1)), dim=-1)
return idx`
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
def print_samples(num=13, block_size=3, top_k = None):
X_init = torch.zeros((num, 1, block_size), dtype=torch.long).to('cuda')
steps = train_dataset.get_output_length() - 1
X_samp = generate(model, X_init, steps, top_k=top_k, do_sample=True).to('cuda')
new_samples = []
for i in range(X_samp.size(0)):
row = X_samp[i, :, block_size:].tolist()[0]
crop_index = row.index(0) if 0 in row else len(row)
row = row[:crop_index]
word_samp = train_dataset.decode(row)
new_samples.append(word_samp)
return new_samples
不同上下文长度的生成效果
block_size=3
'送餐大叔叔风怎么第一次点的1迷就没有需减改进',
'送餐很快!菜品一般,送到都等到了都很在店里吃不出肥肉,第我地佩也不好意思了。第一次最爱付了凉面味道不',
'很不好进吧。。。。。这点一次都是卫生骑题!调菜油腻,真不太满意!',
'11点送到指定地形,不知道他由、奶茶类应盒子,幸好咸。。。',
'味道一般小份速度太难吃了。',
'快递小哥很贴心也吃不习惯。',
'非常慢。',
'为什么,4个盒子,反正订的有点干,送餐速度把面洒了不超值!很快!!!!!少菜分量不够吃了!味道很少餐',
'骑士剁疼倒还没给糖的',
'怎么吃,正好吃,便宜'
block_size=5
['味道不错,送餐大哥工,餐大哥应不错。',
'配送很不满意',
'土豆炒几次,一小时才没吃,幸太多',
'粥不好吃,没有病311小菜送到,吃完太差了',
'太咸了,很感谢到,对这次送餐员辛苦,服务很不好',
'真的很香菇沙,卷哪丝口气,无语了!',
'菜不怎么夹生若梦粥,小伙n丁也没有收到餐。。。',
'一点不脆1个多小时才送到。等了那个小时。',
'就是送的太慢。。。。一京酱肉丝卷太不点了了,大份小太爱,真心不难吃,最后我的平时面没有听说什么呢,就',
'慢能再提前的好,牛肉好吃而且感觉适合更能事,味道倒卷,送的也很快!']
block_size = 7
['味道还不错,但是酱也没给,一点餐不差',
'都是肥肉,有差劲儿大的,也太给了,那么好给这么多~后超难吃~',
'少了一个半小时才吃到了',
'商务还菜很好的',
'慢慢了~以后!点他家极支付30元分钟,送过用了呢。',
'就是没送到就给送王一袋儿食吃起来掉了,有点辣,这油还这抄套!',
'很好吃,就是送餐师傅不错',
'包装好的牛肉卷糊弄错酱,重面太少了,肉不新鲜就吃了',
'味道不错,送得太慢...',
'非常好非常快递小哥,态度极差,一点也好,菜和粥洒了一袋软,以先订过哈哈哈哈']