目录
- 0. 本栏目竞赛汇总表
- 1. 本文主旨
- 2. Transformer数据输入层
- 3. 构建训练数据
- 3.1 代码实现
- 3.2 大白话构建训练数据
- 4. 文本构造&Tokenizer处理
- 4.1 代码实现
- 4.2 大白话文本构造&Tokenizer处理
0. 本栏目竞赛汇总表
Kaggle竞赛汇总
1. 本文主旨
- 大白话:上一篇文章中已经实现缺失错误解释的AI补充,之后的主要任务是使用Transformer架构,训练出错误答案与错误解释的文本匹配模型。本文主旨是定义一个能够实现文本相似度的对比学习的Transformer类,由于整体Transformer架构过大,本文先解读架构图中的“数据输入层”。
- 通过本文可收获技能:Transformer训练数据的生成、Transformer数据的标准化处理。
- 上文回顾:Eedi竞赛Transformer框架解决方案02-GPT_4o生成训练集缺失数据
2. Transformer数据输入层
3. 构建训练数据
3.1 代码实现
task_description = 'Given a math question and a misconcepte incorrect answer, please retrieve the most accurate reason for the misconception.'
def get_detailed_instruct(task_description: str, query: str) -> str:
"""生成详细的指令文本"""
return f'Instruct: {task_description}\nQuery: {query}'
def create_train_df(train_df, misconception_mapping, is_train=True):
"""创建训练数据集
Args:
train_df: 原始数据DataFrame
misconception_mapping: 误解概念映射
is_train: 是否为训练集
"""
train_data = []
for _,row in train_df.iterrows():
for c in ['A','B','C','D']:
if is_train:
misconception_id = row[f"Misconception{c}Id"]
if np.isnan(misconception_id):
misconception_id = -1
doc_text = row[f'Misconception{c}Name']
misconception_id = int(misconception_id)
if c == row['CorrectAnswer']:
continue
if f'Answer{c}Text' not in row:
continue
real_answer_id = row['CorrectAnswer']
real_text = row[f'Answer{real_answer_id}Text']
query_text =f"###question###:{row['SubjectName']}-{row['ConstructName']}-{row['QuestionText']}\n###Correct Answer###:{real_text}\n###Misconcepte Incorrect answer###:{row[f'Answer{c}Text']}"
row['query'] = get_detailed_instruct(task_description,query_text)
row['answer_name'] = c
if is_train and misconception_id != -1:
doc_text = misconception_mapping.iloc[misconception_id]['MisconceptionName']
row['doc'] = doc_text
row['answer_id'] = misconception_id
train_data.append(copy.deepcopy(row))
new_train_df = pd.DataFrame(train_data)
return new_train_df
3.2 大白话构建训练数据
这个函数就像在:
- 收集学生做错的题
- 分析每个错误答案
- 把"题目-错误答案-误解概念"这三个信息关联起来
- 这样模型才能学会"看到这种错误,就知道学生有什么误解"
这就像是在帮助老师建立一个"错题本",不仅记录错在哪,还要理解为什么会错。
4. 文本构造&Tokenizer处理
4.1 代码实现
@dataclass
class EmbedCollator(DataCollatorWithPadding):
"""数据批处理整理器"""
tokenizer: AutoTokenizer = None
def __init__(self, tokenizer, query_max_len=None, passage_max_len=None):
super().__init__(tokenizer=tokenizer)
self.tokenizer = tokenizer
self.query_max_len = query_max_len or 256
self.passage_max_len = passage_max_len or 50
def padding_score(self, teacher_score):
"""处理教师分数的填充"""
group_size = None
for scores in teacher_score:
if scores is not None:
group_size = len(scores)
break
if group_size is None:
return None
padding_scores = [100.0] + [0.0] * (group_size - 1)
new_teacher_score = []
for scores in teacher_score:
if scores is None:
new_teacher_score.append(padding_scores)
else:
new_teacher_score.append(scores)
return new_teacher_score
def mask_pad_token(self,q):
"""随机mask输入tokens"""
if random.random()>0.9:
tensor = q['input_ids'].float()
mask = torch.rand(tensor.shape)
mask = (mask > 0.9).float()
tensor = tensor * (1 - mask) + 2 * mask
tensor = tensor.long()
q['input_ids'] = tensor
return q
def __call__(self, features):
"""处理一个batch的数据"""
query = [f["query"] for f in features]
passage = [f["doc"] for f in features]
if isinstance(query[0], list):
query = sum(query, [])
if isinstance(passage[0], list):
passage = sum(passage, [])
q_collated = self.tokenizer(
query,
padding=True,
truncation=True,
max_length=self.query_max_len,
return_tensors="pt",
)
q_collated = self.mask_pad_token(q_collated)
d_collated = self.tokenizer(
passage,
padding=True,
truncation=True,
max_length=self.passage_max_len,
return_tensors="pt",
)
d_collated = self.mask_pad_token(d_collated)
return {"query": q_collated, "doc": d_collated}
4.2 大白话文本构造&Tokenizer处理
对文字数据实现标准化,简单来说流程:
- 接收各种大小不一的包裹(文本)
- 把它们都处理成统一规格(分词、填充、截断)
- 贴上特殊标记(mask处理),目的:
- 增强模型鲁棒性:让模型学会处理不完整的输入
- 防止过拟合:通过随机遮盖,增加数据的多样性
- 提升泛化能力:模型需要从上下文学习理解文本
- 最后打包成标准箱(tensor)
- 方便后续的运输(模型处理)
这样不管收到什么样的包裹,出站时都是规格统一、整齐划一的标准包装,便于后续的批量处理。
(To be continued)