Coggle数据科学 | 科大讯飞AI大赛：人岗匹配挑战赛赛季3

本文来源公众号“Coggle数据科学”，仅用于学术分享，侵权删，干货满满。

原文链接：科大讯飞AI大赛：人岗匹配挑战赛赛季3

赛题名称：人岗匹配挑战赛赛季3
赛题类型：自然语言处理、文本匹配
赛题任务：基于提供的样本构建模型，预测简历与岗位匹配与否。

报名链接：https://challenge.xfyun.cn/topic/info?type=match-challenge-s3&ch=dw24_AtTCK9

赛事背景

讯飞智聘是一款面向企业招聘全流程的智能化解决方案。运用科大讯飞先进的智能语音、自然语言理解、计算机视觉等AI技术及大数据能力，具备业界领先的简历解析、人岗匹配、AI面试、AI外呼等产品功能，助力企业提升招聘效率，降低招聘成本。

人岗匹配是企业招聘面临一个重大挑战，尤其在校园招聘等集中招聘的场景下，面对海量的简历，如何快速分类筛选出最适合招聘岗位的简历，以及在内推和猎头场景下，如何为一份简历找到合适的岗位，做到人适其岗、岗适其人，提升人岗匹配的效率和准确度，是困扰每一个HR和面试官的难题。

赛事任务

智能人岗匹配需要强大的数据作为支撑，本次大赛提供了大量的岗位JD和求职者简历的加密脱敏数据作为训练样本，参赛选手需基于提供的样本构建模型，预测简历与岗位匹配与否。

数据说明

本次比赛为参赛选手提供训练集与测试集，数据包含数据结构为：

岗位JD数据包含4个特征字段：positionName, positionDescription, positionRequirements, positionID，对应含义分别为：岗位名称、岗位介绍、岗位要求、岗位ID。

求职者简历数据包含如下字段：

出于数据安全保证的考虑，所有数据均为脱敏处理后的数据。

评估指标

本模型依据提交的结果文件，采用macro-F1 score进行评价。

赛题 Baseline

读取比赛数据

import pandas as pd
import numpy as np
import json
import re
from tqdm import tqdm

train = json.load(open('dataset/train.json'))
test = json.load(open('dataset/test.json'))
job_list = json.load(open('dataset/job_list.json'))

人工特征工程
- 简历部分单词与岗位名称、描述和要求中单词的交集的大小。
- 简历部分单词与岗位名称、描述和要求中单词交集的大小，除以岗位名称、描述和要求中单词的数量，得到匹配比例。
- 简历部分单词与岗位名称、描述和要求中单词交集的大小，除以简历部分单词的数量加一，得到匹配比例的另一种计算方式。
- 对于简历中的每个关键部分（如教育经历、社会经验等），使用正则表达式模式\b\w+\b提取单词，并存储在cv_sample_word字典中。
- 遍历岗位列表（job_list），对于每个岗位样本，使用相同的正则表达式模式提取岗位名称、岗位描述和岗位要求中的单词。
- 对于简历中的每个关键部分，计算以下特征：

# 正则表达式模式，匹配单词，其中单词由空格、标点符号或字符串的开始和结束进行分割
pattern = re.compile(r'\b\w+\b')

train_feat = []
for train_sample in tqdm(train):
    cv_sample_word = {}
    for key in ['profileEduExps', 'profileSocialExps', 'profileLanguage', 'profileProjectExps', 'profileSkills', 'profileAwards', 'profileWorkExps', 'profileDesire']:
        cv_sample_word[key] = pattern.findall(str(train_sample['resumeData'][key]))
                           
    for job_sample in job_list:
        positionName_word = re.findall(pattern, job_sample['positionName'])
        positionDescription_word = pattern.findall(job_sample['positionDescription'])
        positionRequirements_word = pattern.findall(job_sample['positionRequirements'])

        feat = [
            len(train_sample['resumeData']['profileEduExps']),
        ]

        for key in ['profileEduExps', 'profileSocialExps', 'profileLanguage', 'profileProjectExps', 'profileSkills', 'profileAwards', 'profileWorkExps', 'profileDesire']:
            feat.append(len(set(cv_sample_word[key]) & set(positionName_word)))
            feat.append(len(set(cv_sample_word[key]) & set(positionDescription_word)))
            feat.append(len(set(cv_sample_word[key]) & set(positionRequirements_word)))

            feat.append(len(set(cv_sample_word[key]) & set(positionName_word)) / len(set(positionName_word)))
            feat.append(len(set(cv_sample_word[key]) & set(positionDescription_word)) / len(set(positionDescription_word)))
            feat.append(len(set(cv_sample_word[key]) & set(positionRequirements_word)) / len(set(positionRequirements_word)))

            feat.append(len(set(cv_sample_word[key]) & set(positionName_word)) / (len(set(cv_sample_word[key])) + 1))
            feat.append(len(set(cv_sample_word[key]) & set(positionDescription_word)) / (len(set(cv_sample_word[key])) + 1))
            feat.append(len(set(cv_sample_word[key]) & set(positionRequirements_word)) / (len(set(cv_sample_word[key])) + 1))


        if train_sample['positionID'] == job_sample['positionID']:
            feat.append(1)
        else:
            feat.append(0)


        train_feat.append(feat)

模型训练

from lightgbm import LGBMClassifier

model = LGBMClassifier()
model.fit(train_feat[:-20000, :-1], train_feat[:-20000, -1])

model.score(train_feat[-20000:, :-1], train_feat[-20000:, -1])

完整代码见：

https://github.com/datawhalechina/competition-baseline/tree/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B2024