文章类型分类项目

注意：本文引用自专业人工智能社区Venus AI

项目背景

在数据科学和机器学习的领域中，文本分析一直是一个引人注目的话题。这个项目的核心挑战是利用机器学习技术，根据文章或书籍的概要预测其类型。这不仅是一个技术挑战，涉及到复杂的文本处理和模式识别技术，而且还是一个应用挑战，探索如何在实际业务中利用这些技术。

数据集特征的详细介绍

数据丰富性：项目的数据集包含了成千上万本书籍的概要，提供了一个多样化的文本数据集来进行机器学习实验。
多维特征：除了基本信息如书名、作者和概要之外，数据集中还可能包括出版年份、ISBN、出版社、甚至书籍的物理尺寸等信息。这些特征提供了多维度的数据，有助于建立更准确的预测模型。
类型多样性：数据集中的书籍被分为多种类型，从而提供了一个广阔的分类范围，这对于构建分类算法来说是非常有价值的。

应用领域的扩展

出版业的革新：这个项目可以帮助出版商更精准地对书籍进行分类，从而优化其市场策略和读者定位。
电子商务的提升：在线书店和电子书平台可以利用这些算法来提高推荐系统的准确性，从而提高用户体验和销售效率。
学术研究的深化：对于从事文学研究的学者来说，这个项目提供了一种全新的文学作品分析方法，可以从数据科学的角度深入探索文本内容。

项目目标

模型构建：这个项目的主要目标是构建一个有效的机器学习模型，能够准确预测书籍的类型。本项目采用逻辑回归，贝叶斯和支持向量机进行模型训练和测试。
数据处理和特征工程：数据预处理是这个项目的关键部分，包括文本清洗、分词、向量化等步骤。此外，特征工程的目标是识别出哪些特征对于预测书籍类型最为重要。
模型评估：评估模型的准确性、泛化能力和效率是项目的关键。这涉及到使用各种评估指标，如准确率、召回率和F1得分，以及交叉验证等技术。

项目的科学计算库依赖

matplotlib==3.7.1
pandas==2.0.2
scikit_learn==1.2.2
seaborn==0.13.0
sentence_transformers==2.2.2

项目的详细代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

使用 pandas 读取训练数据集

data = pd.read_csv('data.csv')

print(data.columns)
data.shape

Index(['index', 'title', 'genre', 'summary'], dtype='object')
(4657, 4)

data.head()

data['genre'].value_counts().plot(kind='barh')

数据预处理

data['genre_id'] = data['genre'].factorize()[0]

data['genre_id'].value_counts()

5    1023
0     876
1     647
3     600
4     600
2     500
7     111
6     100
8     100
9     100
Name: genre_id, dtype: int64

data['summary'] = data['summary'].apply(lambda x: x.lower())

data.head()

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

def getEmbedding(sentence):
    return model.encode(sentence)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

embeddings = data['summary'].apply(lambda x: getEmbedding(x))

input_data = []
for item in embeddings:
    input_data.append(item.tolist())

使用 Tfidf 向量器与深度学习模型sentence embedding进行比较

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
                        ngram_range=(1, 2), 
                        stop_words='english')
# We transform each complaint into a vector
features = tfidf.fit_transform(data.summary).toarray()
labels = data.genre_id
print("Each of the %d synopsis is represented by %d features (TF-IDF score of unigrams and bigrams)"

Each of the 4657 synopsis is represented by 20152 features (TF-IDF score of unigrams and bigrams)

Tfidf 和sentence embedding模型之间的精度比较

使用 Tfidf 向量器的文本嵌入：

models = [
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df_tfidf = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

cv_df_tfidf

使用sentence embedding模型的文本嵌入

models = [
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0),
    LinearSVC(),
    LogisticRegression(random_state=0, max_iter=1000)
]
# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, input_data, data['genre'].factorize()[0], scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

cv_df

注：从上面的比较中我们可以看出，当使用sentence embedding生成的文本嵌入时，所有模型的准确率都更高。因此，我们将使用这些嵌入词进行进一步的模型训练

此外，与其他模型相比，LinearSVC 的准确率最高。因此，让我们使用 LinearSVC 来训练我们的基因预测模型吧

模型训练

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

X_train, X_test, y_train, y_test = train_test_split(input_data, 
                                                               labels, 
                                                               test_size=0.25, 
                                                               random_state=1)
model = LinearSVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification report
print('\t\t\t\t\tCLASSIFICATIION METRICS\n')
print(classification_report(y_test, y_pred, 
                                    target_names= data['genre'].unique()))

					CLASSIFICATIION METRICS

              precision    recall  f1-score   support

     fantasy       0.77      0.77      0.77       233
     science       0.74      0.72      0.73       167
       crime       0.66      0.59      0.62       126
     history       0.76      0.77      0.76       145
      horror       0.66      0.59      0.62       132
    thriller       0.67      0.79      0.72       247
  psychology       0.79      0.76      0.78        25
     romance       0.63      0.34      0.44        35
      sports       0.86      0.83      0.85        30
      travel       0.89      1.00      0.94        25

    accuracy                           0.72      1165
   macro avg       0.74      0.72      0.72      1165
weighted avg       0.72      0.72      0.72      1165

genre_id_df = data[['genre', 'genre_id']].drop_duplicates()

conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt='d',
            xticklabels=genre_id_df.genre.values, 
            yticklabels=genre_id_df.genre.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title("CONFUSION MATRIX - LinearSVC\n", size=16);

保持embedding向量为文件

import pickle

with open('embeddings.pkl', 'wb') as file:
    pickle.dump(embeddings, file)

with open('embeddings.pkl', 'rb') as file:
    embeddings = pickle.load(file)

进行预测

from sentence_transformers import SentenceTransformer

# Initialize the SentenceTransformer model
model = SentenceTransformer('all-mpnet-base-v2')

def getEmbedding(sentence):
    # Generate and return the embedding for the provided sentence
    return model.encode(sentence)

# Example usage
test_summary = "Teenager Max McGrath (Ben Winchell) discovers that his body can generate the most powerful energy in the universe. Steel (Josh Brener) is a funny, slightly rebellious, techno-organic extraterrestrial who wants to utilize Max's skills. When the two meet, they combine together to become Max Steel, a superhero with unmatched strength on Earth. They soon learn to rely on each other when Max Steel must square off against an unstoppable enemy from another galaxy."
test_summary_embed = getEmbedding(test_summary)

X_train, X_test, y_train, y_test = train_test_split(input_data, data['genre'], 
                                                    test_size=0.25,
                                                    random_state = 0)
trained_model = LinearSVC().fit(X_train, y_train)

print(trained_model.predict([test_summary_embed]))

['science']

项目资源下载

详情请见文章类型分类项目-VenusAI (aideeplearning.cn)

文章类型分类项目

注意：本文引用自专业人工智能社区Venus AI

更多AI知识请参考原站（[www.aideeplearning.cn]）

项目背景

数据集特征的详细介绍

应用领域的扩展

项目目标

项目的科学计算库依赖

项目的详细代码

使用 pandas 读取训练数据集

数据预处理

使用 Tfidf 向量器与深度学习模型sentence embedding进行比较

Tfidf 和sentence embedding模型之间的精度比较

使用 Tfidf 向量器的文本嵌入：

使用sentence embedding模型的文本嵌入

模型训练

保持embedding向量为文件

进行预测

项目资源下载

相关文章

数据结构.pta测试二

基于单片机的灭火机器人设计

数据库事务中“锁”的分类

【每日一题】303. 区域和检索 - 数组不可变-2024.3.18

JS拖曳盒子案例

MyBatisPlus 之一：Spring 整合 MyBatisPlus 及雪花算法

力扣203. 移除链表元素

Python从0到100（六）：Python分支和循环结构的应用

基于Spring Boot的煤矿信息管理系统

亮点抢先看！4月16-17日，百度Create大会开设“AI公开课”，大咖带你打造赚钱工具

【Linux】基础 IO（文件系统 inode 软硬链接）-- 详解

每日五道java面试题之mybatis篇（四）

3.程序语言基础知识

python知识点总结(二)

S32DS 中编译生成bin文件

Python Web开发记录 Day13：Django part7 Ajax入门与案例（任务管理）

明远创意生活引领经典家纺品牌“大朴”走向新生

图书管理系统

nginx 基本使用、借助 nginx 和 mkcert 实现本地 https://localhost 测试。

初级爬虫实战——哥伦比亚大学新闻

文章类型分类项目

注意：本文引用自专业人工智能社区Venus AI

更多AI知识请参考原站 （[www.aideeplearning.cn]）

项目背景

数据集特征的详细介绍

应用领域的扩展

项目目标

项目的科学计算库依赖

项目的详细代码

使用 pandas 读取训练数据集

数据预处理

使用 Tfidf 向量器与深度学习模型sentence embedding进行比较

Tfidf 和sentence embedding模型之间的精度比较

使用 Tfidf 向量器的文本嵌入：

使用sentence embedding模型的文本嵌入

模型训练

保持embedding向量为文件

进行预测

项目资源下载

相关文章

更多AI知识请参考原站（[www.aideeplearning.cn]）