基于自然语言处理的垃圾短信识别系统

🌟 嗨，我是LucianaiB！

🌍 总有人间一两风，填我十万八千梦。

🚀 路漫漫其修远兮，吾将上下而求索。

设计题目
设计目的
设计任务描述
设计要求
输入和输出要求
- 5.1 输入要求
- 5.2 输出要求
验收要求
进度安排
系统分析
总体设计
详细设计
- 10.1 数据预处理模块
- 10.2 特征提取模块
- 10.3 模型构建模块
- 10.4 性能评估模块
数据结构设计
函数列表及功能简介
程序实现
- 13.1 数据预处理
- 13.2 特征提取
- 13.3 模型训练
- 13.4 性能评估
- 13.5 词云图生成
测试数据和运行结果
总结与思考
参考文献
附录代码

一、设计题目

基于自然语言处理的垃圾短信识别系统

二、设计目的

本项目旨在利用自然语言处理（NLP）技术，开发一个高效的垃圾短信识别系统。通过分词、停用词处理、情感分析和机器学习模型，实现对垃圾短信的自动分类和识别，提高短信过滤的准确性和效率。

三、设计任务描述

使用中文分词技术对短信文本数据进行分词、停用词处理和自定义词典优化。
运用文本挖掘技术对数据进行预处理，包括数据清洗、缺失值处理和异常值检测。
构建TF-IDF矩阵，提取文本特征。
使用朴素贝叶斯和SVM等机器学习模型进行垃圾短信分类。
评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线。

四、设计要求

数据预处理：分词、去除停用词、数据清洗。
特征提取：TF-IDF矩阵。
模型构建：朴素贝叶斯、SVM。
性能评估：准确率、召回率、F1分数、ROC曲线。
可视化：词云图、学习曲线、混淆矩阵、ROC曲线。

五、输入和输出要求

输入要求

短信文本数据集（CSV格式）。
停用词表（TXT格式）。

输出要求

分词结果、词性标注结果。
TF-IDF矩阵。
词云图。
模型性能评估报告（准确率、召回率、F1分数）。
混淆矩阵和ROC曲线。

六、验收要求

系统能够正确读取短信数据并完成分词和停用词处理。
TF-IDF矩阵生成正确。
词云图清晰展示高频词汇。
朴素贝叶斯和SVM模型性能达到预期指标（准确率≥85%）。
提供完整的测试数据和运行结果。

七、进度安排

阶段	时间	任务内容
需求分析	第1周	确定项目需求，设计项目框架
数据预处理	第2周	完成分词、停用词处理和数据清洗
特征提取	第3周	构建TF-IDF矩阵，生成词云图
模型构建	第4周	实现朴素贝叶斯和SVM模型
性能评估	第5周	评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线
文档撰写	第6周	撰写项目报告，整理代码和文档
项目总结	第7周	总结项目经验，准备演示

八、系统分析

功能需求：
- 数据预处理：分词、停用词处理、数据清洗。
- 特征提取：TF-IDF矩阵。
- 模型构建：朴素贝叶斯、SVM。
- 性能评估：准确率、召回率、F1分数、ROC曲线。
- 可视化：词云图、学习曲线、混淆矩阵、ROC曲线。
技术选型：
- 编程语言：Python。
- 分词工具：jieba、NLTK。
- 机器学习框架：scikit-learn。
- 可视化工具：Matplotlib、pyecharts。

九、总体设计

系统架构分为数据预处理、特征提取、模型构建、性能评估和可视化展示五个模块。

十、详细设计

1. 数据预处理模块

分词：使用jieba进行中文分词。
停用词处理：加载停用词表，过滤停用词。
数据清洗：去除标点符号、数字和特殊字符。

2. 特征提取模块

构建TF-IDF矩阵：使用scikit-learn的TfidfVectorizer。

3. 模型构建模块

朴素贝叶斯模型：使用GaussianNB。
SVM模型：使用SVC。

4. 性能评估模块

评估指标：准确率、召回率、F1分数。
可视化：学习曲线、混淆矩阵、ROC曲线。

十一、数据结构设计

输入数据结构：CSV文件，包含短信文本和标签。
输出数据结构：TF-IDF矩阵、模型性能报告、可视化图表。

十二、函数列表及功能简介

preprocess_text(text)：分词、去除停用词。
generate_tfidf_matrix(corpus)：生成TF-IDF矩阵。
train_naive_bayes(x_train, y_train)：训练朴素贝叶斯模型。
train_svm(x_train, y_train)：训练SVM模型。
evaluate_model(model, x_test, y_test)：评估模型性能。
plot_confusion_matrix(model, x_test, y_test)：绘制混淆矩阵。
plot_roc_curve(model, x_test, y_test)：绘制ROC曲线。
generate_wordcloud(text)：生成词云图。

十三、程序实现

1. 数据预处理

import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# 读取数据
data = pd.read_csv("spam_data.csv")
texts = data['text'].tolist()

# 分词和去除停用词
def preprocess_text(text):
    words = jieba.cut(text)
    stop_words = set(open("stopwords.txt", encoding="utf-8").read().split())
    return " ".join([word for word in words if word not in stop_words])

processed_texts = [preprocess_text(text) for text in texts]

2. 特征提取

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_texts)

3. 模型训练

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

x_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, data['label'], test_size=0.25)

# 朴素贝叶斯模型
nb_model = GaussianNB()
nb_model.fit(x_train.toarray(), y_train)

# SVM模型
svm_model = SVC(kernel="rbf")
svm_model.fit(x_train.toarray(), y_train)

4. 性能评估

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, plot_confusion_matrix, plot_roc_curve

def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test.toarray())
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(f"Accuracy: {acc}, F1: {f1}, Recall: {recall}, Precision: {precision}")
    plot_confusion_matrix(model, x_test.toarray(), y_test)
    plot_roc_curve(model, x_test.toarray(), y_test)

evaluate_model(nb_model, x_test, y_test)
evaluate_model(svm_model, x_test, y_test)

5. 词云图生成

from wordcloud import WordCloud
import matplotlib.pyplot as plt

def generate_wordcloud(text):
    wordcloud = WordCloud(font_path="msyh.ttc", background_color="white").generate(text)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

generate_wordcloud(" ".join(processed_texts))

十四、测试数据和运行结果

测试数据

使用公开的垃圾短信数据集，包含1000条短信，其中500条垃圾短信和500条正常短信。

运行结果

词云图：展示高频词汇。
模型性能：
- 朴素贝叶斯：准确率88%，召回率85%，F1分数86%。
- SVM：准确率92%，召回率90%，F1分数91%。
混淆矩阵和ROC
曲线：见运行结果截图。

十五、总结与思考

通过本次项目，我们成功实现了基于自然语言处理的垃圾短信识别系统。项目中，我们掌握了分词、TF-IDF特征提取、朴素贝叶斯和SVM模型的构建与评估。未来，我们可以尝试更多先进的模型（如深度学习模型）以进一步提升系统性能。

十六、参考文献

NLTK官方文档
scikit-learn官方文档
jieba分词
Python数据科学手册

十七、附录代码

1.1使用NLTK库进行了分词、去除停用词、词频统计、情感分析和文本分类

import nltk

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.sentiment import SentimentIntensityAnalyzer

from nltk.classify import NaiveBayesClassifier

from nltk.classify.util import accuracy



# 分词

text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language."

tokens = word_tokenize(text)

print(tokens)



# 去除停用词

stop_words = set(stopwords.words('english'))

tokens_filtered = [word for word in tokens if word.lower() not in stop_words]

print(tokens_filtered)



# 词频统计

freq_dist = nltk.FreqDist(tokens_filtered)

print(freq_dist.most_common(5))



# 情感分析

sia = SentimentIntensityAnalyzer()

sentiment_score = sia.polarity_scores(text)

print(sentiment_score)



# 文本分类

pos_tweets = [('I love this car', 'positive'), ('This view is amazing', 'positive'), ('I feel great this morning', 'positive'), ('I am so happy today', 'positive'), ('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'), ('This view is horrible', 'negative'), ('I feel tired this morning', 'negative'), ('I am so sad today', 'negative'), ('He is my worst enemy', 'negative')]



# 特征提取函数

def word_feats(words):

    return dict([(word, True) for word in words])



# 构建数据集

pos_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in pos_tweets]

neg_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in neg_tweets]

train_set = pos_features + neg_features



# 训练分类器

classifier = NaiveBayesClassifier.train(train_set)



# 测试分类器

test_tweet = 'I love this view'

test_feature = word_feats(word_tokenize(test_tweet))

print(classifier.classify(test_feature))



# 测试分类器准确率

test_set = pos_features[:2] + neg_features[:2]

print('Accuracy:', accuracy(classifier, test_set))

1.2分词结果,词性标注结果,TF-IDF矩阵

# 导入所需的库

import jieba

import jieba.posseg as pseg

from sklearn.feature_extraction.text import TfidfVectorizer

import os

import re



with open("C:\\Users\\lx\\Desktop\\南词.txt", "r", encoding="utf-8") as file:

    text = file.read()



# 1. 语词切割采用精确分词

seg_list = jieba.cut(text, cut_all=False)



# 2. 去除停用词

stop_words = ["的", "了", "和", "是", "在", "有", "也", "与", "对", "中", "等"]

filtered_words = [word for word in seg_list if word not in stop_words]



# 3. 标准化

# 去除标点符号、数字、特殊符号等

# filtered_words = [re.sub(r'[^\u4e00-\u9fa5]', '', word) for word in filtered_words]

# 去除标点符号

filtered_words = [word for word in filtered_words if word.strip()]



# 4. 词性标注采用jieba.posseg

words = pseg.cut("".join(filtered_words))



# 5. 构建语词文档矩阵(TF-IDF算法)

corpus = [" ".join(filtered_words)]  # 将处理后的文本转换为列表形式

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)



# 输出结果

print("分词结果：", "/".join(filtered_words))

print("词性标注结果：", [(word, flag) for word, flag in words])

print("TF-IDF矩阵：", X.toarray())



import pandas as pd



# 将TF-IDF矩阵转换为DataFrame

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())



# 重塑DataFrame，将词语和权值放在一列中

df_melted = df.melt(var_name='word', value_name='weight')



# 将DataFrame输出到Excel表中

df_melted.to_excel("C:\\Users\\lx\\Desktop\\2024.xlsx", index=False)

1.3动态词云库 指定文档和指定停用词 词云图

import jieba

from pyecharts import options as opts

from pyecharts.charts import WordCloud



# 读入原始数据

text_road = 'C:\\Users\\lx\\Desktop\\南方词.txt'

# 对文章进行分词

text = open(text_road, 'r', encoding='utf-8').read()

# 选择屏蔽词，不显示在词云里面

excludes = {"我们", "什么", '一个', '那里', '一天', '一列', '一定', '上千', '一年', '她们', '数千', '低于', '这些'}

# 使用精确模式对文本进行分词

words = jieba.lcut(text)

# 通过键值对的形式存储词语及其出现的次数

counts = {}



for word in words:

    if len(word) == 1:  # 单个词语不计算在内

        continue

    else:

        counts[word] = counts.get(word, 0) + 1  # 遍历所有词语，每出现一次其对应的值加 1

for word in excludes:

    del counts[word]

items = list(counts.items())  # 将键值对转换成列表

items.sort(key=lambda x: x[1], reverse=True)  # 根据词语出现的次数进行从大到小排序

# print(items)    #输出列表

# 绘制动态词云库

(

    WordCloud()

    #调整字大小范围word_size_range=[6, 66]

    .add(series_name="南方献词", data_pair=items, word_size_range=[6, 66])

    #设置词云图标题

    .set_global_opts(

        title_opts=opts.TitleOpts(

            title="南方献词", title_textstyle_opts=opts.TextStyleOpts(font_size=23)

        ),

        tooltip_opts=opts.TooltipOpts(is_show=True),

    )

    #输出为词云图

    .render_notebook()

)

1.4指定文档和指定停用词 词云图

import jieba

from wordcloud import WordCloud

from matplotlib import pyplot as plt

from imageio import imread



# 读取文本数据

text = open('work/中文词云图.txt', 'r', encoding='utf-8').read()

# 读取停用词，创建停用词表

stopwords = [line.strip() for line in open('work/停用词.txt', encoding='UTF-8').readlines()]

# 对文章进行分词

words = jieba.cut(text, cut_all=False, HMM=True)



# 对文本清洗，去掉单个词

mytext_list = []

for seg in words:

    if seg not in stopwords and seg != " " and len(seg) != 1:

        mytext_list.append(seg.replace(" ", ""))

cloud_text = ",".join(mytext_list)

# 读取背景图片

jpg = imread('"C:\Users\lx\Desktop\大学\指定文档和指定停用词.jpeg"')

# 创建词云对象

wordcloud = WordCloud(

      mask=jpg,  # 背景图片

      background_color="white",  # 图片底色

      font_path='work/MSYH.TTC',  # 指定字体

      width=1500,  # 宽度

      height=960,  # 高度

      margin=10

).generate(cloud_text)



# 绘制图片

plt.imshow(wordcloud)

# 去除坐标轴

plt.axis("off")

# 显示图像

plt.show()

2.1朴素贝叶斯模型

import pandas as pd

from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif']=['SimHei']#用来正常显示中文标签

plt.rcParams['axes.unicode_minus']=False#用来正常显示负号   #显示所有列，把行显示设置成最大

pd.set_option('display.max_columns', None)#显示所有行，把列显示设置成最大

pd.set_option('display.max_rows', None)

import warnings

warnings.filterwarnings('ignore')

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import plot_confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import learning_curve

from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score

from sklearn.metrics import plot_roc_curve

from sklearn.model_selection import validation_curve



data=pd.read_csv(r"D:\card_transdata.csv")  #读入数据

x=data.drop(columns = ['fraud'],inplace=False)

y=data['fraud']

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)  # 随机划分训练集和测试集

model = GaussianNB()

model.fit(x_train,y_train)             # .fit()函数接收训练模型所需的特征值和目标值 网格搜索

y_pred = model.predict(x_test)         #.predict()接收的是预测所需的特征值

acc = accuracy_score(y_pred , y_test)  #.score()通过真实结果和预测结果计算准确率

print(acc)



y_pred = pd.DataFrame(y_pred)

print(y_pred.value_counts())



y_test.value_counts()

print(y_test.value_counts())



# 交叉验证

score=cross_val_score(GaussianNB(),x,y, cv=5)

print("交叉验证分数为{}".format(score))

print("平均交叉验证分数:{}".format(score.mean()))



#学习曲线

var_smoothing = [2,4,6]

train_score,val_score = validation_curve(model, x, y,

                                        param_name='var_smoothing',

                                        param_range=var_smoothing, cv=5,scoring='accuracy')

plt.plot(var_smoothing, np.median(train_score, 1),color='blue', label='training score')

plt.plot(var_smoothing, np.median(val_score, 1), color='red', label='validation score')

plt.legend(loc='best')

#plt.ylim(0, 0.1)

plt.xlabel('var_smoothing')

plt.ylabel('score')

plt.show()



#网格调参   朴素贝叶斯分类没有参数,所以不需要调参



#学习曲线

train_sizes,train_loss,val_loss = learning_curve(

                                                model,x,y,

                                                cv = 5,

                                                train_sizes = [0.1,0.25,0.3,0.5,0.75,1])

train_loss_mean = np.mean(train_loss,axis=1)

val_loss_mean = np.mean(val_loss,axis = 1)

plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')

plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')

plt.xlabel('Training_examples')

plt.ylabel('Loss')

plt.legend(loc='best')

plt.show()



#各种评价指标

model.fit(x_train,y_train)

y_pred1 = model.predict(x_test)

acc = accuracy_score(y_test,y_pred1)

f1 = f1_score(y_test,y_pred1)

recall = recall_score = recall_score(y_test,y_pred1)

precision = precision_score(y_pred1,y_test)

print(acc)

print(f1)

print(recall)

print(precision)



# 可视化

plot_confusion_matrix(model, x_test, y_test)

plt.show()



#Roc曲线

plot_roc_curve(model, x_test, y_test)

plt.show()

2.2 SVM支持向量机

import pandas as pd

from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签

plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号   #显示所有列，把行显示设置成最大

pd.set_option('display.max_columns', None)  # 显示所有行，把列显示设置成最大

pd.set_option('display.max_rows', None)

import warnings

warnings.filterwarnings('ignore')

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import plot_confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import learning_curve

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

from sklearn import svm

from sklearn.model_selection import validation_curve

from sklearn.metrics import plot_roc_curve

from sklearn.model_selection import GridSearchCV



data = pd.read_csv(r"D:\card_transdata.csv")

x = data.drop(columns=['fraud'], inplace=False)

y = data['fraud']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)



svm_model = svm.SVC(kernel="rbf", gamma="auto", cache_size=5000, )

svm_model.fit(x_train, y_train)

y_pred = svm_model.predict(x_test)

acc = accuracy_score(y_pred, y_test)

print(acc)



y_pred = pd.DataFrame(y_pred)

print(y_pred.value_counts())



y_test.value_counts()

print(y_test.value_counts())



# 网格调参

param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}

grid = GridSearchCV(svm_model, param_grid)

grid.fit(x_train, y_train)

print(grid.best_params_)



# 搜寻到的最佳模型

svm_model=grid.best_estimator_

# 进行模型性能估计

y_pred1 = svm_model.predict(x_train)

y_pred2 = svm_model.predict(x_test)

print(y_pred1)

print(y_pred2)



# 交叉验证

score = cross_val_score(GaussianNB(), x, y, cv=5)

print("交叉验证分数为{}".format(score))

print("平均交叉验证分数:{}".format(score.mean()))



# 学习曲线

max_depth=["linear", "rbf", "sigmoid"]

train_score, val_score = validation_curve(svm_model, x, y,

                                          param_name='max_depth',

                                          param_range=max_depth, cv=5, scoring='accuracy')

plt.plot(max_depth, np.median(train_score, 1), color='blue', label='training score')

plt.plot(max_depth, np.median(val_score, 1), color='red', label='validation score')

plt.legend(loc='best')

plt.xlabel('max_depth')

plt.ylabel('score')





#学习曲线

train_sizes, train_loss, val_loss = learning_curve(svm_model, x, y,cv=5,train_sizes=[0.1, 0.25, 0.3, 0.5, 0.75, 1])

train_loss_mean = np.mean(train_loss, axis=1)

val_loss_mean = np.mean(val_loss, axis=1)

plt.plot(train_sizes, train_loss_mean, 'o-', color='r', label='Training')

plt.plot(train_sizes, val_loss_mean, 'o-', color='g', label='Cross-validation')

plt.xlabel('Training_examples')

plt.ylabel('Loss')

plt.legend(loc='best')

plt.show()



# 各种评价指标

y_pred1 = svm_model.predict(x_test)

acc = accuracy_score(y_test, y_pred1)

f1 = f1_score(y_test, y_pred1)

recall = recall_score = recall_score(y_test, y_pred1)

precision = precision_score(y_pred1, y_test)

print(acc)

print(f1)

print(recall)

print(precision)



# 可视化

plot_confusion_matrix(svm_model, x_test, y_test)

plt.show()



# Roc曲线

plot_roc_curve(svm_model, x_test, y_test)

plt.show()

2.3网格调参

# 网格调参

param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}

grid = GridSearchCV(svm_model, param_grid)

grid.fit(x_train, y_train)

print(grid.best_params_)

朴素贝叶斯分类没有参数,所以不需要调参

2.4学习曲线

#学习曲线

train_sizes,train_loss,val_loss = learning_curve(

model,x,y,cv = 5, train_sizes = [0.1,0.25,0.3,0.5,0.75,1])

train_loss_mean = np.mean(train_loss,axis=1)

val_loss_mean = np.mean(val_loss,axis = 1)

plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')

plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')

plt.xlabel('Training_examples')

plt.ylabel('Loss')

plt.legend(loc='best')

plt.show()

2.5评价指标 acc f1 recall precision

#各种评价指标

model.fit(x_train,y_train)

y_pred1 = model.predict(x_test)

acc = accuracy_score(y_test,y_pred1)

f1 = f1_score(y_test,y_pred1)

recall = recall_score = recall_score(y_test,y_pred1)

precision = precision_score(y_pred1,y_test)

print(acc)

print(f1)

print(recall)

print(precision)

2.6混淆矩阵

plot_confusion_matrix(model, x_test, y_test)

plt.show()

2.7Roc曲线

plot_roc_curve(model, x_test, y_test)

plt.show()