NLTK工具包

NLTK工具包

NLTK工具包安装

非常实用的文本处理工具,主要用于英文数据,历史悠久~

安装命令:
pip install nltk

在这里插入图片描述

import nltk
# nltk.download()
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\26388\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.

True

分词

from nltk.tokenize import word_tokenize
from nltk.text import Text
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, we have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]
tokens[:5]
['today', "'s", 'weather', 'is', 'good']

Text对象

# 帮助文档
help(nltk.text)

创建一个Text对象,方便后续操作

t = Text(tokens)
# 统计good单词的个数
t.count('good')
1
# 统计单词位置
t.index('good')
4
# 词频分布排名前8的单词
t.plot(8)


在这里插入图片描述

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

停用词过滤

from nltk.corpus import stopwords
stopwords.readme().replace('\n', '')
'Stopwords CorpusThis corpus contains lists of stop words for several languages.  Theseare high-frequency grammatical words which are usually ignored in textretrieval applications.They were obtained from:http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/The stop words for the Romanian language were obtained from:http://arlc.ro/resources/The English list has been augmentedhttps://github.com/nltk/nltk_data/issues/22The German list has been correctedhttps://github.com/nltk/nltk_data/pull/49A Kazakh list has been addedhttps://github.com/nltk/nltk_data/pull/52A Nepali list has been addedhttps://github.com/nltk/nltk_data/pull/83An Azerbaijani list has been addedhttps://github.com/nltk/nltk_data/pull/100A Greek list has been addedhttps://github.com/nltk/nltk_data/pull/103An Indonesian list has been addedhttps://github.com/nltk/nltk_data/pull/112'
# 查看一下nltk支持的语言
stopwords.fileids()
['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']
# 查看停用词表
stopwords.raw('english').replace('\n', ' ')
"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
# 查看给定文本的停用词有哪些
test_words_set.intersection(set(stopwords.words('english')))
{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}

过滤掉停用词

filter = [w for w in test_words_set if(w not in stopwords.words('english'))]
filter
['afternoon',
 "'s",
 'classes',
 'basketball',
 'today',
 ',',
 'play',
 'weather',
 'windy',
 'sunny',
 'tomorrow',
 '.',
 'good']

词性标注

from nltk import pos_tag
tags = pos_tag(tokens)
tags
[('today', 'NN'),
 ("'s", 'POS'),
 ('weather', 'NN'),
 ('is', 'VBZ'),
 ('good', 'JJ'),
 (',', ','),
 ('very', 'RB'),
 ('windy', 'JJ'),
 ('and', 'CC'),
 ('sunny', 'JJ'),
 (',', ','),
 ('we', 'PRP'),
 ('have', 'VBP'),
 ('no', 'DT'),
 ('classes', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('afternoon', 'NN'),
 (',', ','),
 ('we', 'PRP'),
 ('have', 'VBP'),
 ('to', 'TO'),
 ('play', 'VB'),
 ('basketball', 'NN'),
 ('tomorrow', 'NN'),
 ('.', '.')]

命名实体识别

from nltk import ne_chunk
sentence = "Edison went to Tsinghua University today."
print(ne_chunk(pos_tag(word_tokenize(sentence))))
(S
  (PERSON Edison/NNP)
  went/VBD
  to/TO
  (ORGANIZATION Tsinghua/NNP University/NNP)
  today/NN
  ./.)

数据清洗实例

import re
from nltk.corpus import stopwords
# 输入数据
s = ' RT @Amila #Test\nDurant\'s newly listed Co &amp;Mary\'s unlisted  Group to supply tech for nlTK. \nh $TSLA $AAPL https:// t.co/x34afsfQsh'

# 指定停用词
cache_english_stopwords = stopwords.words('english')

def text_clean(text):
    print('原始数据:', text, '\n')
    
    # 去掉HTML标签
    text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)
    print('去掉特殊标签后:', text_no_special_entities, '\n')
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities)
    print('去掉价值符号后:', text_no_tickers, '\n')
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)
    print('去掉超链接后:', text_no_hyperlinks, '\n')
    
    # 去掉一些专门的名词缩写,简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks)
    print('去掉专门名词缩写后:', text_no_small_words, '\n')
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)
    print('去掉空格后:', text_no_whitespace, '\n')
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print('分词结果:', tokens, '\n')
    
    # 去停用词
    list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
    print('去停用词后结果:', list_no_stopwords, '\n')
    # 过滤后结果
    text_filtered = ' '.join(list_no_stopwords)
    print('过滤后:', text_filtered)
    
text_clean(s)
原始数据:  RT @Amila #Test
Durant's newly listed Co &amp;Mary's unlisted  Group to supply tech for nlTK. 
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉特殊标签后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉价值符号后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h   https:// t.co/x34afsfQsh 

去掉超链接后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h    

去掉专门名词缩写后:    
Durant' newly listed  Mary' unlisted  Group  supply tech for nlTK. 

去掉空格后:  Durant' newly listed Mary' unlisted Group supply tech for nlTK.  

分词结果: ['Durant', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'for', 'nlTK', '.'] 

去停用词后结果: ['Durant', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'nlTK', '.'] 

过滤后: Durant ' newly listed Mary ' unlisted Group supply tech nlTK .

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/927096.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

HarmonyOS:使用Emitter进行线程间通信

Emitter主要提供线程间发送和处理事件的能力&#xff0c;包括对持续订阅事件或单次订阅事件的处理、取消订阅事件、发送事件到事件队列等。 一、Emitter的开发步骤如下&#xff1a; 订阅事件 import { emitter } from kit.BasicServicesKit; import { promptAction } from kit.…

Unity之一键创建自定义Package包

内容将会持续更新&#xff0c;有错误的地方欢迎指正&#xff0c;谢谢! Unity之一键创建自定义Package包 TechX 坚持将创新的科技带给世界&#xff01; 拥有更好的学习体验 —— 不断努力&#xff0c;不断进步&#xff0c;不断探索 TechX —— 心探索、心进取&#xff01; …

【html网页页面007】html+css制作旅游主题内蒙古网页制作含注册表单(4页面附效果及源码)

旅游家乡主题网页制作 &#x1f964;1、写在前面&#x1f367;2、涉及知识&#x1f333;3、网页效果&#x1f308;4、网页源码4.1 html4.2 CSS4.3 源码获取 &#x1f40b;5、作者寄语 &#x1f964;1、写在前面 家乡网站主题内蒙古的网页 一共4个页面 网页使用htmlcss制作页面…

Ardupilot开源无人机之Geek SDK讨论

Ardupilot开源无人机之Geek SDK讨论 1. 源由2. 假设3. 思考3.1 结构构型3.2 有限资源3.3 软硬件构架 4.Ardupilot构架 - 2024kaga Update5. 讨论5.1 话题1&#xff1a;工作模式5.2 话题2&#xff1a;关键要点5.3 话题3&#xff1a;产品设计 6. Geek SDK - OpenFire6.1 开源技术…

JavaWeb——Maven高级

11.1. 分模块设计与开发 将项目按照功能拆分成若干个子模块&#xff0c;方便项目的管理维护、扩展&#xff0c;也方便模块之间的互相调用&#xff0c;资源共享。 11.2. 继承与聚合 11.2.1. 继承 父工程的的打包方式必须为pom 实现步骤 11.2.2. 版本锁定 dependencyManagemen…

Python中的实用工具JSON解析

对于Python不熟悉的同学&#xff0c;建议从本专栏第一篇开始观看 https://blog.csdn.net/qq_20330595/category_12844705.html 先上效果图 代码 import threading import tkinter as tk import json from tkinter import scrolledtext import tkinter.filedialog as filedialo…

医学临床机器学习中算法公平性与偏差控制简析

摘要 随着医疗领域中数据的不断积累和计算能力的提升&#xff0c;临床机器学习技术发展迅速&#xff0c;但算法不公平性和偏差问题凸显。本文深入探讨了临床机器学习算法公平性的重要性、概念与定义、在临床应用中的影响、偏差来源、降低偏差方法及提升公平性策略。通过对不同…

【数据结构】二叉搜索树(二叉排序树)

&#x1f31f;&#x1f31f;作者主页&#xff1a;ephemerals__ &#x1f31f;&#x1f31f;所属专栏&#xff1a;数据结构 目录 前言 一、什么是二叉搜索树 二、二叉搜索树的实现 节点 属性和接口的声明 插入 查找 删除 拷贝构造 析构 中序遍历 三、二叉搜索树的…

如何启用本机GPU硬件加速猿大师播放器网页同时播放多路RTSP H.265 1080P高清摄像头RTSP视频流?

目前市面上主流播放RTSP视频流的方式是用服务器转码方案&#xff0c;这种方案的好处是兼容性更强&#xff0c;可以用于不同的平台&#xff0c;比如&#xff1a;Windows、Linux或者手机端&#xff0c;但是缺点也很明显&#xff1a;延迟高、播放高清或者同时播放多路视频视频容易…

乘积求导法则、除法求导法则和链式求导法则

乘积求导法则、除法求导法则和链式求导法则 1. Constant multiples of functions (函数的常数倍)2. Sums and differences of functions (函数和与函数差)3. Products of functions via the product rule (通过乘积法则求积函数的导数)4. Quotients of functions via the quoti…

2个GitHub上最近比较火的Java开源项目

1. SpringBlade 微服务架构 标题 SpringBlade 微服务架构 摘要 SpringBlade 是一个由商业级项目升级优化而来的微服务架构&#xff0c;采用Spring Boot 3.2、Spring Cloud 2023等核心技术构建&#xff0c;遵循阿里巴巴编码规范&#xff0c;提供基于React和Vue的两个前端框架&am…

Ubuntu 安装 MariaDB

安装 MariaDB具体步骤 1、更新软件包索引&#xff1a; sudo apt update2、安装 MariaDB 服务器&#xff1a; sudo apt install mariadb-server3、启动 MariaDB 服务&#xff08;如果未自动启动&#xff09;&#xff1a; sudo systemctl start mariadb4、设置 MariaDB 开机启…

一体化数据安全平台uDSP 入选【年度创新安全产品 TOP10】榜单

近日&#xff0c;由 FreeBuf 主办的 FCIS 2024 网络安全创新大会在上海隆重举行。大会现场揭晓了第十届 WitAwards 中国网络安全行业年度评选获奖名单&#xff0c;该评选自 2015 年举办以来一直饱受赞誉&#xff0c;备受关注&#xff0c;评选旨在以最专业的角度和最公正的态度&…

相同的二叉树

给你两棵二叉树的根节点 p 和 q &#xff0c;编写一个函数来检验这两棵树是否相同。 如果两个树在结构上相同&#xff0c;并且节点具有相同的值&#xff0c;则认为它们是相同的。 示例 1&#xff1a; 输入&#xff1a;p [1,2,3], q [1,2,3] 输出&#xff1a;true示例 2&…

百度地图JSAPI WebGL v1.0类参考

百度地图JSAPI WebGL v1.0类参考 核心类 Map 此类是地图API的核心类&#xff0c;用来实例化一个地图。请注意WebGL版本的地图API的命名空间是BMapGL。 示例&#xff1a;const map new BMapGL.Map(container); 构造函数描述Map(container: String | HTMLElement, opts: Map…

【k8s】监控metrics-server

metrics-server介绍 Metrics Server是一个集群范围的资源使用情况的数据聚合器。作为一个应用部署在集群中。Metric server从每个节点上KubeletAPI收集指标&#xff0c;通过Kubernetes聚合器注册在Master APIServer中。为集群提供Node、Pods资源利用率指标。 就像Linux 系统一样…

AI 声音:数字音频、语音识别、TTS 简介与使用示例

在现代 AI 技术的推动下&#xff0c;声音处理领域取得了巨大进展。从语音识别&#xff08;ASR&#xff09;到文本转语音&#xff08;TTS&#xff09;&#xff0c;再到个性化声音克隆&#xff0c;这些技术已经深入到我们的日常生活中&#xff1a;语音助手、自动字幕生成、语音导…

ARM CCA机密计算安全模型之硬件强制安全

安全之安全(security)博客目录导读 [要求 R0004] Arm 强烈建议所有 CCA 实现都使用硬件强制的安全(CCA HES)。本文件其余部分假设系统启用了 CCA HES。 CCA HES 是一个可信子系统的租户——一个 CCA HES 主机(Host),见下图所示。它将以下监控安全域服务从应用处理元件(P…

matlab显示sin二维图

1&#xff0c;新建脚本 2、保存脚本 3、脚本命令&#xff1a;clc 清除 脚本命令的信息 clrear all 清除全部 4工作区内容&#xff1a;变量啥的 x0:0.001:2*pi%% 开始 精度 中值 ysin(x) y1cos(x) figure%%产生一个屏幕 plot(x,y)%%打印坐标 title(ysin(x))%%标题 xlabel(…

一万台服务器用saltstack还是ansible?

一万台服务器用saltstack还是ansible? 选择使用 SaltStack 还是 Ansible 来管理一万台服务器&#xff0c;取决于几个关键因素&#xff0c;如性能、扩展性、易用性、配置管理需求和团队的熟悉度。以下是两者的对比分析&#xff0c;帮助你做出决策&#xff1a; SaltStack&…