项目实战--MySQL实现分词模糊匹配

一、需求描述

推广人员添加公司到系统时,直接填写公司简称,而公司全称可能之前已经被添加过,为防止添加重复的公司,所以管理员在针对公司信息审批之前,需要查看以往添加的公司信息里是否有相同公司。

二、方案

技术层面需要考虑实现的功能点:
• 分词
• 与库里已有数据进行匹配
• 按照匹配度对结果进行排序

分词功能有现成的分词器,所以整个需求的核心重点是如何与数据库中的数据匹配并按照匹配度排序。
方案可以选择:
• 方案一:引入ES
• 方案二:利用MySQL实现

若系统规模较小,单纯为实现这个功能引入ES成本较大,还要涉及到数据同步等问题,系统复杂性会提高,所以尽量使用MySQL已有的功能进行实现。
MySQL提供了以下三种模糊搜索的方式:

• like匹配: 要求模式串与整个目标字段完全匹配;
• RegExp正则匹配: 要求目标字段包含模式串即可;
• Fulltext全文索引: 在字段类型为CHARVARCHARTEXT的列上创建全文索引,执行SQL进行查询。

针对于上述业务场景,对相关技术进行优劣分析:

• like匹配: 无法满足需求,所以pass;
• 全文索引: 可定制性差,不支持任意匹配查询,pass;
• 正则匹配: 可实现任意模式匹配,缺点在于执行效率不如全文索引。

针对于这个场景,记录数目相对来说没有那么多,所以对于效率稍低的结果可以接受,因此技术选型方面采用RegExp正则匹配来实现模糊匹配的需求。

三、代码实现

1.提取公司关键信息
对输入的公司名称去除无用信息,保留关键信息。这里的无用信息指的是地名,圆括号,以及集团,股份,有限等。

/**
 * 匹配前去除公司名称的无意义信息
 * @param targetCompanyName
 * @return
 */
private String formatCompanyName(String targetCompanyName){

Stringregex="(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)"+
"?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)"+
"?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)"+
"?(?<village>.*)";
Matchermatcher=Pattern.compile(regex).matcher(targetCompanyName);
while(matcher.find()){
Stringprovince= matcher.group("province");
        log.info("province:{}",province);
if(StringUtils.isNotBlank(province)&& targetCompanyName.contains(province)){
            targetCompanyName = targetCompanyName.replace(province,"");
}
        log.info("处理完省份的公司名称:{}",targetCompanyName);
Stringcity= matcher.group("city");
        log.info("city:{}",city);
if(StringUtils.isNotBlank(city)&& targetCompanyName.contains(city)){
            targetCompanyName = targetCompanyName.replace(city,"");
}
        log.info("处理完城市的公司名称:{}",targetCompanyName);
Stringcounty= matcher.group("county");
        log.info("county:{}",county);
if(StringUtils.isNotBlank(county)&& targetCompanyName.contains(county)){
            targetCompanyName = targetCompanyName.replace(county,"");
}
        log.info("处理完区县级的公司名称:{}",targetCompanyName);
}
String[][] address =AddressUtil.ADDRESS;
for(String[] city: address){
for(String b : city ){
if(targetCompanyName.contains(b)){
                targetCompanyName = targetCompanyName.replace(b,"");
}
}
}
    log.info("处理后的公司名称:{}",targetCompanyName);
return targetCompanyName;
}

地名工具类

public class AddressUtil{
public static final String[][] ADDRESS ={
{"北京"},
{"天津"},
{"安徽","安庆","蚌埠","亳州","巢湖","池州","滁州","阜阳","合肥","淮北","淮南","黄山","六安","马鞍山","宿州","铜陵","芜湖","宣城"},
{"澳门"},
{"香港"},
{"福建","福州","龙岩","南平","宁德","莆田","泉州","厦门","漳州"},
{"甘肃","白银","定西","甘南藏族自治州","嘉峪关","金昌","酒泉","兰州","临夏回族自治州","陇南","平凉","庆阳","天水","武威","张掖"},
{"广东","潮州","东莞","佛山","广州","河源","惠州","江门","揭阳","茂名","梅州","清远","汕头","汕尾","韶关","深圳","阳江","云浮","湛江","肇庆","中山","珠海"},
{"广西","百色","北海","崇左","防城港","贵港","桂林","河池","贺州","来宾","柳州","南宁","钦州","梧州","玉林"},
{"贵州","安顺","毕节地区","贵阳","六盘水","黔东南苗族侗族自治州","黔南布依族苗族自治州","黔西南布依族苗族自治州","铜仁地区","遵义"},
{"海南","海口","三亚","直辖县级行政区划"},
{"河北","保定","沧州","承德","邯郸","衡水","廊坊","秦皇岛","石家庄","唐山","邢台","张家口"},
{"河南","安阳","鹤壁","焦作","开封","洛阳","漯河","南阳","平顶山","濮阳","三门峡","商丘","新乡","信阳","许昌","郑州","周口","驻马店"},
{"黑龙江","大庆","大兴安岭地区","哈尔滨","鹤岗","黑河","鸡西","佳木斯","牡丹江","七台河","齐齐哈尔","双鸭山","绥化","伊春"},
{"湖北","鄂州","恩施土家族苗族自治州","黄冈","黄石","荆门","荆州","十堰","随州","武汉","咸宁","襄樊","孝感","宜昌"},
{"湖南","长沙","常德","郴州","衡阳","怀化","娄底","邵阳","湘潭","湘西土家族苗族自治州","益阳","永州","岳阳","张家界","株洲"},
{"吉林","白城","白山","长春","吉林","辽源","四平","松原","通化","延边朝鲜族自治州"},
{"江苏","常州","淮安","连云港","南京","南通","苏州","宿迁","泰州","无锡","徐州","盐城","扬州","镇江"},
{"江西","抚州","赣州","吉安","景德镇","九江","南昌","萍乡","上饶","新余","宜春","鹰潭"},
{"辽宁","鞍山","本溪","朝阳","大连","丹东","抚顺","阜新","葫芦岛","锦州","辽阳","盘锦","沈阳","铁岭","营口"},
{"内蒙古","阿拉善盟","巴彦淖尔","包头","赤峰","鄂尔多斯","呼和浩特","呼伦贝尔","通辽","乌海","乌兰察布","锡林郭勒盟","兴安盟"},
{"宁夏回族","固原","石嘴山","吴忠","银川","中卫"},
{"青海","果洛藏族自治州","海北藏族自治州","海东地区","海南藏族自治州","海西蒙古族藏族自治州","黄南藏族自治州","西宁","玉树藏族自治州"},
{"山东","滨州","德州","东营","菏泽","济南","济宁","莱芜","聊城","临沂","青岛","日照","泰安","威海","潍坊","烟台","枣庄","淄博"},
{"山西","长治","大同","晋城","晋中","临汾","吕梁","朔州","太原","忻州","阳泉","运城"},
{"陕西","安康","宝鸡","汉中","商洛","铜川","渭南","西安","咸阳","延安","榆林"},
{"上海"},
{"四川","阿坝藏族羌族自治州","巴中","成都","达州","德阳","甘孜藏族自治州","广安","广元","乐山","凉山彝族自治州","泸州","眉山","绵阳","内江","南充","攀枝花","遂宁","雅安","宜宾","资阳","自贡"},
{"西藏","阿里地区","昌都地区","拉萨","林芝地区","那曲地区","日喀则地区","山南地区"},
{"新疆维吾尔","阿克苏地区","阿勒泰地区","巴音郭楞蒙古自治州","博尔塔拉蒙古自治州","昌吉回族自治州","哈密地区","和田地区","喀什地区","克拉玛依","克孜勒苏柯尔克孜自治州","塔城地区","吐鲁番地区","乌鲁木齐","伊犁哈萨克自治州","直辖县级行政区划"},
{"云南","保山","楚雄彝族自治州","大理白族自治州","德宏傣族景颇族自治州","迪庆藏族自治州","红河哈尼族彝族自治州","昆明","丽江","临沧","怒江僳僳族自治州","普洱","曲靖","文山壮族苗族自治州","西双版纳傣族自治州","玉溪","昭通"},
{"浙江","杭州","湖州","嘉兴","金华","丽水","宁波","衢州","绍兴","台州","温州","舟山"},
{"重庆"},
{"台湾","台北","高雄","基隆","台中","台南","新竹","嘉义"},
};
}

2 分词相关代码
pom文件:引入IK分词器相关依赖

 <!-- ikAnalyzer 中文分词器  -->
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
<exclusions>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
</exclusion>
</exclusions>
</dependency>

<!--  lucene-queryParser 查询分析器模块 -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.3.0</version>
</dependency>

IKAnalyzerSupport类:用于配置分词器

@Slf4j
public class IKAnalyzerSupport{
/**
     * IK分词
     * @param target
     * @return
     */
public static List<String> iKSegmenterToList(String target) throwsException{
if(StringUtils.isEmpty(target)){
returnnewArrayList();
}
List<String> result =newArrayList<>();
StringReadersr=newStringReader(target);
// false:关闭智能分词 (对分词的精度影响较大)
IKSegmenterik=newIKSegmenter(sr,true);
Lexeme lex;
while((lex=ik.next())!=null){
StringlexemeText= lex.getLexemeText();
            result.add(lexemeText);
}
return result;
}
}

ServiceImpl类:进行分词处理

 /**
 * 对目标公司名称进行分词
 * @param targetCompanyName
 * @return
 */
private String splitWord(String targetCompanyName){
    log.info("对处理后端公司名称进行分词");

List<String> splitWord =newArrayList<>();
Stringresult= targetCompanyName;
try{
        splitWord = iKSegmenterToList(targetCompanyName);
        result =  splitWord.stream().map(String::valueOf).distinct().collect(Collectors.joining("|"));
        log.info("分词结果:{}",result);
}catch(Exception e){
        log.error("分词报错:{}",e.getMessage());
}
return result;
}

3 匹配
ServiceImpl类:匹配核心代码

public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp){
// 对公司名称进行处理
StringsourceCompanyName= companyDTO.getCompanyName();
StringtargetCompanyName= sourceCompanyName;
    log.info("处理前公司名称:{}",targetCompanyName);
// 处理圆括号
    targetCompanyName = targetCompanyName.replaceAll("[(]|[)]|[(]|[)]","");
// 处理公司相关关键词
    targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]","");

if(!targetCompanyName.contains("银行")){
// 去除行政区域
        targetCompanyName = formatCompanyName(targetCompanyName);
}
// 分词
StringsplitCompanyName= splitWord(targetCompanyName);
//  匹配
List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName,targetCompanyName);

List<String> result =newArrayList();
for(Company companyInfo : matchedCompany){
        result.add(companyInfo.getCompanyName());
if(companyDTO.getCompanyId().equals(companyInfo.getCompanyId())){
            result.remove(companyInfo.getCompanyName());
}
}
returnJsonResult.successResult(result);
}

Repository类:编写SQL语句

/**  
* 模糊匹配公司名称  
* @param companyNameRegex 分词后的公司名称
* @param companyName 分词前的公司名称  
* @return  
*/
@Query(value = 
"SELECT * FROM company WHERE isDeleted = '0' and companyName REGEXP ?1 
ORDER BY length(REPLACE(companyName,?2,''))/length(companyName) ",
nativeQuery = true)  
List<Company> queryMatchCompanyName(String companyNameRegex,String companyName);

按照匹配度排序这个功能点,LENGTH(companyName)返回companyName的长度,LENGTH(REPLACE(companyName, ?2, ‘’))计算出companyName中关键词出现的次数。通过这种方式,可以根据匹配程度进行排序,匹配次数越多的公司名称排序越靠前。

四、效果展示

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/768103.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

项目2:API Hunter 细节回顾 -1

一. 接口调用 对于开发者来说&#xff0c;接口的调用应当是方便快捷的&#xff0c;而且出于安全考虑&#xff0c;通常会选择在后端调用第三方 API&#xff0c;避免在前端暴露诸如密码的敏感信息。 若采用 HTTP 调用方式&#xff1a; HttpClientRestTemplate第三方库&#xf…

【JavaWeb】登录校验-会话技术(一)Cookie与Session

登录校验 实现登陆后才能访问后端系统页面&#xff0c;不登陆则跳转登陆页面进行登陆。 首先我们在宏观上先有一个认知&#xff1a; HTTP协议是无状态协议。即每一次请求都是独立的&#xff0c;下一次请求并不会携带上一次请求的数据。 因此当我们通过浏览器访问登录后&#…

py黑帽子学习笔记_burp

配置burp kali虚机默认装好了社区版burp和java&#xff0c;其他os需要手动装 burp是用java&#xff0c;还得下载一个jython包&#xff0c;供burp用 配apt国内源&#xff0c;然后apt install jython --download-only&#xff0c;会只下载包而不安装&#xff0c;下载的目录搜一…

【JVM-05】Java内存区域(运行时数据区)、对象创建过程、内存布局

【JVM-05】Java内存区域即运行时数据区、对象创建过程、内存布局 1. 介绍下Java内存区域(运行时数据区)1.1 程序计数器(线程私有)1.2 虚拟机栈(线程私有)1.3 本地方法栈(线程私有)1.4 Java堆(线程共享)1.5 方法区(线程共享)1.5.1 方法区和永久代的关系1.5.2 常用参数1.5.3 为什…

用随机森林算法进行的一次故障预测

本案例将带大家使用一份开源的S.M.A.R.T.数据集和机器学习中的随机森林算法&#xff0c;来训练一个硬盘故障预测模型&#xff0c;并测试效果。 实验目标 掌握使用机器学习方法训练模型的基本流程&#xff1b;掌握使用pandas做数据分析的基本方法&#xff1b;掌握使用scikit-l…

AWS的收费方式与计费项目

AWS&#xff08;亚马逊云服务&#xff09;是全球领先的云计算服务提供商&#xff0c;为各种规模和类型的企业提供了广泛的云计算服务。AWS的收费方式和计费项目是许多用户关心的问题&#xff0c;因为了解这些信息有助于企业更好地规划和控制成本。本文将介绍AWS的收费方式和常见…

MobileVitv1替换yolov8主干网络

一、原理介绍 MobileViT模型是为移动设备设计的轻量级、通用目的视觉变换器。它融合了卷积神经网络&#xff08;CNN&#xff09;和视觉变换器&#xff08;ViT&#xff09;的优势&#xff0c;旨在在保持高效性能的同时减少模型参数和降低延迟。以下是关于MobileViT模型的主要原理…

MATLAB中findall用法

目录 语法 说明 示例 查找具有可见或隐藏句柄的图窗 查找句柄处于隐藏状态的对象 查找 Text 对象 提示 findall的功能是查找所有图形对象。 语法 h findall(objhandles) h findall(objhandles,prop1,value1,...,propN,valueN) 说明 h findall(objhandles) 返回 ob…

电脑提示你的msvcr100.dll丢失要如何解决?一键修复msvcr100.dll的解决方法

电脑提示你的msvcr100.dll丢失要如何解决&#xff1f;首先我们就要了解msvcr100.dll这个文件&#xff0c;了解前因后果&#xff0c;这样我们才能知道使用什么方法去修复它。今天主要就是来给大家详细的讲解一下msvcr100.dll文件&#xff0c;并且教各种的修复方法&#xff0c;每…

excel表格如何换行,这几个操作方法要收藏好

Excel表格作为一款强大的数据处理工具&#xff0c;在日常工作和生活中被广泛应用。当需要在单元格内显示较长的文本内容或使数据更加清晰易读时&#xff0c;我们需要掌握一些换行技巧。下面将介绍几种常用的Excel换行方法&#xff1a; 一、使用快捷键换行 1、首先&#xff0c;…

介绍一些好玩且实用的开源的AI工具

介绍一些好玩且实用的开源的AI工具 随着人工智能技术的迅猛发展&#xff0c;开源社区涌现出了许多关于AI的项目&#xff0c;这些项目不仅展示了技术的创新力&#xff0c;也为开发者提供了丰富的工具和资源。本文将介绍几个既有趣又实用的开源人工智能工具&#xff0c;它们不仅…

前端Web开发HTML5+CSS3+移动web视频教程 Day4 CSS 第2天

P44 - P 四个知识点&#xff1a; 复合选择器 CSS特性 背景属性 显示模式 复合选择器 复合选择器仍然是选择器&#xff0c;只要是选择器&#xff0c;作用就是找标签。复合选择器就是把基础选择器进行组合使用。组合了之后就可以在大量的标签里面更快更精准地找标签了。找…

轻度图像处理工具,匹敌photoshop

一、简介 1、一款功能强大的在线图片编辑工具,用户可以将其安装为渐进式网页应用(PWA)。它提供了与 Photoshop 相似的核心功能,能够满足大多数图像编辑需求,非常适合那些不愿或无法安装 Photoshop 的用户。即使使用免费版本,用户也能享受所有功能,是轻度图像处理的理想选…

20240703在飞凌OK3588-C开发板上刷Rockchip原厂的Buildroot20220811

20240703在飞凌OK3588-C开发板上刷Rockchip原厂的Buildroot20220811 2024/7/3 18:25 详细的刷机LOG&#xff1a; [BEGIN] 2024/7/3 18:18:49 rootRK3588:/# DDR Version V1.07 20220412 LPDDR4X, 2112MHz channel[0] BW16 Col10 Bk8 CS0 Row16 CS1 Row16 CS2 Die BW16 Size204…

【高级篇】第9章 Elasticsearch 监控与故障排查

9.1 引言 在现代数据驱动的应用架构中,Elasticsearch不仅是海量数据索引和搜索的核心,其稳定性和性能直接影响到整个业务链路的健康度。因此,建立有效的监控体系和掌握故障排查技能是每一位Elasticsearch高级专家的必备能力。 9.2 监控工具:洞察与优化的利器 在Elastics…

MySQL加个索引都可能丢数据,这个坑你知道吗?

前言 近期&#xff0c;我们收到一位数据库运维小伙伴的咨询&#xff0c;他们有一个MySQL 5.6的数据库&#xff0c;需要对核心支付表做DDL加索引&#xff0c;咨询我们如何加索引更优雅。基于DBA经验&#xff0c;给表添加索引主要有以下几种方式&#xff1a; 用MySQL原生的DDL语…

BeautifulSoup 类通过查找方法选取节点

BeautifulSoup 类提供了一些基于 HTML 或 XML 节点树选取节点的方法&#xff0c;其中比较主流 的两个方法是 find() 方法和 find_all() 方法。 find() 方法用于查找符合条件的第一个节点&#xff1b; find_all() 方法用于查找所有符合条件的节点&#xff0c;并以列表的…

分页导航DOM更新实践:JavaScript与jQuery的结合使用

分页导航DOM更新实践&#xff1a;JavaScript与jQuery的结合使用 在Web开发中&#xff0c;分页导航是展示大量数据时不可或缺的UI组件。合理的分页不仅可以提高应用性能&#xff0c;还能优化用户体验。本博客将通过一个实际的DOM结构和模拟数据&#xff0c;讲解如何使用JavaScr…

计算机网络部分知识点整理

停止等待协议的窗口尺寸为 1。 √以太网标准是IEEE802.3TCP/IP四层&#xff0c;OSI模型有7层&#xff0c;地址解析协议 ARP 在 OSI 参考七层协议属于数据链路层&#xff0c;在TCP/IP 协议属于网络层&#xff0c;ARP作用&#xff1a;将 IP 地址映射到第二层地址&#xff0c;交换…

Zabbix 配置PING监控

Zabbix PING监控介绍 如果需要判断机房的网络或者主机是否正常&#xff0c;这就需要使用zabbix ping&#xff0c;Zabbix使用外部命令fping处理ICMP ping的请求&#xff0c;在基于ubuntu APT方式安装zabbix后默认已存在fping程序。另外zabinx_server配置文件参数FpingLocation默…