24. 正则表达式

一、什么是正则表达式

正则表达式（regular expression）又称 规则表达式，是一种文本模式（pattern）。正则表达式使用一个字符串来描述、匹配具有相同规格的字符串，通常被用来检索、替换那些符合某个模式（规则）的文本。正则表达式的核心功能就是处理文本。正则表达式并不仅限于某一种语言，但是在每种语言中有细微的差别。

C++ 标准库从 C++ 11 开始提供了正则表达式（Regular Expressions）的支持，通过 <regex> 头文件来实现。C++ 的正则表达式库提供了一组类和函数，用于处理正则表达式匹配、搜索和替换操作。

正则表达式对象 ：使用 std::regex 类定义一个正则表达式对象。
匹配操作 ：
- std::regex_match() 方法：用于检查一个字符串是否与一个正则表达式匹配。
- std::regex_search() 方法：用于搜索字符串中与正则表达式匹配的部分。
- std::regex_replace() 方法：用于替换与正则表达式匹配的文本。
- std::sregex_iterator 迭代器类：用于遍历字符串中所有与正则表达式匹配的子串。返回一个 std::smatch 对象，后者用于存储匹配结果。
匹配结果 ：使用 std::smatch 类来存储匹配结果，它包含匹配的子串和相关信息。

在 C++ 中，主要有两种方法完成模式匹配：“搜索”（searching），即在字符串任意部分中搜索匹配的模式；而 “匹配”（matching）是指判断一个字符串能否从起始处全部匹配某个模式。搜索通过 regex_search() 函数或方法来实现，而匹配是调用 regex_match() 函数或方法实现。

regex_match() 函数总是从字符串的起始部分对模式进行全部匹配。如果匹配成功，就返回 true；如果匹配失败，就返回 false。

bool regex_match(const std::string& text, const std::regex& e);
bool regex_match(const std::string& text, std::smatch& match, const std::regex& e);

参数 text 是要进行匹配的输入字符串。
参数 e 是 std::regex对象，代表正则表达式。
参数 match 是 std::smatch对象（或与 std::smatch类似的类型），用于存储匹配的结果信息。当使用 std::smatch 时，它会包含整个匹配的信息以及任何分组（括号中的子匹配）。

regex_search() 的工作方式与 regex_match() 完全一致，不同之处在于 regex_search() 会用它的字符串参数，在任意位置对给定正则表达式模式搜索第一次出现的匹配情况。如果搜索到成功的匹配，就会返回 true；否则，返回 false。

bool regex_search(const std::string& text, const std::regex& e);
bool regex_search(const std::string& text, std::smatch& match, const std::regex& e);

参数 text 是要进行匹配的输入字符串。
参数 e 是 std::regex对象，代表正则表达式。
参数 match 是 std::smatch对象（或与 std::smatch类似的类型），用于存储匹配的结果信息。当使用 std::smatch 时，它会包含整个匹配的信息以及任何分组（括号中的子匹配）。

std::regex_replace() 函数用于替换字符串中与正则表达式匹配的文本。

regex_replace(const std::string& text, const std::regex& e, const std::string& replace_text);
regex_replace(const std::string& text, const std::regex& e, const std::string& replace_text, int flags = 0);

参数 text 是要进行匹配的输入字符串。
参数 e 是 std::regex对象，代表正则表达式。
参数 replace_text 是要替换的字符串。
参数 flags 用于指定额外的选项。

std::regex_replace 命名空间中定义了许多宏用来指定替换方式。

std::regex_constants::format_first_only：只替换第一个匹配项。
std::regex_constants::format_all：替换所有匹配项。
std::regex_constants::format_no_copy：不创建新字符串，而是直接修改输入字符串。

std::regex_replace 函数返回一个输出迭代器，指向替换后的字符串的第一个字符。你可以将这个输出迭代器用于输出流，或者将替换后的字符串存储在另一个字符串变量中。

由此可见，regex_match() 试图从字符串的起始部分开始匹配模式，而 regex_search() 函数不但会搜索模式在字符串中第一次出现的位置，而且严格地对字符串从左到右搜索。

std::sregex_iterator 类的构造函数有多个重载版本，但通常使用的是以下两个版本：

std::sregex_iterator(const std::string::const_iterator& first, const std::string::const_iterator& last, const std::regex& e);
std::sregex_iterator(const std::string& s, const std::regex& e);

参数 first 指向搜索开始的位置。
参数 last 指向搜索结束的位置。
参数 e 是 std::regex对象，代表正则表达式。
参数 s 是一个 std::string 对象，用于定义搜索的范围。

字符串的匹配：

#include <iostream>
#include <regex>

using namespace std;

int main(void) 
{
    string text = "小樱同学";
    string pattern = "小樱";

    regex expression(pattern);
    smatch match;

    if (regex_match(text, match, expression))
    {
        cout << "Found a match: " << match.str() << endl;
    }
    else
    {
        cout << "No match found." << endl;
    }

    return 0;
}

字符串的搜索：

#include <iostream>
#include <regex>

using namespace std;

int main(void) 
{
    string text = "小樱同学";
    string pattern = "同学";

    regex expression(pattern);
    smatch match;

    if (regex_search(text, match, expression)) 
    {
        cout << "Found a match: " << match.str() << endl;
        cout << "Match position: " << match.position() << endl;
        cout << "Match prefix: " << match.prefix() << endl;
        cout << "Match suffix: " << match.suffix() << endl;
    } 
    else 
    {
        cout << "No match found." << endl;
    }

    return 0;
}

字符串的替换：

#include <iostream>
#include <regex>

using namespace std;

int main(void) 
{
    string text = "你好啊，小樱同学！从现在开始你就是我的朋友啊。小樱同学，请多多关照。";
    string pattern = "同学";

    regex expression(pattern);
    smatch match;

    cout << regex_replace(text, expression, "同志") << endl;

    return 0;
}

二、基础语法

2.1、转义字符

使用正则表达式去检索某些特殊字符的时候，需要用到转义字符，否则检索不到结果，甚至会报错；在 C++ 中，\ 具有转义的意思，会对紧随其后的字符进行转义，如果我们想使用普通的 \ ，需要在使用一个 \ 对它进行转义。

#include <iostream>
#include <regex>

using namespace std;

int main(void) 
{
    string text = "abc$def(123(456))";
    string pattern = "\\(456";

    regex expression(pattern);
    smatch match;

    if (regex_search(text, match, expression))
    {
        cout << "Found a match: " << match.str() << endl;
    }
    else
    {
        cout << "No match found." << endl;
    }

    return 0;
}

由于在 C++ 字符串字面量中反斜杠是一个转义字符，所以在构建模式串时，你需要使用两个反斜杠来表示一个反斜杠。在这个模式串中，我们使用了两次反斜杠 \\ 来转义左括号 (。这里推荐使用推荐原始字符串字面量，这样可以不需要转义反斜杠。

#include <iostream>
#include <regex>

using namespace std;

int main(void) 
{
    string text = "abc$def(123(456))";
    string pattern = R"(\(456)";

    regex expression(pattern);
    smatch match;

    if (regex_search(text, match, expression))
    {
        cout << "Found a match: " << match.str() << endl;
    }
    else
    {
        cout << "No match found." << endl;
    }

    return 0;
}

在 C++ 字符串字面量中，反斜杠本身也是一个转义字符，所以我们需要使用两个反斜杠 \\ 来表示一个真正的反斜杠 \。

需要用到转义符号的常见字符如下：. * + ( ) $ / \ ? [ ] ^ { }

2.2、字符匹配符

字符匹配符	含义	实例	解释
`[]`	可接收的字符列表	[abc]	a、b、c 中的任意 1 个字符
`[^]`	不可接收的字符列表	[^abc]	除 a、b、c 之外的任意 1 个字符包括数字和特殊符号
`-`	连字符	a-z	任意一个小写字母

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "abc123def4567AbC";

    regex expression1("[abc]");
    regex expression2("[^abc]");
    regex expression3("[a-z]");
    smatch match;

    if (regex_search(text, match, expression1))
    {
        cout << "Match found: " << match.str() << endl;
    }

    if (regex_search(text, match, expression2))
    {
        cout << "Match found: " << match.str() << endl;
    }

    if (regex_search(text, match, expression3))
    {
        cout << "Match found: " << match.str() << endl;
    }
  
    return 0;
}

2.3、元字符

元字符	含义
`.`	匹配单个除换行符以外的任意字符
`\d`	匹配 0~9 任意一个数字
`\D`	匹配单个任意非数字字符
`\s`	匹配任意空白字符
`\S`	匹配任意不是空白符的字符
`\w`	匹配字母或数字或下划线的任意字符
`\W`	匹配任意不是字母、数字、下划线的字符

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "abc123def4567AbC";

    regex expression1(R"(\d\d\d)");
    regex expression2("\\d\\w");
    smatch match;

    // 使用 sregex_iterator 遍历所有匹配的结果
    for (sregex_iterator it(text.begin(), text.end(), expression1), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    cout << endl;

    for (sregex_iterator it(text.begin(), text.end(), expression2), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    return 0;
}

元字符的大写表示不匹配；

2.4、重复限定符

重复限定符用于指定其前面的字符和组合项连续出现多少次。

重复限定符	意义
`?`	0 次或 1 次
`*`	0 次或多次
`+`	1 次或多次
`{n}`	正好出现 n 次
`{n,}`	至少出现 n 次
`{n,m}`	出现 n 次至 m 次

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "abc123def4567AbC89d115200a1";

    regex expression1(R"(\d{3,5})");
    regex expression2(R"(\d+)");
    smatch match;

    // 使用 sregex_iterator 遍历所有匹配的结果
    for (sregex_iterator it(text.begin(), text.end(), expression1), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    cout << endl;

    for (sregex_iterator it(text.begin(), text.end(), expression2), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    return 0;
}

2.5、定位符

定位符，用来指定要匹配的字符串出现的位置。

定位符	含义
`^`	指定起始字符
`$`	指定结束字符
`\b`	匹配目标字符串的边界，边界指的是字串间有空格，或者目标字符串的结束位置
`\B`	匹配非单词边界

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "abc123 def4567abc123abc abc89 d115200 a1abc";

    regex expression1("^abc");
    regex expression2("abc$");
    regex expression3(R"(abc\b)");
    regex expression4(R"(^ab\B)");
    smatch match;

    // 使用 sregex_iterator 遍历所有匹配的结果
    for (sregex_iterator it(text.begin(), text.end(), expression1), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    cout << endl;

    for (sregex_iterator it(text.begin(), text.end(), expression2), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }
  
    cout << endl;

    for (sregex_iterator it(text.begin(), text.end(), expression3), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    cout << endl;
  
    for (sregex_iterator it(text.begin(), text.end(), expression4), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    return 0;
}

2.6、选择匹配符

正则表达式用符号 | 来表示或，也叫做分支条件，当满足正则表达里的分支条件的任何一种条件时，都会当成匹配成功。

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "你好啊，小樱同学，欢迎你加入小樱班，从现在开始你就是我的朋友啊，小樱同志，请多多关照。";

    regex expression("小樱同学|小樱同志");
    smatch match;

    // 使用 sregex_iterator 遍历所有匹配的结果
    for (sregex_iterator it(text.begin(), text.end(), expression), end; it != end; ++it) 
    {
        match = *it;
        cout << "Found match: " << match.str() << " at position: " << match.position() << endl;
    }

    return 0;
}

2.6、分组组合

重复限定符是作用在与它相邻的最左边的一个字符。正则表达式中可以使用小括号 () 来做分组，也就是括号中的内容会作为一个整体。

2.6.1、捕获分组

我们可以使用 std::smatch 类来获取匹配结果。正则表达式字符串中的第一对括号是第 1 组，第二对括号是第 2 组，依次类推。使用 match[i] 就可以取得匹配文本的不同部分。match[0]返回这个匹配的文本。

捕获分组	说明
`(pattern)`	非命名捕获。捕获匹配的子字符串。编号为零的第一个捕获是由整个正则表达式模式匹配的文本，其它捕获结果则根据左括号的顺序从 1 开始自动编号。

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "我是小樱，我的身份证明是37028419860401232X";
    string pattern = R"(\d{6}(\d{4})(\d{2})(\d{2})\d{3}[\dX])";

    regex expression(pattern);
    smatch match;

     // 使用std::sregex_iterator遍历所有匹配项  
    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        // match[0] 包含整个匹配项  
        // match[1] 是第一个捕获组
        // match[2] 是第二个捕获组
  
        cout << "整个匹配项: " << match[0] << endl;
        for (int i = 1; i < match.size(); i++)
        {
            cout << "第" << i << "个捕获组: " << match[i] << endl;
        }
    }  

    return 0;
}

2.6.2、非捕获分组

非捕获分组	说明
`(?:pattern)`	匹配 pattern 但不捕获该匹配的子表达式，即它是一个非捕获匹配，不存储以后使用的匹配。例如：“小樱(?:同学\|同志)” 等价于 “小樱同学\|小樱同志”
`(?=pattern)`	它是一个非捕获匹配。例如：“Harmony(?=2\|3)” 匹配 “Harmony2” 中的 “Harmony”，但不匹配 “Harmony1” 中的 “Harmony”
`(?!pattern)`	该表达式匹配不处于匹配 pattern 的字符串的起始点的搜索字符串。它是一个非捕获匹配。例如：“Harmony(?=2\|3)” 匹配 “Harmony1” 中的 “Harmony”，但不匹配 “Harmony2” 中的 “Harmony”

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "你好啊，小樱同学，欢迎你加入小樱班，从现在开始你就是我的朋友啊，小樱同志，请多多关照。";

    regex expression1("小樱(?:同学|同志)");
    regex expression2("小樱(?=同学|同志)");
    regex expression3("小樱(?!同学|同志)");
    smatch match;

     // 使用std::sregex_iterator遍历所有匹配项  
    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression1); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        cout << "整个匹配项: " << match[0] << endl;
    }

    cout << endl;

    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression2); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        cout << "整个匹配项: " << match[0] << endl;
    }

    cout << endl;

    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression3); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        cout << "整个匹配项: " << match[0] << endl;
    }

    return 0;
}

2.7、非贪婪匹配

当 ? 元字符紧随任何其它限定符（*、+、?、{n}、{n,}、{n,m}）之后，匹配模式是 “非贪婪匹配”。非贪婪匹配搜索到、尽可能短的字符串。而默认的贪婪匹配搜索到的尽可能长的字符串。

#include <iostream>
#include <regex>

using namespace std;

int main(void)
{
    string text = "abc111111abc";

    // 贪婪匹配
    regex expression1(R"(\d{3,5})");
    // 非贪婪匹配
    regex expression2(R"(\d{3,5}?)");
    smatch match;

     // 使用std::sregex_iterator遍历所有匹配项  
    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression1); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        cout << "整个匹配项: " << match[0].str() << endl;
    }

    cout << endl;

    for (sregex_iterator i = sregex_iterator(text.begin(), text.end(), expression2); i != sregex_iterator(); ++i) 
    {  
        match = *i;  
  
        cout << "整个匹配项: " << match[0].str() << endl;
    }

    return 0;
}