C++/Regex

<regex> 是C++標準程式庫中的一個头文件，定义了C++标准中正则表达式的实现。是从C++11正式引入的。

C++11 <regex>默认使用ECMAScript即javascript的ECMA-262标准，因此不支持逆向检查（look-behind）语法。

类型定义

syntax_option_type
match_flag_type
error_type

类模板

包括下述类模板：

basic_regex：正则表达式对象。
sub_match：子表达式匹配捕获的字符序列
match_results：一个正则表达式的匹配，包含了所有子表达式匹配。
regex_iterator：在一个字符序列中遍历所有正则表达式匹配的迭代器。
regex_token_iterator：在给定字符序列的所有正则表达式匹配中遍历所有特定子表达式的迭代器。
regex_error：正则表达式库产生的错误报告。
regex_traits：正则表达式库所需的字符类型的维护信息。

basic_regex

正则表达式的对象在构造时，可以选择语法类型：

flag	语法效果	注释
icase	大小写不敏感	匹配时不考虑大小写的差别
nosubs	无子表达式	子表达式不被认为是marked。match_results对象不包含子表达式匹配。
optimize	优化匹配	匹配效率比构建regex对象的效率更优先
collate	Locale的sensitiveness	字符范围，如"[a-b]"，受locale影响.
ECMAScript	ECMAScript语法	正则表达式遵循其中一种语法。不能多选。如果不设置，则默认是ECMAScript语法.
basic	Basic POSIX语法
extended	Extended POSIX语法
awk	Awk POSIX语法
grep	Grep POSIX语法
egrep	Egrep POSIX语法

regex_iterator类模板

用正则表达式搜索一个序列时，使用前向只读迭代器regex_iterator在所有匹配位置上迭代。

template<
   class BidirIt,
   class CharT = typename std::iterator_traits<BidirIt>::value_type,
   class Traits = std::regex_traits<CharT>
> class regex_iterator

//构造函数
regex_iterator ( BidirectionalIterator first, BidirectionalIterator last, //底层序列的开始和结束迭代器（二个 BidirIt 实例）
       const regex_type& rgx, //指向正则表达式的指针
       regex_constants::match_flag_type flags = regex_constants::match_default); //匹配标志类型

构造时给出被搜索序列的起始与末尾的位置\所使用的正则表达式对象、属性类型。构造函数首先用函数regex_search找到相继的匹配。如果无匹配，则迭代器相当于缺省构造出的对象，表示序列尾。迭代器每次自增时，它调用 std::regex_search 并记忆结果（即保存值 std::match_results<BidirIt> 的副本）。

在最后匹配后自增 std::regex_iterator ，将等于序列尾迭代器。进一步解引用或自增序列尾迭代器引发未定义行为。

每次用运算符++在移动迭代器；解引用(dereference)获得内部match_results对象的引用。

regex_token_iterator

类似于regex_iterator的一个迭代器，但所指向的是正则表达式每次匹配中的特定的sub_match对象。可在构造regex_token_iterator对象时通过构造器的第3个参数指出要选择哪个（或哪些）sub_match对象，其中0代表整个匹配，1、2、...依此代表相应的子匹配，-1代表不属于匹配的字符序列（可用于tokenize一个序列，其中不匹配的部分就是想要的数据，称为tokenizer）。

match_results

接近为一个容器类，存储了regex_match, regex_search 或 regex_iterator函数的一次正则匹配操作产生的一批匹配结果，每个匹配结果对应于sub_match类型。当match_results包含了有效的匹配结果时（即使结果为空），其成员函数match_results::ready返回真；然后对regex_iterator解引用将指向有效地位置。如果匹配结果不为空，则empty成员函数返回为假，match_results包含一系列的sub_match元素，其中第一个是整个匹配，随后依次是对应于捕获群（括号包围的群）的子表达式；也可以直接调用成员函数如str(i), length, position，或运算符[]，或迭代器begin、end、cbegin、cend.

如果match_results对象用于regex_search函数，目标序列中不是匹配部分的可用成员函数prefix与suffix访问.

ready状态的match_results对象，调用format成员函数，可用于格式化字符串序列。可用的格式指示符（format specifiers）有：

字符	替换为
$n	第n个向后引用，n必须大于0，至多为2位数字。
$&	整个匹配
$`	前缀（目标序列中匹配之前的部分）
$´	后缀（目标序列中匹配之后的部分）
$$	单个的$字符

预定义了下述特化的模板类：

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

预定义的成员类型，需要关注的有：

value_type	sub_match<BidirectionalIterator>	 
char_type	iterator_traits<BidirectionalIterator>::value_type	 
reference	value_type&	
const_reference	const value_type&	
iterator	a forward iterator to const value_type	
const_iterator	a forward iterator to const value_type	The same as iterator

sub_match

sub_match是std::pair的派生类模板，定义如下：

template <class BidirectionalIterator>
        class sub_match : public pair <BidirectionalIterator, BidirectionalIterator>;

sub_match表示正则表达式一次匹配计算中的一个子表达式的匹配结果。由函数regex_match或regex_search, 或regex迭代器(regex_iterator 或regex_token_iterator)产生一次匹配计算。子表达式的匹配结果是字符序列，但sub_match并不存储字符序列本身，而是使用std::pair基类存储字符序列的开始迭代器与结束（past-the-end）迭代器。

sub_match的成员函数matched，给出了对象的状态表示已匹配或未匹配，缺省构造的sub_match对象的状态为假；作为一个match_results对象的一部分的sub_match的状态为真。

sub_match对象可转化为string对象，或在compare时行为类似于string，并有成员函数length其行为类似于string的同名成员函数。

预定义的成员类型：

类型名	定义	含义
value_type	iterator_traits<BidirectionalIterator>::value_type	字符序列的字符类型
string_type	basic_string<value_type>	字符序列的string类型
iterator	BidirectionalIterator	模板参数，即字符序列的迭代器类型
difference_type	iterator_traits<BidirectionalIterator>::size_type	即ptrdiff_t
first_type	BidirectionalIterator	基类std::pair的第一个模板参数
second_type	BidirectionalIterator	基类std::pair的第二个模板参数

预定义特化版本：

typedef sub_match<const char*> csub_match;
typedef sub_match<const wchar_t*> wcsub_match;
typedef sub_match<string::const_iterator> ssub_match;
typedef sub_match<wstring::const_iterator> wssub_match;

regex_traits

translate：把一个字符翻译为另一个字符。如果两个字符翻译到同一个字符，那么正则匹配时认为二者相同。 value：把一个字符用int表示。可指定进制情况。 isctype：判断一个字符是否属于指定的字符类。字符类用整形值表示。 lookup_classname：返回一个整形表示的bitmask值的字符类。 lookup_collatename：返回字符串。

regex_error

regex_error是regex库函数可以抛出的异常对象。它的成员函数code()返回regex_constants::error_type枚举值：

flag	error
error_collate	表达式包含无效的collating元素名字
error_ctype	表达式包含无效的字符类名字
error_escape	表达式包含无效的转义字符或尾部转义(trailing escape)
error_backref	表达式包含无效的反向引用
error_brack	表达式包含不匹配的方括号
error_paren	表达式包含不匹配的圆括号
error_brace	表达式包含不匹配的大括号
error_badbrace	表达式的大括号之间的范围(range)无效
error_range	表达式包含无效的字符范围
error_space	内存不足，无法把表达式转化为有限状态机。
error_badrepeat	表达式中包含重复指示符（即*?+{中的一个）但它前面没有效的正则表达式。
error_complexity	匹配的计算复杂度超出了预设的级别
error_stack	运行栈的内存不足

算法函数

regex_match：对整个字符序列做正则表达式匹配尝试。
regex_search：对字符序列的一部分做正则表达式匹配尝试。
regex_replace：对正则表达式匹配上的部分做替换操作。

全局函数

std::swap(std::basic_regex)：针对正则表达式对象的特化的swap。
比较两个子匹配对象：
- operator==
- operator!=
- operator<
- operator<=
- operator>
- operator>=
operator<< ：输出匹配饿得字符子序列
字典序比较两个匹配结果的值
- operator==
- operator!=
std::swap(std::match_results)：针对正则表达式匹配结果的特化版本的swap

常量定义

match_flag_type

std::regex_constants::match_flag_type具有下述比特标志值:

flag	effects on match	notes
match_default	缺省值	缺省匹配行为，值为0
match_not_bol	Not Beginning-Of-Line	第一个字符不被认为是行的开始("^"不匹配).
match_not_eol	Not End-Of-Line	最后一个字符不被认为是行的末尾("$"不匹配)
match_not_bow	Not Beginning-Of-Word	转义序列"\b"不匹配一个单词（word）的开始。
match_not_eow	Not End-Of-Word	转义序列"\b"不匹配一个单词（word）的末尾。
match_any	Any match	如果有不止一种匹配，any match是可接受的
match_not_null	Not null	不匹配空序列
match_continuous	Continuous	表达式必须匹配从第一个字符开始的子序列
match_prev_avail	Previous Available	在第一个匹配之前还有字符存在(match_not_bol与match_not_bow被忽略)
format_default	Default formatting	默认使用ECMAScript的替换规则。值为0
format_sed	sed formatting	使用POSIX的sed工具的替换匹配
format_no_copy	No copy	目标序列中不匹配正则表达式的部分在替换匹配时不被复制。
format_first_only	First only	仅第一次出现的正则表达式被替换。

syntax_option_type


值	效果
icase	匹配时忽略大小写
nosubs	所有子匹配都作为non-marking sub-expressions (?:expr)。从而，没有匹配存入std::regex_match结构且mark_count()为0
optimize	指示正则表达式引擎用更多编译时间产生一个速度更快的表示。例如，把不确定有限状态自动机（non-deterministic FSA）转化为确定有限状态自动机。
collate	形如"[a-b]"的字符将是locale 敏感的
multiline (C++17)	使用ECMAScript引擎前提下，指示 ^ 将匹配行首，$将匹配行尾
ECMAScript	使用修改后的ECMAScript正则表达式语法
basic	使用基本POSIX正则表达式语法
extended	使用扩展POSIX正则表达式语法
awk	使用awk的正则表达式语法
grep	使用grep的正则表达式语法。实际上在基本语法之外增加了把新行字符'\n'作为可选分隔符。
egrep	'）。

从ECMAScript, basic, extended, awk, grep, egrep中至少选择一个语法选项。如果没有选择，ECMAScript是缺省项。其它选项作为修改符，例如：std::regex("meow", std::regex::icase) is equivalent to std::regex("meow", std::regex::ECMAScript|std::regex::icase)

ECMA是深度优先匹配；而POSIX是最左最长匹配。例如：.*(a|xayy) 在zzxayyzz中做正则表达式搜索，

ECMA (depth first search) match: zzxa
POSIX (leftmost longest)  match: zzxayy

例子程序

#include <iostream>
#include <string>
#include <regex>
 
int main()
{
    std::string fnames[] = {"foo.txt", "bar.txt", "baz.dat", "zoidberg"};
    std::regex pieces_regex("([a-z]+)\\.([a-z]+)");
    std::smatch pieces_match; 
    for (const auto &fname : fnames) {
        if (std::regex_match(fname, pieces_match, pieces_regex)) {
            std::cout << fname << '\n';
            for (size_t i = 0; i < pieces_match.size(); ++i) {
                std::ssub_match sub_match = pieces_match[i];
                std::string piece = sub_match.str();
                std::cout << "  submatch " << i << ": " << piece << '\n';
            }   
        }   
    }   
}

输出：

foo.txt
  submatch 0: foo.txt
  submatch 1: foo
  submatch 2: txt
bar.txt
  submatch 0: bar.txt
  submatch 1: bar
  submatch 2: txt
baz.dat
  submatch 0: baz.dat
  submatch 1: baz
  submatch 2: dat

参考文献

页面Template:ReflistH/styles.css没有内容。

C++ reference for Standard library header <regex>