Python/Regular Expression

为在Python脚本中使用正则表达式，导入"re"模块：

import re

概况

Python中使用的正则表达式函数一瞥：

import re
if re.search("l+","Hello"):        print 1  # Substring match suffices
if not re.match("ell.","Hello"):   print 2  # The beginning of the string has to match
if re.match(".el","Hello"):        print 3
if re.match("he..o","Hello",re.I): print 4  # Case-insensitive match
print re.sub("l+", "l", "Hello")            # Prints "Helo"; replacement AKA substitution
print re.sub(r"(.*)\1", r"\1", "HeyHey")    # Prints "Hey"; backreference
print re.sub("EY", "ey", "HEy", flags=re.I) # Prints "Hey"; case-insensitive sub
print re.sub(r"(?i)EY", r"ey", "HEy")       # Prints "Hey"; case-insensitive sub
for match in re.findall("l+.", "Hello Dolly"):
  print match                               # Prints "llo" and then "lly"
for match in re.findall("e(l+.)", "Hello Dolly"):
  print match                               # Prints "llo"; match picks group 1
for match in re.findall("(l+)(.)", "Hello Dolly"):
  print match[0], match[1]                  # The groups end up as items in a tuple
matchObj = re.match("(Hello|Hi) (Tom|Thom)","Hello Tom Bombadil")
if matchObj is not None:
  print matchObj.group(0)                   # Prints the whole match disregarding groups
  print matchObj.group(1) + matchObj.group(2) # Prints "HelloTom"

匹配与搜索

函数match与search分别作匹配与搜索。仅当从字符串开头就匹配时，函数match才返回结果。函数search在字符串任意位置找到一个匹配就是成功。

>>> import re
>>> foo = re.compile(r'foo(.{,5})bar', re.I+re.S)
>>> st1 = 'Foo, Bar, Baz'
>>> st2 = '2. foo is bar'
>>> search1 = foo.search(st1)
>>> search2 = foo.search(st2)
>>> match1 = foo.match(st1)
>>> match2 = foo.match(st2)

上例中，match2将是None，因为字符串st2开头之处并不匹配模式。其他3个结果将返回匹配对象（详见下）。

也可不做编译，就直接搜索或匹配：

>>> search3 = re.search('oo.*ba', st1, re.I)

编译出来的模式对象（pattern object）的函数可以有参数表示搜索的开始、结束位置。例如：

>>> match3 = foo.match(st2, 3)

从字符串st2的位置为3的字符开始搜索。

如想要搜索模式的多个实例，有两种办法：loop循环中用搜索的开始、结束位置；使用函数findall与finditer。函数findall返回匹配字符串的列表。函数finditer返回一个迭代对象，用于loop循环中产生匹配对象。例如：

>>> str3 = 'foo, Bar Foo. BAR Foo: bar'
>>> foo.findall(str3)
[', ', '. ', ': ']
>>> for match in foo.finditer(str3):
...     match.group(1)
...
', '
'. '
': '

匹配对象

函数search与match返回匹配对象（match object），该对象包含了模式匹配信息。如果不存在匹配结果，则返回NoneType。

其成员函数group返回一个字符串对应于捕获群（capture group），即正则表达式中被()包围的部分。如果没有捕获群，则返回整个匹配得到的字符串。

例如：

>>> search1.group()
'Foo, Bar'
>>> search1.group(1)
', '

捕获群如果有命名，可以用matchobj.group('name')访问。

可以用函数start与end得到匹配群在字符串中位值：

>>> search1.start()
0
>>> search1.end()
8
>>> search1.start(1)
3
>>> search1.end(1)
5

这分别返回整个匹配，以及第一个匹配的开始与结束位置。

属性：

string: 匹配时使用的文本。
re: 匹配时使用的Pattern对象（字符串）。
pos: 正则表达式开始搜索的字符串位置。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
endpos: 正则表达式结束搜索的字符串位置。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
lastindex: 最后一个捕获群在文本中的编号。如果没有被捕获的分组，将为None。
lastgroup: 最后一个捕获群的别名。如果这个分组没有别名或者没有被捕获的分组，将为None。

方法：

group([group1, …]):获得一个或多个指定的捕获群所对应的截获字符串；指定多个参数时将以元组形式返回。参数可以使用编号也可以使用别名；编号0代表整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的返回None；截获了多次的返回最后一次截获的子串。
groups([default]):以元组形式返回全部的捕获群所对应的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。
groupdict([default]):返回有别名的捕获群的字典，其中别名为键、截获的子串为值。没有别名的字典不包含在内。default含义同上。
start([group]):返回指定的捕获群所对应的子串在string中的起始位置（子串第一个字符的位置）。group默认值为0。
end([group]):返回指定的捕获群所对应的子串在string中的结束位置（子串最后一个字符的位置+1）。group默认值为0。
span([group]):返回(start(group), end(group))。
expand(template):将匹配到的捕获群代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用捕获群，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符'0'，只能使用\g<1>0。

替换

函数sub做子字符串替换。5个参数依次为：被替换的子字符串的模式、作为替换值的子字符串模式或函数（参数为一个matched对象）、被替换的字符串、可选的最大替换数量、正则表达式的标志位。返回替换后的字符串。

下例是替换但不事先编译出对象：

>>> result = re.sub(r"b.*d","z","abccde")
>>> result
'aze'

编译出对象后使用：

>>> import re
>>> mystring = 'This string has a q in it'
>>> pattern = re.compile(r'(a[n]? )(\w) ')
>>> newstring = pattern.sub(r"\1'\2' ", mystring)
>>> newstring
"This string has a 'q' in it"

这取单个字符数字符（\w在正则表达式中的含义），前缀以"a"或"an"，然后用单引号包围。替换字符串模式中的\1与\2反引用（backreferences）表达式中的2个捕获群（capture group)；这是搜索产生的Match对象中的group(1)与group(2)。

函数subn类似于sub，但返回一个元组，包括结果字符串与实际替换的数量。例如：

>>> subresult = pattern.subn(r"\1'\2' ", mystring)
>>> subresult
("This string has a 'q' in it", 1)

分割

函数split使用正则表达式分隔一个字符串：

>>> import re
>>> mystring = '1. First part 2. Second part 3. Third part'
>>> re.split(r'\d\.', mystring)
['', ' First part ', ' Second part ', ' Third part']

转义

函数re.escape把字符串中所有非字母数字的字符全部转义。这可用于处理未知字符串，其中包含一些元字符如(或.，用于创建正则表达式。

>>> re.escape(r'This text (and this) must be escaped with a "\" to use in a regexp.')
'This\\ text\\ \\(and\\ this\\)\\ must\\ be\\ escaped\\ with\\ a\\ \\"\\\\\\"\\ to\\ use\\ in\\ a\\ regexp\\.'

标志

正则表达式使用如下标志（flag）：

缩写	全名	描述
`re.I`	`re.IGNORECASE`	忽略大小写
`re.L`	`re.LOCALE`	某些序列(`\w, \W, \b, \B, \s, \S`)的行为依赖于当前locale
`re.M`	`re.MULTILINE`	`^`与`$`字符匹配每一行的开始与结束，而不只是字符串的开始与结束。
`re.S`	`re.DOTALL`	`.`字符匹配任何字符，包括换行符
`re.U`	`re.UNICODE`	`\w, \W, \b, \B, \d, \D, \s, \S`依赖于Unicode字符属性。
`re.X`	`re.VERBOSE`	忽略空白符，除非在字符类或前缀非转义的反斜线；忽略`#`(除非在字符类或前缀非转义的反斜线)及其后直至行尾的字符，因而可用于注释，这使得外观更清晰。regexps.

模式对象

如果一个正则表达式被使用多次，应该创建一个模式对象（pattern object），用于搜索/匹配。

为此需使用函数compile：

import re
foo = re.compile(r'foo(.{,5})bar', re.I+re.S)

第一个参数，指出匹配字符串"foo"后跟至多5个任意字符再跟字符串"bar"，存储中间的字符为一个组（group）。第二个参数是可选的标志，以修改正则表达式的行为。某些正则表达式函数不支持标志参数，因此可能需要先创建这种模式对象。

只读属性：

pattern: 编译时用的表达式字符串。
flags: 编译时用的匹配模式。数字形式。
groups: 表达式中捕获群的数量。
groupindex: 以表达式中有别名的捕获群的别名为键、以该捕获群对应的编号为值的字典，没有别名的组不包含在内。

其他

(?imx:pattern) 对pattern使用I或M或X选项

(?-imx:pattern) 对pattern关闭I或M或X选项

(?P<name>pattern) 给pattern命名为name

(?P=name)使用一个命名的pattern，例如：re.findall(r'(?P<g1>[a-z]+)\d+(?P=g1)',s)

(?#some text) 注释

外部链接

Python 标准库中关于 re 模块的文档