Python/Regular Expression

為在Python腳本中使用正則表達式，導入"re"模塊：

import re

概況

Python中使用的正則表達式函數一瞥：

import re
if re.search("l+","Hello"):        print 1  # Substring match suffices
if not re.match("ell.","Hello"):   print 2  # The beginning of the string has to match
if re.match(".el","Hello"):        print 3
if re.match("he..o","Hello",re.I): print 4  # Case-insensitive match
print re.sub("l+", "l", "Hello")            # Prints "Helo"; replacement AKA substitution
print re.sub(r"(.*)\1", r"\1", "HeyHey")    # Prints "Hey"; backreference
print re.sub("EY", "ey", "HEy", flags=re.I) # Prints "Hey"; case-insensitive sub
print re.sub(r"(?i)EY", r"ey", "HEy")       # Prints "Hey"; case-insensitive sub
for match in re.findall("l+.", "Hello Dolly"):
  print match                               # Prints "llo" and then "lly"
for match in re.findall("e(l+.)", "Hello Dolly"):
  print match                               # Prints "llo"; match picks group 1
for match in re.findall("(l+)(.)", "Hello Dolly"):
  print match[0], match[1]                  # The groups end up as items in a tuple
matchObj = re.match("(Hello|Hi) (Tom|Thom)","Hello Tom Bombadil")
if matchObj is not None:
  print matchObj.group(0)                   # Prints the whole match disregarding groups
  print matchObj.group(1) + matchObj.group(2) # Prints "HelloTom"

匹配與搜索

函數match與search分別作匹配與搜索。僅當從字符串開頭就匹配時，函數match才返回結果。函數search在字符串任意位置找到一個匹配就是成功。

>>> import re
>>> foo = re.compile(r'foo(.{,5})bar', re.I+re.S)
>>> st1 = 'Foo, Bar, Baz'
>>> st2 = '2. foo is bar'
>>> search1 = foo.search(st1)
>>> search2 = foo.search(st2)
>>> match1 = foo.match(st1)
>>> match2 = foo.match(st2)

上例中，match2將是None，因為字符串st2開頭之處並不匹配模式。其他3個結果將返回匹配對象（詳見下）。

也可不做編譯，就直接搜索或匹配：

>>> search3 = re.search('oo.*ba', st1, re.I)

編譯出來的模式對象（pattern object）的函數可以有參數表示搜索的開始、結束位置。例如：

>>> match3 = foo.match(st2, 3)

從字符串st2的位置為3的字符開始搜索。

如想要搜索模式的多個實例，有兩種辦法：loop循環中用搜索的開始、結束位置；使用函數findall與finditer。函數findall返回匹配字符串的列表。函數finditer返回一個迭代對象，用於loop循環中產生匹配對象。例如：

>>> str3 = 'foo, Bar Foo. BAR Foo: bar'
>>> foo.findall(str3)
[', ', '. ', ': ']
>>> for match in foo.finditer(str3):
...     match.group(1)
...
', '
'. '
': '

匹配對象

函數search與match返回匹配對象（match object），該對象包含了模式匹配信息。如果不存在匹配結果，則返回NoneType。

其成員函數group返回一個字符串對應於捕獲群（capture group），即正則表達式中被()包圍的部分。如果沒有捕獲群，則返回整個匹配得到的字符串。

例如：

>>> search1.group()
'Foo, Bar'
>>> search1.group(1)
', '

捕獲群如果有命名，可以用matchobj.group('name')訪問。

可以用函數start與end得到匹配群在字符串中位值：

>>> search1.start()
0
>>> search1.end()
8
>>> search1.start(1)
3
>>> search1.end(1)
5

這分別返回整個匹配，以及第一個匹配的開始與結束位置。

屬性：

string: 匹配時使用的文本。
re: 匹配時使用的Pattern對象（字符串）。
pos: 正則表達式開始搜索的字符串位置。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
endpos: 正則表達式結束搜索的字符串位置。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
lastindex: 最後一個捕獲群在文本中的編號。如果沒有被捕獲的分組，將為None。
lastgroup: 最後一個捕獲群的別名。如果這個分組沒有別名或者沒有被捕獲的分組，將為None。

方法：

group([group1, …]):獲得一個或多個指定的捕獲群所對應的截獲字符串；指定多個參數時將以元組形式返回。參數可以使用編號也可以使用別名；編號0代表整個匹配的子串；不填寫參數時，返回group(0)；沒有截獲字符串的返回None；截獲了多次的返回最後一次截獲的子串。
groups([default]):以元組形式返回全部的捕獲群所對應的字符串。相當於調用group(1,2,…last)。default表示沒有截獲字符串的組以這個值替代，默認為None。
groupdict([default]):返回有別名的捕獲群的字典，其中別名為鍵、截獲的子串為值。沒有別名的字典不包含在內。default含義同上。
start([group]):返回指定的捕獲群所對應的子串在string中的起始位置（子串第一個字符的位置）。group默認值為0。
end([group]):返回指定的捕獲群所對應的子串在string中的結束位置（子串最後一個字符的位置+1）。group默認值為0。
span([group]):返回(start(group), end(group))。
expand(template):將匹配到的捕獲群代入template中然後返回。template中可以使用\id或\g<id>、\g<name>引用捕獲群，但不能使用編號0。\id與\g<id>是等價的；但\10將被認為是第10個分組，如果你想表達\1之後是字符'0'，只能使用\g<1>0。

替換

函數sub做子字符串替換。5個參數依次為：被替換的子字符串的模式、作為替換值的子字符串模式或函數（參數為一個matched對象）、被替換的字符串、可選的最大替換數量、正則表達式的標誌位。返回替換後的字符串。

下例是替換但不事先編譯出對象：

>>> result = re.sub(r"b.*d","z","abccde")
>>> result
'aze'

編譯出對象後使用：

>>> import re
>>> mystring = 'This string has a q in it'
>>> pattern = re.compile(r'(a[n]? )(\w) ')
>>> newstring = pattern.sub(r"\1'\2' ", mystring)
>>> newstring
"This string has a 'q' in it"

這取單個字符數字符（\w在正則表達式中的含義），前綴以"a"或"an"，然後用單引號包圍。替換字符串模式中的\1與\2反引用（backreferences）表達式中的2個捕獲群（capture group)；這是搜索產生的Match對象中的group(1)與group(2)。

函數subn類似於sub，但返回一個元組，包括結果字符串與實際替換的數量。例如：

>>> subresult = pattern.subn(r"\1'\2' ", mystring)
>>> subresult
("This string has a 'q' in it", 1)

分割

函數split使用正則表達式分隔一個字符串：

>>> import re
>>> mystring = '1. First part 2. Second part 3. Third part'
>>> re.split(r'\d\.', mystring)
['', ' First part ', ' Second part ', ' Third part']

轉義

函數re.escape把字符串中所有非字母數字的字符全部轉義。這可用於處理未知字符串，其中包含一些元字符如(或.，用於創建正則表達式。

>>> re.escape(r'This text (and this) must be escaped with a "\" to use in a regexp.')
'This\\ text\\ \\(and\\ this\\)\\ must\\ be\\ escaped\\ with\\ a\\ \\"\\\\\\"\\ to\\ use\\ in\\ a\\ regexp\\.'

標誌

正則表達式使用如下標誌（flag）：

縮寫	全名	描述
`re.I`	`re.IGNORECASE`	忽略大小寫
`re.L`	`re.LOCALE`	某些序列(`\w, \W, \b, \B, \s, \S`)的行為依賴於當前locale
`re.M`	`re.MULTILINE`	`^`與`$`字符匹配每一行的開始與結束，而不只是字符串的開始與結束。
`re.S`	`re.DOTALL`	`.`字符匹配任何字符，包括換行符
`re.U`	`re.UNICODE`	`\w, \W, \b, \B, \d, \D, \s, \S`依賴於Unicode字符屬性。
`re.X`	`re.VERBOSE`	忽略空白符，除非在字符類或前綴非轉義的反斜線；忽略`#`(除非在字符類或前綴非轉義的反斜線)及其後直至行尾的字符，因而可用於注釋，這使得外觀更清晰。regexps.

模式對象

如果一個正則表達式被使用多次，應該創建一個模式對象（pattern object），用於搜索/匹配。

為此需使用函數compile：

import re
foo = re.compile(r'foo(.{,5})bar', re.I+re.S)

第一個參數，指出匹配字符串"foo"後跟至多5個任意字符再跟字符串"bar"，存儲中間的字符為一個組（group）。第二個參數是可選的標誌，以修改正則表達式的行為。某些正則表達式函數不支持標誌參數，因此可能需要先創建這種模式對象。

只讀屬性：

pattern: 編譯時用的表達式字符串。
flags: 編譯時用的匹配模式。數字形式。
groups: 表達式中捕獲群的數量。
groupindex: 以表達式中有別名的捕獲群的別名為鍵、以該捕獲群對應的編號為值的字典，沒有別名的組不包含在內。

其他

(?imx:pattern) 對pattern使用I或M或X選項

(?-imx:pattern) 對pattern關閉I或M或X選項

(?P<name>pattern) 給pattern命名為name

(?P=name)使用一個命名的pattern，例如：re.findall(r'(?P<g1>[a-z]+)\d+(?P=g1)',s)

(?#some text) 注釋

外部連結

Python 標準庫中關於 re 模塊的文檔