Regular expression foundation and application of Python learning

Python Regular Expression

Before talking about the use of Python regular expressions, let's talk about some knowledge of regular expressions, so everyone understands.

regular expression

Regular expressions describe a pattern of string matching, which can be used to check whether a string contains a certain substring, replace a matched substring, or extract a substring that meets a certain condition from a string, etc. The application of regular expressions is very common, such as the verification of mobile phone numbers, ID cards, mailboxes, and text processing work encountered in daily programming development.

Normal, non-printing, and special characters:

Normal characters:

Ordinary characters include all printable and nonprintable characters not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation, and some other symbols.

Nonprinting characters:

Nonprinting characters can also be part of regular expressions.

characterdescribe
\cxMatches the control character specified by x. For example, \cM matches a Control-M or carriage return. The value of x must be one of AZ or az. Otherwise, treat c as a literal 'c' character. Notation: control x > control a character
\fMatches a form feed character . Equivalent to \x0c and \cL. Notation: page – leaf leaf —> f
\nMatches a newline character . Equivalent to \x0a and \cJ. Notation: line-line-> n
\rMatches a carriage return . Equivalent to \x0d and \cM. Notation: Enter -> go back -> return -> r
\sMatches any whitespace character , including spaces, tabs, form feeds, and more. Equivalent to [ \f\n\r\t\v]. Note that Unicode regular expressions will match full-width whitespace characters. Notation: blank -> the sky is white -> sky----> s
\Smatches any non-whitespace character . Equivalent to [^ \f\n\r\t\v].
\tMatches a tab character . Equivalent to \x09 and \cI. Notation: table—>table—> t
\vMatches a vertical tab character . Equivalent to \x0b and \cK. Notation: vertical—>vertical—> v
Special characters:

The so-called special characters are some characters with special meaning

Many metacharacters require special treatment when trying to match them. To match these special characters, you must first "escape" the characters, that is, replace the backslash character\put in front of them. The following table lists special characters in regular expressions:

special charactersdescribe
$Matches the end position of the input string . $ also matches '\n' or '\r' if the RegExp object's Multiline property is set. To match the $ character itself, use\$.
( )Marks the start and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use\(and\).
*Matches the preceding subexpression zero or more times . To match the * character, use\*. Notation: 0*1 =0
+Matches the preceding subexpression one or more times . To match the + character, use\+. Notation: 0+1 = 1
.Matches any single character except newline \n . To match . , use\..
[Marks the start of a bracketed expression. To match [, use\[.
?Match the preceding subexpression zero or one time , or specify a non-greedy qualifier . To match the ? character, use\?.
\Marks the next character as either a special character, or a literal character, or a backreference, or an octal escape . For example, 'n' matches the character 'n'. '\n' matches a newline. The sequence '\' matches "" and '(' matches "(".
^Matches the starting position of the input string (unless used in a bracket expression), when the symbol is used in a bracket expression, it means that the set of characters in the bracket expression is not accepted . To match the ^ character itself, use\^.
{Marks the start of a qualifier expression . To match {, use\{.
|Indicates a choice between two items . To match |, use\|.

Meta characters:

qualifier

Qualifiers are used to specify how many times a given component of a regular expression must occur to satisfy a match. There are 6 kinds of ***** or + or ? or {n} or {n,} or {n,m} .

characterdescribe
*Matches the preceding subexpression zero or more times . For example, zo* matches "z" as well as "zoo". *** is equivalent to {0,}**.
+Matches the preceding subexpression one or more times . For example, 'zo+' matches 'zo' and 'zoo', but not 'z'. + is equivalent to {1,} .
?Matches the preceding subexpression zero or one time . For example, "do(es)?" matches "do", "does" in "does", and "do" in "doxy". ? Equivalent to {0,1} .
{n}n is a non-negative integer (>=0). Matches n times determined by the preceding expression . For example, 'o{2}' would not match the 'o' in "Bob", but would match the two o's in "food".
{n,}n is a non-negative integer. Match at least n times . For example, 'o{2,}' would not match the 'o' in "Bob", but would match all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}Both m and n are non-negative integers, where n <= m. Matches at least n times and at most m times . For example, "o{1,3}" will match the first three o's in "fooooood" (will match the most greedy). 'o{0,1}' is equivalent to 'o?'. Note that there can be no spaces between the comma and the two numbers.

To give a few examples:

/[1-9][0-9]*/: matches all numbers greater than 0

/[1-9][0-9]?/: matches all numbers from 1-99, equivalent to/[1-9][0-9]{0,1}/

*,+Qualifiers are all greedy in that they match as much of the literal as possible, only a non-greedy or minimal match can be achieved by appending a ? after them .

Locator

That is, locate the position where the regular expression matches. Locators and qualifiers cannot be used together!

characterdescribe
^Matches where the input string begins.
$Matches the position at the end of the input string.
\bMatch a word boundary. Regular expressions use word boundaries that are positions between words and spaces to match, for example/\bABC/is to find a match at the beginning of the word,/ABC\b/actually starts looking for a match at the end of the word, so\bNeed to pay attention to the matching position .
\BMatch non-word boundaries. non-word boundaries, then the interior (middle part) of the word, e.g./\BAbcd/would match Abcd in the word sdAbcdefg. therefore\BDon't care where the match is , as long as it's inside the word.
other metacharacters
characterdescribe
[abc]A set of characters that matches any one of the included characters. It doesn't match abc!
[^abc]Matches any character not included .
[az]Matches any character in the specified range. (To match the hyphen -, you need to place the - at the beginning or end of the brackets, i.e. [-az] or [az-)
[^az]Matches any arbitrary character not in the specified range.
\dMatch a numeric character.
\DMatches a non-numeric character. A little trick, metacharacters that are often uppercase are all negated from lowercase
\wMatch letters, numbers, underscores . Equivalent to'[A-Za-z0-9_]'
\WMatches non-letters, numbers, underscores . Equivalent to'[^A-Za-z0-9_]'
\xnmatches n, where n is a hexadecimal escape value , i.e.\x43matches 'C'.
\numMatches num, where num is a positive integer. For example, '(.)\1' matches two consecutive identical characters.
(?imx:)Use i, m, or x optional flags in parentheses
(?-imx: )Do not use i, m, or x optional flags in parentheses

Operator precedence:

Regular expressions are evaluated from left to right.

operatordescribe
\Escapes
(), (?: ), (?=), []Parentheses and Square Brackets
*, +, ?, {n}, {n,}, {n,m}qualifier
^, $, \ any metacharacter, any characterAnchors and sequences
|selection operator

Example:

^[a-zA-Z0-9_]{1,}$ // all strings containing more than one letter, number or underscore
^[1-9][0-9]{0,}$ // all positive integers, equivalent to ^[1-9][0-9]*$
^\-{0,1}[0-9]{1,}$ // all integers, equivalent to ^\-{0,1}[0-9]+$
^[-]?[0-9]+\.?[0-9]+$ // all floats

/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/ //Parse a URL into protocol, domain, port and relative path.
/<\s*(\S+)(\s[^>]*)?>[\s\S]*<\s*\/\1\s*>/ //Matches HTML tags.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Group and Capture

  1. Grouping can be divided into two forms: capturing groups and non-capturing groups
Capture group:

Capture groups can be numbered by counting their opening brackets from left to right: (a)((b)(c(d)))

Then (a) is the first bracket, divided into group 1, followed by ((b)(c(d))), (b), (c(d)), (d), and group (a) ((b)(c(d))) is group 0, which always represents the entire expression.

Capture groups are so named because in matching, every subsequence of the input sequence that matches those groups is saved. The captured subsequence can later be used in an expression via a Back reference (backreference) or retrieved from the matcher after the matching operation is complete .

Non-capturing groups:

Groups starting with (?) are pure non-capturing groups that do not capture text and do not count against the group meter.

Compared with capturing groups, non-capturing groups do not store the matched content, saving memory.

  • (?:pattern): Matches Pattern characters, does not capture text, and does not store the matched characters in memory, which means this is a non-acquisition match.
industry(?:y|ies) can match industry or industries, which is equivalent to industry|industries
  • 1
  1. (?=pattern) : A zero-width positive lookahead assertion. Continue matching only if the subexpression pattern matches to the right of this position, i.e. match the lookup string at the beginning of any string matching pattern
  2. (?!pattern) : A zero-width negative lookahead assertion. Continue matching only if the subexpression pattern does not match to the right of this position, i.e. match the lookup string at the beginning of any string that does not match pattern.
  3. (?<=pattern) : Assert after zero-width positive. Continue matching only if the subexpression pattern matches to the left of this position, similar to a zero-width positive lookahead assertion, but in the opposite direction.
  4. (?<!pattern) : Assert after zero-width negative. Continue matching only if the subexpression pattern does not match to the left of this position, similar to a zero-width negative lookahead assertion, but in the opposite direction.

Therefore, the above 1 and 3 correspond, and 2 and 4 correspond.

for example:

(?<!4)56(?=9): The following text 56 cannot be preceded by 4 and must be followed by 9. Therefore, the following text 5569 can be matched, which does not match 4569.

(?<=[^c]a)\d*(?=bd): Matches numbers contained between the characters a and b, but the character before the a cannot be a c; the character after the b must be a d.

pattern modifier

Commonly used pattern modifiers in expressions are i, g, m, s, x, e, etc. They can be used in combination.

modifierdescribe
ICase-insensitive matching, e.g./abc/iCan be matched with abc or aBC or ABc etc.
gIndicates a global match
MTreats a string as multiple lines, matching either line (multiple line)
STreat the string as a single line and the newline as a single line
XIgnore whitespace in pattern
AForce a match from the beginning of the target string
DIf you use $ to limit the ending character, no newline at the end is allowed, for example/abc/DCannot match "adshabc\n"
UMatch only the most recent string, no repeated matches, similar to non-greedy?
eUse with the function preg_replace()

Python regular expressions use

Python's re module enables full regular expression functionality.

The compile function generates a regular expression object from a pattern string and optional flag arguments. This object has a series of methods for regular expression matching and replacement.

Regular expression processing function

  • re.match(pattern, string, flags=0) : re.match attempts to match a pattern from the beginning of the string

    1. pattern: the regular expression to match
    2. string: the string to match
    3. flags: flag bits, used to control the matching method of regular expressions, that is, modifiers

    Returns a matching object if the match is successful, otherwise returns None

    Match expressions can be obtained using the group(num) or groups() match object functions.

    methoddescribe
    group(num=0)A string of entire expressions to match, group() can enter multiple group numbers at once, in which case it will return a tuple containing the values ​​corresponding to those groups .
    groups()Returns a tuple containing all group strings , from 1 to the contained group number.

    See an example:

    import re
     str  =  'Hey Gril ! I love you so much!' 
    match = re . match ( r '(.*) love (.*?) .*' , str , re . M | re . I )   #() capture group
    
    if match :   #If it is captured 
       print  ( "group(0) : " , match . group ( ) ) #As mentioned   earlier, the capture group 0 represents the entire expression, the default value is 0 
       print  ( "group(1) : " , match . group ( 1 ) )  # print capturing group 1 
       print  ( "group(2) : " , match . group ( 2 ) )  # print capturing group 2
    
    '''
    group(0) : Hey Gril ! I love you so much!
    group(1) : Hey Gril ! I
    group(2) : you
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
  • re.search(pattern, string, flags=0) : re.search scans the entire string and returns the first successful match.

    Parameters are passed on re.match()

    Ditto we can also use the group(num) or groups() match object function to get the matching expression.

    import re
    
    print ( re . match ( 'you' , 'I love you' ) )   #Can't find it, because it can only be found from the starting position 
    print ( re . search ( 'you' , 'I love you' ) )  #Yes found because when searching the entire string 
    print ( re . search ( 'you' , 'I love you' ) . span ( ) )
    
    '''
    None
    <re.Match object; span=(7, 10), match='you'>
    (7, 10)
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    import re
     str  =  'Hey Gril ! I love you so much!' 
    match = re . search ( r '(.*) love (.*?) .*' , str , re . M | re . I )
    
    if match : #If   captured 
       print  ( "group(0) : " , match . group ( ) ) #As mentioned   earlier, capturing group 0 represents the entire expression, the default value is 0 
       print  ( "group(1) : " , match . group ( 1 ) )  # print capturing group 1 
       print  ( "group(2) : " , match . group ( 2 ) )  # print capturing group 2
       
    '' 
    group ( 0 )  :   Hey Gril ! I love you so much!
    group ( 1 )  :   Hey Gril ! I
    group ( 2 )  :   you
     '' '
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
  • re.sub(pattern, repl, string, count=0, flags=0) : used to replace matches in the string

    1. pattern: The pattern string in the regular expression.
    2. repl : the string to replace, can also be a function
    3. string: the original string to be searched and replaced
    4. count: The maximum number of replacements after pattern matching, the default value is 0, which means to replace all matches
    5. flags: The matching pattern used at compile time, in numeric form

    example:

    import re
    
    str  =  "110-119-120 #emergency number"
    
    num = re . sub ( r '#.*$' , "" , str ) 
    print ( num )
    
    num   = re . sub ( r '\D' ,  "" , str ) 
    print ( num )
    
    '''
    110-119-120
    110119120
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
  • re.compile(pattern[, flags]) : The compile function is used to compile the regular expression and generate a regular expression re.RegexObject object for use by the match() and search() functions.

    1. pattern: a regular expression in the form of a string
    2. flags: match pattern
    import re
    pattern = re.compile ( r ' \ d+ ' )
    
    m = pattern . match ( 'one12twothree34four' ) 
    print ( m ) 
    m = pattern . match ( 'one12twothree34four' , 2 , 10 )  
    print ( m ) 
    m = pattern . match ( 'one12twothree34four' , 3 , 10 ) 
    print ( m ) 
    print ( m.group ( 0 ) _ _) 
    print ( m . start ( 0 ) ) # Get the starting position of the substring matched by the group in the whole string (the index of the first character of the substring) 
    print ( m . end ( 0 ) ) # The method is used to get The end position of the substring matched by the group in the entire string (the index of the last character of the substring + 1) 
    print ( m . span ( 0 ) ) # The method returns (start(group), end(group))
    
    '''
    None
    None
    <re.Match object; span=(3, 5), match='12'>
    12
    3
    5
    (3, 5)
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
  • re.findall(string, pos, endpos) : Finds all substrings matched by the regular expression from the string and returns a list, or an empty list if no matches are found.

    1. string: the string to be matched
    2. pos: optional parameter, specifying the starting position of the string, the default is 0
    3. endpos: optional parameter, specifying the end position of the string, the default is the length of the string
    import re
    pattern = re.compile ( r ' [ az ]+' )
    
    m1 = pattern . findall ( 'one12twothree34four' ) 
    print ( m1 ) 
    m2 = pattern . findall ( 'one12twothree34four' , 4 , 10 ) 
    print ( m2 )
    
    '''
    ['one', 'twothree', 'four']
    ['twoth']
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
  • re.finditer(pattern, string, flags=0) : Find all substrings in the string matched by the regular expression and return them as an iterator . is returned as an iterator

    The same parametersre.match()

    import re
    pattern = re.compile ( r ' [ az ]+' )
    
    m1 = pattern . finditer ( 'one12twothree34four' ) 
    for i in m1 : 
       print ( i . group ( ) ) 
    print ( '-------------' ) 
    m2 = pattern . finditer ( 'one12twothree34four' , 4 , 10 ) 
    for i in m2 : 
       print ( i . group ( ) ) 
    '''
    one
    twothree
    four
    -------------
    twoth
    '''
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

re.split(pattern, string[, maxsplit=0, flags=0]) : split the string according to the substrings that can be matched and return a list

  1. pattern: the regular expression to match
  2. string: the string to match
  3. maxsplit: the number of splits, the default is 0, there is no limit to the number of times, if there is a split
  4. flags: matching method
import re
 print ( re . split ( r '[0-1]+' , 'one12twothree34four' ) ) 
print ( re . split ( r 'o' , 'one12twothree34four' ) )

'''
['one', '2twothree34four']
['', 'ne12tw', 'three34f', 'ur']
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Related: Regular expression foundation and application of Python learning