RegEx

From WikIT
Jump to navigation Jump to search

In a regular expression (shortened into regex throughout), special characters interpreted are:

Single-character matches[edit]

.\c
Matches any character. If you check the box which says “. matches newline”, the dot will indeed do that, enabling the “any” character to run over multiple lines. With the option unchecked, then . will only match characters within a line, and not the line ending characters (\r and \n)
\Г
This allows you to use a character Г that would otherwise have a special meaning. For example, \[ would be interpreted as [ and not as the start of a character set. Adding the backslash (escaping) makes special a character that otherwise isn’t. E.g. \d stands for “a digit”, while “d” is just an ordinary letter.
Non ASCII characters[edit]
\xnn
Specify a single chracter with code nn. What this stands for depends on the text encoding. For instance, \xE9 may match an é or a θ depending on the code page in an ANSI encoded document.
\x{nnnn}
Like above, but matches a full 16-bit Unicode character. If the document is ANSI encoded, this construct is invalid.
\Onnn
A single byte character whose code in octal is nnn.
[[.collating sequence.]]
The character the collating sequence stands for. For instance, in Spanish, “ch” is a single letter, though it is written using two characters. That letter would be represented as [[.ch.]]. This trick also works with symbolic names of control characters, like [[.BEL.]] for the character of code 0x07. See also the discussion on character ranges.
Control characters[edit]

\a

BEL 0x07 (alarm).

\f

FF 0x0C (form feed).

\R

Any newline character.

\b

BS 0x08 (backspace) (in character class definition. Outside: “word boundary”)

\n

LF 0x0A (line feed). This is the regular end of line under Unix systems.

\t

TAB 0x09 (tab, or hard tab, horizontal tab).

\e

ESC 0x1B.

\r

CR 0x0D (carriage return). Part of DOS/Windows end of line sequence CR-LF.

\C

\Ccharacter CTL char obtained from character by stripping all but 6 lowest order bits. E.g., \C1, \CA and \Ca all = SOH 0x01.

Ranges or kinds of characters[edit]

[]
This indicates a set of characters, for example, [abc] means any of the characters a, b or c. You can also use ranges, for example [a-z] for any lower case character. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequence in Spanish).
[^]
The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A,B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].
[[:name:]]
The whole character class named name. Most of the time, there is a single letter escape sequence for them – see below.
Recognised classes are:

[[:alnum:]]

ASCII letters and digits

[[:graph:]]

graphical character

[[:u:]]

uppercase letters

[[:alpha:]]

ASCII letters

[[:l:]]

lowercase letters

[[:unicode:]]

any character with code point > 255

[[:blank:]]

spacing which is not a line terminator

[[:print:]]

printable characters

[[:w:]]

word character

[[:cntrl:]]

control characters

[[:punct:]]

punctuation characters: , “ ‘ ? ! ; : # $ % & ( ) * + – / < > = @ [ ] \ ^ _ { } | ~

[[:xdigit:]]

hexadecimal digits

[[:d:]]

decimal digits

[[:s:]]

whitespace

 

 

\pshort name,\p{name}
Same as [[:name:]]. For instance, \pd and \p{digit} both stand for a digit, \d.
\Pshort name,\P{name]
Same as [^[:name:]] (not belonging to the class name).

\d

A digit in the 0-9 range, same as [[:digit:]].

\U

Not an uppercase letter. Same note applies.

\h

Horizontal spacing. This only matches space, tab and line feed.

\D

Not a digit. Same as [^[:digit]].

\w

A word character, which is a letter, digit or underscore. This appears not to depend on what the Scintilla component considers as word characters. Same as [[:word:]].

\H

Not horizontal whitespace.

\l

A lowercase letter. Same as [a-z] or [[:lower:]]. NOTE: this will fall back on “a word character” if the “Match case” search option is off.

\W

Not a word character. Same as :alnum: with the addition of the underscore.

\v

Vertical whitespace. This encompasses the The VT, FF and CR control characters: 0x0B (vertical tab), 0x0D (carriage return) and 0x0C (form feed).

\L

Not a lower case letter. See note above.

\s

A spacing character: space, EOLs and tabs count. Same as [[:space:]].

\V

Not vertical whitespace.

\u

An uppercase letter. Same as [[:uper:]]. See note about lower case letters.

\S

Not a space.

[[=?=]]

Ignores diacritic marks and case. E.g. [[=a=]] matches any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å.

Multiplying operators[edit]

+

Match 1 or more repetitions of the previous character. Sa+m matches Sam, Saam, Saaam, etc. [aeiou]+ matches consecutive strings of vowels.

*?

0 or more of the previous group, but lazy, rather than the longest string as with the “greedy” * operator. m.*?o applied to the text margin-bottom: 0 will match margin-bo, but m.*o will match margin-botto.

{n,}

Matches n or more copies of the element it applies to.

*

Matches 0 or more repetitions of the previous character. Sa*m matches Sm, Sam, Saam, etc.

+?

One or more of the previous group, but lazily.

{m,n}

Matches m to n copies of the element it applies to, greedily.

?

0 or one of the previous character. Sa?m matches Sm and Sam, but not Saam.

{n}

Matches n copies of the element it applies to.

{n,}?,
{m,n}?

Like the above, but match lazily.

*+?+, ++{n,}+{m,n}+

These so called “possessive” variants of greedy repeat marks do not backtrack. This allows failures to be reported much earlier, which can boost performance significantly. But they will eliminate matches that would require backtracking to be found.
Example: matching “.*” against “abc”x will find “abc”, because
  • “ then abc”x then $ fails
  • “ then abc” then x fails
  • “ then abc then “ succeeds.

However, matching “*+” against “abc”x will fail, because the possessive repeat factor prevented backtracking.

Anchors[edit]

Anchors match a position in the line, rather than a particular character.

^

This matches the start of a line (except when used inside a set, see above).

$

This matches the end of a line.

\A\’

The start of the matching string.

\<

This matches the start of a word using Scintilla’s definitions of words.

\>

This matches the end of a word using Scintilla’s definition of words.

\z\`

The end of the matching string.

\b

Matches either the start or end of a word.

\B

Not a word boundary.

\Z

Matches like \z with an optional sequence of newlines before it. Equivalent to (?=\v*\z).

Groups[edit]

()
<Parentheses mark a subset of the regular expression. The string matched by the contents of the parentheses ( ) can be re-used as a backreference or as part of a replace operation; see Substitutions, below.
Groups may be nested.
(?<some name>…), (?’some name‘…),(?(some name)…)
Names this group some name.
\gn , \g{n}
The n-th subexpression, aka parenthesised group. Uing the second form has some small benefits, like n being more than 9, or disambiguating when n might be followed by digits. When n’ is negative, groups are counted backwards, so that \g-2 is the second last matched group.
\g{something},\k<something>
The string matching the subexpression named something.
\digit
Backreference: \1 matches an additional occurence of a text matched by an earlier part of the regex. Example: This regular expression: ([Cc][Aa][Ss][Ee]).*\1 would match a line such as Case matches Case but not Case doesn’t match cASE. A regex can have multiple subgroups, so \2, \3, etc can be used to match others (numbers advance left to right with the opening parenthesis of the group). So \n is a synonym for \gn, but doesn’t support the extension syntax for the latter.

Readability enhancements[edit]

(:)
A grouping construct that doesn’t count as a subexpression, just grouping things for easier reading of the regex.
(?#)
Comments. The whole group is for humans only and will be ignored in matching text.

Using the x flag modifier (see section below) is also a good way to improve readability in complex regular expressions.

Search modifiers[edit]

The following constructs control how matches condition other matches, or otherwise alter the way search is performed. For those readers familiar with Perl, \G is not supported.

\Q
Starts verbatim mode (Perl calls it “quoted”). In this mode, all characters are treated as-is, the only exception being the \E end verbatim mode sequence.
\E
Ends verbatim mode. Ths, “\Q\*+\Ea+” matches “\*+aaaa”.
(?:flags-not-flags …), (?:flags-not-flags:…)
Applies flags and not-flags to search inside the parentheses. Such a construct may have flags and may have not-flags – if it has neither, it is just a non-marking group, which is just a readability enhancer. The following flags are known:

   i : case insensitive (default: off)

   m : ^ and $ match embedded newlines (default: as per “. matches newline”)

    s: dot matches newline (default: as per “. matches newline”)

    x: Ignore unescaped whitespace in regex (default: off)

(?|expression using the alternation | operator)
If an alternation expression has subexpressions in some of its alternatives, you may want the subexpression counter not to be altered by what is in the other branches of the alternation. This construct will just do that.
For example, you get the following subexpressioncounter values:
/(a)(?|x(y)z|(p(q)r)|(t)u(v))(z)/x
# before  ———————-branch-reset—————- after
/ ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1            2         2  3        2     3     4
Without the construct, (p(q)r) would be group #3, and (t) group #5. With the constuct, they both report as group #2.

Control flow[edit]

Normally, a regular expression parses from left to right linerly. But you may need to change this behaviour.

|
The alternation operator, which allows matching either of a number of options, like in : one|two|three to match either of “one”, “two” or “three”. Matches are attempted from left to right. Use (?:) to match an empty string in such a construct.
(?n), (?signed-n)
Refers to subexpression #n. When a sign is present, go to the signed-n-th expression.
(?0), (?R)
Backtrack to start of pattern.
(?&name)
Backtrack to subexpression named name.
(?assertionyes-pattern|no-pattern)
Mathes yes-pattern if assertion is true, and no-pattern otherwise if provided. Supported assertions are:
  • (?=assert) (positive lookahead)
  • (?!assert) (negative lookahead)
  • (?(R)) (true if inside a recursion)
  • (?(Rn) (true if in a recursion to subexpression numbered n
PCRE doesn’t treat recursive expressions like Perl. In PCRE, as in Python, a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.
\K
Resets matched text at this point. For instance, matching “foo\Kbar” will not match bar”. It will match “foobar”, but will pretend that only “bar” matches. Useful when you wish to replace only the tail of a matched subject and groups are clumsy to formulate.

Assertions[edit]

These special groups consume no characters. Their succesful matching counts, but when they are done, matching starts over where it left.

(?=pattern
If pattern matches, backtrack to start of pattern. This allows using logical AND for combining regexes.
For instance,
(?=.*[[:lower:]])(?=.*[[:upper:]]).{6,}
tries finding a lowercase letter anywhere. On success it backtracks and searches for an uppercase letter. On yet another success, it checks whether the subject has at least 6 characters.
‘“q(?=u)i” doesn’t match “quit”, because, as matching ‘u’ consumes 0 characters, matching “i” in the pattern fails at “u” i the subject.
(?!pattern
Matches if pattern didn’t match.
(?<=pattern)
Asserts that pattern matches before some token.
(?<pattern)
Asserts that pattern does not match before some token.
NOTE: pattern has to be of fixed length, so that the regex engine knows where to test the assertion.
(?>pattern)
Match pattern independently of surrounding patterns, and don’t backtrack into it. Failure to match will caus the whole subject not to match.

Substitutions[edit]

\a, \e, \f, \n, \r, \t, \v

The corresponding control character, respectively BEL, ESC, FF, LF, CR, TAB and VT.

\Ccharacter, \xnn,\x{nnnn}

Like in earch patterns, respectively the control character with the same low order bits, the character with code ‘nn and the character with code nnnn (requires Unicode encoding).

\l

Causes next character to output in lowercase

\L

Causes next characters to be output in lowercase, until a \E is found.

\u

Causes next character to output in uppercase

\U

Causes next characters to be output in uppercase, until a \E is found.

\E

Puts an end to forced case mode initiated by \L or \U.

$&, $MATCH, ${^MATCH}

The whole matched text.

$`, $PREMATCH, ${^PREMATCH}

The text between the previous and current match, or the text before the match if this is the first one.

$”, $POSTMATCH, ${$POSTMATCH}

Everything that follows current match.

$LAST_SUBMATCH_RESULT, $^N

Returns what the last matching subexpression matched.

$+, $LAST_PAREN_MATCH

Returns what matched the last subexpression in the pattern.

$$

Returns $.

$n, ${n}, \n

Returns what matched the subexpression numbered n. Negative indices are not alowed.

$+{name}

Returns what matched subexpression named name.

Examples[edit]

These examples come from an earlier version of this page: Notepad++ RegExp Help, by Author : Georg Dembowski

IMPORTANT
  • You have to check the box “regular expression” in search & replace dialog
  • When copying the strings out of here, pay close attention not to have additional spaces in front of them! Then the RegExp will not work!

Example 0

How to replace/delete full lines according to a regex pattern? Let’s say you wish to delete all the lines in a file that contain the word “unused”, without leaving blank lines in their stead. This means you need to locate the line, remove it all, and additionally remove its terminating newline.

So, you’d want to do this:: Find: ^.*?unused.*?$\R Replace with: nothing, not even a space The regular expression appears to always work is to be read like this:

  • assert the start of a line
  • match some characters, stopping as early as required for the expression to match
  • the string you search in the file, “unused”
  • more characters, again stopping at the earliest necessary for the expression to match
  • assert line ends
  • A newline character or sequence

Remember that .* gobbles everything to the end of line if “. matches newline” is off, and to the end of file if the option is on!

Well, why is appears above in bold letters? Because this expression assumes each line ends with an end of line sequence. This is almost always true, and may fail for the last line in the file. It won’t match and won’t be deleted.

But the remedy is fairly simle: we translate in regex parlance that the newline should match if it is there. So the correct expression actually is:

^.*?unused.*?$\R?

Example 1

You use a MediaWiki (e.g. Wikipedia, Wikitravel) and want to make all headings one “level higher”, so a H2 becomes a H1 etc.

    • Search ^=(=)
    • Replace with \1
    • Click “Replace all”

      You do this to find all headings2…9 (two equal sign characters are required) which begin at line beginning (^) and to replace the two equal sign characters by only the last of the two, so eleminating one and having one remaining.
    • Search =(=)$
    • Replace with \1
    • Click “Replace all”

      You do this to find all headings2…9 (two equal sign characters are required) which end at line ending ($) and to replace the two equal sign characters by only the last of the two, so eleminating one and having one remaining.

== title == became = title =, you’re done :-)

Example 2

You have a document with a lot of dates, which are in German date format (dd.mm.yy) and you’d like to transform them to sortable format (yy-mm-dd). Don’t be afraid by the length of the search term – it’s long, but consiting of pretty easy and short parts.

Do the following:

  • Search ([^0-9])([0123][0-9])\.([01][0-9])\.([0-9][0-9])([^0-9])
  • Replace with \1\4-\3-\2\5
  • Click “Replace all”

You do this to fetch

  • the day, whose first number can only be 0, 1, 2 or 3
  • the month, whose first number can only be 0 or 1
  • but only if the separator is . and not ‘any character’ ( . versus \. )
  • but only if no numbers are sourrounding the date, as then it might be an IP address instead of a date

and to write all of this in the opposite order, except for the surroundings. Pay attention: Whatever SEARCH matches will be deleted and only replaced by the stuff in the REPLACE field, thus it is mandatory to have the surroundings in the REPLACE field as well!

Outcome:

  • 31.12.97 became 97-12-31
  • 14.08.05 became 05-08-14
  • the IP address 14.13.14.14 did not change

You’re done :-)

Example 3

You have printed in windows a file list using dir /b/s >filelist.txt to the file filelist.txt and want to make local URLs out of them.

  1. Open filelist.txt with Notepad++
    • Search \\
    • Replace with /
    • Click “Replace all” to change windows path separator char \ into URL path separator char /
    • Search ^(.*)$
    • Replace with file:///\1
    • Click “Replace all” to add file:/// in the beginning of all lines

According on your requirements, preceed to escape some characters like space to %20 etc. C:\!\aktuell.csv became file:///C:/!/aktuell.csv

You’re done :-)

Example 4

Another Search Replace Example

[Data]
AS AF AFG 004 Afghanistan
EU AX ALA 248 Ŭand Islands
EU AL ALB 008 Albania, People’s Socialist Republic of
AF DZ DZA 012 Algeria, People’s Democratic Republic of
OC AS ASM 016 American Samoa
EU AD AND 020 Andorra, Principality of
AF AO AGO 024 Angola, Republic of
NA AI AIA 660 Anguilla
AN AQ ATA 010 Antarctica (the territory South of 60 deg S)
NA AG ATG 028 Antigua and Barbuda
SA AR ARG 032 Argentina, Argentine Republic
AS AM ARM 051 Armenia
NA AW ABW 533 Aruba
OC AU AUS 036 Australia, Commonwealth of
  • Search for: ([A-Z]+) ([A-Z]+) ([A-Z]+) ([0-9]+) (.*)
  • Replace with: \1,\2,\3,\4,\5
  • Hit “Replace All”

Final Data:

AS,AF,AFG,004,Afghanistan
EU,AX,ALA,248,Ŭand Islands
EU,AL,ALB,008,Albania, People’s Socialist Republic of
AF,DZ,DZA,012,Algeria, People’s Democratic Republic of
OC,AS,ASM,016,American Samoa
EU,AD,AND,020,Andorra, Principality of
AF,AO,AGO,024,Angola, Republic of
NA,AI,AIA,660,Anguilla
AN,AQ,ATA,010,Antarctica (the territory South of 60 deg S)
NA,AG,ATG,028,Antigua and Barbuda
SA,AR,ARG,032,Argentina, Argentine Republic
AS,AM,ARM,051,Armenia
NA,AW,ABW,533,Aruba
OC,AU,AUS,036,Australia, Commonwealth of

Example 5

How to recognize a balanced expression, in mathematics or in programming?

Let’s first explicitly describe what we wish to match. An expression is balanced if and only if all areas delineatd by parentheses contain a balanced expression. Like in: 1+f(x+g())-h(2).

This leads to define the following kinds of groups: balanced ::= no_paren parenno_paren

no_paren = [^()]* – a possibly empty group of characters without a single parenthesis

paren ::= ( balanced )

Can we represent this as a regex? We cannot as-is.

The first hurdle is that there is no primitive construct to represent an alternating sequence of tokens. A common trick then is to represent the sequence as a repetition of the repeating pattern – here, no_paren followed by paren -, with any odd stuff at the end added.

So we have a more manageable, although slightly more complex, representation:

balanced ::= simple* no_paren

simple ::= no_paren paren

no_paren ::= [^()]*

paren = ( balanced )


A second hurdle is that parentheses are not ordinary characters. That’s ok, we’ll escape them as \( and \) respectively.

The third one is more interesting. How do we represent the whole of an expression inside a nested sub-expression? This smacks of recursion. PCRE has recursion. The simplest form of it is tgoing back to the start of the search pattern – not the searched text! – and doing it again. It writes as (?R). You remember seeing this one in the main list, right?

So:

  • we know how to match a no_paren. It will be nicer to give it an explicit name. This we’ll do in the embelishments section below.
  • we jusrtr discovered how to write a paren: \((?R)\)

This gives us the following hard to read, but correct regex:

([^()]*\((?R)\))*[^()]*

Try it, it works. But it is about as hard to decrypt as a badly indented piece of code without a comment and with unpromising, unclear identifiers. This is only one of the reasons why old Perl earned itself the rare qualifier of “write-only language”.

Embellishments

First of all, let’s add some spacing so that we can identify the components of the regex. Spacing can be added using the x modifier flag, which is off by default.

So we can write something more legible:

(?x:  ([^ ( ) ]* \( (?R) \) )* [^()]* )

Now let’s add some commenting

(?x:  ([^ ( ) ]* \( (?# The next group means “start matching the 
beginning of the regex”)(?R) \) )* [^()]* )

In Perl, we could go further by assigning names to groups. However, in PCRE this will not work, because any named group, once matched, won’t change. This is obviously not what we want.


Derived from https://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions Originally adapted to MediaWiki format by CChris, further adapted to suit my formatting taste by Roy Grubb

For free information about the hundreds of
visual thinking tools available, visit the

Visual Thinking Center