[ Split HTML / Single HTML ]
Table Of Contents
- Introduction
- Description
- Terminology
- Details
- Count
- Backreferences
- API
- CLI
- Test Suite
- Regex portability issues
- Alternatives
- Misc
- Links
Synopsis top ⇧
#include "strregex.h" int rc = strregex("Hello, world!", "^Hello", NULL); /* match, POSIX syntax */ int rc = strregex("Hello, world!", "m/^hello/i", NULL); /* match, Perl-ish syntax */ int rc = strregex("Hello, world!", "m/^He\\(l\\)\\1/", NULL); /* match with backreferences (requires BRE syntax to be portable)*/ int rc = strregex("Hello, world!", "s/^hello/Hi/i", &result); /* substitute, Perl-ish syntax */ int rc = strregex("Hello, world!", "s/^(h)e(l)lo/$1$2i/i", &result); /* substitute with backreferences, Perl style */ int rc = strregex("Hello, world!", "s/^(h)e(l)lo/${1}${2}i/i", &result); /* substitute with backreferences, Perl style with braces */ int rc = strregex("Hello, world!", "s/^(h)e(l)lo/\\1\\2i/i", &result); /* substitute with backreferences, Perl and sed style */
Purpose top ⇧
To make the use of regular expressions more powerful and easier than using regex.h
directly, with no external libraries besides the header-only strregex.h
.
Quick start top ⇧
- Download or clone this project.
Compile and run tests:
make clean all test
Description top ⇧
Minimal C API layer on top of regex.h
, which allows for both matching and substitution using one single function, strregex()
, without the need of additional cflags or eflags arguments.
The supported regex syntaxes are:
For matching and substitution: A Perl-ish style syntax, which allows for both matching and substitution.
This syntax is a reduced set of the Perl Regexp Quote-Like Operators, m and s, and the Perl Modifiers i, g, and m:- m/PATTERN/igm
- s/PATTERN/REPLACEMENT/igm
- m/PATTERN/igm
Including the modifiers in the regex string omits the need of using REG_ICASE
or REG_NEWLINE
explicitly.
The syntax for PATTERN follows the rules for BRE and ERE as described below.
For the complete regex syntax, see Details.
- For matching only: PATTERN may also be used as a regex "as is", without the Perl-ish Operators and Modifiers.
The supported syntax for PATTERN depends on the current regex engine (POSIX or GNU).
This API always tries first to compile the regex as ERE, but if that fails, a second intent tries to re-compile the regex as BRE.
This means that this API deals with 4 similar, but not identical syntaxes for PATTERN:
- POSIX BRE
- POSIX ERE
- GNU BRE
- GNU ERE
See Terminology for the details of each syntax.
Terminology top ⇧
The major regular expression types:
POSIX BRE (Basic Regular Expressions)
%%% TODO: CONTINUE HERE Thestrregex()
API uses the BRE syntax only if the matching regex contains backreferences for matching, i.e.\1
,\2
..\9
.
The BRE syntax does not apply to backreferences in the substitution string, see details below.
(It would have easier to ignore BRE for this API; see about portability problems why it is included at all.)
The POSIX BRE uses the following metacharacters:^ . [ ] [^ ] $ ( ) \n * {m,n}
With BRE, the following metacharacters have to be escaped:
\{ \} \( \)
Here is a BRE regex for matching 4 integer numbers separated by a dot:
[[:digit:]]\{1,\}.[[:digit:]]\{1,\}.[[:digit:]]\{1,\}.[[:digit:]]\{1,\}
As we see, there is a lot to escape.
And it becomes even worse when using BRE in a C program, as each escape character has to be escaped itself for the program to compile:const char *regex = "[[:digit:]]\\{1,\\}.[[:digit:]]\\{1,\\}.[[:digit:]]\\{1,\\}.[[:digit:]]\\{1,\\}";
The "backslash-cluttering" makes the regex almost illegible, and is the major reason why the usage of BRE has been avoided in this C API.
POSIX ERE (Extended Regular Expressions)
With this API, the ERE syntax is used, unless the regex contains backreferences.
The POSIX ERE uses the following metacharacters:^ . [ ] [^ ] $ ( )
\n* {m,n} ? + |Resume: ERE added the metacharacters ? + |, but support for backreferences
\nwas removed from ERE.
It makes any combination of \1,\2..\9 and ? + | invalid.
(Even so, it may still work with your regex engine version; read more below: Portability problems when backreferences with POSIX regular expressions.)
Besides the additional ? + |, an advantage over BRE is that ERE metcharacters don't need to be escaped.
The ERE version of the BRE expression above is more compact and legible:const char *regex = "[[:digit:]]{1,}.[[:digit:]]{1,}.[[:digit:]]{1,}.[[:digit:]]{1,}";
The ERE-specific + metacharacter may replace
{1,}
to make the regex even more compact:const char *regex = "[[:digit:]]+.[[:digit:]]+.[[:digit:]]+.[[:digit:]]+";
GNU extensions
If the current regex engine is GNU (Linux, Android), the GNU extension syntax is also supported (but is obviously not portable with non-GNU regex engines).
These extensions in fact mean that GNU BREs have exactly the same features as GNU EREs, except that +, ?, |, braces and parentheses need backslashes to give them a special meaning instead of take it away.
The GNU BRE syntax can be seen as a mix between POSIX BRE (escaped metacharacters) and POSIX ERE (additional metacharacters): The GNU BRE uses the following metacharacters:^ . [ ] [^ ] $ ( ) \n * {m,n}
PCRE (Perl Compatible Regular Expressions)
With this API, the PCRE syntax is NOT supported. Just included for completeness.
Perl programmers are used to the PCRE syntax, which has a wider regex grammar, and additional features such as lookahead/lookbehind and non-greedy quantifiers.
A Perl programmer would write the above expression in an even more compact way, like this:my $regex = "\d+.\d+.\d+.\d+";
The C version (backslashes must still be escaped):
#include <pcre.h> const char *regex = "\\d+.\\d+.\\d+.\\d+"; /* Requires PCRE*/
Note that, although the
strregex
API supports the use of some Perl-ish style operators and modifiers, its regex syntax only supports ERE.
Once again, PCRE syntax is not supported. Use PCRE instead if ERE isn't enough for you.If you need a really fast matching-only library, consider Hyperscan, which has a similar syntax to PCRE, but does not support substitution.
Details top ⇧
The strregex()
function has 3 arguments: <string>
, <regex>
, <result>
.
<string>
: The original string, never modified.
<regex>
: The regular expression, see the regex syntax details below.
<result>
: Depends on if using match-only or substitution:
- With matching: Either a pointer to
<string>
(match) or NULL (no match). - With substitution: Either the modified string (match) or a pointer to
<string>
(no match).
The return value from strregex()
is either the number of matches, the number of substitutions, or an error code.
Check Count matches and substitutions for details about how to counts are calculated.
The <regex>
syntax top ⇧
The accepted syntax for <regex>
is either:
- POSIX BRE (only if
<regex>
contains backreferences) - POSIX ERE
- A Perl-ish style regex.
For substitution, the Perl-ish style syntax must be used.
The <regex>
syntax - BRE/ERE top ⇧
If <regex>
does not start with either m/
or s/
, this API auto-detects backreferences before passing the regex to regcomp(3) and regexec(3):
- If
<regex>
contains one or more backreferences (\1, \2 ...\9), you are "almost forced to" escape special characters, as the regex is compiled as BRE (REG_EXTENDED unset). Example: ^He\(l\)\1o, . If
<regex>
does NOT contain any backreferences, the string is compiled as ERE (REG_EXTENDED set). Example: ^Hel{2}o.ERE and backreferences and POSIX
As mentioned above, you are "almost forced to" use BRE syntax if you include backreferences in the regex.
The expression "almost forced to" could also be read as "highly recommended to":Non-POSIX: Combines ERE unescaped parentheses and BRE backreference */ const char *regex_non_posix_compliant = "(a)\\1"; /* POSIX: Uses BRE escaped parentheses and BRE backreference */ const char *regex_posix_compliant = "\\(a\\)\\1"; /*
When this API detects backreferences, it "probes" the current regex engine by first compiling the regex as BRE.
(In the example above, compilingregex_non_posix_compliant
as BRE (REG_EXTENDED=0
) will always fail, whileregex_posix_compliant
will always succeed.)
If compiling as BRE fails andregcomp()
returns errorREG_ESUBREG
, thenregcomp()
is called a second time, but as ERE (REG_EXTENDED=1
).
If the second intent also fails, your regex engine is strictly POSIX compliant, and does not allow combining unescaped special characters and backreferences.
If the second intent succeeds, your regex engine (probably on Linux) allows for (more permissive, but non-POSIX compliant) combining of unescaped special characters and backreferences.
(In the example above, compilingregex_non_posix_compliant
as ERE may work or not, depending on your regex engine, whileregex_posix_compliant
will always fail.)
Note that such a regex is not portable; it may work with your current regex engine, but will probably fail on other platforms, so its usage is "highly recommended to" be avoided.
Read more about regex portability problems here.This is the only "regex engine probe" that this API does.
It doesn't "probe" for any other non-POSIX compliant combinations, such as ERE-specific metacharacters escaped as BRE, for example\?
.
That would imply for the API to "guess" if the programmer meant to include a meta?
or a literal\?
.
Instead, this API always assumes that programmer uses ERE syntax, except for backreferences.
The <regex>
syntax - Perl-ish top ⇧
If <regex>
starts with either m/
or s/
, and ends with /
(with optional trailing modifier letters i
, g
, m
), it is considered a Perl-ish style regex.
The syntax is one of:
- m/PATTERN/igm
- s/PATTERN/REPLACEMENT/igm
The PATTERN syntax is always either BRE or ERE (depends if backreferences are used, see above).
The REPLACEMENT string may basically be a mix of string literals and backreferences for substitution, i.e. $1, $2...(not to be confused with backreferences for matching).
Any use of /
inside either PATTERN or REPLACEMENT must be escaped, i.e. \/
.
In this API, a subset of Perl's operators and modifiers is used, limited to the m/ and s/ operators, and the g, i and m modifiers.
The <regex>
syntax - Perl-ish - the operators m/ and s/ top ⇧
The m/ operator (for matching): m/PATTERN/igm
When using this operator,strregex()
returns the number of matches.
Examples:- m/^He\\(l\\)\\1o/ : PATTERN contains a backreference for matching => compile as BRE, returns 2 matches
- m/^Hello/ : PATTERN does NOT contain any backreference for matching => compile as ERE, returns 1 match
Both examples above match the string "Hello", but the return value is different (2 matches versus 1 match).
Read more here how to count matches.Note that, without any modifiers, the Perl-ish m/^Hello/ and the POSIX ^Hello are functionally identical and may be used interchangeably.
- m/^He\\(l\\)\\1o/ : PATTERN contains a backreference for matching => compile as BRE, returns 2 matches
The s/ operator (for substitution): s/PATTERN/REPLACEMENT/igm
When using this operator,strregex()
returns the number of substitutions.
The same rules apply for PATTERN as with the m/ operator.
Examples:- s/^He\\(l\\)\\1o/Hi/ : PATTERN contains a backreference for matching => compile as BRE, 1 substitution
- s/^(H)ello/$1i/ : PATTERN does NOT contain any backreference for matching (even if it contains a backreference for substitution) => compile as ERE, 1 substitution
Both examples above match the string "Hello", stores the substition "Hi" in
<result>
, and the return value is the same (1 substitution).
Read more here how to count substitutions.- s/^He\\(l\\)\\1o/Hi/ : PATTERN contains a backreference for matching => compile as BRE, 1 substitution
The <regex>
syntax - Perl-ish - the modifiers i, g and m top ⇧
The i modifier (case-insensitive):
Case-insensitive match, so m/^Hello/i matches both "Hello" and "hello".
Internally,REG_ICASE
is set forregcomp()
.The g modifier (global):
Match/substitute globally, that is, as many times as possible in a string.
The match/substitution count is affected by g modifier:- m/Hello/ matches "HelloHelloHello" returns 1 (matches once)
- m/Hello/g matches "HelloHelloHello" returns 3 (matches 3 times)
Read more here how to count matches and substitutions.
The m modifier (multiline):
Multiline match, so m/^Hello/m matches Hello\nthere.
Internally,REG_NEWLINE
is set forregcomp()
.
Count - matches and substitutions top ⇧
The strregex()
function returns the match count or the substitute count, depending on if the m/
or the s/
was used in the regex.
To resume:
- A match count depends on both the number of regex subgroups and the g modifier.
- A substitution count depends only on the g modifier.
Count - matches top ⇧
For match-only regex expressions, the number of matches is obtained in the same way as with the POSIX regex engine.
That is, if REG_NOSUB
is unset when calling regcomp()
, the pmatch[]
array will contain all matches, so the count is obtained by counting array elements (not equal to -1).
The pmatch[]
elements will basically contain one element for the entire match, and one element for each matching subgroup.
The pmatch[]
array is also needed when using the g modifier, as the match will repeated globally, to get as many matches as possible.
Even if there are no subgroups in the regex, each repeated match needs the offset for the previous match.
This offset is obtained from the pmatch[]
array.
If there are no subgroups in the regex, and the g modifier is excluded, there may be only either 0 or 1 matches.
The pmatch[]
array is not needed, so in this single case, REG_NOSUB
will be set when calling regcomp()
.
Count - substitutions top ⇧
For substitution regex expressions, the POSIX regex engine does not offer a solution, so there isn't anything similar to the pmatch[]
array for substitution.
Substitution is made entirely by this API, which keeps track of the number of substitutions.
Anyhow, the substitutions rely on the pmatch[]
array to get string offsets to substitute, so REG_NOSUB
is always unset when using s/
, even if there are no subgroups and/or the g modifier is excluded.
Count - matches and substitutions - examples top ⇧
This set of examples shows how the number of counts may differ even for identical or very similar regexes, depending on the use of subgroups and the g modifier:
#include "strregex.h" int count1 = strregex("hihihihihihi", "m/hihi/", NULL); /* 1 match, no subgroups, no g modifier (REG_NOSUB is set) */ int count2 = strregex("hihihihihihi", "m/(hi)(hi)/", NULL); /* 2 matches, subgroups, no g modifier */ int count3 = strregex("hihihihihihi", "m/hihi/g", NULL); /* 3 matches, no subgroups, g modifier */ int count4 = strregex("hihihihihihi", "m/(hi)(hi)/g", NULL); /* 6 matches, subgroups, g modifier */ int count5 = strregex("hihihihihihi", "s/hihi/HO/", &result); /* 1 substitution, no subgroups, no g modifier (but REG_NOSUB is unset) */ int count6 = strregex("hihihihihihi", "s/(hi)(hi)/HO/", &result); /* 1 substitution, subgroups, no g modifier */ int count7 = strregex("hihihihihihi", "s/hihi/HO/g", &result); /* 3 substitutions, no subgroups, g modifier */ int count8 = strregex("hihihihihihi", "s/(hi)(hi)/HO/", &result); /* 3 substitutions, subgroups, g modifier */
Count - matches and substitutions - the reference top ⇧
This API mimicks Perl in many ways, and the match and substitution counts are no exception.
So, if in doubt that strregex()
returns the expected count or not, use Perl's regex engine as a reference:
#!/usr/bin/env perl use strict; my $str; my $count; # match count $str = 'hihihihihihi'; $count = () = $str =~ m/hihi/; print("MATCH: count=$count\n"); # 1 match $str = 'hihihihihihi'; $count = () = $str =~ m/(hi)(hi)/; print("MATCH, SUBGROUPS: count=$count\n"); # 2 matches $str = 'hihihihihihi'; $count = () = $str =~ m/hihi/g; print("MATCH, GLOBAL: count=$count\n"); # 3 matches $str = 'hihihihihihi'; $count = () = $str =~ m/(hi)(hi)/g; print("MATCH, SUBGROUPS, GLOBAL: count=$count\n"); # 6 matches # substitution count $str = 'hihihihihihi'; $count = $str =~ s/hihi/HO/; print("SUBSTITUTE: str='$str', count=$count\n"); # 1 substitution $str = 'hihihihihihi'; $count = $str =~ s/(hi)(hi)/HO/; print("SUBSTITUTE, SUBGROUPS: str='$str', count=$count\n"); # 1 substitution $str = 'hihihihihihi'; $count = $str =~ s/hihi/HO/g; print("SUBSTITUTE, GLOBAL: str='$str', count=$count\n"); # 3 substitutions $str = 'hihihihihihi'; $count = $str =~ s/(hi)(hi)/HO/g; print("SUBSTITUTE, GROUPS, GLOBAL: str='$str', count=$count\n"); # 3 substitutions
See also strregex.pl, the Perl version of the strregex command-line tool, which both are used for testing of strregex()
.
Backreferences for matching and for substitution top ⇧
In this API, the backreferences tries to mimick Perl's syntax, so only \1 is allowed in the PATTERN string, while any of $1, ${1}, \1 may be used in the REPLACEMENT string.
This API supports \1 in REPLACEMENT, even if Perl treats it like a second class citizen:perl -wE '$_ = "foo"; s/(foo)/bar $1/; say' perl -wE '$_ = "foo"; s/(foo)/bar ${1}/; say' perl -wE '$_ = "foo"; s/(foo)/bar \1/; say' # shows warning: "\1 better written as $1 at -e line 1."
In the same way as in Perl and POSIX BRE handle backreferences in PATTERN, they must refer to an existing group in a regex, i.e. (a)(b)\2 is valid, while (a)(b)\3 is invalid.
In case of mismatch, this API returns the POSIX regerror:invalid backreference number
.In Perl, backreferences in REPLACEMENT don't need to match any group.
Perl silently evaluates non-matching references to the empty string: s/(a)(b)/$2/ returnsb
on match, while s/(a)(b)/$3/ returns an empty string.
This API acts different from Perl in this case: s/(a)(b)/$2/ returnsb
on match, while s/(a)(b)/$3/ returns an error.
Backreferences - limits top ⇧
In this API, the backreferences in PATTERN has the same range limit as the underlying POSIX regex engine.
The POSIX standard says that the range\1
,\2
..\9
is supported, which seems to be a common practice in implemented regex engines.
Note that the support for a bigger number of regex subgroups (basically, the number of parentheses) in a regex is common, but only the 9 first subgroups may be referred by backreferences.
Two examples using 26 subgroups, 1 for each letter, followed by backreferences 1-9 and 1-10, respectively:Example 1 - BRE regex with 26 subgroups followed by backreference in the range 1-9, matches:
string = "ABCDEFGHIJKLMNOPQRSTUVWXYZ IHGFEDCBA" regex = "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\)\\(M\\)\\(N\\)\\(O\\)\\(P\\)\\(Q\\)\\(R\\)\\(S\\)\\(T\\)\\(U\\)\\(V\\)\\(W\\)\\(X\\)\\(Y\\)\\(Z\\) \\9\\8\\7\\6\\5\\4\\3\\2\\1"
Example 2 - BRE regex with 26 subgroups followed by backreference in the range 1-10, does not match, as backreference
\\10
is not supported, but instead is interpreted as\\1
followed by a literal0
:string = "ABCDEFGHIJKLMNOPQRSTUVWXYZ JIHGFEDCBA" regex = "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\)\\(M\\)\\(N\\)\\(O\\)\\(P\\)\\(Q\\)\\(R\\)\\(S\\)\\(T\\)\\(U\\)\\(V\\)\\(W\\)\\(X\\)\\(Y\\)\\(Z\\) \\10\\9\\8\\7\\6\\5\\4\\3\\2\\1"
Using the style with braces
${1}
may be useful when REPLACEMENT contains a backreference followed by a literal digit: Compare$10
versus${1}0
.
Anyhow, Perl has a limitation here: While backreferences without braces ($1, $2 ..
) have a theoretically unlimited range, backreferences with braces are limited to the range${1} .. ${9}
.
Trying to use the backreference${10}
in Perl causes an error.
Thestrregex
API allows for two-or-more-digit backreferences with braces, with a range of${1} .. ${INT_MAX}
, i.e.${2147483647}
.
(Yes, you need a looong PATTERN with maaany subgroups to break that limit.)
API top ⇧
The API includes the following functions:
int strregex(const char *string, const char *regex, char **result);
Pass a string string
and a regular expression regex
, get matched/substituted string in result
, or NULL on no match/no substitution.
For substitution, memory for result
must be deallocated afterwards.
This can be done using free(result)
, but strregex_free(&str, &result)
(see below) is recommended.
Returns: Number of matches/substitutions (0 or more), or any possible error code (negative number, see below).
const char *strregex_error(int errcode);
Get a human-readable string of an error code returned by strregex()
.
Note:
Internally, an error may be caused by:
- error from
regcomp()
- error from
strregex()
itself
The error codes from regcomp()
are defined in regex.h
, normally as positive numbers.
Internally, strregex()
converts these error codes (and its own error codes) to negative values.
That way, the return value from strregex()
may be non-negative (indicating the number of matches) or negative (indicating the error code).
As all error codes are negative values, calling strregex_error()
with a positive errcode
value with return the same (empty) string as for errcode=0
.
Note that regexec()
returns values (zero for a successful match, otherwise REG_NOMATCH) are not considered errors, but only an indicator if a regex matched or not.
Returns: Error string
void strregex_free(const char **string, char **result);
Free a result
only if needed.
If strregex()
failed, or did not match/substitute anything, or if strregex()
matched using a match-only regular expression, there is no need to call this function (but it doesn't harm , either).
If strregex()
has been called using a substitution regular expression, and text has been successfully substituted, this function free()
:s the resulting text.
If in doubt, always call this function instead of free()
.
Returns: Nothing
CLI top ⇧
The strregex
program may be used to test the API:
./strregex <string> <regex> [-c|--count] [-v|--verbose]
It's behaviour is similar to:
echo <string> | grep <regex>
echo <string> | sed 's/<regex>/<subst>/'
That is:
- for a match-only
<regex>
, output either the original string (match), or nothing (no match). - for a substitution
<regex>
, output either the substituted string (substitution), or the original string (no substitution).
Flags:
-c
: Show only the number of matches. A negative number means that an error occurred.-v
: verbose output
Examples:
./strregex 'Hello, world!' '^Hello' <-- POSIX regex syntax, match, outputs the original string
./strregex 'Hello, world!' '^hello' <-- POSIX regex syntax, no match, no output
./strregex 'Hello, world!' 'm/^hello/' <-- Perl-ish regex syntax for matching, no match, no output
./strregex 'Hello, world!' 'm/^hello/i' <-- Perl-ish regex syntax for matching, case-insensitive, matches, outputs the original string
./strregex 'Hello, world!' 's/^hello/Hi/i' <-- Perl-ish regex syntax for substitution, outputs 'Hi, world!'
./strregex 'Hello, world!' 's/^Hola/Hi/' <-- No substitution, returns original string 'Hello, world!'
./strregex 'Halo, mundo!' 's/(l)(o)/$2$1/' <-- Substitution using backreferences, outputs 'Hola, mundo!'
./strregex -c 'Hello, world!' 'm/l/g' <-- Matches 'l' twice, outputs '2'
Notes:
As opposed of Perl, the m operator is not optional when using the "perl-ish" syntax.
When m is omitted, the regex is interpreted as a POSIX regular expression:
./strregex 'Hello, world!' 'm/hello/i' <-- Interpreted as Perl-ish. Match.
./strregex 'Hello, world!' '/hello/i' <-- Interpreted as a POSIX. No match.
./strregex '/hello/i, world!' '/hello/i' <-- Interpreted as a POSIX. Match.
./strregex '/hello/i, world!' '/hello/i' <-- Interpreted as a POSIX. Match.
To match a string which is a Perl-ish regular expression itself, the workaround is to put the entire string in a subgroup, with possible additional metacharacters:
./strregex 'm/hello/i' '^(m/hello/i){1}$' <-- Interpreted as a POSIX. Match.
Test Suite top ⇧
%%%
Portability problems when using backreferences with POSIX regular expressions top ⇧
In this API, it would have been easier to avoid the BRE syntax alltogether.
The BRE regex syntax uses backslashes for special characters while ERE doesn't.
In C code, backslashes themselves have to be escaped, which makes a BRE regex in C code much harder to read than a similar ERE regex.
Anyhow, there is a problem with backreferences, which BRE supports, while ERE (at least officially) does not support backreferences:
The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax.
With this syntax, a backslash causes the metacharacter to be treated as a literal character. So, for example, \( \) is now ( ) and \{ \} is now { }.
Additionally, support is removed for \n backreferences ...
Not beeing aware of this may lead to confusion, as it is easy to assume that ERE is a succesor of BRE, which is not entirely true.
Let's see an example regex which uses both the choice/alternation/set-union operator (|), and backreferences (\1\2):
(1|2)(3|4)\1\2
It would be useful to match certain patterns, like these:
1313 # MATCH 1414 # MATCH 2323 # MATCH 2424 # MATCH 2324 # NO MATCH
But...
- the choice/alternation/set-union operator (|) is only supported with ERE, not with BRE
- backreferences (\1\2) are only supported with BRE, not with ERE
So, at least officially, so if we stick to the POSIX standard, we are stuck.
Unless switching to PCRE, there is no way to use the regex above. :-(
To add more confusion, some regex.h
library versions do support backreferences for both BRE and ERE.
Eli Bendersky has mentioned this in some notes on POSIX regular expressions
He managed to use backreferences with ERE using his program regex_sample
, compiled for gcc 4.6 on Ubuntu Linux 12.04.
I used his program with an increased pmatch
array, and tested it on different platforms.
On FreeBSD 11.2 (clang 6.0.0) and MSYS2 (gcc 7.3.0):
regexec() fails to match when mixing BRE and ERE syntax
./regex_sample extended "([a-z]+)_\1" abc_abc
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: ([a-z]+)_\1
[3]: abc_abc
--------
No match
./regex_sample extended "(1|2)(3|4)\1\2" 1313
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: (1|2)(3|4)\1\2
[3]: 1313
--------
No match
The | metacharacter is not supported for BRE (as the POSIX standard says):
./regex_sample basic "\(A\|B\)\(C\|D\)\1\2" ACAC
Program argc=4
[0]: ./regex_sample
[1]: basic
[2]: \(A\|B\)\(C\|D\)\1\2
[3]: ACAC
--------
No match
On Ubuntu 16.04 (gcc 5.4.0):
regexec() matches successfully mixing BRE and ERE syntax
./regex_sample extended "([a-z]+)_\1" abc_abc
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: ([a-z]+)_\1
[3]: abc_abc
--------
Match group 0: abc_abc
Match group 1: abc
./regex_sample extended "(1|2)(3|4)\1\2" 1313
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: (1|2)(3|4)\1\2
[3]: 1313
--------
Match group 0: 1313
Match group 1: 1
Match group 2: 3
As opposed of the POSIX standard says, the | metacharacter is supported for BRE on Linux, as long as it is escaped:
./regex_sample basic "\(A\|B\)\(C\|D\)\1\2" ACAC
Program argc=4
[0]: ./regex_sample
[1]: basic
[2]: \(A\|B\)\(C\|D\)\1\2
[3]: ACAC
--------
Match group 0: ACAC
Match group 1: A
Match group 2: C
Same result for all platforms:
More than 9 subgroups are supported for BRE, but only backreferences index 1-9 may be used (as the POSIX standard says).
Note that the backreference index limit to 1-9 also applies to Linux, even when using ERE.
Test: 12 subgroups and 9 backreferences => match
./regex_sample.exe basic "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\) \\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL IHGFEDCBA"
Program argc=4
[0]: ./regex_sample
[1]: basic
[2]: \(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)\(I\)\(J\)\(K\)\(L\) \9\8\7\6\5\4\3\2\1
[3]: ABCDEFGHIJKL IHGFEDCBA
--------
Match group 0: ABCDEFGHIJKL IHGFEDCBA
Match group 1: A
Match group 2: B
Match group 3: C
Match group 4: D
Match group 5: E
Match group 6: F
Match group 7: G
Match group 8: H
Match group 9: I
Match group 10: J
Match group 11: K
Match group 12: L
Test: 12 subgroups and 10 backreferences => no match
$ ./regex_sample.exe basic "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\) \\10\\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL JIHGFEDCBA"
Program argc=4
[0]: ./regex_sample
[1]: basic
[2]: \(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)\(I\)\(J\)\(K\)\(L\) \10\9\8\7\6\5\4\3\2\1
[3]: ABCDEFGHIJKL JIHGFEDCBA
--------
No match
Test: 12 subgroups and 10 backreferences with ERE => no match (does not match on any platform, Linux included)
./regex_sample extended "(A)(B)(C)(D)(E)(F)(G)(H)(I)(J)(K)(L) \\10\\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL JIHGFEDCBA"
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: (A)(B)(C)(D)(E)(F)(G)(H)(I)(J)(K)(L) \10\9\8\7\6\5\4\3\2\1
[3]: ABCDEFGHIJKL JIHGFEDCBA
--------
No match
To resume:
If you manage to compile an ERE regex with backreferences, don't be too sure it will be portable.
To save yourself some headache, try avoiding matching backreferences alltogether.
More on this subject:
https://stackoverflow.com/questions/53767426/c-posix-ere-without-back-references
https://stackoverflow.com/questions/13322996/do-extended-regexes-support-back-references
Alternatives top ⇧
If you are looking for:
- a. A more powerful C regex library: Use PCRE or Hyperscan (very fast, matching-only library) instead.
- b. A more powerful CLI tool: Use Perl, or combine
grep
andsed
:
Match (Perl or grep
):
echo -e 'Hello, world!' | perl -p -ne 'm/Hello/'
echo -e 'Hello, world!' | grep Hello
Hello, world!
Replace (Perl or sed
):
echo -e 'Hello, world!' | perl -p -ne 's/Hello/Hi/g'
echo -e 'Hello, world!' | sed 's/Hello/Hi/g'
Hi, world!
Misc top ⇧
The other files in this repos is a bunch of examples using regex.h
.
To compile them:
make clean misc
Links top ⇧
- regex on Wikipedia
- man(3) regex
- POSIX Basic and Extended Regular Expressions
- regex(3) for Linux and regex(3) for FreeBSD - Note that FreeBSD supports additional flags (which are not recommended to use, for portability issues)
- Some notes on POSIX regular expressions
- Regular expression use in C
- Regular Expression parsing in C
- PCRE
- Hyperscan - fast, matching-only regex library
- The Perl operator 'm'
- The Perl operator 's'
- The Perl modifiers 'i', 'g' and 'm'