strregex

strregex.h - minimal C API layer on top of regex.h
Login

[ Split HTML / Single HTML ]

Table Of Contents

Synopsis top

#include "strregex.h"

int rc = strregex("Hello, world!", "^Hello", NULL);                      /* match, POSIX syntax */
int rc = strregex("Hello, world!", "m/^hello/i", NULL);                  /* match, Perl-ish syntax */
int rc = strregex("Hello, world!", "m/^He\\(l\\)\\1/", NULL);            /* match with backreferences (requires BRE syntax to be portable)*/
int rc = strregex("Hello, world!", "s/^hello/Hi/i", &result);            /* substitute, Perl-ish syntax */
int rc = strregex("Hello, world!", "s/^(h)e(l)lo/$1$2i/i", &result);     /* substitute with backreferences, Perl style */
int rc = strregex("Hello, world!", "s/^(h)e(l)lo/${1}${2}i/i", &result); /* substitute with backreferences, Perl style with braces */
int rc = strregex("Hello, world!", "s/^(h)e(l)lo/\\1\\2i/i", &result);   /* substitute with backreferences, Perl and sed style */

Purpose top

To make the use of regular expressions more powerful and easier than using regex.h directly, with no external libraries besides the header-only strregex.h.

Quick start top

  1. Download or clone this project.
  2. Compile and run tests:

    make clean all test

Description top

Minimal C API layer on top of regex.h, which allows for both matching and substitution using one single function, strregex(), without the need of additional cflags or eflags arguments.
The supported regex syntaxes are:

Including the modifiers in the regex string omits the need of using REG_ICASE or REG_NEWLINE explicitly.
The syntax for PATTERN follows the rules for BRE and ERE as described below.
For the complete regex syntax, see Details.

The supported syntax for PATTERN depends on the current regex engine (POSIX or GNU).
This API always tries first to compile the regex as ERE, but if that fails, a second intent tries to re-compile the regex as BRE.
This means that this API deals with 4 similar, but not identical syntaxes for PATTERN:

See Terminology for the details of each syntax.

Terminology top

The major regular expression types:

Details top

The strregex() function has 3 arguments: <string>, <regex>, <result>.

<string>: The original string, never modified.
<regex>: The regular expression, see the regex syntax details below.
<result>: Depends on if using match-only or substitution:

The return value from strregex() is either the number of matches, the number of substitutions, or an error code.
Check Count matches and substitutions for details about how to counts are calculated.

The <regex> syntax top

The accepted syntax for <regex> is either:

For substitution, the Perl-ish style syntax must be used.

The <regex> syntax - BRE/ERE top

If <regex> does not start with either m/ or s/, this API auto-detects backreferences before passing the regex to regcomp(3) and regexec(3):

The <regex> syntax - Perl-ish top

If <regex> starts with either m/ or s/, and ends with / (with optional trailing modifier letters i, g, m), it is considered a Perl-ish style regex.
The syntax is one of:

The PATTERN syntax is always either BRE or ERE (depends if backreferences are used, see above).
The REPLACEMENT string may basically be a mix of string literals and backreferences for substitution, i.e. $1, $2...(not to be confused with backreferences for matching).
Any use of / inside either PATTERN or REPLACEMENT must be escaped, i.e. \/.
In this API, a subset of Perl's operators and modifiers is used, limited to the m/ and s/ operators, and the g, i and m modifiers.

The <regex> syntax - Perl-ish - the operators m/ and s/ top

The <regex> syntax - Perl-ish - the modifiers i, g and m top

Count - matches and substitutions top

The strregex() function returns the match count or the substitute count, depending on if the m/ or the s/ was used in the regex.
To resume:

Count - matches top

For match-only regex expressions, the number of matches is obtained in the same way as with the POSIX regex engine.
That is, if REG_NOSUB is unset when calling regcomp(), the pmatch[] array will contain all matches, so the count is obtained by counting array elements (not equal to -1).
The pmatch[] elements will basically contain one element for the entire match, and one element for each matching subgroup.

The pmatch[] array is also needed when using the g modifier, as the match will repeated globally, to get as many matches as possible.
Even if there are no subgroups in the regex, each repeated match needs the offset for the previous match.
This offset is obtained from the pmatch[] array.

If there are no subgroups in the regex, and the g modifier is excluded, there may be only either 0 or 1 matches.
The pmatch[] array is not needed, so in this single case, REG_NOSUB will be set when calling regcomp().

Count - substitutions top

For substitution regex expressions, the POSIX regex engine does not offer a solution, so there isn't anything similar to the pmatch[] array for substitution.
Substitution is made entirely by this API, which keeps track of the number of substitutions.
Anyhow, the substitutions rely on the pmatch[] array to get string offsets to substitute, so REG_NOSUB is always unset when using s/, even if there are no subgroups and/or the g modifier is excluded.

Count - matches and substitutions - examples top

This set of examples shows how the number of counts may differ even for identical or very similar regexes, depending on the use of subgroups and the g modifier:

#include "strregex.h"

int count1 = strregex("hihihihihihi", "m/hihi/", NULL);           /* 1 match, no subgroups, no g modifier (REG_NOSUB is set) */
int count2 = strregex("hihihihihihi", "m/(hi)(hi)/", NULL);       /* 2 matches, subgroups, no g modifier */
int count3 = strregex("hihihihihihi", "m/hihi/g", NULL);          /* 3 matches, no subgroups, g modifier */
int count4 = strregex("hihihihihihi", "m/(hi)(hi)/g", NULL);      /* 6 matches, subgroups, g modifier */

int count5 = strregex("hihihihihihi", "s/hihi/HO/", &result);     /* 1 substitution, no subgroups, no g modifier (but REG_NOSUB is unset) */
int count6 = strregex("hihihihihihi", "s/(hi)(hi)/HO/", &result); /* 1 substitution, subgroups, no g modifier */
int count7 = strregex("hihihihihihi", "s/hihi/HO/g", &result);    /* 3 substitutions, no subgroups, g modifier */
int count8 = strregex("hihihihihihi", "s/(hi)(hi)/HO/", &result); /* 3 substitutions, subgroups, g modifier */

Count - matches and substitutions - the reference top

This API mimicks Perl in many ways, and the match and substitution counts are no exception.
So, if in doubt that strregex() returns the expected count or not, use Perl's regex engine as a reference:

 #!/usr/bin/env perl
 use strict;

 my $str; my $count;

 # match count
 $str = 'hihihihihihi'; $count = () = $str =~ m/hihi/; print("MATCH: count=$count\n");                                     # 1 match
 $str = 'hihihihihihi'; $count = () = $str =~ m/(hi)(hi)/; print("MATCH, SUBGROUPS: count=$count\n");                      # 2 matches
 $str = 'hihihihihihi'; $count = () = $str =~ m/hihi/g; print("MATCH, GLOBAL: count=$count\n");                            # 3 matches
 $str = 'hihihihihihi'; $count = () = $str =~ m/(hi)(hi)/g; print("MATCH, SUBGROUPS, GLOBAL: count=$count\n");             # 6 matches
 
 # substitution count
 $str = 'hihihihihihi'; $count = $str =~ s/hihi/HO/; print("SUBSTITUTE: str='$str', count=$count\n");                      # 1 substitution
 $str = 'hihihihihihi'; $count = $str =~ s/(hi)(hi)/HO/; print("SUBSTITUTE, SUBGROUPS: str='$str', count=$count\n");       # 1 substitution
 $str = 'hihihihihihi'; $count = $str =~ s/hihi/HO/g; print("SUBSTITUTE, GLOBAL: str='$str', count=$count\n");             # 3 substitutions
 $str = 'hihihihihihi'; $count = $str =~ s/(hi)(hi)/HO/g; print("SUBSTITUTE, GROUPS, GLOBAL: str='$str', count=$count\n"); # 3 substitutions

See also strregex.pl, the Perl version of the strregex command-line tool, which both are used for testing of strregex().

Backreferences for matching and for substitution top

Backreferences - limits top

API top

The API includes the following functions:

int strregex(const char *string,  const char *regex, char **result);

Pass a string string and a regular expression regex, get matched/substituted string in result, or NULL on no match/no substitution.
For substitution, memory for result must be deallocated afterwards.
This can be done using free(result), but strregex_free(&str, &result) (see below) is recommended.

Returns: Number of matches/substitutions (0 or more), or any possible error code (negative number, see below).

const char *strregex_error(int errcode);

Get a human-readable string of an error code returned by strregex().

Note:
Internally, an error may be caused by:

The error codes from regcomp() are defined in regex.h, normally as positive numbers.
Internally, strregex() converts these error codes (and its own error codes) to negative values.
That way, the return value from strregex() may be non-negative (indicating the number of matches) or negative (indicating the error code).
As all error codes are negative values, calling strregex_error() with a positive errcode value with return the same (empty) string as for errcode=0.
Note that regexec() returns values (zero for a successful match, otherwise REG_NOMATCH) are not considered errors, but only an indicator if a regex matched or not.

Returns: Error string

void strregex_free(const char **string, char **result);

Free a result only if needed.
If strregex() failed, or did not match/substitute anything, or if strregex() matched using a match-only regular expression, there is no need to call this function (but it doesn't harm , either).
If strregex() has been called using a substitution regular expression, and text has been successfully substituted, this function free():s the resulting text.
If in doubt, always call this function instead of free().

Returns: Nothing

CLI top

The strregex program may be used to test the API:

    ./strregex <string> <regex> [-c|--count] [-v|--verbose]

It's behaviour is similar to:

    echo <string> | grep <regex>
    echo <string> | sed 's/<regex>/<subst>/'

That is:

Flags:

Examples:

    ./strregex 'Hello, world!' '^Hello'        <-- POSIX regex syntax, match, outputs the original string
    ./strregex 'Hello, world!' '^hello'        <-- POSIX regex syntax, no match, no output
    ./strregex 'Hello, world!' 'm/^hello/'     <-- Perl-ish regex syntax for matching, no match, no output
    ./strregex 'Hello, world!' 'm/^hello/i'    <-- Perl-ish regex syntax for matching, case-insensitive, matches, outputs the original string

    ./strregex 'Hello, world!' 's/^hello/Hi/i' <-- Perl-ish regex syntax for substitution, outputs 'Hi, world!'
    ./strregex 'Hello, world!' 's/^Hola/Hi/'   <-- No substitution, returns original string 'Hello, world!'

    ./strregex 'Halo, mundo!' 's/(l)(o)/$2$1/' <-- Substitution using backreferences, outputs 'Hola, mundo!'

    ./strregex -c 'Hello, world!' 'm/l/g'      <-- Matches 'l' twice, outputs '2'

Notes:

As opposed of Perl, the m operator is not optional when using the "perl-ish" syntax.
When m is omitted, the regex is interpreted as a POSIX regular expression:

    ./strregex 'Hello, world!' 'm/hello/i'     <-- Interpreted as Perl-ish. Match.
    ./strregex 'Hello, world!' '/hello/i'      <-- Interpreted as a POSIX. No match.
    ./strregex '/hello/i, world!' '/hello/i'   <-- Interpreted as a POSIX. Match.
    ./strregex '/hello/i, world!' '/hello/i'   <-- Interpreted as a POSIX. Match.

To match a string which is a Perl-ish regular expression itself, the workaround is to put the entire string in a subgroup, with possible additional metacharacters:

    ./strregex 'm/hello/i' '^(m/hello/i){1}$'  <-- Interpreted as a POSIX. Match.

Test Suite top

%%%

Portability problems when using backreferences with POSIX regular expressions top

In this API, it would have been easier to avoid the BRE syntax alltogether.
The BRE regex syntax uses backslashes for special characters while ERE doesn't.
In C code, backslashes themselves have to be escaped, which makes a BRE regex in C code much harder to read than a similar ERE regex.
Anyhow, there is a problem with backreferences, which BRE supports, while ERE (at least officially) does not support backreferences:

The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax.
With this syntax, a backslash causes the metacharacter to be treated as a literal character. So, for example, \( \) is now ( ) and \{ \} is now { }.
Additionally, support is removed for \n backreferences ...

Not beeing aware of this may lead to confusion, as it is easy to assume that ERE is a succesor of BRE, which is not entirely true.

Let's see an example regex which uses both the choice/alternation/set-union operator (|), and backreferences (\1\2):

    (1|2)(3|4)\1\2

It would be useful to match certain patterns, like these:

    1313 # MATCH
    1414 # MATCH
    2323 # MATCH
    2424 # MATCH
    2324 # NO MATCH

But...

So, at least officially, so if we stick to the POSIX standard, we are stuck.
Unless switching to PCRE, there is no way to use the regex above. :-(

To add more confusion, some regex.h library versions do support backreferences for both BRE and ERE.
Eli Bendersky has mentioned this in some notes on POSIX regular expressions

He managed to use backreferences with ERE using his program regex_sample, compiled for gcc 4.6 on Ubuntu Linux 12.04.
I used his program with an increased pmatch array, and tested it on different platforms.

On FreeBSD 11.2 (clang 6.0.0) and MSYS2 (gcc 7.3.0):

regexec() fails to match when mixing BRE and ERE syntax

    ./regex_sample extended "([a-z]+)_\1" abc_abc
    Program argc=4
    [0]: ./regex_sample
    [1]: extended
    [2]: ([a-z]+)_\1
    [3]: abc_abc
    --------
    No match

    ./regex_sample extended "(1|2)(3|4)\1\2" 1313
    Program argc=4
    [0]: ./regex_sample
    [1]: extended
    [2]: (1|2)(3|4)\1\2
    [3]: 1313
    --------
    No match

The | metacharacter is not supported for BRE (as the POSIX standard says):

    ./regex_sample basic "\(A\|B\)\(C\|D\)\1\2"  ACAC
    Program argc=4
    [0]: ./regex_sample
    [1]: basic
    [2]: \(A\|B\)\(C\|D\)\1\2
    [3]: ACAC
    --------
    No match

On Ubuntu 16.04 (gcc 5.4.0):

regexec() matches successfully mixing BRE and ERE syntax

    ./regex_sample extended "([a-z]+)_\1" abc_abc
    Program argc=4
    [0]: ./regex_sample
    [1]: extended
    [2]: ([a-z]+)_\1
    [3]: abc_abc
    --------
    Match group 0: abc_abc
    Match group 1: abc

    ./regex_sample extended "(1|2)(3|4)\1\2" 1313
    Program argc=4
    [0]: ./regex_sample
    [1]: extended
    [2]: (1|2)(3|4)\1\2
    [3]: 1313
    --------
    Match group 0: 1313
    Match group 1: 1
    Match group 2: 3

As opposed of the POSIX standard says, the | metacharacter is supported for BRE on Linux, as long as it is escaped:

    ./regex_sample basic "\(A\|B\)\(C\|D\)\1\2"  ACAC
    Program argc=4
    [0]: ./regex_sample
    [1]: basic
    [2]: \(A\|B\)\(C\|D\)\1\2
    [3]: ACAC
    --------
    Match group 0: ACAC
    Match group 1: A
    Match group 2: C

Same result for all platforms:
More than 9 subgroups are supported for BRE, but only backreferences index 1-9 may be used (as the POSIX standard says).
Note that the backreference index limit to 1-9 also applies to Linux, even when using ERE.

Test: 12 subgroups and 9 backreferences => match

    ./regex_sample.exe basic "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\) \\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL IHGFEDCBA"
    Program argc=4
    [0]: ./regex_sample
    [1]: basic
    [2]: \(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)\(I\)\(J\)\(K\)\(L\) \9\8\7\6\5\4\3\2\1
    [3]: ABCDEFGHIJKL IHGFEDCBA
    --------
    Match group 0: ABCDEFGHIJKL IHGFEDCBA
    Match group 1: A
    Match group 2: B
    Match group 3: C
    Match group 4: D
    Match group 5: E
    Match group 6: F
    Match group 7: G
    Match group 8: H
    Match group 9: I
    Match group 10: J
    Match group 11: K
    Match group 12: L

Test: 12 subgroups and 10 backreferences => no match

    $ ./regex_sample.exe basic "\\(A\\)\\(B\\)\\(C\\)\\(D\\)\\(E\\)\\(F\\)\\(G\\)\\(H\\)\\(I\\)\\(J\\)\\(K\\)\\(L\\) \\10\\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL JIHGFEDCBA"
    Program argc=4
    [0]: ./regex_sample
    [1]: basic
    [2]: \(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)\(I\)\(J\)\(K\)\(L\) \10\9\8\7\6\5\4\3\2\1
    [3]: ABCDEFGHIJKL JIHGFEDCBA
    --------
    No match

Test: 12 subgroups and 10 backreferences with ERE => no match (does not match on any platform, Linux included)

    ./regex_sample extended "(A)(B)(C)(D)(E)(F)(G)(H)(I)(J)(K)(L) \\10\\9\\8\\7\\6\\5\\4\\3\\2\\1" "ABCDEFGHIJKL JIHGFEDCBA"
    Program argc=4
    [0]: ./regex_sample
    [1]: extended
    [2]: (A)(B)(C)(D)(E)(F)(G)(H)(I)(J)(K)(L) \10\9\8\7\6\5\4\3\2\1
    [3]: ABCDEFGHIJKL JIHGFEDCBA
    --------
    No match

To resume:
If you manage to compile an ERE regex with backreferences, don't be too sure it will be portable.
To save yourself some headache, try avoiding matching backreferences alltogether.

More on this subject:
https://stackoverflow.com/questions/53767426/c-posix-ere-without-back-references
https://stackoverflow.com/questions/13322996/do-extended-regexes-support-back-references

Alternatives top

If you are looking for:

Match (Perl or grep):

    echo -e 'Hello, world!' | perl -p -ne 'm/Hello/'
    echo -e 'Hello, world!' | grep Hello

    Hello, world!

Replace (Perl or sed):

    echo -e 'Hello, world!' | perl -p -ne 's/Hello/Hi/g'
    echo -e 'Hello, world!' | sed 's/Hello/Hi/g'

    Hi, world!

Misc top

The other files in this repos is a bunch of examples using regex.h.
To compile them:

    make clean misc

Links top