Copyright Eric Niebler, 2002
The purpose of this document is to describe how to use the GRETA Regular Expression Template Archive. It describes the objects in the library, the methods defined on the objects, and the ways to use the objects and methods to perform regular expression pattern matching on strings in C++. It does not describe regular expression syntax. It is enough to say that the full Perl 5 syntax is supported. If you are not familiar with Perl’s regular expression syntax, I recommend reading Chapter 2 of Programming Perl, 2nd Ed. (a.k.a. The Camel Book), one of the many fine books put out by O’Reilly publishers.
GRETA: The GRETA Regular Expression Template Archive
Notice to
Users of Version 1.x
The
Match_results and Subst_results Objects
Known Issues
and Perl Incompatibilities
Embedded
Code in a Regular Expression
Comment
Blocks Before Quantifiers
Variable
Width Look-Behind Assertions
Appendix 2:
Implementation Details
The regular expression template library contains objects and functions that make it possible to perform pattern matching and substitution on strings in C++. They are:
To perform a search or replace operation, you will typically first initialize an rpattern object by giving it a string representing the pattern against which to match. You will then call a method on the rpattern object (match() or substitute(), for instance), passing it a string to match against and a match_results objects to receive the results of the match. If the match()/substitute() fails, the method returns false. If it succeeds, it returns true, and the match_results object stores the resulting array of backreferences internally. (Here, the term backreference has the same meaning as it does in Perl. Backreferences provide extra information about what parts of the pattern matched which parts of the string.) There are methods on the match_results object to make the backreference information available. For example:
#include
<iostream>
#include
<string>
#include
“regexpr2.h”
using namespace std;
using namespace regex;
int
main() {
match_results
results;
string str( “The book cost $12.34” );
rpattern
pat( “\\$(\\d+)(\\.(\\d\\d))?” );
// Match a
dollar sign followed by one or more digits,
// optionally
followed by a period and two more digits.
// The double-escapes are necessary to satisfy the compiler.
match_results::backref_type
br = pat.match( str, results );
if( br.matched ) {
cout
<< “match success!” << endl;
cout
<< “price: ” << br << endl;
} else {
cout
<< “match failed!” << endl;
}
return 0;
}
The above program would print out the following:
match
success!
price:
$12.34
The following sections discuss the rpattern object in detail and how to customize your searches to be faster and more efficient.
Note: all declarations in the header file (regexpr2.h) are contained in the regex namespace. To use any of the objects, methods or enumerations described in this document, you must prepend all declarations with “regex::” or you must have the “using namespace regex;” directive somewhere within the enclosing scope of your declarations. For simplicity, I’ve left off the “regex::” prefixes in the rest of my code snippets.
Different regex engines are good on different types of patterns. That said, I have found my regex engine to be pretty quick. For a benchmark, I matched the pattern “^([0-9]+)(\-| |$)(.*)$” against the string “100- this is a line of ftp response which contains a message string”. GRETA is about 7 times faster than the regex library in boost (http://www.boost.org), and about 10 times faster than the regular expression classes in ATL7. For this input, GRETA is even faster than Perl, although Perl is faster for some other patterns. Most regex engines I have seen build up an NFA (non-deterministic finite state automaton) and execute it iteratively, often with a big, slow switch statement. I have a different approach: patterns are compiled into a directed, possibly cyclic graph, and matching happens by traversing this graph recursively. In addition, the code makes heavy use of templates to freeze the state of the flags into the compiled pattern so that they don’t need to be checked at match time. The result is a pretty lean blob of code that can match your pattern quickly.
Even the best algorithms have their weaknesses, though. Matching regular expressions with backreferences is an NP-complete problem. There are patterns that will make any backtracking regex engine take exponential time to finish. (These usually involve nested quantifiers.) If you have a performance critical app, you would be smart to test your patterns for speed, or profile your app to make sure you are not spending too much time thrashing around in the regex code. You’ve been warned!
Also, see the section VC7 and Managed Code for some advice for compiling GRETA under VC7.
Many things have changed since version 1.x of the Regular Expression Template Archive. If you have code which uses version 1.x, you will not be able to use version 2 without making changes to your code. Sorry! There were a number of unsafe, unintuitive interface features of version 1 that I felt were worth fixing for version 2. If you need version 1, I have a copy and I’d be happy to give it to you.
Most notably, the regexpr object has gone away. It was a subclass of std::string, with match() and substitute() methods, and it stored the results of the match/substitute internally. Subclassing std::string is dangerous because std::string doesn’t have a virtual destructor. Also, matching is conceptually a const operation, and it seemed wrong that it should change internal state.
The match/count/substitute methods have moved to the rpattern object. The state that used to be stored in the regexpr object is now put in a match_results/subst_results container, which is passed as an out parameter to the match/substitute methods.
Also, the CSTRINGS flag has gone away. It is no longer necessary to optimize a pattern for use with C-style NULL-terminated strings. When you pass a C-style string to the rpattern::match method, the same optimization is used automatically. (In early 2.X versions of the library, there was a basic_rpattern_c object for performing this optimization, but it is no longer necessary and has been deprecated.)
Another minor change involves the register_intrinsic_charset() method. It used to be a part of rpattern’s interface, but it has moved to the syntax module.
Despite the sweeping interface changes, the majority of the back-end code is unchanged. You should expect patterns that worked in version 1.x to continue to work in version 2.
The rpattern object contains the regular expression pattern
against which to match. It also exposes
the match(), substitute(), and count() methods you will use to perform regular expression
matches. When you instantiate an rpattern object, the pattern is “compiled” into a structure
that speeds up pattern matching. Once
compiled, you may reuse the same pattern for multiple match operations.
Here is how rpattern is declared:
template<
typename CI,
typename SY = perl_syntax<std::iterator_traits<CI>::value_type>
>
class basic_rpattern {
…
};
typedef basic_rpattern<std::basic_string<TCHAR>::const_iterator> rpattern;
typedef basic_rpattern<TCHAR const *> rpattern_c;
The rpattern class is a template on iterator
type. It is also a template on the
syntax module. By default, the Perl
syntax module is used, but you are free to write your own syntax and specify it
as a template parameter. See the section
on the Syntax Module.
The following sections
describe the methods available on the rpattern object.
rpattern::string_type is a typedef that is used in many of the following function prototypes. It is defined as follows:
typedef CI const_iterator;
typedef std::iterator_traits<const_iterator>::value_type char_type;
typedef std::basic_string<char_type> string_type;
The typedef is a little complicated, but its effect is what you would expect. If the result of dereferencing a const_iterator is a char, then string_type is the same as std::string. If dereferencing a const_iterator results in a wchar_t, then string_type is the same as std::wstring.
There are two
constructors for instantiating an rpattern object. Here
are their prototypes:
rpattern::rpattern(
const string_type &
pat,
REGEX_FLAGS
flags=NOFLAGS,
REGEX_MODE
mode=MODE_DEFAULT ); //
throw(bad_alloc,bad_regexpr);
rpattern::rpattern(
const string_type &
pat,
const string_type &
sub,
REGEX_FLAGS
flags=NOFLAGS,
REGEX_MODE
mode=MODE_DEFAULT ); //
throw(bad_alloc,bad_regexpr);
Both methods require you
to specify a string containing the pattern to match. Both methods allow you to specify some flags
that customize your search and the mode with which the pattern should be
matched (see Matching Modes). The first method does not require you to
specify a substitution string; the second one does. If you do not specify a substitution string,
it is assumed to be the null string.
Notice the (commented
out) exception specification, “throw(bad_alloc, bad_regexpr),” on the constructors. This means that the constructor of an rpattern object can throw an exception of type bad_regexpr or bad_alloc. This can
happen when the specified pattern contains illegal regular expression syntax,
such as unbalanced parentheses. It can
also occur when the substitution string refers to a backref
that does not exist (for instance, if the substitution string contains “$6”, and there are not 6 groups in the pattern). The bad_regexpr exception inherits from the std::invalid_argument exception. The
bad_regexpr::what() method
returns a pointer to a C-style string that contains a description of the
problem.
The match method is used to find patterns in strings. There are a couple of different versions of the match method. The first takes a std::basic_string as a parameter. The second version takes two iterators, the beginning and end of the string to match. Finally, there is a version of the match() method that takes a pointer to a NULL-terminated string.
template<
typename CH, typename TR, typename
rpattern::backref_type rpattern::match(
const std::basic_string<CH,TR,AL> & str,
match_results
& results,
size_type pos = 0, //
offset to the start of the substring to match
size_type len
= npos ) const; //
length of the substring to match
rpattern::backref_type rpattern::match(
const_iterator ibegin, // start of
the string to match
const_iterator iend, // one
past end of string to match
match_results
& results ) const;
rpattern::backref_type rpattern::match(
const char_type * sz,
match_results_c
& results ) const;
Notice that the match() method that takes a std::basic_string is a template on character, character traits and allocator. This is necessary given that basic_rpattern is a template only on iterator, but it gives us a little more flexibility then we really want. If you accidentally pass in the wrong string type (say, a std::wstring when the match method is really expecting a std::string) you will get a compile-time error informing you of your mistake.
The match() method returns an object of type rpattern::backref_type. If the match fails, an “empty backref” will be returned. An empty backref contains “false” in the matched member. If the match succeeds, the backref contains information about where the pattern matched the string, and contains “true” in the matched member. The following code demonstrates the usage of the match() method:
const
char * sz = “Here is a string to match.”;
rpattern_c
pat( “str.ng” ); //
matches “string”, “strong”, etc.
match_results_c
results;
rpattern_c::backref_type br = pat.match( sz, results );
if( br.matched ) {
printf( “Backref: %.*s”, br.second – br.first, br.first );
} else {
printf(
“No match found.” );
}
In the above code, br.first is a pointer to the first character in the string that matched the pattern and br.second is a pointer to the last+1 character that matched the pattern. Therefore, the length of the matched pattern is br.second – br.first. (You’ll notice that this value is used as the precision of the string in the printf() format statement.) The above code will generate the following output: “Backref: string”.
Notice the use of rpattern_c and match_results_c in the above example. These are typedefs for basic_rpattern<const TCHAR *> and basic_match_results<const TCHAR *>, respectively.
The substitute() method finds patterns and replaces the found substrings with insertion strings you specify. It takes a std::string object as a parameter, and a subst_results container. It also takes two optional integer parameters, which you can use to specify the offset of the first character to search and the length of the search string. Here is the prototype:
template<
typename CH, typename TR, typename
size_t rpattern::substitute( std::basic_string<CH,TR,AL>
& str,
subst_results & results,
size_type pos = 0,
size_type len
= npos );
The return type is an unsigned integer representing the number of substitutions made. If the pattern is not found, no substitutions are made and the return value is 0. The following example finds all instances of \n not preceded by \r and replaces them with \r\n:
size_t insert_crlf(
string & str ) {
static
const rpattern s_crlf(
“(?<!\r)\n”, “\r\n”, GLOBAL );
subst_results results;
return s_crlf.substitute( str, results );
}
Later sections will describe the significance of “static const” and “GLOBAL” in the above example. Note the use of the look-behind assertion, (?<!\r), in the example. It evaluates to true when the preceding character is not a return character, but it doesn’t include the preceding character in the match.
When doing a global substitution, the next match operation begins where the last substitution left off. Thus, if the search string contains the word “foo”, pattern is “o”, and the substitution is “oo”, the resulting string will be, “foooo”.
Use the count() method when you want to know the number of times a given pattern matches a string, but you don’t care about where the matches occur or what they are exactly. The count() method has three variants, which are analogous to the three match() methods. Here’s one if its prototypes:
size_t rpattern::count(
const_iterator ibegin, // start of
the string to match
const_iterator iend
) const; //
one past end of string to match
Like the substitute() method, the next match begins where the previous one ended. Thus, if you want to know how many times the pattern “oo” shows up in the string “fooooo”, the answer is 2, even though this pattern could conceivably match the string in 4 unique places. Overlapping matches are not counted.
Sometimes you may want
to set the substitution string after the rpattern object has been instantiated. For this, you would use the set_substitution() method.
void rpattern::set_substitution( const string_type &
sub );
// throw(bad_alloc,bad_regexpr);
The set_substitution() method can be called as many times as you like. Each time it is called, it replaces the old
substitution string with the new one.
Notice that the set_substitution() method can also throw a bad_regexpr exception for ill-formed substitution strings.
This method returns the
count of groups in the pattern string.
For a successfully parsed pattern, this number is always at least one,
because the entire pattern is considered group number 0.
size_t rpattern::cgroups() const;
The results of a match() or a substitute() operation are stored in the match_results and subst_results containers, respectively. After a successful match() or substitute(), they contain useful information that can be queried for using the following methods. Unless otherwise noted, these methods appear on both the match_results and subst_results containers.
This method returns the count of the internally saved backrefs. It takes no parameters. Here is the prototype:
size_t match_results::cbackrefs() const;
After a successful match() operation, cbackrefs() will always return at least one. That is because the part of the string that matched the entire pattern is always saved in backref number 0 (like $& in Perl).
This method returns the requested backref. It takes one parameter, which is the integer representing the backref to return. The prototype for this method looks like:
const backref_type & match_results::backref(
size_t cbackref ) const;
As in Perl, backreferences are created as a side effect of grouping. The substring that matched the entire pattern is saved in backref number 0; the substring that matched the pattern enclosed in the first set of parentheses is backref number 1, the second set of parentheses is backref number 2, and so on. Sets of parentheses are numbered by counting opening parentheses from left to right.
If cbackref is greater than the number of saved backrefs, match_results::backref() throws a std::out_of_range exception.
The rstart() method returns the offset to the first character in the specified backref. Here is the prototype:
size_t match_results::rstart( size_t cbackref
= 0 ) const;
After a global substitution operation, rstart() returns the offset to the point in the string that would be the start of the backref if the string had not been modified. That’s because it is really an index into the backref string. See subst_results::backref_str below.
If cbackref is greater than the number of saved backrefs, match_results::rstart() throws a std::out_of_range exception.
The rlength() method returns the length of the specified backref. Here is the prototype:
size_t match_results::rlength( size_t cbackref
= 0 ) const;
If cbackref is greater than the number of saved backrefs, match_results::rlength() throws a std::out_of_range exception.
This method returns a
const reference to the internal vector of backrefs. The return value is const to prevent writing
to this vector. It is intended for
read-only.
const match_results::backref_vector & match_results::all_backrefs()
const;
This method is only defined on the subst_results container. It returns a const reference to the string to which the backreferences point. Here is the prototype:
const string_type & subst_results::backref_str()
const;
As I mentioned earlier, backrefs are really just a pair of iterators into a string. But a substitution modifies the string, and the backreferences need to refer to the string as it was before it was modified. For this to work, a copy of the string must be made before it is changed. The backreferences are pointers into the copy of the string. The backref_str() method returns a const reference to the copy.
Note: this implies that doing substitutions on really big strings is costly, because the regex engine must first make a copy of the string before performing the substitution. Often, you don’t need backreferences after a substitution, and a copy does not need to be made. There is a flag to turn off this expensive feature for these cases. This is discussed in the section entitled NOBACKREFS.
The regular expression syntax supported by default is the syntax of Perl 5.6. The syntax is defined in a separate module and is incorporated into the rpattern class by means of a template parameter. You can define your own syntax module if you like. Your module must implement the following interface:
template<
typename CH >
class my_syntax {
public:
typedef
CH char_type;
typedef
std::basic_string<CH>::iterator
iterator;
typedef
std::basic_string<CH>::const_iterator
const_iterator;
my_syntax( REGEX_FLAGS flags );
REGEX_FLAGS
get_flags()
const;
void set_flags(
REGEX_FLAGS flags );
TOKEN reg_token( iterator
& icur,
const_iterator iend );
TOKEN quant_token( iterator
& icur,
const_iterator iend );
TOKEN charset_token( iterator
& icur,
const_iterator iend );
TOKEN subst_token( iterator
& icur,
const_iterator iend );
TOKEN ext_token( iterator
& icur,
const_iterator iend );
void register_intrinsic_charset(
char_type
ch,
const std::basic_string<CH> & str
); // throw(bad_regexpr,std::bad_alloc);
};
The TOKEN type is an enumeration, defined as follows:
enum TOKEN {
// Can be returned by any syntax method
NO_TOKEN = 0,
// Token that can be returned by the reg_token method
BEGIN_GROUP,
END_GROUP,
ALTERNATION,
BEGIN_LINE,
END_LINE,
BEGIN_CHARSET,
MATCH_ANY,
ESCAPE,
// Tokens that can be returned by the quant_token method
ONE_OR_MORE,
ZERO_OR_MORE,
ZERO_OR_ONE,
ONE_OR_MORE_MIN,
ZERO_OR_MORE_MIN,
ZERO_OR_ONE_MIN,
BEGIN_RANGE,
RANGE_SEPARATOR,
END_RANGE,
END_RANGE_MIN,
// These may also
be returned by the reg_token method
ESC_DIGIT,
ESC_NOT_DIGIT,
ESC_SPACE,
ESC_NOT_SPACE,
ESC_WORD,
ESC_NOT_WORD,
ESC_BEGIN_STRING,
ESC_END_STRING,
ESC_END_STRING_z,
ESC_WORD_BOUNDARY,
ESC_NOT_WORD_BOUNDARY,
ESC_WORD_START,
ESC_WORD_STOP,
ESC_QUOTE_META_ON,
ESC_QUOTE_META_OFF,
// Tokens that can be returned by the subst_token method
SUBST_BACKREF,
SUBST_PREMATCH,
SUBST_POSTMATCH,
SUBST_MATCH,
SUBST_ESCAPE,
SUBST_QUOTE_META_ON,
SUBST_UPPER_ON,
SUBST_UPPER_NEXT,
SUBST_LOWER_ON,
SUBST_LOWER_NEXT,
SUBST_ALL_OFF,
// Tokens that can be returned by the charset_token method
CHARSET_NEGATE,
CHARSET_ESCAPE,
CHARSET_RANGE,
CHARSET_BACKSPACE,
CHARSET_END,
CHARSET_ALNUM,
CHARSET_NOT_ALNUM,
CHARSET_ALPHA,
CHARSET_NOT_ALPHA,
CHARSET_BLANK,
CHARSET_NOT_BLANK,
CHARSET_CNTRL,
CHARSET_ NOT_CNTRL,
CHARSET_DIGIT,
CHARSET_ NOT_DIGIT,
CHARSET_GRAPH,
CHARSET_NOT_GRAPH,
CHARSET_LOWER,
CHARSET_NOT_LOWER,
CHARSET_PRINT,
CHARSET_NOT_PRINT,
CHARSET_PUNCT,
CHARSET_NOT_PUNCT,
CHARSET_SPACE,
CHARSET_NOT_SPACE,
CHARSET_UPPER,
CHARSET_NOT_UPPER,
CHARSET_XDIGIT,
CHARSET_NOT_XDIGIT,
// Tokens that can be returned by the ext_token method
EXT_NOBACKREF,
EXT_POS_LOOKAHEAD,
EXT_NEG_LOOKAHEAD,
EXT_POS_LOOKBEHIND,
EXT_NEG_LOOKBEHIND,
EXT_INDEPENDENT,
EXT_UNKNOWN
};
If any syntax method returns a token other than NO_TOKEN, the character pointer (icur) must be advanced past the token. icur must not be advanced past the end of the string, iend. If NO_TOKEN is returned, icur should not be modified, except to move past any characters that should be ignored.
Once you have defined your syntax module, you need to instantiate an rpattern template that uses it. See the section on Template Instantiation for help.
The intrinsic character
sets are pretty handy. For instance, “\s” matches the same thing as “[\t\f\r\n
]”, and “\w” is (roughly) a shorthand for “[a-zA-Z_0-9]”. With the register_intrinsic_charset() method, you can create your own. Here’s the prototype:
static
void my_syntax::register_intrinsic_charset(
char_type
ch, const std::basic_string<char_type> & str );
// throw(bad_regexpr,std::bad_alloc);
You
would use it as follows:
my_syntax<char>::register_intrinsic_charset(
‘p’, “[aeiouAEIOU]” );
This
maps “\p” to “[aeiouAEIOU]”. In implementation, the character set is
“compiled” when you call register_intrinsic_charset(), and when “\p” appears in any new
patterns, this compiled form gets linked into your pattern. This speeds up pattern compilation, and
reduces memory usage, since the compiled character set is shared across all
your patterns. Note that with the above
call to register_intrinsic_charset(), no mapping of “\P” to “[^aeiouAEIOU]” is created. If you want that mapping, you need to create
it yourself with another call to register_intrinsic_charset().
The
case-sensitivity of a user-defined intrinsic character set is taken from the
context in which it is used. For
instance, if you define “\y” to be “[a-z]”, then in a case-sensitive
pattern, it matches only a-z, but in a case-insensitive pattern, it matches a-z and A-Z.
If
you redefine an intrinsic character set, any compiled patterns that refer to
that character set are now invalid. You
must reinitialize them.
It is possible to
customize your searches by specifying some flags in the rpattern
constructor. Here is an example:
// Find all instances of ‘foo’, replace with ‘bar’, ignore case
rpattern pat( “foo”,
“bar”, NOCASE | GLOBAL );
Do a case-insensitive
search. Searches are by default
case-sensitive.
This flag causes the rpattern::substitute() method to replace all substrings that match the
pattern. By default, only the first
substring that matches the pattern is replaced.
When used with the rpattern::match() method, it causes all substrings that match the
pattern to be found. See the section on NOBACKREFS, FIRSTBACKREFS and ALLBACKREFS to find out how to control what information gets
saved in the backref vector. This flag is ignored by the rpattern::count() method.
By default, ‘^’ matches only the beginning of the string, and ‘$’ matches only the end of the string (or at the newline preceding the end of the string). When you use the MULTILINE flag, ‘^’ will also match the beginning of a line (immediately after a newline character) and ‘$’ will also match before the end of the line (immediately before a carriage return or newline character). The assertions ‘\A’ and ‘\Z’ are used to match only the beginning and end of the string respectively, regardless of whether the MULTILINE flag has been specified.
By
default, ‘.’ matches any character besides the newline character. When you specify SINGLELINE, ‘.’ will match the newline character, also.
This can be used together with the MULTILINE
option. The two only sound mutually
exclusive.
This is the equivalent
of the /x pattern modifier in Perl. When you specify EXTENDED, any white space in you pattern that is not escaped
or in a character class is ignored. In
addition, any text following a #
up to and including a newline character is considered
a comment and is ignored. The idea is to
make patterns more readable.
Find the rightmost,
longest match. The default is to find
the leftmost, longest match.
This flag is used to
optimize substitution operations. By
default, backreferences are saved during substitution
operations. To make the backreferences available even after the substitution
operation has changed the string, it is necessary to save a copy of the string
before modifying it. This can be an
expensive operation, particularly if the string is very long. Often, there is no need to know the backreferences after a substitution operation completes, in
which case the copy operation was wasteful.
The NOBACKREFS flag is a hint to the regular expression engine that
you will not be using the backreferences after the
substitution operation completes. This
eliminates the need to make a copy of the string.
If your substitution
string contains backreferences, such as ‘$2’, then a copy of the string needs to be made anyway,
and the NOBACKREFS flag is ignored.
This flags is ignored by
the match() and count() methods.
By default, when doing a global match or substitute operation, the backref vector only contains the backrefs from the last successful match. Sometimes, it is helpful to know the backref information for all the matches. If you specify ALLBACKREFS, then the backref information for each successive match will be appended to the backref vector, instead of replacing it.
This flag is like ALLBACKREFS, except that only information from the complete match (backref 0) is saved for each successive match.
If the NORMALIZE flag is specified, the regular expression parser performs the following normalizations:
The string “\\n” is turned into the newline character, ‘\n’.
The string “\\r” is turned into the return character, ‘\r’.
The string “\\t” is turned into the tab character, ‘\t’.
The string “\\f” is turned into the form-feed character, ‘\f’.
The string “\\v” is turned into the vertical-feed character, ‘\v’.
The string “\\a” is turned into the bell character, ‘\a’.
The string “\\\\” is turned into the backslash character, ‘\\’.
This can be especially useful if you are accepting the pattern and substitution strings from a user as input.
By default, the pattern matching algorithm is a depth-first, recursive search. It is fast and effective, but it has a limitation: the size of your process’s stack. If you run out of stack space during a match operation, one of two things will happen. If you are on a Microsoft platform, the exception is handled and execution continues as if the pattern failed to match (which may or may not be the right answer). If you are not on a Microsoft platform, the exception cannot be handled, and your process will dump core and die. It’s not very nice to have a process die when given legal input, so GRETA gives you the option to perform matches in a completely safe manner, using an iterative algorithm that is stack-conservative. The different matching-modes are described below. You can specify these modes to the rpattern constructor.
On Microsoft platforms, this is the default mode. It uses the recursive algorithm. As the name suggests, it is fast, but it can tear through stack space pretty quick for certain types of patterns. The worst offender is a group quantifier. This pattern eats lots of stack: “(.)+” whereas this one does not: “.+”. You should use fast mode when your patterns do not quantify groups or use the (?R) extension (see Recursive Patterns), or when you are not accepting the string and the pattern from users.
This mode uses the slower, safer iterative algorithm. It will not overflow the stack. Use this mode when you cannot make assumptions about the pattern and the string (eg., you are accepting them as input from users). Also, you can use this mode when your stack space is small, or if you are in a DLL and you cannot make assumptions about the size of your process’s stack space.
You should be aware that safe mode will occasionally need to make heap allocations, which could fail in low-memory situations and generate an exception. You should wrap calls to match(), count() and substitute() in a try/catch block and handle any std::bad_alloc exceptions that may get thrown.
This is the default mode on non-Microsoft platforms. When operating in mixed mode, GRETA uses a simple heuristic to determine on a per-pattern basis which algorithm is the best to use. Patterns that contain quantified groups, like “(.)+”, are matched using safe mode, as are patterns that use the (?R) extension. All other patterns are matched using the fast mode.
There are a few things my code does a bit differently than Perl. These are things that could trip you up, and you should be aware of them.
Perl lets you put Perl code directly into a regex with the (?{ code }) extension. You can’t embed Perl code in a GRETA regular expression. Duh! But see Recursive Patterns for a useful hack that might get you what you’re looking for.
In the pattern “a(?i)b”, (?i) turns on case-insensitive pattern matching. In both Perl and GRETA, pattern modifiers such as (?i) have the scope of the nearest enclosing group. Outside of that group, they have no effect. However, in GRETA, the scope of the pattern modifier extends only to the right, whereas in Perl it extends both to the right and left. For example, in Perl, the above example would match the string “AB”, but in GRETA it would not because the “a” appears to the left of the (?i) modifier.
Perl lets you write patterns like /X(?#Match X 3 times){3}/, where the bit in (?#...) is a comment, and {3} is a quantifier that applies to X. (I don’t know who would write such a pattern, but Perl programmers are a pretty sick bunch.) In GRETA, comment groups logically separate quantifiers from the things quantified, so this will not match XXX like it does in Perl.
Perl doesn’t allow variable-width look-behind assertions, probably because it can be really expensive. GRETA allows it, and lets you shoot yourself if you’re not careful. For variable-width look-behind assertions, every evaluation of the look-behind is like a rightmost pattern match unto itself, complete with backtracking. This can result in very severe exponential behavior, especially if you’re using your pattern to match very long strings. I have done my best to protect you, though. GRETA does static width analysis of your look-behind assertions. This analysis is fairly robust; for example, it recognizes that the pattern “(\d{3}-){2}\d{4}” must match exactly 12 characters (a hyphen-separated telephone number, for instance). If the static analysis reveals that the look-behind assertion must match between N and M characters, it only tries matching in those places where a match is possible. In the previous example, the pattern matcher would back up exactly 12 characters and look for a phone number. If it fails, it gives up without looking anyplace else.
By the way, you are allowed to create back-references in your positive look-behind assertions and use them later in your pattern and access then after a successful match.
Perl lets you embed code directly into your patters with the (?{ code }) extension, which relies on the Perl interpreter to interpolate the code and execute it at pattern match time. One very nice application of this is to write recursive patterns for matching things like balanced, nested parenthesis, XML tags, etc. Obviously, I can’t call out to an interpreter in C++. But I have included experimental support for recursive patterns with the (?R) extension. For example, consider the following pattern (assume EXTENDED is used so white space is ignored):
\( ( (?>[^()]+) | (?R) )* \)
First, you match an opening paren. Then you match a bunch of stuff that is not an opening or closing paren. Then, if you can match a closing paren you are done. Otherwise, you recurse by matching an opening paren again, followed by a bunch of stuff that is not a paren, etc, etc. You finish when you find a matching paren for the first one you found.
This extension was inspired by PCRE (Perl-compatible regular expressions) at http://www.pcre.org.
Here are some #define’s to give you control of some aspects of GRETA’s behavior.
Define REGEX_WIDE_AND_NARROW if you plan to
match wide and narrow strings in the same executable. This forces the instantiation of both the
wide and narrow versions of the pattern matching routines. Note that this will make your executable
bigger.
Define REGEX_POSIX if you want to use the POSIX
syntax module. It causes the appropriate
templates to get instantiated.
Define REGEX_NO_PERL if you are not planning on
using the Perl syntax module. It
prevents the Perl templates from getting instantiated. This saves some space.
In debug mode, the GRETA performs sanity checks at
run-time. Among other things, it turns
off the arena allocator and turns on run-time type
checking in the type-unsafe parts of the code.
This makes the code run very slow, but it’s a good idea to try your
patterns a few times with REGEX_DEBUG on (-DREGEX_DEBUG=1), just to make sure
everything is kosher. If you define any
of: DEBUG, _DEBUG or DBG, then REGEX_DEBUG gets turned on automatically. To override this, explicitly turn it off with
–DREGEX_DEBUG=0.
When parsing a pattern, GRETA does a number of allocations. To speed up the parsing phase, the allocations come from an arena. This is a significant performance enhancement, but it can interfere with debug tools like AppVerifier and PageHeap which track allocations. By turning REGEX_DEBUG_HEAP on (-DREGEX_DEBUG_HEAP=1), you can effectively turn off arena allocation. By default, REGEX_DEBUG_HEAP is turned on when REGEX_DEBUG is on.
The aforementioned regex arena allocator is pretty sneaky. All memory associated with a parsed pattern resides within the arena, so when the pattern is being destroyed the arena is simply thrown away. The destructors for the objects allocated in the arena never get called. This is done for performance reasons. However, some of the objects allocated in the arena are STL containers, and a strict interpretation of the C++ specification reveals that this degenerate allocation scheme is not entirely kosher. It relies on stateful allocators – that is, allocators that maintain state and don’t necessarily compare equal to each other. Most modern implementations of the STL support stateful allocators as an extension. This includes Dinkumware (VC7 and later) and STLPort. If you are using a version of the STL that does not support stateful allocators, or if you are interested in strict conformance to the standard, you can compile GRETA with REGEX_NO_ALLOCATOR. The STL containers will use std::allocator instead of the regex arena, and destructors will get called to properly clean up the memory.
Anyone who has written a low-level memory routine and tried to make it portable has run into alignment problems. I hit one such problem when writing GRETA’s stack class. It constructs objects of unknown type in place from a block of pre-allocated memory, and it tries to be clever about alignment. But it’s not perfect. (Blame the C++ standardization committee or Bjarne Stroustrup, but don’t blame me.) If you’re getting a compiler error in restack.h on a line that looks like this:
static_assert< (ALIGNMENT >= alignof<T>::value) > const align_test;
then you have hit an alignment problem. Relax, it is easy to fix. Just compile with: –DREGEX_STACK_ALIGNMENT=8. If that doesn’t work, try an alignment of 16 or 32.
On many implementations of the STL, string::iterator is not a typedef for char*. Rather, it is a wrapper class. As a result, the regex code gets instantiated twice, once for bare pointers (rpattern_c) and once for the wrapped pointers (rpattern). But if there is a conversion from the bare pointer to the wrapped pointer, then we only need to instantiate the template for the wrapped pointers, and the code will work for the bare pointers, too. This can be a significant space savings. The REGEX_FOLD_INSTANTIONS macro controls this optimization. The default is "off" for backwards compatibility. To turn the optimization on, compile with: -DREGEX_FOLD_INSTANTIATIONS=1. When instantiation-folding is enabled, rpattern_c and match_results_c are typedefs for rpattern and match_results, and attempts to use basic_rpattern<const TCHAR *> directly will result in linker errors.
Here are a couple of
random things to be aware of.
Make your patterns “static const” if you can.
That way, they are compiled once when the thread of execution passes
over their declarations, not every time.
Consider the two following functions:
void good_function( string & str )
{
static const rpattern static_pattern( “your
pattern here” );
static_pattern.substitute( str );
}
void bad_function( string & str )
{
rpattern
auto_pattern( “your pattern here” );
auto_pattern.substitute( str );
}
In good_function(), the static_pattern object is initialized once the first time good_function() is called.
Thereafter, good_function() can be called as many times as you like, and you
won’t incur the penalty of recompiling the pattern. The static_pattern object is cleaned up only when the program
terminates. In bad_function(), the auto_pattern object is initialized every time bad_function() is called, and it is cleaned up every time bad_function() returns. This
is unnecessary overhead since the pattern you are looking for is not changing
each time. But read the Thread-safety section below for one small caveat.
A const rpattern object is completely thread-safe. For instance, if you declare an rpattern object to be “static const”,
you can use it in multiple threads to perform simultaneous match() or substitute()
operations.
However, VC does no
synchronization while constructing function-local, static objects. (Sad, but true.) It’s possible for two threads to try to
initialize a static const pattern simultaneously. But fear not! I’ve provided a
work-around. You can use the STATIC_RPATTERN macro to declare all your static const patterns. If _MT is
defined, then STATIC_RPATTERN will guard all your rpattern
constructions with a critical section.
By default, GRETA uses
recursion in a depth-first search for matches.
If you are matching against a very long string (several thousands of
characters long), then there is a chance you could overflow the stack. If you use the VC compiler on a Win32
platform, then GRETA traps the stack overflow exception and resets the stack
guard page so that any additional stack overflows that may occur can also be
handled. In that case, execution will
continue as if the pattern had failed to match.
An ounce of prevention
goes a long way, however. There are some
simple things you can do to prevent stack overflows. The first is to not match against strings
that are 1000’s of characters long. J Next, you can
avoid constructs that require deep recursion.
Quantifying a group is one such construct. So for instance, this pattern recurses deeply: “(.)+”, whereas
this one does not: “.+”. Also, you should compile GRETA with full
optimization, making sure to turn on inline expansion (/Ob1 or /Ob2). This drastically reduces GRETA’s
stack usage. Finally, you can use the
/STACK switch to the linker to increase your stack reserve. By default, the VC linker gives executables 1
Mb of stack reserve. By contrast, the
version of perl.exe that I have (from ActiveState)
has a whopping 16 Mb of stack reserve.
For most situations, 1 Mb is plenty, but if you allow deeply recursive
patterns to match long strings, then you might want to bump your stack reserve
up to 2 or 4 Mb. With 4 Mb of stack, you
can match a deeply recursive pattern against a string of 35,000 characters
without overflowing your stack.
Finally, if you are
still concerned about stack overflows, you can use the “safe” mode to match
patterns. When in safe mode, pattern
matching is done using an iterative algorithm, eliminating the possibility of
stack overflows. Safe mode is slower,
however. See Matching
Modes for more information.
The regular expression
engine is not DBCS-aware. You can use
only fixed-width character sets (that is, ANSI and UNICODE). Sorry.
This regular expression
package makes heavy use of STL (The Standard Template Library). STL is available in VC5 and higher. If you are not currently using STL in your
projects (why not?!) you may have to make some modifications to your build
settings to use this package. If you are
using the NT build environment, you should add the line “USE_STL=1” to your sources file. If you are building in VC, you may need to
explicitly link to one of the STL libraries (msvcprt.lib
or libcp.lib, or their debug equivalents).
If you are compiling GRETA under VC7, and you care about speed, you should not compile the code with managed extensions turned on. I have found that the performance of the regex engine degrades badly with managed code. You should compile GRETA as unmanaged code into a separate library, and then link the lib to your managed app.
I’ve needed to explicitly instantiate the basic_rpattern classes I thought would be most useful. I did it will the following hack, which lives at the bottom of regexpr2.cpp:
#ifndef REGEX_NO_PERL
# ifdef REGEX_WIDE_AND_NARROW
template class basic_rpattern<const char *>;
template class basic_rpattern<const wchar_t
*>;
template class basic_rpattern<string::const_iterator>;
template class basic_rpattern<wstring::const_iterator>;
# else
template class basic_rpattern<const TCHAR *>;
template class basic_rpattern<tstring::const_iterator>;
# endif
#endif
#ifdef REGEX_POSIX
# ifdef REGEX_WIDE_AND_NARROW
template class basic_rpattern<const char *,posix_syntax<char>
>;
template class basic_rpattern<const wchar_t
*,posix_syntax<wchar_t>
>;
template class basic_rpattern<string::const_iterator,posix_syntax<char>
>;
template class basic_rpattern<wstring::const_iterator,posix_syntax<wchar_t> >;
# else
template class basic_rpattern<const TCHAR *,posix_syntax<TCHAR>
>;
template class basic_rpattern<tstring::const_iterator,posix_syntax<TCHAR>
>;
# endif
#endif
By default, only the classes necessary to perform Perl-style pattern matches on strings of type TCHAR are instantiated. If you would like to do pattern matching on both char and wchar_t strings, then compile with REGEX_WIDE_AND_NARROW defined. Note that this will make your code twice as big. Also, by default, the Perl syntax module is compiled, and the POSIX syntax is not. If you want to use the POSIX syntax, then define REGEX_POSIX. If you don’t plan on using Perl syntax, you can turn off Perl template instantiation by defining REGEX_NO_PERL.
If you have defined your own syntax module that you want to use with the rpattern class, then you need to do something fancier. Delete the above code from the bottom of regexpr2.cpp and replace it with your own explicit instantiations, using the template parameters of your choosing. Or you can #include “regexpr2.cpp” and use implicit instantiation. But be aware of how many template instantiations you have, because each addition will greatly increase the size of your compiled code.
You should send bug
reports and feature requests to me, Eric Niebler. I’ll get to them as soon as I can.
|
Move implementation details into
internal namespace “detail". |
|
Fixed bad interaction between
NORMALIZE and EXTENDED flags that was causing “\n” escape sequences to be
ignored as white-space. |
|
Fixed some compile problems. |
|
rpattern can now be instantiated with non-const iterators. |
|
rpattern::set_flags() method has gone away. |
|
bad_regexpr now inherits from std::invalid_argument,
as it should. |
|
Fix NOBACKREFS flag, which has
been broken since 3/15. Many thanks to Michael Nelte
for the bug report. |
|
Make heterogeneous stack
exception safe and more generally useful. |
|
Fix linker error caused by 4/8
change. |
|
Safer alignment handling in unsafe_stack. |
|
Fix potential crashing bug in unsafe_stack. Only safe mode is affected. |
|
Retarget the code for VC7. |
|
Faster implementation of unsafe_stack resulting in an overall 10% perf improvement in “safe” mode. |
|
Some fixes for gcc 3.0.3. |
|
20% performance improvement in
character set matching. |
|
Fix VC7 build problem with the
forward declaration of _resetstkoflw. Thanks to Steve West for the suggested fix. |
|
Various portability fixes.
Compiles with gcc-2.96 with STLPort. |
|
Compiles cleanly on VC6 with /W4
and on VC7 with /W4 and /Wp64. |
|
GLOBAL match/substitute
operations with patterns that match zero-length substrings are now handled as
they are in Perl. |
|
Finally, a fix for the infamous
Repeated Pattern Matching Zero-Length Substring bug. |
|
Big perf
improvement for quantified literals and character sets. |
|
Statically initialize the Perl
syntax look-up tables to prevent global rpattern
objects from using the tables before they are initialized. |
|
Use group extent information to
make lookahead and lookbehind
assertions and independent subexpressions more
efficient in both time and (stack) space. |
|
Fix a bug when matching a
zero-width pattern (e.g. /^\s*$/) to an empty std::string. |
|
Add nathann’s
debug heap support. |
|
Fix AV when trying to match against
an uninitialized std::string. |
|
Add conditional subexpressions a-la Perl. |
|
Fix backtracking bug in
non-greedy group quantifiers. |
|
Fix bug in nested POSIX-style
character sets. |
|
Use small-object allocator to speed up pattern compilation. |
|
Complete overhaul of the
interface. |
|
Big
performance improvements. |
|
Add
negative posix charsets
like [:^alpha:]. |
|
Remove
vertical tab character (‘\v’) from \s intrinsic character set. |
|
Explicitly
zero Perl syntax look-up tables. |
|
Fix
assert when parsing certain invalid patterns. |
|
Fix
AV when reinitializing an rpattern. |
|
Remove
platform-dependent assumption that string iterators
are char*. |
|
Compiles
and runs with VC7. |
|
Use
pattern width analysis to optimize matching. |
|
Add
lookbehind assertions, which can be arbitrarily
complicated (Perl only allows fixed-width lookbehinds). |
|
Include
basic POSIX syntax module. |
|
Fix
leak when parsing invalid substitution strings. |
|
Support
quotemeta (\Q and \E) in pattern and substitution
strings. |
|
regexpr is now template on character
type, traits and allocator! |
|
Fixed
bug in Unicode charset matching. |
|
Expanded
the global match and substitute functionality |
|
Just
fixed a memory leak! Get the latest version. |
|
Submitted
to http://toolbox/ |
So, you want to know how GRETA works, huh? Well, it is a hybrid of a standard state machine and the Interpreter design pattern, as described in Design Patterns, by Gamma, et al. A state machine has states and transitions in a directed, possibly cyclic, graph. It also has a clear flow from the initial state to the terminal state. It is generally implemented with a p-code interpreter or with jump tables. The Interpreter pattern, on the other hand, doesn't deal with states -- it deals with grammar rules. Each rule in your language (e.g. regular expressions) gets its own polymorphic class, and you parse sentences in your language into trees of these polymorphic nodes according to your grammar rules. Recognizing sentences in your language (e.g., matching a pattern) happens by recursively traversing this tree.
GRETA is somewhere in between these two. It is basically a state machine, but the states are polymorphic nodes in a graph, and transitions are virtual function calls between nodes, much like Interpreter. There is one start node, and (essentially) one end node. In between, there are branches and loops. The goal is to find a way to the end state using only legal transitions. Each node either matches or it doesn't. If it doesn't, it returns false. If it does, it passes execution to the next node in the graph recursively. If there are branches, it picks one, and if its first guess fails to match (returns false), it tries the next one. The entire execution state during pattern matching is maintained in local variables on the stack. Since there is no dynamic allocation, it is very fast.
This approach can actually be as fast/faster that a traditional p-code state machine. With p-code, to move between states, you have an op-code and you execute a big switch on your op-code. Big switches are inefficient -- they introduce branches and blow locality out of the water. Compare that to a virtual function call. A couple of indirections and you're executing the code you want to be executing, with no branches. This is similar to the jump table school of state machines. But jump tables are hard to write, hard to understand and hard to maintain. With GRETA, the v-tables are the jump tables, and I let the compiler worry about writing them and maintaining them.
Finally, I use a lot of template tricks to choose branches at compile time and to turn virtual function calls into inline function calls. (Try that with a jump table!) This speeds things up a lot, at the expense of some code bloat. I have an article about this trick coming out in the C/C++ User's Journal in the next couple of months. Keep your eyes out for it.