Regex


Object Hierarchy:

GLib.Regex GLib.Regex GLib.Regex

Description:

[ Compact ]
[ Version ( since = "2.14" ) ]
[ CCode ( ref_function = "g_regex_ref" , type_id = "G_TYPE_REGEX" , unref_function = "g_regex_unref" ) ]
public class Regex

A `GRegex` is a compiled form of a regular expression.

After instantiating a `GRegex`, you can use its methods to find matches in a string, replace matches within a string, or split the string at matches.

`GRegex` implements regular expression pattern matching using syntax and semantics (such as character classes, quantifiers, and capture groups) similar to Perl regular expression. See the [PCRE documentation](man:pcre2pattern(3)) for details.

A typical scenario for regex pattern matching is to check if a string matches a pattern. The following statements implement this scenario.

``` { .c } const char *regex_pattern = ".*GLib.*"; const char *string_to_search = "You will love the GLib implementation of regex"; g_autoptr( GMatchInfo) match_info = NULL; g_autoptr(GRegex) regex = NULL;

regex = g_regex_new (regex_pattern, G_REGEX_DEFAULT, G_REGEX_MATCH_DEFAULT, NULL); g_assert (regex != NULL);

if (g_regex_match (regex, string_to_search, G_REGEX_MATCH_DEFAULT, &match_info)) { int start_pos, end_pos; g_match_info_fetch_pos ( match_info, 0, &start_pos, &end_pos); g_print ("Match successful! Overall pattern matches bytes d to d\n", start_pos, end_pos); } else { g_print ("No match!\n"); } ```

The constructor for `GRegex` includes two sets of bitmapped flags:

* [flags@GLib.RegexCompileFlags]—These flags control how GLib compiles the regex. There are options for case sensitivity, multiline, ignoring whitespace, etc. * [flags@GLib.RegexMatchFlags]—These flags control `GRegex`’s matching behavior, such as anchoring and customizing definitions for newline characters.

Some regex patterns include backslash assertions, such as `\d` (digit) or `\D` (non-digit). The regex pattern must escape those backslashes. For example, the pattern `"\\d\\D"` matches a digit followed by a non-digit.

GLib’s implementation of pattern matching includes a `start_position` argument for some of the match, replace, and split methods. Specifying a start position provides flexibility when you want to ignore the first _n_ characters of a string, but want to incorporate backslash assertions at character _n_ - 1. For example, a database field contains inconsistent spelling for a job title: `healthcare provider` and `health-care provider`. The database manager wants to make the spelling consistent by adding a hyphen when it is missing. The following regex pattern tests for the string `care` preceded by a non-word boundary character (instead of a hyphen) and followed by a space.

``` { .c } const char *regex_pattern = "\\Bcare\\s"; ```

An efficient way to match with this pattern is to start examining at `start_position` 6 in the string `healthcare` or `health-care`.

``` { .c } const char *regex_pattern = "\\Bcare\\s"; const char *string_to_search = "healthcare provider"; g_autoptr(GMatchInfo) match_info = NULL; g_autoptr(GRegex) regex = NULL;

regex = g_regex_new ( regex_pattern, G_REGEX_DEFAULT, G_REGEX_MATCH_DEFAULT, NULL); g_assert (regex != NULL);

g_regex_match_full ( regex, string_to_search, -1, 6, // position of 'c' in the test string. G_REGEX_MATCH_DEFAULT, &match_info, NULL ); ```

The method [method@GLib.Regex.match_full] (and other methods implementing `start_pos`) allow for lookback before the start position to determine if the previous character satisfies an assertion.

Unless you set the [flags@GLib.RegexCompileFlags.RAW] as one of the `GRegexCompileFlags`, all the strings passed to `GRegex` methods must be encoded in UTF-8. The lengths and the positions inside the strings are in bytes and not in characters, so, for instance, `\xc3\xa0` (i.e., `à`) is two bytes long but it is treated as a single character. If you set `G_REGEX_RAW`, the strings can be non-valid UTF-8 strings and a byte is treated as a character, so `\xc3\xa0` is two bytes and two characters long.

Regarding line endings, `\n` matches a `\n` character, and `\r` matches a `\r` character. More generally, `\R` matches all typical line endings: CR + LF (`\r\n`), LF (linefeed, U+000A, `\n`), VT (vertical tab, U+000B, `\v`), FF (formfeed, U+000C, `\f`), CR (carriage return, U+000D, `\r`), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029).

The behaviour of the dot, circumflex, and dollar metacharacters are affected by newline characters. By default, `GRegex` matches any newline character matched by `\R`. You can limit the matched newline characters by specifying the [flags@GLib.RegexMatchFlags.NEWLINE_CR], [ flags@GLib.RegexMatchFlags.NEWLINE_LF], and [flags@GLib.RegexMatchFlags.NEWLINE_CRLF] compile options, and with [ flags@GLib.RegexMatchFlags.NEWLINE_ANY], [flags@GLib.RegexMatchFlags.NEWLINE_CR], [flags@GLib.RegexMatchFlags.NEWLINE_LF] and [ flags@GLib.RegexMatchFlags.NEWLINE_CRLF] match options. These settings are also relevant when compiling a pattern if [ flags@GLib.RegexCompileFlags.EXTENDED] is set and an unescaped `#` outside a character class is encountered. This indicates a comment that lasts until after the next newline.

Because `GRegex` does not modify its internal state between creation and destruction, you can create and modify the same `GRegex` instance from different threads. In contrast, [struct@GLib.MatchInfo] is not thread safe.

The regular expression low-level functionalities are obtained through the excellent PCRE library written by Philip Hazel.


Namespace: GLib
Package: glib-2.0

Content:

Static methods:

Creation methods:

Methods: