normalize

Description:

[ CCode ( cname = "g_utf8_normalize" ) ]
public string normalize (ssize_t len = -1, NormalizeMode mode = DEFAULT)

Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character.

The string has to be valid UTF-8, otherwise null is returned. You should generally call normalize before comparing two Unicode strings.

The normalization mode g_normalize_default only standardizes differences that do not affect the text content, such as the above-mentioned accent representation. g_normalize_all also standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same.

g_normalize_default_compose and g_normalize_all_compose are like g_normalize_default and g_normalize_all, but returned a result with composed forms rather than a maximally decomposed form. This is often useful if you intend to convert the string to a legacy encoding or pass it to a system with less capable Unicode handling.

Example: Utf8-handling, normalization:

public static int main (string[] args) {
	string str1 = "\xE2\x84\xAB";
	string str2 = "\x41\xCC\x8A";

	// Output: ``"Å", "Å"``
	print ("\"%s\", \"%s\"\n", str1, str2);

	// Output: ``false``
	bool res = (str1 == str2);
	print ("%s\n", res.to_string ());

	// Output: ``true``
	str1 = str1.normalize ();
	str2 = str2.normalize ();
	res = (str1 == str2);
	print ("%s\n", res.to_string ());

	return 0;
}

valac --pkg glib-2.0 string.normalize.vala

Parameters:

len	length of `str`, in bytes, or -1 if `str` is nul-terminated.
mode	the type of normalization to perform.
str	a UTF-8 encoded string.

Returns:

a newly allocated string, that is the normalized form of str, or null if str is not valid UTF-8.