Tech Tip: Using Unicode property names in Match regex patterns

PRODUCT: 4D | VERSION: 11.6 | PLATFORM: Mac & Win

Published On: April 2, 2010

If you are running your 4D v11 SQL database in UNICODE mode and you are not using "UNICODE PROPERTY NAME" in your regular expression (regex) pattern strings, you may want to. This can be very useful testing for characters in a string using the 4D command Match regex. It may be more beneficial for you to test for a property with "\p{UNICODE PROPERTY NAME}" or "\P{UNICODE PROPERTY NAME}", instead of testing for a specific group of characters.

"\p{UNICODE PROPERTY NAME}" means "match any character with the specified Unicode Property".

"\P{UNICODE PROPERTY NAME}" means "match any character not having the specified Unicode Property".

Because Unicode is a large character set, a regular expression engine needs to provide for the recognition of whole categories of characters as well as simply ranges of characters; otherwise the listing of characters becomes impractical and error-prone. This is done by providing syntax for sets of characters based on the Unicode character properties.

The most basic overall character property is the General Category, which is a basic categorization of Unicode characters into: Letters, Punctuation, Symbols, Marks, Numbers, Separators, and Other. These property values each have a single letter abbreviation, which is the uppercase first character except for separators, which use Z. The official data mapping Unicode characters to the General Category value is in UnicodeData.txt [UData].

An additional reason to use Unicode property names is that they are more comprehensive than typical character tests. For example, \p{Whitespace} includes a test for 26 code points, far more comprehensive than the typical whitespace test as compared below.

$IsWhitespace_B:=Match regex("\\p{Whitespace}";$Text_T)

compared to:

$IsWhitespace_B:=Match regex(".[ \t\r\n]";$Text_T)

Note: the need for the double-backslash "\\" in the pattern. "\p" and "\P" are not recognized in 4D as are \r, \n, \t, \", and \\. So for "\p" or "\P" to be recognized the escape character, \, has to be escaped, \\.

A partial listing of Unicode property names is below. For more detalied documentaion and listings see Unicode Technical Standard #18 - UNICODE REGULAR EXPRESSIONS and Unicode Regular Expressions - Regex Tutorial:

\p{L} or \p{Letter}: any kind of letter from any language.

\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.

\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.

\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.

\p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.

\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).

\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).

\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).

\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.

\p{Zl} or \p{Line_Separator}: line separator character U+2028.

\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.

\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc..

\p{Sm} or \p{Math_Symbol}: any mathematical symbol.

\p{Sc} or \p{Currency_Symbol}: any currency sign.

\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.

\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.

\p{N} or \p{Number}: any kind of numeric character in any script.

\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.

\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.

\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0..9 (excluding numbers from ideographic scripts).

\p{P} or \p{Punctuation}: any kind of punctuation character.

\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.

\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.

\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.

\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.

\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.

\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.

\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.

\p{C} or \p{Other}: invisible control characters and unused code points.

\p{Cc} or \p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.

\p{Cf} or \p{Format}: invisible formatting indicator.

\p{Co} or \p{Private_Use}: any code point reserved for private use.

\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.

\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.