Tech Tip: Use Match regex to fold or split text into strings of a maximum length

PRODUCT: 4D | VERSION: 11 | PLATFORM: Mac & Win

Published On: September 17, 2009

Occasionally there is a need to break up a block of text, such as the one below:

Lorem ipsum dolor sit amet, risus aliquam sollicitudin pede egestas massa libero, cursus integer sed quis et molestie metus, nulla pede adipiscing sed orci, ac eu consectetuer et massa adipiscing, sed blandit ligula enim turpis rutrum imperdiet. Hendrerit quam tincidunt viverra lacus, sem dolor venenatis ultrices mauris, praesent egestas eleifend inceptos metus rutrum, nullam nec maecenas erat et fusce nibh, elit pellentesque sed amet et adipiscing. Lacus urna et nunc, odio ipsum, posuere viverra praesent voluptatum urna ut scelerisque, massa commodo velit. In dictumst at cras. Eget feugiat. Mollis inceptos convallis eros sapien, facilisis scelerisque mauris mauris orci magna. Dolor arcu temporibus, maecenas et porta arcu.

You can break it into a series of fixed length lines or strings, as shown below, as commonly seen in email replies:

> Lorem ipsum dolor sit amet, risus aliquam sollicitudin pede egestas
> massa libero, cursus integer sed quis et molestie metus, nulla pede
> adipiscing sed orci, ac eu consectetuer et massa adipiscing, sed blandit
> ligula enim turpis rutrum imperdiet. Hendrerit quam tincidunt viverra
> lacus, sem dolor venenatis ultrices mauris, praesent egestas eleifend
> inceptos metus rutrum, nullam nec maecenas erat et fusce nibh, elit
> pellentesque sed amet et adipiscing. Lacus urna et nunc, odio ipsum,
> posuere viverra praesent voluptatum urna ut scelerisque, massa commodo
> velit. In dictumst at cras. Eget feugiat. Mollis inceptos convallis eros
> sapien, facilisis scelerisque mauris mauris orci magna. Dolor arcu
> temporibus, maecenas et porta arcu.

Doing this in traditional 4D code is no small task. However, by using the 4D command Match regex the task becomes almost trivial.

The first step is to define a regular expression (regex) pattern that will provide clean breaks on word boundaries. Next you want those boundaries to be at points that do not exceed the length desired. There are actually two patterns that can be used to accomplish this task. The first works only on the ANSI low ASCII character set (0-127):

$Pat_T:="(?:[ -~]{1,72}(?:$|\\s))"

The pattern, (?:regex), is known as a "non-capturing" pattern. What it finds is not captured to a stack in memory to be used in a "replace" function. All that is desired is for it to reveal where the pattern is matched.

First part of this pattern "[ -~]" looks for ASCII characters 32 (space) - 126 (tilde). Next "{1,72}" stipulates that a group of at least 1 but not more than 72 characters is desired. Finally, the embedded pattern of "(?:$|\\s)" is another non-capturing pattern that stipulates to break on a regex string ending, carriage return or new line, or to beak on a whitespace character. See "Regular Expression Basic Syntax Reference" and "Regular Expression Advanced Syntax Reference".

However, this pattern will not work in all 4D environments. Since 4D v11 works in UNICODE mode, there is a need for a pattern that will work in UNICODE mode as well. Such a pattern is shown below:

$Pat_T:="(?:[^\\p{C}]{1,72})(?:$|\\p{Z})"

Again the pattern is "non-capturing". This time the pattern [^\\p{C}] stipulates to count all characters that are NOT (^) invisible control characters and unused code points (\p{C}). Again "{1,72}" stipulates the group size to be from 1 to 72 characters. Finally (?:$|\\p{Z}) is the same as above except "\p{Z}" translates to any kind of whitespace or invisible separator. See "Unicode Character Properties" and "Regular Expression Unicode Syntax Reference".

Using these patterns the text can be "folded" using the code below (in these examples the text is folded at 72 characters as per the regular expressions used above):

$Start_L:=1
$FoundAt_L:=0
$Length_L:=0
$Bool_B:=Match regex($Pat_T;$Buf_T;$Start_L;$FoundAt_L;$Length_L)
While ($Bool_B)
    $Buf_T:=Insert string($Buf_T;"\r> ";$FoundAt_L)
    ` Plus 3 to account for the inserted characters
    $Start_L:=$FoundAt_L+$Length_L+3
    $Bool_B:=Match regex($Pat_T;$Buf_T;$Start_L;$FoundAt_L;$Length_L)
End while

Alternatively, to split the text into an array:

ARRAY TEXT($Fold_aT;0)
$Start_L:=1
$FoundAt_L:=0
$Length_L:=0
$Bool_B:=Match regex($Pat_T;$Buf_T;$Start_L;$FoundAt_L;$Length_L)
While ($Bool_B)
    APPEND TO ARRAY($Fold_aT;Substring($Buf_T;$FoundAt_L;$Length_L))
    $Start_L:=$FoundAt_L+$Length_L
    $Bool_B:=Match regex($Pat_T;$Buf_T;$Start_L;$FoundAt_L;$Length_L)
End while