Tech Tip: Creating Explicit UTF-8 Text Files

PRODUCT: 4D | VERSION: 12 | PLATFORM: Mac & Win

Published On: April 21, 2012

The following code will create a text file with UTF-8 text, based on 4D Text as input. In particular the CONVERT FROM TEXT command is used to convert 4D Text (which is UTF-16) to a BLOB containing UTF-8 text. The BLOB is then saved to disk.

C_TEXT($1;$filePath_t)
C_TEXT($2;$fileBody_t)

C_BLOB($fileBody_b)

$filePath_t:=$1
$fileBody_t:=$2

// Convert text to UTF-8.
CONVERT FROM TEXT($fileBody_t;"UTF-8";$fileBody_b)

$docRef_h:=Create document($filePath_t)

If (OK=1)
// Add the content.
SEND PACKET($docRef_h;$fileBody_b)
CLOSE DOCUMENT($docRef_h)
End if

Here's the problem: a program opening the output file might interpret it as a different encoding, depending on the content. In part this is because UTF-8 is indistinguishable from ASCII for most of the first 128 characters. Notepad++, for example, will interpret the file as ASCII instead of UTF-8 if there are no extended characters present. In fact the key is the above code produces a file with the encoding "UTF-8 without BOM", where "BOM" stands for "Byte Order Mark". Without a BOM, the file encoding must be inferred based on content.

If you wish to explicitely delcare that the file contains UTF-8 text (and not "UTF-8 without BOM"), it is possible to do so by inserting the BOM. Here is the code to do it:

C_TEXT($1;$filePath_t)
C_TEXT($2;$fileBody_t)

C_BLOB($fileBody_b;$bom_b)

$filePath_t:=$1
$fileBody_t:=$2

SET BLOB SIZE($bom_b;3)
$bom_b{0}:=239 // EF (UTF-8)
$bom_b{1}:=187 // BB (UTF-8)
$bom_b{2}:=191 // BF (UTF-8)

` Convert text to UTF-8.
CONVERT FROM TEXT($fileBody_t;"UTF-8";$fileBody_b)

$docRef_h:=Create document($filePath_t)

If (OK=1)
// Insert the BOM...
SEND PACKET($docRef_h;$bom_b)
// Add the content.
SEND PACKET($docRef_h;$fileBody_b)
CLOSE DOCUMENT($docRef_h)
End if

Please note: in a system that works only with UTF-8 files, the BOM is not required nor recommended according to the Unicode Standard. This technique is useful in situations where different file types might be encountered and, thus, the ability to detect the encoding is important. But in fact the more important point is that ideally the type of the file should be known before accessing it, not detected.