Users have to select a
UTF-8 locale, for example with
export LANG=en_GB.UTF-8
in order to activate the
UTF-8 support in applications.
Application software that has to be aware of the used character
encoding should always set the locale with for example
setlocale(LC_CTYPE, "")
and programmers can then test the expression
strcmp(nl_langinfo(CODESET), "UTF-8") == 0
to determine whether a
UTF-8 locale has been selected and whether
therefore all plaintext standard input and output, terminal
communication, plaintext file content, filenames and environment
variables are encoded in
UTF-8.
Programmers accustomed to single-byte encodings such as
US-ASCII or
ISO 8859 have to be aware that two assumptions made so far are no longer valid
in
UTF-8 locales. Firstly, a single byte does not necessarily correspond any
more to a single character. Secondly, since modern terminal emulators
in
UTF-8 mode also support Chinese, Japanese, and Korean
double-width characters as well as non-spacing
combining characters, outputting a single character does not necessarily advance the cursor
by one position as it did in
ASCII. Library functions such as
mbsrtowcs(3)
and
wcswidth(3)
should be used today to count characters and cursor positions.
The official ESC sequence to switch from an
ISO 2022 encoding scheme (as used for instance by VT100 terminals) to
UTF-8 is ESC % G
("\x1b%G"). The corresponding return sequence from
UTF-8 to ISO 2022 is ESC % @ ("\x1b%@"). Other ISO 2022 sequences (such as
for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
It can be hoped that in the foreseeable future,
UTF-8 will replace
ASCII and
ISO 8859 at all levels as the common character encoding on POSIX systems,
leading to a significantly richer environment for handling plain text.