21 May, 2013

Episode Four: Not Enough Keys in the Keyboard

C++'s basic source character set consists of only 96 characters, while also offering a way to name any character in the ISO10646 universal character set —the character repertoire of Unicode—. As little as this may sound, nine of those characters lay outside the ISO646 invariant character set. This can be problematic when the encoding and/or keyboard used to write code does not support one or more of these nine characters. Hence, a workaround is born...

This article makes several references to the different translation phases of the C++ compiler. It suffices to know that translation phase X happens before translation phase X+1. For a complete specification of translation phases, see Phases of translation.

Character Sets

Basic Source Character Set

The C++ language is defined in terms of the basic source character set. The set of physical source file characters accepted is implementation-defined; in the very first phase of translation, source file characters are mapped to the basic source character set.

[2.3/1] The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

0 1 2 3 4 5 6 7 8 9

_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " '

Nine of them lay outside the ISO646 invariant character set, those are:

{ } [ ] # ^ | ∼ \

For instance, some EBCDIC (Extended Binary Coded Decimal Interchange Code) code pages lack characters such as '{' and '}'. This may seem rare nowadays —living in a Unicode era—, but it was a real issue back when the C and C++ standards were being defined.

Universal Character Names

Other characters from the ISO10646 universal character set can be named using universal-character-names. When mapping source file characters, any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. Note however that the actual name of those universal-characters will depend on the source file character set, which is implementation-defined.

[2.3/2] The universal-character-name construct provides a way to name other characters.

hex-quad:
  hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit

universal-character-name:
  \u hex-quad
  \U hex-quad hex-quad

The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. [...]

The translation of universal-character-names within an ordinary or wide string literal will depend on the execution character sets, which are also implementation-defined, creating potential portability issues. UTF-8, UTF-16 and UTF-32 character and string literals, on the other hand, use their respective encodings regardless of the actual execution character sets.

Universal-character-names can also be used to name characters of the basic source character set. However, the use of any of them in a place other than a character or string literal is ill-formed.

A note on Universal Character Names

Universal characters can be used in identifiers. The allowed ranges are specified at Annex E - Universal character names for identifier characters of the C++ standard. It is not the intention to use identifiers as d\u00E9j\u00E0_vu, but rather the —identical as far as C++ cares— déjà_vu form.

/** assuming the source file is in the expected character encoding **/
int déjà_vu = 0;
int d\u00E9j\u00E0_vu = 0; // error: 'déjà_vu' : redefinition
int d\U000000E9j\U000000E0_vu = 0; // error: 'déjà_vu' : redefinition

Whether that's a sensible thing to do is subject for debate. It has to at least be considered that people are unlikely to remember the several ranges of allowed and initially disallowed characters for identifiers, much less the hexadecimal representation of each universal character. In general, universal character names should be avoided in identifiers unless absolutely necessary; the basic character set should suffice for almost every identifier [citation needed].

Trigraphs

Trigraph sequences are replaced by the corresponding single-character internal representations. This happens in the first translation phase, right after mapping source file characters to the basic source character set.

[2.4/1] Before any other processing takes place, each occurrence of one of the following sequences of three characters ("trigraph sequences") is replaced by the single character indicated in Table 1.

Trigraph Replacement Trigraph Replacement Trigraph Replacement
??= # ??( [ ??< {
??/ \ ??) ] ??> }
??' ^ ??! | ??- ~

[2.4/2] [ Example:

??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)

becomes

#define arraycheck(a,b) a[b] || b[a]

—end example ]

As everything happening at the preprocessing stage, trigraph replacement has no knowledge of the language rules. This means that trigraph sequences are replaced everywhere —...or are they? Yes, they are. Keep reading—.

Within character and string literals

Trigraph sequences are replaced when found inside character and string literals —they don't even know they are there—.

auto s = "He said 'Hello??'."

/** after trigraph replacement: **/

auto s = "He said 'Hello^.".

And that is the whole reason why there is an escape-sequence \?, so that we can break a trigraph sequence within a character or string literal.

auto s = "He said 'Hello?\?'." // no trigraph sequence here, move along...

Another way to break trigraph sequences, one that works with string literals only, is to leverage string literals concatenation. In translation phase six, adjacent string literal tokens are concatenated; by then, trigraph sequences have lost their chance to be replaced.

auto s = "He said 'Hello?" "?'." // trigraph sequence formed at a late translation phase, all is well...

A note on raw string literals

Preprocessing tokens are formed at translation phase three, and by then replacements like trigraph sequences —as well as universal characters and line splicing— have already been applied. It is at this stage that raw string literals are found, which have the nice property that no transformation happens for them. The reality is that some transformations do happen, but they have to be reverted.

[2.5/3] If the input stream has been parsed into preprocessing tokens up to a given character:

  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; [...]

[ Example:

#define R "x"
  const char* s = R"y"; // ill-formed raw string, not "x" "y"

—end example ]

Within comments

The one other place where the language allows the use of two consecutive question marks that may not be part of a trigraph sequence is inside comments. They are turned into whitespace at translation phase three, leaving little wiggle room for disaster. Still, translation phase two performs line splicing —physical source lines terminated in a backslash character \ are spliced into a single logical line—, and there so happens to be a trigraph sequence for \.

// Why isn't this an error??/
void i = 0;

/** after trigraph replacement: **/

// Why isn't this an error\
void i = 0;

/** and after line splicing: **/

// Why isn't this an errorvoid i = 0;

Alternative tokens

In 1994 a normative amendment to the C standard, included in C99, supplied digraphs as more readable alternatives to five of the trigraphs. Those are:

Digraph Replacement
<: [
:> ]
<% {
%> }
%: #

This is not only a more readable alternative to trigraphs —if one may call them that—, but a safer alternative as well. Unlike trigraphs, digraphs are full fledged tokens, handled during tokenization at translation phase four. This means that digraphs won't be replaced within comments or character and string literals. In fact, they won't be replaced at all; any digraph must always represent a full token by itself, or compose the token %:%: replacing the preprocessor concatenation token ##.

C++ also incorporated those alternatives, adding a few of its own that rendered the name Digraph unsuitable. Particularly, %:%: is treated as a single token rather than two occurrences of %:. They are known as Alternative tokens.

[2.6/1] Alternative token representations are provided for some operators and punctuators.

[2.6/2] In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling. The set of alternative tokens is defined in Table 2.

Alternative Primary Alternative Primary Alternative Primary
<: [ and && and_eq &=
:> ] bitor | or_eq |=
<% { or || xor_eq ^=
%> } xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

The standard further reads on a note for [2.6/1]:

These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren’t lexical keywords are colloquially known as “digraphs”.

These alternative tokens behave exactly the same as their primary counterpart, except for the preprocessor stringizing operator:

#define STRINGIZE(text) #text

std::cout << STRINGIZE([); // outputs "["
std::cout << STRINGIZE(<:); // outputs "<:"

A note on <:

The alternative token <: is the weirdest of them all. Preprocessing tokenization is greedy, it will match the context that consumes most characters. This happens even when it results in an error that could otherwise be avoided.

However, C++ introduces an exception for the alternative token <:. This is needed due to an otherwise unwanted interaction between the digraph and the scope resolution operator :: —not present in C—.

auto b = x<::y; // x < ::y or the erroneous x[:y ?

std::vector<::std::string> v;  // std::vector< ::std::string > or the erroneous std::vector[:std::string> ?

int array<::> a; // int array[] ?

Resulting in the following rules:

[2.5/3] If the input stream has been parsed into preprocessing tokens up to a given character:

  • [...]
  • Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.

[2.5/5] [ Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression. —end example ]

Summary

C++ has a few alternative ways of naming characters. Both for back when there weren't enough keys, and for today when there are way too many characters.

  • Know about universal-character-names, trigraphs and alternative tokens so that you may recognize them.
  • Avoid using them when possible —unless you are going for obfuscated code of the year—.
  • Beware of unintended trigraph sequences.

References: