29 Sep, 2013

Episode Seven: One Char to Rule them All

There are 5 types of character literals in C++. Two types of character literal for the narrow-kings under the sky, two for the universal-lords in their halls of stone, one for the mortal wide doomed to die, in the land of C++ where the shadows lie.

Character literals

[2.14.3]/1 A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by one of the letters u, U, or L, as in u'y', U'z', or L'x', respectively. [...]

Narrow character literals

These are also know as ordinary character literals.

[2.14.3/1] [...] A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. [...]

[3.9.1/1] Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. [...]

The integral value of a narrow character literal is implementation defined, as are the basic character set and basic execution character set. It is guaranteed, however, that a null character has value 0, and that decimal digits occupy consecutive positions in ascending order.

[2.3/3] The basic execution character set and the basic execution wide-character set shall each contain [...] a null character (respectively, null wide character), whose representation has all zero bits. [...] In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. [...]

Narrow character types are integral types, and as such they can be used to perform integral arithmetic. With at least 8-bits, they are also within the smallest integral types. These properties appear to make them perfect candidates for performing integral arithmetic with small quantities. However, char is not appropriate for this task as the semantics of most integral operations depend on whether the type is signed or unsigned, and that particular aspect of char happens to be implementation defined.

While one could naively expect char to be signed as an int is, C left the signedness of char as an implementation choice —for alleged optimization opportunities and performance reasons—; and in fact many implementations choose to have it be unsigned. Eventually, a solution to the uncertainty of the semantics was found by explicitly specifying signed char, while still leaving the signedness of char implementation defined. As a consequence, there are three types of chars: plain char, signed char, and unsigned char. These three types are compatible, —and since C pointers are promiscuous— this change did not present any compatibility concerns.

[3.9.1/1] [...] It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. [...] In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

C++ inherited this quirk from C but, as of its stricter type safety and advanced functionalities, it does present situations where awareness of the three compatible yet different types is required. For example,

  • pointers:

    char* a = 0;
    signed char* b = a; // ok in C, error in C++
    unsigned char* c = b; // ok in C, error in C++
  • function overloads:

    void foo(signed char) {}
    void foo(unsigned char) {}
    
    foo('C'); // error, ambiguous call to overloaded function
  • template specializations:

    template <typename T> struct bar;
    template <> struct bar<signed char> {};
    template <> struct bar<unsigned char> {};
    
    bar<char> dummy; // error, uses undefined struct bar<char>

Narrow character types are also special in that every one of their bits is meaningful —there are no padding bits—. Additionally, every unsigned narrow character type represents a number —there are no trap representations—.

[3.9.1/1] [...] For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. [...]

Furthermore, the standard guarantees that the representation of an object can be accessed via a pointer to a narrow character type.

[3.10/10] If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

  • [...]
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • [...]
  • a char or unsigned char type.

A note on multicharacter literals

Perhaps surprisingly, the type of a narrow character literal in C is not char but int.

A multicharacter literal is an abomination where several characters are packed together in a single literal, generally as many as fit in an int —e.g. 'foo', 'bar!'—. In C++, an implementation is not even required to support multicharacter literals, but if it does then those literals have type int as in C, and an implementation defined value.

[2.14.3/1] [...] An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

Universal character literals

[2.14.3/2] A character literal that begins with the letter u, such as u'y', is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t literal containing multiple c-char s is ill-formed. A character literal that begins with the letter U, such as U'z', is a character literal of type char32_t. The value of a char32_t literal containing a single c-char is equal to its ISO 10646 code point value. A char32_t literal containing multiple c-char s is ill-formed. [...]

The integral value of these character literals is defined by ISO10646, the universal character set. While the standard is somewhat vague with respect to encoding, it is irrelevant for character literals as the respective UCS and UTF encodings can represent every valid character literal with the same bit representation:

  • A char16_t literal can only hold characters in the basic multilingual plane. These are the characters representable by both UCS-2 and UTF-16 encodings, which have the same bit representation.

  • A char32_t literal can hold any character in the universal character set. Both UCS-2 and UTF-32 encodings are identical, as they use a direct bit representation for codepoints.

For string literals, while the standard explicitly mentions UTF-8, it does only implicitly specify an UTF-16 encoding for char16_t string literals.

  • A UTF-8 string literal is encoded in —unsurprisingly—UTF-8. There is no need for a separate char8_t, a char will suffice as it must have at least 8-bits.

    [2.14.5/7] A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.

  • A char16_t string literal may contain surrogate pairs, which is the way in which UTF-16 represents characters outside the basic multilingual plane —those that cannot be encoded in UCS-2—.

    [2.14.5/9] A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

  • A char32_t string literal may be encoded in either UCS-4 or UTF-32, as those encodings are identical.

    [2.14.5/10] A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

Don't get fooled by the _t suffix in their names; while they are typedefs in C, they are full blown types in C++.

[3.9.1/5] [...] Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types.

Wide character literals

[2.14.3/2] [...] A character literal that begins with the letter L, such as L'x', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. [Note: The type wchar_t is able to represent all members of the execution wide-character set (see 3.9.1). —end note]. The value of a wide-character literal containing multiple c-char s is implementation-defined.

The integral value of a wide character literal is implementation defined, as are the basic wide-character set and basic execution wide-character set. They present the same guarantees of their narrow counterparts —namely, a null wide-character having value 0, and consecutive decimal digits—.

Like char, whether wchar_t is signed or unsigned is also implementation defined. However, there are no unsigned and signed variants of wchar_t nor a need for them, as one can use the underlying type for integral arithmetic.

Like char16_t and char32_t, the _t suffix is due to being a typedef in C, but it is a fundamental type in C++.

[3.9.1/5] Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. [...]

Summary

There are two types of narrow character literals, char and int; two types of universal character literals, char16_t and char32_t; and one type of wide character literal, wchar_t.

  • The signedness of char is implementation defined. It is either signed or unsigned, but it is nevertheless a type different than signed char and unsigned char.
  • Decimal digits have consecutive integral values in the basic execution character set and the basic execution wide-character set.
  • A multicharacter literal is conditionally supported and has an implementation defined value.
  • The _t suffix in char16_t, char32_t, and wchar_t correspond to those names being typedefs on C, but are fundamental types on C++.
  • The size, signedness, and alignment of char16_t and char32_t are the same than those of uint_least16_t and uint_least32_t, respectively.
  • The signedness of wchar_t is implementation defined.

References: