| 1. | An implementation shall support input files
that are a sequence of UTF-8 code units (UTF-8 files). It may also support
an implementation-defined set of other kinds of input files, and,
if so, the kind of an input file is determined in
an implementation-defined manner
that includes a means of designating input files as UTF-8 files,
independent of their content.
If an input file is determined to be a UTF-8 file,
then it shall be a well-formed UTF-8 code unit sequence and
it is decoded to produce a sequence of Unicode8
scalar values. A sequence of translation character set elements ([lex.charset]) is then formed
by mapping each Unicode scalar value
to the corresponding translation character set element. In the resulting sequence,
each pair of characters in the input sequence consisting of
U+000d carriage return followed by U+000a line feed,
as well as each
U+000d carriage return not immediately followed by a U+000a line feed,
is replaced by a single new-line character. For any other kind of input file supported by the implementation,
characters are mapped, in an
implementation-defined manner,
to a sequence of translation character set elements,
representing end-of-line indicators as new-line characters. |
| 2. | Each sequence comprising a backslash character (\)
immediately followed by
zero or more whitespace characters other than new-line followed by
a new-line character is deleted, splicing
physical source lines to form logical source lines. Only the last
backslash on any physical source line is eligible for being part
of such a splice. [Note 2: — end note]
A source file that is not empty and that (after splicing)
does not end in a new-line character
is processed as if an additional new-line character were appended
to the file. |
| 3. | The source file is decomposed into preprocessing
tokens ([lex.pptoken]) and sequences of whitespace characters
(including comments). New-line characters are
retained. Whether each nonempty sequence of whitespace characters other
than new-line is retained or replaced by one U+0020 space character is
unspecified. As characters from the source file are consumed
to form the next preprocessing token
(i.e., not being consumed as part of a comment or other forms of whitespace),
except when matching a
c-char-sequence,
s-char-sequence,
r-char-sequence,
h-char-sequence, or
q-char-sequence,
universal-character-names are recognized ([lex.universal.char]) and
replaced by the designated element of the translation character set ([lex.charset]). The process of dividing a source file's
characters into preprocessing tokens is context-dependent. [Example 1: — end example] |
| 4. | Preprocessing directives ([cpp]) are executed, macro invocations are
expanded ([cpp.replace]), and _Pragma unary operator expressions are executed ([cpp.pragma.op]). A #include preprocessing directive ([cpp.include]) causes the named header or
source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted. Whitespace characters separating preprocessing tokens are no longer significant. |
| 5. | For a sequence of two or more adjacent string-literal preprocessing tokens,
a common encoding-prefix is determined
as specified in [lex.string]. Each such string-literal preprocessing token is then considered to have
that common encoding-prefix. |
| 6. | Each preprocessing token is converted into a token ([lex.token]). |
| 7. | The tokens constitute a translation unit and
are syntactically and
semantically analyzed as a translation-unit ([basic.link]) and
translated. [Note 3: The process of analyzing and translating the tokens can occasionally
result in one token being replaced by a sequence of other
tokens ([temp.names]). — end note]
It is
implementation-defined
whether the sources for
module units and header units
on which the current translation unit has an interface
dependency ([module.unit], [module.import])
are required to be available. [Note 4: Source files, translation
units and translated translation units need not necessarily be stored as
files, nor need there be any one-to-one correspondence between these
entities and any external representation. The description is conceptual
only, and does not specify any particular implementation. — end note][Note 5: Previously translated translation units can be preserved individually or in libraries. The separate translation units of a program communicate ([basic.link]) by (for example)
calls to functions whose names have external or module linkage,
manipulation of variables whose names have external or module linkage, or
manipulation of data files. — end note]While the tokens constituting translation units
are being analyzed and translated,
required instantiations are performed. [Note 6: This can include
instantiations which have been explicitly
requested ([temp.explicit]). — end note]The contexts from which instantiations may be performed
are determined by their respective points of instantiation ([temp.point]). [Note 7: Other requirements in this document can further constrain
the context from which an instantiation can be performed. For example, a constexpr function template specialization
might have a point of instantiation at the end of a translation unit,
but its use in certain constant expressions could require
that it be instantiated at an earlier point ([temp.inst]). — end note]Each instantiation results in new program constructs. The program is ill-formed if any instantiation fails. During the analysis and translation of tokens,
certain expressions are evaluated ([expr.const]). Constructs appearing at a program point P are analyzed
in a context where each side effect of evaluating an expression E
as a full-expression is complete if and only if
[Example 2: class S {
class Incomplete;
class Inner {
void fn() {
/* */ Incomplete i; // OK
}
}; /* */
consteval {
define_aggregate(^^Incomplete, {});
}
}; /* */
— end example] |
| 8. | Translated translation units are combined, and
all external entity references are resolved ([basic.link]). Library
components are linked to satisfy external references to
entities not defined in the current translation. All such translator
output is collected into a program image which contains information
needed for execution in its execution environment. |
character | glyph | |
U+0009 | character tabulation | |
U+000b | line tabulation | |
U+000c | form feed | |
U+0020 | space | |
U+000a | line feed | new-line |
U+0021 | exclamation mark | ! |
U+0022 | quotation mark | " |
U+0023 | number sign | # |
U+0024 | dollar sign | $ |
U+0025 | percent sign | % |
U+0026 | ampersand | & |
U+0027 | apostrophe | ' |
U+0028 | left parenthesis | ( |
U+0029 | right parenthesis | ) |
U+002a | asterisk | * |
U+002b | plus sign | + |
U+002c | comma | , |
U+002d | hyphen-minus | - |
U+002e | full stop | . |
U+002f | solidus | / |
U+0030 .. U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 |
U+003a | colon | : |
U+003b | semicolon | ; |
U+003c | less-than sign | < |
U+003d | equals sign | = |
U+003e | greater-than sign | > |
U+003f | question mark | ? |
U+0040 | commercial at | @ |
U+0041 .. U+005a | latin capital letter a .. z | A B C D E F G H I J K L M |
N O P Q R S T U V W X Y Z | ||
U+005b | left square bracket | [ |
U+005c | reverse solidus | \ |
U+005d | right square bracket | ] |
U+005e | circumflex accent | ^ |
U+005f | low line | _ |
U+0060 | grave accent | ` |
U+0061 .. U+007a | latin small letter a .. z | a b c d e f g h i j k l m |
n o p q r s t u v w x y z | ||
U+007b | left curly bracket | { |
U+007c | vertical line | | |
U+007d | right curly bracket | } |
U+007e | tilde | ~ |
alignas | constinit | extern | protected | throw |
alignof | const_cast | false | public | true |
asm | continue | float | register | try |
auto | contract_assert | for | reinterpret_cast | typedef |
bool | co_await | friend | requires | typeid |
break | co_return | goto | return | typename |
case | co_yield | if | short | union |
catch | decltype | inline | signed | unsigned |
char | default | int | sizeof | using |
char8_t | delete | long | static | virtual |
char16_t | do | mutable | static_assert | void |
char32_t | double | namespace | static_cast | volatile |
class | dynamic_cast | new | struct | wchar_t |
concept | else | noexcept | switch | while |
const | enum | nullptr | template | |
consteval | explicit | operator | this | |
constexpr | export | private | thread_local |
and | and_eq | bitand | bitor | compl | not |
not_eq | or | or_eq | xor | xor_eq |
integer-literal other than decimal-literal | ||
none | int | int |
long int | unsigned int | |
long long int | long int | |
unsigned long int | ||
long long int | ||
unsigned long long int | ||
u or U | unsigned int | unsigned int |
unsigned long int | unsigned long int | |
unsigned long long int | unsigned long long int | |
l or L | long int | long int |
long long int | unsigned long int | |
long long int | ||
unsigned long long int | ||
Both u or U | unsigned long int | unsigned long int |
and l or L | unsigned long long int | unsigned long long int |
ll or LL | long long int | long long int |
unsigned long long int | ||
Both u or U | unsigned long long int | unsigned long long int |
and ll or LL | ||
z or Z | the signed integer type corresponding | the signed integer type |
to std::size_t ([support.types.layout]) | corresponding to std::size_t | |
std::size_t | ||
Both u or U | std::size_t | std::size_t |
and z or Z |
Encoding | Kind | Type | Associated char- | Example |
prefix | acter encoding | |||
none | char | ordinary literal | 'v' | |
multicharacter literal | int | encoding | 'abcd' | |
L | wchar_t | wide literal | L'w' | |
encoding | ||||
u8 | char8_t | UTF-8 | u8'x' | |
u | char16_t | UTF-16 | u'y' | |
U | char32_t | UTF-32 | U'z' |
character | ||
U+000a | line feed | \n |
U+0009 | character tabulation | \t |
U+000b | line tabulation | \v |
U+0008 | backspace | \b |
U+000d | carriage return | \r |
U+000c | form feed | \f |
U+0007 | alert | \a |
U+005c | reverse solidus | \\ |
U+003f | question mark | \? |
U+0027 | apostrophe | \'' |
U+0022 | quotation mark | \" |
type | |
none | double |
f or F | float |
l or L | long double |
f16 or F16 | std::float16_t |
f32 or F32 | std::float32_t |
f64 or F64 | std::float64_t |
f128 or F128 | std::float128_t |
bf16 or BF16 | std::bfloat16_t |
Enco- | Kind | Type | Associated | Examples |
ding | character | |||
prefix | encoding | |||
none | array of n const char | ordinary literal encoding | "ordinary string" R"(ordinary raw string)" | |
L | array of n const wchar_t | wide literal encoding | L"wide string" LR"w(wide raw string)w" | |
u8 | array of n const char8_t | UTF-8 | u8"UTF-8 string" u8R"x(UTF-8 raw string)x" | |
u | array of n const char16_t | UTF-16 | u"UTF-16 string" uR"y(UTF-16 raw string)y" | |
U | array of n const char32_t | UTF-32 | U"UTF-32 string" UR"z(UTF-32 raw string)z" |