Skip to content

Commit d9bd482

Browse files
committed
[lex] Better specify whitespace characters
This commit defines a grammar term for _whitespace-character_ and uses it consistently where the plain text term whitespace character is used. A whitespace character is defined as one of the five characters that are mentioned in the text closest to provifing a defifinition. The unicode character name is (mostly) consistently used to name these characters, and for consistency, similar changes were made to name unicode characters rather than render specified characters in code font throughout [lex]. The one exception is backslash, which is retained as-is to avoid making more issues for P2348. Note that this commit is not a replacement for P2348, merely a clearer statement of the existing specification without any normative changes.
1 parent a95c5ef commit d9bd482

File tree

2 files changed

+62
-33
lines changed

2 files changed

+62
-33
lines changed

source/lex.tex

+58-29
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,9 @@
110110
\indextext{line splicing}%
111111
If the first translation character is \unicode{feff}{byte order mark},
112112
it is deleted.
113-
Each sequence of a backslash character (\textbackslash)
113+
Each sequence of a \unicode{005c}{reverse solidus} character (\tcode{\textbackslash})
114114
immediately followed by
115-
zero or more whitespace characters other than new-line followed by
115+
zero or more \grammarterm{whitespace-character}s followed by
116116
a new-line character is deleted, splicing
117117
physical source lines to form \defnx{logical source lines}{source line!logical}. Only the last
118118
backslash on any physical source line shall be eligible for being part
@@ -126,9 +126,13 @@
126126
shall be processed as if an additional new-line character were appended
127127
to the file.
128128

129-
\item The source file is decomposed into preprocessing
130-
tokens\iref{lex.pptoken} and sequences of whitespace characters
131-
(including comments). A source file shall not end in a partial
129+
\item
130+
\indextext{whitespace}%
131+
\indextext{comment}%
132+
\indextext{token!preprocessing}%
133+
The source file is decomposed into preprocessing tokens\iref{lex.pptoken} and
134+
whitespace\iref{lex.whitespace}.
135+
A source file shall not end in a partial
132136
preprocessing token or in a partial comment.
133137
\begin{footnote}
134138
A partial preprocessing
@@ -140,10 +144,10 @@
140144
would arise from a source file ending with an unclosed \tcode{/*}
141145
comment.
142146
\end{footnote}
143-
Each comment\iref{lex.comment} is replaced by one space character. New-line characters are
144-
retained. Whether each nonempty sequence of whitespace characters other
145-
than new-line is retained or replaced by one space character is
146-
unspecified.
147+
Each comment\iref{lex.comment} is replaced by one \unicode{0020}{space} character.
148+
New-line characters are retained.
149+
Whether each nonempty sequence of \grammarterm{whitespace-character}s is
150+
retained or replaced by one \unicode{0020}{space} character is unspecified.
147151
As characters from the source file are consumed
148152
to form the next preprocessing token
149153
(i.e., not being consumed as part of a comment or other forms of whitespace),
@@ -181,7 +185,7 @@
181185

182186
\item
183187
Each preprocessing token is converted into a token\iref{lex.token}.
184-
Whitespace characters separating tokens are no longer significant.
188+
Whitespace separating tokens is no longer significant.
185189
The resulting tokens constitute a \defn{translation unit} and
186190
are syntactically and
187191
semantically analyzed as a \grammarterm{translation-unit}\iref{basic.link} and
@@ -467,7 +471,34 @@
467471
None of these names or aliases have leading or trailing spaces.
468472
\end{note}
469473

470-
\rSec1[lex.comment]{Comments}
474+
\rSec1[lex.whitespace]{Whitespace}
475+
\indextext{whitespace|(}%
476+
477+
\rSec2[lex.whitespace.general]{General}
478+
479+
\indextext{character!whitespace|(}%
480+
\begin{bnf}
481+
\nontermdef{whitespace-character}\br
482+
\unicode{0009}{character tabulation}\br
483+
\unicode{000b}{line tabulation}\br
484+
\unicode{000c}{form feed}\br
485+
\unicode{0020}{space}\br
486+
\end{bnf}
487+
488+
\pnum
489+
Sequences of \grammarterm{whitespace-character}s, new-line characters, and
490+
comments\iref{lex.comment} form \defn{whitespace}, which carries no
491+
semantic significance other than to separate tokens\iref{lex.token}
492+
and preprocessing tokens\iref{lex.pptoken}.
493+
494+
\pnum
495+
\begin{note}
496+
Implementations are permitted but not required to coalesce non-empty
497+
sequences of whitespace into a single \unicode{0020}{space}
498+
while retaining new-lines\iref{lex.phases}.
499+
\end{note}
500+
501+
\rSec2[lex.comment]{Comments}
471502

472503
\pnum
473504
\indextext{comment|(}%
@@ -477,8 +508,8 @@
477508
characters \tcode{*/}. These comments do not nest.
478509
\indextext{comment!\tcode{//}}%
479510
The characters \tcode{//} start a comment, which terminates immediately before the
480-
next new-line character. If there is a form-feed or a vertical-tab
481-
character in such a comment, only whitespace characters shall appear
511+
next new-line character. If there is a \unicode{000b}{line tabulation} or a \unicode{000c}{form feed}
512+
character in such a comment, only \grammarterm{whitespace-character}s shall appear
482513
between it and the new-line that terminates the comment; no diagnostic
483514
is required.
484515
\begin{note}
@@ -488,7 +519,14 @@
488519
characters \tcode{//} and \tcode{/*} have no special meaning within a
489520
\tcode{/*} comment.
490521
\end{note}
522+
523+
\pnum
524+
\begin{note}
525+
Comments are turned into \unicode{0020}{space} characters in phase 3 of translation
526+
as part of decomposing a source file into preprocessor tokens and whitespace.
527+
\end{note}
491528
\indextext{comment|)}
529+
\indextext{whitespace|)}%
492530

493531
\rSec1[lex.pptoken]{Preprocessing tokens}
494532

@@ -506,7 +544,7 @@
506544
string-literal\br
507545
user-defined-string-literal\br
508546
preprocessing-op-or-punc\br
509-
\textnormal{each non-whitespace character that cannot be one of the above}
547+
\textnormal{each non-\grammarterm{whitespace-character} that cannot be one of the above}
510548
\end{bnf}
511549

512550
\pnum
@@ -520,22 +558,15 @@
520558
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
521559
identifiers, preprocessing numbers, character literals (including user-defined character
522560
literals), string literals (including user-defined string literals), preprocessing
523-
operators and punctuators, and single non-whitespace characters that do not lexically
561+
operators and punctuators,
562+
and single non-\grammarterm{whitespace-character}s that do not lexically
524563
match the other preprocessing token categories.
525564
If a \unicode{0027}{apostrophe} or a \unicode{0022}{quotation mark} character
526565
matches the last category, the program is ill-formed.
527566
If any character not in the basic character set matches the last category,
528567
the program is ill-formed.
529-
Preprocessing tokens can be separated by
530568
\indextext{whitespace}%
531-
whitespace;
532-
\indextext{comment}%
533-
this consists of comments\iref{lex.comment}, or whitespace characters
534-
(\unicode{0020}{space},
535-
\unicode{0009}{character tabulation},
536-
new-line,
537-
\unicode{000b}{line tabulation}, and
538-
\unicode{000c}{form feed}), or both.
569+
Preprocessing tokens can be separated by whitespace\iref{lex.whitespace}.
539570
As described in \ref{cpp}, in certain
540571
circumstances during translation phase 4, whitespace (or the absence
541572
thereof) serves as more than preprocessing token separation. Whitespace
@@ -824,9 +855,7 @@
824855
\end{footnote}
825856
operators, and other separators.
826857
\indextext{whitespace}%
827-
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
828-
(collectively, ``whitespace''), as described below, are ignored except
829-
as they serve to separate tokens.
858+
Whitespace\iref{lex.whitespace} is ignored except to separate tokens.
830859
\begin{note}
831860
Whitespace can separate otherwise adjacent identifiers, keywords, numeric
832861
literals, and alternative tokens containing alphabetic characters.
@@ -1786,8 +1815,8 @@
17861815
\begin{bnf}
17871816
\nontermdef{d-char}\br
17881817
\textnormal{any member of the basic character set except:}\br
1789-
\bnfindent\textnormal{\unicode{0020}{space}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis}, \unicode{005c}{reverse solidus},}\br
1790-
\bnfindent\textnormal{\unicode{0009}{character tabulation}, \unicode{000b}{line tabulation}, \unicode{000c}{form feed}, and new-line}
1818+
\bnfindent\textnormal{a \grammarterm{whitespace-character}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis},}\br
1819+
\bnfindent\textnormal{\unicode{005c}{reverse solidus}, and new-line}
17911820
\end{bnf}
17921821

17931822
\pnum

source/preprocessor.tex

+4-4
Original file line numberDiff line numberDiff line change
@@ -289,12 +289,12 @@
289289
the directive name and the following new-line character.
290290

291291
\pnum
292-
The only whitespace characters that shall appear
292+
The only \grammarterm{whitespace-character}s that shall appear
293293
between preprocessing tokens
294294
within a preprocessing directive
295295
(from just after the directive-introducing token
296296
through just before the terminating new-line character)
297-
are space and horizontal-tab
297+
are \unicode{0020}{space} and \unicode{0009}{character tabulation}
298298
(including spaces that have replaced comments
299299
or possibly other whitespace characters
300300
in translation phase 3).
@@ -1496,7 +1496,7 @@
14961496
\indextext{name!macro|see{macro, name}}%
14971497
\defnx{macro name}{macro!name}.
14981498
There is one name space for macro names.
1499-
Any whitespace characters preceding or following the
1499+
Any \grammarterm{whitespace-character}s preceding or following the
15001500
replacement list of preprocessing tokens are not considered
15011501
part of the replacement list for either form of macro.
15021502

@@ -1573,7 +1573,7 @@
15731573
right parenthesis preprocessing tokens.
15741574
Within the sequence of preprocessing tokens making up an invocation
15751575
of a function-like macro,
1576-
new-line is considered a normal whitespace character.
1576+
new-line is considered a {whitespace-character}.
15771577

15781578
\pnum
15791579
\indextext{macro!function-like!arguments}%

0 commit comments

Comments
 (0)