Skip to content
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- [Lexical structure](lexical-structure.md)
- [Input format](input-format.md)
- [Shebang](shebang.md)
- [Frontmatter](frontmatter.md)
- [Keywords](keywords.md)
- [Identifiers](identifiers.md)
- [Comments](comments.md)
Expand Down
66 changes: 66 additions & 0 deletions src/frontmatter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
r[frontmatter]
# Frontmatter

r[frontmatter.intro]
Frontmatter is an optional section of metadata whose syntax allows external tools to read it without parsing Rust.

> [!EXAMPLE]
> <!-- ignore: test runner doesn't support frontmatter -->
> ```rust,ignore
> #!/bin/env cargo
> --- cargo
> package.edition = 2024
> ---
>
> fn main() {}
> ```

r[frontmatter.syntax]
```grammar,lexer
@root FRONTMATTER ->
WHITESPACE_ONLY_LINE*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the whitespace considered part of the frontmatter for the lexer (here), but in frontmatter.position it describes whitespace as preceding (and thus not part of) the frontmatter?

I'm also a bit confused about the combination of this and the FRONTMATTER_INVALID grammar - what is the benefit of including the discussion of invalid frontmatter in the grammar specification itself, rather than just in the prose?

!FRONTMATTER_INVALID
FRONTMATTER_MAIN

WHITESPACE_ONLY_LINE -> (!LF WHITESPACE)* LF

FRONTMATTER_INVALID -> (!LF WHITESPACE)+ `---` ^ ⊥

FRONTMATTER_MAIN ->
`-`{n:3..=255} ^ FRONTMATTER_REST

FRONTMATTER_REST ->
FRONTMATTER_FENCE_START
FRONTMATTER_LINE*
FRONTMATTER_FENCE_END

FRONTMATTER_FENCE_START ->
MAYBE_INFOSTRING_OR_WS LF

FRONTMATTER_FENCE_END ->
`-`{n} HORIZONTAL_WHITESPACE* ( LF | EOF )

FRONTMATTER_LINE -> !`-`{n} ~[LF CR]* LF

MAYBE_INFOSTRING_OR_WS ->
HORIZONTAL_WHITESPACE* INFOSTRING? HORIZONTAL_WHITESPACE*

INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )*
```

r[frontmatter.position]
Frontmatter may appear at the start of the file (after the optional [byte order mark]) or after a [shebang]. In either case, it may be preceded by [whitespace].

r[frontmatter.fence]
Frontmatter must start and end with a *fence*. Each fence must start at the beginning of a line. The opening fence must consist of at least 3 and no more than 255 hyphens (`-`). The closing fence must have exactly the same number of hyphens as the opening fence. The hyphens of either fence may be followed by [horizontal whitespace].

r[frontmatter.infostring]
The opening fence, after optional [horizontal whitespace], may be followed by an infostring that identifies the format or purpose of the body. An infostring may be followed by horizontal whitespace.

r[frontmatter.body]
No line in the body may start with a sequence of hyphens (`-`) equal to or longer than the opening fence. The body may not contain carriage returns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused about the "body must not contain carriage returns" at first, since I use windows - then I remembered input.crlf, so it shouldn't be too much of an issue. But, can I suggest adding a note that that normalization is applied before the presence of carriage returns is validated? Other multi-line contexts like doc blocks do allow post-input.crlf normalization CR to remain


[byte order mark]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[horizontal whitespace]: grammar-HORIZONTAL_WHITESPACE
[shebang]: input-format.md#shebang-removal
[whitespace]: whitespace.md
22 changes: 21 additions & 1 deletion src/input-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,25 @@ r[input.shebang]
r[input.shebang.removal]
If a [shebang] is present, it is removed from the input sequence (and is therefore ignored).

r[input.frontmatter]
## Frontmatter removal

r[input.frontmatter.removal]
If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed.

For example, given the following file:

<!-- ignore: test runner doesn't support frontmatter -->
```rust,ignore
--- cargo
package.edition = 2024
---

fn main() {}
```

The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`.

r[input.tokenization]
## Tokenization

Expand All @@ -54,11 +73,12 @@ The resulting sequence of characters is then converted into tokens as described
>
> - Byte order mark removal.
> - CRLF normalization.
> - Shebang removal when invoked in an item context (as opposed to expression or statement contexts).
> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts).
>
> The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations.

[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[Crates and source files]: crates-and-source-files.md
[frontmatter]: frontmatter.md
[shebang]: shebang.md
[whitespace]: whitespace.md
2 changes: 1 addition & 1 deletion src/items/modules.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ r[items.mod.attributes]
## Attributes on modules

r[items.mod.attributes.intro]
Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM and shebang.
Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM, shebang, and frontmatter.

r[items.mod.attributes.supported]
The built-in attributes that have meaning on a module are [`cfg`], [`deprecated`], [`doc`], [the lint check attributes], [`path`], and [`no_implicit_prelude`]. Modules also accept macro attributes.
Expand Down
12 changes: 12 additions & 0 deletions src/notation.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,18 @@ Mizushima et al. introduced [cut operators][cut operator paper] to parsing expre

The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.

r[notation.grammar.bottom]
### The bottom rule

In logic, ⊥ (*bottom*) represents absurdity --- a proposition that is always false. In type theory, it is the *empty type*: a type with no inhabitants. The grammar borrows both senses: the rule ⊥ matches nothing --- not any character, not even the end of input.

```grammar,notation
// The bottom rule does not match anything.
⊥ -> !(CHAR | EOF)
```

Placed after a [hard cut operator], `^ ⊥` makes a rule fail unconditionally once the parser has committed past the cut. This gives the grammar a way to express *recognition without acceptance*: the parser identifies the input, commits so that no other alternative can be tried, and then rejects it. In the frontmatter grammar, for example, [FRONTMATTER_INVALID] uses `^ ⊥` to recognize an opening fence preceded by whitespace on the same line --- input that is close enough to frontmatter to rule out other interpretations, but that is not valid.

r[notation.grammar.string-tables]
### String table productions

Expand Down
8 changes: 8 additions & 0 deletions src/whitespace.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ WHITESPACE ->
| U+2028 // Line separator
| U+2029 // Paragraph separator

HORIZONTAL_WHITESPACE ->
U+0009 // Horizontal tab, `'\t'`
| U+0020 // Space, `' '`

TAB -> U+0009 // Horizontal tab, `'\t'`

LF -> U+000A // Line feed, `'\n'`
Expand All @@ -26,10 +30,14 @@ CR -> U+000D // Carriage return, `'\r'`
r[lex.whitespace.intro]
Whitespace is any non-empty string containing only characters that have the [`Pattern_White_Space`] Unicode property.

r[lex.whitespace.horizontal]
[HORIZONTAL_WHITESPACE] is the horizontal space subset of [`Pattern_White_Space`] as categorized by [UAX #31, Section 4.1][uax31-4.1].

r[lex.whitespace.token-sep]
Rust is a "free-form" language, meaning that all forms of whitespace serve only to separate _tokens_ in the grammar, and have no semantic significance.

r[lex.whitespace.replacement]
A Rust program has identical meaning if each whitespace element is replaced with any other legal whitespace element, such as a single space character.

[`Pattern_White_Space`]: https://www.unicode.org/reports/tr31/
[uax31-4.1]: https://www.unicode.org/reports/tr31/#Whitespace_and_Syntax