parse EOL as a token

This commit is contained in:
2026-05-15 00:16:13 -07:00
parent c8ad7e74e7
commit 64e5467062
8 changed files with 16909 additions and 15378 deletions
+60 -93
View File
@@ -1,106 +1,73 @@
# Known limitations of `tree-sitter-pico8-lua`
PICO-8's Lua dialect is **line-significant** in two places: the body of a
shorthand `if (cond) ...` / `while (cond) ...` extends to end-of-line, and
the optional `else` of a shorthand `if` must be on the same line as the
opening `if`. Tree-sitter has no built-in concept of newlines as syntactic
tokens — to encode line-significance correctly we'd need an **external
scanner** ( a C file that emits synthetic line-end tokens, the same
mechanism `tree-sitter-python` uses for `INDENT`/`DEDENT`/`NEWLINE` ).
This document used to track parse incorrectness around PICO-8's
line-significant shorthand `if (cond) ...` / `while (cond) ...`
constructs. As of v0.3 the external scanner emits a `LINE_END` token
when the parser is at the body-or-terminator decision point of a
shorthand statement and the next byte is `\n` / `\r` / EOF, so the body
of a shorthand is correctly bounded to its source line.
We have intentionally not written that scanner yet. This document tracks
the resulting parse incorrectness so it isn't forgotten when we revisit.
There are no other known parse-incorrectness issues at this time.
Removing this file (or leaving it as a brief stub) is fine once you're
confident no documentation links still point at the old limitation
sections.
## 1. Dangling-`else` mis-bind in nested `if`
## How line-significance is wired up (for reference)
```lua
-- intended: outer if/else, with shorthand-if as a single statement
-- inside the outer if's consequence.
if is_noisy then
if (is_goose()) honk()
else
toot()
end
```
PICO-8 deviates from standard Lua in two places where a newline is
syntactically significant:
The grammar's shorthand `if` rule uses `prec.right` on its optional `else`
clause, so it greedily eats any `else` it can see — matching the
classic "associate else with nearest if" convention from C / Java.
That's wrong for PICO-8, where the line break after `honk()` should
have closed the shorthand. The bound-too-tight parse:
- `if (cond) <stmts...>` — the consequence (and any same-line `else`
alternative) extends to end-of-line, not to a matching `end`.
- `while (cond) <stmts...>` — same line-bounded body as the
shorthand `if`.
- `else` is parsed as the shorthand's alternative, not the outer if's.
- The outer `if_statement` ends up with no `else_statement` child.
- The trailing `end` still resolves to the outer `if_statement`,
so the source still parses cleanly ( no `ERROR` node ).
Tree-sitter has no built-in concept of newlines as syntactic tokens
when `/\s/` is in `extras` (and we want it there: every other
construct treats whitespace transparently). The canonical fix is an
**external scanner** that gates a synthetic terminator token on
`valid_symbols`. We do exactly that:
**Indistinguishable case** — both parses are correct here, because the
`else` really is on the same line as the shorthand:
- `src/scanner.c` exposes a `LINE_END` external symbol. The scanner
looks at the raw lookahead before the lexer has a chance to skip
extras, and emits `LINE_END` only when the parser actually expects
one (i.e., `valid_symbols[LINE_END] == true`). At any other
position, the scanner's LINE_END branch returns false, and the `\n`
falls through to be eaten silently by the `/\s/` extras pattern.
- `LINE_END` is **zero-width** — the scanner does not consume the
newline. This matters for nested shorthands: `if (a) if (b) c()\nd()`
has to terminate BOTH shorthands at the same `\n`. With a zero-width
terminator, each enclosing shorthand sees the same `\n` in turn and
reduces. Once no shorthand is on the stack, `LINE_END` is no longer
in `valid_symbols`, the scanner returns false, and the `\n` is
consumed by extras. The emit chain is bounded by static nesting
depth, so there's no infinite-loop risk despite the zero width.
```lua
if is_noisy then
if (is_goose()) honk() else toot()
end
```
The shorthand rules in `grammar.js` end with `$._line_end`; the body
and the optional `else` alternative are both `$.statement, repeat($.statement)`,
allowing PICO-8's multi-statement single-line bodies
(`if (falling) wheeee() splat()`).
## 2. Multi-statement shorthand body
The cross-language pattern is "external scanner + valid_symbols-gated
terminator," same as `tree-sitter-r` (the closest analogue) and
similar in spirit to Ruby's paired `_line_break` / `_no_line_break`
hint tokens. Reaching for `\s` removal or per-rule extras is **not**
necessary for this style of line-significance; only Python-style
INDENT/DEDENT requires the heavier refactor.
```lua
-- both statements are conditional in PICO-8.
if (is_falling()) wheeee() splat()
```
## Test coverage
The grammar's `shorthand_if_statement` rule takes exactly one
consequence statement, so this parses as:
`test/corpus/shorthand_line_end.txt` exercises:
- `shorthand_if_statement` with consequence `wheeee()`
- followed by an unconditional `splat()` statement
A line-aware grammar would gather every statement up to end-of-line
into the shorthand body. Visually:
```lua
-- this and the previous example produce the SAME parse tree under
-- the current grammar, which is wrong for the previous example.
if (is_falling()) wheeee()
splat()
```
## What does this break?
The parse is structurally wrong but **token classification stays
correct**, because every keyword and identifier is still itself
regardless of which parent node owns it. So:
| Feature | Affected? | Notes |
|---|---|---|
| `highlights.scm` ( syntax highlighting ) | No | `else` is `@keyword.conditional` whether it's a child of `shorthand_if_statement` or `else_statement`. |
| `outline.scm` ( file outline ) | No | Doesn't traverse if-bodies. |
| Bracket matching | No | Independent of if/else structure. |
| Injections | No | Independent. |
| `indents.scm` ( auto-indent ) | Subtly | A mis-bound `else` is inside a `shorthand_if_statement`, which is not an `@indent` node; so the next line may land at the wrong indent column. |
| Semantic selection ( "expand selection" ) | Subtly | Cursor on `toot()` expands to `shorthand_if_statement` instead of `else_statement` → outer `if_statement`. |
| `folds.scm` / `textobjects.scm` | Potentially | Not currently shipped; would inherit the structural bug if we add them. |
| Static analysis / LSP-style features | Yes | Anything that walks the AST to reason about reachability or scope ( e.g. "unreachable code", goto-definition through a conditional branch ) will mis-report. None of this is shipped today. |
For v0.2's stated scope ( syntax highlighting + a basic outline ), the
visible symptom is "auto-indent occasionally off by one column inside a
nested-if-with-out-of-line-else", which only bites a relatively
uncommon code pattern. Deferred until v0.3 LSP work, which needs a
correct AST.
## Fixing it later
The canonical approach is an external scanner. Sketch:
1. Add an `external` symbol like `_logical_line_end` that emits at every
`\n` *not* preceded by line-continuation context.
2. Make `shorthand_if_statement` take the form
`seq('if', '(', expr, ')', stmt, optional(seq(\
/* not _logical_line_end yet */ 'else', stmt)), $._logical_line_end)`.
3. Allow `shorthand_if_statement` consequence to be `repeat1(stmt)` so a
one-line `if (x) a() b()` puts both calls in the shorthand body.
The scanner needs to be written in C, registered via the `externals`
field, and built into `src/scanner.c`. `tree-sitter-python`'s scanner is
a good reference for the pattern.
- Single- and multi-statement shorthand bodies, terminated by `\n` and
by EOF.
- Same-line `else` (single- and multi-statement alternative).
- The historical dangling-else case (shorthand inside a standard `if`,
with `else` on a later line — must bind to the outer `if`).
- Line comment trailing the shorthand body (the comment is in extras
and the trailing `\n` still triggers `LINE_END`).
- Shorthand inside a `do`-block (the `\n` before the closing `end`
terminates the shorthand cleanly).
- Nested shorthand `if`s on the same line (one `\n` must close both).
- Coexistence with standard `if (parenthesized) then ... end` — the
GLR conflict resolves on whether `then` follows.
+30 -9
View File
@@ -68,6 +68,12 @@ export default grammar({
$._block_string_start,
$._block_string_content,
$._block_string_end,
// PICO-8 line-significance: terminates the body of `if (cond) ...` /
// `while (cond) ...` shorthand. The scanner emits this only when the
// parser is at a state expecting it; everywhere else a newline falls
// through to /\s/ in extras and is skipped. See src/scanner.c.
$._line_end,
],
supertypes: ($) => [$.statement, $.expression, $.declaration, $.variable],
@@ -168,14 +174,20 @@ export default grammar({
'end'
),
// PICO-8 single-line: while (cond) stmt
// PICO-8 single-line: while (cond) stmt {stmt}
// Body extends to end-of-line (or EOF). The $._line_end terminator
// is emitted by the external scanner when it sees \n/\r/EOF at a
// position where the parser expects line-end; until then, additional
// statements on the same line accumulate into the body.
shorthand_while_statement: ($) =>
seq(
'while',
'(',
field('condition', $.expression),
')',
field('body', $.statement)
field('body', $.statement),
repeat(field('body', $.statement)),
$._line_end
),
repeat_statement: ($) =>
@@ -205,19 +217,28 @@ export default grammar({
),
else_statement: ($) => seq('else', field('body', optional_block($))),
// PICO-8 single-line: if (cond) stmt [else stmt]
// prec.right resolves the dangling-else ambiguity in favor of greedy
// attach to the nearest preceding shorthand `if`, matching PICO-8
// semantics where shorthand if/else live on one line.
// PICO-8 single-line: if (cond) stmt {stmt} [else stmt {stmt}]
// Both the consequence and the alternative extend to end-of-line.
// The $._line_end terminator (emitted by the external scanner on
// \n/\r/EOF) prevents a later-line `else` from binding to a
// shorthand `if` on a previous line, matching PICO-8 semantics.
shorthand_if_statement: ($) =>
prec.right(seq(
seq(
'if',
'(',
field('condition', $.expression),
')',
field('consequence', $.statement),
optional(seq('else', field('alternative', $.statement)))
)),
repeat(field('consequence', $.statement)),
optional(
seq(
'else',
field('alternative', $.statement),
repeat(field('alternative', $.statement))
)
),
$._line_end
),
for_statement: ($) =>
seq(
+88 -47
View File
@@ -538,6 +538,21 @@
"type": "SYMBOL",
"name": "statement"
}
},
{
"type": "REPEAT",
"content": {
"type": "FIELD",
"name": "body",
"content": {
"type": "SYMBOL",
"name": "statement"
}
}
},
{
"type": "SYMBOL",
"name": "_line_end"
}
]
},
@@ -729,50 +744,68 @@
]
},
"shorthand_if_statement": {
"type": "PREC_RIGHT",
"value": 0,
"content": {
"type": "SEQ",
"members": [
{
"type": "STRING",
"value": "if"
},
{
"type": "STRING",
"value": "("
},
{
"type": "FIELD",
"name": "condition",
"content": {
"type": "SYMBOL",
"name": "expression"
}
},
{
"type": "STRING",
"value": ")"
},
{
"type": "SEQ",
"members": [
{
"type": "STRING",
"value": "if"
},
{
"type": "STRING",
"value": "("
},
{
"type": "FIELD",
"name": "condition",
"content": {
"type": "SYMBOL",
"name": "expression"
}
},
{
"type": "STRING",
"value": ")"
},
{
"type": "FIELD",
"name": "consequence",
"content": {
"type": "SYMBOL",
"name": "statement"
}
},
{
"type": "REPEAT",
"content": {
"type": "FIELD",
"name": "consequence",
"content": {
"type": "SYMBOL",
"name": "statement"
}
},
{
"type": "CHOICE",
"members": [
{
"type": "SEQ",
"members": [
{
"type": "STRING",
"value": "else"
},
{
}
},
{
"type": "CHOICE",
"members": [
{
"type": "SEQ",
"members": [
{
"type": "STRING",
"value": "else"
},
{
"type": "FIELD",
"name": "alternative",
"content": {
"type": "SYMBOL",
"name": "statement"
}
},
{
"type": "REPEAT",
"content": {
"type": "FIELD",
"name": "alternative",
"content": {
@@ -780,15 +813,19 @@
"name": "statement"
}
}
]
},
{
"type": "BLANK"
}
]
}
]
}
}
]
},
{
"type": "BLANK"
}
]
},
{
"type": "SYMBOL",
"name": "_line_end"
}
]
},
"for_statement": {
"type": "SEQ",
@@ -3696,6 +3733,10 @@
{
"type": "SYMBOL",
"name": "_block_string_end"
},
{
"type": "SYMBOL",
"name": "_line_end"
}
],
"inline": [],
+3 -3
View File
@@ -1195,7 +1195,7 @@
"named": true,
"fields": {
"alternative": {
"multiple": false,
"multiple": true,
"required": false,
"types": [
{
@@ -1215,7 +1215,7 @@
]
},
"consequence": {
"multiple": false,
"multiple": true,
"required": true,
"types": [
{
@@ -1231,7 +1231,7 @@
"named": true,
"fields": {
"body": {
"multiple": false,
"multiple": true,
"required": true,
"types": [
{
+16418 -15205
View File
File diff suppressed because it is too large Load Diff
+35
View File
@@ -11,6 +11,13 @@ enum TokenType {
BLOCK_STRING_START,
BLOCK_STRING_CONTENT,
BLOCK_STRING_END,
// PICO-8 line-significance: terminates the body of `if (cond) ...` /
// `while (cond) ...` shorthand. Emitted only when the parser expects it
// (see scan() — this token is gated on valid_symbols[LINE_END]) so that
// newlines outside of shorthand contexts continue to fall through to
// extras and be skipped silently.
LINE_END,
};
static inline void consume(TSLexer *lexer) { lexer->advance(lexer, false); }
@@ -157,6 +164,34 @@ static bool scan_comment_content(Scanner *scanner, TSLexer *lexer) {
bool tree_sitter_pico8_lua_external_scanner_scan(void *payload, TSLexer *lexer, const bool *valid_symbols) {
Scanner *scanner = (Scanner *)payload;
// LINE_END must be checked before any whitespace-skipping path below,
// because the bytes that signal it (\n, \r, EOF) would otherwise be
// consumed as extras and be invisible to us. The check is also
// intentionally placed before the block_string / block_comment branches
// so that those branches' skip_whitespaces() can't eat our newline.
//
// The scanner emits LINE_END only when the parser's current state lists
// it as valid (i.e., we're at the body-or-terminator decision point of a
// shorthand_if_statement / shorthand_while_statement). Everywhere else,
// \n falls through to the /\s/ extras pattern and is skipped silently,
// so this branch is invisible to the rest of the grammar.
//
// LINE_END is intentionally zero-width: we do NOT consume the newline.
// That lets nested shorthands on the same line each see the same \n and
// close in turn (e.g. `if (a) if (b) c()\nd()` — the \n must terminate
// BOTH shorthands so that `d()` is a top-level statement). Once every
// enclosing shorthand has reduced, LINE_END is no longer in any parser
// state's valid_symbols, the scanner returns false, and the trailing
// \n is consumed by /\s/ in extras as usual. There is no infinite-loop
// risk: each LINE_END shift reduces one shorthand statement, so the
// emit chain is bounded by static nesting depth.
if (valid_symbols[LINE_END] &&
(lexer->lookahead == '\n' || lexer->lookahead == '\r' ||
lexer->lookahead == 0)) {
lexer->result_symbol = LINE_END;
return true;
}
if (valid_symbols[BLOCK_STRING_END] && scan_block_end(scanner, lexer)) {
reset_state(scanner);
lexer->result_symbol = BLOCK_STRING_END;
@@ -0,0 +1,249 @@
================================================================
shorthand if — single statement body, terminated by newline
================================================================
if (cond) honk()
toot()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments)))
(function_call
name: (identifier)
arguments: (arguments)))
================================================================
shorthand if — single statement body, terminated by EOF
================================================================
if (cond) honk()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))))
================================================================
shorthand if — multi-statement body collected into shorthand
================================================================
if (is_falling()) wheeee() splat()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (function_call
name: (identifier)
arguments: (arguments))
consequence: (function_call
name: (identifier)
arguments: (arguments))
consequence: (function_call
name: (identifier)
arguments: (arguments))))
================================================================
shorthand if — same-line else
================================================================
if (cond) honk() else toot()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))
alternative: (function_call
name: (identifier)
arguments: (arguments))))
================================================================
shorthand if — same-line multi-statement else
================================================================
if (cond) honk() else toot() squawk()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))
alternative: (function_call
name: (identifier)
arguments: (arguments))
alternative: (function_call
name: (identifier)
arguments: (arguments))))
================================================================
shorthand if nested in standard if — `else` on later line binds
to OUTER if, not the shorthand (PICO-8 line-significance)
================================================================
if is_noisy then
if (is_goose()) honk()
else
toot()
end
----------------------------------------------------------------
(chunk
(if_statement
condition: (identifier)
consequence: (block
(shorthand_if_statement
condition: (function_call
name: (identifier)
arguments: (arguments))
consequence: (function_call
name: (identifier)
arguments: (arguments))))
alternative: (else_statement
body: (block
(function_call
name: (identifier)
arguments: (arguments))))))
================================================================
shorthand if — line comment between body and newline still
terminates the shorthand at the newline (line comment is in
extras and is attached to the deepest enclosing node)
================================================================
if (cond) honk() -- inline
toot()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))
(comment
content: (comment_content)))
(function_call
name: (identifier)
arguments: (arguments)))
================================================================
shorthand if inside a do-block — newline before `end` terminates
shorthand, then `end` closes the do-block
================================================================
do
if (cond) honk()
end
----------------------------------------------------------------
(chunk
(do_statement
body: (block
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))))))
================================================================
shorthand while — multi-statement body, terminated by newline
================================================================
while (running) tick() draw()
cleanup()
----------------------------------------------------------------
(chunk
(shorthand_while_statement
condition: (identifier)
body: (function_call
name: (identifier)
arguments: (arguments))
body: (function_call
name: (identifier)
arguments: (arguments)))
(function_call
name: (identifier)
arguments: (arguments)))
================================================================
shorthand while — single statement body, terminated by EOF
================================================================
while (cond) tick()
----------------------------------------------------------------
(chunk
(shorthand_while_statement
condition: (identifier)
body: (function_call
name: (identifier)
arguments: (arguments))))
================================================================
nested shorthand ifs on the same line — a single newline must
terminate BOTH shorthands (otherwise the outer one greedily
absorbs the next-line statement)
================================================================
if (a) if (b) c()
d()
----------------------------------------------------------------
(chunk
(shorthand_if_statement
condition: (identifier)
consequence: (shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))))
(function_call
name: (identifier)
arguments: (arguments)))
================================================================
standard if with parenthesized condition coexists with shorthand
— GLR resolves on the token after `)` (then vs statement)
================================================================
if (cond) then a() end
if (cond) a()
----------------------------------------------------------------
(chunk
(if_statement
condition: (parenthesized_expression
(identifier))
consequence: (block
(function_call
name: (identifier)
arguments: (arguments))))
(shorthand_if_statement
condition: (identifier)
consequence: (function_call
name: (identifier)
arguments: (arguments))))