parse EOL as a token

This commit is contained in:
2026-05-15 00:16:13 -07:00
parent c8ad7e74e7
commit 64e5467062
8 changed files with 16909 additions and 15378 deletions
+60 -93
View File
@@ -1,106 +1,73 @@
# Known limitations of `tree-sitter-pico8-lua`
PICO-8's Lua dialect is **line-significant** in two places: the body of a
shorthand `if (cond) ...` / `while (cond) ...` extends to end-of-line, and
the optional `else` of a shorthand `if` must be on the same line as the
opening `if`. Tree-sitter has no built-in concept of newlines as syntactic
tokens — to encode line-significance correctly we'd need an **external
scanner** ( a C file that emits synthetic line-end tokens, the same
mechanism `tree-sitter-python` uses for `INDENT`/`DEDENT`/`NEWLINE` ).
This document used to track parse incorrectness around PICO-8's
line-significant shorthand `if (cond) ...` / `while (cond) ...`
constructs. As of v0.3 the external scanner emits a `LINE_END` token
when the parser is at the body-or-terminator decision point of a
shorthand statement and the next byte is `\n` / `\r` / EOF, so the body
of a shorthand is correctly bounded to its source line.
We have intentionally not written that scanner yet. This document tracks
the resulting parse incorrectness so it isn't forgotten when we revisit.
There are no other known parse-incorrectness issues at this time.
Removing this file (or leaving it as a brief stub) is fine once you're
confident no documentation links still point at the old limitation
sections.
## 1. Dangling-`else` mis-bind in nested `if`
## How line-significance is wired up (for reference)
```lua
-- intended: outer if/else, with shorthand-if as a single statement
-- inside the outer if's consequence.
if is_noisy then
if (is_goose()) honk()
else
toot()
end
```
PICO-8 deviates from standard Lua in two places where a newline is
syntactically significant:
The grammar's shorthand `if` rule uses `prec.right` on its optional `else`
clause, so it greedily eats any `else` it can see — matching the
classic "associate else with nearest if" convention from C / Java.
That's wrong for PICO-8, where the line break after `honk()` should
have closed the shorthand. The bound-too-tight parse:
- `if (cond) <stmts...>` — the consequence (and any same-line `else`
alternative) extends to end-of-line, not to a matching `end`.
- `while (cond) <stmts...>` — same line-bounded body as the
shorthand `if`.
- `else` is parsed as the shorthand's alternative, not the outer if's.
- The outer `if_statement` ends up with no `else_statement` child.
- The trailing `end` still resolves to the outer `if_statement`,
so the source still parses cleanly ( no `ERROR` node ).
Tree-sitter has no built-in concept of newlines as syntactic tokens
when `/\s/` is in `extras` (and we want it there: every other
construct treats whitespace transparently). The canonical fix is an
**external scanner** that gates a synthetic terminator token on
`valid_symbols`. We do exactly that:
**Indistinguishable case** — both parses are correct here, because the
`else` really is on the same line as the shorthand:
- `src/scanner.c` exposes a `LINE_END` external symbol. The scanner
looks at the raw lookahead before the lexer has a chance to skip
extras, and emits `LINE_END` only when the parser actually expects
one (i.e., `valid_symbols[LINE_END] == true`). At any other
position, the scanner's LINE_END branch returns false, and the `\n`
falls through to be eaten silently by the `/\s/` extras pattern.
- `LINE_END` is **zero-width** — the scanner does not consume the
newline. This matters for nested shorthands: `if (a) if (b) c()\nd()`
has to terminate BOTH shorthands at the same `\n`. With a zero-width
terminator, each enclosing shorthand sees the same `\n` in turn and
reduces. Once no shorthand is on the stack, `LINE_END` is no longer
in `valid_symbols`, the scanner returns false, and the `\n` is
consumed by extras. The emit chain is bounded by static nesting
depth, so there's no infinite-loop risk despite the zero width.
```lua
if is_noisy then
if (is_goose()) honk() else toot()
end
```
The shorthand rules in `grammar.js` end with `$._line_end`; the body
and the optional `else` alternative are both `$.statement, repeat($.statement)`,
allowing PICO-8's multi-statement single-line bodies
(`if (falling) wheeee() splat()`).
## 2. Multi-statement shorthand body
The cross-language pattern is "external scanner + valid_symbols-gated
terminator," same as `tree-sitter-r` (the closest analogue) and
similar in spirit to Ruby's paired `_line_break` / `_no_line_break`
hint tokens. Reaching for `\s` removal or per-rule extras is **not**
necessary for this style of line-significance; only Python-style
INDENT/DEDENT requires the heavier refactor.
```lua
-- both statements are conditional in PICO-8.
if (is_falling()) wheeee() splat()
```
## Test coverage
The grammar's `shorthand_if_statement` rule takes exactly one
consequence statement, so this parses as:
`test/corpus/shorthand_line_end.txt` exercises:
- `shorthand_if_statement` with consequence `wheeee()`
- followed by an unconditional `splat()` statement
A line-aware grammar would gather every statement up to end-of-line
into the shorthand body. Visually:
```lua
-- this and the previous example produce the SAME parse tree under
-- the current grammar, which is wrong for the previous example.
if (is_falling()) wheeee()
splat()
```
## What does this break?
The parse is structurally wrong but **token classification stays
correct**, because every keyword and identifier is still itself
regardless of which parent node owns it. So:
| Feature | Affected? | Notes |
|---|---|---|
| `highlights.scm` ( syntax highlighting ) | No | `else` is `@keyword.conditional` whether it's a child of `shorthand_if_statement` or `else_statement`. |
| `outline.scm` ( file outline ) | No | Doesn't traverse if-bodies. |
| Bracket matching | No | Independent of if/else structure. |
| Injections | No | Independent. |
| `indents.scm` ( auto-indent ) | Subtly | A mis-bound `else` is inside a `shorthand_if_statement`, which is not an `@indent` node; so the next line may land at the wrong indent column. |
| Semantic selection ( "expand selection" ) | Subtly | Cursor on `toot()` expands to `shorthand_if_statement` instead of `else_statement` → outer `if_statement`. |
| `folds.scm` / `textobjects.scm` | Potentially | Not currently shipped; would inherit the structural bug if we add them. |
| Static analysis / LSP-style features | Yes | Anything that walks the AST to reason about reachability or scope ( e.g. "unreachable code", goto-definition through a conditional branch ) will mis-report. None of this is shipped today. |
For v0.2's stated scope ( syntax highlighting + a basic outline ), the
visible symptom is "auto-indent occasionally off by one column inside a
nested-if-with-out-of-line-else", which only bites a relatively
uncommon code pattern. Deferred until v0.3 LSP work, which needs a
correct AST.
## Fixing it later
The canonical approach is an external scanner. Sketch:
1. Add an `external` symbol like `_logical_line_end` that emits at every
`\n` *not* preceded by line-continuation context.
2. Make `shorthand_if_statement` take the form
`seq('if', '(', expr, ')', stmt, optional(seq(\
/* not _logical_line_end yet */ 'else', stmt)), $._logical_line_end)`.
3. Allow `shorthand_if_statement` consequence to be `repeat1(stmt)` so a
one-line `if (x) a() b()` puts both calls in the shorthand body.
The scanner needs to be written in C, registered via the `externals`
field, and built into `src/scanner.c`. `tree-sitter-python`'s scanner is
a good reference for the pattern.
- Single- and multi-statement shorthand bodies, terminated by `\n` and
by EOF.
- Same-line `else` (single- and multi-statement alternative).
- The historical dangling-else case (shorthand inside a standard `if`,
with `else` on a later line — must bind to the outer `if`).
- Line comment trailing the shorthand body (the comment is in extras
and the trailing `\n` still triggers `LINE_END`).
- Shorthand inside a `do`-block (the `\n` before the closing `end`
terminates the shorthand cleanly).
- Nested shorthand `if`s on the same line (one `\n` must close both).
- Coexistence with standard `if (parenthesized) then ... end` — the
GLR conflict resolves on whether `then` follows.