Files
2026-05-15 00:16:13 -07:00

3.6 KiB

Known limitations of tree-sitter-pico8-lua

This document used to track parse incorrectness around PICO-8's line-significant shorthand if (cond) ... / while (cond) ... constructs. As of v0.3 the external scanner emits a LINE_END token when the parser is at the body-or-terminator decision point of a shorthand statement and the next byte is \n / \r / EOF, so the body of a shorthand is correctly bounded to its source line.

There are no other known parse-incorrectness issues at this time. Removing this file (or leaving it as a brief stub) is fine once you're confident no documentation links still point at the old limitation sections.

How line-significance is wired up (for reference)

PICO-8 deviates from standard Lua in two places where a newline is syntactically significant:

  • if (cond) <stmts...> — the consequence (and any same-line else alternative) extends to end-of-line, not to a matching end.
  • while (cond) <stmts...> — same line-bounded body as the shorthand if.

Tree-sitter has no built-in concept of newlines as syntactic tokens when /\s/ is in extras (and we want it there: every other construct treats whitespace transparently). The canonical fix is an external scanner that gates a synthetic terminator token on valid_symbols. We do exactly that:

  • src/scanner.c exposes a LINE_END external symbol. The scanner looks at the raw lookahead before the lexer has a chance to skip extras, and emits LINE_END only when the parser actually expects one (i.e., valid_symbols[LINE_END] == true). At any other position, the scanner's LINE_END branch returns false, and the \n falls through to be eaten silently by the /\s/ extras pattern.
  • LINE_END is zero-width — the scanner does not consume the newline. This matters for nested shorthands: if (a) if (b) c()\nd() has to terminate BOTH shorthands at the same \n. With a zero-width terminator, each enclosing shorthand sees the same \n in turn and reduces. Once no shorthand is on the stack, LINE_END is no longer in valid_symbols, the scanner returns false, and the \n is consumed by extras. The emit chain is bounded by static nesting depth, so there's no infinite-loop risk despite the zero width.

The shorthand rules in grammar.js end with $._line_end; the body and the optional else alternative are both $.statement, repeat($.statement), allowing PICO-8's multi-statement single-line bodies (if (falling) wheeee() splat()).

The cross-language pattern is "external scanner + valid_symbols-gated terminator," same as tree-sitter-r (the closest analogue) and similar in spirit to Ruby's paired _line_break / _no_line_break hint tokens. Reaching for \s removal or per-rule extras is not necessary for this style of line-significance; only Python-style INDENT/DEDENT requires the heavier refactor.

Test coverage

test/corpus/shorthand_line_end.txt exercises:

  • Single- and multi-statement shorthand bodies, terminated by \n and by EOF.
  • Same-line else (single- and multi-statement alternative).
  • The historical dangling-else case (shorthand inside a standard if, with else on a later line — must bind to the outer if).
  • Line comment trailing the shorthand body (the comment is in extras and the trailing \n still triggers LINE_END).
  • Shorthand inside a do-block (the \n before the closing end terminates the shorthand cleanly).
  • Nested shorthand ifs on the same line (one \n must close both).
  • Coexistence with standard if (parenthesized) then ... end — the GLR conflict resolves on whether then follows.