Does anyone have a good solution for writing a recursive descent parser in C/C++ with the following constraints?
- produces an ordinary heterogeneous AST
- handles multiple errors (while being able to produce a partial syntax tree for the rest)
- doesn't use exception handling or setjmp/longjmp
Whenever I crank something like this out, I feel like I'm missing something and writing too much code. I guess I could just use a morass of macros?
@zwarich My preferred way to handle errors these days if I want a "maximally lossless" AST is to have one node variant per error case (and you discover what these are as you write the parser since they're your "else" cases). And you don't have any early-outs anywhere, the whole thing is infallible. But yes, you end up with a lot of "AST bloat" since you need a lot of extra node variants.
@zwarich The infallibility means that you need a global forward progress guarantee and you need some sort of language and parser specific design principle to have a good outcome here. For example, an easy way to achieve global forward progress is to have every error case consume at least one token. But that's actually very bad; you want something more subtle, e.g. expression-level recovery might sync to the next semicolon but not actually consume it, leaving that to the statement-level parsing.
@pervognsen Do you have any tricks on how to have good error recovery for "precedence climbing" / top-down operator parsing?
@zwarich Not really, beyond the obvious stuff. If you take the semantic approach to precedence that Pratt suggests in the paper, where you directly order semantic entities instead of tokens (semantic tokens just inherit the precedence), then I think you rediscover a lot of the classic Wirth-style recovery tricks, e.g. Declaration < Block < Statement < Expression where a token like ; belongs to the Statement level. Syncing to semantic tokens from the outer levels is the natural thing to do.
@zwarich Incidentally, a corollary is that "fully recursive, fully uniform" syntax like Lisp S-expressions is more or less the worst case but that's sort of obvious if you've ever tried to think through robust recovery for those languages. You need strict separation between levels and they need to be delimited purely by local information (e.g. a distinct set of tokens) not by non-local information like upstream context (lexical scope, bracket balancing or quote vs non-quote context).
@pervognsen Once you hit mismatched braces, is there any point in reporting any errors or should you just report brace errors and ignore all other errors?
Your mention of S-expressions reminded me of this paper, which tries to find a coarse paren-like structure for the purposes of error recovery:
https://link.springer.com/chapter/10.1007/3-540-08342-1_14
@zwarich I'll take a look. I guess you know of Jonathan Bachrach's late 1990s, early 2000s work on macros for infix languages where he uses a skeletal syntax tree? That's very close to the same idea as token trees in Rust (I assume there's influence).
@zwarich I must have mentioned that I converged on rigid indentation-based block structure (much more rigid than anything else I've seen) for my own language experiments? It solves a lot of problems. You don't have Pratt-like "semantic levels" intrinsically for the blocks since on their own nested blocks don't carry any meaning but you do have robust syntactic recovery and from that you can build robust semantic recovery by taking into account (well-formed) syntactic/semantic context.
@zwarich I use strict statement vs expression separation (for similar reasons to Pratt and Wirth) but I have a way to reflect between the statement and expression level while also covering use cases like (labeled) break. A let definition can take a statement block as an argument:
let result =
for (key, value) in items
if key == needle
yield Some(value)
yield None
You can also use yield e as x to label the yield destination explicitly.
@pervognsen I'll try to stop asking you mundane syntax questions, but I'm sort of fascinated by this variant of it. Is every indented let RHS automatically treated as a statement embedding, or is this a mere expression?
let result =
f(x, y)
@zwarich It's always a statement block. You'd have to write
let result =
yield f(x, y)
although of course there is no reason not to write let result = f(x, y) in this case. I have some ideas for how this block and yield syntax fits in as a special case of an algebraic effect system eventually though I haven't worked out the details.
@pervognsen If you're going to go that route, you should also treat conditionals and iteration as control effects like in Verse, although that does introduce a new asymmetry between `if` and `case`.
@zwarich Yeah, I've been considering things like that. For the pedagogical language, a lot of what I'm planning on doing is focused on making compilation simple, fast and good, so there's an essential lack of generality (which I'm intentionally putting in but making sure most of it can be gradually removed without changing the meaning of existing code).
@zwarich For example, the for loop is closer to Common Lisp's LOOP facility than what you usually see in other languages, but it desugars to a loop with sequential guards that has a pretty direct interpretation in terms of success/failure continuations for each for clause.
@zwarich For the initial revision of the pedagogical language, I'm essentially stratifying tests and expressions to make it simple and faster to generate good code. The grammar for that fragment is something like this:
conjunct = 'not' test | expr relation expr
disjunct = conjunct ('and' disjunct)?
test = disjunct ('or' test)?
The plan is to flatten the test/expr distinction syntactically and semantically later in a compatible way.
@pervognsen @zwarich It's also interesting how "silly newbie instincts", like `if (thing == true) { ... }` in this case, sometimes point to something deeper.