-
-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contextual Lexer Leaking "Spam" Terminals #1331
Comments
This is a limitation of the LALR parsing algorithm. The BasicLexers are created based on the states of the parser, which can't fully reflect the truth. The LALR algorithm can't parse all CFGs, so if this is limiting what you can parse, you can't use lalr and need to use earley (which can parse all CFGs). |
Thanks for the clarification @MegaIng. I was admittedly clinging on to LALR(1) out of past habit and from wanting to run the transformer during the parse. Now seems a very good time to migrate properly to Earley. It will probably make my grammar a lot simpler too. For context, the language I'm parsing is full of edge cases where white space is sometimes essential and needing to be kept, versus cases where white space is irrelevant and needing to be discarded. LALR(1) parsing has been tough because it turns into a chaotic war of trying to stop text being eaten by the wrong terminals. |
I've given Early a good try today, but the lack of transparency makes debugging To get around the problem of spam terminal leakage in LALR(1), I've added logic to my BasicLexer subclass to introspect current parser state, and use this to test for lexing errors, and send back correct 'patched' replacement tokens. This has got me back on track. |
@davidmcnabnz Yes, the Earley parser is harder to debug. I wouldn't say impossible. The "problem" with Earley (which is also why it's so powerful) is that it doesn't know what rules and tokens it will choose until the last token has been input. It might be possible to add some interactivity to the Earley parser, but it will be a different API to parse_interactive(). Anyway, glad to hear you're getting back on track. I wonder, what do you get from subclassing BasicLexer, that you can't do with parse_interactive() ? |
Thanks, @erezsh for the thoughts. You're right to question it - as it happens, subclassing Anyway, I've pivoted to a whole different approach which is simplifying the project, eliminating all terminal spam and "terminal priority arms races", and speeding up the task. This approach is based on:
This sounds like a lot, but it's turned out surprisingly easy. Also, it's helping me get an easier coverage of the legacy language's nuances, where significance of whitespace, and semantics of tokens, varies widely according to context. Given that this newer methodology has resolved all the issues flagged in the ticket, I'd be happy for the ticket to close. Hopefully this discussion might help others finding themselves in similar struggles. |
Nice milestone reached today, using the If anyone's interested, I'd be happy to refactor this class to remove the proprietary elements and make it available for general use, with examples. |
Congrats on the milestone! Sounds interesting, what would the API look like, more or less? |
The API for The added capability arises from the states table and stack, where for each state, there's a hardwired list of eligible matching options, and where each match option contains token name, regex plus zero or more state transition commands (noop, reset, goto, push, pop). An The legacy language's features made a simpler finite state machine (FSM) unworkable. It required a push-down automaton to properly track how the meanings of input text often change dramatically according to context. For instance, the same keyword in different states needs to yield different tokens to the parser. @erezsh I'm thinking to publish this into a separate small repo of 'Lark examples'. But if you prefer, I could add the class into the examples section of my Lark fork, and send you a PR. |
@davidmcnabnz Hard to say without seeing it first! Maybe it's safer to put it in a separate repo, and we can always copy it to the main one afterwards. |
@erezsh will do. Meanwhile, you could consider a slight relaxation on validating In my use case, this makes it harder to pass the state table into my own method which instantiates
|
@davidmcnabnz I don't think closures are the worst thing, but I agree it was nicer if there was a more relaxed way to do it. But at the same time, I don't want to lose the guidance and protection that an ABC provides. Maybe in the future we can change it to a Protocol. |
Describe the bug
In a contextual lexing setup, where numerous BasicLexer instances get spawned for the various contexts, many of these instances are being created with toxic "spam" terminals. In other words, their terminals lists are including terminals that have no possibility of contributing to the fulfilment of any rule in their context.
This is causing text to be mis-classified in places, resulting in creation of tokens which are illegal in the context, which in turn crashes the parse.
To Reproduce
At this stage I would struggle to distil my large proprietary parsing codebase into a simple example. With this bug report, I'm asking if there are any general troubleshooting tips for finding out why invalid terminals are leaking into the context, and how to prevent this, apart from writing some very aggressive introspection into my BasicLexer subclass.
I've tried juggling terminal priorities, but this is just an 'arms race' that doesn't resolve. If a terminal priority change fixes one context, it breaks others, and vice versa.
The text was updated successfully, but these errors were encountered: