Does anyone use "reinforcement learning from compiler feedback" to train LLMs for code generation? It seems like eg codex was just made by doing extra training on github code.
What I'm imagining is your generate a bunch of completions and penalize the ones that don't compile(or pass some other statically-performable check like generating no syntax errors or whatever)