I need to find evidence for/against the claim that there was a training run of GPT-2 that maximized negative log-loss I've heard it a couple of times on the internet and already spread the meme myself, but I haven't seen it in a paper or blogpost
a Schelling point for those who seek one