If we had those widely distributed, people would likely use them for capabilities and just widen the gap (e.g. OpenAI who talk about this as a strategy are not to be trusted with that strategy, since I don't see them using it solely for alignment work for half a year, and instead using it on both capabilities and alignment. But their plan is sound in that regard).
But I disagree with the view that you can't have the alignment theorist that is not also a consequentialist.
I think the philosophy/math/cs system would be just as capable at capabilities work as at alignment work.
But I now remember an old idea of making STEMGPT, trained (in the weak case) only on STEM textbooks, arXiv, (in the strong case only on) hadron collider data, protein structures, meteorological and geological data &c. Hard to have info about humans leak over though.
How much of strawberry alignment is value-laden? 5%? 95%? probably further along some logarithmic scale, if I had to bet
Even with ML systems!
I agree that probably with most architectures, if you train them a lot to be capable alignment theorists, they have inner optimizers that are capable consequentialists, but the alignment-theorist-phase might be quite long (I could_{10%} see it going over 100x human ability).