Sry. Lemme expand this a lil.
In the talk I heard Mixture of Experts explained as a thing that could reduce the compute requirements of large models by breaking them down into smaller models with sparse connections in between.
---
RT @lorenpmc
@cedar_xr @BasedBeffJezos Pointers to?
https://twitter.com/lorenpmc/status/1636598290727976960
I would love to know if I'm missing some major context on MOEs that distinguishes them from the other modularization techniques and make them especially deserving of discussion vs the rest.