Sry. Lemme expand this a lil.
In the talk I heard Mixture of Experts explained as a thing that could reduce the compute requirements of large models by breaking them down into smaller models with sparse connections in between.
---
RT @lorenpmc
@cedar_xr @BasedBeffJezos Pointers to?
https://twitter.com/lorenpmc/status/1636598290727976960
Some of which may perform better than the intuitive-sounding "partition the input space and use subnetworks that could be seen as different models for each sub-space.
I tried my 1am best to find something that compares MOEs with other modularization tech, but nothing comes up. But here's a 2019 review on modularization techniques.
https://arxiv.org/abs/1904.12770