Sry. Lemme expand this a lil.

In the talk I heard Mixture of Experts explained as a thing that could reduce the compute requirements of large models by breaking them down into smaller models with sparse connections in between.
---
RT @lorenpmc
@cedar_xr @BasedBeffJezos Pointers to?
twitter.com/lorenpmc/status/16

Follow

This is something I typically hear in the context of modularity / sparsity (different names depending on whether you speak to the abstract people or the nitty gritty people).

And my impression was that they are a very big family of techniques with varied performance.

Some of which may perform better than the intuitive-sounding "partition the input space and use subnetworks that could be seen as different models for each sub-space.

I tried my 1am best to find something that compares MOEs with other modularization tech, but nothing comes up. But here's a 2019 review on modularization techniques.

arxiv.org/abs/1904.12770

I would love to know if I'm missing some major context on MOEs that distinguishes them from the other modularization techniques and make them especially deserving of discussion vs the rest.

Sign in to participate in the conversation
Mastodon

a Schelling point for those who seek one