> GPT-4: 8 x 220B experts trained with different data/task distributions and 16-...

Tepix · on June 21, 2023

220B open source models wouldn't be as useful for most users.

You need two RTX 3090 24GB cards already to run inference with a 65B model that is 4bit quantized. Going beyond that (already expensive) hardware is out of reach for the average hobbyist developer.

logicchains · on June 21, 2023

You could run it quantized to 4 bits on CPU with 256GB ram, which is much cheaper to rent/buy. Sure it might be somewhat slow, but for lots of use cases that doesn't matter.

symbolicAGI · on June 21, 2023

Benchmarks I've run with a Ryzen 7950x, 128 GB RAM with Nvidia GeForce 3060 12 GB VRAM show a slowdown less than half when not using the GPU, with LLama.cpp as the inference platform and various ggml open source models in the 7B-13B parameter range.

The Ryzen does best with 16 threads, not the 32 it is capable of, which is expected due to it having 16 CPU cores.

Tepix · on June 30, 2023

Llama.cpp running on the GPU is pretty slow. Better try with something else. The speedup going from CPU to RTX 3090 is usually around 10x or 15x.

woodson · on June 21, 2023

Google open-sourced (Apache 2.0) the Switch Transformers C-2048 model (1.6T parameters for 3.1 TB): https://huggingface.co/google/switch-c-2048

jerpint · on June 21, 2023

I think it’s just an ensemble of models, so you do some kind of pooling/majority vote on your output tokens

heliophobicdude · on June 21, 2023

Would this be before or after inference? Is there some sort of a delegation based on the matter?

cfn · on June 21, 2023

If it is output tokens then it is after the inference.

swyx · on June 21, 2023

same. i wish i had asked george instead of nodding along like an idiot. he probably wouldnt know but at least he’d speculate in interesting ways.

fancyfredbot · on June 21, 2023

It was a great interview, thank you.