> GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.
There was a post on HackerNews the other day about a 13B open source model.
Any 220B open source models? Why or why not?
I wonder what the 8 categories were. I wonder what goes into identifying tokens and then trying to guess which category/model you should look up. What if tokens go between two models, how do the models route between each other?
220B open source models wouldn't be as useful for most users.
You need two RTX 3090 24GB cards already to run inference with a 65B model that is 4bit quantized. Going beyond that (already expensive) hardware is out of reach for the average hobbyist developer.
You could run it quantized to 4 bits on CPU with 256GB ram, which is much cheaper to rent/buy. Sure it might be somewhat slow, but for lots of use cases that doesn't matter.
Benchmarks I've run with a Ryzen 7950x, 128 GB RAM with Nvidia GeForce 3060 12 GB VRAM show a slowdown less than half when not using the GPU, with LLama.cpp as the inference platform and various ggml open source models in the 7B-13B parameter range.
The Ryzen does best with 16 threads, not the 32 it is capable of, which is expected due to it having 16 CPU cores.
There was a post on HackerNews the other day about a 13B open source model.
Any 220B open source models? Why or why not?
I wonder what the 8 categories were. I wonder what goes into identifying tokens and then trying to guess which category/model you should look up. What if tokens go between two models, how do the models route between each other?