Andrej's series is excellent, Sebastian's book + this video are excellent. There's a lot of overlap but they go into more detail on different topics or focus on different things. Andrej's entire series is absolutely worth watching, his upcoming Eureka Labs stuff is looking extremely good too. Sebastian's blog and book are definitely worth the time and money IMO.
Nice write up Sebastian, looking forward to the book. There are lots of details on the LLM and how it’s composed, would be great if you can expand on how Llama and OpenAI could be cleaning and structuring their training data given it seems this is where the battle is heading in the long run.
But isn't it the beauty of llm's that they need comparably little preparation (unstructured text as input) and pick the features on their own so to say?
I really like Sebastian's content but I do agree with you. I didn't get into deep learning until starting with Karpathy's series, which starts by creating an autograd engine from scratch. Before that I tried learning with fast.ai, which dives immediately into building networks with Pytorch, but I noped out of there quickly. It felt about as fun as learning Java in high school. I need to understand what I'm working with!
Maybe it's just different learning styles. Some people, me included, like to start getting some immediate real world results to keep it relevant and form some kind of intuition, then start peeling back the layers to understand the underlying principles. With fastAI you are already doing this by the 3rd lecture.
Like driving a car, you don't need to understand what's under the hood you start driving, but eventually understanding it makes you a better driver.
For sure! In both cases I imagine it is a conscious choice where the teachers thought about the trade-offs of each option. Both have their merits. Whenever you write learning material you have to decide where to draw the line of how far you want to break down the subject matter. You have to think quite hard about exactly who you are writing for. It's really hard to do!
You seem to be implying that the top-down approach is a trade off that involves not breaking down the subject matter into as lower level details. I think the opposite is true - when you go top down you can keep teaching lower and lower layers all the way down to physics if you like!
Oh that’s interesting to know! I guess I gel better with bottom up. As soon as I start seeing API functions I don’t understand I immediately want to know how they work!
Bach (Johann Sebastian .. there were many musical Bach's in the family) owned and wrote for harpsichords, lute-harpsichords, violin, viola, cellos, a viola da gamba, lute and spinet.
Never had a piano, not even a fortepiano .. though reportedly he played one once.
He had to improvise on the Hammerklavier when visiting Frederick the Great in Potsdam. That (improvising for Frederick) is also the starting point for the later creation of https://en.wikipedia.org/wiki/The_Musical_Offering .
We’re digressing to get way off the whole point of the comment, but to address your point, actually piano design has been an area of great innovation over the centuries, with different companies doing it in considerably different ways.
Considering i seem to be the minority here based on all the other responses the message you replied to, the answer i'd give is "by mine, i guess".
At least when i saw the "Building LLMs from the Ground Up" what i expected was someone to open vim, emacs or their favorite text editor and start writing some C code (or something around that level) to implement, well, everything from the "ground" (the operating system's user space which in most OSes is around the overall level of C) and "up".
The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.
To a statistician or a practitioner approaching machine learning from a mathematical perspective, the computational details are a distraction.
Yes, these models would not be possible without automatic differentiation and massively parallel computing. But there is a lot of rich detail to consider in building up the model from first mathematical principles, motivating design choices with prior art from natural language processing, various topics related to how input data is represented and loss is evaluated, data processing considerations, putting things into context of machine, learning more broadly, etc. You could fill half a book chapter with that kind of content (and people do), without ever talking about computational details beyond a passing mention.
In my personal opinion, fussing over manual memory management is far afield from anything useful unless you want to actually work on hardware or core library implementations like Pytorch. Nobody else in industry is doing that.
> The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.
But if all is relative and depends on your PoV that implies that there isn't actually a problem here, right? :-P
I don't think there is anything wrong with "building up the model from first mathematical principles" as you wrote, it just wasn't what i personally had in mind with the "from the ground up" part.
And FWIW i'm not that stuck up on the "vim and C" aspect, i used those as an example that i expected most would understand and leave little room for misinterpretation in what you'd have to work with (i.e. very very little) and have to implement yourself (pretty much everything) - personally i'd consider it from "the ground up" even if it was in C#, D, Java, JavaScript or even Python, as long as the implementation was done in a way that didn't rely on 3rd party libraries so that whatever is implemented in, say, Java could also be implementable in C#, D, JavaScript or Python with just whatever is available out of the box in those languages or even C, if one doesn't mind writing the extra bookkeeping functionality themselves.
Right, but again I think the emphasis on avoiding 3rd party libraries isn't really relevant to machine learning. The "from scratch" here is avoiding 3rd party implementations of the transformer model, building up from the math on paper and then letting the AD/computation framework do its thing.
One does not exclude the other though. "Avoiding 3rd party implementations of the transformer model" is a subset of "avoiding 3rd party libraries". "From scratch" is, as seen, vague enough for different people to interpret it in different ways. Despite being the minority in this thread, i do not think my interpretation is any less valid - especially since some people have already done such "from scratch" (i.e. in C or C++ with no 3rd party dependencies) implementations already.
Gluing together premade components is not “from the ground up” by most people’s definition.
People are looking at the ground up for a clear picture of what the thing is actually doing, so masking the important part of what is actually happening, then calling it “ground up” is disingenuous.
Yes, but "what the thing is actually doing" is different depending on what your perspective is on what "the thing" and what "actually" consists of.
If you are interested in how the model works conceptually, how training works, how it represents text semantically, etc., then I maintain that computational details are an irrelevant distraction, not an essential foundation.
How about another analogy? Is SICP not a good foundation for learning about language design because it uses Scheme and not assembly or C?
From scratch is relative. To a python programmer, from scratch may mean starting with dictionaries but a non-programmer will have to learn what python dicts are first.
To someone who already knows excel, from scratch with excel sheets instead of python may work with them.
For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.
Although if your claim is then that most programmers do not care about being effective, that I would tend to agree with given the 64 gigs of ram my basic text editors need these days.
>For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.
While I agree it's good to know how your collections work. "Efficient key-value store" may be enough to use it effectively 80% of the time for somebody dabbling in Python.
Sadly I've met enough people that call themselves programmers that didn't even have such a surface level understanding of it.
No it is not. From scratch has a meaning. To me it means: in a way that letxs you undrrstand the important details, e.g. using a programming language without major dependencies.
Calling that from scratch is like saying "Just go to the store and tell them what you want" in a series called: "How to make sausage from scratch".
When I want to know how to do X from scratch I am not interested in "how to get X the fastest way possible", to be frank I am not even interested in "How to get X in the way others typically get it", what I am interested in is learning how to do all the stuff that is normally hidden away in dependencies or frameworks myself — or, you know, from scratch. And considering the comments here I am not alone in that reading.
Your definition doesn’t match mine. My definition is fuzzier. It is “building something using no more than the common tools of the trade”. The term “common” is very era dependent.
For example, building a web server from scratch - I’d probably assume the presence of a sockets library or at the very least networking card driver support. For logging and configuration I’d assume standard I/o support.
It probably comes down to what you think makes LLMs interesting as programs.
It is okay to differ on this. Language is not an exact science. It is however always good to factor in expectations when you describe things.
E.g. when a title says it shows you how to do a thing in vanilla javascript from scratch bringing in jquery in the first step makes that tile a lie. If you bring in a hefty dependency on step 1 and run three imported function the vanilla javascript part might be fine, but the from scratch starts to do some heavy lifting.
You could always go deeper and from some points of view, it's not "from the ground up" enough unless you build your own autograd and tensors from plain numpy arrays.
Your comment is one of the most pompous that I've ever read.
NVDIA value lies only in pytorch and cuda optimizations with respect with pure c implementation, so saying that you need go lower level than cuda or pytorch means simply reinventing Nvidia. Good luck with that
1. I only said the meaning of the title is wrong, and I praised the content
2. I didn't say CUDA wouldn't be ground up or low level (please re-read) (I say in another comment about a no-code guide with CUDA, but it's obviously a joke)
3. And finally, I think your comment comes out as holier than thou and finger pointing and making a huge deal out of a minor semantic observation.
Pytorch is low level enough to understand and interpret each and every passage. In pytorch, you can use builtin transformers, or code them yourself down to the "lowest" level in which there's still a theoretical meaning. So pytorch is just a tool and your comment was just pompous and empty.
Wanted to say the same thing. As an educator who once gave a course on a similar topic for non-programmers you need to start way, way earlier.
E.g.
1. Programming basics
2. How to manipulate text using programs (reading, writing, tokenization, counting words, randomization, case conversion, ...)
3. How to extract statistical properties from texts (ngrams, etc, ...)
4. How to generate crude text using markov chains
5. Improving on markov chains and thinking about/trying out different topologies
Etc.
Sure markov chains are not exactly LLMS, but they are a good starting point to byild a intuition how programs can extract statistical properties from text and generate new text based on that. Also it gives you a feeling how programes can work on text.
If you start directly with a framework there is some essential understanding missing.
Beyond learning how it all works and demo, there is not much practical usage. You can train it on current events if you feed that corpus during training instead of just OpenWebText. Shouldn't be hard.
Quite a cry, in a submission page from one of the most language "obsessed" in this community.
Now: "code" is something you establish - as the content of the codex medium (see https://en.wikipedia.org/wiki/Codex for its history); from the field of law, a set of rules, exported in use to other domains since at least the mid XVI century in English.
"Program" is something you publish, with the implied content of a set of intentions ("first we play Bach then Mozart" - the use postdates "code"-as-"set of rules" by centuries).
"Develop" is something you unfold - good, but it does not imply "rules" or "[sequential] process" like the other two terms.
I am from Brazil and I find this funny because in my circle of friends/co-wroekers we mostly use "coding" when speaking English, or "codar" (code as a Portuguese verb) with other Brazilians.
I am not sure why, but I think it is because "program" has a strong association with prostitution in Brazilian Portuguese.
I'm from Europe and my language doesn't have an equivalent to "coding" but i'm still using the English word "coder" and "coding" for decades - in my case i learned it from the demoscene where it was always used for programmers since the 80s. FWIW the Demoscene is (or was at least) largely a European thing (groups outside of Europe did exist but the majority of both groups and demoparties were -and i think still are- in Europe) so perhaps there is some truth about the "coding" word being a European thing (e.g. it sounded ok in some languages and spread from there).
Also in my ears coder always sounded cooler than programmer and it wasn't until a few years ago i first heard that to some people it has negative connotations. Too late to change though, it still sounds cooler to me :-P.
I am from Europe and I am not completely sure about that to be honest. I also prefer programming.
I also dislike software development as it reminds me of developing a photograhic negative – like "oh let's check out how the software we developed came out".
It should be software engineering and it should be held to a similar standard as other engineering fields if it isn't done in a non-professional context.
The word "development" can mean several things. I don't think "software development" sounds bad when grouped with a phrase like "urban development".
It describes growing and tuning software for, well, working better, solving more needs, and with fewer failure modes.
I do agree that a "coder" creates code, and a programmer creates programs. I expect more of a complete program than of a bunch of code. If a text says "coder", it does set an expectation about the professionalism of the text.
And I expect even more from a software solution created by a software engineer. At least a specification!
Still, I, a professional software engineer and programmer, also write "code" for throwaway scripts, or just for myself, or that never gets completed. Or for fun.
I will read articles by and for coders too.
The word is a signal. It's neither good nor bad, but If that's not the signal the author wants to send, they should work on their communication.
Wrong angle. There is a problem, your consideration of the problem, the refinement of your solution to the problem: the solution gradually unfolds - it is developed.
This is the exact level of details I was looking for. I'm fairly experienced with deep learning and pytorch and don't want to see them built from scratch. I found Andrej's materials too low level and I tend to get lost in the weeds. This is not a criticism but just a comment for someone in a similar situation as I'm.
This is great. Just yesterday I was wondering how exactly transformers/attention and LLMs work. I'd worked through how back-propagation works in a deep RNN a long while ago and thought it would be interesting to see the rest.
This is great! Hope it works on a Windows 11 machine too (I often find that when Windows isn't explicitly mentioned, the code isn't tested on it and usually fails to work due to random issues).
This page is just a container for a youtube video. I suggest updating this HN link to point to the video directly, which contains the same links as the page in its description.
yeah really valuable stuff. so we know how the ginormous model that we can't train or host works (putting practice there are so many hacks and optimizations that none of them work like this). great.
This is excellent. Thanks for sharing. It's always good to go back to the fundamentals. There's another resource that is also quite good: https://jaykmody.com/blog/gpt-from-scratch/
Neither the author of the GPT from scratch post, nor eclectic29 who recommended it above did ever promise that the post is about building LLMs from the ground up. That was the original post.
The GPT from scratch post explains, from the ground up, ground being numpy, what calculations take place inside a GPT model.
Language is the language model that extends Transformer. Transformer is a base model for any kind of token (words, pixels, etc.).
However, currently there is some language-specific stuff in Transformer that should be moved to Language :) I'm focusing first on language models, and getting into image generation next.
No, I mean, a transformer is a very specific model architecture, and your simple language model has nothing to do with that architecture. Unless I’m missing something.
I still call it a transformer because the inputs are tokenized and computed to produce completions, not from lookups or assembling based on rules.
> Unless I'm missing something.
Only that I said "without taking the LLM approach" meaning tokens aren't scored in high-dimensional vectors, just as far simpler JSON bigrams. I don't think that disqualifies using the term "transformer" - I didn't want to call it a "computer" or a "completer". Have a better word?
But the n-gram approach is better, I don't think vectors start to pull away on accuracy until they are capturing a lot more contextual information (where there is already a lot of context inferred from the structure of an n-gram).
Calling it a "transformer" is misleading when discussing language modelling because it now means a very specific ML architecture while your project seems to be about Markov chains + hardcoded rules using regexps https://github.com/bennyschmidt/llimo/blob/master/models/Cha...
The idea of tokenizing words and producing completions is not unique to the original transformers, it's a basic idea from NLP. So I'm not sure why you think it should be called a transformer just because it uses tokenized inputs and produces completions as well. It's like saying your new programming language has a "Java-based architecture" simply because they both have classes (and nothing else in common otherwise).
>I didn't want to call it a "computer" or a "completer". Have a better word?
I've seen projects which also use Markov chains + additional rules ontop, for example there's quite a few projects called "Markov chains with POS tagging":
Not quite sure about "it's not based on rules" when your code has things like:
const MATCH_FIRST_MODAL = new RegExp(/IS|AM|ARE|WAS|HAS|HAVE|HAD|MUST|MAY|MIGHT|WERE|WILL|SHALL|CAN|COULD|WOULD|SHOULD|OUGHT|DOES|DID/);
or
const properNoun = `${part.value} `;
if (isPrevNNP) {
result += prependArticle(query, properNoun);
}
Pretty sure your examples in the video are also cherry-picked. The very first example is you asking "where is Paris?" What really happens is, one of the hardcoded regexps transforms it to "Paris is" and then the bigram model repeats the second sentence in the Paris dataset verbatim.
Same with tile engines & game dev. Say I wanted to rotate a map:
Input
[
[0, 0, 1],
[0, 0, 0],
[0, 0, 0]
]
Output
[
[0, 0, 0],
[0, 0, 0],
[0, 0, 1]
]
The function is a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.
> Not quite sure about "it's not based on rules" when your code has things like:
>
> const MATCH_FIRST_MODAL
Totally irrelevant to the topic. This is the chat interface itself which mostly just parses questions into cursors to be completed. You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis. text-ada-embedding itself uses POS.
> Pretty sure your examples in the video are also cherry-picked
Fantastic detective work, you caught me. But just to confirm - why not just use it yourself? npm i next-token-prediction
Don't forget to log the completions to prove that they aren't broken down by token, and instead just doing key/val lookups or text searches as you said.
> What really happens is, one of the hardcoded regexps transforms it to "Paris is"
The only thing you got right - that questions are transformed into sentences using conventional NLP in order to complete them. This functionality is what makes it a chat bot that you can ask questions.
It's still misleading to call it a transformer in the context of NLP. It doesn't matter what it means in other, non-NLP areas (linear algebra, CSS or gamedev).
It's like creating a procedural language and calling it "functional" because it has functions. Sure the concept of functions existed long before compsci but it would be very misleading because "functional programming" is a well-established term.
>You would be a fool to think ChatGPT has no NLP or parts-of-speech analysis
Pretty sure it doesn't. At least it's not required to. I've run lots of local models and it's just model weights without hardcoded regexps. In fact, I was able to feed grammar rules of an invented language into Claude Sonnet and it was able to construct proper sentences.
Again they are the exact same concept. Whether vectors represent tiles in a video game, an object in CSS, matrix algebra you took in school, or the semantics of words used by LLMs, in all cases it's the same meaning of the word "transform". It's not specific to language models at all - which was the thesis of your whole argument.
What is meant by "sliding window" or "skip gram" is bigram mapping (or other n-gram).
This is ML 101.
It's the same training methodology and data structure used in my next-token-prediction lib, and is widely used for training for LLMs. Ask your local AI to explain the basics, or see examples like: https://www.kaggle.com/code/hamishdickson/training-and-plott...
> ChatGPT doesn't use parts-of-speech
Yes it does, there's not only a huge business in tagging data (both POS and NER) adjacent to AI, but OpenAI specifically famously used African workers on very low wages to tag a bunch of data. ChatGPT uses text-embedding-ada, you'll have to put 2 and 2 together as they don't open source that part.
Mistral says:
"The preprocessing stage of Text-Embedding-ADA-002 involves applying POS tags to the input text using a separate POS tagger like Spacy or Stanford NLP. These POS tags can be useful for segmenting sentences into individual words or tokens."
>It's not specific to language models at all - which was the thesis of your whole argument.
I didn't say that it's unique to LMs. My argument is that saying "my LM is a transformer" is misleading because "transformer" in the context of LMs means a very specific architecture. You're deliberately misusing terms, probably to draw attention to your project.
>OpenAI specifically famously used African workers on very low wages to tag a bunch of data
Did they tag Polish parts of speech too? Or Ancient Greek? ChatGPT constructs grammatically correct Ancient Greek. I thought they tagged "harmful/non-harmful", not parts of speech?
>ChatGPT uses text-embedding-ada
[Citation needed]
NanoGPT, for example, learns embeddings together with the rest of the network so, as I said, manual tagging is not required.
Anyway, looking forward to hearing news about your image generation project. Any news?
Nobody denied the term is used in language models, I only pointed out that they use that term because of what it already means in the context of vector operations (long before OpenAI).
The wikipedia on deep learning transformers:
All transformers have the same primary components:
- Tokenizers, which convert text into tokens.
- Embedding layer, which converts tokens and positions of the tokens into vector representations.
- Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
- Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.
Where does it say bigrams can't be used for next-token prediction? Or that you can't tag data? Note "...which converts tokens and positions of the tokens..."
> You're deliberately misusing terms, probably to draw attention to your project.
Haha well since I have like 30 followers and the npm is free/MIT whatever scheme you think I'm up to it's not working. Anyway a text autocomplete library is not exactly viral material. Jokes aside, no I am trying to use accurate terms that make sense for the project.
Could just make it anonymous - `export default () => {}` - and call the file `model.js`. What would you call it?
> Did they tag Polish parts of speech too? Or Ancient Greek?
Yes, all the foreign words with special characters were tokenized and trained on. An LLM doesn't "know any language". If it never trained on any Polish word sequences it would not be able to output very good Polish sequences anymore that it could output good JavaScript. It's not that has to train on Polish to translate Polish per se, but it does has to have the language coverage at the token level to be able to perform such vector transformations - which is probably most easily accomplished by training on Polish-specific data.
> The model was initialised from AUEB NLP Group's Greek BERT and subsequently trained on monolingual data from the First1KGreek Project, Perseus Digital Library, PROIEL Treebank and Gorman's Treebank
First1KGreek Project
> The goal of this project is to collect at least one edition of every Greek work composed between Homer and 250CE
> The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.
> Anyway, looking forward to hearing news about your image generation project. Any news?
Not yet! Feel free to follow on GitHub or even help out if you're really interested in it. Would be cool to have pixel prediction as snappy as text autocomplete.
And it fits the definition doesn't it since it tokenizes inputs to compute them against pre-trained ones, rather than being based on rules/lookups or arbitrary logic/algorithms?
Even in CSS a matrix "transform" is the same concept - the word "transform" is not unique to language models, more a reference to how 1 set of data becomes another by way of computation.
Same with tile engines / game dev. Say I wanted to rotate a map, this could be a simple 2D tic-tac-toe board or a 3D MMO tile map, anything in between:
Input
[
[0, 0, 1],
[0, 0, 0],
[0, 0, 0]
]
Output
[
[0, 0, 0],
[0, 0, 0],
[0, 0, 1]
]
The method that takes the input and gives that output is called a "transformer" because it is not looking up some rule that says where to put the new values, it's performing math on the data structure whose result determines the new values.
It's not unique to language models. If anything vector word embeddings are much later to this concept than math and game dev.
I used Three.js to build https://www.playshadowvane.com/ - built the engine from scratch and recall working with vectors (e.g. THREE Vector3 for XYZ stuff) years before they were being popularized by LLMs.
Please do yourself a favor and google “transformer paper”. Open the very first result and read the pdf. Hopefully it will become clear what people mean when they say “transformer” in ML context, and you will finally realize how silly you look like in this thread.
You'll look twice as silly after thinking vectors are unique to LLMs, or that the word "transformer" has anything to do with LLMs rather than lower-level array math.
Consider that a "vector database" is a very specific technology - yet the word "vector" is not off limits in other database related libraries, especially if dealing with vectors.
In any case - if you think I'm trying to pass it off as something else, what I call "transformer" does tokenize lots of text (breaks it down by ~word, ~pixel) and derives semantic values (AKA trains) to produce real-time completions to inputs by way of math, not lookups. It fits the definition even in that sense where "transformer" meant something more abstract than the mathematical term.
^ If you look at this GitHub repo, should be obvious it's a token prediction library - the video of the browser demo shown there clearly shows it being used with an <input /> to autocomplete text based on your domain-specific data. Is THAT a Markov chain, nothing more? What a strange question, the answer is an obvious "No" - it's a front-end library for predicting text and pixels (AKA tokens).
This project, which uses the aforementioned library is a chat bot. There's an added NLP layer that uses parts-of-speech analysis to transform your inputs into a cursor that is completed (AKA "answered"). See the video where I am chatting with the bot about Paris? Is that nothing more than a standard Markov chain? Nothing else going on? Again the answer is an obvious "No" it's a chat bot - what about the NLP work, or the chat interface, etc. makes you ask if it's nothing more than a standard [insert vague philosophical idea]?
To me, your question is like when people were asking if jQuery "is just a monad"? I don't understand the significance of the question - jQuery is a library for web development. Maybe there are some similarities to this philosophical concept "monad"? See: https://stackoverflow.com/questions/10496932/is-jquery-a-mon...
It's like saying "I looked at your website and have concluded it is nothing more than an Array."
They are just inquiring as to what the underlying data structure and algorithm is, not what function it performs, or the myriad of ways it can be used.
It's an inquiry with an embedded false dichotomy/assumption that n-grams are not used in LLMs, when in fact ChatGPT also uses n-grams/"Markov chains". Popular embeddings including those ChatGPT uses like text-embedding-ada-002 and later also use parts-of-speech codes. And the chat interface uses conventional NLP too. Maybe some people think it's nothing but "magical vectors" doing all the work, but that's incorrect.
If you google "Is ChatGPT just a glorified Markov chain?" you will amazingly get pages of results of people asking this question, just like "Is jQuery just a glorified monad?" as if to reduce something novel down to useless, mere philosophy that "we've had" for thousands of years. Imagine suggesting using a state management library in React to improve FE dev and getting the retort: "Isn't that just a state machine?" in a discounting manner, and imagine the rest of the team actually nodding their head in agreement like a scene in Idiocracy - welcome to Hacker News.
For smart people, the answer to any question like this is "No". Google is not a glorified Array. Bitcoin is not a glorified LinkedList. Language models are not glorified Markov chains. To even ask that is so reductionist and incorrect that any answer obfuscates what they actually are.
Here's a gist you can paste into your browser that shows how both n-grams and conventional NLP (parts-of-speech analysis) are used to derive vector embeddings in the first place: https://gist.github.com/bennyschmidt/ba79ba64faa5ba18334b4ae... (following in the style of text-embedding-ada-002 albeit much tinier)
They are not mutually exclusive concepts to begin with. Never have been. None of these comments even deserve these lengthy replies (I am likely responding to a mix of 12- and 24-year-olds who don't care that much anyway, just want to "win"), yet I feel compelled to explain.
I think this is way too harsh. What if someone who is not interested in learning a subject deeply, but still genuinely wonders if they get the gist of it and/or want to know where to start in case ? Of course one of them will eventually remember markov chains and start drawing parallels with modern LLMs. It is only natural. No need to berate people for that.
edit: I do appreciate your work and explanation, btw.
I’m not sure why you’d want to build an LLM these days - you won’t be able to train it anyway. It’d make a lot of sense to teach people how to build stuff with LLMs, not LLMs themselves.
This has been said about pretty much every subject. Writing your own Browsers, compilers, cryptography, etc. But at least for me even if nothing comes of it just knowing how it really works, What steps are involved are part of using things properly. Some people are perfectly happy using a black box, but without kowning how its made, how do we know the limits? How will the next generation of llms happen if nobody can get excited about the internal workings?
You don’t need to write your own LLM to know how it works. And unlike, say, a browser it doesn’t really do anything even remotely impressive unless you have at least a few tens of thousands of dollars to spend on training. Source: my day job is to do precisely what I’m telling you not to bother doing, but I do have access to a large pool of GPUs. If I didn’t, I’d be doing what I suggest above.
But I mean people can always rent GPUs too. And they're getting pretty ubiquitous as we ramp up from the AI hype craze, I am just an IT monkey at the moment and even I have on-demand access to a server with something like 4x192GB GPUs at work.
It's possible to train useful LLMs on affordable harwdare. It depends on what kind of LLM you want. Sure you won't build the next ChatGPT, but not every language task requires a universal general-purpose LLM with billions of parameters.
It's so fun! And for me at least, it sparks a lot of curiosity to learn the theory behind them, so I would imagine it is similar for others. And some of that theory will likely cross over to the next AI breakthrough. So I think this is a fun and interesting vehicle for a lot of useful knowledge. It's not like building compilers is still super relevant for most of us, but many people still learn to do it!
Anyway I will watch it tonight before bed. Thank you for sharing.