>It's not specific to language models at all - which was the thesis of your whole argument.
I didn't say that it's unique to LMs. My argument is that saying "my LM is a transformer" is misleading because "transformer" in the context of LMs means a very specific architecture. You're deliberately misusing terms, probably to draw attention to your project.
>OpenAI specifically famously used African workers on very low wages to tag a bunch of data
Did they tag Polish parts of speech too? Or Ancient Greek? ChatGPT constructs grammatically correct Ancient Greek. I thought they tagged "harmful/non-harmful", not parts of speech?
>ChatGPT uses text-embedding-ada
[Citation needed]
NanoGPT, for example, learns embeddings together with the rest of the network so, as I said, manual tagging is not required.
Anyway, looking forward to hearing news about your image generation project. Any news?
Nobody denied the term is used in language models, I only pointed out that they use that term because of what it already means in the context of vector operations (long before OpenAI).
The wikipedia on deep learning transformers:
All transformers have the same primary components:
- Tokenizers, which convert text into tokens.
- Embedding layer, which converts tokens and positions of the tokens into vector representations.
- Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
- Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.
Where does it say bigrams can't be used for next-token prediction? Or that you can't tag data? Note "...which converts tokens and positions of the tokens..."
> You're deliberately misusing terms, probably to draw attention to your project.
Haha well since I have like 30 followers and the npm is free/MIT whatever scheme you think I'm up to it's not working. Anyway a text autocomplete library is not exactly viral material. Jokes aside, no I am trying to use accurate terms that make sense for the project.
Could just make it anonymous - `export default () => {}` - and call the file `model.js`. What would you call it?
> Did they tag Polish parts of speech too? Or Ancient Greek?
Yes, all the foreign words with special characters were tokenized and trained on. An LLM doesn't "know any language". If it never trained on any Polish word sequences it would not be able to output very good Polish sequences anymore that it could output good JavaScript. It's not that has to train on Polish to translate Polish per se, but it does has to have the language coverage at the token level to be able to perform such vector transformations - which is probably most easily accomplished by training on Polish-specific data.
> The model was initialised from AUEB NLP Group's Greek BERT and subsequently trained on monolingual data from the First1KGreek Project, Perseus Digital Library, PROIEL Treebank and Gorman's Treebank
First1KGreek Project
> The goal of this project is to collect at least one edition of every Greek work composed between Homer and 250CE
> The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.
> Anyway, looking forward to hearing news about your image generation project. Any news?
Not yet! Feel free to follow on GitHub or even help out if you're really interested in it. Would be cool to have pixel prediction as snappy as text autocomplete.
I didn't say that it's unique to LMs. My argument is that saying "my LM is a transformer" is misleading because "transformer" in the context of LMs means a very specific architecture. You're deliberately misusing terms, probably to draw attention to your project.
>OpenAI specifically famously used African workers on very low wages to tag a bunch of data
Did they tag Polish parts of speech too? Or Ancient Greek? ChatGPT constructs grammatically correct Ancient Greek. I thought they tagged "harmful/non-harmful", not parts of speech?
>ChatGPT uses text-embedding-ada
[Citation needed]
NanoGPT, for example, learns embeddings together with the rest of the network so, as I said, manual tagging is not required.
Anyway, looking forward to hearing news about your image generation project. Any news?