Considering i seem to be the minority here based on all the other responses the ...

nerdponx · 2024-09-01T14:01:30 1725199290

The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

To a statistician or a practitioner approaching machine learning from a mathematical perspective, the computational details are a distraction.

Yes, these models would not be possible without automatic differentiation and massively parallel computing. But there is a lot of rich detail to consider in building up the model from first mathematical principles, motivating design choices with prior art from natural language processing, various topics related to how input data is represented and loss is evaluated, data processing considerations, putting things into context of machine, learning more broadly, etc. You could fill half a book chapter with that kind of content (and people do), without ever talking about computational details beyond a passing mention.

In my personal opinion, fussing over manual memory management is far afield from anything useful unless you want to actually work on hardware or core library implementations like Pytorch. Nobody else in industry is doing that.

badsectoracula · 2024-09-02T00:41:03 1725237663

> The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

But if all is relative and depends on your PoV that implies that there isn't actually a problem here, right? :-P

I don't think there is anything wrong with "building up the model from first mathematical principles" as you wrote, it just wasn't what i personally had in mind with the "from the ground up" part.

And FWIW i'm not that stuck up on the "vim and C" aspect, i used those as an example that i expected most would understand and leave little room for misinterpretation in what you'd have to work with (i.e. very very little) and have to implement yourself (pretty much everything) - personally i'd consider it from "the ground up" even if it was in C#, D, Java, JavaScript or even Python, as long as the implementation was done in a way that didn't rely on 3rd party libraries so that whatever is implemented in, say, Java could also be implementable in C#, D, JavaScript or Python with just whatever is available out of the box in those languages or even C, if one doesn't mind writing the extra bookkeeping functionality themselves.

nerdponx · 2024-09-03T11:27:37 1725362857

Right, but again I think the emphasis on avoiding 3rd party libraries isn't really relevant to machine learning. The "from scratch" here is avoiding 3rd party implementations of the transformer model, building up from the math on paper and then letting the AD/computation framework do its thing.

badsectoracula · 2024-09-04T10:24:40 1725445480

One does not exclude the other though. "Avoiding 3rd party implementations of the transformer model" is a subset of "avoiding 3rd party libraries". "From scratch" is, as seen, vague enough for different people to interpret it in different ways. Despite being the minority in this thread, i do not think my interpretation is any less valid - especially since some people have already done such "from scratch" (i.e. in C or C++ with no 3rd party dependencies) implementations already.

wredue · 2024-09-01T16:24:07 1725207847

Gluing together premade components is not “from the ground up” by most people’s definition.

People are looking at the ground up for a clear picture of what the thing is actually doing, so masking the important part of what is actually happening, then calling it “ground up” is disingenuous.

nerdponx · 2024-09-01T16:37:14 1725208634

Yes, but "what the thing is actually doing" is different depending on what your perspective is on what "the thing" and what "actually" consists of.

If you are interested in how the model works conceptually, how training works, how it represents text semantically, etc., then I maintain that computational details are an irrelevant distraction, not an essential foundation.

How about another analogy? Is SICP not a good foundation for learning about language design because it uses Scheme and not assembly or C?