I hope I actually sit down and go through this because this seems like an excellent resource. I took a grad level (they just let us take them as undergrad) compilers course in college and it was the most fun I've had at Uni. It was all SML building a functional language that supported a solid type system with ADTs and parametric polymorphism and it was so interesting to see that it wasn't as complicated as it seemed to implement (although time consuming to iron out bugs).
My prof developed a production SML compiler that's fairly widely used, so he was the best resource I could have asked for. Such a humble guy as well. The thing that was most interesting to me was the use of ADTs on the implementation end. I couldn't imagine how much more tedious it would be dealing with the many types of trees a compiler deals with without them.
The class's impact on my overall career isn't direct, as I don't work near the compiler space at all (unfortunately), but there were so many good type level concepts I got out of that course. Plus, it was just a blast. Late nights with a bunch of tmux panes ironing out bugs to rack up test coverage - such fond memories :)
I highly recommend playing with compiler implementation for anyone that's unfamiliar. There are a lot of good introductory resources out there and they're just such a fascinating part of our industry. A beautiful cross roads between CS theory and practical application.
> My prof developed a production SML compiler that's fairly widely used
MLton? I've been playing around on-and-off with a non-production SML compiler and MLton is such a fantastic resource.
This is just a hobby for me, but I have read (and own) several different compiler textbooks. What I really want is one that focuses on compiling functional languages down to assembly.
Yes! Professor Fluet was probably the most passionate professor I had. The man has three kids, a full time job as a professor at the university, and still works on MLton, it's incredible.
During our code gen lectures, he had told us that MLtons original (not sure if it's still the default) backend generated C and shipped it over to a C compiler to do the dirty work. I know they have an official x86 backend now too though.
I wish I had resource recommendations for what you're looking for (because it is a super interesting topic), but hopefully MLton can serve as a great resource . PolyML may also be a good resource - not sure though.
Adrian is a great guy and I highly recommend all his content.
For undergrad level PL, I highly recommend Dan Grossman's MOOC[0] or recent class recorded lectures[1]. Dan's thesis work strongly influenced Rust's borrow checker.
Disclosure: Dan advised me in undergrad and Adrian is a friend.
> I wish my school offered something even half good as this.
Me too. In my school, we had great introduction/mid-level classes but at the graduate level, I found our classes underwhelming. Mostly Prof/Researchers teaching their narrow specialties and trying to recruit PhD students, but without putting much time in their lectures as they didn't care about teaching. Bunch of slides, research papers to read. They wouldn't bother making a heavy programming project, which was left to their colleagues doing less research.
I'm in a Master's (and possibly jumping to PhD) program at a Big Tech School that you've probably heard of -- and that's my feeling as well.
Fairly underwhelming. Lots of online resources to chase after to catch up to the state of the art, but a lot of the courses are things like the history of, and developments of, the technology. Worth knowing, but I don't need multiple, year-long adventures in this stuff -- get me to modern, and then let's solve some actual problems!
I used to think like this until I actually spent some time with the modern things after learning a lot of the historical development paths. A lot of research is stitching together disparate ideas that have developed in their own little community silos. The best skills you can learn as a researcher are to appreciate, recognize (non-surface level) similarities between, and how to synthesize existing research.
There's an awful lot of cult of the new that goes on in modern research.
> modern things after learning a lot of the historical development paths
While not research, it is nontheless interesting to watch every generation of programmers (including mine) re-discover RDBMS after tripping over all the pitfalls with the hot new stuff.
It would have been if the papers covered were closer to the state of the art.
The most recent one is from 2015.
However, this course would actually give you the background required to read more recent paper and work on compiler research, so calling it PhD level is not a huge stretch.
Even if you're working at the PhD level, a course is only going to at best give you a survey of material instead of the latest-and-greatest papers hot off the presses at PLDI or ASPLOS or whatever conferences matter most for your field. If you're actually doing research on a particular subfield, then you are going to be finding and reading those papers on your own.
In that vein, it makes a lot more sense to focus on a solid, foundational paper than whatever incrementalism came out most recently. SLP vectorization is a good example here--yeah, the foundational paper is 20 years old at this point, but you'd rather students read and understand that than whatever the most recent addition to SLP vectorization came out in this (or last) year's PLDI. Even the other papers I think I might add or substitute for others in this course aren't going to be substantially newer than the stuff already here.
I mean a lot of programming language design goes back decades; it's better to focus a course on the basics that every language will have vs the most recent developments that haven't fully crystallized yet either.
There is a critical difference between programming language design and compiler design.
You are right when it comes to programming language design and somewhat wrong when it comes to compiler design.
If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.
For programming languages, you absolutely have to understand the foundations to do absolutely anything.
> If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.
On the other hand, if the margins for improving compiler quality are slim they can be ignored without great consequences: it makes sense to study state of the art research only in the niches that turn out to be relevant for a specific need of a specific project, not blindly and at the expense of important general principles and techniques (i.e. mostly the problems of 75 years ago and the solutions of 25 years ago).
Again this is not necessarily for someone doing compiler research, but foundational knowledge for anyone at the PhD level.
Someone doing compiler research this might be a first graduate level course to get them used to the technique and concepts used in more recent research for the might have two or three additional courses getting closer and closer to state of the art.
I took this course when it was PL-focused in 2015ish! It starts with an in-depth exploration of the untyped lambda calculus (with the usual fixins like Church numerals, various combinators, etc), spending half a week or so on semantic expansion rules, reduction strategies, recursion, fixpoints, continuations, how one might encode state with only lexical closures, etc. After maybe the tenth lecture, the class takes a quick detour to talk about predicate logic, Hoare logic, dynamic logic, and Kleene algebra, before deriving a typed variant of the lambda calculus as an illustrative pathway into the latter half of the course featuring algebraic datatypes, unification, type inference, and the duality between formal logic propositions and typing systems. We explored subtype polymorphism, bisimulation, equirecursive equality, and closed on a gentle introduction to monads (of all things) and a teaser of category theory (which IMO should have been taught first but oh well).
As more of a math course than a computer engineering course, 6120 was a first-semester Ph.D. level class, and I certainly wouldn't recommend taking it unless you really wanted to dive deep into the depths of programming language implementation. Incoming students were expected to be fluent in discrete mathematics and formal proofwriting skills. I wasn't, so I found myself falling behind at the start even though I had some familiarity with lambda calculus and algebraic datatypes. You definitely benefit from a strong math background.
Are there any courses on creating a new language, interpreter, and compiler from scratch? Like not using LLVM or similar to generate intermediate representation but to literally pick an ISA and, well, compile for it.
You can find a PDF and the associated course with a very minimal amount of following links.
If you are thinking of making your own language, it's also good to learn something about programming language theory, if you don't already. Many languages make mistakes that have been solved 25+ years ago. PLAI is good for that: https://www.plai.org/
I compiled a basic C program and then read the output of GCC -S to see the assembly.
I used what I learned from GCC output Then I wrote a basic amd64 expression compiler with simple live ranges and linear register allocation and ANF (a normal form) for postorder AST traversal for code generation. It can only evaluate nested add and mul expressions.
To learn code generation not targeting amd64, I wrote an imaginary assembly language and then wrote a switch based virtual machine and a compiler for it with a frontend that looks similar to javascript.
I tend to be biased towards what Jeremy Siek himself markets as "Proven in the classroom" when it comes to book authors in CS. Many book authors lack this experience and simply write for themselves, which is ok, but can result in bad didactics. Good teachers and authors from academia are invaluable.
That's a great point about experience teaching the topic. (I've written a few articles where I certainly wished for that experience.)
I just wish textbooks didn't have their own bad tendencies: they have pablum as an attractor, because on average students just want to get through the class, not doing too much worse than average among the other students. Even without this problem, there's a more basic one: like with enterprise software, the decision to buy the book is not typically up to the user. "Will people actually want to read this on their own time?" is a strong driver of quality, even though it has pitfalls too.
This does not cover compilers. I know because I specifically asked munificent (the author) about this before and he mentioned some other books to look at.
He's being modest :) Sure, the book doesn't cover many traditional compiler topics such as register allocation and code optimization. However, it does cover parsing and translating to intermediate representation, which is 90% of what you need to get your own programming language off the ground. And to be honest, many university compiler courses won't have time to go much farther than that anyway.
Most of the time when people talk about compiled languages/compilers, they mean languages that compile to native machine code, not languages that compile to a bytecode/run on a virtual machine.
Compiling to machine code rather than VM bytecode is a difference in degree rather than kind, and doing it well rather than naively is a deep topic.
Simple machine code generation with basic register allocation and evaluation which eagerly spills to the stack is not super hard. Producing fast code instead is a bottomless well.
It's just the nomenclature that I'm used to, compiled languages are so called because there is no VM, it's just machine code that is run while C# and Java still requires the JVM to do JIT compilation at runtime. But it's mostly splitting hairs, there is no agreed upon definition of what makes a languages compiled/interpreted.
There's literally no difference in the theory or technique of compiling for a virtual machine or a real one. Real machines are just messier and have a bunch of edge cases and weirdness that require you to read those multi-kilopage manuals to figure out. Having one or more intermediate steps is useful for portability and optimization though, so it's probably a good idea to target a virtual machine unless you only intend to ever target one specific architecture that you know inside and out.
Yes, but it's still the most beginner-friendly book on implementing programming languages that I've ever seen. It covers VMs on the second part, and going from that to asm is not a huge leap.
Is that really true? I would expect that going from IR to native code would be the hardest part because you have to worry about architecture and supporting different ISA’s. IR can be simple because it is abstract and hides lots of the dirty details of the hardware.
Generating code is not all that hard. Generating fast code is a challenge. Decoding instructions on some architectures is hard. If you have a nice general IR naively generating native code can be just a table lookup with a register tracking/allocation routine.
UPenn's CIS341[1] is pretty nice. It uses a bottom-up approach, where assignments start with a 64-bit integer-only x64 simulator, then goes up to compiling LLVM IR to x64, then compiling the lecturer's language ('Oat') into LLVM IR, and then optimisations like dead code elimination, register allocation, variable substitution, etc.
The https://www.nand2tetris.org/ is pretty neat. The compiler in part 2 isn’t very advanced (it doesn’t cover much in the space of optimization or the state of the art) but taken as a whole you end up designing the chip, ISA, VM, language and compiler, so it’s pretty good for that aspect.
When I did my undergraduate degree, you could opt to do that for 1 extra point. Compiling for an ISA is not that much more difficult than compiling for a bytecode IR like that of Java or .Net which I imagine is the typical target of these courses.
Check out project Oberon by Niklaus Wirth. It involves building a compiler from scratch for an ISA designed from scratch and building an OS for it. You can even put the processor onto an FPGA and run the code you compiled on physical hardware you designed.
Jumping ahead to the video on LLVM, the instructor mentions his prior LLVM tutorial/intro which looks really helpful as an overview, with more informative links. Probably the best get-started tutorial I've come across yet:
The kaleidoscope tutorial for LLVM is a good intro tutorial to read before one gets deeper into LLVM. The tutorial is in c++ but you can follow along in any language with LLVM bindings: https://llvm.org/docs/tutorial/
It's one of those things where you know it when you see it. I got 18 year old CMU students showing up knowing how write x86 SIMD intrinsics or ones with LLVM JIT experience. Dirty.
Thanks for this. His lecture on SSA was the best explanation I've found of SSA vs alternatives that I've found (building up a motivation for why you'd want to use SSA which is lacking in other explanations I've found) - and unfortunately, I'd lost track of where I'd seen it. Was looking for it again recently not having much luck (not on YouTube) And now here it is.
Why should online classes be any cheaper? I'm not talking about self-paced, certificate only classes, where you never interact with a real human. You can already find plenty of these for free from any number of top schools. I'm talking about a real class, for real credit, with a real professor and assignments graded by real humans. Online students end up consuming the same resources as regular students (with the exception of occupying space in a physical classroom). What's more, online students require _extra_ accommodations, since they can't just stop by the professor's office for office hours. They have to setup online meetings for that, etc.
Self-paced is the tough ask. Part of the value of formal schooling is the forcing function of deadlines to make you actually spend the time and absorb the material.
> An important component of credit courses is the time limitness
Huh? Why would that be an important component?
I mean I know that it's typical for courses that come with credit, but in my mind it's just a side effect of the administrative side of providing courses, not something that's at all central to accreditation. I.e. if you build up the knowledge over a longer period (which you can still do with e.g. resits) that's completely fine as well.
>This page lists the curriculum for following this course at the university of your imagination, for four imagination credits (ungraded).
Seems rather ridiculous of Cornell to dictate terms for the University of My Imagination (UMI), if an imaginary professor at UMI wants to grade my work I think they should be allowed to do so and even grant me an imaginary degree if they feel I have earned it. Guess it is true, universities do stifle creativity.
My prof developed a production SML compiler that's fairly widely used, so he was the best resource I could have asked for. Such a humble guy as well. The thing that was most interesting to me was the use of ADTs on the implementation end. I couldn't imagine how much more tedious it would be dealing with the many types of trees a compiler deals with without them.
The class's impact on my overall career isn't direct, as I don't work near the compiler space at all (unfortunately), but there were so many good type level concepts I got out of that course. Plus, it was just a blast. Late nights with a bunch of tmux panes ironing out bugs to rack up test coverage - such fond memories :)
I highly recommend playing with compiler implementation for anyone that's unfamiliar. There are a lot of good introductory resources out there and they're just such a fascinating part of our industry. A beautiful cross roads between CS theory and practical application.