Advanced Compilers: Self-Guided Online Course

jjice · on March 13, 2023

I hope I actually sit down and go through this because this seems like an excellent resource. I took a grad level (they just let us take them as undergrad) compilers course in college and it was the most fun I've had at Uni. It was all SML building a functional language that supported a solid type system with ADTs and parametric polymorphism and it was so interesting to see that it wasn't as complicated as it seemed to implement (although time consuming to iron out bugs).

My prof developed a production SML compiler that's fairly widely used, so he was the best resource I could have asked for. Such a humble guy as well. The thing that was most interesting to me was the use of ADTs on the implementation end. I couldn't imagine how much more tedious it would be dealing with the many types of trees a compiler deals with without them.

The class's impact on my overall career isn't direct, as I don't work near the compiler space at all (unfortunately), but there were so many good type level concepts I got out of that course. Plus, it was just a blast. Late nights with a bunch of tmux panes ironing out bugs to rack up test coverage - such fond memories :)

I highly recommend playing with compiler implementation for anyone that's unfamiliar. There are a lot of good introductory resources out there and they're just such a fascinating part of our industry. A beautiful cross roads between CS theory and practical application.

sn9 · on March 14, 2023

Your comment seems as good a place as any to drop the requisite Yegge compiler evangelism piece: https://steve-yegge.blogspot.com/2007/06/rich-programmer-foo...

shpongled · on March 13, 2023

> My prof developed a production SML compiler that's fairly widely used

MLton? I've been playing around on-and-off with a non-production SML compiler and MLton is such a fantastic resource.

This is just a hobby for me, but I have read (and own) several different compiler textbooks. What I really want is one that focuses on compiling functional languages down to assembly.

jjice · on March 14, 2023

Yes! Professor Fluet was probably the most passionate professor I had. The man has three kids, a full time job as a professor at the university, and still works on MLton, it's incredible.

During our code gen lectures, he had told us that MLtons original (not sure if it's still the default) backend generated C and shipped it over to a C compiler to do the dirty work. I know they have an official x86 backend now too though.

I wish I had resource recommendations for what you're looking for (because it is a super interesting topic), but hopefully MLton can serve as a great resource . PolyML may also be a good resource - not sure though.

darzu · on March 13, 2023

Adrian is a great guy and I highly recommend all his content.

For undergrad level PL, I highly recommend Dan Grossman's MOOC[0] or recent class recorded lectures[1]. Dan's thesis work strongly influenced Rust's borrow checker.

Disclosure: Dan advised me in undergrad and Adrian is a friend.

[0] https://www.coursera.org/learn/programming-languages

[1] https://courses.cs.washington.edu/courses/cse341/19sp/ (ctrl-f "Videos")

JonChesterfield · on March 14, 2023

That sounded interesting. I think it's probably this Dan, https://homes.cs.washington.edu/~djg/publications/, making said thesis "Safe programming at the C level of abstraction", https://homes.cs.washington.edu/~djg/papers/grossman_thesis....

yodsanklai · on March 13, 2023

> I wish my school offered something even half good as this.

Me too. In my school, we had great introduction/mid-level classes but at the graduate level, I found our classes underwhelming. Mostly Prof/Researchers teaching their narrow specialties and trying to recruit PhD students, but without putting much time in their lectures as they didn't care about teaching. Bunch of slides, research papers to read. They wouldn't bother making a heavy programming project, which was left to their colleagues doing less research.

red-iron-pine · on March 13, 2023

I'm in a Master's (and possibly jumping to PhD) program at a Big Tech School that you've probably heard of -- and that's my feeling as well.

Fairly underwhelming. Lots of online resources to chase after to catch up to the state of the art, but a lot of the courses are things like the history of, and developments of, the technology. Worth knowing, but I don't need multiple, year-long adventures in this stuff -- get me to modern, and then let's solve some actual problems!

JHonaker · on March 13, 2023

I used to think like this until I actually spent some time with the modern things after learning a lot of the historical development paths. A lot of research is stitching together disparate ideas that have developed in their own little community silos. The best skills you can learn as a researcher are to appreciate, recognize (non-surface level) similarities between, and how to synthesize existing research.

There's an awful lot of cult of the new that goes on in modern research.

Swizec · on March 13, 2023

> modern things after learning a lot of the historical development paths

While not research, it is nontheless interesting to watch every generation of programmers (including mine) re-discover RDBMS after tripping over all the pitfalls with the hot new stuff.

tester756 · on March 13, 2023

I wish my school offered something even half good as this.

I've seen it like 2 years ago and it is well prepared course.

>CS 6120 is a PhD-level Cornell CS course by Adrian Sampson on programming language implementation.

I just dont understand why is this "PhD level"

platzhirsch · on March 13, 2023

At my school we used the course work of Stanford for the operating system class which was amazing: https://web.stanford.edu/class/cs140/projects/pintos/pintos_...

It's really up to the professor.

fooker · on March 13, 2023

> I just dont understand why is this "PhD level"

It would have been if the papers covered were closer to the state of the art. The most recent one is from 2015.

However, this course would actually give you the background required to read more recent paper and work on compiler research, so calling it PhD level is not a huge stretch.

jcranmer · on March 13, 2023

Even if you're working at the PhD level, a course is only going to at best give you a survey of material instead of the latest-and-greatest papers hot off the presses at PLDI or ASPLOS or whatever conferences matter most for your field. If you're actually doing research on a particular subfield, then you are going to be finding and reading those papers on your own.

In that vein, it makes a lot more sense to focus on a solid, foundational paper than whatever incrementalism came out most recently. SLP vectorization is a good example here--yeah, the foundational paper is 20 years old at this point, but you'd rather students read and understand that than whatever the most recent addition to SLP vectorization came out in this (or last) year's PLDI. Even the other papers I think I might add or substitute for others in this course aren't going to be substantially newer than the stuff already here.

Cthulhu_ · on March 13, 2023

I mean a lot of programming language design goes back decades; it's better to focus a course on the basics that every language will have vs the most recent developments that haven't fully crystallized yet either.

fooker · on March 13, 2023

There is a critical difference between programming language design and compiler design.

You are right when it comes to programming language design and somewhat wrong when it comes to compiler design.

If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.

For programming languages, you absolutely have to understand the foundations to do absolutely anything.

HelloNurse · on March 13, 2023

> If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.

On the other hand, if the margins for improving compiler quality are slim they can be ignored without great consequences: it makes sense to study state of the art research only in the niches that turn out to be relevant for a specific need of a specific project, not blindly and at the expense of important general principles and techniques (i.e. mostly the problems of 75 years ago and the solutions of 25 years ago).

solarmist · on March 13, 2023

Again this is not necessarily for someone doing compiler research, but foundational knowledge for anyone at the PhD level.

Someone doing compiler research this might be a first graduate level course to get them used to the technique and concepts used in more recent research for the might have two or three additional courses getting closer and closer to state of the art.

mattkrause · on March 13, 2023

I’d guess “PhD level” just means it’s intended for graduate students in a PhD program.

For example, they read and discuss papers rather than just working out of a textbook.

gcr · on March 13, 2023

I took this course when it was PL-focused in 2015ish! It starts with an in-depth exploration of the untyped lambda calculus (with the usual fixins like Church numerals, various combinators, etc), spending half a week or so on semantic expansion rules, reduction strategies, recursion, fixpoints, continuations, how one might encode state with only lexical closures, etc. After maybe the tenth lecture, the class takes a quick detour to talk about predicate logic, Hoare logic, dynamic logic, and Kleene algebra, before deriving a typed variant of the lambda calculus as an illustrative pathway into the latter half of the course featuring algebraic datatypes, unification, type inference, and the duality between formal logic propositions and typing systems. We explored subtype polymorphism, bisimulation, equirecursive equality, and closed on a gentle introduction to monads (of all things) and a teaser of category theory (which IMO should have been taught first but oh well).

As more of a math course than a computer engineering course, 6120 was a first-semester Ph.D. level class, and I certainly wouldn't recommend taking it unless you really wanted to dive deep into the depths of programming language implementation. Incoming students were expected to be fluent in discrete mathematics and formal proofwriting skills. I wasn't, so I found myself falling behind at the start even though I had some familiarity with lambda calculus and algebraic datatypes. You definitely benefit from a strong math background.

rachitnigam · on March 13, 2023

You’re thinking of 6110 (advanced programming languages) not 6120

gcr · on March 13, 2023

Ah so I am! Thanks for the correction

frozenport · on March 13, 2023

Right after undergrad.

agomez314 · on March 13, 2023

Love the recommendations people are giving out here, not an exaggeration to say that HN gave me a better CS education than my M.S.

abhaynayar · on March 13, 2023

Whenever I need to learn about any topic even non-tech I search it on hn.algolia.com

rcardo11 · on March 13, 2023

Same here! well curated resource

rg111 · on March 13, 2023

Not an exaggeration at all.

HN gave me much better education in CS and programming than any traditional source.

satvikpendem · on March 13, 2023

Are there any courses on creating a new language, interpreter, and compiler from scratch? Like not using LLVM or similar to generate intermediate representation but to literally pick an ISA and, well, compile for it.

noelwelsh · on March 13, 2023

Various books do this. Essentials of Compilation is a recent one: https://github.com/IUCompilerCourse/Essentials-of-Compilatio...

You can find a PDF and the associated course with a very minimal amount of following links.

If you are thinking of making your own language, it's also good to learn something about programming language theory, if you don't already. Many languages make mistakes that have been solved 25+ years ago. PLAI is good for that: https://www.plai.org/

samsquire · on March 13, 2023

I compiled a basic C program and then read the output of GCC -S to see the assembly.

I used what I learned from GCC output Then I wrote a basic amd64 expression compiler with simple live ranges and linear register allocation and ANF (a normal form) for postorder AST traversal for code generation. It can only evaluate nested add and mul expressions.

https://github.com/samsquire/compiler or on replit: https://replit.com/@Chronological/Compiler3#main.py

Finding opcodes is a struggle but there is this website: https://www.felixcloutier.com/x86/ and https://www.cs.uaf.edu/2016/fall/cs301/lecture/09_28_machine... (the table at the bottom is really useful)

To learn code generation not targeting amd64, I wrote an imaginary assembly language and then wrote a switch based virtual machine and a compiler for it with a frontend that looks similar to javascript.

https://github.com/samsquire/multiversion-concurrency-contro... in ProgramParser.java, LanguageInterpreter.java, and LanguageInterpreterRunner.java for where main() is.

JonChesterfield · on March 14, 2023

Intel's documentation covers encoding. I remember it being relatively heavy going to decipher, been a few years since I've worked on x64 though.

bgar · on March 13, 2023

Check out https://mitpress.mit.edu/9780262047760/essentials-of-compila...

Author has made complete draft PDFs for both the Racket and Python versions available here. https://wphomes.soic.indiana.edu/jsiek/

_the_inflator · on March 13, 2023

Thanks, highly appreciated.

I tend to be biased towards what Jeremy Siek himself markets as "Proven in the classroom" when it comes to book authors in CS. Many book authors lack this experience and simply write for themselves, which is ok, but can result in bad didactics. Good teachers and authors from academia are invaluable.

abecedarius · on March 13, 2023

That's a great point about experience teaching the topic. (I've written a few articles where I certainly wished for that experience.)

I just wish textbooks didn't have their own bad tendencies: they have pablum as an attractor, because on average students just want to get through the class, not doing too much worse than average among the other students. Even without this problem, there's a more basic one: like with enterprise software, the decision to buy the book is not typically up to the user. "Will people actually want to read this on their own time?" is a strong driver of quality, even though it has pitfalls too.

ly3xqhl8g9 · on March 13, 2023

Precisely made for "creating a new language, interpreter, and compiler from scratch": http://craftinginterpreters.com

satvikpendem · on March 13, 2023

This does not cover compilers. I know because I specifically asked munificent (the author) about this before and he mentioned some other books to look at.

ufo · on March 13, 2023

He's being modest :) Sure, the book doesn't cover many traditional compiler topics such as register allocation and code optimization. However, it does cover parsing and translating to intermediate representation, which is 90% of what you need to get your own programming language off the ground. And to be honest, many university compiler courses won't have time to go much farther than that anyway.

koonsolo · on March 13, 2023

What do you mean exactly by "compilers"? Because the book features a chapter on how to compile source code to virtual machine byte code.

sondr3 · on March 13, 2023

Most of the time when people talk about compiled languages/compilers, they mean languages that compile to native machine code, not languages that compile to a bytecode/run on a virtual machine.

barrkel · on March 13, 2023

Compiling to machine code rather than VM bytecode is a difference in degree rather than kind, and doing it well rather than naively is a deep topic.

Simple machine code generation with basic register allocation and evaluation which eagerly spills to the stack is not super hard. Producing fast code instead is a bottomless well.

vsnf · on March 13, 2023

Is that true? I think of C# and Java as compiled languages and they run on the CLR/JVM which are both virtual machines.

sondr3 · on March 13, 2023

It's just the nomenclature that I'm used to, compiled languages are so called because there is no VM, it's just machine code that is run while C# and Java still requires the JVM to do JIT compilation at runtime. But it's mostly splitting hairs, there is no agreed upon definition of what makes a languages compiled/interpreted.

orthoxerox · on March 13, 2023

Both have JITs. So someone wrote a compiler from CIL/JVM bytecode into native code.

t-3 · on March 13, 2023

There's literally no difference in the theory or technique of compiling for a virtual machine or a real one. Real machines are just messier and have a bunch of edge cases and weirdness that require you to read those multi-kilopage manuals to figure out. Having one or more intermediate steps is useful for portability and optimization though, so it's probably a good idea to target a virtual machine unless you only intend to ever target one specific architecture that you know inside and out.

e12e · on March 13, 2023

Plot twist - the developer then runs the code in qemu for rapid iteration and ease of debugging... ;)

golergka · on March 13, 2023

Yes, but it's still the most beginner-friendly book on implementing programming languages that I've ever seen. It covers VMs on the second part, and going from that to asm is not a huge leap.

weird_user · on March 13, 2023

Compiling to machine code is not that hard once you get to the IR or bytecode phase, check out this toy compiler:

https://github.com/byo-books/pretty_laughable_lang/

celeritascelery · on March 13, 2023

Is that really true? I would expect that going from IR to native code would be the hardest part because you have to worry about architecture and supporting different ISA’s. IR can be simple because it is abstract and hides lots of the dirty details of the hardware.

t-3 · on March 13, 2023

Generating code is not all that hard. Generating fast code is a challenge. Decoding instructions on some architectures is hard. If you have a nice general IR naively generating native code can be just a table lookup with a register tracking/allocation routine.

delta_p_delta_x · on March 13, 2023

UPenn's CIS341[1] is pretty nice. It uses a bottom-up approach, where assignments start with a 64-bit integer-only x64 simulator, then goes up to compiling LLVM IR to x64, then compiling the lecturer's language ('Oat') into LLVM IR, and then optimisations like dead code elimination, register allocation, variable substitution, etc.

[1]: https://www.seas.upenn.edu/~cis3410/current/

agentultra · on March 13, 2023

The https://www.nand2tetris.org/ is pretty neat. The compiler in part 2 isn’t very advanced (it doesn’t cover much in the space of optimization or the state of the art) but taken as a whole you end up designing the chip, ISA, VM, language and compiler, so it’s pretty good for that aspect.

tinco · on March 13, 2023

When I did my undergraduate degree, you could opt to do that for 1 extra point. Compiling for an ISA is not that much more difficult than compiling for a bytecode IR like that of Java or .Net which I imagine is the typical target of these courses.

mrkeen · on March 13, 2023

Have a look at https://keleshev.com/compiling-to-assembly-from-scratch/

cyber1 · on March 13, 2023

https://compilerbook.com/

fancyfredbot · on March 13, 2023

Check out project Oberon by Niklaus Wirth. It involves building a compiler from scratch for an ISA designed from scratch and building an OS for it. You can even put the processor onto an FPGA and run the code you compiled on physical hardware you designed.

AlphaSite · on March 13, 2023

The prior course involves implementing a code generating compiler with optimisation passes etc for a language.

photochemsyn · on March 13, 2023

Jumping ahead to the video on LLVM, the instructor mentions his prior LLVM tutorial/intro which looks really helpful as an overview, with more informative links. Probably the best get-started tutorial I've come across yet:

https://www.cs.cornell.edu/~asampson/blog/llvm.html

This is a great resource, I've wanted to learn how to use LLVM to compile to RISCV assembly for a while, e.g.

https://llvm.org/docs/RISCVUsage.html

epilys · on March 13, 2023

The kaleidoscope tutorial for LLVM is a good intro tutorial to read before one gets deeper into LLVM. The tutorial is in c++ but you can follow along in any language with LLVM bindings: https://llvm.org/docs/tutorial/

tomcam · on March 13, 2023

> in the self-guided version, your end-of-semester assignment is to change the world through the magic of compilers.

I’m not feeling any pressure or anything

avinassh · on March 13, 2023

Are there equivalent 'Advanced Databases: The Self-Guided Online Course'?

Andy Pavlo's CMU 15-721: Advanced Database Systems is similar, where you will hack on Postgres to implement a Foreign Data Wrapper (FDW)

fooker · on March 13, 2023

Here you go: https://users.cs.utah.edu/~pandey/courses/cs6530/fall22/sche...

apavlo · on March 13, 2023

That class just uses my slides. It says so at the bottom of the page:

> The lecture slides used in the course are taken from Prof. Andy Pavlo's CMU 15-721 course

Just go to the source: https://15721.courses.cs.cmu.edu/spring2023/

itsmemattchung · on March 13, 2023

Is there an accompanying textbook for your course or only the papers that are listed?

apavlo · on March 13, 2023

Not for the Advanced class. We cover state-of-the-art systems. Textbooks are about 10 years behind.

balakk · on March 13, 2023

okay I'm curious - what are dirty systems programming skills, professor?

apavlo · on March 13, 2023

It's one of those things where you know it when you see it. I got 18 year old CMU students showing up knowing how write x86 SIMD intrinsics or ones with LLVM JIT experience. Dirty.

zamnos · on March 13, 2023

running sed against a binary to replace function calls because the source has been lost to the sands of time

UncleOxidant · on March 14, 2023

Thanks for this. His lecture on SSA was the best explanation I've found of SSA vs alternatives that I've found (building up a motivation for why you'd want to use SSA which is lacking in other explanations I've found) - and unfortunately, I'd lost track of where I'd seen it. Was looking for it again recently not having much luck (not on YouTube) And now here it is.

JonChesterfield · on March 14, 2023

This one? https://vod.video.cornell.edu/media/1_130pq2fh

SSA is functional programming (the paper) resonated with me, though I remain unconvinced by phi nodes relative to basic blocks taking arguments.

debanjan16 · on March 13, 2023

Another resource great for understanding how programming languages work is:

https://www.plai.org/

iainctduncan · on March 13, 2023

thanks for sharing this!!!!

jws · on March 13, 2023

I went through part of this while working on some liveness analysis. The lecturer and the notes do a good job of conveying ideas.

Well worth using as a resource.

happytiger · on March 13, 2023

When am I going to be able to do self-paced for credit courses from good schools online without the crazy price tag?

elteto · on March 13, 2023

Why should online classes be any cheaper? I'm not talking about self-paced, certificate only classes, where you never interact with a real human. You can already find plenty of these for free from any number of top schools. I'm talking about a real class, for real credit, with a real professor and assignments graded by real humans. Online students end up consuming the same resources as regular students (with the exception of occupying space in a physical classroom). What's more, online students require _extra_ accommodations, since they can't just stop by the professor's office for office hours. They have to setup online meetings for that, etc.

Icathian · on March 13, 2023

Self-paced is the tough ask. Part of the value of formal schooling is the forcing function of deadlines to make you actually spend the time and absorb the material.

cinntaile · on March 13, 2023

An important component of credit courses is the time limitness so this seems unlikely to happen.

samvher · on March 13, 2023

> An important component of credit courses is the time limitness

Huh? Why would that be an important component?

I mean I know that it's typical for courses that come with credit, but in my mind it's just a side effect of the administrative side of providing courses, not something that's at all central to accreditation. I.e. if you build up the knowledge over a longer period (which you can still do with e.g. resits) that's completely fine as well.

solarmist · on March 13, 2023

Mostly because self-paced = doesn’t finish.

Read any of the research on MOOCs and it becomes clear that we don’t have a clue how to write a self-paced course people finish with any regularity.

artificialLimbs · on March 13, 2023

How much do you think it should cost?

Icathian · on March 13, 2023

Georgia Tech's OMSCS comes in at about $7k total. I think that's a good place to start.

LZ_Khan · on March 13, 2023

Any other graduate-level self-guided courses worth doing? I'm interested in AI/ML/Statistics.

ofalkaed · on March 13, 2023

>This page lists the curriculum for following this course at the university of your imagination, for four imagination credits (ungraded).

Seems rather ridiculous of Cornell to dictate terms for the University of My Imagination (UMI), if an imaginary professor at UMI wants to grade my work I think they should be allowed to do so and even grant me an imaginary degree if they feel I have earned it. Guess it is true, universities do stifle creativity.