Life of an instruction in LLVM

stavros · on Nov 25, 2012

Can someone tell me what the advantage of emitting LLVM IR is, compared to machine code? I don't think I ever saw that clarified anywhere.

mahmud · on Nov 25, 2012

An IR (intermediate representation) is "retargetable": meaning new processors can be supported with a new backend addon, not a full rewrite.

Compilation, specially of the type of language LLVM is most used for (impure, procedural, object oriented, with optional collection) lend themselves to very similar optimization techniques. Optimization happens in phases, one after the other, sometimes a phase might need to be repeated after the code undergoes further "simplification". Since the phases pretty much operate on the same data format, and since they're mostly portable across processor architectures in wide use today, it makes sense to delay actual object code generation until last.

The final assembly instructions generated are as good as "machine code". In fact, the whole reason it's called "assembly", as opposed to "compilation", or even "translation" is because there is a 1:1 correspondence between assembly instruction and machine instructions. Many assemblers use instruction hash-table to look up code by template :-)

Even when assembled, the program is not in runnable state as a lot of its external symbols and dependencies are not resolved. A file is its own compilation unit, and so functions referenced in other files in the project, or dependency libraries, or standard system libraries and system calls in use have to be either resolved, or registered somewhere handy for quick resolution later. Static linking does the first, dynamic linking does later. If the object file exports symbols for use by others it might need to get made into a shared-library.

As you can imagine, all this work is both very platform specific, and also tedious. Whence why one might want to avoid the final "weaving" of the binary, and leave it to someone who is intimately familiar with the target environment. Someone like the vendor assembler and linker, better yet, the high-quality binary tools from the good folks at GNU :-)

seanalltogether · on Nov 25, 2012

"An IR (intermediate representation) is "retargetable": meaning new processors can be supported with a new backend addon, not a full rewrite."

Why isn't this true of other high level languages like C, C++? Is LLVM just more strict? Could the others be made more strict to improve portability?

mahmud · on Nov 25, 2012

I don't think you understood what I was saying, and others replying to you are also off the mark, at least for the excerpt you quote.

LLVM is a compiler framework. Retargetability is a function of compilers, not languages. Most compilers are retargetable. A good sign is if a compiler generates code for more than 2, unrelated processors. Retargetability is just an engineering design. Most C, C++ compilers (in fact, most non-toy compilers maintained by more than a half a dozen people) are retargetable.

Think of retargetability is related to portability, but not same. A compiler can be non-portable but retargetable. Many compilers for small embedded processors target a huge number of processors, while themselves running on only x86 and Windows. That is called cross-compilation :-)

zanny · on Nov 25, 2012

Others have responded, but I don't think they get the gist of this question exactly.

C, C++, etc are just as retargetable as LLVM because you can take IR code, C, or C++ code (assuming standards compliance) and get a machine dependent binary anywhere you have a working compiler / assembler / linker stack. Clang uses LLVM as a translation layer for C and C++ because LLVM IR code is much more machine-like and makes writing the final assembler / linker easier, since you are just effectively translating abstracted assembly into machine specific assembly, instead of an object oriented / procedural dialect language into machine instructions.

alexk7 · on Nov 25, 2012

It is true of C and C++. In fact, LLVM is mostly used as a C/C++/Objective-C back-end (in Clang). The main advantage of LLVM is that it is a really good representation to perform optimizations on.

mahmud · on Nov 25, 2012

LLVM IR is in SSA form (static single-assignment) THE most popular IR for compiler research of late. Pretty much a lingua franca; optimization techniques old and new have been adapted to work with SSA.

derleth · on Nov 25, 2012

C and C++ are too high-level to directly represent a lot of useful things.

One of those things is multiple return values, implemented as a first-class part of LLVM since 2.3:

http://llvm.org/releases/2.3/docs/ReleaseNotes.html

Trying to fit this into C is painful; maybe it's easier in C++. You have to hope the optimizer on your compiler is really quite clever to get the same level of optimization the LLVM tools can provide with less effort because it is, as I said, directly representable in LLVM IR.

You also get things like memory use intrinsics, such as llvm.lifetime.start and llvm.lifetime.end, which help implement garbage collection by preserving information from the high-level code in the intermediate code. All such information would be thrown away in either C or C++ unless you do a lot more work.

http://llvm.org/docs/LangRef.html

stavros · on Nov 25, 2012

Very interesting, thank you. Why doesn't the GNU userland, for example, use LLVM? Although, I imagine gcc compiles to many architectures anyway. Still, the advantages sound many.

mahmud · on Nov 25, 2012

GCC predates LLVM; it also produces better output, has more maintainers and users, and runs on more platforms.

Sanddancer · on Nov 25, 2012

GCC doesn't always produce better output. A lot of networking code, like http daemons, as well as databases like PostgreSQL, tend to be sped up a bit by using llvm/clang. Also, the number of supported platforms for LLVM is growing pretty quickly; llvm/clang has support in the base tree for platforms like the MSP430, for example, that GCC doesn't.

exDM69 · on Nov 25, 2012

LLVM IR is a lot easier to generate than machine code and unlike machine code, it can be optimized after it is generated.

When using LLVM IR as a compiler target, you leave certain decisions to be made later by the LLVM to machine code compiler. In particular, you don't have to do exact instruction selection, instruction scheduling, register allocation (LLVM IR has infinite "registers") and memory allocation when compiling. The LLVM backend is very good at making these decisions and writing a compiler that is better than the LLVM backend would be years of work.

LLVM IR is also machine independent so you can write just one compiler front end that emits LLVM IR and compile that to machine code of many different CPUs. You can even distribute the LLVM IR bitcode and use the LLVM JIT compiler to assemble machine code at run time when you know exactly what the target architecture is and have optimized code just for that platform (CPU instruction set extensions, like SSE, NEON and other SIMD extensions, etc).

In summary, LLVM IR is a lot easier to generate than machine code and the LLVM backend will emit better machine code than a hand built backend (unless you have years to spend on it).

eliben · on Nov 25, 2012

> LLVM IR is a lot easier to generate than machine code and unlike machine code, it can be optimized after it is generated.

Just a small nit: it's possible to optimize machine code. In fact, LLVM has some peephole optimization passes on the MachineInstr level, and hooks to add others. LLVM IR is, however, much more amenable to optimization, since it's higher-level, contains type information and was generally designed to be optimizable (i.e. SSA form).

azakai · on Nov 25, 2012

> LLVM IR is also machine independent

Not true. See http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/0437...

LLVM IR is more abstracted than a concrete machine assembly language, but it still contains implicit and explicit signs of the machine it was compiled for, for example

* Types that exist in one machine and not another (e.g. x86_f80 does not exist on ARM).

* Endianness

* Structure alignment

As the link above says, LLVM IR is an excellent compiler IR, but it was never meant to be a portable machine-independent IR.

But it can be abstracted enough for some purposes, and some projects do use it for related things.

eliben · on Nov 25, 2012

LLVM IR can be target-independent. It can also contain target-depentent pieces when compiled from a target-dependent language (like C or C++).

azakai · on Nov 26, 2012

> LLVM IR can be target-independent.

Interesting, how would I generate that? What source languages are supported?

eliben · on Nov 26, 2012

Look at IR as a layer of its own, separate from Clang. You can even use it as your source language (although it's inconvenient, of course). For example, the Kaleidoscope language developed in the LLVM tutorial is target independent. The same is probably true for high-level languages that had LLVM backends attached - i.e. Haskell, Emscripten (Javascript) etc.

azakai · on Nov 26, 2012

In a theoretical sense, yes. But it would be very hard to avoid introducing nonportable elements in your code. There is no practical way to go from any existing language to LLVM and keep it portable, with any existing LLVM frontend that I am familiar with.

stavros · on Nov 25, 2012

Thank you for the explanation, it seems that the advantages are many.

michael_miller · on Nov 25, 2012

LLVM IR provides a clean API boundary separating the compiler frontend from the backend. This has two notable advantages: it limits the complexity and "code tangle" of combining two extremely complex pieces of software, and it lets you swap in new frontend / backend components easily. For example, if you want to develop a new language, you don't need to worry about writing a fancy optimizer or backends for X86 / ARM / MIPS / esoteric processor X: you just target LLVM, and you get an advanced optimizer outputting many instruction sets for free. Additionally, if you're creating a new instruction set, and you write an LLVM backend, you get the benefit of supporting every language outputting LLVM.

stavros · on Nov 25, 2012

Ah, thank you, I just asked this above before reading your reply. That sounds very useful.

Someone · on Nov 25, 2012

As an addition to what others said: 'emitting' and the textual representation of LLVM IR may have put you on the wrong footing.

The LLVM intermediate representation is an in-memory data structure. Compilers can build that structure directly.

LLVM has code for serializing and deserializing that data structure. That makes it easier to debug optimization steps (for example, one can check a dead code elimination pass without having to look at CPU-specific things), makes it easier to wrote a compiler (a compiler can pipe the LLVM textual representation into LLVM tools), and also makes it easier for LLVM writers to discuss LLVM behavior.

For an example compiler, see http://llvm.org/docs/tutorial/LangImpl1.html? Section 3 shows how to directly emit the binary representation.

wladimir · on Nov 26, 2012

Exactly. Internally, the data structure serialized as "LLVM bitcode" (or the textual equivalent) is a list of functions, each function represented by a Control Flow Graph of basic blocks. Each basic block contains runs of SSA (static single assignment) instructions without jumps and flow control (effectively a DAG of operations). There is also some basic data structure layout information. This representation is very handy for code generation, code analysis, and implementing optimization passes.

stavros · on Nov 25, 2012

That's more or less what I got... The IR is similar to bytecode, isn't it?

Someone · on Nov 25, 2012

LLVM assembly and bytecode are both serializations of the intermediate representation. Using a somewhat loose definition, one could call all three of them IR, even though few tools would use assembly or bytecode as their internal representation of the code.

Reading http://llvm.org/docs/CommandGuide/index.html#basic-commands, llvm-as takes LLVM assembly to bytecode; llvm-dis does the reverse.

chrisaycock · on Nov 25, 2012

The IR acts as an intermediary between the compiler's front-end (parsing and semantic analysis) and the back-end (machine-code optimizations and generation). The independent IR makes it possible to reuse, say, a C-language front-end regardless of the targeted object code. Similarly, an x86 back-end can be reused with any programming language.

Another nicety of a "high-level" IR vs machine code is that a lot of details become easier. LLVM has an infinite number of registers, and abstracts-out much of the calling convention.

stavros · on Nov 25, 2012

I see, thank you. I imagine the target audience is less "users of x86" and more "I just wrote a weird embedded CPU and I want compilers for it, ill just write an LLVM backend and get them for free", is that right?

eliben · on Nov 28, 2012

I wouldn't say "for free", but yes, LLVM definitely contains some non-mainstream backends, and you can write your own.

munin · on Nov 25, 2012

LLVM IR is pretty easy to work with, and also very heavily documented and standardized. you'll have few/no "surprises" and the process of writing IR transformations is pretty well understood. so if you want to transform code, you have a solid base to start from.

BlackJack · on Nov 25, 2012

chrisaycock's answer is complete, but here's the relevant lines in the article:

The code generator is one of the most complex parts of LLVM. Its task is to "lower" the relatively high-level, target-independent LLVM IR into low-level, target-dependent "machine instructions" (MachineInstr).