An IR (intermediate representation) is "retargetable": meaning new processors can be supported with a new backend addon, not a full rewrite.
Compilation, specially of the type of language LLVM is most used for (impure, procedural, object oriented, with optional collection) lend themselves to very similar optimization techniques. Optimization happens in phases, one after the other, sometimes a phase might need to be repeated after the code undergoes further "simplification". Since the phases pretty much operate on the same data format, and since they're mostly portable across processor architectures in wide use today, it makes sense to delay actual object code generation until last.
The final assembly instructions generated are as good as "machine code". In fact, the whole reason it's called "assembly", as opposed to "compilation", or even "translation" is because there is a 1:1 correspondence between assembly instruction and machine instructions. Many assemblers use instruction hash-table to look up code by template :-)
Even when assembled, the program is not in runnable state as a lot of its external symbols and dependencies are not resolved. A file is its own compilation unit, and so functions referenced in other files in the project, or dependency libraries, or standard system libraries and system calls in use have to be either resolved, or registered somewhere handy for quick resolution later. Static linking does the first, dynamic linking does later. If the object file exports symbols for use by others it might need to get made into a shared-library.
As you can imagine, all this work is both very platform specific, and also tedious. Whence why one might want to avoid the final "weaving" of the binary, and leave it to someone who is intimately familiar with the target environment. Someone like the vendor assembler and linker, better yet, the high-quality binary tools from the good folks at GNU :-)
I don't think you understood what I was saying, and others replying to you are also off the mark, at least for the excerpt you quote.
LLVM is a compiler framework. Retargetability is a function of compilers, not languages. Most compilers are retargetable. A good sign is if a compiler generates code for more than 2, unrelated processors. Retargetability is just an engineering design. Most C, C++ compilers (in fact, most non-toy compilers maintained by more than a half a dozen people) are retargetable.
Think of retargetability is related to portability, but not same. A compiler can be non-portable but retargetable. Many compilers for small embedded processors target a huge number of processors, while themselves running on only x86 and Windows. That is called cross-compilation :-)
Others have responded, but I don't think they get the gist of this question exactly.
C, C++, etc are just as retargetable as LLVM because you can take IR code, C, or C++ code (assuming standards compliance) and get a machine dependent binary anywhere you have a working compiler / assembler / linker stack. Clang uses LLVM as a translation layer for C and C++ because LLVM IR code is much more machine-like and makes writing the final assembler / linker easier, since you are just effectively translating abstracted assembly into machine specific assembly, instead of an object oriented / procedural dialect language into machine instructions.
It is true of C and C++. In fact, LLVM is mostly used as a C/C++/Objective-C back-end (in Clang). The main advantage of LLVM is that it is a really good representation to perform optimizations on.
LLVM IR is in SSA form (static single-assignment) THE most popular IR for compiler research of late. Pretty much a lingua franca; optimization techniques old and new have been adapted to work with SSA.
Trying to fit this into C is painful; maybe it's easier in C++. You have to hope the optimizer on your compiler is really quite clever to get the same level of optimization the LLVM tools can provide with less effort because it is, as I said, directly representable in LLVM IR.
You also get things like memory use intrinsics, such as llvm.lifetime.start and llvm.lifetime.end, which help implement garbage collection by preserving information from the high-level code in the intermediate code. All such information would be thrown away in either C or C++ unless you do a lot more work.
Very interesting, thank you. Why doesn't the GNU userland, for example, use LLVM? Although, I imagine gcc compiles to many architectures anyway. Still, the advantages sound many.
GCC doesn't always produce better output. A lot of networking code, like http daemons, as well as databases like PostgreSQL, tend to be sped up a bit by using llvm/clang. Also, the number of supported platforms for LLVM is growing pretty quickly; llvm/clang has support in the base tree for platforms like the MSP430, for example, that GCC doesn't.
LLVM IR is a lot easier to generate than machine code and unlike machine code, it can be optimized after it is generated.
When using LLVM IR as a compiler target, you leave certain decisions to be made later by the LLVM to machine code compiler. In particular, you don't have to do exact instruction selection, instruction scheduling, register allocation (LLVM IR has infinite "registers") and memory allocation when compiling. The LLVM backend is very good at making these decisions and writing a compiler that is better than the LLVM backend would be years of work.
LLVM IR is also machine independent so you can write just one compiler front end that emits LLVM IR and compile that to machine code of many different CPUs. You can even distribute the LLVM IR bitcode and use the LLVM JIT compiler to assemble machine code at run time when you know exactly what the target architecture is and have optimized code just for that platform (CPU instruction set extensions, like SSE, NEON and other SIMD extensions, etc).
In summary, LLVM IR is a lot easier to generate than machine code and the LLVM backend will emit better machine code than a hand built backend (unless you have years to spend on it).
> LLVM IR is a lot easier to generate than machine code and unlike machine code, it can be optimized after it is generated.
Just a small nit: it's possible to optimize machine code. In fact, LLVM has some peephole optimization passes on the MachineInstr level, and hooks to add others. LLVM IR is, however, much more amenable to optimization, since it's higher-level, contains type information and was generally designed to be optimizable (i.e. SSA form).
LLVM IR is more abstracted than a concrete machine assembly language, but it still contains implicit and explicit signs of the machine it was compiled for, for example
* Types that exist in one machine and not another (e.g. x86_f80 does not exist on ARM).
* Endianness
* Structure alignment
As the link above says, LLVM IR is an excellent compiler IR, but it was never meant to be a portable machine-independent IR.
But it can be abstracted enough for some purposes, and some projects do use it for related things.
Look at IR as a layer of its own, separate from Clang. You can even use it as your source language (although it's inconvenient, of course). For example, the Kaleidoscope language developed in the LLVM tutorial is target independent. The same is probably true for high-level languages that had LLVM backends attached - i.e. Haskell, Emscripten (Javascript) etc.
In a theoretical sense, yes. But it would be very hard to avoid introducing nonportable elements in your code. There is no practical way to go from any existing language to LLVM and keep it portable, with any existing LLVM frontend that I am familiar with.
LLVM IR provides a clean API boundary separating the compiler frontend from the backend. This has two notable advantages: it limits the complexity and "code tangle" of combining two extremely complex pieces of software, and it lets you swap in new frontend / backend components easily. For example, if you want to develop a new language, you don't need to worry about writing a fancy optimizer or backends for X86 / ARM / MIPS / esoteric processor X: you just target LLVM, and you get an advanced optimizer outputting many instruction sets for free. Additionally, if you're creating a new instruction set, and you write an LLVM backend, you get the benefit of supporting every language outputting LLVM.
As an addition to what others said: 'emitting' and the textual representation of LLVM IR may have put you on the wrong footing.
The LLVM intermediate representation is an in-memory data structure. Compilers can build that structure directly.
LLVM has code for serializing and deserializing that data structure. That makes it easier to debug optimization steps (for example, one can check a dead code elimination pass without having to look at CPU-specific things), makes it easier to wrote a compiler (a compiler can pipe the LLVM textual representation into LLVM tools), and also makes it easier for LLVM writers to discuss LLVM behavior.
Exactly. Internally, the data structure serialized as "LLVM bitcode" (or the textual equivalent) is a list of functions, each function represented by a Control Flow Graph of basic blocks. Each basic block contains runs of SSA (static single assignment) instructions without jumps and flow control (effectively a DAG of operations). There is also some basic data structure layout information.
This representation is very handy for code generation, code analysis, and implementing optimization passes.
LLVM assembly and bytecode are both serializations of the intermediate representation. Using a somewhat loose definition, one could call all three of them IR, even though few tools would use assembly or bytecode as their internal representation of the code.
The IR acts as an intermediary between the compiler's front-end (parsing and semantic analysis) and the back-end (machine-code optimizations and generation). The independent IR makes it possible to reuse, say, a C-language front-end regardless of the targeted object code. Similarly, an x86 back-end can be reused with any programming language.
Another nicety of a "high-level" IR vs machine code is that a lot of details become easier. LLVM has an infinite number of registers, and abstracts-out much of the calling convention.
I see, thank you. I imagine the target audience is less "users of x86" and more "I just wrote a weird embedded CPU and I want compilers for it, ill just write an LLVM backend and get them for free", is that right?
LLVM IR is pretty easy to work with, and also very heavily documented and standardized. you'll have few/no "surprises" and the process of writing IR transformations is pretty well understood. so if you want to transform code, you have a solid base to start from.
chrisaycock's answer is complete, but here's the relevant lines in the article:
The code generator is one of the most complex parts of LLVM. Its task is to "lower" the relatively high-level, target-independent LLVM IR into low-level, target-dependent "machine instructions" (MachineInstr).