gccint: LTO Overview
1
1 25.1 Design Overview
1 ====================
1
1 Link time optimization is implemented as a GCC front end for a bytecode
1 representation of GIMPLE that is emitted in special sections of '.o'
1 files. Currently, LTO support is enabled in most ELF-based systems, as
1 well as darwin, cygwin and mingw systems.
1
1 Since GIMPLE bytecode is saved alongside final object code, object
1 files generated with LTO support are larger than regular object files.
1 This "fat" object format makes it easy to integrate LTO into existing
1 build systems, as one can, for instance, produce archives of the files.
1 Additionally, one might be able to ship one set of fat objects which
1 could be used both for development and the production of optimized
1 builds. A, perhaps surprising, side effect of this feature is that any
1 mistake in the toolchain leads to LTO information not being used (e.g.
1 an older 'libtool' calling 'ld' directly). This is both an advantage,
1 as the system is more robust, and a disadvantage, as the user is not
1 informed that the optimization has been disabled.
1
1 The current implementation only produces "fat" objects, effectively
1 doubling compilation time and increasing file sizes up to 5x the
1 original size. This hides the problem that some tools, such as 'ar' and
1 'nm', need to understand symbol tables of LTO sections. These tools
1 were extended to use the plugin infrastructure, and with these problems
1 solved, GCC will also support "slim" objects consisting of the
1 intermediate code alone.
1
1 At the highest level, LTO splits the compiler in two. The first half
1 (the "writer") produces a streaming representation of all the internal
1 data structures needed to optimize and generate code. This includes
1 declarations, types, the callgraph and the GIMPLE representation of
1 function bodies.
1
1 When '-flto' is given during compilation of a source file, the pass
1 manager executes all the passes in 'all_lto_gen_passes'. Currently,
1 this phase is composed of two IPA passes:
1
1 * 'pass_ipa_lto_gimple_out' This pass executes the function
1 'lto_output' in 'lto-streamer-out.c', which traverses the call
1 graph encoding every reachable declaration, type and function.
1 This generates a memory representation of all the file sections
1 described below.
1
1 * 'pass_ipa_lto_finish_out' This pass executes the function
1 'produce_asm_for_decls' in 'lto-streamer-out.c', which takes the
1 memory image built in the previous pass and encodes it in the
1 corresponding ELF file sections.
1
1 The second half of LTO support is the "reader". This is implemented as
1 the GCC front end 'lto1' in 'lto/lto.c'. When 'collect2' detects a link
1 set of '.o'/'.a' files with LTO information and the '-flto' is enabled,
1 it invokes 'lto1' which reads the set of files and aggregates them into
1 a single translation unit for optimization. The main entry point for
1 the reader is 'lto/lto.c':'lto_main'.
1
1 25.1.1 LTO modes of operation
1 -----------------------------
1
1 One of the main goals of the GCC link-time infrastructure was to allow
1 effective compilation of large programs. For this reason GCC implements
1 two link-time compilation modes.
1
1 1. _LTO mode_, in which the whole program is read into the compiler at
1 link-time and optimized in a similar way as if it were a single
1 source-level compilation unit.
1
1 2. _WHOPR or partitioned mode_, designed to utilize multiple CPUs
1 and/or a distributed compilation environment to quickly link large
1 applications. WHOPR stands for WHOle Program optimizeR (not to be
1 confused with the semantics of '-fwhole-program'). It partitions
1 the aggregated callgraph from many different '.o' files and
1 distributes the compilation of the sub-graphs to different CPUs.
1
1 Note that distributed compilation is not implemented yet, but since
1 the parallelism is facilitated via generating a 'Makefile', it
1 would be easy to implement.
1
1 WHOPR splits LTO into three main stages:
1 1. Local generation (LGEN) This stage executes in parallel. Every
1 file in the program is compiled into the intermediate language and
1 packaged together with the local call-graph and summary
1 information. This stage is the same for both the LTO and WHOPR
1 compilation mode.
1
1 2. Whole Program Analysis (WPA) WPA is performed sequentially. The
1 global call-graph is generated, and a global analysis procedure
1 makes transformation decisions. The global call-graph is
1 partitioned to facilitate parallel optimization during phase 3.
1 The results of the WPA stage are stored into new object files which
1 contain the partitions of program expressed in the intermediate
1 language and the optimization decisions.
1
1 3. Local transformations (LTRANS) This stage executes in parallel.
1 All the decisions made during phase 2 are implemented locally in
1 each partitioned object file, and the final object code is
1 generated. Optimizations which cannot be decided efficiently
1 during the phase 2 may be performed on the local call-graph
1 partitions.
1
1 WHOPR can be seen as an extension of the usual LTO mode of compilation.
1 In LTO, WPA and LTRANS are executed within a single execution of the
1 compiler, after the whole program has been read into memory.
1
1 When compiling in WHOPR mode, the callgraph is partitioned during the
1 WPA stage. The whole program is split into a given number of partitions
1 of roughly the same size. The compiler tries to minimize the number of
1 references which cross partition boundaries. The main advantage of
1 WHOPR is to allow the parallel execution of LTRANS stages, which are the
1 most time-consuming part of the compilation process. Additionally, it
1 avoids the need to load the whole program into memory.
1