gccint: LTO Overview

1 
1 25.1 Design Overview
1 ====================
1 
1 Link time optimization is implemented as a GCC front end for a bytecode
1 representation of GIMPLE that is emitted in special sections of '.o'
1 files.  Currently, LTO support is enabled in most ELF-based systems, as
1 well as darwin, cygwin and mingw systems.
1 
1  Since GIMPLE bytecode is saved alongside final object code, object
1 files generated with LTO support are larger than regular object files.
1 This "fat" object format makes it easy to integrate LTO into existing
1 build systems, as one can, for instance, produce archives of the files.
1 Additionally, one might be able to ship one set of fat objects which
1 could be used both for development and the production of optimized
1 builds.  A, perhaps surprising, side effect of this feature is that any
1 mistake in the toolchain leads to LTO information not being used (e.g.
1 an older 'libtool' calling 'ld' directly).  This is both an advantage,
1 as the system is more robust, and a disadvantage, as the user is not
1 informed that the optimization has been disabled.
1 
1  The current implementation only produces "fat" objects, effectively
1 doubling compilation time and increasing file sizes up to 5x the
1 original size.  This hides the problem that some tools, such as 'ar' and
1 'nm', need to understand symbol tables of LTO sections.  These tools
1 were extended to use the plugin infrastructure, and with these problems
1 solved, GCC will also support "slim" objects consisting of the
1 intermediate code alone.
1 
1  At the highest level, LTO splits the compiler in two.  The first half
1 (the "writer") produces a streaming representation of all the internal
1 data structures needed to optimize and generate code.  This includes
1 declarations, types, the callgraph and the GIMPLE representation of
1 function bodies.
1 
1  When '-flto' is given during compilation of a source file, the pass
1 manager executes all the passes in 'all_lto_gen_passes'.  Currently,
1 this phase is composed of two IPA passes:
1 
1    * 'pass_ipa_lto_gimple_out' This pass executes the function
1      'lto_output' in 'lto-streamer-out.c', which traverses the call
1      graph encoding every reachable declaration, type and function.
1      This generates a memory representation of all the file sections
1      described below.
1 
1    * 'pass_ipa_lto_finish_out' This pass executes the function
1      'produce_asm_for_decls' in 'lto-streamer-out.c', which takes the
1      memory image built in the previous pass and encodes it in the
1      corresponding ELF file sections.
1 
1  The second half of LTO support is the "reader".  This is implemented as
1 the GCC front end 'lto1' in 'lto/lto.c'.  When 'collect2' detects a link
1 set of '.o'/'.a' files with LTO information and the '-flto' is enabled,
1 it invokes 'lto1' which reads the set of files and aggregates them into
1 a single translation unit for optimization.  The main entry point for
1 the reader is 'lto/lto.c':'lto_main'.
1 
1 25.1.1 LTO modes of operation
1 -----------------------------
1 
1 One of the main goals of the GCC link-time infrastructure was to allow
1 effective compilation of large programs.  For this reason GCC implements
1 two link-time compilation modes.
1 
1   1. _LTO mode_, in which the whole program is read into the compiler at
1      link-time and optimized in a similar way as if it were a single
1      source-level compilation unit.
1 
1   2. _WHOPR or partitioned mode_, designed to utilize multiple CPUs
1      and/or a distributed compilation environment to quickly link large
1      applications.  WHOPR stands for WHOle Program optimizeR (not to be
1      confused with the semantics of '-fwhole-program').  It partitions
1      the aggregated callgraph from many different '.o' files and
1      distributes the compilation of the sub-graphs to different CPUs.
1 
1      Note that distributed compilation is not implemented yet, but since
1      the parallelism is facilitated via generating a 'Makefile', it
1      would be easy to implement.
1 
1  WHOPR splits LTO into three main stages:
1   1. Local generation (LGEN) This stage executes in parallel.  Every
1      file in the program is compiled into the intermediate language and
1      packaged together with the local call-graph and summary
1      information.  This stage is the same for both the LTO and WHOPR
1      compilation mode.
1 
1   2. Whole Program Analysis (WPA) WPA is performed sequentially.  The
1      global call-graph is generated, and a global analysis procedure
1      makes transformation decisions.  The global call-graph is
1      partitioned to facilitate parallel optimization during phase 3.
1      The results of the WPA stage are stored into new object files which
1      contain the partitions of program expressed in the intermediate
1      language and the optimization decisions.
1 
1   3. Local transformations (LTRANS) This stage executes in parallel.
1      All the decisions made during phase 2 are implemented locally in
1      each partitioned object file, and the final object code is
1      generated.  Optimizations which cannot be decided efficiently
1      during the phase 2 may be performed on the local call-graph
1      partitions.
1 
1  WHOPR can be seen as an extension of the usual LTO mode of compilation.
1 In LTO, WPA and LTRANS are executed within a single execution of the
1 compiler, after the whole program has been read into memory.
1 
1  When compiling in WHOPR mode, the callgraph is partitioned during the
1 WPA stage.  The whole program is split into a given number of partitions
1 of roughly the same size.  The compiler tries to minimize the number of
1 references which cross partition boundaries.  The main advantage of
1 WHOPR is to allow the parallel execution of LTRANS stages, which are the
1 most time-consuming part of the compilation process.  Additionally, it
1 avoids the need to load the whole program into memory.
1