gprof: Implementation

1 
1 9.1 Implementation of Profiling
1 ===============================
1 
1 Profiling works by changing how every function in your program is
1 compiled so that when it is called, it will stash away some information
1 about where it was called from.  From this, the profiler can figure out
1 what function called it, and can count how many times it was called.
1 This change is made by the compiler when your program is compiled with
1 the '-pg' option, which causes every function to call 'mcount' (or
1 '_mcount', or '__mcount', depending on the OS and compiler) as one of
1 its first operations.
1 
1    The 'mcount' routine, included in the profiling library, is
1 responsible for recording in an in-memory call graph table both its
1 parent routine (the child) and its parent's parent.  This is typically
1 done by examining the stack frame to find both the address of the child,
1 and the return address in the original parent.  Since this is a very
1 machine-dependent operation, 'mcount' itself is typically a short
1 assembly-language stub routine that extracts the required information,
1 and then calls '__mcount_internal' (a normal C function) with two
1 arguments--'frompc' and 'selfpc'.  '__mcount_internal' is responsible
1 for maintaining the in-memory call graph, which records 'frompc',
1 'selfpc', and the number of times each of these call arcs was traversed.
1 
1    GCC Version 2 provides a magical function
1 ('__builtin_return_address'), which allows a generic 'mcount' function
1 to extract the required information from the stack frame.  However, on
1 some architectures, most notably the SPARC, using this builtin can be
1 very computationally expensive, and an assembly language version of
1 'mcount' is used for performance reasons.
1 
1    Number-of-calls information for library routines is collected by
1 using a special version of the C library.  The programs in it are the
1 same as in the usual C library, but they were compiled with '-pg'.  If
1 you link your program with 'gcc ... -pg', it automatically uses the
1 profiling version of the library.
1 
1    Profiling also involves watching your program as it runs, and keeping
1 a histogram of where the program counter happens to be every now and
1 then.  Typically the program counter is looked at around 100 times per
1 second of run time, but the exact frequency may vary from system to
1 system.
1 
1    This is done is one of two ways.  Most UNIX-like operating systems
1 provide a 'profil()' system call, which registers a memory array with
1 the kernel, along with a scale factor that determines how the program's
1 address space maps into the array.  Typical scaling values cause every 2
1 to 8 bytes of address space to map into a single array slot.  On every
1 tick of the system clock (assuming the profiled program is running), the
1 value of the program counter is examined and the corresponding slot in
1 the memory array is incremented.  Since this is done in the kernel,
1 which had to interrupt the process anyway to handle the clock interrupt,
1 very little additional system overhead is required.
1 
1    However, some operating systems, most notably Linux 2.0 (and
1 earlier), do not provide a 'profil()' system call.  On such a system,
1 arrangements are made for the kernel to periodically deliver a signal to
1 the process (typically via 'setitimer()'), which then performs the same
1 operation of examining the program counter and incrementing a slot in
1 the memory array.  Since this method requires a signal to be delivered
1 to user space every time a sample is taken, it uses considerably more
1 overhead than kernel-based profiling.  Also, due to the added delay
1 required to deliver the signal, this method is less accurate as well.
1 
1    A special startup routine allocates memory for the histogram and
1 either calls 'profil()' or sets up a clock signal handler.  This routine
1 ('monstartup') can be invoked in several ways.  On Linux systems, a
1 special profiling startup file 'gcrt0.o', which invokes 'monstartup'
1 before 'main', is used instead of the default 'crt0.o'.  Use of this
1 special startup file is one of the effects of using 'gcc ... -pg' to
1 link.  On SPARC systems, no special startup files are used.  Rather, the
1 'mcount' routine, when it is invoked for the first time (typically when
1 'main' is called), calls 'monstartup'.
1 
1    If the compiler's '-a' option was used, basic-block counting is also
1 enabled.  Each object file is then compiled with a static array of
1 counts, initially zero.  In the executable code, every time a new
1 basic-block begins (i.e., when an 'if' statement appears), an extra
1 instruction is inserted to increment the corresponding count in the
1 array.  At compile time, a paired array was constructed that recorded
1 the starting address of each basic-block.  Taken together, the two
1 arrays record the starting address of every basic-block, along with the
1 number of times it was executed.
1 
1    The profiling library also includes a function ('mcleanup') which is
1 typically registered using 'atexit()' to be called as the program exits,
1 and is responsible for writing the file 'gmon.out'.  Profiling is turned
1 off, various headers are output, and the histogram is written, followed
1 by the call-graph arcs and the basic-block counts.
1 
1    The output from 'gprof' gives no indication of parts of your program
1 that are limited by I/O or swapping bandwidth.  This is because samples
1 of the program counter are taken at fixed intervals of the program's run
1 time.  Therefore, the time measurements in 'gprof' output say nothing
1 about time that your program was not running.  For example, a part of
1 the program that creates so much data that it cannot all fit in physical
1 memory at once may run very slowly due to thrashing, but 'gprof' will
1 say it uses little time.  On the other hand, sampling by run time has
1 the advantage that the amount of load due to other users won't directly
1 affect the output you get.
1