gettext: MO Files

1 
1 10.3 The Format of GNU MO Files
1 ===============================
1 
1    The format of the generated MO files is best described by a picture,
1 which appears below.
1 
1    The first two words serve the identification of the file.  The magic
1 number will always signal GNU MO files.  The number is stored in the
1 byte order used when the MO file was generated, so the magic number
1 really is two numbers: ‘0x950412de’ and ‘0xde120495’.
1 
1    The second word describes the current revision of the file format,
1 composed of a major and a minor revision number.  The revision numbers
1 ensure that the readers of MO files can distinguish new formats from old
1 ones and handle their contents, as far as possible.  For now the major
1 revision is 0 or 1, and the minor revision is also 0 or 1.  More
1 revisions might be added in the future.  A program seeing an unexpected
1 major revision number should stop reading the MO file entirely; whereas
1 an unexpected minor revision number means that the file can be read but
1 will not reveal its full contents, when parsed by a program that
1 supports only smaller minor revision numbers.
1 
1    The version is kept separate from the magic number, instead of using
1 different magic numbers for different formats, mainly because
1 ‘/etc/magic’ is not updated often.
1 
1    Follow a number of pointers to later tables in the file, allowing for
1 the extension of the prefix part of MO files without having to recompile
1 programs reading them.  This might become useful for later inserting a
1 few flag bits, indication about the charset used, new tables, or other
1 things.
1 
1    Then, at offset O and offset T in the picture, two tables of string
1 descriptors can be found.  In both tables, each string descriptor uses
1 two 32 bits integers, one for the string length, another for the offset
1 of the string in the MO file, counting in bytes from the start of the
1 file.  The first table contains descriptors for the original strings,
1 and is sorted so the original strings are in increasing lexicographical
1 order.  The second table contains descriptors for the translated
1 strings, and is parallel to the first table: to find the corresponding
1 translation one has to access the array slot in the second array with
1 the same index.
1 
1    Having the original strings sorted enables the use of simple binary
1 search, for when the MO file does not contain an hashing table, or for
1 when it is not practical to use the hashing table provided in the MO
1 file.  This also has another advantage, as the empty string in a PO file
1 GNU ‘gettext’ is usually _translated_ into some system information
1 attached to that particular MO file, and the empty string necessarily
1 becomes the first in both the original and translated tables, making the
1 system information very easy to find.
1 
1    The size S of the hash table can be zero.  In this case, the hash
1 table itself is not contained in the MO file.  Some people might prefer
1 this because a precomputed hashing table takes disk space, and does not
1 win _that_ much speed.  The hash table contains indices to the sorted
1 array of strings in the MO file.  Conflict resolution is done by double
1 hashing.  The precise hashing algorithm used is fairly dependent on GNU
1 ‘gettext’ code, and is not documented here.
1 
1    As for the strings themselves, they follow the hash file, and each is
1 terminated with a <NUL>, and this <NUL> is not counted in the length
1 which appears in the string descriptor.  The ‘msgfmt’ program has an
1 option selecting the alignment for MO file strings.  With this option,
1 each string is separately aligned so it starts at an offset which is a
1 multiple of the alignment value.  On some RISC machines, a correct
1 alignment will speed things up.
1 
1    Contexts are stored by storing the concatenation of the context, a
1 <EOT> byte, and the original string, instead of the original string.
1 
1    Plural forms are stored by letting the plural of the original string
1 follow the singular of the original string, separated through a <NUL>
1 byte.  The length which appears in the string descriptor includes both.
1 However, only the singular of the original string takes part in the hash
1 table lookup.  The plural variants of the translation are all stored
1 consecutively, separated through a <NUL> byte.  Here also, the length in
1 the string descriptor includes all of them.
1 
1    Nothing prevents a MO file from having embedded <NUL>s in strings.
1 However, the program interface currently used already presumes that
1 strings are <NUL> terminated, so embedded <NUL>s are somewhat useless.
1 But the MO file format is general enough so other interfaces would be
1 later possible, if for example, we ever want to implement wide
1 characters right in MO files, where <NUL> bytes may accidentally appear.
1 (No, we don’t want to have wide characters in MO files.  They would make
1 the file unnecessarily large, and the ‘wchar_t’ type being platform
1 dependent, MO files would be platform dependent as well.)
1 
1    This particular issue has been strongly debated in the GNU ‘gettext’
1 development forum, and it is expectable that MO file format will evolve
1 or change over time.  It is even possible that many formats may later be
1 supported concurrently.  But surely, we have to start somewhere, and the
1 MO file format described here is a good start.  Nothing is cast in
1 concrete, and the format may later evolve fairly easily, so we should
1 feel comfortable with the current approach.
1 
1              byte
1                   +------------------------------------------+
1                0  | magic number = 0x950412de                |
1                   |                                          |
1                4  | file format revision = 0                 |
1                   |                                          |
1                8  | number of strings                        |  == N
1                   |                                          |
1               12  | offset of table with original strings    |  == O
1                   |                                          |
1               16  | offset of table with translation strings |  == T
1                   |                                          |
1               20  | size of hashing table                    |  == S
1                   |                                          |
1               24  | offset of hashing table                  |  == H
1                   |                                          |
1                   .                                          .
1                   .    (possibly more entries later)         .
1                   .                                          .
1                   |                                          |
1                O  | length & offset 0th string  ----------------.
1            O + 8  | length & offset 1st string  ------------------.
1                    ...                                    ...   | |
1      O + ((N-1)*8)| length & offset (N-1)th string           |  | |
1                   |                                          |  | |
1                T  | length & offset 0th translation  ---------------.
1            T + 8  | length & offset 1st translation  -----------------.
1                    ...                                    ...   | | | |
1      T + ((N-1)*8)| length & offset (N-1)th translation      |  | | | |
1                   |                                          |  | | | |
1                H  | start hash table                         |  | | | |
1                    ...                                    ...   | | | |
1        H + S * 4  | end hash table                           |  | | | |
1                   |                                          |  | | | |
1                   | NUL terminated 0th string  <----------------' | | |
1                   |                                          |    | | |
1                   | NUL terminated 1st string  <------------------' | |
1                   |                                          |      | |
1                    ...                                    ...       | |
1                   |                                          |      | |
1                   | NUL terminated 0th translation  <---------------' |
1                   |                                          |        |
1                   | NUL terminated 1st translation  <-----------------'
1                   |                                          |
1                    ...                                    ...
1                   |                                          |
1                   +------------------------------------------+
1