gettext: MO Files
1
1 10.3 The Format of GNU MO Files
1 ===============================
1
1 The format of the generated MO files is best described by a picture,
1 which appears below.
1
1 The first two words serve the identification of the file. The magic
1 number will always signal GNU MO files. The number is stored in the
1 byte order used when the MO file was generated, so the magic number
1 really is two numbers: ‘0x950412de’ and ‘0xde120495’.
1
1 The second word describes the current revision of the file format,
1 composed of a major and a minor revision number. The revision numbers
1 ensure that the readers of MO files can distinguish new formats from old
1 ones and handle their contents, as far as possible. For now the major
1 revision is 0 or 1, and the minor revision is also 0 or 1. More
1 revisions might be added in the future. A program seeing an unexpected
1 major revision number should stop reading the MO file entirely; whereas
1 an unexpected minor revision number means that the file can be read but
1 will not reveal its full contents, when parsed by a program that
1 supports only smaller minor revision numbers.
1
1 The version is kept separate from the magic number, instead of using
1 different magic numbers for different formats, mainly because
1 ‘/etc/magic’ is not updated often.
1
1 Follow a number of pointers to later tables in the file, allowing for
1 the extension of the prefix part of MO files without having to recompile
1 programs reading them. This might become useful for later inserting a
1 few flag bits, indication about the charset used, new tables, or other
1 things.
1
1 Then, at offset O and offset T in the picture, two tables of string
1 descriptors can be found. In both tables, each string descriptor uses
1 two 32 bits integers, one for the string length, another for the offset
1 of the string in the MO file, counting in bytes from the start of the
1 file. The first table contains descriptors for the original strings,
1 and is sorted so the original strings are in increasing lexicographical
1 order. The second table contains descriptors for the translated
1 strings, and is parallel to the first table: to find the corresponding
1 translation one has to access the array slot in the second array with
1 the same index.
1
1 Having the original strings sorted enables the use of simple binary
1 search, for when the MO file does not contain an hashing table, or for
1 when it is not practical to use the hashing table provided in the MO
1 file. This also has another advantage, as the empty string in a PO file
1 GNU ‘gettext’ is usually _translated_ into some system information
1 attached to that particular MO file, and the empty string necessarily
1 becomes the first in both the original and translated tables, making the
1 system information very easy to find.
1
1 The size S of the hash table can be zero. In this case, the hash
1 table itself is not contained in the MO file. Some people might prefer
1 this because a precomputed hashing table takes disk space, and does not
1 win _that_ much speed. The hash table contains indices to the sorted
1 array of strings in the MO file. Conflict resolution is done by double
1 hashing. The precise hashing algorithm used is fairly dependent on GNU
1 ‘gettext’ code, and is not documented here.
1
1 As for the strings themselves, they follow the hash file, and each is
1 terminated with a <NUL>, and this <NUL> is not counted in the length
1 which appears in the string descriptor. The ‘msgfmt’ program has an
1 option selecting the alignment for MO file strings. With this option,
1 each string is separately aligned so it starts at an offset which is a
1 multiple of the alignment value. On some RISC machines, a correct
1 alignment will speed things up.
1
1 Contexts are stored by storing the concatenation of the context, a
1 <EOT> byte, and the original string, instead of the original string.
1
1 Plural forms are stored by letting the plural of the original string
1 follow the singular of the original string, separated through a <NUL>
1 byte. The length which appears in the string descriptor includes both.
1 However, only the singular of the original string takes part in the hash
1 table lookup. The plural variants of the translation are all stored
1 consecutively, separated through a <NUL> byte. Here also, the length in
1 the string descriptor includes all of them.
1
1 Nothing prevents a MO file from having embedded <NUL>s in strings.
1 However, the program interface currently used already presumes that
1 strings are <NUL> terminated, so embedded <NUL>s are somewhat useless.
1 But the MO file format is general enough so other interfaces would be
1 later possible, if for example, we ever want to implement wide
1 characters right in MO files, where <NUL> bytes may accidentally appear.
1 (No, we don’t want to have wide characters in MO files. They would make
1 the file unnecessarily large, and the ‘wchar_t’ type being platform
1 dependent, MO files would be platform dependent as well.)
1
1 This particular issue has been strongly debated in the GNU ‘gettext’
1 development forum, and it is expectable that MO file format will evolve
1 or change over time. It is even possible that many formats may later be
1 supported concurrently. But surely, we have to start somewhere, and the
1 MO file format described here is a good start. Nothing is cast in
1 concrete, and the format may later evolve fairly easily, so we should
1 feel comfortable with the current approach.
1
1 byte
1 +------------------------------------------+
1 0 | magic number = 0x950412de |
1 | |
1 4 | file format revision = 0 |
1 | |
1 8 | number of strings | == N
1 | |
1 12 | offset of table with original strings | == O
1 | |
1 16 | offset of table with translation strings | == T
1 | |
1 20 | size of hashing table | == S
1 | |
1 24 | offset of hashing table | == H
1 | |
1 . .
1 . (possibly more entries later) .
1 . .
1 | |
1 O | length & offset 0th string ----------------.
1 O + 8 | length & offset 1st string ------------------.
1 ... ... | |
1 O + ((N-1)*8)| length & offset (N-1)th string | | |
1 | | | |
1 T | length & offset 0th translation ---------------.
1 T + 8 | length & offset 1st translation -----------------.
1 ... ... | | | |
1 T + ((N-1)*8)| length & offset (N-1)th translation | | | | |
1 | | | | | |
1 H | start hash table | | | | |
1 ... ... | | | |
1 H + S * 4 | end hash table | | | | |
1 | | | | | |
1 | NUL terminated 0th string <----------------' | | |
1 | | | | |
1 | NUL terminated 1st string <------------------' | |
1 | | | |
1 ... ... | |
1 | | | |
1 | NUL terminated 0th translation <---------------' |
1 | | |
1 | NUL terminated 1st translation <-----------------'
1 | |
1 ... ...
1 | |
1 +------------------------------------------+
1