find: Old Database Format

1 
1 4.2.4 Old Database Format
1 -------------------------
1 
1 The old database format is used by Unix 'locate' and 'find' programs and
1 earlier releases of the GNU ones.  'updatedb' produces this format if
1 given the '--old-format' option.
1 
1    'updatedb' runs programs called 'bigram' and 'code' to produce
1 old-format databases.  The old format differs from the new one in the
1 following ways.  Instead of each entry starting with an
1 offset-differential count byte and ending with a null, byte values from
1 0 through 28 indicate offset-differential counts from -14 through 14.
1 The byte value indicating that a long offset-differential count follows
1 is 0x1e (30), not 0x80.  The long counts are stored in host byte order,
1 which is not necessarily network byte order, and host integer word size,
1 which is usually 4 bytes.  They also represent a count 14 less than
1 their value.  The database lines have no termination byte; the start of
1 the next line is indicated by its first byte having a value <= 30.
1 
1    In addition, instead of starting with a dummy entry, the old database
1 format starts with a 256 byte table containing the 128 most common
1 bigrams in the file list.  A bigram is a pair of adjacent bytes.  Bytes
1 in the database that have the high bit set are indexes (with the high
1 bit cleared) into the bigram table.  The bigram and offset-differential
1 count coding makes these databases 20-25% smaller than the new format,
1 but makes them not 8-bit clean.  Any byte in a file name that is in the
1 ranges used for the special codes is replaced in the database by a
1 question mark, which not coincidentally is the shell wildcard to match a
1 single character.
1 
1    The old format therefore cannot faithfully store entries with
1 non-ASCII characters.  It therefore should not be used in
1 internationalised environments.  That is, most installations should not
1 use it.
1 
1    Because the long counts are stored by the 'code' program as
1 native-order machine words, the database format is not easily used in
1 environments which differ in terms of byte order.  If locate databases
1 are to be shared between machines, the LOCATE02 database format should
1 be used.  This has other benefits as discussed above.  However, the
1 length of the filename currently being processed can normally be used to
1 place reasonable limits on the long counts and so this information is
1 used by locate to help it guess the byte ordering of the old format
1 database.  Unless it finds evidence to the contrary, 'locate' will
1 assume that the byte order of the database is the same as the native
1 byte order of the machine running 'locate'.  The output of 'locate
1 --statistics' also includes information about the byte order of
1 old-format databases.
1 
1    The output of 'locate --statistics' will give an incorrect count of
1 the number of file names containing newlines or high-bit characters for
1 old-format databases.
1 
1    Old versions of GNU 'locate' fail to correctly handle very long file
1 names, possibly leading to security problems relating to a heap buffer
1 overrun.  ⇒Security Considerations for locate, for a detailed
1 explanation.
1