libidn: IDNA Functions

1 
1 6 IDNA Functions
1 ****************
1 
1 Until now, there has been no standard method for domain names to use
1 characters outside the ASCII repertoire.  The IDNA document defines
1 internationalized domain names (IDNs) and a mechanism called IDNA for
1 handling them in a standard fashion.  IDNs use characters drawn from a
1 large repertoire (Unicode), but IDNA allows the non-ASCII characters to
1 be represented using only the ASCII characters already allowed in
1 so-called host names today.  This backward-compatible representation is
1 required in existing protocols like DNS, so that IDNs can be introduced
1 with no changes to the existing infrastructure.  IDNA is only meant for
1 processing domain names, not free text.
1 
1 6.1 Header file ‘idna.h’
1 ========================
1 
1 To use the functions explained in this chapter, you need to include the
1 file ‘idna.h’ using:
1 
1      #include <idna.h>
1 
1 6.2 Control Flags
1 =================
1 
1 The IDNA ‘flags’ parameter can take on the following values, or a
1 bit-wise inclusive or of any subset of the parameters:
1 
1  -- Return code: Idna_flags IDNA_ALLOW_UNASSIGNED
1      Allow unassigned Unicode code points.
1 
1  -- Return code: Idna_flags IDNA_USE_STD3_ASCII_RULES
1      Check output to make sure it is a STD3 conforming host name.
1 
1 6.3 Prefix String
1 =================
1 
1  -- Macro: #define IDNA_ACE_PREFIX
1      String with the official IDNA prefix, ‘xn--’.
1 
1 6.4 Core Functions
1 ==================
1 
1 The idea behind the IDNA function names are as follows: the
1 ‘idna_to_ascii_4i’ and ‘idna_to_unicode_44i’ functions are the core IDNA
1 primitives.  The ‘4’ indicate that the function takes UCS-4 strings
1 (i.e., Unicode code points encoded in a 32-bit unsigned integer type) of
1 the specified length.  The ‘i’ indicate that the data is written
1 “inline” into the buffer.  This means the caller is responsible for
1 allocating (and de-allocating) the string, and providing the library
1 with the allocated length of the string.  The output length is written
1 in the output length variable.  The remaining functions all contain the
1 ‘z’ indicator, which means the strings are zero terminated.  All output
1 strings are allocated by the library, and must be de-allocated by the
1 caller.  The ‘4’ indicator again means that the string is UCS-4, the ‘8’
1 means the strings are UTF-8 and the ‘l’ indicator means the strings are
1 encoded in the encoding used by the current locale.
1 
1    The functions provided are the following entry points:
1 
1 idna_to_ascii_4i
1 ----------------
1 
1  -- Function: int idna_to_ascii_4i (const uint32_t * IN, size_t INLEN,
1           char * OUT, int FLAGS)
1      IN: input array with unicode code points.
1 
1      INLEN: length of input array with unicode code points.
1 
1      OUT: output zero terminated string that must have room for at least
1      63 characters plus the terminating zero.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      The ToASCII operation takes a sequence of Unicode code points that
1      make up one domain label and transforms it into a sequence of code
1      points in the ASCII range (0..7F). If ToASCII succeeds, the
1      original sequence and the resulting sequence are equivalent labels.
1 
1      It is important to note that the ToASCII operation can fail.
1      ToASCII fails if any step of it fails.  If any step of the ToASCII
1      operation fails on any label in a domain name, that domain name
1      MUST NOT be used as an internationalized domain name.  The method
1      for deadling with this failure is application-specific.
1 
1      The inputs to ToASCII are a sequence of code points, the
1      AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output
1      of ToASCII is either a sequence of ASCII code points or a failure
1      condition.
1 
1      ToASCII never alters a sequence of code points that are all in the
1      ASCII range to begin with (although it could fail).  Applying the
1      ToASCII operation multiple times has exactly the same effect as
1      applying it just once.
1 
1      Return value: Returns 0 on success, or an ‘Idna_rc’ error code.
1 
1 idna_to_unicode_44i
1 -------------------
1 
1  -- Function: int idna_to_unicode_44i (const uint32_t * IN, size_t
1           INLEN, uint32_t * OUT, size_t * OUTLEN, int FLAGS)
1      IN: input array with unicode code points.
1 
1      INLEN: length of input array with unicode code points.
1 
1      OUT: output array with unicode code points.
1 
1      OUTLEN: on input, maximum size of output array with unicode code
1      points, on exit, actual size of output array with unicode code
1      points.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      The ToUnicode operation takes a sequence of Unicode code points
1      that make up one domain label and returns a sequence of Unicode
1      code points.  If the input sequence is a label in ACE form, then
1      the result is an equivalent internationalized label that is not in
1      ACE form, otherwise the original sequence is returned unaltered.
1 
1      ToUnicode never fails.  If any step fails, then the original input
1      sequence is returned immediately in that step.
1 
1      The Punycode decoder can never output more code points than it
1      inputs, but Nameprep can, and therefore ToUnicode can.  Note that
1      the number of octets needed to represent a sequence of code points
1      depends on the particular character encoding used.
1 
1      The inputs to ToUnicode are a sequence of code points, the
1      AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output
1      of ToUnicode is always a sequence of Unicode code points.
1 
1      Return value: Returns ‘Idna_rc’ error condition, but it must only
1      be used for debugging purposes.  The output buffer is always
1      guaranteed to contain the correct data according to the
1      specification (sans malloc induced errors).  NB! This means that
1      you normally ignore the return code from this function, as checking
1      it means breaking the standard.
1 
1 6.5 Simplified ToASCII Interface
1 ================================
1 
1 idna_to_ascii_4z
1 ----------------
1 
1  -- Function: int idna_to_ascii_4z (const uint32_t * INPUT, char **
1           OUTPUT, int FLAGS)
1      INPUT: zero terminated input Unicode string.
1 
1      OUTPUT: pointer to newly allocated output string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert UCS-4 domain name to ASCII string.  The domain name may
1      contain several labels, separated by dots.  The output buffer must
1      be deallocated by the caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_ascii_8z
1 ----------------
1 
1  -- Function: int idna_to_ascii_8z (const char * INPUT, char ** OUTPUT,
1           int FLAGS)
1      INPUT: zero terminated input UTF-8 string.
1 
1      OUTPUT: pointer to newly allocated output string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert UTF-8 domain name to ASCII string.  The domain name may
1      contain several labels, separated by dots.  The output buffer must
1      be deallocated by the caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_ascii_lz
1 ----------------
1 
1  -- Function: int idna_to_ascii_lz (const char * INPUT, char ** OUTPUT,
1           int FLAGS)
1      INPUT: zero terminated input string encoded in the current locale’s
1      character set.
1 
1      OUTPUT: pointer to newly allocated output string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert domain name in the locale’s encoding to ASCII string.  The
1      domain name may contain several labels, separated by dots.  The
1      output buffer must be deallocated by the caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 6.6 Simplified ToUnicode Interface
1 ==================================
1 
1 idna_to_unicode_4z4z
1 --------------------
1 
1  -- Function: int idna_to_unicode_4z4z (const uint32_t * INPUT, uint32_t
1           ** OUTPUT, int FLAGS)
1      INPUT: zero-terminated Unicode string.
1 
1      OUTPUT: pointer to newly allocated output Unicode string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert possibly ACE encoded domain name in UCS-4 format into a
1      UCS-4 string.  The domain name may contain several labels,
1      separated by dots.  The output buffer must be deallocated by the
1      caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_unicode_8z4z
1 --------------------
1 
1  -- Function: int idna_to_unicode_8z4z (const char * INPUT, uint32_t **
1           OUTPUT, int FLAGS)
1      INPUT: zero-terminated UTF-8 string.
1 
1      OUTPUT: pointer to newly allocated output Unicode string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert possibly ACE encoded domain name in UTF-8 format into a
1      UCS-4 string.  The domain name may contain several labels,
1      separated by dots.  The output buffer must be deallocated by the
1      caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_unicode_8z8z
1 --------------------
1 
1  -- Function: int idna_to_unicode_8z8z (const char * INPUT, char **
1           OUTPUT, int FLAGS)
1      INPUT: zero-terminated UTF-8 string.
1 
1      OUTPUT: pointer to newly allocated output UTF-8 string.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert possibly ACE encoded domain name in UTF-8 format into a
1      UTF-8 string.  The domain name may contain several labels,
1      separated by dots.  The output buffer must be deallocated by the
1      caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_unicode_8zlz
1 --------------------
1 
1  -- Function: int idna_to_unicode_8zlz (const char * INPUT, char **
1           OUTPUT, int FLAGS)
1      INPUT: zero-terminated UTF-8 string.
1 
1      OUTPUT: pointer to newly allocated output string encoded in the
1      current locale’s character set.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert possibly ACE encoded domain name in UTF-8 format into a
1      string encoded in the current locale’s character set.  The domain
1      name may contain several labels, separated by dots.  The output
1      buffer must be deallocated by the caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 idna_to_unicode_lzlz
1 --------------------
1 
1  -- Function: int idna_to_unicode_lzlz (const char * INPUT, char **
1           OUTPUT, int FLAGS)
1      INPUT: zero-terminated string encoded in the current locale’s
1      character set.
1 
1      OUTPUT: pointer to newly allocated output string encoded in the
1      current locale’s character set.
1 
1      FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1      ‘IDNA_USE_STD3_ASCII_RULES’ .
1 
1      Convert possibly ACE encoded domain name in the locale’s character
1      set into a string encoded in the current locale’s character set.
1      The domain name may contain several labels, separated by dots.  The
1      output buffer must be deallocated by the caller.
1 
1      Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1 
1 6.7 Error Handling
1 ==================
1 
1 idna_strerror
1 -------------
1 
1  -- Function: const char * idna_strerror (Idna_rc RC)
1      RC: an ‘Idna_rc’ return code.
1 
1      Convert a return code integer to a text string.  This string can be
1      used to output a diagnostic message to the user.
1 
1      *IDNA_SUCCESS:* Successful operation.  This value is guaranteed to
1      always be zero, the remaining ones are only guaranteed to hold
1      non-zero values, for logical comparison purposes.
1 
1      *IDNA_STRINGPREP_ERROR:* Error during string preparation.
1 
1      *IDNA_PUNYCODE_ERROR:* Error during punycode operation.
1 
1      *IDNA_CONTAINS_NON_LDH:* For IDNA_USE_STD3_ASCII_RULES, indicate
1      that the string contains non-LDH ASCII characters.
1 
1      *IDNA_CONTAINS_MINUS:* For IDNA_USE_STD3_ASCII_RULES, indicate that
1      the string contains a leading or trailing hyphen-minus (U+002D).
1 
1      *IDNA_INVALID_LENGTH:* The final output string is not within the
1      (inclusive) range 1 to 63 characters.
1 
1      *IDNA_NO_ACE_PREFIX:* The string does not contain the ACE prefix
1      (for ToUnicode).
1 
1      *IDNA_ROUNDTRIP_VERIFY_ERROR:* The ToASCII operation on output
1      string does not equal the input.
1 
1      *IDNA_CONTAINS_ACE_PREFIX:* The input contains the ACE prefix (for
1      ToASCII).
1 
1      *IDNA_ICONV_ERROR:* Could not convert string in locale encoding.
1 
1      *IDNA_MALLOC_ERROR:* Could not allocate buffer (this is typically a
1      fatal error).
1 
1      *IDNA_DLOPEN_ERROR:* Could not dlopen the libcidn DSO (only used
1      internally in libc).
1 
1      Return value: Returns a pointer to a statically allocated string
1      containing a description of the error with the return code ‘rc’ .
1