libidn: IDNA Functions
1
1 6 IDNA Functions
1 ****************
1
1 Until now, there has been no standard method for domain names to use
1 characters outside the ASCII repertoire. The IDNA document defines
1 internationalized domain names (IDNs) and a mechanism called IDNA for
1 handling them in a standard fashion. IDNs use characters drawn from a
1 large repertoire (Unicode), but IDNA allows the non-ASCII characters to
1 be represented using only the ASCII characters already allowed in
1 so-called host names today. This backward-compatible representation is
1 required in existing protocols like DNS, so that IDNs can be introduced
1 with no changes to the existing infrastructure. IDNA is only meant for
1 processing domain names, not free text.
1
1 6.1 Header file ‘idna.h’
1 ========================
1
1 To use the functions explained in this chapter, you need to include the
1 file ‘idna.h’ using:
1
1 #include <idna.h>
1
1 6.2 Control Flags
1 =================
1
1 The IDNA ‘flags’ parameter can take on the following values, or a
1 bit-wise inclusive or of any subset of the parameters:
1
1 -- Return code: Idna_flags IDNA_ALLOW_UNASSIGNED
1 Allow unassigned Unicode code points.
1
1 -- Return code: Idna_flags IDNA_USE_STD3_ASCII_RULES
1 Check output to make sure it is a STD3 conforming host name.
1
1 6.3 Prefix String
1 =================
1
1 -- Macro: #define IDNA_ACE_PREFIX
1 String with the official IDNA prefix, ‘xn--’.
1
1 6.4 Core Functions
1 ==================
1
1 The idea behind the IDNA function names are as follows: the
1 ‘idna_to_ascii_4i’ and ‘idna_to_unicode_44i’ functions are the core IDNA
1 primitives. The ‘4’ indicate that the function takes UCS-4 strings
1 (i.e., Unicode code points encoded in a 32-bit unsigned integer type) of
1 the specified length. The ‘i’ indicate that the data is written
1 “inline” into the buffer. This means the caller is responsible for
1 allocating (and de-allocating) the string, and providing the library
1 with the allocated length of the string. The output length is written
1 in the output length variable. The remaining functions all contain the
1 ‘z’ indicator, which means the strings are zero terminated. All output
1 strings are allocated by the library, and must be de-allocated by the
1 caller. The ‘4’ indicator again means that the string is UCS-4, the ‘8’
1 means the strings are UTF-8 and the ‘l’ indicator means the strings are
1 encoded in the encoding used by the current locale.
1
1 The functions provided are the following entry points:
1
1 idna_to_ascii_4i
1 ----------------
1
1 -- Function: int idna_to_ascii_4i (const uint32_t * IN, size_t INLEN,
1 char * OUT, int FLAGS)
1 IN: input array with unicode code points.
1
1 INLEN: length of input array with unicode code points.
1
1 OUT: output zero terminated string that must have room for at least
1 63 characters plus the terminating zero.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 The ToASCII operation takes a sequence of Unicode code points that
1 make up one domain label and transforms it into a sequence of code
1 points in the ASCII range (0..7F). If ToASCII succeeds, the
1 original sequence and the resulting sequence are equivalent labels.
1
1 It is important to note that the ToASCII operation can fail.
1 ToASCII fails if any step of it fails. If any step of the ToASCII
1 operation fails on any label in a domain name, that domain name
1 MUST NOT be used as an internationalized domain name. The method
1 for deadling with this failure is application-specific.
1
1 The inputs to ToASCII are a sequence of code points, the
1 AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output
1 of ToASCII is either a sequence of ASCII code points or a failure
1 condition.
1
1 ToASCII never alters a sequence of code points that are all in the
1 ASCII range to begin with (although it could fail). Applying the
1 ToASCII operation multiple times has exactly the same effect as
1 applying it just once.
1
1 Return value: Returns 0 on success, or an ‘Idna_rc’ error code.
1
1 idna_to_unicode_44i
1 -------------------
1
1 -- Function: int idna_to_unicode_44i (const uint32_t * IN, size_t
1 INLEN, uint32_t * OUT, size_t * OUTLEN, int FLAGS)
1 IN: input array with unicode code points.
1
1 INLEN: length of input array with unicode code points.
1
1 OUT: output array with unicode code points.
1
1 OUTLEN: on input, maximum size of output array with unicode code
1 points, on exit, actual size of output array with unicode code
1 points.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 The ToUnicode operation takes a sequence of Unicode code points
1 that make up one domain label and returns a sequence of Unicode
1 code points. If the input sequence is a label in ACE form, then
1 the result is an equivalent internationalized label that is not in
1 ACE form, otherwise the original sequence is returned unaltered.
1
1 ToUnicode never fails. If any step fails, then the original input
1 sequence is returned immediately in that step.
1
1 The Punycode decoder can never output more code points than it
1 inputs, but Nameprep can, and therefore ToUnicode can. Note that
1 the number of octets needed to represent a sequence of code points
1 depends on the particular character encoding used.
1
1 The inputs to ToUnicode are a sequence of code points, the
1 AllowUnassigned flag, and the UseSTD3ASCIIRules flag. The output
1 of ToUnicode is always a sequence of Unicode code points.
1
1 Return value: Returns ‘Idna_rc’ error condition, but it must only
1 be used for debugging purposes. The output buffer is always
1 guaranteed to contain the correct data according to the
1 specification (sans malloc induced errors). NB! This means that
1 you normally ignore the return code from this function, as checking
1 it means breaking the standard.
1
1 6.5 Simplified ToASCII Interface
1 ================================
1
1 idna_to_ascii_4z
1 ----------------
1
1 -- Function: int idna_to_ascii_4z (const uint32_t * INPUT, char **
1 OUTPUT, int FLAGS)
1 INPUT: zero terminated input Unicode string.
1
1 OUTPUT: pointer to newly allocated output string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert UCS-4 domain name to ASCII string. The domain name may
1 contain several labels, separated by dots. The output buffer must
1 be deallocated by the caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_ascii_8z
1 ----------------
1
1 -- Function: int idna_to_ascii_8z (const char * INPUT, char ** OUTPUT,
1 int FLAGS)
1 INPUT: zero terminated input UTF-8 string.
1
1 OUTPUT: pointer to newly allocated output string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert UTF-8 domain name to ASCII string. The domain name may
1 contain several labels, separated by dots. The output buffer must
1 be deallocated by the caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_ascii_lz
1 ----------------
1
1 -- Function: int idna_to_ascii_lz (const char * INPUT, char ** OUTPUT,
1 int FLAGS)
1 INPUT: zero terminated input string encoded in the current locale’s
1 character set.
1
1 OUTPUT: pointer to newly allocated output string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert domain name in the locale’s encoding to ASCII string. The
1 domain name may contain several labels, separated by dots. The
1 output buffer must be deallocated by the caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 6.6 Simplified ToUnicode Interface
1 ==================================
1
1 idna_to_unicode_4z4z
1 --------------------
1
1 -- Function: int idna_to_unicode_4z4z (const uint32_t * INPUT, uint32_t
1 ** OUTPUT, int FLAGS)
1 INPUT: zero-terminated Unicode string.
1
1 OUTPUT: pointer to newly allocated output Unicode string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert possibly ACE encoded domain name in UCS-4 format into a
1 UCS-4 string. The domain name may contain several labels,
1 separated by dots. The output buffer must be deallocated by the
1 caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_unicode_8z4z
1 --------------------
1
1 -- Function: int idna_to_unicode_8z4z (const char * INPUT, uint32_t **
1 OUTPUT, int FLAGS)
1 INPUT: zero-terminated UTF-8 string.
1
1 OUTPUT: pointer to newly allocated output Unicode string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert possibly ACE encoded domain name in UTF-8 format into a
1 UCS-4 string. The domain name may contain several labels,
1 separated by dots. The output buffer must be deallocated by the
1 caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_unicode_8z8z
1 --------------------
1
1 -- Function: int idna_to_unicode_8z8z (const char * INPUT, char **
1 OUTPUT, int FLAGS)
1 INPUT: zero-terminated UTF-8 string.
1
1 OUTPUT: pointer to newly allocated output UTF-8 string.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert possibly ACE encoded domain name in UTF-8 format into a
1 UTF-8 string. The domain name may contain several labels,
1 separated by dots. The output buffer must be deallocated by the
1 caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_unicode_8zlz
1 --------------------
1
1 -- Function: int idna_to_unicode_8zlz (const char * INPUT, char **
1 OUTPUT, int FLAGS)
1 INPUT: zero-terminated UTF-8 string.
1
1 OUTPUT: pointer to newly allocated output string encoded in the
1 current locale’s character set.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert possibly ACE encoded domain name in UTF-8 format into a
1 string encoded in the current locale’s character set. The domain
1 name may contain several labels, separated by dots. The output
1 buffer must be deallocated by the caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 idna_to_unicode_lzlz
1 --------------------
1
1 -- Function: int idna_to_unicode_lzlz (const char * INPUT, char **
1 OUTPUT, int FLAGS)
1 INPUT: zero-terminated string encoded in the current locale’s
1 character set.
1
1 OUTPUT: pointer to newly allocated output string encoded in the
1 current locale’s character set.
1
1 FLAGS: an ‘Idna_flags’ value, e.g., ‘IDNA_ALLOW_UNASSIGNED’ or
1 ‘IDNA_USE_STD3_ASCII_RULES’ .
1
1 Convert possibly ACE encoded domain name in the locale’s character
1 set into a string encoded in the current locale’s character set.
1 The domain name may contain several labels, separated by dots. The
1 output buffer must be deallocated by the caller.
1
1 Return value: Returns ‘IDNA_SUCCESS’ on success, or error code.
1
1 6.7 Error Handling
1 ==================
1
1 idna_strerror
1 -------------
1
1 -- Function: const char * idna_strerror (Idna_rc RC)
1 RC: an ‘Idna_rc’ return code.
1
1 Convert a return code integer to a text string. This string can be
1 used to output a diagnostic message to the user.
1
1 *IDNA_SUCCESS:* Successful operation. This value is guaranteed to
1 always be zero, the remaining ones are only guaranteed to hold
1 non-zero values, for logical comparison purposes.
1
1 *IDNA_STRINGPREP_ERROR:* Error during string preparation.
1
1 *IDNA_PUNYCODE_ERROR:* Error during punycode operation.
1
1 *IDNA_CONTAINS_NON_LDH:* For IDNA_USE_STD3_ASCII_RULES, indicate
1 that the string contains non-LDH ASCII characters.
1
1 *IDNA_CONTAINS_MINUS:* For IDNA_USE_STD3_ASCII_RULES, indicate that
1 the string contains a leading or trailing hyphen-minus (U+002D).
1
1 *IDNA_INVALID_LENGTH:* The final output string is not within the
1 (inclusive) range 1 to 63 characters.
1
1 *IDNA_NO_ACE_PREFIX:* The string does not contain the ACE prefix
1 (for ToUnicode).
1
1 *IDNA_ROUNDTRIP_VERIFY_ERROR:* The ToASCII operation on output
1 string does not equal the input.
1
1 *IDNA_CONTAINS_ACE_PREFIX:* The input contains the ACE prefix (for
1 ToASCII).
1
1 *IDNA_ICONV_ERROR:* Could not convert string in locale encoding.
1
1 *IDNA_MALLOC_ERROR:* Could not allocate buffer (this is typically a
1 fatal error).
1
1 *IDNA_DLOPEN_ERROR:* Could not dlopen the libcidn DSO (only used
1 internally in libc).
1
1 Return value: Returns a pointer to a statically allocated string
1 containing a description of the error with the return code ‘rc’ .
1