libidn: Punycode Functions
1
1 5 Punycode Functions
1 ********************
1
1 Punycode is a simple and efficient transfer encoding syntax designed for
1 use with Internationalized Domain Names in Applications. It uniquely
1 and reversibly transforms a Unicode string into an ASCII string. ASCII
1 characters in the Unicode string are represented literally, and
1 non-ASCII characters are represented by ASCII characters that are
1 allowed in host name labels (letters, digits, and hyphens). A general
1 algorithm called Bootstring allows a string of basic code points to
1 uniquely represent any string of code points drawn from a larger set.
1 Punycode is an instance of Bootstring that uses particular parameter
1 values, appropriate for IDNA.
1
1 5.1 Header file ‘punycode.h’
1 ============================
1
1 To use the functions explained in this chapter, you need to include the
1 file ‘punycode.h’ using:
1
1 #include <punycode.h>
1
1 5.2 Unicode Code Point Data Type
1 ================================
1
1 The punycode function uses a special type to denote Unicode code points.
1 It is guaranteed to always be a 32 bit unsigned integer.
1
1 -- Punycode Unicode code point: uint32_t punycode_uint
1 A unsigned integer that hold Unicode code points.
1
1 5.3 Core Functions
1 ==================
1
1 Note that the current implementation will fail if the ‘input_length’
1 exceed 4294967295 (the size of ‘punycode_uint’). This restriction may
1 be removed in the future. Meanwhile applications are encouraged to not
1 depend on this problem, and use ‘sizeof’ to initialize ‘input_length’
1 and ‘output_length’.
1
1 The functions provided are the following two entry points:
1
1 punycode_encode
1 ---------------
1
1 -- Function: int punycode_encode (size_t INPUT_LENGTH, const
1 punycode_uint [] INPUT, const unsigned char [] CASE_FLAGS,
1 size_t * OUTPUT_LENGTH, char [] OUTPUT)
1 INPUT_LENGTH: The number of code points in the ‘input’ array and
1 the number of flags in the ‘case_flags’ array.
1
1 INPUT: An array of code points. They are presumed to be Unicode
1 code points, but that is not strictly REQUIRED. The array contains
1 code points, not code units. UTF-16 uses code units D800 through
1 DFFF to refer to code points 10000..10FFFF. The code points
1 D800..DFFF do not occur in any valid Unicode string. The code
1 points that can occur in Unicode strings (0..D7FF and E000..10FFFF)
1 are also called Unicode scalar values.
1
1 CASE_FLAGS: A ‘NULL’ pointer or an array of boolean values parallel
1 to the ‘input’ array. Nonzero (true, flagged) suggests that the
1 corresponding Unicode character be forced to uppercase after being
1 decoded (if possible), and zero (false, unflagged) suggests that it
1 be forced to lowercase (if possible). ASCII code points (0..7F)
1 are encoded literally, except that ASCII letters are forced to
1 uppercase or lowercase according to the corresponding case flags.
1 If ‘case_flags’ is a ‘NULL’ pointer then ASCII letters are left as
1 they are, and other code points are treated as unflagged.
1
1 OUTPUT_LENGTH: The caller passes in the maximum number of ASCII
1 code points that it can receive. On successful return it will
1 contain the number of ASCII code points actually output.
1
1 OUTPUT: An array of ASCII code points. It is *not*
1 null-terminated; it will contain zeros if and only if the ‘input’
1 contains zeros. (Of course the caller can leave room for a
1 terminator and add one if needed.)
1
1 Converts a sequence of code points (presumed to be Unicode code
1 points) to Punycode.
1
1 Return value: The return value can be any of the ‘Punycode_status’
1 values defined above except ‘PUNYCODE_BAD_INPUT’ . If not
1 ‘PUNYCODE_SUCCESS’ , then ‘output_size’ and ‘output’ might contain
1 garbage.
1
1 punycode_decode
1 ---------------
1
1 -- Function: int punycode_decode (size_t INPUT_LENGTH, const char []
1 INPUT, size_t * OUTPUT_LENGTH, punycode_uint [] OUTPUT,
1 unsigned char [] CASE_FLAGS)
1 INPUT_LENGTH: The number of ASCII code points in the ‘input’ array.
1
1 INPUT: An array of ASCII code points (0..7F).
1
1 OUTPUT_LENGTH: The caller passes in the maximum number of code
1 points that it can receive into the ‘output’ array (which is also
1 the maximum number of flags that it can receive into the
1 ‘case_flags’ array, if ‘case_flags’ is not a ‘NULL’ pointer). On
1 successful return it will contain the number of code points
1 actually output (which is also the number of flags actually output,
1 if case_flags is not a null pointer). The decoder will never need
1 to output more code points than the number of ASCII code points in
1 the input, because of the way the encoding is defined. The number
1 of code points output cannot exceed the maximum possible value of a
1 punycode_uint, even if the supplied ‘output_length’ is greater than
1 that.
1
1 OUTPUT: An array of code points like the input argument of
1 ‘punycode_encode()’ (see above).
1
1 CASE_FLAGS: A ‘NULL’ pointer (if the flags are not needed by the
1 caller) or an array of boolean values parallel to the ‘output’
1 array. Nonzero (true, flagged) suggests that the corresponding
1 Unicode character be forced to uppercase by the caller (if
1 possible), and zero (false, unflagged) suggests that it be forced
1 to lowercase (if possible). ASCII code points (0..7F) are output
1 already in the proper case, but their flags will be set
1 appropriately so that applying the flags would be harmless.
1
1 Converts Punycode to a sequence of code points (presumed to be
1 Unicode code points).
1
1 Return value: The return value can be any of the ‘Punycode_status’
1 values defined above. If not ‘PUNYCODE_SUCCESS’ , then
1 ‘output_length’ , ‘output’ , and ‘case_flags’ might contain
1 garbage.
1
1 5.4 Error Handling
1 ==================
1
1 punycode_strerror
1 -----------------
1
1 -- Function: const char * punycode_strerror (Punycode_status RC)
1 RC: an ‘Punycode_status’ return code.
1
1 Convert a return code integer to a text string. This string can be
1 used to output a diagnostic message to the user.
1
1 *PUNYCODE_SUCCESS:* Successful operation. This value is guaranteed
1 to always be zero, the remaining ones are only guaranteed to hold
1 non-zero values, for logical comparison purposes.
1
1 *PUNYCODE_BAD_INPUT:* Input is invalid.
1
1 *PUNYCODE_BIG_OUTPUT:* Output would exceed the space provided.
1
1 *PUNYCODE_OVERFLOW:* Input needs wider integers to process.
1
1 Return value: Returns a pointer to a statically allocated string
1 containing a description of the error with the return code ‘rc’ .
1