【蛇】乙巳年己卯月戊子日 / 二月廿二日
Thursday March 20, 2025

UTF


UTF is the abbrivation of UCS Transformation Format (ISO 10646 standard), alternatively it's also called Unicode Transformation Format. UTF-16 is the standard encoding for Unicode.

UTF-7

Is described in RFC 2152 (UTF-7: A Mail-Safe Transformation Format of Unicode).
It's a 7-bit encoding method.
Is developed to use in mail environment and uses a shift sequence
+ : start encoding
- : stop encoding

Example: Hi Mom !

UTF-8

Is described in RFC 2279 (UTF-8, a transformation format of ISO 10646). It was formerly known as UTF-2.
It's a 8-bit variable length encoding method. All unicode characters with a value smaller then 128 are transmitted as is, the rest are encoded. Since it's interpreted as a sequence of bytes, there is no endian problem (these are problems for encoding forms that use 16-bit or 32-bit code units). If there is a BOM used, it's only used to distinguish UTF-8 from other UTF encodings and has nothing to do with the byte order.

Example: 日 本 語

UTF-16

Is described in rfc2781 (UTF-16, an encoding of ISO 10646).
characters are represented using either one or two unsigned 16-bit integers, depending on the character value. All characters represented in UTF-16 can be represented as a single 32-bit unit in UTF-32
The UTF-16 sequence
[0048] [0069] [D800] [DC00] [0021] [0021]
is mapped to UCS-4 as
[0000 0048] [0000 0069] [0001 0000] [0000 0021] [0000 0021]
and represents "Hi<0001 0000>!!".

UTF-32

Encodes a unicode code point as a sequence of 4 bytes. This in either big-endian or little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a BOM (byte order mark), it is used to distinguish between the two byte orders. The BOM is not considered part of the content of the text.
UTF-32 was originally specified as Unicode Standard Annex #19 : UTF-32. However is now incorporated into the core specification of the Unicode standard.

Overview

EncodingBeijing 北 京
UTF-7 (ASCII)Beijing +UxD0RA-
UTF-8 (hex)42 65 69 6A 69 6E 67 E5 8C 97 E4 BA AC
UTF-16 (hex)0042 0065 0069 006A 0069 006E 0067 5317 4EAC


[ < back ] - [ home ]