UTF
UTF is the abbrivation of UCS Transformation Format (ISO 10646 standard), alternatively it's also called Unicode Transformation Format. UTF-16 is the standard encoding for Unicode.
UTF-7
Is described in RFC 2152 (UTF-7: A Mail-Safe Transformation Format of Unicode).It's a 7-bit encoding method.
Is developed to use in mail environment and uses a shift sequence
+ : start encoding
- : stop encoding
Example: Hi Mom ☺!
utf-7 | Hi Mom | + | Jjo | - | ! |
ASCII | Start | ☺ | End | ASCII | |
hex | 0048 0069 0020 004D 006F 006D 0020 | 263A | 0021 |
UTF-8
Is described in RFC 2279 (UTF-8, a transformation format of ISO 10646). It was formerly known as UTF-2.It's a 8-bit variable length encoding method. All unicode characters with a value smaller then 128 are transmitted as is, the rest are encoded. Since it's interpreted as a sequence of bytes, there is no endian problem (these are problems for encoding forms that use 16-bit or 32-bit code units). If there is a BOM used, it's only used to distinguish UTF-8 from other UTF encodings and has nothing to do with the byte order.
Example: 日 本 語
日 | 本 | 語 | |
hex | 65E5 | 672C | 8A9E |
utf-8 | E6 97 A5 | E6 9C AC | E8 AA 9E |
UTF-16
Is described in rfc2781 (UTF-16, an encoding of ISO 10646).characters are represented using either one or two unsigned 16-bit integers, depending on the character value. All characters represented in UTF-16 can be represented as a single 32-bit unit in UTF-32
The UTF-16 sequence
[0048] [0069] [D800] [DC00] [0021] [0021]
is mapped to UCS-4 as
[0000 0048] [0000 0069] [0001 0000] [0000 0021] [0000 0021]
and represents "Hi<0001 0000>!!".
UTF-32
Encodes a unicode code point as a sequence of 4 bytes. This in either big-endian or little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a BOM (byte order mark), it is used to distinguish between the two byte orders. The BOM is not considered part of the content of the text.UTF-32 was originally specified as Unicode Standard Annex #19 : UTF-32. However is now incorporated into the core specification of the Unicode standard.
Overview
Encoding | Beijing | 北 京 |
UTF-7 (ASCII) | Beijing | +UxD0RA- |
UTF-8 (hex) | 42 65 69 6A 69 6E 67 | E5 8C 97 E4 BA AC |
UTF-16 (hex) | 0042 0065 0069 006A 0069 006E 0067 | 5317 4EAC |
[ < back ] - [ home ]