【蛇】乙巳年己卯月戊子日 / 二月廿二日
Thursday March 20, 2025

Java

Java comes with a classes called InputStreamReader and OutputStreamWriter that translate into and out of Unicode from local encodings. Two of the supported encodings are GB2312 and Big5.

Java 2 allows the programmer to directly access the fonts on the machine. Previous to the introduction of Swing set of peerless Java AWT components, Java could not display Chinese except on Chinese operating systems. With Swing, you can display Chinese in any component, providing you have fonts that support Chinese on your system. So the latest versions of Java can display Chinese, Japanese, and Korean text directly if corresponding fonts are installed.
Java uses UCS-2 internally however every Java Virtual Machine (JVM) has a default charset. The default charset of the JVM is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
Note that Java calls UTF-8 as UTF8.

Displaying Chinese characters

It's good practise to specify the encoding when reading and displaying a stream. InputStreamReader and OutputStreamWriter use the default encoding if no encoding is specified. The encoding of an InputStreamReader or OutputStreamWriter can be determined by the getEncoding method
InputStreamReader defaultReader = new InputStreamReader(fis);
String defaultEncoding = defaultReader.getEncoding();

To set the encodings:
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis, "Big5");
// converts the input stream to Big5
PrintWriter writer = new PrintWriter(new OutputStreamWriter(output, "UTF-8"));


You can use the native2ascii tool to switch between \uXXXX format and real encoding.

Double-byte character set support for Java Server Pages

East Asian languages such as Japanese, Chinese, and Korean are classified as double-byte character sets (DBCS). An individual character representation requires two bytes as opposed to a single byte for an English language character. For example, Japanese requires 16 bits to represent the roughly 32,000 double-byte characters.
To support DBCS JSP pages, the JSP compiler checks the page directive to determine which character set to use. The JSP compiler uses the value of charset to determine the encoding of the JSP page. If charset is not defined, ISO-8859-1 encoding is assumed. For example, to specify encoding using simplified Chinese characters, set the contentType attribute to the appropriate character set in a page directive:
<%@ page contentType="text/html;charset=eucgb"%>

The generated Java code always uses UTF8 encoding.
The browser displaying the JSP must also support the character set. Browsers that are compliant with HTML 4.0 support Basic Multilingual Plane, a standardized 16-bit character set that supports most of the world's languages. The browser must also have the required fonts to correctly display the characters of your target language.

If you want to display a Chinese character based on the unicode value you can use
(char)Integer.parseInt("7557", 16)
(where 7557 is the unicode value of the character)

[ < back ] - [ home ]