Han unification is an effort by the authors of
Unicode and the
Universal Character Set to map
multiple
character sets of the
so-called
CJK languages into a single set of
unified
characters.
Han characters are a common feature of written
Chinese (hanzi), Japanese
(kanji), Korean
(hanja), and Traditional Chinese in Hong Kong
and Taiwan
, and—at
least historically—other East and Southeast Asian languages.
(See
Vietnamese Hán Tự and
Chữ Nôm.)
Modern Chinese, Korean, and Japanese
typefaces typically use regional or historical
variants of a given Han
character. In the formulation of Unicode, an attempt was made
to unify these variants by considering them different
glyphs representing the same "
grapheme", or
orthographic unit, hence, "Han unification",
with the resulting character repertoire sometimes contracted to
Unihan.
Unihan can also refer to the Unihan Database
maintained by the Unicode Consortium, which provides information
about all of the unified Han characters encoded in the Unicode
standard, including mappings to various national and industry
standards, indices into standard dictionaries, encoded variants,
pronunciations in various languages, and an English definition. The
database is available to the public as
a text file and via an
interactive Web site. The latter also includes
representative glyphs and definitions for compound words drawn from
the free Japanese
EDICT and Chinese
CEDICT dictionary projects (which are provided for
convenience and are not a formal part of the Unicode
standard).
Rationale and controversy
Rules for Han unification are given in the East Asian Scripts
chapter of the various versions of the Unicode Standard (Chapter 11
in Unicode 4.0). The
Ideographic Rapporteur Group
(IRG), made up of experts from the Chinese-speaking countries,
North and South Korea, Japan, Vietnam, and other countries, is
responsible for the process.
One possible rationale is the desire to limit the size of the full
Unicode character set, where CJK characters as represented by
discrete
ideograms may approach or exceed
100,000 (while those required for ordinary literacy in any language
are probably under 3,000).
The secret life of Unicode article
located on IBM DeveloperWorks attempts to illustrate part of the
motivation for Han unification:
In fact, the three ideographs for "one" are encoded separately in
Unicode, as they are not considered national variants. The first
and second are used on financial instruments to prevent tampering
(they may be considered variants), while the third is the common
form in all three countries.
However, Han unification has also caused considerable controversy,
particularly among the Japanese public, who, with the nation's
literati, have a history of protesting the culling of historically
and culturally significant variants. (See
Kanji#Orthographic
reform and lists of kanji. Today, the list of characters
officially recognized for use in proper names continues to expand
at a modest pace.)
Graphemes versus glyphs

The Latin small "a" has widely
differing glyphs that all represent concrete instances of the same
abstract grapheme.
While a native reader of any language using the Latin script
recognizes these two glyphs as the same grapheme, to others they
might appear to be completely unrelated.
A grapheme is the smallest abstract unit of meaning in a writing
system. Any grapheme has many possible glyph expressions, but all
are recognized as the same grapheme by those with reading and
writing knowledge of a particular writing system. While Unicode
typically assigns characters to code points to express the
graphemes within a system of writing, the Unicode standard (
section 3.4 D7) does caution:
An abstract character does not necessarily correspond to what
a user thinks of as a “character” and should not be confused with a
grapheme.
However, this refers to the fact that some graphemes are composed
of several characters. So, for example, the character "a" (U+0061)
combined with a circle above (U+030A) might be understood by a user
as a single grapheme while being composed of multiple Unicode
abstract characters. In addition, Unicode also assigns some code
points to a small number (other than for compatibility reasons) of
formatting characters, whitespace characters, and other abstract
characters that are not graphemes, but instead used to control the
breaks between lines, words, graphemes and grapheme clusters. With
the unified Han ideographs, the Unicode standard makes a departure
from prior practices in assigning abstract characters not as
graphemes, but according to the underlying meaning of the grapheme:
what linguists sometimes call
sememes. This
departure therefore is not simply explained by the oft quoted
distinction between an abstract character and a glyph, but is more
rooted in the difference between an abstract character assigned as
a grapheme and an abstract character assigned as a sememe. In
contrast, consider Unicode’s unification of
punctuation and
diacritics, where graphemes with widely different
meanings (for example, an apostrophe and a single quotation mark)
are unified because the graphemes are the same. For Unihan the
characters are not unified by their appearance, but by their
definition or meaning.
For a grapheme to be represented by various glyphs means that the
grapheme has glyph variations that are usually determined by
selecting one font or another or using glyph substitution features
where multiple glyphs are included in a single font. Such glyph
variations are considered by Unicode a feature of rich text
protocols and not properly handled by the plain text goals of
Unicode. However, when the change from one glyph to another
constitutes a change from one grapheme to another—where a glyph
cannot possibly still, for example, mean the same grapheme
understood as the small letter "a"—Unicode separates those into
separate code points. For Unihan the same thing is done whenever
the abstract meaning changes, however rather than speaking of the
abstract meaning of a grapheme (the letter "a"), the unification of
Han ideographs assigns a new code point for each different
meaning—even if that meaning is expressed by distinct graphemes in
different languages. While a grapheme such as “ö” might mean
something different in English (as used in the word “coördinated”)
than it does in German, it is still the same grapheme and can be
easily unified so that English and German can share a common
abstract Latin writing system (along with Latin itself).
To deal with the use of different graphemes for the same Unihan
sememe, Unicode has relied on several mechanisms to deal with the
issue: especially as it relates to rendering text. One has been to
treat it as simply a font issue so that different fonts might be
used to render Chinese, Japanese or Korean. Also font formats such
as OpenType allow for the mapping of alternate glyphs according to
language so that a text rending system can look to the user’s
environmental settings to determine which glyph to use. The problem
with these approaches is that they fail to meet the goals of
Unicode to support multilingual text within the same
document.
So rather than treat the issue as a rich text problem of glyph
alternates, Unicode added the concept of variation selectors, first
introduced in version 3.2 and supplemented in version 4.0. While
variation selectors are treated as combining characters, they have
no associated diacritic or mark. Instead, by combining with a base
character, they signal the two character sequence selects a
variation (typically in terms of grapheme, but also in terms of
underlying meaning as in the case of a location name or other
proper noun) of the base character. This then is not a selection of
an alternate glyph, but the selection of a grapheme variation or a
variation of the base abstract character. Such a two-character
sequence however can be easily mapped to a separate single glyph in
modern fonts. Since Unicode has assigned 256 separate variation
selectors, it is capable of assigning 256 variations for any Han
ideograph. Such variations can be specific to one language or
another and enable the encoding of plain text that includes such
grapheme variations.
Unihan "abstract characters"
Since the Unihan standard encodes "abstract characters", not
"glyphs", the graphical artifacts produced by Unicode have been
considered temporary technical hurdles, and at most, cosmetic.
However, again, particularly in Japan, due in part to the way in
which Chinese characters were incorporated into Japanese writing
systems historically, the inability to specify a particular variant
is considered a significant obstacle to the use of Unicode in
scholarly work. For example, the unification of "grass" (explained
above), means that a historical text cannot be encoded so as to
preserve its peculiar orthography. Instead, for example, the
scholar would be required to locate the desired glyph in a specific
typeface in order to convey the text as written, defeating the
purpose of a unified character set. Unicode has responded to these
needs by assigning variation selectors so that authors can select
grapheme variations of particular ideographs (or even other
characters).
Small differences in graphical representation are also problematic
when they affect legibility or the wrong cultural tradition.
Besides making some Unicode fonts unusable for texts involving
multiple "Unihan languages", names or other orthographically
sensitive terminology might be displayed incorrectly. (Proper names
tend to be especially orthographically conservative--compare this
to changing the spelling of one's name to suit a language reform in
the U.S. or U.K.) While this may be considered primarily a
graphical representation or rendering problem to be overcome by
more artful fonts, the widespread use of Unicode would make it
difficult to preserve such distinctions. The problem of one
character representing semantically different concepts is also
present in the Latin part of Unicode. The Unicode character for an
apostrophe is the same as the character for a right single quote:
’. On the other hand, it is sometimes pointed out that the capital
Latin letter "A" is not unified with the Greek letter "Α" (Alpha).
This is, of course, desirable for reasons of compatibility, and
deals with a much smaller alphabetic character set.
While the unification aspect of Unicode is controversial in some
quarters for the reasons given above, Unicode itself does now
encode a vast number of seldom-used characters of a more-or-less
antiquarian nature.
Some of the controversy stems from the fact that the very decision
of performing Han unification was made by the initial Unicode
Consortium, which at the time was a consortium of North American
companies and organizations (most of them in California), but
included no East Asia government representatives. The initial
design goal was to create a 16-bit standard, and Han unification
was therefore a critical step for avoiding tens of thousands of
character duplications. This 16-bit requirement was later
abandoned, making the size of the character set less an issue
today.
The controversy later extended to the internationally
representative ISO: the initial CJK-JRG group favored a proposal
(DIS 10646) for a non-unified character set, "which was thrown out
in favor of unification with the Unicode Consortium's unified
character set by the votes of American and European ISO members"
(even though the Japanese position was unclear). Endorsing the
Unicode Han unification was a necessary step for the heated ISO
10646/Unicode merger.
Much of the controversy surrounding Han unification is based on the
distinction between
glyphs, as defined in
Unicode, and the related but distinct idea of
graphemes. Unicode assigns abstract characters
(graphemes), as opposed to glyphs, which are a particular visual
representations of a character in a specific
typeface. One character may be represented by many
distinct glyphs, for example a "g" or an "a", both of which may
have one loop or two. Yet for a reader of Latin script based
languages the two variations of the "a" character are both
recognized as the same grapheme. Graphemes present in national
character code standards have been added to Unicode, as required by
Unicode's Source Separation rule, even where they can be composed
of characters already available. The national character code
standards existing in CJK languages are considerably more involved,
given the technological limitations under which they evolved, and
so the official CJK participants in Han unification may well have
been amenable to reform.
Unlike European versions, CJK Unicode fonts, due to Han
unification, have large but irregular patterns of overlap,
requiring language-specific fonts. Unfortunately, language-specific
fonts also make it difficult to access to a variant which, as with
the "grass" example, happens to appear more typically in another
language style. (That is to say, it would be difficult to access
"grass" with the four-stroke radical more typical of Traditional
Chinese in a Japanese environment, which fonts would typically
depict the three-stroke radical.) Unihan proponents tend to favor
markup languages for defining language strings, but this would not
ensure the use of a specific variant in the case given, only the
language-specific font more likely to depict a character as that
variant. (At this point, merely stylistic differences do enter in,
as a selection of Japanese and Chinese fonts are not likely to be
visually compatible.)
Chinese
users seem to have fewer objections to Han unification, largely
because Unicode did not attempt to unify Simplified Chinese characters
(an invention of the People's Republic of China
, and in use among Chinese speakers in the PRC,
Singapore
, and Malaysia
), with
Traditional Chinese
characters, as used in Hong Kong, Taiwan (Big5), and, with some differences, more familiar to
Korean and Japanese users. Unicode is seen as neutral with
regards to this politically charged issue, and has encoded
Simplified and Traditional Chinese glyphs separately (e.g. the
ideograph for "discard" is 丟 U+4E1F for Traditional Chinese Big5
#A5E1 and 丢 U+4E22 for Simplified Chinese GB #2210). It is also
noted that Traditional and Simplified characters should be encoded
separately according to Unicode Han Unification rules, because they
are distinguished in pre-existing PRC character sets. Furthermore,
as with other variants, Traditional to Simplified characters is not
a one-to-one relationship.
Specialist character sets developed to address , or regarded by
some as not suffering from, these perceived deficiencies include:
However,
none of these alternative standards has been as widely adopted as
Unicode, which is now the base character set
for many new standards and protocols, and is built into the
architecture of operating systems (Microsoft Windows, Apple
Mac OS X, and many other
Unix-like systems), programming languages
(Perl, Python, C#, Java, Common LISP, APL), and libraries (IBM International Components
for Unicode (ICU) along with the Pango,
Graphite, Scribe, Uniscribe, and
ATSUI
rendering engines), font formats (TrueType
and OpenType) and so on.
Examples of language dependent characters
In each row of the following table, the same character is repeated
in all five columns. However, each column is marked (via the HTML
lang attribute) as being in a different language:
Chinese (3 varieties: unmarked "Chinese",
simplified characters,
and
traditional
characters),
Japanese, or
Korean. The
browser should select, for each character, a
glyph (from a
font)
suitable to the specified language. (Besides actual character
variation--look for differences in stroke order, number, or
direction--the typefaces may also reflect different typographical
styles, as with serif and non-serif alphabets.) This only works for
fallback glyph selection if you have CJK fonts installed on your
system and the font selected to display this article does not
include glyphs for these characters. Note also that Unicode
includes non-graphical language tag characters in the range U+E0000
– U+E007F for plain text language tagging.
| Code |
Chinese
(Generic) |
Chinese
Simplified |
Chinese
Traditional |
Japanese |
Korean |
|
与 |
与 |
与 |
与 |
与 |
| U+4ECA |
今 |
今 |
今 |
今 |
今 |
| U+4EE4 |
令 |
令 |
令 |
令 |
令 |
| U+514D |
免 |
免 |
免 |
免 |
免 |
| U+5165 |
入 |
入 |
入 |
入 |
入 |
| U+5168 |
全 |
全 |
全 |
全 |
全 |
| U+5177 |
具 |
具 |
具 |
具 |
具 |
| U+5203 |
刃 |
刃 |
刃 |
刃 |
刃 |
| U+5316 |
化 |
化 |
化 |
化 |
化 |
| U+5340 |
區 |
區 |
區 |
區 |
區 |
| U+5916 |
外 |
外 |
外 |
外 |
外 |
| U+60C5 |
情 |
情 |
情 |
情 |
情 |
| U+624D |
才 |
才 |
才 |
才 |
才 |
| U+6B21 |
次 |
次 |
次 |
次 |
次 |
| U+6D77 |
海 |
海 |
海 |
海 |
海 |
| U+6F22 |
漢 |
漢 |
漢 |
漢 |
漢 |
| U+753B |
画 |
画 |
画 |
画 |
画 |
| U+76F4 |
直 |
直 |
直 |
直 |
直 |
| U+771F |
真 |
真 |
真 |
真 |
真 |
| U+7A7A |
空 |
空 |
空 |
空 |
空 |
| U+7D00 |
紀 |
紀 |
紀 |
紀 |
紀 |
| U+8349 |
草 |
草 |
草 |
草 |
草 |
| U+89D2 |
角 |
角 |
角 |
角 |
角 |
| U+8ACB |
請 |
請 |
請 |
請 |
請 |
| U+9053 |
道 |
道 |
道 |
道 |
道 |
| U+9913 |
餓 |
餓 |
餓 |
餓 |
餓 |
| U+9AA8 |
骨 |
骨 |
骨 |
骨 |
骨 |
Examples of some non-unified Han ideographs
For some glyphs, Unicode has encoded variant characters, making it
unnecessary to switch between fonts or language tags. In the
following table, the separate rows in each group contains the
Unicode equivalent character using different code points. Note that
for characters such as 入 (U+5165), the only way to display the two
variants is to change font (or language tag) as described in the
previous table. However, for 內 (U+5167), there is an alternate
character 内 (U+5185) as illustrated below. For some characters,
like 兌/兑 (U+514C/U+5151), either method can be used to display the
different glyphs.
| Code |
Chinese
(Generic) |
Chinese
Simplified |
Chinese
Traditional |
Japanese |
Korean |
| U+9AD8 |
高 |
高 |
高 |
高 |
高 |
| U+9AD9 |
髙 |
髙 |
髙 |
髙 |
髙 |
|
| U+7D05 |
紅 |
紅 |
紅 |
紅 |
紅 |
| U+7EA2 |
红 |
红 |
红 |
红 |
红 |
|
| U+4E1F |
丟 |
丟 |
丟 |
丟 |
丟 |
| U+4E22 |
丢 |
丢 |
丢 |
丢 |
丢 |
|
| U+4E57 |
乗 |
乗 |
乗 |
乗 |
乗 |
| U+4E58 |
乘 |
乘 |
乘 |
乘 |
乘 |
|
| U+4FA3 |
侣 |
侣 |
侣 |
侣 |
侣 |
| U+4FB6 |
侶 |
侶 |
侶 |
侶 |
侶 |
|
| U+514C |
兌 |
兌 |
兌 |
兌 |
兌 |
| U+5151 |
兑 |
兑 |
兑 |
兑 |
兑 |
|
| U+5167 |
內 |
內 |
內 |
內 |
內 |
| U+5185 |
内 |
内 |
内 |
内 |
内 |
|
| U+7522 |
產 |
產 |
產 |
產 |
產 |
| U+7523 |
産 |
産 |
産 |
産 |
産 |
|
| U+7A05 |
稅 |
稅 |
稅 |
稅 |
稅 |
| U+7A0E |
税 |
税 |
税 |
税 |
税 |
|
| U+4E80 |
亀 |
亀 |
亀 |
亀 |
亀 |
| U+9F9C |
龜 |
龜 |
龜 |
龜 |
龜 |
| U+9F9F |
龟 |
龟 |
龟 |
龟 |
龟 |
|
| U+5225 |
別 |
別 |
別 |
別 |
別 |
| U+522B |
别 |
别 |
别 |
别 |
别 |
|
| U+4E21 |
両 |
両 |
両 |
両 |
両 |
| U+4E24 |
两 |
两 |
两 |
两 |
两 |
| U+5169 |
兩 |
兩 |
兩 |
兩 |
兩 |
Unicode ranges
Ideographic characters assigned by Unicode appear in the following
blocks:
- CJK Unified Ideographs (4E00–9FFF) ()
- CJK Unified Ideographs Extension A (3400–4DBF) ()
- CJK Unified Ideographs Extension B (20000–2A6DF)
- CJK Unified Ideographs Extension C (2A700–2B73F)
- CJK Compatibility Ideographs (F900–FAFF) (the twelve characters
at FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28
and FA29 are actually "unified ideographs" not "compatibility
ideographs")
Unicode includes support of CJKV radicals, strokes, punctuation,
marks and symbols in the following blocks:
Additional compatibility (discouraged use) characters appear in
these blocks:
- Kangxi Radicals (2F00–2FDF)
- Enclosed CJK Letters and Months (3200–32FF) ()
- CJK Compatibility (3300–33FF) ()
- CJK Compatibility Ideographs (F900–FAFF) ()
- CJK Compatibility Ideographs (2F800–2FA1F)
- CJK Compatibility Forms (FE30–FE4F) ()
These compatibility characters (excluding the twelve unified
ideographs in the CJK Compatibility Ideographs block) are included
for compatibility with legacy text handling system and other legacy
character sets. They include forms of characters for vertical text
layout and rich text characters that Unicode recommends handling
through other means.
Unihan database files
The Unihan project has always made an effort to make available
their build database.
An
Unihan.zip file is provided on unicode.org. It
contains all the data the Unihan team have collected.
A
project libUnihan (0.5.3) provides a normalized
SQLite Unihan database and corresponding C library. All tables in
this database are in fifth normal form.
libUnihan is released as
LGPL, while its
database, UnihanDb, is released as
MIT
License.
See also
Notes
- The Unicode Standard, Version 4.0-online
edition
- http://www.info.gov.hk/digital21/eng/structure/irg.html
- This example also points to another reason that “abstract
character” and grapheme as an abstract unit in a written language
do not necessarily map one-ton-one. In English the combining
diaeresis, “¨”, and the “o” it modifies may be seen as two separate
graphemes, while in languages such as Swedish, the letter “ö” may
be seen as a single grapheme. Similarly in English the dot on an
“i” is understood as a part of the “i” grapheme while in other
languages the dot may be seen as a separate grapheme added to the
“i”.
- " Unicode (see the first paragraph) defines a consistent
way of encoding multilingual text that enables the exchange of text
data internationally and creates the foundation for global
software." ( Introduction).
- See the Unicode Consortium’s Ideographic Variation
Database and the PDF code charts for variation selectors:
Variation Selectors and Variation Selectors Supplement
- See the Unicode Consortium’s Ideographic Variation
Database and the PDF code charts for variation selectors:
Variation Selectors and Variation Selectors Supplement
- Ten Years
- http://www.unicode.org/history/unicode88.pdf
- Character Set List
- First proposed in 1998. However, , adoption of this proposed
counter-standard is nearly non-existent. There has been little
definitive standardization process or documents on UTF-2000 except
for some conference presentations in 2000 and 2001.
- 汉字整理研究获重要成果
- See unicode.org/charts/unihan.html
- See libunihan.sourceforge.net
References
External links
- Unihan Database
- Unicode standard
- IRG Page
- IRG
working documents – many big size pdfs, some of them with
details of CJK extensions
- Han Unification in Unicode by Otfried
Cheong
- Why Unicode Won't Work on the Internet: Linguistic,
Political, and Technical Limitations
- Why Unicode Will Work On The Internet
- Per-character summary of differences in
characters
- The secret life of Unicode
- GB18030 Support Package for Windows 2000/XP,
including Chinese, Tibetan, Yi, Mongolian and Thai font by
Microsoft
- Proposal to encode additional grass radicals in the
UCS – A humorous proposal to encode all possible variants of
the grass radical, made as an April
Fool's Day joke
- Unicode Technical Note 26: On the Encoding of Latin,
Greek, Cyrillic, and Han
- "Unicode Revisited" – the strong point of view
of some people working on the competing TRON proposal
- "Unicode in Japan, guide to a technical and psychological
struggle" – A more balanced take on the arguments for and
against Unicode for Japanese.