What is Unicode?
Developed in cooperation between the Unicode Consortium and the International Organization for Standardization, Unicode is an attempt to consolidate the alphabets and ideographs of the world’s languages into a single, international character set. It is language agnostic, focusing instead on the characters themselves. Thus, a letter shared between English and Russian, or for that matter, an ideograph shared between Kanji and Han script, would have the same Unicode character. As a multilingual standard, Unicode makes it possible for developers to create applications without having to resort to the often costly and time consuming task of releasing localized versions for each language.
Most Western character sets are 7-bit (e.g. US ASCII) or 8-bit (Latin-1), limiting them, respectively, to 128 or 256 characters. This limitation has resulted in a slew of sets customized for each language. For languages like Chinese, Korean and Japanese, which use heavily ideographic (i.e. based on the content of a word rather than its component sounds) writing systems consisting of thousands of characters, traditional 7 and 8 bit character sets are not adequate. Therefore, to include the character sets of the world’s principal writing systems, Unicode uses primarily a 16-bit set, allowing up to 65536 characters. This does have the consequence that Unicode text takes up twice as much disk space as does text using an 8-bit character set.
As a character set, Unicode does not concern itself with the specific appearance, or glyph, of a character. Instead, it includes only a code and name for each character. Individual fonts are assigned the tasks of rendering characters into glyphs, with the exact appearance of glyphs varying between fonts. Similarly, Unicode does not, for the most part, distinguish between plain and and rich text, instead allowing applications to apply their own text processing and formatting.
For more information about Unicode, visit the Unicode Consortium’s Web.