LYCOS RETRIEVER Beta Retriever Home  |  What is Lycos Retriever?   
Unicode: Characters
built 655 days ago
Since Unicode 3.1, some extensions have even been defined so that the defined range is now U+000000 to U+10FFFF (21 bits), and formally, the character set is defined as 31-bits to allow for future expansion. It is a myth that there are 65,536 Unicode code points and that every Unicode letter can really be squeezed into two bytes. It is ... incorrect to think that UTF-8 can represent less characters than UTF-16. UTF-8 simply uses a variable number of bytes for a character, sometimes just one byte (8 bits).
Unicode-enabled functions are often referred to as wide character functions, as described in Conventions for Function Prototypes. This designation is made because of the use of the UTF-16 encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems. Each code value is 16 bits wide, in contrast to the older code page approach to character and string data, which uses 8-bit code values. The use of 16 bits allows the direct encoding of 65,536 characters. In fact, the universe of symbols used to transcribe human languages is even larger than that, and UTF-16 code points in the range U+D800 through U+DFFF are used to form surrogate pairs, which constitute 32-bit encodings of supplementary characters. See Surrogates and Supplementary Characters for further discussion.
While many customers realize the need for and benefits of using Unicode, they are often intimidated by its perceived complexity. In some cases it is the fear of the unknown, but some challenges to using Unicode need to be addressed. Most of these revolve around UTF-8 being a varying-width encoding scheme. ASCII characters occupy 1 byte each. Accented Latin characters, Arabic, Cyrillic, Greek, and Hebrew occupy 2 bytes each. Most other characters—including Chinese, Indian, Japanese, and Korean—occupy 3 bytes each, and supplementary characters are represented in 4 bytes.
Source:
Possessing everything you need to grasp Unicode, this comprehensive reference takes you on a detailed guide through the complex character world. Learn how to identify and classify characters, utilize their properties, and process data in a robust manner. Other topics include collation and sorting, line breaking rules and Unicode encodings. Perfect for both beginning and seasoned programmers.
Source:
To simplify matters further, Unicode defines the first 128 characters to be identical to the characters from ASCII. An encoding that makes use of this fact is UTF-8. UTF-8 is a variable length encoding with several propeties that makes it ideal for storing Unicode when the majority of characters are ASCII characters. Among the properties of UTF-8 is:
Unicode has started to replace character-encoding schemes such as ASCII, EUC, and ISO 8859 at all levels. With the UTF-8 encoding, Unicode can be used in a convenient and backward-compatible way in environments that were designed entirely around ASCII, such as UNIX. UTF-8 is the encoding used for Unix, Linux, and similar systems.
Source:
SEARCH
MORE ABOUT