LYCOS RETRIEVER
Unicode: Unicode Standards
built 643 days ago
With the Unicode 16-bit encoding system, over 65,000 characters can be encoded (2^16 = 65536). However, the total number of characters that needs to be encoded has actually exceeded that limit (mainly to accommodate the CJK extension of characters). To find additional place for new characters, developers of the Unicode Standard decided to introduce the notion of surrogate pairs. With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over 1 million additional characters. Unlike MBCS characters, high and low surrogates cannot be interpreted when they do not appear as part of a surrogate pair (one of the major challenges with lead-byte and trail-byte processing of MBCS text).
Source:
Unicode is a universal character-coding standard for the interchange and display of principal written languages. It covers the languages of the Americas, Europe, Middle East, Africa, India, Asia, and Pacifica, as well as historic scripts and technical symbols. Unicode allows for the exchange, processing, and display of multilingual texts, as well as the use of common technical and mathematical symbols. It hopes to resolve internationalization problems of multilingual computing, such as different national character standards. Not all modern or archaic scripts... are currently supported.
Source:
Unicode is compatible with ASCII characters. The first 128 Unicode characters correspond to the ASCII characters and have the same numeric value. ASCII's 0x41 is the same as Unicode's \u0041. While ASCII's 128 characters supports just the Latin alphabet, Unicode's over 65,000 characters can support many different languages. Unicode is fully compatible with ISO's 10646-1 and UCS-2 standards. JavaScript programs will still be written in the ASCII-set characters.
Source:
The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on" in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to a combination of their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in which these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately.
Source:
The Unicode standard is a fixed-width uniform encoding scheme. Its target usage is for interchange and display of many different languages, as well as historic scripts, technical and mathematical symbols, and multilingual texts. The Unicode standard specifies the identity of the character and its numeric value. The 16-bit numeric value is defined by a hexadecimal number and a prefix \u (backslash followed by a lowercase u). The Unicode value \u0041, for example, represents the character A. The Unicode unique name for this character is LATIN CAPITAL LETTER A.
Source:
The Unicode Standard defines 66 non-character code points, or noncharacters. These are the last two positions on each of the 17 planes, in other words, all characters whose code points end in ...FFFE or ...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications are free to use any of these code points internally but should never attempt to interchange them. In effect, noncharacters can be thought of as application-internal private-use code points.
Source: