LYCOS RETRIEVER Beta Retriever Home  |  What is Lycos Retriever?   
Unicode: Unicode Standard
built 643 days ago
As a further caveat, not all Unicode-based text processing is a matter of simple character-by-character parsing. Complex text-based operations such as hyphenation, line breaking, and glyph formation need to take into account the context in which they are being used (the relation to surrounding characters, for instance). The complexity of these operations hinges on language rules and has nothing to do with Unicode as an encoding standard. Instead, the software implementation should define a higher-level protocol for handling these operations.
Source:
Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers through a single character set. Originally, text-characters were represented in computers using byte-wide data: each printable character (and many non-printing, or "control" characters) were implemented using a single byte each, which allowed for 256 characters total. However, globalization has created a need for computers to be able to accommodate many different alphabets (and other writing systems) from around the world in an interchangeable way.
The UTF-8 encoding of Unicode is compatible with ASCII characters and is supported by many programs. The first 128 Unicode characters correspond to the ASCII characters and have the same byte value. The Unicode characters U+0020 through U+007E are equivalent to the ASCII characters 0x20 through 0x7E. Unlike ASCII, which supports the Latin alphabet and uses a 7-bit character set, UTF-8 uses between one and four octets for each character. ("Octet" meaning a byte, or 8 bits.) This allows for several million characters. An alternative encoding standard, UTF-16, uses two octets to represent Unicode characters.
Unicode encompasses virtually all characters used widely in computers today. It is capable of addressing more than 1.1 million code points. The standard has provisions for 8-bit, 16-bit and 32-bit encoding forms. The 16-bit encoding is used as its default encoding and allows for its million plus code points to be distributed across 17 "planes" with each plane addressing over 65,000 characters each. The characters in Plane 0-or as it is commonly called the "Basic Multilingual Plane" (BMP)-are used to represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, basic dingbats (including all level-100 Zapf Dingbats), and punctuation marks. But in addition to the support for characters in modern languages and for the symbols and shapes just mentioned, Unicode ... offers coverage for other characters, such as less commonly used Chinese, Japanese, and Korean (CJK) ideographs, Arabic presentation forms, and musical symbols.
Source:
The Unicode character set can be used for all known encoding. Unicode is modeled after the ASCII (American Standard Code for Information Interchange) character set. It uses a numerical value and name for each character. The character encoding specifies the identity of the character and its numeric value (code position), as well as the representation of this value in bits. The 16-bit numeric value (code value) is defined by a hexadecimal number and a prefix U, for example, U+0041 represents A. The unique name for this value is LATIN CAPITAL LETTER A.
Not all the Unicode characters represented below or in other writeups may be viewable in your browser. In fact, some of the characters may not be viewable in any browser. This is because Unicode is an evolving and ever-growing standard which has the ability to store/represent literally millions of characters and symbols from hundreds of languages and cultures past and present, and not all software has the ability to display all the characters. If your browser does not understand the Unicode value of the character, it will usually display a small square box or a question mark. This is normal and expected behavior, and does not mean there is a problem with this writeup or your web browser. For additional information see Using Unicode on E2.
Source:
SEARCH
MORE ABOUT