LYCOS RETRIEVER
Unicode: Characters
built 627 days ago
Unicode is a multi-byte character set, portable across all major computing platforms and with decent coverage over most of the world. It is ... single-locale; it includes no code pages or other complexities that make software harder to write and test. There is no competing character set that's reasonably multiplatform. For these reasons, Trolltech uses Unicode as the native character set for Qt (since version 2.0).
Source:
Unicode is a system for representing characters and symbols in binary[1]. To that extent, it is similar to other encoding schemes like ASCII. However, the goal of Unicode is to be able to uniquely represent every character in every language. Use of Unicode promotes easy internationalization of files and applications.
Source:
Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
Source:
Unicode characters and strings use data types that are distinct from those for code page-based characters and strings. Along with a series of macros and naming conventions, this distinction minimizes the chance of accidentally mixing the two types of character data. It facilitates compiler type checking to ensure that only Unicode parameter values are used with functions expecting Unicode strings.
Source:
There are lots of different ways of encoding Unicode code points into bits but the most popular encoding is UTF-8. Using UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the useful side effect that English text looks exactly the same in UTF-8 as it did in ASCII, because for every ASCII character with hexadecimal value 0xXY, the corresponding Unicode code point is U+00XY. This backwards compatibility is why if you are developing an application that is only used by English speakers you can often get away without handling characters properly and still expect things to work most of the time. Of course, if you use a different encoding such as UTF-16 this doesn't apply since none of the code points are encoded to 8 bits.
Source:
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM.
Source: