Thursday, December 23, 2010

New to Unicode ??? Go through this

In order for a computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers. The Unicode standard defines such a code by using character encoding.

Character Encoding

All character encoding does is assign a number to every character that can be used. I could if I really wanted to make a character encoding right now. For example, I could say "A" becomes the number 13, "a" = 14, "1" = 33, "#" = 123 and so on. My character encoding scheme might work brilliantly on my computer but when I come to send some text to another computer I will have problems. It won't know what I'm talking about unless it understands my encoding scheme too.
This is where industry wide standards come in. If the whole computer industry uses the same character encoding scheme, all the computers can display the same characters.

What is Unicode?

ASCII which stands for American Standard Code for Information Interchange became the first widespread encoding scheme. However, it is limited to only 128 character definitions. Which is fine for the most common English characters, numbers and punctuation but is a bit limiting for the rest of the world. They naturally wanted to be able to encode their characters too. And, for a little while depending on where you were, there might be a different character being displayed for the same ASCII code. In the end, the other parts of the world began creating their own encoding schemes and things started to get a little bit confusing. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were meant to be using.
It became apparent that a new character encoding scheme was needed and the Unicode standard was created. The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible. These days the Unicode standard defines values for over 100,000 characters and can be seen at the Unicode Consortium. It has several character encoding forms, UTF standing for Unicode Transformation Unit:
  • UTF-8: only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode the other characters. UTF-8 is widely used in email systems and on the Internet.
  • UTF-16: uses two bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers.
  • UTF-32: uses four bytes (32 bits) to encode the characters. It became apparent that as the Unicode standard grew a 16-bit number is too small to represent all the characters. UTF-32 is capable of representing every Unicode character as one number.

Code Points

A code point is the value that a character is given in the Unicode standard. The values according to Unicode are written as hexadecimal numbers and have a prefix of "U+". For example to encode the characters I looked at earlier, "A" is U+0041, "a" is U+0061, "1" is U+0031, "#" is U+0023. These code points are split into 17 different sections called planes. Each plane holds 65,536 code points. The first plane, which holds the most commonly used characters, is known as the basic multilingual plane.

Code Units

The encoding schemes are made up of code units. They are way to provide an index for where a character is positioned on a plane. For instance, with UTF-16 each 16-bit number is a code unit. The code units can be transformed into code points. For example, the flat note symbol "♭" has a code point of U+1D160 and it lives on the second plane of the Unicode standard. It would be encoded using the combination of the following two 16-bit code units: U+D834 and U+DD60 .
For the basic multilingual plane the values of the code points and code units are identical. This allows a shortcut for UTF-16 that saves a lot of storage space. It only needs to use one 16-bit number to represent those characters.

No comments:

Post a Comment