Friday, December 24, 2010

(First ever blog 100% written and ingenuited by me)


What is UTF-8??
-UCS[1] Transformation Format — 8-bit.Now here UCS stands for Universal Character SET.As this is widely used transformation unit over the WEB and is very prominently used.Actually ASCII stores values from 0 to 128 and that 128 incorporates all the Alphabates like smaller case a to z and UPPER CASE A TO Z.after this its also contains some non printable chars like carriage return new lines(/n) and symbols which can represent one proper sentence with punctuation you can say it consists all symbols on key board.Now the question is how ascci is stored in a computer as you are aware that all the values are stored in bits(0,1) on your Disk.Now ASCII List with your characters:


After this how it is stored on disk second major thing






Up to the mark very much it is cleared how Coputer is dealing with ascii and how it is stored.Over all your things on your computer is goin to breathe in 0 and 1

Now ASCII was only content English chars more or less some punc symbols Upper the mentioned byte can store upto 255 values means different combinations out of that 128 wer allooted Standard for ASCII now the time was what about another 128 chars.So big players in IT started use as per there require symbols and for different Languagaes.but it was not standard over a world.Now one machine will mean some other character for extended ascii and some will mean something else so it will create a mess up there UTF 8 came into picture.

First to under stand How UTF 8 process you need to be aware of some  terms like COLLATIONS,ENCODINGS,CHARACTER SET what are the for that follow the below link


What is collationa n charsets??? Click on link


Presumming you are aware of above mention terms Now i think we can proceed as i tell you UTF-8 can store upto 4 bytes per char
AS you can see below cited table first range is like ASCII and everything is unchanged over there.So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode.This includes Latin letters with diacritics and characters from the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tana alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.


How unicode is store into disk??

To understand the way UTF-8 works, we have to examine the binary representation of each byte. If the first bit (the high-order bit) is zero, then it’s a single-byte character, and we can directly map its remaining bits to the Unicode characters 0 – 127. If the first bit is a one, then this byte is a member of a multi-byte character (either the first character or some followup of it).

For a multi-byte character (any character whose Unicode number is 128 or above), we need to know how many bytes will make up this character. This is stored in the leading bits of the first byte in the character. We can identify how many total bytes will make up this character by counting the number of leading 1’s before we encounter the first 0. Thus, for the first byte in a multi-byte character, 110xxxxx represents a two-byte character, 1110xxxx represents a three-byte character, and so on.


Lets go with an Example



Above cited char in image we want to store as Its Decimal value with utf8 is 362 it require 2 bytes to store the value
now binary value would be 101101010 But its oing to stoer the binary value of the HExadeimal form so HExaDecimal is 16A for 362 and Binary would be 000101101010
But UTF8 have its own encoding standard and it will be converted to abide the above rules of leading ones and all that after that we will get one hex value that is C5AA and that is stored on a disk


How it will be stored
First bit would be 1 now as its a multibyte character as 1 more byte will make a leading this character so second value would be 1
now we got binary is 11XXXXXX:XXXXXXXX

So we encountered Zero now so third would be 110XXXXX:XXXXXXXX (1110xxxx represents a three-byte character, and so on.)

Now With this we got one sequence and now binary would be attached atlast of this

So now the binary for utf8 form for cited image would be 11000101:10101010




What is Encoded Byte in term in UTF8?

So encoded byte is the hexadecimal value for that particular Symbol or character See the mentioned image

No comments:

Post a Comment