Tuesday, December 28, 2010

Unicode Fundamentals in SQL server 2005

Unicode maps code points to characters, but does not actually specify how the data will be represented in memory, in a database, or on a Web page.

What is Unicode and what is UTF-8??
Unicode is clear from above statement now come to UTF-8.Its a encoding scheme(What is Encoding ?click here). In short you can say that There is a Hex value for each and every character which is defined in Character Set of Unicode.Like that many other Encoding scheme are there UTF-8,UTF-16,UTF-32.

The thing is All the character will have different HEX value for each Encoding.


Information that is stored in Unicode in Microsoft SQLServer 2000 and Microsoft SQLServer 2005 uses the UCS-2 encoding, which stores every character as two bytes, regardless of which character is used. Therefore, the Latin letter "A" is treated the same way as the Cyrillic letter Sha (), the Hebrew letter Lamed (), the Tamil letter Rra (), or the Japanese Hiragana letter E (). Each one has a unique code point (for these letters, the code points are U+0041, U+0248, U+05DC, U+0BB1, and U+3048, respectively, where each four-digit hexadecimal number represents the two bytes that UCS-2 uses).

Now as you know UCS-2 is a old Encoding nowa days its outdated.It strictly use a 2 byte encoding no matter you are storing ASCII value or non ASCII value.So It means 2 power of 16 that means it can store up to 65536 Different characters.

Now if we will have SQL SERVER 2005 Centric views then Storage part in a nutshell work like this.
-UCS-2 is obsolete: It can no longer represent every Unicode character.
-For Nonunicode data types like char,varchar,text sql server will take 1 byte to store data for each character 
-For Unicode data types like nchar,nvarchar sql server will take take 2 bytes to store data for each character as It is using UTF-16 internally to store data(MySQL is using Internally UTF-8 to store data so it again Depend on collation and character set which you are using for DB,Table,Column)
-UTF-8 in a few XML-centric places, without you doing anything weird.

-So you can say that its not using UTF-8 Or MSSQL does not support utf-8 Encoding.
IF you are writing .NET framework code,Java code or C# code all the things are stored into UTF-16 format internally and Microsoft Products works good will UTF-16 Encoding.ASP.NET used in C# is again a UTF-8 encoding  This all things work in background you don't need to bother.

Now lets compare sizes of these three encodings:






In non-Unicode data types (TEXT, VAR CHAR), only a single byte is used per character, and only characters of the collation's code page can be stored 

Now for if particular collation you want to move on you can use the collation and keep it data type as VAR CHAR and it will store.Internal conversions will be made from code page and will be stored and it can work up to limited characters which are defined in code page.


Now first thing what is Code page?

In sql server A code page also known as a character set, is a set of 256 uppercase and lowercase letters, numbers, and symbols. The printable characters of the first 128 values are the same for all character set choices. The last 128 characters, sometimes referred to as extended characters, differ from set to set. The code page for SQL Server determines both the types of characters that SQL Server recognizes in your database, and the results that SQL Server returns for queries that involve character comparisons.Total 1252 Pages are there to work with

If you want to see the page number for particular collation then you can use the Inbuilt following function:
(If collation is not having a code page it will return 0)

SELECT COLLATIONPROPERTY('SQL_Latvian_CP1257_CS_AS', 'CodePage')
Will give result 1257.

This Code page will have total ASCII representations as well as Extended Chars for that particular Collation


Now Working with multiple Language a same time

Its impossible that we can find all the characters of multiple languages in single code page.The same time you need to work with nvarchar like Unicode data types.But there Each Character you are going to store whether a ASCII value or Non ASCII value it will always accommodate 2 Bytes.So we can say it is space consuming.At a same time we no need to do any internal conversions like in code pages sql server was doing so again a good thing that it will be same for what ever client want to see where ever will be one single character for each encoded number.The Unicode specification defines a single encoding scheme for most characters widely used in businesses around the world. All computers consistently translate the bit patterns in Unicode data into characters using the single Unicode specification. This ensures that the same bit pattern is always converted to the same character on all computers. Data can be freely transferred from one database or computer to another without concern that the receiving system will translate the bit patterns into characters incorrectly.

Unicode data is stored using the nchar, nvarchar, and ntext data types in SQL Server. Use these data types for columns that store characters from more than one character set. Use nvarchar when a column's entries vary in the number of Unicode characters (up to 4,000) they contain. Use nchar when every entry for a column has the same fixed length (up to 4,000 Unicode characters). Use ntext when any entry for a column is longer than 4,000 Unicode characters.


Keep in mind this things While select a Datatype as Unicode data type

Note  The n prefix for these data types comes from the SQL-92 standard for National (Unicode) data types.

Use of nchar, nvarchar, and ntext is the same as char, varchar, and text, respectively, except that:

Unicode supports a wider range of characters.

More space is needed to store Unicode characters.

The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.

Unicode constants are specified with a leading N: N'A Unicode string'.

All Unicode data uses the same Unicode code page. Collations do not control the code page used for Unicode columns, only attributes such as comparison rules and case sensitivity.


Thanks and Regards 
Kamesh