HomeMySQL Page 2 - Administering MySQL: International Usage and Log Files
4.7.4 The Character Definition Arrays - MySQL
If you need to administer MySQL, this article gets you off to a good start. In this section, we discuss localization and international usage, as well as the MySQL log files. The sixth of a multi-part series, it is excerpted from chapter four of the book MySQL Administrator's Guide, written by Paul Dubois (Sams; ISBN: 0672326345).
to_lower[] and to_upper[] are simple arrays that hold the lowercase and uppercase characters corresponding to each member of the character set. For example:
to_lower['A'] should contain 'a'
to_upper['a'] should contain 'A'
sort_order[] is a map indicating how characters should be ordered for comparison and sorting purposes. Quite often (but not for all character sets) this is the same as to_upper[], which means that sorting will be case-insensitive. MySQL will sort characters based on the values of sort_order[] elements. For more complicated sorting rules, see the discussion of string collating in Section 4.7.5, "String Collating Support."
ctype[] is an array of bit values, with one element for one character. (Note that to_lower[], to_upper[], and sort_order[] are indexed by character value, but ctype[] is indexed by character value + 1. This is an old legacy convention to be able to handle EOF.)
You can find the following bitmask definitions in m_ctype.h:
The ctype[] entry for each character should be the union of the applicable bitmask values that describe the character. For example, 'A' is an uppercase character (_U) as well as a hexadecimal digit (_X), so ctype['A'+1] should contain the value:
_U + _X = 01 + 0200 = 0201
4.7.5 String Collating Support
If the sorting rules for your language are too complex to be handled with the simple sort_order[] table, you need to use the string collating functions.
Right now the best documentation for this is the character sets that are already implemented. Look at the big5, czech, gbk, sjis, and tis160 character sets for examples.
You must specify the strxfrm_multiply_MYSET=N value in the special comment at the top of the file. N should be set to the maximum ratio the strings may grow during my_strxfrm_MYSET (it must be a positive integer).
4.7.6 Multi-Byte Character Support
If you want to add support for a new character set that includes multi-byte characters, you need to use the multi-byte character functions.
Right now the best documentation on this consists of the character sets that are already implemented. Look at the euc_kr, gb2312, gbk, sjis, and ujis character sets for examples. These are implemented in the ctype-'charset'.c files in the strings directory.
You must specify the mbmaxlen_MYSET=N value in the special comment at the top of the source file. N should be set to the size in bytes of the largest character in the set.
4.7.7 Problems with Character Sets
If you try to use a character set that is not compiled into your binary, you might run into the following problems:
Your program has an incorrect path to where the character sets are stored. (Default /usr/local/mysql/share/mysql/charsets). This can be fixed by using the --character-sets-dir option when you run the program in question.
The character set is a multi-byte character set that can't be loaded dynamically. In this case, you must recompile the program with support for the character set.
The character set is a dynamic character set, but you don't have a configure file for it. In this case, you should install the configure file for the character set from a new MySQL distribution.
If your Index file doesn't contain the name for the character set, your program will display the following error message:
ERROR 1105: File '/usr/local/share/mysql/
charsets/?.conf'
not found (Errcode: 2)
In this case, you should either get a new Index file or manually add the name of any missing character sets to the current file.
For MyISAM tables, you can check the character set name and number for a table with myisamchk -dvv tbl_name.