>
Copyright © 2001 Radovan Garabík
| Revision History | ||
|---|---|---|
| Revision 0.3.9 | 2004-10-22 | |
| upgrade vim information, add rxvt-unicode (thanks to Eduard Bloch <blade @ debian.org>) | ||
| Revision 0.3.8 | 2004-01-22 | |
| upgrade mc information | ||
| Revision 0.3.7 | 2003-10-31 | |
| mention mc and lynx | ||
| Revision 0.3.6 | 2003-04-09 | |
| consoletools now includes unicode patch | ||
| Revision 0.3.5 | 2002-11-25 | |
| note about LaTex added | ||
| Revision 0.3.4 | 2002-11-22 | |
| changes in X11 section | ||
| Revision 0.3.3 | 2002-09-08 | |
| notes about KDE and loadkeys (by Jochen Hein <jochen @ jochen.org>), info about fonty-rg package | ||
| Revision 0.3.2 | 2002-08-18 | |
| info about fvwm (thanks to Toni Müller <support @ oeko.net>) | ||
| Revision 0.3.1 | 2002-07-22 | |
| readline now handles multibyte characters upstream | ||
| Revision 0.3.0 | 2002-07-09 | |
| minor corrections | ||
| Revision 0.2.9 | 2002-01-15 | |
| uxterm is part of debian | ||
| Revision 0.2.8 | 2002-01-01 | |
| mutt-utf8 is part of debian | ||
| Revision 0.2.7 | 2001-12-19 | |
| no need to patch xterm anymore | ||
| Revision 0.2.6 | 2001-10-13 | |
| vim debian package is with multibyte support now | ||
| Revision 0.2.5 | 2001-10-11 | |
| link to gpm kernel patch | ||
| Revision 0.2.4 | 2001-10-04 | |
| mention vim-gtk debian package | ||
| Revision 0.2.3 | 2001-10-03 | |
| add modified console-tools for dead key kernel patch, add more console fonts | ||
| Revision 0.2.2 | 2001-09-28 | |
| vim 6.0 is now stable, mention the dead keys kernel patch | ||
| Revision 0.2.1 | 2001-09-21 | |
| Fixed minor typos. | ||
| Revision 0.2 | 2001-09-14 | |
| Paragraph on console Compose in UTF-8 mode added | ||
| Revision 0.1 | 2001-09-13 | |
| Few minor improvements | ||
| Revision 0.09 | 2001-09-12 | |
| Converted to docbook. | ||
Even though it is supposed to be well-documented how to switch to UTF-8 encoding, there are many pitfalls and gotchas. Often one has to locate the relevant information somewhere on the net. This HOWTO intends to fill that gap.
Doublewidth (CJK) characters, XIM, right-to-left scripts are not dealt with in this document.
Information here is collected from various sources.
Snippets of scripts, fonts, programs and similar are from these people:
Ričardas Čepas http://x-lt.richard.eu.org/
Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/
Pablo Saratxaga http://www.srtxg.easynet.be/
Ilya Ketris
This document is Public Domain.
This text is valid for debian unstable (woody) distribution as of April 2003.
Most notably, you need glibc 2.2, locales debian package, and xfree86 4.1, and also reasonably new versions of all mentioned packages.
Pick a locale you would use. I decided to use en_GB, you may use something else, the important part is the UTF-8 encoding. The easy way is to run
dpkg-reconfigure locales |
# localedef -v -c -i en_GB -f UTF-8 /usr/lib/locale/en_GB.UTF-8 |
Alternatives:
you may want to put into /etc/locale.gen lines like this:
en_GB.UTF-8 UTF-8 en_GB ISO-8859-1 |
this makes en_GB locale with default ISO-8859-1 encoding (for use by non-converted users, if necessary), and en_GB locale with UTF-8 encoding in directory /usr/lib/locale/en_GB.utf8 (glibc normalizes the name), which you should rename to /usr/lib/locale/en_GB.UTF-8.
Update: renaming seems to be unnecessary anymore.
Ulrich Drepper wrote:
"There were some versions of glibc where the charset name returned by nl_langinfo() and locale(1) were normalized but that's only a matter of updating your libc which is a good thing anyway. "
Or put into /etc/locale.gen these lines:
en_GB UTF-8 en_GB.ISO-8859-1 ISO-8859-1 |
Combining both lines like this:
en_GB.UTF-8 UTF-8 en_GB.ISO-8859-1 ISO-8859-1 |
You have to run locale-gen(8) after you modify /etc/locale.gen
Also by Ulrich Drepper:
"It always is a good idea to spell out the complete locale name. These abbreviations are evil since they don't allow consistency tests. Just assume an environment where not all machines are configured the same way. Telling the user to user en_GB instead of en_GB.UTF-8 would mean s/he can get different locales on different machines. Explictly mentioning the charset allows the programs (and glibc itself) to perform consistency checks. "
$ export LC_CTYPE=en_GB.UTF-8 |
alternatively, you may want to do $ export LANG=en_GB.UTF-8 (does not make much of a difference for English locale)
$ locale charmap |
$ locale |
vim 6.0 supports unicode.
Upgrade if you have an older version.
vim should be able to figure out you are using utf-8 from your locale, if (for whatever reason) it fails, you can force it by issuing this command: :set enc=utf-8
for gvim, you could set up another font using a syntax like this:
set guifont=guifont=Fixed\ 13 |
Note that vim uses its own methods to guess an encoding, so it will open a file in a traditional encoding correctly (eg. it tries Latin1 for de_DE.UTF-8) and store it correctly. However, you need to make sure that the file contents is not mixed, eg. contains both, Latin1 and UTF-8 characters. In this case, vim will always decide to open it as Latin1 and break UTF-8 characters while storing. Possible solution: ensure that the relevant parts of the file are UTF-8 encoded and use "vim -b" (no recoding).
yudit has its own input routines, and unfortunately uses XLookupString(3) so you HAVE to use its input routines instead of regular XFree86 keyboard map (which should be switched to "us" for use with yudit)
You need to install mutt-utf8 package along with slang1-utf8, and run mutt-utf8 executable to get utf-8 enabled mutt.
mutt should be able to figure out you are using utf-8 from your locale. You can verify if mutt recognized your locale by typing
:set charset=TAB |
If you need to force UTF-8 mode, you can do it by putting
set charset="utf-8" |
If incoming mails have correct headers, they will be displayed correctly. Unfortunately, many mails (especially those originating in Russia) claim to be in incorrect encoding, mostly in iso-8859-1 - you can force the charset by typing
:exec edit-type |
gnome-terminal handles unicode, though somewhat limited when compared to xterm (no combining characters, no doublewidth). you have to turn on utf-8 mode with
$ echo -e '\e%G' |
You have to manually tell lynx that Display Character Set is UTF-8, by going to (O)ptions Menu.
Unfortunately, lynx-cur is not compiled with wide ncurses version, therefore it does not compute length of lines correctly in UTF-8 mode. I am using aliened RedHat (uhm) version of lynx, get one from http://rpmfind.net/, you should download version for RawHide. Then do:
# alien lynx-2.8.5-13.i386.rpm # dpkg -i lynx_2.8.5-14_i386.deb # cd /usr/lib # ln -s libcrypto.so.0.9.7 libcrypto.so.4 # ln -s libssl.so.0.9.7 libssl.so.4 |
Midnight Commander in Debian does not support UTF-8. You can use the RedHat version from http://rpmfind.net/, download version for RawHide or Fedora. You need to create link for slang library. Make sure you have slang1a-utf8 debian package installed, and create the link:
# cd /lib # ln -s libslang.so.1-UTF8.4.9 libslang-utf8.so.1 |
# dpkg --remove mc # alien mc-4.6.0-6.i386.rpm # dpkg -i mc_4.6.0-7_i386.deb |
There is an extension for TeX, called Omega, and an extension for LaTeX, called Lambda, aiming at providing unicode support for TeX. However, I was not very successful in trying to make these two work.
Altrernative is http://www.unruh.de/DniQ/latex/unicode/ which is just a standard LaTex package, and enables you to use UTF-8 as an input encoding in your LaTex files. You have to manually select appropriate fontencodings, though. Debian package is called latex-ucs.
Although in theory just running /usr/bin/unicode_start is sufficient to switch the console into utf-8 mode, you need a suitable font as well (otherwise, why you switched to unicode in the first place...).
bundled with console-data comes LatArCyrHeb font, which has a good coverage of all common latin, cyrillic, arabic and hebrew glyphs (why there are arabic and hebrew glyphs, when console does not support right-to-left direction, is beyond me - they just take place in the font). You can use
consolechars -f LatArCyrHeb-16 |
LatCyrGr-16.psf font has a good coverage of all common latin, cyrillic and greek glyphs. It is a 512-character font, which means that you loose bold colours if you use it on VGA console (using framebuffer does not suffer from this problem). Second font, chavo.psf, contains only 256 characters and as such retains bold colours even on plain VGA console. It includes characters needed to write Czech, Slovak, Polish, Hungarian, Russian, German, Esperanto. Of course, it accidentally covers much wider range. The package also contains two scripts, first called iso, second uni. iso X turns oyour console into ISO-8859-X encoding. uni turns your console into UTF-8 mode and automatically loads LatCyrGr-16.psf font. uni 2 will use chavo.psf instead.
If you choose to use different font, be aware that not all the fonts in /usr/share/consolefonts/ have proper unicode map!
Test the console: download UTF-8-demo.txt.gz and cat it:
$ zcat UTF-8-demo.txt.gz |
Alternative (14 pixels high) font: download uni-511-14.psf.gz (by Ilya Ketris) - it does not have arabic, but greek instead (and less readable cyrillic) I modified Ilya's font (removed Hebrew, redesigned cyrillic, added some missing ISO-8859-2 glyphs and some less used cyrillic glyphs, modified some other glyphs to be more readable), you can download it as rg.psf.gz (source rg.sbf.gz)
install package dynafont (depends on konwert).
type:
$ filterm - dynafont |
If you are not using framebuffer, you can use this command:
$ filterm - 512bold+dynafont |
Test the console:
$ cat UTF-8-demo.txt |
You end dynafont by logging out of pseudoterminal it created.
there are not so many unicode keymaps. However, if you use kbd package instead of default consoletools, unicode_start from that package reloads current console keyboard layout via
dumpkeys | loadkeys --unicode |
You can test the console this way:
$ loadkeys ua-utf.kmap |
$ cat |
Switch keyboard back to us layout by pressing right alt again.
Unfortunately, compose (= dead keys) does not work in unicode mode. Reason is that struct
struct kbdiacr {
unsigned char diacr, base, result;
}; |
I have prepared a small kernel patch that fixes it. You can download it here: download/unicode_dead_keys_linux-2.4.9.patch.gz.
You also need to compile console-tools against this patched kernel. Since version 1:0.2.3dbs-30, consoletools in debian includes patch facilitating transition into unicode mode - all you need to do is to recompile console-tools agains the patched kernel - first install needed build dependencies (groff and dbs), and then do:
$ cd /tmp $ apt-get source console-tools $ cd console-tools* $ dpkg-buildpackage -rfakeroot -us -uc |
I put the recompiled packages and the patch here: download/console-tools/.
Unfortunately, linux console cannot display line drawing characters in utf-8 mode.
This is an ugly hack trying to improve the situation a bit:
Download linux+utf8 terminfo definition (based on original definition by Ričardas Čepas), compile it:
$ tic linux+utf8 |
now set your TERM environmental variable to linux+utf8
$ export TERM=linux+utf8 |
As this terminfo entry is just a crude hack, do expect some graphical glitches. More glitches result from the multibyte nature of unicode characters and are being addressed upstream (ncurses/slang)
libreadline supports multibyte locales (including UTF-8) since version 4.3. Please upgrade.
You might need to set up /etc/inputrc:
### # Be 8 bit clean. set input-meta on set output-meta on # To allow the use of 8bit-characters like the german umlauts, comment out # the line below. However this makes the meta key not work as a meta key, # which is annoying to those which don't need to type in 8-bit characters. set convert-meta off #You may need to add following lines if moving cursor does not work properly: "\e[D" backward-char "\e[C" forward-char ### |
see http://mail.nl.linux.org/linux-utf8/1999-08/msg00001.html (no, situation has not changed since then)
get the patch from ftp://ftp.ilog.fr/pub/Users/haible/utf8
I have not tried the patch, I use the kernel input editor with non-ascii chars very rarely.
copy&paste copies only ASCII characters in UTF-8 mode.
There were several stale kernel patches trying to improve the situation, until Jochen Hein made a version for 2.4.19. Get a patch here: download/unicode_copypaste_2.4.19.patch.gz. It seems to mess up making a selection in non-utf8 mode in strange ways, though.
In XFree86 4.2.0 configuration, there are already entered many commonly used locales, those using UTF-8 encoding are using files from en_US.UTF-8 locale. I am going to keep this convention.
If you want to input non-ascii characters, you may need a compose map. There was not a proper Compose map in older version of XFree86, but 4.2.0 should be good enough. For reference, here is a good latin compose map by Pablo Saratxaga: Compose.gz.
See file /usr/X11R6/lib/X11/locale/compose.dir and make sure there are these lines:
en_US.UTF-8/Compose: en_GB.UTF-8 en_US.UTF-8/Compose en_GB.UTF-8 |
See file /usr/X11R6/lib/X11/locale/locale.dir and make sure there are these lines:
en_US.UTF-8/XLC_LOCALE: en_GB.UTF-8 en_US.UTF-8/XLC_LOCALE en_GB.UTF-8 |
rxvt-unicode is a modern, slim and fast replacement for xterm. It handles UTF-8 and most other locales if the locale was set in the environment where urxvt has been started (run "LANG=xx_YY.UTF-8 urxvt" to be sure). It also supports charset guessing for few modern charsets so some users can continue using old programs.
If you did not install the recommended fonts for your locale, you may need to configure the fallback font list. See manpage for details.
Xterm does handle compose (dead keys) in UTF-8 locale properly since version 157. You find out which version are you using by typing:
$ xterm -v |
$ xterm -u8 |
$ uxterm |
start uxterm, type
$ setxkbmap us_intl $ cat |
Start gvim and write there some accented letters.
In xterm, type
$ setxkbmap ru |
If all goes OK, you are done.
KDE 3 seems to work well with unicode - you just need to use correct locale, and have necessary fonts installed (otherwise the missing characters will be displayed either without diacritics, or taken from other fonts - scaled bitmapped characters look particularly ugly).
yudit has its own input method and you have to use US keyboard layout to use it effectively (as usual, entering characters outside ISO-8859-1 range does not work)
There are many graphical glitches with all ncurses and slang applications. Just remember to press CTRL+L :-)
fvwm fails to display text in UTF-8 locale. Edit /usr/X11R6/lib/X11/locale/en_US.UTF-8/XLC_LOCALE file and change
on_demand_loading True |
on_demand_loading False |
man-db since version 2.4.0 uses groff's utf8 device to display pages in UTF-8, in UTF-8 locale. However, groff assumes source pages to be in ISO-8859-1 encoding.
Xfree86 i18n mailing list archives: http://www.xfree86.org/pipermail/i18n/
linux-utf8 mailing list archives: http://mail.nl.linux.org/linux-utf8/
Roman Czyborra's unicode info (must read!): http://www.czyborra.com/
UTF-8 and Unicode FAQ for Unix/Linux by Markus Günther Kuhn (must read!): http://www.cl.cam.ac.uk/~mgk25/unicode.html
UTF-8 enabled mutt by Edmund Grimley Evans: http://rano.org/mutt.html
Unicode home page: http://www.unicode.org