>Step by step introduction to switching your debian installation to utf-8 encoding.

Step by step introduction to switching your debian installation to utf-8 encoding.

Radovan Garabík

Revision History
Revision 0.3.92004-10-22
upgrade vim information, add rxvt-unicode (thanks to Eduard Bloch <blade @ debian.org>)
Revision 0.3.82004-01-22
upgrade mc information
Revision 0.3.72003-10-31
mention mc and lynx
Revision 0.3.62003-04-09
consoletools now includes unicode patch
Revision 0.3.52002-11-25
note about LaTex added
Revision 0.3.42002-11-22
changes in X11 section
Revision 0.3.32002-09-08
notes about KDE and loadkeys (by Jochen Hein <jochen @ jochen.org>), info about fonty-rg package
Revision 0.3.22002-08-18
info about fvwm (thanks to Toni Müller <support @ oeko.net>)
Revision 0.3.12002-07-22
readline now handles multibyte characters upstream
Revision 0.3.02002-07-09
minor corrections
Revision 0.2.92002-01-15
uxterm is part of debian
Revision 0.2.82002-01-01
mutt-utf8 is part of debian
Revision 0.2.72001-12-19
no need to patch xterm anymore
Revision 0.2.62001-10-13
vim debian package is with multibyte support now
Revision 0.2.52001-10-11
link to gpm kernel patch
Revision 0.2.42001-10-04
mention vim-gtk debian package
Revision 0.2.32001-10-03
add modified console-tools for dead key kernel patch, add more console fonts
Revision 0.2.22001-09-28
vim 6.0 is now stable, mention the dead keys kernel patch
Revision 0.2.12001-09-21
Fixed minor typos.
Revision 0.22001-09-14
Paragraph on console Compose in UTF-8 mode added
Revision 0.12001-09-13
Few minor improvements
Revision 0.092001-09-12
Converted to docbook.

Even though it is supposed to be well-documented how to switch to UTF-8 encoding, there are many pitfalls and gotchas. Often one has to locate the relevant information somewhere on the net. This HOWTO intends to fill that gap.

Doublewidth (CJK) characters, XIM, right-to-left scripts are not dealt with in this document.


Credits

Information here is collected from various sources.

Snippets of scripts, fonts, programs and similar are from these people:

This document is Public Domain.


Prerequisites

This text is valid for debian unstable (woody) distribution as of April 2003.

Most notably, you need glibc 2.2, locales debian package, and xfree86 4.1, and also reasonably new versions of all mentioned packages.


Preparing correct locale

Choosing a locale

Pick a locale you would use. I decided to use en_GB, you may use something else, the important part is the UTF-8 encoding. The easy way is to run
dpkg-reconfigure locales
and then pick up one UTF-8 locale. In this case you may skip the following paragraph about configuring one.

generate the locale

# localedef -v -c -i en_GB -f UTF-8 /usr/lib/locale/en_GB.UTF-8
check out if /usr/lib/locale/en_GB.UTF-8 directory exists.

Alternatives:

  • you may want to put into /etc/locale.gen lines like this:
    en_GB.UTF-8 UTF-8
    en_GB ISO-8859-1

    this makes en_GB locale with default ISO-8859-1 encoding (for use by non-converted users, if necessary), and en_GB locale with UTF-8 encoding in directory /usr/lib/locale/en_GB.utf8 (glibc normalizes the name), which you should rename to /usr/lib/locale/en_GB.UTF-8.

    Update: renaming seems to be unnecessary anymore.

    Ulrich Drepper wrote:

    "There were some versions of glibc where the charset name returned by nl_langinfo() and locale(1) were normalized but that's only a matter of updating your libc which is a good thing anyway. "

  • Or put into /etc/locale.gen these lines:
    en_GB UTF-8
    en_GB.ISO-8859-1 ISO-8859-1
    this makes UTF-8 to be the default encoding for en_GB and puts ISO-8859-1 locale into /usr/lib/locale/en_GB.iso88591

  • Combining both lines like this:
    en_GB.UTF-8 UTF-8
    en_GB.ISO-8859-1 ISO-8859-1
    is not optimal since then specifying the locale as en_GB is not sufficient and you have always to specify the encoding (either en_GB.UTF-8 or en_GB.ISO-8859-1)

  • You have to run locale-gen(8) after you modify /etc/locale.gen

  • Also by Ulrich Drepper:

    "It always is a good idea to spell out the complete locale name. These abbreviations are evil since they don't allow consistency tests. Just assume an environment where not all machines are configured the same way. Telling the user to user en_GB instead of en_GB.UTF-8 would mean s/he can get different locales on different machines. Explictly mentioning the charset allows the programs (and glibc itself) to perform consistency checks. "

set up the locale

$ export LC_CTYPE=en_GB.UTF-8

alternatively, you may want to do $ export LANG=en_GB.UTF-8 (does not make much of a difference for English locale)

test the locale

$ locale charmap
the output should be UTF-8
$ locale
the output should be LC_CTYPE=en_GB.UTF-8, other entries can be either en_GB.UTF-8 or POSIX, LC_ALL should be empty


Notes about several applications:

vim

vim 6.0 supports unicode.

Upgrade if you have an older version.

vim should be able to figure out you are using utf-8 from your locale, if (for whatever reason) it fails, you can force it by issuing this command: :set enc=utf-8

for gvim, you could set up another font using a syntax like this:
set guifont=guifont=Fixed\ 13
(put this into /etc/vim/gvimrc if you want)

Note that vim uses its own methods to guess an encoding, so it will open a file in a traditional encoding correctly (eg. it tries Latin1 for de_DE.UTF-8) and store it correctly. However, you need to make sure that the file contents is not mixed, eg. contains both, Latin1 and UTF-8 characters. In this case, vim will always decide to open it as Latin1 and break UTF-8 characters while storing. Possible solution: ensure that the relevant parts of the file are UTF-8 encoded and use "vim -b" (no recoding).


yudit

yudit has its own input routines, and unfortunately uses XLookupString(3) so you HAVE to use its input routines instead of regular XFree86 keyboard map (which should be switched to "us" for use with yudit)


mutt

You need to install mutt-utf8 package along with slang1-utf8, and run mutt-utf8 executable to get utf-8 enabled mutt.

mutt should be able to figure out you are using utf-8 from your locale. You can verify if mutt recognized your locale by typing
:set charset=TAB
where TAB is the TAB key

If you need to force UTF-8 mode, you can do it by putting
set charset="utf-8"
into /etc/Muttrc.

If incoming mails have correct headers, they will be displayed correctly. Unfortunately, many mails (especially those originating in Russia) claim to be in incorrect encoding, mostly in iso-8859-1 - you can force the charset by typing
:exec edit-type


gnome-terminal

gnome-terminal handles unicode, though somewhat limited when compared to xterm (no combining characters, no doublewidth). you have to turn on utf-8 mode with
$ echo -e '\e%G'
first.


lynx

You have to manually tell lynx that Display Character Set is UTF-8, by going to (O)ptions Menu.

Unfortunately, lynx-cur is not compiled with wide ncurses version, therefore it does not compute length of lines correctly in UTF-8 mode. I am using aliened RedHat (uhm) version of lynx, get one from http://rpmfind.net/, you should download version for RawHide. Then do:
# alien lynx-2.8.5-13.i386.rpm
# dpkg -i lynx_2.8.5-14_i386.deb
# cd /usr/lib
# ln -s libcrypto.so.0.9.7 libcrypto.so.4
# ln -s libssl.so.0.9.7 libssl.so.4
and run lynx.


Midnight Commander

Midnight Commander in Debian does not support UTF-8. You can use the RedHat version from http://rpmfind.net/, download version for RawHide or Fedora. You need to create link for slang library. Make sure you have slang1a-utf8 debian package installed, and create the link:
# cd /lib
# ln -s libslang.so.1-UTF8.4.9 libslang-utf8.so.1
Then remove old debian mc, prepare and install the redhat package:
# dpkg --remove mc 
# alien mc-4.6.0-6.i386.rpm
# dpkg -i mc_4.6.0-7_i386.deb
Then, browsing directories full of UTF-8 named files should work. Unfortunately, mcedit still does not support UTF-8 characters, neither do work UTF-8 characters in mc dialogs.


LaTeX

There is an extension for TeX, called Omega, and an extension for LaTeX, called Lambda, aiming at providing unicode support for TeX. However, I was not very successful in trying to make these two work.

Altrernative is http://www.unruh.de/DniQ/latex/unicode/ which is just a standard LaTex package, and enables you to use UTF-8 as an input encoding in your LaTex files. You have to manually select appropriate fontencodings, though. Debian package is called latex-ucs.


Linux console

Although in theory just running /usr/bin/unicode_start is sufficient to switch the console into utf-8 mode, you need a suitable font as well (otherwise, why you switched to unicode in the first place...).

bundled with console-data comes LatArCyrHeb font, which has a good coverage of all common latin, cyrillic, arabic and hebrew glyphs (why there are arabic and hebrew glyphs, when console does not support right-to-left direction, is beyond me - they just take place in the font). You can use
consolechars -f LatArCyrHeb-16
to load the font. However, this font is not particularly legible, neither its glyphs are of what I would call a good quality. I have created package fonty-rg, which contains fonts for linux console in various encodings, including two unicode fonts.

LatCyrGr-16.psf font has a good coverage of all common latin, cyrillic and greek glyphs. It is a 512-character font, which means that you loose bold colours if you use it on VGA console (using framebuffer does not suffer from this problem). Second font, chavo.psf, contains only 256 characters and as such retains bold colours even on plain VGA console. It includes characters needed to write Czech, Slovak, Polish, Hungarian, Russian, German, Esperanto. Of course, it accidentally covers much wider range. The package also contains two scripts, first called iso, second uni. iso X turns oyour console into ISO-8859-X encoding. uni turns your console into UTF-8 mode and automatically loads LatCyrGr-16.psf font. uni 2 will use chavo.psf instead.

If you choose to use different font, be aware that not all the fonts in /usr/share/consolefonts/ have proper unicode map!

Test the console: download UTF-8-demo.txt.gz and cat it:
$ zcat UTF-8-demo.txt.gz
You should see some accented letters and some cyrillic, but many of the more exotic glyphs will not display. If you have proper unicode locale, you can use less(1).

Alternative (14 pixels high) font: download uni-511-14.psf.gz (by Ilya Ketris) - it does not have arabic, but greek instead (and less readable cyrillic) I modified Ilya's font (removed Hebrew, redesigned cyrillic, added some missing ISO-8859-2 glyphs and some less used cyrillic glyphs, modified some other glyphs to be more readable), you can download it as rg.psf.gz (source rg.sbf.gz)


(optional) Using dynafont

install package dynafont (depends on konwert).

type:
$ filterm - dynafont
dynafont will automatically download necessary fonts as the new characters are appearing. Of course you can have only 512 of glyphs simultaneously on the screen, but that's good enough in most cases. dynafont will also handle combining characters.

If you are not using framebuffer, you can use this command:
$ filterm - 512bold+dynafont
and dynafont will try to balance number of glyphs and colours so that you can have bold colours on the console. The result is quite good, if there are not many colours displayed.

Test the console:
$ cat UTF-8-demo.txt
compare the result with running dynafont and without.

You end dynafont by logging out of pseudoterminal it created.


Keyboard

there are not so many unicode keymaps. However, if you use kbd package instead of default consoletools, unicode_start from that package reloads current console keyboard layout via
dumpkeys | loadkeys --unicode
which is not possible using consoletools, since loadkeys lacks the --unicode option there.

You can test the console this way:

$ loadkeys ua-utf.kmap
now type
$ cat
(so that unicode characters do not confuse your shell) press (and release, do not hold it pressed) right alt - this switches keyboard into ukrainian layout. Press some keys and observe cyrillic letters.

Switch keyboard back to us layout by pressing right alt again.

Unfortunately, compose (= dead keys) does not work in unicode mode. Reason is that struct
struct kbdiacr {
unsigned char diacr, base, result;
};
in /usr/src/linux/include/linux/kd.h expects result to be just unsigned char, thus unsuitable for unicode characters.

I have prepared a small kernel patch that fixes it. You can download it here: download/unicode_dead_keys_linux-2.4.9.patch.gz.

You also need to compile console-tools against this patched kernel. Since version 1:0.2.3dbs-30, consoletools in debian includes patch facilitating transition into unicode mode - all you need to do is to recompile console-tools agains the patched kernel - first install needed build dependencies (groff and dbs), and then do:
$ cd /tmp
$ apt-get source console-tools
$ cd console-tools*
$ dpkg-buildpackage -rfakeroot -us -uc
and install created debian packages.

I put the recompiled packages and the patch here: download/console-tools/.


line drawing characters

Unfortunately, linux console cannot display line drawing characters in utf-8 mode.

This is an ugly hack trying to improve the situation a bit:

Download linux+utf8 terminfo definition (based on original definition by Ričardas Čepas), compile it:
$ tic linux+utf8
(if you do it as root, this will install the entry into system-wide directory)

now set your TERM environmental variable to linux+utf8
$ export TERM=linux+utf8
and test it by running midnight commander (old debian version, not the patched one by RedHat :-)) (notice that mc and some other programs might not recognise linux+utf8 as colour terminal, you can force colours on mc by using option -c)

As this terminfo entry is just a crude hack, do expect some graphical glitches. More glitches result from the multibyte nature of unicode characters and are being addressed upstream (ncurses/slang)


Conclusion

Linux console definitely needs improving (read: rewriting)


libreadline

libreadline supports multibyte locales (including UTF-8) since version 4.3. Please upgrade.

You might need to set up /etc/inputrc:

###
# Be 8 bit clean.
set input-meta on
set output-meta on

# To allow the use of 8bit-characters like the german umlauts, comment out
# the line below. However this makes the meta key not work as a meta key,
# which is annoying to those which don't need to type in 8-bit characters.
set convert-meta off

#You may need to add following lines if moving cursor does not work properly:
"\e[D" backward-char
"\e[C" forward-char
###


stty console editor

see http://mail.nl.linux.org/linux-utf8/1999-08/msg00001.html (no, situation has not changed since then)

get the patch from ftp://ftp.ilog.fr/pub/Users/haible/utf8

I have not tried the patch, I use the kernel input editor with non-ascii chars very rarely.


gpm

copy&paste copies only ASCII characters in UTF-8 mode.

There were several stale kernel patches trying to improve the situation, until Jochen Hein made a version for 2.4.19. Get a patch here: download/unicode_copypaste_2.4.19.patch.gz. It seems to mess up making a selection in non-utf8 mode in strange ways, though.


X window system

Prepare locale and keyboard

In XFree86 4.2.0 configuration, there are already entered many commonly used locales, those using UTF-8 encoding are using files from en_US.UTF-8 locale. I am going to keep this convention.

If you want to input non-ascii characters, you may need a compose map. There was not a proper Compose map in older version of XFree86, but 4.2.0 should be good enough. For reference, here is a good latin compose map by Pablo Saratxaga: Compose.gz.

See file /usr/X11R6/lib/X11/locale/compose.dir and make sure there are these lines:
en_US.UTF-8/Compose:           en_GB.UTF-8
en_US.UTF-8/Compose            en_GB.UTF-8
Our en_GB.UTF-8 should be there, if you are using some exotic locale, the lines might be missing. Add them.

See file /usr/X11R6/lib/X11/locale/locale.dir and make sure there are these lines:
en_US.UTF-8/XLC_LOCALE: en_GB.UTF-8
en_US.UTF-8/XLC_LOCALE  en_GB.UTF-8
Similarly, if they are not there for your locale, add them.


rxvt-unicode

rxvt-unicode is a modern, slim and fast replacement for xterm. It handles UTF-8 and most other locales if the locale was set in the environment where urxvt has been started (run "LANG=xx_YY.UTF-8 urxvt" to be sure). It also supports charset guessing for few modern charsets so some users can continue using old programs.

If you did not install the recommended fonts for your locale, you may need to configure the fallback font list. See manpage for details.


xterm

Xterm does handle compose (dead keys) in UTF-8 locale properly since version 157. You find out which version are you using by typing:
$ xterm -v
If you have an older version, you should upgrade. Xterm debian version 4.1.0-11 is ok. In theory, xterm should recognize utf-8 locale by itself, but if it does not work, you have to use -u8 option:
$ xterm -u8
In addition, of course, you have to select appropriate unicode fonts. There is a script called uxterm which does everything for you. Just use
$ uxterm
instead of xterm to get UTF-8 enabled xterm.


Testing

start uxterm, type
$ setxkbmap us_intl
$ cat
and type accented letters (e.g. 'a gives a with acute accent (á), ^a gives you a with circumflex (â))

Start gvim and write there some accented letters.

In xterm, type
$ setxkbmap ru
Press right alt and _hold it pressed_ while typing letters. You should see some cyrillic.

If all goes OK, you are done.


What does not work

KDE

KDE 3 seems to work well with unicode - you just need to use correct locale, and have necessary fonts installed (otherwise the missing characters will be displayed either without diacritics, or taken from other fonts - scaled bitmapped characters look particularly ugly).


cooledit

cooledit cannot input characters outside ISO-8859-1 range.


screen

screen seems to have problems with line drawings


yudit

yudit has its own input method and you have to use US keyboard layout to use it effectively (as usual, entering characters outside ISO-8859-1 range does not work)


Ncurses and Slang

There are many graphical glitches with all ncurses and slang applications. Just remember to press CTRL+L :-)


fvwm

fvwm fails to display text in UTF-8 locale. Edit /usr/X11R6/lib/X11/locale/en_US.UTF-8/XLC_LOCALE file and change
on_demand_loading       True
into:
on_demand_loading       False
It should help a little.


man and groff

man-db since version 2.4.0 uses groff's utf8 device to display pages in UTF-8, in UTF-8 locale. However, groff assumes source pages to be in ISO-8859-1 encoding.


And more...

and undoubtedly many more...


A. References