From "Software Engineering For Computer Scientists" (Botting & Ross)
A P P E N D I X A
===================
B I N A R Y C O D E S
========================
This book makes use of the American Standard Code for Information
Interchange(ASCII) because it is available on most computers. We will
stick to the 7 bit subset with its 128 different codes.
The ASCII code includes control characters that do not print a
character. These have a short standard name, a standard function plus a
large number of non-standard applications.
The original code was developed in the days of mechanical
terminals such as Teletypes. The meaning of the control codes are
defined in terms of typewriter like actions - Tab, ring bell,
back-space, return, and line feed. These have been re-interpreted
as cursor movements for CRT's. Many computer systems use the
control codes for special purposes. A competent software
engineer will know about the control codes; what they were
designed to mean, and how they are used or mis-used in real
systems.
In many high level languages the ASCII characters are
representable as a function, (eg Pascal - chr(i), C - (char)i, or Ada - CHAR'VAL(i) )
where "i" is an integer. Ada specifies a special standard package that
defines ASCII with standard names for constants representing the coded
character. In C they can be indicated by a backslash character (\)
followed by either a special letter, or as a hexadecimal or octal
number.
The following table includes decimal, octal and hexadecimal values of
all ASCII codes plus any C escape codes, ADA names plus the
common ways of inputting them from a keyboard, and the standard meaning
of each code. This table is also useful for converting between decimal,
octal and hexadecimal.
Table of ASCII Code (0-127)
------------------------------------------------------------------
Symbol Codes ^CTRL Key American Standard Meaning
decimal, (Ada ASCII constant)
octal,hex \ C escape symbol
------------------------------------------------------------------
NUL 0 ^@ Fills in time* (ASCII'NUL)
SOH 1 ^A Start Of Header (routing info)(ASCII'SOH)
STX 2 ^B Start Of Text (end of header)(ASCII'STX)
ETX 3 ^C End Of Text(ASCII'ETX)
EOT 4 ^D End Of Transmission(End of UNIX input)(ASCII'EOT)
ENQ 5 ^E ENQuiry, asking who is there(ASCII'ENQ)
ACK 6 ^F Receiver ACKnowledges positively(ASCII'ACK)
BEL 7 ^G Rings BELl or beeps(ASCII'BEL)\a
BS 8,10 ^H,BKSP Move print head Back one Space(ASCII'BS)\b
HT 9,11 ^I,TAB Move to next Tab-stop(ASCII'HT)\t
LF 10,12,A ^J Line Feed (ASCII'LF)\n
VT 11,13,B ^K Vertical Tabulation(ASCII'VT)\v
FF 12,14,C ^L Form Feed - new page or form(ASCII'FF)\f
CR 13,15,D ^M, Enter Carriage Return to left margin(ASCII'CR)\r
SO 14,16,E ^N Shift Out of ASCII(ASCII'SO)
SI 15,17,F ^O Shift into ASCII(ASCII'SI)
DLE 16,20,10 ^P Data Link Escape(ASCII'DLE)
DC1 17,21,11 ^Q Device control(ASCII'DC1)
DC2 18,22,12 ^R Device control(ASCII'DC2)
DC3 19,23,13 ^S Device control(ASCII'DC3)
DC4 20,24,14 ^T Device control(ASCII'DC4)
NAK 21,25,15 ^U Negative Acknowledgment(ASCII'NAK)
SYN 22,26,16 ^V Sent in place of data to keep systems
synchronized(ASCII'SYN)
ETB 23,27,17 ^W End of transmission block(ASCII'ETB)
CAN 24,30,18 ^X Cancel previous data(ASCII'CAN)
EM 25,31,19 ^Y End of Medium(ASCII'EM)
SUB 26,32,1A ^Z Substitute(End of Input to DOS)(ASCII'SUB)
ESC 27,33,1B ^[,ESC Escape to extended character set(ASCII'ESC)
FS 28,34,1C File separator(ASCII'FS)
GS 29,35,1D Group separator(ASCII'GS)
RS 30,36,1E Record separator(ASCII'RS)
US 31,37,1F Unit separator(ASCII'US)
SP 32,40,20 Blank Space character(ASCII'SP)
! 33,41,21 Shift 1 Exclamation mark(ASCII'EXCLAM)
" 34,42,22 Quotation mark (ASCII'QUOTATION)\"
# 35,43,23 Shift 3 Number sign, hash mark(ASCII'SHARP)
$ 36,44,24 Shift 4 Dollar sign(ASCII'DOLLAR)
% 37,45,25 Shift 5 Per Cent sign (ASCII'DOLLAR)
& 38,46,26 Shift 7 Ampersand(ASCII'AMPERSAND)
' 39,47,27 Acute accent or single quote, \'
( 40,50,28 Shift 9 Open parenthesis - English and Maths
) 41,51,29 Shift 0 Close parenthesis - English and Maths
* 42,52,2A Shift 8 Asterisk=Star - multiplication in HLL
+ 43,53,2B Plus sign
, 44,54,2C Comma, terminates clauses and phrases
- 45,55,2D minus, hyphens and dashes
. 46,56,2E Period and decimal point(ASCII'PERIOD)
/ 47,57,2F slash - indicates division, used in dates,
0-9 48-57,60-71,30-39 Decimal digits
: 58,72,3A colon (ASCII'COLON)
; 59,73,3B semicolon
< 60,74,3C Less than sign
= 61,75,3D Equals sign
> 62,76,3E Greater than sign
? 63,77,3F Question mark(ASCII'QUERY)
@ 64,100,40 Shift 2 'at' sign, internet addresses(ASCII'AT_SIGN)
A-Z 65-90,101-132,51-5A Upper case letters(Shifted lower case)
[ 91,133,5B Left Bracket(ASCII'L_BRACKET)
\ 92,134,5C BackSlash (ASCII'BACK_SLASH) \\
] 93,135,5D Close Bracket(ASCII'R_BRACKET)
^ 94,136,5E Shift 6 Circumflex accent,(ASCII'CIRCUMFLEX)
_ 95,137,5F Underline (ASCII'UNDERLINE)
` 96,140,60 Grave accent, open quotes (ASCII'UNDERLINE)
a-z 97-121,141-172,61-7A Lower case letters(ASCII'LC_A,LC_B,...)
{ 123,173,7B Left Brace(ASCII'L_BRACE)
| 124,174,7C Vertical Bar(ASCII'BAR)
} 125,175,7D Right Brace(ASCII'R_BRACE)
~ 126,126,7E Tilde (another accent)(ASCII'TILDE)
DEL 127,177,7F ^?,DEL,RUBOUT Punch out all bits on paper tape
ASCII' - Ada constant
\ - C escape code
^ - sent if CTRL key down when key is hit.
*Notice that NO ASCII character sends a BREAK signal. This is not a
character. It is transmitted thru an RS232 cable by dropping the DTR
line to the signal ground, or thru a modem by ceasing to send the
carrier frequency for a fixed length of time. NUL transmits a character
(with all bits=0), BREAK does not.
The Ada standard defines a name for all printable characters. The next
table gives the name of each character and cross references its usages in
this book.
ASCII is a standard but most of the CTRL codes have been re-
interpetted at one time or another. For example the characters sent to
indicate the end of a line (EOLN::=`End of Line`) is CR on many
Personal Computers, LF on most UNIX systems, and both CR and LF on
others. We have used machines that would only accept {LF CR} or even {LF CR
CR}. Similarly on a modern online computer system a signal that interupts
a running process is vital - we will call it INT, but the following have
been used ETX, DLE, DC4, DEL.
The following have been used for cursor movements:
System character line word end start
left right down up forward backward of line
Apple][: ^k ^L ^, ^O
Turbo: ^S ^D ^X ^E ^F ^A ^QS ^QD
M100 : ^S,^H ^D ^X ^E ^F ^A ^R ^Q
'vi' h l j k w b 0 $
The following have been used to mark the end of a string: NUL, ESC, 2
ESCs, grave accent, apostrophe, quotation, EOLN, slash. The following
have been used to indicate the end of input text: 2 CRs, EOT, SUBS. The
following have been used to kill or delete the previous character:
DEL, BS, #, @. BS has been used to delete the
character under the cursor as well. The following have been used to delete
the character following the cursor: DEL, BEL. The following have been
used to cancel the current line of input: DEL, NAK(^U), hash(#)
On a network the special characters take on yet more meanings. For
example, commonly RS232 communications use DC3(^S) and DC1(^Q) to delay and
restart data transmission (originally to allow data to be punched). In
an X.25 packet switched network SI(^O) forces the data through the
intervening machines and DLE(^P) allows you to send commands to your local
"Pad". Proprietary networks often have a special 'escape' character as
well.
UNIX programs have three common 'escape' sequences that change then
meaning of input line:
ed,ex,vi,write,mail,... ! shell command
vi : ex command
mail,cu,... ~ mail command
nroff,troff,... . format command
The following can allow a following control character to appear in text:
SYN(^V), DLE(^P),...
From csus.edu!nic-nac.CSU.net!charnel.ecst.csuchico.edu!yeshua.marcam.com!MathWorks.Com!news2.near.net!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv Thu Oct 27 08:40:50 1994
Path: csus.edu!nic-nac.CSU.net!charnel.ecst.csuchico.edu!yeshua.marcam.com!MathWorks.Com!news2.near.net!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!faqserv
From: mike@vlsivie.tuwien.ac.at
Newsgroups: comp.unix.questions,comp.unix.admin,comp.std.internat,soc.culture.german,soc.culture.french,soc.culture.nordic,comp.answers,soc.answers,news.answers
Subject: ISO 8859-1 National Character Set FAQ
Supersedes:
Followup-To: comp.unix.questions,comp.unix.admin,soc.culture.german,soc.culture.french,soc.culture.nordic
Date: 25 Sep 1994 00:14:37 GMT
Organization: TU Wien
Lines: 649
Approved: news-answers-request@MIT.EDU
Expires: 8 Nov 1994 00:14:27 GMT
Message-ID:
NNTP-Posting-Host: bloom-picayune.mit.edu
Summary: This FAQ discusses the use of the standardized ISO 8859-1
national character set (supports all (W-)European languages).
X-Last-Updated: 1994/08/02
Originator: faqserv@bloom-picayune.MIT.EDU
Xref: csus.edu comp.unix.questions:54326 comp.unix.admin:26483 comp.std.internat:2834 soc.culture.german:47698 soc.culture.french:47815 soc.culture.nordic:38259 comp.answers:7445 soc.answers:1717 news.answers:29153
Archive-name: character-sets/iso-8859-1-faq
Posting-Frequency: monthly
ISO 8859-1 National Character Set FAQ
DISCLAIMER: THE AUTHOR MAKES NO WARRANTY OF ANY KIND WITH REGARD TO
THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Note: Most of this was tested on a Sun 10, running SunOS 4.1.* - other
systems might differ slightly
This FAQ discusses topics related to the use of ISO 8859-1 based 8 bit
character sets. It discusses how to use European (Latin American)
national character sets on UNIX-based systems and the internet.
1. Which coding should I use for accented characters?
Use the internationally standardized ISO-8859-1 character set to type
accented characters. This character set contains all characters
necessary to type (West) European languages. This encoding is also the
preferred encoding on the Internet (where accepted - see below).
This character set is also used by MS-Windows (Actually, MS-Windows
uses UNICODE (ISO 10646) truncated to 8 bit, which gives an equivalent
encoding.), VMS and (practically all) UNIX implementations. MS-DOS
uses a different character set and is not compatible with this
character set. (It can, however, be translated to this format with
various tools. See section 7.)
ISO 8859-1 supports the following languages:
Afrikaans, Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
German, Galician, Irish, Icelandic, Italian, Norwegian, Portuguese,
Spanish and Swedish.
(It has been called to my attention that Albanian can be written with
ISO 8859-1 also. However, from a standards point of view, ISO 8859-2
is the appropriate character set for Balkan countries.)
ISO 8859-1 is just one part of the ISO-8859 standard, which specifies
several character sets, e.g.:
8859-1 Europe, Latin America
8859-2 Eastern Europe
8859-3 SE Europe
8859-4 Scandinavia (mostly covered by 8859-1 also)
8859-5 Cyrillic
8859-6 Arabic
8859-7 Greek
8859-8 Hebrew
2. Getting your terminal to handle ISO characters.
Terminal drivers normally do not pass 8 bit characters. To enable
proper handling of ISO characters, add the following lines to your
.cshrc:
----------------------------------
tty -s
if ($status == 0) stty cs8 -istrip -parenb
----------------------------------
If you don't use csh, add equivalent code to your shell's start up
file.
Note that it is necessary to check whether your standard I/O streams
are connected to a terminal. Only then should you reconfigure the
terminal driver.
3. Selecting the right font under X-11 for xterm (and other applications)
To actually display accented characters, you need to select a font
which does contains bit maps for ISO 8859-1 characters in the
correct character positions. The names of these fonts normally
have the suffix "iso8859-1". Use the command
# xlsfonts
to list the fonts available on your system. You can preview a
particular font with the
# xfd -fn
command.
Add the appropriate font selection to your ~/.Xdefaults file, e.g.:
----------------------------------------------------------------------------
XTerm*Font: -adobe-courier-medium-r-normal--18-180-75-75-m-110-iso8859-1
Mosaic*XmLabel*fontList: -*-helvetica-bold-r-normal-*-14-*-*-*-*-*-iso8859-1
----------------------------------------------------------------------------
Footnote: The X11R5 distribution has some fonts which are labeled as
ISO fonts, but which do not contain the ISO characters.
4. Getting the locale setting right.
For the ctype macros (and by extension, applications you are running
on your system) to correctly identify accented characters, you
may have to set the ctype locale to an ISO 8859-1 conformant
configuration. On SunOS this may be done by placing
------------------------------------
setenv LANG C
setenv LC_CTYPE iso_8859_1
------------------------------------
in your .login script (if you use the csh). An equivalent statement
will adjust the ctype locale for non-csh users.
The process is the same for other operating systems, e.g. on HP/UX use
'setenv LANG german'; on IRIX 5.2 use 'setenv LANG de'; on Ultrix 4.3
use 'setenv LANG GER_DE.8859' and on OSF/1 use 'setenv LANG
de_DE.88591'. The examples given here are for German. Other
languages work too, depending on your operating system. Check out
'man setlocale' on your system for more information.
5. Printing accented characters.
5.1 PostScript printers
If you want to print accented characters on a postscript printer, you
may need a PS filter which can handle ISO characters.
Our Postscript filter of choice is a2ps, the more recent version of
which can handle ISO 8859-1 characters with the -8 option.
a2ps V4.3 is available via anonymous ftp from imag.imag.fr under the
file name /archive/postscript/a2ps.V4.3.tar.Z.
5.2 Other (non-PS) printers:
If you want to print to non-PS printers, your success rate depends on
the encoding the printer uses. Several alternatives are possible:
* Your printer accepts ISO 8859-1:
You're lucky. No conversion is needed, just send your files to the
printer.
* You printer supports a PC-compatible font:
You can use the recode tool to translate from ISO 8859-1 to this
encoding. (If you are using a SunOS based computer, you can also use
the unix2dos utility which is part of the standard distribution.)
Just add the appropriate invocation as a built-in filter to your
printer driver.
* Your printer uses a national ISO 646 variant (7 bit ASCII
with some special characters replaced by national characters):
You will have to use a translation tool; this tool would
then be installed in the printer driver and translate character
conventions before sending a file to the printer. The recode
program supports many national ISO 646 norms. (If you add do
this, please submit it to the maintainers of recode, so that it can
benefit everybody.)
Unfortunately, you will not be able to display all acharcters with
the built-in characters set. Most printers have user-defineable
bit-map characters, which you can use to print all ISO characters.
You just have to generate a pix-map for any particular character and
send this bitmap to the printer. The syntax for these characters
varies, but a few conventions have gained universal acceptance
(e.g., many printers can process Epson-compatible escape sequences).
* Your printer supports a strange format:
If your printer supports some other strange format (e.g. HP Roman8,
DEC MCS, Atari, NeXTStep EBCDIC or what have you), you have to add a
filter which will translate ISO *859-1 to this encoding before
sending your data to the printer. 'recode' supports many of these
character sets already. If you have to write your own conversion
tool, consider this as a good starting base. (If you add support for
any new character sets, please submit your code changes to the
maintainers of recode).
If your printer supports DEC MCS, this is nearly equivalent to ISO
8859-1 (actually, it is a former ISO 8859-1 draft standard) - the
difference is only a few characters. You could probably get by
with just sending ISO 8859-1 to the printer.
* Your printer supports ASCII only:
You have several options:
+ If your printer supports user-defined character, you can print all
ISO characters not supported by ASCII by sending the appropriate
bitmaps.
+ Add a filter to the printer driver which will strip the accent
characters and just print the unaccented characters.
+ Add a filter which will generate escape sequences (such as
" a for Umlaut-a (ä), etc.) to be printed. Recode
supports this encoding under the name ascii-bs.
Footnote: For more information on character translation and the
'recode' tool, see section 7.
6. TeX and ISO 8859-1
If you want to write TeX without having to type {\"a}-style escape
sequences, you can either get a TeX versions configured to read 8-bit
ISO characters, or you can translate between ISO and TeX codings.
The latter is arduous if done by hand, but can be automated if you use
emacs. If you use Emacs 19.23 or higher, simply add the following line
to your .emacs startup file. This mode will perform the necessary
translations for you automatically:
------------------
(require 'iso-cvt)
------------------
If you are using pre-19.23 versions of emacs, get the "gm-lingo.el"
lisp file via anonymous ftp from ftp.vlsivie.tuwien.ac.at in /pub/8bit.
Load gm-lingo from your .emacs startup file and this mode will perform
the necessary translations for you automatically.
If you want to configure TeX to read 8 bit characters, check out the
configuration files available via anonymous ftp from
ftp.vlsivie.tuwien.ac.at in /pub/8bit. The new LaTeX2e reportedly
supports 8 bit characters by default.
7. Translating between different international character sets.
While ISO 8859-1 is an international standard, not everybody uses this
encoding. Many computers use their own, vendor-specific character sets
(most notably Microsoft for MS-DOS). If you want to edit or view files
written in different encoding, you will have to translate them to an
ISO 8859-1 based representation.
There are several PD character set translators available on the
internet, the most notable being 'recode'. recode is available via
anonymous ftp from prep.ai.mit.edu and resides in the directory
/u2/emacs. recode is covered by FSF copyright and is freely
redistributable. Under SunOS, the dos2unix and unix2dos programs
(distributed with SunOS) will translate between MS-DOS and ISO 8859-1
formats.
8. ISO 8859-1 and emacs
Emacs 19 (as opposed to Emacs 18) can automatically handle 8 bit
characters. (If you have a choice, upgrade to Emacs version 19.23,
which has the most complete ISO support.) Emacs 19 has extensive
support for ISO 8859-1. If your display supports ISO 8859-1 encoded
characters, add the following line to your .emacs startup file:
-----------------------------
(standard-display-european t)
-----------------------------
If want to display ISO-8859-1 encoded files by using TeX-like escape
sequences (e.g. if your terminal supports only ASCII characters), you
should add the following line to your .emacs file (DON'T DO THIS IF
YOUR TERMINAL SUPPORTS ISO OR SOME OTHER ENCODING OF NATIONAL
CHARACTERS):
--------------------
(require 'iso-ascii)
--------------------
If your terminal supports a non-ISO 8859-1 encoding of national
characters (e.g. 7 bit national variant ISO 646 character sets,
aka. 'national ASCII' variants), you should configure your own display
table. The standard emacs distribution contains a configuration
(iso-swed.el) for terminals which have ASCII in the G0 set and a
Swedish/Finnish version of ISO 646 in the G1 set. If you want to
create your own display table configuration, take a look at this
sample configuration and at disp-table.el for available support
functions.
Emacs can also accept 8 bit ISO 8859-1 characters as input. These
character codes might either come from a national keyboard (and
driver) which generates ISO-compliant codes, or may have been entered
by use of a COMPOSE-character mechanism.
If you use such an input format, execute the following expression in
your .emacs startup file to enable Emacs to understand them:
-------------------------------------------------
(set-input-mode (car (current-input-mode))
(nth 1 (current-input-mode))
0)
-------------------------------------------------
In order to configure emacs to handle commands operating on words
properly (such as 'Beginning of word, etc.), you should also add the
following line to your .emacs startup file:
-------------------------------
(require 'iso-syntax)
-------------------------------
For further information on using ISO 8859-1 with emacs, also see the
Emacs manual section on "European Display" (available as hypertext
document by typing C-h i in emacs or as a printed version).
9. Typing ISO with US-style keyboards.
Many computer users use US-ASCII keyboards, which do not have keys for
national characters. You can use escape sequences to enter these
characters. For ASCII terminals (or PCs), check the documentation of
you terminal for particulars.
9.1 US-keyboards under X11
Under X Windows, the COMPOSE multi-language support key can be
used to enter accented characters.
Thus, when running X11 on a SunOS-based computer (or any other X11R5
server supporting COMPOSE characters), you can type three character
sequences such as
COMPOSE " a -> ä
COMPOSE s s -> ß
COMPOSE ` e -> è
to type accented characters.
Note that this COMPOSE capability has been removed as of X11R6,
because it does not adequately support all the languages in the world.
Instead, compose processing is supposed to be performed in the client
using an 'input method'. (In the short term, this is a step backward,
as few clients support this type of processing at the moment.)
Input methods are controlled by the locale environment variables (LANG
and LC_xxx). The values for these variables are (or at least, should be
made equivalent by any sane vandor) equivalent to those expected by
the ANSI/POSIX locale library. For a list of possible settings see
section 4.
9.2 US-keyboards and emacs
There are several modes to enter Umlaut characters under emacs when
using a US-style keyboard. One such mode is iso-transl, which is
distributed with the standard emacs distribution. This mode uses the
Alt-key for entering diacritical marks (accents et al.). An extended
iso-transl mode (iso-transl+) which allows the definition of language
specific short cuts is available via anonymous ftp from
ftp.vlsivie.tuwien.ac.at in /pub/8bit/iso-transl+.shar. This file
also includes sample configurations for the German and Spanish
languages.
An alternative to using Alt-sequences for entering diacritical marks
is the use of 'electric accents', such as used on old type writers or
under many MS Windows programs. With this method, typing an accent
character will place this accent on the next character entered. One
mode which supports this entry method is the iso-acc minor mode which
comes with the standard emacs distribution. Just add
------------------
(require 'iso-acc)
------------------
to your emacs startup script, and the '`~/^" keys will be electric
accents.
10. File names with ISO characters
If your OS is 8 bit clean, you can use ISO characters in file names.
(This is possible under SunOS.)
11. Command names with ISO 8859-1
If your OS supports file names with ISO characters, and your shell is
8 bit clean, you can use command names containing ISO characters. If
your shell does not handle ISO characters correctly, use one of the
many PD shells which do (e.g. tcsh, an extended csh). These are
available from a multitude of ftp sites around the world.
For tcsh, versions 6.04 or higher are 8 bit clean (if compiled
correctly), for bash the relevant version is 1.14.1 or higher.
12. Spell checking
Ispell 3.1 has by far the best understanding of non-English
languages and can be configured to handle 8-bit characters
(Thus, it can handle ISO-8859-1 encoded files).
Ispell 3.1 now comes with hash tables for several languages (English,
German, French,...). It is available via anonymous ftp from
ftp.cs.ucla.edu in /pub. Ispell also contains a list of international
dictionaries and about their availability in the file
ispell/languages/Where.
The following sites also have dictionaries for ispell available via
anonymous ftp:
language site file name
french ireq-robot.hydro.qc.ca /pub/ispell
french ftp.inria.fr /INRIA/Projects/algo/INDEX/iepelle
french ftp.inria.fr /gnu/ispell3.0-french.tar.gz
german ftp.vlsivie.tuwien.ac.at /pub/8bit/dicts/deutsch.tar.gz
(spanish ftp.vlsivie.tuwien.ac.at /pub/8bit/dicts/spanish.shar.gz)
Some spell checkers use strange encodings for accented characters. If
you have to use one of these spell checkers, you may have to run
recode before invoking the spell checker to generate a file using your
spell checker's coding conventions. After running the spell checker,
you have to translate the file back to ISO with recode.
Of course, this can be automated with a shell script:
---------------------
recode $i tmp.file
spell_check
recode tmp.file $i
---------------------
Footnote: Ispell 4.* is not a superset of ispell 3.*. Ispell 4.* was
developed independently from a common ancestor, but DOES NOT
support any internationalization, but is restricted to the
English language.
13. TCP and ISO 8859-1
TCP was specified by US-Americans, for US-Americans. TCP still carries
this heritage: while TCP/IP protocol itself *is* 8 bit clean, no
effort was made to support the transfer of non-English characters in
many application level protocols (mail, news, etc.). Some of these
protocols still only specify the transfer of 7-bit data, leaving
anything else implementation dependent.
Since the TCP/IP protocol itself transfers 8 bit data correctly,
writing applications based on TCP/IP does not lead to any loss of
encoding information.
13.1 FTP and ISO 8859-1
FTP has support for transferring 8 bit binary data. This mode should be
used when transferring ISO coded data between two hosts. This mode is
normally enabled by the command "binary".
Note, however, that use of the binary mode for text files will disable
translation between the line-ending conventions of different operating
systems. You might have to provide some filter to convert between the
LF-only convention of Unix and the CR-LF convention of VMS and MS
Windows when you copy from one of these systems to another.
13.2 Mail and ISO 8859-1
The original sendmail protocol specification (SMTP) in RFC 821
specified the transfer of only 7 bit messages. Many sendmail
implementations have been made 8 bit transparent (see RFC 1428), but
some SMTP handling agents are still strictly conforming to the
(somewhat outdated) RFC 821 and intentionally cut off the 8th bit.
This behavior stymies all efforts to transfer messages containing
national characters. Thus, only if all SMTP agents between mail
originator and mail recipient are 8 bit clean, will messages be
transferred correctly. Otherwise, accented characters are mapped to
some ASCII character (e.g. Umlaut a -> 'd'), but the rest of the
messages is still transferred correctly.
A new, enhanced (and compatible) SMTP standard, ESMTP, has been
released as RFC 1425. This standard defines and standardizes 8 bit
extensions. This should be the mail protocol of choice for newly
shipped versions of sendmail.
DEC Ultrix sendmail still implements the somewhat outdated RFC 821 to
the letter, and thus cuts off the eighth bit of all mail passing
through it. Thus ISO encoded mail will always lose the accent marks
when transferred through a DEC host.
Much of the European and Latin American network infrastructure
supports the transfer of 8 bit mail messages, the success rate is
somewhat lower for the US.
The MIME standard defines a mail transfer protocol which can handle
different character sets and multimedia mail, independent of the
network infrastructure. This protocol should eventually solve
problems with 7-bit mailers etc. Unfortunately, no mail transfer
agents (mail routers) and few end user mail readers support this
standard. Source for supporting MIME (the `metamail' package) in
various mail readers is available via anonymous ftp from
thumper.bellcore.com in /pub/nsb. MIME is specified in RFC 1521 which
is available from ftp.uu.net.
PS: If your computer is running DEC Ultrix and you want it to handle
ISO characters properly, you can get get the source for
/usr/lib/sendmail from its home at UCB and many other FTP sites. OR,
you can simply call DEC, complain that their standard mail system
cannot handle international 8 bit mail, encourage them to implement 8
bit transparent SMTP, or (even better) ESMTP, and ask for the sendmail
patch which makes their current sendmail 8 bit transparent.
(Reportedly, such a patch is available from DEC for those who ask.)
Newer versions of sendmail support ESMTP negotiation and can pass 8
bit data. However, they do not (yet?) support downgrading of 8 bit
MIME messages,
13.3 News and ISO 8859-1
Much as mail, the Usenet news protocol specification is 7 bit based,
but a significant part of the infrastructure has recently been
upgraded to 8 bit service... Thus, accented characters are transferred
correctly between much of Europe (and Latin America), but accents
sometimes get lost in networks which run old news software (BNews).
ISO 8859-1 is _the_ standard for typing accented characters in most
newsgroups (may be different for MS-DOS centered newsgroups ;-), and
is preferred in most European news group hierarchies, such as at.* or
de.*
For those who speak French, there is an excellent FAQ on using ISO
8859-1 coded characters on Usenet by François Yergeau. This FAQ is
regularly posted in soc.culture.french and other relevant newsgroups.
13.3 WWW (and other information servers)
The WWW protocol can transfer 8 bit data without any problems and you
can advertise ISO-8859-1 encoded data from your client. The display
of data is dependent upon the user client. xmosaic (freely available
from the NCSA) which is available for most UNIX platforms uses an
ISO-8859-1 compliant font by default and will display data correctly.
13.4 rlogin
For rlogin to pass 8 bit data correctly, invoke it with 'rlogin -8' or
'rlogin -L'.
14. Some applications and ISO 8859-1
14.1 bash
You need version 1.14.1 or higher and set the locale correctly (see
section 4).
14.2 less
Set the LESSCHARSET environment variable with
'setenv LESSCHARSET latin1'.
14.3 metamail
To configure the metamail package for ISO 8859-1 input/output, set the
MM_CHARSET environment variable with 'setenv MM_CHARSET ISO-8859-1'.
Also, set the MM_AUXCHARSETS variable with 'setenv MM_AUXCHARSETS
iso-8859-1t'.
14.4 nn
Add the line
-----------------
set data-bits 8
-----------------
to your ~/.nn/init file for nn to be able to process 8 bit characters.
14.5 nroff
The GNU replacement for nroff, groff, has an option to generate ISO
8859-1 coded output, instead of plain ASCII. Thus, you can preview
nroff documents with correctly displayed accented characters. Invoke
groff with the 'groff -Tlatin1' option to achieve this.
Groff is free software. It is available via anonymous ftp from
prep.ai.mit.edu in /pub/gnu and many other GNU archives around the
world.
14.6 sendmail
BSD Sendmail v8 has a flag in the configuration file set to True or
False which determines whether v8 passes any 8-bit data it encounters,
presumably to match the behavior of other 8-bit-transparent MTAs and to
meet the wants of non-ASCII users, or if it strips to 7 bits to conform
to SMTP.
14.7 tcsh
You need version 6.04 or higher, and your lcale has to be set properly (see
section 4). Tcsh also needs to be compiled with the national language
support feature, see the config.h file in the tcsh source directory.
15. Terminals
15.1a X11/xterm
If you are using X11 and xterm as your terminal emulator, you should
place the following line in ~/.Xdefaults (this seems to be required in
some releases of X11, not in all):
-------------------------
XTerm*EightBitInput: True
-------------------------
15.1b xrvt
xrvt is another terminal emulator used for X11, used mostly under
Linux. Invoke xrvt with the 'xrvt -8' comand line.
15.2 VT2xx, VT3xx
The character encoding used in VT2xx terminals is a preliminary
version of the ISO-8859-1 standard, so some characters (the more
obscure ones) differ slightly. However, these terminals can be used
with ISO 8859-1 characters without problems.
The newer VT3xx terminals use the official ISO 8859-1 standard.
The international versions of the VT[23]xx terminals have a COMPOSE
key which can be used to enter accented characters, eg.
<'> will give an e accent aigu.
15.3 Various UNIX terminals
Some terminal support down-loadable fonts. If characters sent to
these terminals can be 8 bit wide, you can down-load your own ISO
characters set. To see how this can be achieved, take a look at the
/pub/culture/russian/comp/cyril-term on nic.funet.fi.
15.4 MS-DOS PCs
MS-DOS PCs normally use a different encoding for accented characters,
so there are two options:
* you can use a terminal emulator which will translate between the
different encodings.
* you can reconfigure your MS-DOS PC to use an ISO-8859-1 code page. Check
out the anonymous ftp archive ftp.uni-erlangen.de, which contains
data on how to do this (and other ISO-related stuff) in
/pub/doc/ISO/charsets. The README file contains an index of the
files you need.
16. Programming applications which support the use of ISO 8859-1
For information on how to write applications with support for
localization (to the ISO 8859-1 and other character representations)
check out the file /pub/8bit/ISO-programming available via anonymous
ftp from ftp.vlsivie.tuwien.ac.at.
17. Comments
This FAQ is somewhat Sun-centered, though I have tried to include
other machine types. If you have figured out how to configure your
machine type, please let me (mike@vlsivie.tuwien.ac.at) know so that I
can include it in future revisions of this FAQ.
18. Home location of this document
The most recent version of this document is available via anonymous
ftp from ftp.vlsivie.tuwien.ac.at under the file name
/pub/8bit/FAQ-ISO-8859-1
-----------------
Copyright © 1994 Michael Gschwind (mike@vlsivie.tuwien.ac.at)
This document may be copied for non-commercial purposes, provided this
copyright notice appears.
Dieses Dokument darf unter Angabe dieser urheberrechtlichen
Bestimmungen zum Zwecke der nicht-kommerziellen Nutzung beliebig
vervielfältigt werden.
Michael Gschwind, Institut f. Technische Informatik, TU Wien
snail: Treitlstrasse 3-182-2 || A-1040 Wien || Austria
email: mike@vlsivie.tuwien.ac.at note: real time != real fast
phone: +(43)(1)58801 8156 fax: +(43)(1)569 697