коллекция специальных символов и глифов GNU roff (GNU roff special character and glyph repertoire)
Имя (Name)
groff_char - GNU roff special character and glyph repertoire
Описание (Description)
The GNU roff typesetting system has a large glyph repertoire
suitable for production of varied literary, professional,
technical, and mathematical documents. However, its input
character set is restricted to that defined by the standards ISO
Latin-1 (ISO 8859-1) and IBM code page 1047 (an arrangement of
EBCDIC). For ease of document maintenance in UTF-8 environments,
it is advisable to use only the Unicode basic Latin code points,
a subset of all of the foregoing historically referred to as
US-ASCII, which has only 94 visible, printable code points.
AT&T troff in the 1970s faced a similar problem of typesetter
devices with a glyph repertoire differing from that of the
computers that controlled them. The solution troff adopted was a
form of escape sequence known as a special character to access
several dozen additional glyphs available in the fonts prepared
for mounting in the phototypesetter. These glyphs were mapped
onto a two-character name space for a degree of mnemonic
convenience; for example, the escape sequence \(aa
encoded an
acute accent and \(sc
a section sign. (Characters that don't
require an escape sequence for their expression, like 'a', are
termed 'ordinary'.)
As in other respects, groff has removed historical roff
limitations on the lengths of special character escape sequences,
but recognizes and retains compatibility with the historical
names. groff expands the lexicon of glyphs available by name and
permits users to define their own special character escape
sequences with the .char
request.
This document lists all of the glyph names predefined by groff
and describes the systematic notation by which it enables access
to arbitrary Unicode code points and construction of composite
glyphs. The glyphs listed in this document may not be available,
or may vary in appearance, depending on the output driver chosen
when the page was rendered (with the -T
option to the man(1) or
roff programs). The driver used in generation of this page was
'utf8'.
A few escape sequences that are not groff special characters also
produce glyphs; these exist for syntactical or historical
reasons. \'
, \`
, \-
, and \_
are translated on input to the
special characters \[aq]
, \[ga]
, \[-]
, and \[ul]
, respectively.
Others include \\
, \.
(backslash-dot), and \e
; see groff(7). A
small number of special characters represent glyphs that are not
encoded in Unicode; examples include the baseline rule \[ru]
and
the Bell Systems logo \[bs].
In groff, you can test output driver support for any character
(ordinary or special) with the conditional expression operator
'c
'.
.ie c \[bs] \{Welcome to the \[bs] Bell System;
did you get the Wehrmacht helmet or the Death Star?\}
.el No Bell Systems logo.
For brevity in the remainder of this document, we shall refer to
systems conforming to the ISO 646:1991 IRV, ISO 8859, or ISO
10646 ('Unicode') character encoding standards as 'ISO' systems,
and those employing IBM code page 1047 as 'EBCDIC' systems. That
said, EBCDIC systems that support groff are known to also support
UTF-8.
While groff accepts eight-bit encoded input, not all such code
points are valid as input. On ISO platforms, character codes 0,
11, 13–31, and 128–159 are invalid. (This is all C0 and C1
controls except for SOH through LF [Control+A to Control+J], and
FF [Control+L].) On EBCDIC platforms, 0, 8–9, 11, 13–20, 23–31,
and 48–63 are invalid. Some of these code points are used by
groff for internal purposes, which is one reason it does not
support UTF-8 natively.
Fundamental character set
The ninety-four characters catalogued above, plus the space, tab,
and newline, form the fundamental character set for groff input;
anything in the language, even over one million code points in
Unicode, can be expressed using it. On ISO systems, code points
in the range 33–126 comprise a common set of printable glyphs in
all of the aforementioned ISO character encoding standards. It
is this character set and (with some noteworthy exceptions) the
corresponding glyph repertoire for which AT&T troff was
implemented. On EBCDIC systems, printable characters are in the
range 66–201 and 203–254; those without counterparts in the ISO
range 33–126 are discussed in the next subsection.
All of the following characters map to glyphs as you would
expect.
┌──────────────────────────────────────────────────────────┐
│! # $ % & ( ) * + , . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ │
│A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] _ │
│a b c d e f g h i j k l m n o p q r s t u v w x y z { | } │
└──────────────────────────────────────────────────────────┘
The remaining seven of the ninety-four code points in this range
surprise computing professionals and others intimately familiar
with the ISO character encodings. The developers of AT&T troff
chose mappings for them that would be useful for typesetting
technical literature in a broad range of scientific disciplines;
the preparation of AT&T's patent filings with the U.S. government
was the application of the system that 'paid the bills' at the
Bell Labs site where troff and Unix were first developed. It is
also worth noting that the prevailing character encoding standard
in the 1970s, USAS X3.4-1968 ('ASCII') deliberately supported
semantic ambiguity at some code points, and outright substitution
at several others, to suit the localization demands of various
national standards bodies.
The table below presents the seven exceptional code points with
their typical keycap engravings, their glyph mappings and
semantics in roff systems, and the escape sequences producing the
Unicode basic Latin character they replace. The first, the
neutral double quote, is a partial exception because it does
represent itself, but since it is also used by roff systems to
quote macro arguments, groff supports a special character escape
as an alternative form so that the glyph can be easily included
in macro arguments without requiring the user to master the
quoting rules that AT&T troff required in that context.
Furthermore, not all of the special character escape sequences
are portable to AT&T troff and all of its descendants; these
groff extensions are presented using its special character escape
form \[]
, whereas portable special character escape sequences are
shown in the traditional \(
form. \-
and \e
are portable to all
known troffs. \e
means 'the glyph of the current escape
character'; it therefore can produce unexpected output if the .ec
or .eo
requests are used. On devices with a limited glyph
repertoire, the appearances of glyphs on the same row of the
table may be identical; except for the neutral double quote, this
will not be the case on more-capable devices. Review your
document using as many different output drivers as possible.
┌────────────────────────────────────────────────────────────────┐
│Keycap Appearance and meaning Special character and meaning │
├────────────────────────────────────────────────────────────────┤
│" " neutral double quote \[dq]
neutral double quote │
│' ' closing single quote \[aq]
neutral apostrophe │
│- - hyphen \-
or \[-]
minus sign │
│\ (escape character) \e
or \[rs]
reverse solidus │
│^ ˆ modifier circumflex \(ha
circumflex/caret/'hat' │
│` ' single opening quote \(ga
grave accent │
│~ ˜ modifier tilde \(ti
tilde │
└────────────────────────────────────────────────────────────────┘
The hyphen-minus is a particularly unfortunate case of
overloading. Its awkward name in ISO 8859 and later standards
reflects the many conflicting purposes to which it had already
been put in the 1980s, including a hyphen, a minus sign, and
(alone or in repetition) dashes of varying widths. For best
results in groff, use the '-
' character in input without an
escape only to mean a hyphen, as in the phrase 'long-term'. For
a minus sign in running text or a Unix command-line option dash,
use \-
(or \[-]
in groff if you find it helps the clarity of the
source document). (Another minus sign, for use in mathematical
equations, is available as \[mi]
). AT&T troff supported em-
dashes as \(em
, as does groff.
The special character escape for the apostrophe as a neutral
single quote is typically needed only in technical content;
typing words like 'can't' and 'Anne's' in a natural way will
render correctly, because in ordinary prose an apostrophe is
typeset either as a closing single quotation mark or as a neutral
single quote, depending on the capabilities of the output device.
By contrast, special character escape sequences should be used
for quotation marks unless portability to limited or historical
troff implementations is necessary; on those systems, the input
convention is to pair the grave accent with the apostrophe for
single quotes, and to double both characters for double quotes.
AT&T troff defined no special characters for quotation marks or
the apostrophe. Repeated single quotes (''thus'') will be
visually distinguishable from double quotes ('thus') on terminal
devices, and perhaps on others (depending on the font selected).
┌────────────────────────────────────────────────────────────────┐
│AT&T troff input recommended groff input │
├────────────────────────────────────────────────────────────────┤
│A Winter's Tale A Winter's Tale │
│`U.K. outer quotes' \[oq]
U.K. outer quotes\[cq]
│
│`U.K. ``inner'' quotes' \[oq]
U.K. \[lq]
inner\[rq]
quotes\[cq]
│
│``U.S. outer quotes'' \[lq]
U.S. outer quotes\[rq]
│
│``U.S. `inner' quotes'' \[lq]
U.S. \[oq]
inner\[cq]
quotes\[rq]
│
└────────────────────────────────────────────────────────────────┘
If you expect to use quotation marks frequently in your document,
see if the macro package you're using defines strings or macros
to facilitate quotation.
Using Unicode basic Latin characters to compose boxes and lines
is ill-advised. roff systems have special characters for drawing
straight horizontal and vertical lines; see subsection 'Rules and
lines' below. Preprocessors like tbl(1) and pic(1) draw boxes
and will produce the best possible output for the device, falling
back to basic Latin glyphs only when necessary.
Eight-bit encodings and Latin-1 supplement
ISO 646 is a seven-bit code encoding 128 code points; eight-bit
codes are twice the size. ISO 8859-1 and code page 1047
allocated the additional space to what Unicode calls 'C1
controls' (control characters) and the 'Latin-1 supplement'. The
C1 controls are neither printable nor usable as groff input.
Two characters in the Latin-1 supplement are handled specially.
troff never produces them as output.
NBSP encodes the no-break space. On input it is mapped to \~
,
the adjustable non-breaking space escape sequence.
SHY encodes the soft hyphen character. On input it is mapped
to \%
, the hyphenation control escape sequence.
The remaining characters in the Latin-1 supplement represent
themselves. Although they can be specified directly with the
keyboard on systems configured to use Latin-1 as the character
encoding, it is more portable, both to other roff systems and to
UTF-8 environments, to use their glyph names, shown below.
¡ \[r!] inverted exclamation mark Ñ \[~N] N tilde
¢ \[ct] cent sign Ò \[`O] O grave
£ \[Po] pound sign Ó \['O] O acute
¤ \[Cs] currency sign Ô \[^O] O circumflex
¥ \[Ye] yen sign Õ \[~O] O tilde
¦ \[bb] broken bar Ö \[:O] O dieresis
§ \[sc] section sign × \[mu] multiplication sign
¨ \[ad] dieresis accent Ø \[/O] O slash
© \[co] copyright sign Ù \[`U] U grave
ª \[Of] feminine ordinal indicator Ú \['U] U acute
« \[Fo] left double chevron Û \[^U] U circumflex
¬ \[no] logical not Ü \[:U] U dieresis
® \[rg] registered sign Ý \['Y] Y acute
¯ \[a-] macron accent Þ \[TP] uppercase thorn
° \[de] degree sign ß \[ss] lowercase sharp s
± \[+-] plus-minus à \[`a] a grave
² \[S2] superscript two á \['a] a acute
³ \[S3] superscript three â \[^a] a circumflex
´ \[aa] acute accent ã \[~a] a tilde
µ \[mc] micro sign ä \[:a] a dieresis
¶ \[ps] pilcrow sign å \[oa] a ring
· \[pc] centered period æ \[ae] ae ligature
¸ \[ac] cedilla accent ç \[,c] c cedilla
¹ \[S1] superscript one è \[`e] e grave
º \[Om] masculine ordinal indicator é \['e] e acute
» \[Fc] right double chevron ê \[^e] e circumflex
¼ \[14] one quarter symbol ë \[:e] e dieresis
½ \[12] one half symbol ì \[`i] i grave
¾ \[34] three quarters symbol í \['i] e acute
¿ \[r?] inverted question mark î \[^i] i circumflex
À \[`A] A grave ï \[:i] i dieresis
Á \['A] A acute ð \[Sd] lowercase eth
 \[^A] A circumflex ñ \[~n] n tilde
à \[~A] A tilde ò \[`o] o grave
Ä \[:A] A dieresis ó \['o] o acute
Å \[oA] A ring ô \[^o o circumflex
Æ \[AE] AE ligature õ \[~o] o tilde
Ç \[,C] C cedilla ö \[:o] o dieresis
È \[`E] E grave ÷ \[di] division sign
É \['E] E acute ø \[/o] o slash
Ê \[^E] E circumflex ù \[`u] u grave
Ë \[:E] E dieresis ú \['u] u acute
Ì \[`I] I grave û \[^u] u circumflex
Í \['I] I acute ü \[:u] u dieresis
Î \[^I] I circumflex ý \['y] y acute
Ï \[:I] I dieresis þ \[Tp] lowercase thorn
Ð \[-D] uppercase eth ÿ \[:y] y dieresis
Special character escape forms
Glyphs that lack a character code in the basic Latin repertoire
to directly represent them are entered by one of several special
character escape forms. Such glyphs can be simple or composite,
and accessed either by name or numerically by code point. Code
points and combining properties are determined by character
encoding standards, whereas glyph names originated in AT&T troff
special character escape sequences. Glyph names are not limited
to alphanumeric characters; any of the printable characters from
the Unicode basic Latin repertoire may be used.
\(
gl is a special character escape for the glyph with the two-
character name gl. This is the syntax form supported by
AT&T troff. The acute accent, \(aa
, is an example.
\[
glyph-name]
is a special character escape for glyph-name, which can be
of arbitrary length. The foregoing acute accent example
could be expressed in groff as \[aa]
.
An ordinary input character 'c' is not the same as \[
c]
;
the latter is internally mapped to glyph name '\
c'. In
other words, '\[a]
' is not 'a', but rather \a
, the
uninterpreted leader escape sequence. By default, groff
defines a single glyph name of length one, namely the
minus sign, which can be accessed as either \-
or \[-]
.
\[
base-glyph composite-1 composite-2 ... composite-n]
is a composite glyph. Glyphs like a lowercase 'e' with an
acute accent, as in the word 'café', can be expressed as
\[e aa]
. See subsection 'Accents' below for a table of
combining glyph names.
Unicode encodes far more characters than groff has glyph names
for; special character escape forms based on numerical code
points enable access to any of them. Frequently used glyphs or
glyph combinations can be stored in strings, and new glyph names
can be created with the .char
request, enabling the user to
devise ad hoc names for them; see groff(7).
\[u
nnnn[n[n]]]
is a Unicode numeric special character escape sequence.
With this form, any Unicode point can be indicated using
four to six hexadecimal digits, with hexadecimal letters
accepted in uppercase form only. Thus, \[u02DA]
accesses
the (spacing) ring accent, producing '˚'.
Unicode code points can be composed as well; when they are, troff
requires NFD (Normalization Form D), where all Unicode glyphs are
maximally decomposed. (Exception: precomposed characters in the
Latin-1 supplement described above are also accepted. Do not
count on this exception remaining in a future troff that accepts
UTF-8 input directly.) Thus, troff accepts 'caf\['e]
',
'caf\[e aa]
', and 'caf\[u0065_0301]
', as ways to input 'café'.
(Due to its legacy 8-bit encoding compatibility, at present it
also accepts 'caf\[u00E9]
' on ISO Latin-1 systems.)
\[u
base-glyph[_
combining-component]...]
constructs a composite glyph from Unicode numeric special
character escape sequences. The code points of the base
glyph and the combining components are each expressed in
hexadecimal, with an underscore (_
) separating each
component. Thus, \[u0065_0301]
produces 'é'.
\[char
nnn]
expresses an eight-bit code point where nnn is the code
point of the character, a decimal number between 0 and 255
without leading zeroes. This legacy numeric special
character escape is used to map characters onto glyphs via
the .trin
request in macro files loaded by grotty(1).