summaryrefslogtreecommitdiff
path: root/doc/xorg-docs/specs/CTEXT/ctext.tbl.ms
diff options
context:
space:
mode:
Diffstat (limited to 'doc/xorg-docs/specs/CTEXT/ctext.tbl.ms')
-rw-r--r--doc/xorg-docs/specs/CTEXT/ctext.tbl.ms450
1 files changed, 450 insertions, 0 deletions
diff --git a/doc/xorg-docs/specs/CTEXT/ctext.tbl.ms b/doc/xorg-docs/specs/CTEXT/ctext.tbl.ms
new file mode 100644
index 000000000..de9dc3145
--- /dev/null
+++ b/doc/xorg-docs/specs/CTEXT/ctext.tbl.ms
@@ -0,0 +1,450 @@
+.\" $XdotOrg: xc/doc/specs/CTEXT/ctext.tbl.ms,v 1.2 2004/04/23 18:42:15 eich Exp $
+.\" Use tbl and -ms
+.sp 8
+.ce 5
+\s+2\fBCompound Text Encoding\fP\s-2
+.sp 6p
+Version 1.1
+X Consortium Standard
+X Version 11, Release 6.8
+Robert W. Scheifler
+.sp 2
+.LP
+Copyright \(co 1989 by X Consortium
+.LP
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the ``Software''), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+.LP
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+.LP
+THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN
+AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+.LP
+Except as contained in this notice, the name of the X Consortium shall not be
+used in advertising or otherwise to promote the sale, use or other dealings
+in this Software without prior written authorization from the X Consortium.
+.sp 2
+.NH 1
+Overview
+.LP
+Compound Text is a format for multiple character set data, such as
+multi-lingual text. The format is based on ISO
+standards for encoding and combining character sets. Compound Text is intended
+to be used in three main contexts: inter-client communication using selections,
+as defined in the \fIInter-Client Communication Conventions Manual\fP (ICCCM);
+window properties (e.g., window manager hints as defined in the ICCCM);
+and resources (e.g., as defined in Xlib and the Xt Intrinsics).
+.LP
+Compound Text is intended as an external representation, or interchange format,
+not as an internal representation. It is expected (but not required) that
+clients will convert Compound Text to some internal representation for
+processing and rendering, and convert from that internal representation to
+Compound Text when providing textual data to another client.
+.NH 1
+Values
+.LP
+The name of this encoding is ``COMPOUND_TEXT''. When text values are used in
+the ICCCM-compliant selection mechanism or are stored as window properties in
+the server, the type used should be the atom for ``COMPOUND_TEXT''.
+.LP
+Octet values are represented in this document as two decimal numbers in the
+form col/row. This means the value (col * 16) + row. For example, 02/01 means
+the value 33.
+.LP
+For our purposes, the octet encoding space is divided into four ranges:
+.RS
+.TS
+l l.
+C0 octets from 00/00 to 01/15
+GL octets from 02/00 to 07/15
+C1 octets from 08/00 to 09/15
+GR octets from 10/00 to 15/15
+.TE
+.RE
+.LP
+C0 and C1 are ``control character'' sets, while GL and GR are ``graphic
+character'' sets. Only a subset of C0 and C1 octets are used in the encoding,
+and depending on the character set encoding defined as GL or GR, a subset of
+GL and GR octets may be used; see below for details. All octets (00/00 to
+15/15) may appear inside the text of extended segments (defined below).
+.LP
+[For those familiar with ISO 2022, we will use only an 8-bit environment, and
+we will always use G0 for GL and G1 for GR.]
+.NH 1
+Control Characters
+.LP
+In C0, only the following values will be used:
+.RS
+.TS
+l l l.
+00/09 HT HORIZONTAL TABULATION
+00/10 NL NEW LINE
+01/11 ESC (ESCAPE)
+.TE
+.RE
+.LP
+In C1, only the following value will be used:
+.RS
+.TS
+l l l.
+09/11 CSI CONTROL SEQUENCE INTRODUCER
+.TE
+.RE
+.LP
+[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Compound Text.]
+.LP
+No control sequences are defined in Compound Text for changing the C0 and C1
+sets.
+.LP
+A horizontal tab can be represented with the octet 00/09. Specification of
+tabulation width settings is not part of Compound Text and must be obtained
+from context (in an unspecified manner).
+.LP
+[Inclusion of horizontal tab is for consistency with the STRING type currently
+defined in the ICCCM.]
+.LP
+A newline (line separator/terminator) can be represented with the octet 00/10.
+.LP
+[Note that 00/10 is normally LINEFEED, but is being interpreted as NEWLINE.
+This can be thought of as using the (deprecated) NEW LINE mode, E.1.3, in ISO
+6429. Use of this value instead of 08/05 (NEL, NEXT LINE) is for consistency
+with the STRING type currently defined in the ICCCM.]
+.LP
+The remaining C0 and C1 values (01/11 and 09/11) are only used in the control
+sequences defined below.
+.NH 1
+Standard Character Set Encodings
+.LP
+The default GL and GR sets in Compound Text correspond to the left and right
+halves of ISO 8859-1 (Latin 1). As such, any legal instance of a STRING type
+(as defined in the ICCCM) is also a legal instance of type COMPOUND_TEXT.
+.LP
+.nf
+[The implied initial state in ISO 2022 is defined with the sequence:
+ 01/11 02/00 04/03 GO and G1 in an 8-bit environment only. Designation also invokes.
+ 01/11 02/00 04/07 In an 8-bit environment, C1 represented as 8-bits.
+ 01/11 02/00 04/09 Graphic character sets can be 94 or 96.
+ 01/11 02/00 04/11 8-bit code is used.
+ 01/11 02/08 04/02 Designate ASCII into G0.
+ 01/11 02/13 04/01 Designate right-hand part of ISO Latin-1 into G1.
+]
+.fi
+.LP
+To define one of the approved standard character set encodings to be
+the GL set, one of the following control sequences is used:
+.RS
+.TS
+l l.
+01/11 02/08 {I} F 94 character set
+01/11 02/04 02/08 {I} F 94\u\s-2N\s+2\d character set
+.TE
+.RE
+.LP
+To define one of the approved standard character set encodings to be
+the GR set, one of the following control sequences is used:
+.RS
+.TS
+l l.
+01/11 02/09 {I} F 94 character set
+01/11 02/13 {I} F 96 character set
+01/11 02/04 02/09 {I} F 94\u\s-2N\s+2\d character set
+.TE
+.RE
+.LP
+The ``F''in the control sequences above stands for ``Final character'', which
+is always in the range 04/00 to 07/14. The ``{I}'' stands for zero or more
+``intermediate characters'', which are always in the range 02/00 to 02/15, with
+the first intermediate character always in the range 02/01 to 02/03. The
+registration authority has defined an ``{I} F'' sequence for each registered
+character set encoding.
+.LP
+[Final characters for private encodings (in the range 03/00 to 03/15) are not
+permitted here in Compound Text.]
+.LP
+For GL, octet 02/00 is always defined as SPACE, and octet 07/15 (normally
+DELETE) is never used. For a 94-character set defined as GR, octets 10/00 and
+15/15 are never used.
+.LP
+[This is consistent with ISO 2022.]
+.LP
+A 94\u\s-2N\s+2\d character set uses N octets (N > 1) for each character.
+The value of N is derived from the column value for F:
+.RS
+.TS
+l l.
+column 04 or 05 2 octets
+column 06 3 octets
+column 07 4 or more octets
+.TE
+.RE
+.LP
+In a 94\u\s-2N\s+2\d encoding, the octet values 02/00 and 07/15 (in GL) and
+10/00 and 15/15 (in GR) are never used.
+.LP
+[The column definitions come from ISO 2022.]
+.LP
+Once a GL or GR set has been defined, all further octets in that range (except
+within control sequences and extended segments) are interpreted with respect to
+that character set encoding, until the GL or GR set is redefined. GL and GR
+sets can be defined independently, they do not have to be defined in pairs.
+.LP
+Note that when actually using a character set encoding as the GR set, you must
+force the most significant bit (08/00) of each octet to be a one, so that it
+falls in the range 10/00 to 15/15.
+.LP
+[Control sequences to specify character set encoding revisions (as in section
+6.3.13 of ISO 2022) are not used in Compound Text. Revision indicators do not
+appear to provide useful information in the context of Compound Text. The most
+recent revision can always be assumed, since revisions are upward compatible.]
+.NH 1
+Approved Standard Encodings
+.LP
+The following are the approved standard encodings to be used with Compound
+Text. Note that none have Intermediate characters; however, a good parser will
+still deal with Intermediate characters in the event that additional encodings
+are later added to this list.
+.RS
+.TS
+l l l.
+_
+.sp 4p
+\fB{I} F\fP \fB94/96\fP \fBDescription\fP
+.sp 4p
+_
+.sp 6p
+4/02 94 7-bit ASCII graphics (ANSI X3.4-1968),
+ Left half of ISO 8859 sets
+04/09 94 Right half of JIS X0201-1976 (reaffirmed 1984),
+ 8-Bit Alphanumeric-Katakana Code
+04/10 94 Left half of JIS X0201-1976 (reaffirmed 1984),
+ 8-Bit Alphanumeric-Katakana Code
+.sp 6p
+04/01 96 Right half of ISO 8859-1, Latin alphabet No. 1
+04/02 96 Right half of ISO 8859-2, Latin alphabet No. 2
+04/03 96 Right half of ISO 8859-3, Latin alphabet No. 3
+04/04 96 Right half of ISO 8859-4, Latin alphabet No. 4
+04/06 96 Right half of ISO 8859-7, Latin/Greek alphabet
+04/07 96 Right half of ISO 8859-6, Latin/Arabic alphabet
+04/08 96 Right half of ISO 8859-8, Latin/Hebrew alphabet
+04/12 96 Right half of ISO 8859-5, Latin/Cyrillic alphabet
+04/13 96 Right half of ISO 8859-9, Latin alphabet No. 5
+.sp 6p
+04/01 94\u\s-22\s+2\d GB2312-1980, China (PRC) Hanzi
+04/02 94\u\s-22\s+2\d JIS X0208-1983, Japanese Graphic Character Set
+04/03 94\u\s-22\s+2\d KS C5601-1987, Korean Graphic Character Set
+.sp 6p
+_
+.TE
+.RE
+.LP
+The sets listed as ``Left half of ...'' should always be defined as GL. The
+sets listed as ``Right half of ...'' should always be defined as GR. Other
+sets can be defined either as GL or GR.
+.NH 1
+Non-Standard Character Set Encodings
+.LP
+Character set encodings that are not in the list of approved standard
+encodings can be included
+using ``extended segments''. An extended segment begins with one of the
+following sequences:
+.RS
+.TS
+l l.
+01/11 02/05 02/15 03/00 M L variable number of octets per character
+01/11 02/05 02/15 03/01 M L 1 octet per character
+01/11 02/05 02/15 03/02 M L 2 octets per character
+01/11 02/05 02/15 03/03 M L 3 octets per character
+01/11 02/05 02/15 03/04 M L 4 octets per character
+.TE
+.RE
+[This uses the ``other coding system'' of ISO 2022, using private Final
+characters.]
+.LP
+The ``M'' and ``L'' octets represent a 14-bit unsigned value giving the number
+of octets that appear in the remainder of the segment. The number is computed
+as ((M - 128) * 128) + (L - 128). The most significant bit M and L are always
+set to one. The remainder of the segment consists of two parts, the name of
+the character set encoding and the actual text. The name of the encoding comes
+first and is separated from the text by the octet 00/02 (STX, START OF TEXT).
+Note that the length defined by M and L includes the encoding name and
+separator.
+.LP
+[The encoding of the length is chosen to avoid having zero octets in Compound
+Text when possible, because embedded NUL values are problematic in many C
+language routines. The use of zero octets cannot be ruled out entirely
+however, since some octets in the actual text of the extended segment may have
+to be zero.]
+.LP
+The name of the encoding should be registered with the X Consortium to avoid
+conflicts and should when appropriate match the CharSet Registry and Encoding
+registration used in the X Logical Font Description. The name itself should be
+encoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or
+asterisk (02/10), and should use hyphen (02/13) only in accordance with the X
+Logical Font Description.
+.LP
+Extended segments are not to be used for any character set encoding that can
+be constructed from a GL/GR pair of approved standard encodings. For
+example, it is incorrect to use an extended segment for any of the ISO 8859
+family of encodings.
+.LP
+It should be noted that the contents of an extended segment are arbitrary;
+for example,
+they may contain octets in the C0 and C1 ranges, including 00/00, and
+octets comprising a given character may differ in their most significant bit.
+.LP
+[ISO-registered ``other coding systems'' are not used in Compound Text;
+extended segments are the only mechanism for non-2022 encodings.]
+.NH 1
+Directionality
+.LP
+If desired, horizontal text direction can be indicated using the following
+control sequences:
+.RS
+.TS
+l l.
+09/11 03/01 05/13 begin left-to-right text
+09/11 03/02 05/13 begin right-to-left text
+09/11 05/13 end of string
+.TE
+.RE
+.LP
+[This is a subset of the SDS (START DIRECTED STRING) control in the Draft
+Bidirectional Addendum to ISO 6429.]
+.LP
+Directionality can be nested. Logically, a stack of directions is maintained.
+Each of the first two control sequences pushes a new direction on the stack,
+and the third sequence (revert) pops a direction from the stack. The stack
+starts out empty at the beginning of a Compound Text string. When the stack is
+empty, the directionality of the text is unspecified.
+.LP
+Directionality applies to all subsequent text, whether in GL, GR, or an
+extended segment. If the desired directionality of GL, GR, or extended
+segments differs, then directionality control sequences must be inserted when
+switching between them.
+.LP
+Note that definition of GL and GR sets is independent of directionality;
+defining a new GL or GR set does not change the current directionality, and
+pushing or popping a directionality does not change the current GL and GR
+definitions.
+.LP
+Specification of directionality is entirely optional; text direction should be
+clear from context in most cases. However, it must be the case that either
+all characters in a Compound Text string have explicitly specified direction
+or that all characters have unspecified direction. That is, if directionality
+control sequences are used, the first such control sequence must precede the
+first graphic character in a Compound Text string, and graphic characters are
+not permitted whenever the directionality stack is empty.
+.NH 1
+Resources
+.LP
+To use Compound Text in a resource, you can simply treat all octets as if they
+were ASCII/Latin-1 and just replace all ``\\'' octets (05/12) with the two
+octets ``\\\\'', all newline octets (00/10) with the two octets ``\\n'', and
+all zero octets with the four octets ``\\000''.
+It is up to the client making use of the resource to interpret the data as
+Compound Text; the policy by which this is ascertained is not constrained by
+the Compound Text specification.
+.NH 1
+Font Names
+.LP
+The following CharSet names for the standard character set encodings are
+registered for use in font names under the X Logical Font Description:
+.RS
+.TS
+l l l.
+_
+.sp 6p
+\fBName\fP \fBEncoding Standard\fP \fBDescription\fP
+.sp 6p
+_
+.sp 6p
+ISO8859-1 ISO 8859-1 Latin alphabet No. 1
+ISO8859-2 ISO 8859-2 Latin alphabet No. 2
+ISO8859-3 ISO 8859-3 Latin alphabet No. 3
+ISO8859-4 ISO 8859-4 Latin alphabet No. 4
+ISO8859-5 ISO 8859-5 Latin/Cyrillic alphabet
+ISO8859-6 ISO 8859-6 Latin/Arabic alphabet
+ISO8859-7 ISO 8859-7 Latin/Greek alphabet
+ISO8859-8 ISO 8859-8 Latin/Hebrew alphabet
+ISO8859-9 ISO 8859-9 Latin alphabet No. 5
+JISX0201.1976-0 JIS X0201-1976 (reaffirmed 1984) 8-bit Alphanumeric-Katakana Code
+GB2312.1980-0 GB2312-1980, GL encoding China (PRC) Hanzi
+JISX0208.1983-0 JIS X0208-1983, GL encoding Japanese Graphic Character Set
+KSC5601.1987-0 KS C5601-1987, GL encoding Korean Graphic Character Set
+.sp 6p
+_
+.TE
+.RE
+.LP
+.NH 1
+Extensions
+.LP
+There is no absolute requirement for a parser to deal with anything but the
+particular encoding syntax defined in this specification. However, it is
+possible that Compound Text may be extended in the future, and as such it may
+be desirable to construct the parser to handle 2022/6429 syntax more generally.
+.LP
+There are two general formats covering all control sequences that are expected
+to appear in extensions:
+.LP
+01/11 {I} F
+.IP
+For this format, I is always in the range 02/00 to 02/15, and F is always
+in the range 03/00 to 07/14.
+.LP
+09/11 {P} {I} F
+.IP
+For this format, P is always in the range 03/00 to 03/15, I is always in
+the range 02/00 to 02/15, and F is always in the range 04/00 to 07/14.
+.LP
+In addition, new (singleton) control characters (in the C0 and C1 ranges) might
+be defined in the future.
+.LP
+Finally, new kinds of ``segments'' might be defined in the future using syntax
+similar to extended segments:
+.LP
+01/11 02/05 02/15 F M L
+.IP
+For this format, F is in the range 03/05 to 3/15. M and L are as defined
+in extended segments. Such a segment will always be followed by the number
+of octets defined by M and L. These octets can have arbitrary values and
+need not follow the internal structure defined for current extended
+segments.
+.LP
+If extensions to this specification are defined in the future, then any string
+incorporating instances of such extensions must start with one of the following
+control sequences:
+.RS
+.TS
+l l.
+01/11 02/03 V 03/00 ignoring extensions is OK
+01/11 02/03 V 03/01 ignoring extensions is not OK
+.TE
+.RE
+.LP
+In either case, V is in the range 02/00 to 02/15 and indicates the major
+version
+minus one of the specification being used. These version control sequences are
+for use by clients that implement earlier versions, but have implemented a
+general parser. The first control sequence indicates that it is acceptable to
+ignore all extension control sequences; no mandatory information will be lost
+in the process. The second control sequence indicates that it is unacceptable
+to ignore any extension control sequences; mandatory information would be lost
+in the process. In general, it will be up to the client generating the
+Compound Text to decide which control sequence to use.
+.NH 1
+Errors
+.LP
+If a Compound Text string does not match the specification here (e.g., uses
+undefined control characters, or undefined control sequences, or incorrectly
+formatted extended segments), it is best to treat the entire string as invalid,
+except as indicated by a version control sequence.