summaryrefslogtreecommitdiff
path: root/share/man/man7
diff options
context:
space:
mode:
authorIngo Schwarze <schwarze@cvs.openbsd.org>2017-05-31 17:58:57 +0000
committerIngo Schwarze <schwarze@cvs.openbsd.org>2017-05-31 17:58:57 +0000
commit8b03cf129e5e2d5a745565e8568bd6f752b55f19 (patch)
tree051e31a5047ac1d40cab3ec626010efc72303b6f /share/man/man7
parentb6b0addf6ab08f91536d8cb34d0dc8526ecb64f7 (diff)
about ten different improvements; OK tedu@ espie@ bentley@
Diffstat (limited to 'share/man/man7')
-rw-r--r--share/man/man7/utf8.756
1 files changed, 29 insertions, 27 deletions
diff --git a/share/man/man7/utf8.7 b/share/man/man7/utf8.7
index 567edf41af0..d27891dd8f0 100644
--- a/share/man/man7/utf8.7
+++ b/share/man/man7/utf8.7
@@ -1,4 +1,4 @@
-.\" $OpenBSD: utf8.7,v 1.5 2017/05/31 17:16:48 schwarze Exp $
+.\" $OpenBSD: utf8.7,v 1.6 2017/05/31 17:58:56 schwarze Exp $
.\"
.\" Copyright (c) 2017 Ted Unangst <tedu@openbsd.org>
.\"
@@ -21,34 +21,36 @@
.Nm utf8
.Nd UTF-8 text encoding
.Sh DESCRIPTION
-UTF-8 is a multibyte encoding for Unicode text.
+UTF-8 is a multibyte character encoding for Unicode text.
It is the preferred format for non ASCII text.
.Pp
-The length of a UTF-8 sequence varies depending on the encoded value.
-If the high bit of the first byte is zero, the sequence length is one and
-the value is the remaining seven bits.
-If the high bit is set, then the number of high bits set, followed by a zero
-bit, indicates the length of the sequence and the value is formed by combining
-the low bits of each byte.
-Continuation bytes all have the same format, with the top two bits set and
-unset, respectively, and six value bits.
-.Pp
-Unicode ranges and their encoding formats:
+Unicode codepoints are encoded as follows:
.Bl -tag -width Ds
-.It 0x0 - 0x7f
-One byte.
-0.......
-.It 0x80 - 0x7ff
-Two bytes.
-110..... 10.......
-.It 0x800 - 0xffff
-Three bytes.
-1110.... 10...... 10......
-.It 0x1000 - 0x10ffff
-Four bytes.
-11110... 10...... 10...... 10......
+.It U+0000 \(en U+007F:
+One byte: 0....... (compatible with ASCII)
+.It U+0080 \(en U+07FF:
+Two bytes: 110..... 10.......
+.It U+0800 \(en U+D7FF and U+E000 \(en U+FFFF:
+Three bytes: 1110.... 10...... 10......
+.It U+10000 \(en U+10FFFF:
+Four bytes: 11110... 10...... 10...... 10......
.El
+.Pp
+The bits shown as dots contain the codepoint represented as a binary
+integer.
+.Pp
+Bytes starting with the bit pattern 11...... are called UTF-8 start
+bytes, and those starting with 10...... UTF-8 continuation bytes.
+The number of leading 1 bits in a start byte indicates the total
+number of bytes used to encode the codepoint, including the start
+byte.
+.Pp
+Encodings using more bytes than required are invalid.
+In particular, 11000000 and 11000001 are not valid start bytes,
+the byte after 11100000 must be at least 10100000,
+and the byte after 11110000 must be at least 10010000.
.Sh SEE ALSO
+.Xr locale 1 ,
.Xr ascii 7
.Sh STANDARDS
.Rs
@@ -58,6 +60,6 @@ Four bytes.
.%T UTF-8, a transformation format of ISO 10646
.Re
.Pp
-The Unicode Standard.
-.Sh CAVEATS
-Beware of overlong encodings.
+.Lk http://www.unicode.org/versions/latest/ "The Unicode Standard"
+.Pp
+.Lk http://www.unicode.org/reports/tr44/ "The Unicode Character Database"