summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTed Unangst <tedu@cvs.openbsd.org>2017-05-31 10:09:32 +0000
committerTed Unangst <tedu@cvs.openbsd.org>2017-05-31 10:09:32 +0000
commit11dd47918fc4c7f67ad087c9d623a1a1049ce39e (patch)
tree516d88287b3319cfa7496d91c862d92dbe9ca4d8
parentd2c0786fef756b4566b0220e77b93dd0b46a579e (diff)
perhaps a few more words about encoding format
-rw-r--r--share/man/man7/utf8.713
1 files changed, 9 insertions, 4 deletions
diff --git a/share/man/man7/utf8.7 b/share/man/man7/utf8.7
index 200565d5a7b..28b0ee692b8 100644
--- a/share/man/man7/utf8.7
+++ b/share/man/man7/utf8.7
@@ -1,4 +1,4 @@
-.\" $OpenBSD: utf8.7,v 1.2 2017/05/31 09:58:36 tedu Exp $
+.\" $OpenBSD: utf8.7,v 1.3 2017/05/31 10:09:31 tedu Exp $
.\"
.\" Copyright (c) 2017 Ted Unangst
.\" All rights reserved.
@@ -33,11 +33,16 @@
UTF-8 is a multibyte encoding for Unicode text.
It is the preferred format for non ASCII text.
.Pp
-The first byte of a sequence indicates the length in its high bits.
+The length of a UTF-8 sequence varies depending on the encoded value.
+If the high bit of the first byte is zero, the sequence length is one and
+the value is the remaining seven bits.
+If the high bit is set, then the number of high bits set, followed by a zero
+bit, indicates the length of the sequence and the value is formed by combining
+the low bits of each byte.
Continuation bytes all have the same format, with the top two bits set and
-unset, respectively.
+unset, respectively, and six value bits.
.Pp
-Ranges:
+Unicode ranges and their encoding formats:
.Bl -tag -width Ds
.It 0x0 - 0x7f
One byte.