diff options
author | Ted Unangst <tedu@cvs.openbsd.org> | 2017-05-31 10:09:32 +0000 |
---|---|---|
committer | Ted Unangst <tedu@cvs.openbsd.org> | 2017-05-31 10:09:32 +0000 |
commit | 11dd47918fc4c7f67ad087c9d623a1a1049ce39e (patch) | |
tree | 516d88287b3319cfa7496d91c862d92dbe9ca4d8 | |
parent | d2c0786fef756b4566b0220e77b93dd0b46a579e (diff) |
perhaps a few more words about encoding format
-rw-r--r-- | share/man/man7/utf8.7 | 13 |
1 files changed, 9 insertions, 4 deletions
diff --git a/share/man/man7/utf8.7 b/share/man/man7/utf8.7 index 200565d5a7b..28b0ee692b8 100644 --- a/share/man/man7/utf8.7 +++ b/share/man/man7/utf8.7 @@ -1,4 +1,4 @@ -.\" $OpenBSD: utf8.7,v 1.2 2017/05/31 09:58:36 tedu Exp $ +.\" $OpenBSD: utf8.7,v 1.3 2017/05/31 10:09:31 tedu Exp $ .\" .\" Copyright (c) 2017 Ted Unangst .\" All rights reserved. @@ -33,11 +33,16 @@ UTF-8 is a multibyte encoding for Unicode text. It is the preferred format for non ASCII text. .Pp -The first byte of a sequence indicates the length in its high bits. +The length of a UTF-8 sequence varies depending on the encoded value. +If the high bit of the first byte is zero, the sequence length is one and +the value is the remaining seven bits. +If the high bit is set, then the number of high bits set, followed by a zero +bit, indicates the length of the sequence and the value is formed by combining +the low bits of each byte. Continuation bytes all have the same format, with the top two bits set and -unset, respectively. +unset, respectively, and six value bits. .Pp -Ranges: +Unicode ranges and their encoding formats: .Bl -tag -width Ds .It 0x0 - 0x7f One byte. |