diff options
author | Stefan Sperling <stsp@cvs.openbsd.org> | 2010-12-05 14:59:50 +0000 |
---|---|---|
committer | Stefan Sperling <stsp@cvs.openbsd.org> | 2010-12-05 14:59:50 +0000 |
commit | e6f31fe8a556a8d28837be272f778f6b65cfc0bf (patch) | |
tree | 533eb64e87d0fa16f2f5a517509c8f37d85d7507 | |
parent | 44e707728c3054de0a9f38f5a360830f082a4720 (diff) |
Rewrite the mbrtowc(3) man page so we can make sense of this function.
tweaks from jmc, help from uwe, "We are going to have to trust you :-)" deraadt
-rw-r--r-- | lib/libc/locale/mbrtowc.3 | 280 |
1 files changed, 187 insertions, 93 deletions
diff --git a/lib/libc/locale/mbrtowc.3 b/lib/libc/locale/mbrtowc.3 index 2980638171f..23c79bb5cf6 100644 --- a/lib/libc/locale/mbrtowc.3 +++ b/lib/libc/locale/mbrtowc.3 @@ -1,4 +1,4 @@ -.\" $OpenBSD: mbrtowc.3,v 1.2 2007/05/31 19:19:29 jmc Exp $ +.\" $OpenBSD: mbrtowc.3,v 1.3 2010/12/05 14:59:49 stsp Exp $ .\" $NetBSD: mbrtowc.3,v 1.5 2003/09/08 17:54:31 wiz Exp $ .\" .\" Copyright (c)2002 Citrus Project, @@ -25,166 +25,220 @@ .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" -.Dd $Mdocdate: May 31 2007 $ +.Dd $Mdocdate: December 5 2010 $ .Dt MBRTOWC 3 .Os -.\" ---------------------------------------------------------------------- .Sh NAME .Nm mbrtowc .Nd converts a multibyte character to a wide character (restartable) -.\" ---------------------------------------------------------------------- .Sh SYNOPSIS .Fd #include <wchar.h> .Ft size_t -.Fn mbrtowc "wchar_t * restrict pwc" "const char * restrict s" "size_t n" \ -"mbstate_t * restrict ps" -.\" ---------------------------------------------------------------------- +.Fn mbrtowc "wchar_t * restrict wc" "const char * restrict s" "size_t n" \ +"mbstate_t * restrict mbs" .Sh DESCRIPTION The .Fn mbrtowc -usually converts the multibyte character pointed to by -.Fa s -to a wide character, and stores the wide character +function examines at most +.Fa n +bytes of the multibyte character byte string pointed to by +.Fa s , +converts those bytes to a wide character, and stores the wide character in the wchar_t object pointed to by -.Fa pwc +.Fa wc if -.Fa pwc -is non-null and +.Fa wc +is not +.Dv NULL +and .Fa s points to a valid character. -The conversion happens in accordance with the conversion state -described in the mbstate_t object pointed to by -.Fa ps . -This function may examine at most -.Fa n -bytes of the array beginning from -.Fa s . .Pp -If -.Fa s -points to a valid character and the character corresponds to a null wide -character, then the +Conversion happens in accordance with the conversion state described +by the mbstate_t object pointed to by +.Fa mbs . +The mbstate_t object must be initialized to zero before the application's +first call to +.Fn mbrtowc . +If the previous call to .Fn mbrtowc -places the mbstate_t object pointed to by -.Fa ps -to an initial conversion state. +did not return (size_t)-1, the mbstate_t object can safely be reused +without reinitialization. +.Pp +The behaviour of +.Fn mbrtowc +is affected by the +.Dv LC_CTYPE +category of the current locale. +If the locale is changed without reinitialization of the mbstate_t object +pointed to by +.Fa mbs , +the behaviour of +.Fn mbrtowc +is undefined. .Pp Unlike .Xr mbtowc 3 , -the .Fn mbrtowc -may accept the byte sequence pointed to by +will accept an incomplete byte sequence pointed to by .Fa s -not forming a complete multibyte character -but which may be part of a valid character. -In this case, this function will accept all such bytes -and save them into the conversion state object pointed to by -.Fa ps . -They will be used at subsequent calls of this function to restart -the conversion suspended. +which does not form a complete character but is potentially part of +a valid character. +In this case, +.Fn mbrtowc +consumes all such bytes. +The conversion state saved in the mbstate_t object pointed to by +.Fa mbs +will be used to restart the suspended conversion during the next +call to +.Fn mbrtowc . .Pp -The behaviour of the +In state-dependent encodings, +.Fa s +may point to a special sequence of bytes called a +.Dq shift sequence . +Shift sequences switch between character code sets available within an +encoding scheme. +One encoding scheme using shift sequences is ISO/IEC 2022-JP, which +can switch e.g. from ASCII (which uses one byte per character) to +JIS X 0208 (which uses two bytes per character). +Shift sequence bytes correspond to no individual wide character, so .Fn mbrtowc -is affected by the -.Dv LC_CTYPE -category of the current locale. +treats them as if they were part of the subsequent multibyte character. +Therefore they do contribute to the number of bytes in the multibyte character. .Pp -These are the special cases: +Special cases in interpretation of arguments are as follows: .Bl -tag -width 012345678901 -.It "s == NULL" -.Fn mbrtowc -sets the conversion state object pointed to by -.Fa ps -to an initial state and always returns 0. -Unlike -.Xr mbtowc 3 , -the value returned does not indicate whether the current encoding of -the locale is state-dependent. +.It "wc == NULL " +The conversion from a multibyte character to a wide character is performed +and the conversion state may be affected, but the resulting wide character +is discarded. .Pp -In this case, +This can be used to find out how many bytes are contained in the +multibyte character pointed to by +.Fa s . +.It "s == NULL " .Fn mbrtowc ignores -.Fa pwc +.Fa wc and .Fa n , -and is equivalent to the following call: +and behaves equivalent to .Bd -literal -offset indent -mbrtowc(NULL, "", 1, ps); +mbrtowc(NULL, "", 1, mbs); .Ed -.It "pwc == NULL" -The conversion from a multibyte character to a wide character has -taken place and the conversion state may be affected, but the resultant -wide character is discarded. -.It "ps == NULL" +.Pp +which attempts to use the mbstate_t object pointed to by +.Fa mbs +to start or continue conversion using the empty string as input, +and discards the conversion result. +.Pp +If conversion succeeds, this call always returns zero. +Unlike +.Xr mbtowc 3 , +the value returned does not indicate whether the current encoding of +the locale is state-dependent, i.e. uses shift sequences. +.It "mbs == NULL " .Fn mbrtowc uses its own internal state object to keep the conversion state, -instead of -.Fa ps -mentioned in this manual page. +instead of an mbstate_t object pointed to by +.Fa mbs . +This internal conversion state is initialized once at program startup. +It is not safe to call +.Fn mbrtowc +again with a +.Dv NULL +.Fa mbs +argument if +.Fn mbrtowc +returned (size_t)-1 because at this point the internal conversion state +is undefined. .Pp Calling any other functions in .Em libc -never change the internal -state of -.Fn mbrtowc , -which is initialized at startup time of the program. +never changes the internal +conversion state object of +.Fn mbrtowc . .El -.\" ---------------------------------------------------------------------- .Sh RETURN VALUES -In the usual cases, -.Fn mbrtowc -returns: .Bl -tag -width 012345678901 .It 0 -The next bytes pointed to by +The bytes pointed to by .Fa s -form a null character. +form a terminating NUL character. +If +.Fa wc +is not +.Dv NULL , +a NUL wide character has been stored in the wchar_t object pointed to by +.Fa wc . .It positive +.Fa s +points to a valid character, and the value returned is the number of +bytes completing the character. If +.Fa wc +is not +.Dv NULL , +the corresponding wide character has been stored in the wchar_t object +pointed to by +.Fa wc . +.It (size_t)-1 .Fa s -points to a valid character, +points to an illegal byte sequence which does not form a valid multibyte +character in the current locale. .Fn mbrtowc -returns the number of bytes in the character. +sets +.Va errno +to EILSEQ. +The conversion state object pointed to by +.Fa mbs +is left in an undefined state and must be reinitialized before being +used again. +.Pp +Because applications using +.Fn mbrtowc +are shielded from the specifics of the multibyte character encoding scheme, +it is impossible to repair byte sequences containing encoding errors. +Such byte sequences must be treated as invalid and potentially malicious input. +Applications must stop processing the byte string pointed to by +.Fa s +and either discard any wide characters already converted, or cope with +truncated input. .It (size_t)-2 .Fa s -points to the byte sequence which possibly contains part of a valid -multibyte character, but which is incomplete. -When +points to an incomplete byte sequence of length .Fa n -is at least -.Dv MB_CUR_MAX -only occurs if the array pointed to by -.Fa s -contains a redundant shift sequence. -.It (size_t)-1 -.Fa s -points to an illegal byte sequence which does not form a valid multibyte -character. -In this case, +which has been consumed and contains part of a valid multibyte character. .Fn mbrtowc sets .Va errno -to indicate the error. +to EILSEQ. +The character may be completed by calling +.Fn mbrtowc +again with +.Fa s +pointing to one or more subsequent bytes of the multibyte character and +.Fa mbs +pointing to the conversion state object used during conversion of the +incomplete byte sequence. .El -.\" ---------------------------------------------------------------------- .Sh ERRORS The .Fn mbrtowc -may causes an error in the following case: +function may cause an error in the following cases: .Bl -tag -width Er .It Bq Er EILSEQ .Fa s points to an invalid or incomplete multibyte character. .It Bq Er EINVAL -.Fa ps +.Fa mbs points to an invalid or uninitialized mbstate_t object. .El -.\" ---------------------------------------------------------------------- .Sh SEE ALSO .Xr mbrlen 3 , .Xr mbtowc 3 , .Xr setlocale 3 -.\" ---------------------------------------------------------------------- .Sh STANDARDS The .Fn mbrtowc @@ -196,3 +250,43 @@ The restrict qualifier is added at .\" .St -isoC99 . ISO/IEC 9899:1999 .Pq Dq ISO C99 . +.Sh CAVEATS +.Fn mbrtowc +is not suitable for programs that care about internals of the character +encoding scheme used by the byte string pointed to by +.Fa s . +.Pp +It is possible that +.Fn mbrtowc +fails because of locale configuration errors. +An +.Dq invalid +character sequence may simply be encoded in a different encoding than that +of the current locale. +.Pp +The special cases for +.Fa s +== NULL and +.Fa mbs +== NULL do not make any sense. +Instead of passing +.Dv NULL +for +.Fa mbs , +.Xr mbtowc 3 +can be used. +.Pp +Earlier versions of this man page implied that calling +.Fn mbrtowc +with a +.Dv NULL +.Fa s +argument would always set +.Fa mbs +to the initial conversion state. +But this is true only if the previous call to +.Fn mbrtowc +using +.Fa mbs +did not return (size_t)-1 or (size_t)-2. +It is recommended to zero the mbstate_t object instead. |