summaryrefslogtreecommitdiff
path: root/lib/libc/locale/mbrtowc.3
diff options
context:
space:
mode:
Diffstat (limited to 'lib/libc/locale/mbrtowc.3')
-rw-r--r--lib/libc/locale/mbrtowc.3280
1 files changed, 187 insertions, 93 deletions
diff --git a/lib/libc/locale/mbrtowc.3 b/lib/libc/locale/mbrtowc.3
index 2980638171f..23c79bb5cf6 100644
--- a/lib/libc/locale/mbrtowc.3
+++ b/lib/libc/locale/mbrtowc.3
@@ -1,4 +1,4 @@
-.\" $OpenBSD: mbrtowc.3,v 1.2 2007/05/31 19:19:29 jmc Exp $
+.\" $OpenBSD: mbrtowc.3,v 1.3 2010/12/05 14:59:49 stsp Exp $
.\" $NetBSD: mbrtowc.3,v 1.5 2003/09/08 17:54:31 wiz Exp $
.\"
.\" Copyright (c)2002 Citrus Project,
@@ -25,166 +25,220 @@
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
-.Dd $Mdocdate: May 31 2007 $
+.Dd $Mdocdate: December 5 2010 $
.Dt MBRTOWC 3
.Os
-.\" ----------------------------------------------------------------------
.Sh NAME
.Nm mbrtowc
.Nd converts a multibyte character to a wide character (restartable)
-.\" ----------------------------------------------------------------------
.Sh SYNOPSIS
.Fd #include <wchar.h>
.Ft size_t
-.Fn mbrtowc "wchar_t * restrict pwc" "const char * restrict s" "size_t n" \
-"mbstate_t * restrict ps"
-.\" ----------------------------------------------------------------------
+.Fn mbrtowc "wchar_t * restrict wc" "const char * restrict s" "size_t n" \
+"mbstate_t * restrict mbs"
.Sh DESCRIPTION
The
.Fn mbrtowc
-usually converts the multibyte character pointed to by
-.Fa s
-to a wide character, and stores the wide character
+function examines at most
+.Fa n
+bytes of the multibyte character byte string pointed to by
+.Fa s ,
+converts those bytes to a wide character, and stores the wide character
in the wchar_t object pointed to by
-.Fa pwc
+.Fa wc
if
-.Fa pwc
-is non-null and
+.Fa wc
+is not
+.Dv NULL
+and
.Fa s
points to a valid character.
-The conversion happens in accordance with the conversion state
-described in the mbstate_t object pointed to by
-.Fa ps .
-This function may examine at most
-.Fa n
-bytes of the array beginning from
-.Fa s .
.Pp
-If
-.Fa s
-points to a valid character and the character corresponds to a null wide
-character, then the
+Conversion happens in accordance with the conversion state described
+by the mbstate_t object pointed to by
+.Fa mbs .
+The mbstate_t object must be initialized to zero before the application's
+first call to
+.Fn mbrtowc .
+If the previous call to
.Fn mbrtowc
-places the mbstate_t object pointed to by
-.Fa ps
-to an initial conversion state.
+did not return (size_t)-1, the mbstate_t object can safely be reused
+without reinitialization.
+.Pp
+The behaviour of
+.Fn mbrtowc
+is affected by the
+.Dv LC_CTYPE
+category of the current locale.
+If the locale is changed without reinitialization of the mbstate_t object
+pointed to by
+.Fa mbs ,
+the behaviour of
+.Fn mbrtowc
+is undefined.
.Pp
Unlike
.Xr mbtowc 3 ,
-the
.Fn mbrtowc
-may accept the byte sequence pointed to by
+will accept an incomplete byte sequence pointed to by
.Fa s
-not forming a complete multibyte character
-but which may be part of a valid character.
-In this case, this function will accept all such bytes
-and save them into the conversion state object pointed to by
-.Fa ps .
-They will be used at subsequent calls of this function to restart
-the conversion suspended.
+which does not form a complete character but is potentially part of
+a valid character.
+In this case,
+.Fn mbrtowc
+consumes all such bytes.
+The conversion state saved in the mbstate_t object pointed to by
+.Fa mbs
+will be used to restart the suspended conversion during the next
+call to
+.Fn mbrtowc .
.Pp
-The behaviour of the
+In state-dependent encodings,
+.Fa s
+may point to a special sequence of bytes called a
+.Dq shift sequence .
+Shift sequences switch between character code sets available within an
+encoding scheme.
+One encoding scheme using shift sequences is ISO/IEC 2022-JP, which
+can switch e.g. from ASCII (which uses one byte per character) to
+JIS X 0208 (which uses two bytes per character).
+Shift sequence bytes correspond to no individual wide character, so
.Fn mbrtowc
-is affected by the
-.Dv LC_CTYPE
-category of the current locale.
+treats them as if they were part of the subsequent multibyte character.
+Therefore they do contribute to the number of bytes in the multibyte character.
.Pp
-These are the special cases:
+Special cases in interpretation of arguments are as follows:
.Bl -tag -width 012345678901
-.It "s == NULL"
-.Fn mbrtowc
-sets the conversion state object pointed to by
-.Fa ps
-to an initial state and always returns 0.
-Unlike
-.Xr mbtowc 3 ,
-the value returned does not indicate whether the current encoding of
-the locale is state-dependent.
+.It "wc == NULL "
+The conversion from a multibyte character to a wide character is performed
+and the conversion state may be affected, but the resulting wide character
+is discarded.
.Pp
-In this case,
+This can be used to find out how many bytes are contained in the
+multibyte character pointed to by
+.Fa s .
+.It "s == NULL "
.Fn mbrtowc
ignores
-.Fa pwc
+.Fa wc
and
.Fa n ,
-and is equivalent to the following call:
+and behaves equivalent to
.Bd -literal -offset indent
-mbrtowc(NULL, "", 1, ps);
+mbrtowc(NULL, "", 1, mbs);
.Ed
-.It "pwc == NULL"
-The conversion from a multibyte character to a wide character has
-taken place and the conversion state may be affected, but the resultant
-wide character is discarded.
-.It "ps == NULL"
+.Pp
+which attempts to use the mbstate_t object pointed to by
+.Fa mbs
+to start or continue conversion using the empty string as input,
+and discards the conversion result.
+.Pp
+If conversion succeeds, this call always returns zero.
+Unlike
+.Xr mbtowc 3 ,
+the value returned does not indicate whether the current encoding of
+the locale is state-dependent, i.e. uses shift sequences.
+.It "mbs == NULL "
.Fn mbrtowc
uses its own internal state object to keep the conversion state,
-instead of
-.Fa ps
-mentioned in this manual page.
+instead of an mbstate_t object pointed to by
+.Fa mbs .
+This internal conversion state is initialized once at program startup.
+It is not safe to call
+.Fn mbrtowc
+again with a
+.Dv NULL
+.Fa mbs
+argument if
+.Fn mbrtowc
+returned (size_t)-1 because at this point the internal conversion state
+is undefined.
.Pp
Calling any other functions in
.Em libc
-never change the internal
-state of
-.Fn mbrtowc ,
-which is initialized at startup time of the program.
+never changes the internal
+conversion state object of
+.Fn mbrtowc .
.El
-.\" ----------------------------------------------------------------------
.Sh RETURN VALUES
-In the usual cases,
-.Fn mbrtowc
-returns:
.Bl -tag -width 012345678901
.It 0
-The next bytes pointed to by
+The bytes pointed to by
.Fa s
-form a null character.
+form a terminating NUL character.
+If
+.Fa wc
+is not
+.Dv NULL ,
+a NUL wide character has been stored in the wchar_t object pointed to by
+.Fa wc .
.It positive
+.Fa s
+points to a valid character, and the value returned is the number of
+bytes completing the character.
If
+.Fa wc
+is not
+.Dv NULL ,
+the corresponding wide character has been stored in the wchar_t object
+pointed to by
+.Fa wc .
+.It (size_t)-1
.Fa s
-points to a valid character,
+points to an illegal byte sequence which does not form a valid multibyte
+character in the current locale.
.Fn mbrtowc
-returns the number of bytes in the character.
+sets
+.Va errno
+to EILSEQ.
+The conversion state object pointed to by
+.Fa mbs
+is left in an undefined state and must be reinitialized before being
+used again.
+.Pp
+Because applications using
+.Fn mbrtowc
+are shielded from the specifics of the multibyte character encoding scheme,
+it is impossible to repair byte sequences containing encoding errors.
+Such byte sequences must be treated as invalid and potentially malicious input.
+Applications must stop processing the byte string pointed to by
+.Fa s
+and either discard any wide characters already converted, or cope with
+truncated input.
.It (size_t)-2
.Fa s
-points to the byte sequence which possibly contains part of a valid
-multibyte character, but which is incomplete.
-When
+points to an incomplete byte sequence of length
.Fa n
-is at least
-.Dv MB_CUR_MAX
-only occurs if the array pointed to by
-.Fa s
-contains a redundant shift sequence.
-.It (size_t)-1
-.Fa s
-points to an illegal byte sequence which does not form a valid multibyte
-character.
-In this case,
+which has been consumed and contains part of a valid multibyte character.
.Fn mbrtowc
sets
.Va errno
-to indicate the error.
+to EILSEQ.
+The character may be completed by calling
+.Fn mbrtowc
+again with
+.Fa s
+pointing to one or more subsequent bytes of the multibyte character and
+.Fa mbs
+pointing to the conversion state object used during conversion of the
+incomplete byte sequence.
.El
-.\" ----------------------------------------------------------------------
.Sh ERRORS
The
.Fn mbrtowc
-may causes an error in the following case:
+function may cause an error in the following cases:
.Bl -tag -width Er
.It Bq Er EILSEQ
.Fa s
points to an invalid or incomplete multibyte character.
.It Bq Er EINVAL
-.Fa ps
+.Fa mbs
points to an invalid or uninitialized mbstate_t object.
.El
-.\" ----------------------------------------------------------------------
.Sh SEE ALSO
.Xr mbrlen 3 ,
.Xr mbtowc 3 ,
.Xr setlocale 3
-.\" ----------------------------------------------------------------------
.Sh STANDARDS
The
.Fn mbrtowc
@@ -196,3 +250,43 @@ The restrict qualifier is added at
.\" .St -isoC99 .
ISO/IEC 9899:1999
.Pq Dq ISO C99 .
+.Sh CAVEATS
+.Fn mbrtowc
+is not suitable for programs that care about internals of the character
+encoding scheme used by the byte string pointed to by
+.Fa s .
+.Pp
+It is possible that
+.Fn mbrtowc
+fails because of locale configuration errors.
+An
+.Dq invalid
+character sequence may simply be encoded in a different encoding than that
+of the current locale.
+.Pp
+The special cases for
+.Fa s
+== NULL and
+.Fa mbs
+== NULL do not make any sense.
+Instead of passing
+.Dv NULL
+for
+.Fa mbs ,
+.Xr mbtowc 3
+can be used.
+.Pp
+Earlier versions of this man page implied that calling
+.Fn mbrtowc
+with a
+.Dv NULL
+.Fa s
+argument would always set
+.Fa mbs
+to the initial conversion state.
+But this is true only if the previous call to
+.Fn mbrtowc
+using
+.Fa mbs
+did not return (size_t)-1 or (size_t)-2.
+It is recommended to zero the mbstate_t object instead.