diff options
author | Todd C. Miller <millert@cvs.openbsd.org> | 2001-05-24 18:26:20 +0000 |
---|---|---|
committer | Todd C. Miller <millert@cvs.openbsd.org> | 2001-05-24 18:26:20 +0000 |
commit | 483d4e680bd2a6db14835b1b4d65be33488d532b (patch) | |
tree | 129a4c95425cb37ed928ef53a27eb7dce5de3345 /gnu/usr.bin/perl/pod/perlunicode.pod | |
parent | 8757fe6728b9db37919ad703b336ebbbc84413aa (diff) |
stock perl 5.6.1
Diffstat (limited to 'gnu/usr.bin/perl/pod/perlunicode.pod')
-rw-r--r-- | gnu/usr.bin/perl/pod/perlunicode.pod | 48 |
1 files changed, 23 insertions, 25 deletions
diff --git a/gnu/usr.bin/perl/pod/perlunicode.pod b/gnu/usr.bin/perl/pod/perlunicode.pod index 5333ac495c0..5b0fe2faaf2 100644 --- a/gnu/usr.bin/perl/pod/perlunicode.pod +++ b/gnu/usr.bin/perl/pod/perlunicode.pod @@ -1,16 +1,18 @@ =head1 NAME -perlunicode - Unicode support in Perl +perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change) =head1 DESCRIPTION =head2 Important Caveat -WARNING: The implementation of Unicode support in Perl is incomplete. + WARNING: As of the 5.6.1 release, the implementation of Unicode + support in Perl is incomplete, and continues to be highly experimental. -The following areas need further work. +The following areas need further work. They are being rapidly addressed +in the 5.7.x development branch. -=over +=over 4 =item Input and Output Disciplines @@ -114,13 +116,7 @@ will typically occur directly within the literal strings as UTF-8 characters, but you can also specify a particular character with an extension of the C<\x> notation. UTF-8 characters are specified by putting the hexadecimal code within curlies after the C<\x>. For instance, -a Unicode smiley face is C<\x{263A}>. A character in the Latin-1 range -(128..255) should be written C<\x{ab}> rather than C<\xab>, since the -former will turn into a two-byte UTF-8 code, while the latter will -continue to be interpreted as generating a 8-bit byte rather than a -character. In fact, if the C<use warnings> pragma of the C<-w> switch -is turned on, it will produce a warning -that you might be generating invalid UTF-8. +a Unicode smiley face is C<\x{263A}>. =item * @@ -163,20 +159,10 @@ C<(?:\PM\pM*)>. =item * -The C<tr///> operator translates characters instead of bytes. It can also -be forced to translate between 8-bit codes and UTF-8. For instance, if you -know your input in Latin-1, you can say: - - while (<>) { - tr/\0-\xff//CU; # latin1 char to utf8 - ... - } - -Similarly you could translate your output with - - tr/\0-\x{ff}//UC; # utf8 to latin1 char - -No, C<s///> doesn't take /U or /C (yet?). +The C<tr///> operator translates characters instead of bytes. Note +that the C<tr///CU> functionality has been removed, as the interface +was a mistake. For similar functionality see pack('U0', ...) and +pack('C0', ...). =item * @@ -214,6 +200,18 @@ byte-oriented C<chr()> and C<ord()> under utf8. =item * +The bit string operators C<& | ^ ~> can operate on character data. +However, for backward compatibility reasons (bit string operations +when the characters all are less than 256 in ordinal value) one cannot +mix C<~> (the bit complement) and characters both less than 256 and +equal or greater than 256. Most importantly, the DeMorgan's laws +(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. +Another way to look at this is that the complement cannot return +B<both> the 8-bit (byte) wide bit complement, and the full character +wide bit complement. + +=item * + And finally, C<scalar reverse()> reverses by character rather than by byte. =back |